Friday, April 29, 2016

Creating all possible unique Cartesian pairs [x y] combinations from set, where pairs [x y] not equal pairs [y x] and x,y from the same collection and not numbers

To make proper Cartesian pairs set it's required to aggregate all pairs  [x y] excluding
[x y] = [y x]. In my case I also need to remove numbers and garbage (like dates, 'Yes', No') . In other words all combinations like this : ['Bank of America' 'loan product'] ['loan product' 'Bank of America'] must meet only once not twice. Initial idea using hash map was not so efficient. I consider pair set as matrix [x y] where every element of this matrix is [x y]ij which was created after multiplication of two vectors x and y . In this square matrix all elements on top of diagonal are duplicates , they are equal to element bellow diagonal [x y]12 = [x y]21 because x,y from the same collection. (see example from this blog: How to visualise big CSV files  . This example of one pair from this file:  ['Bank of America' 'loan product'] ['loan product' 'Bank of America' ] ). To do not count element on top of diagonal of this matrix next condition will be used: [x y]ij where i <j . Total number of combination from the CSV file which I process  will follow this formula:
.

which equal:



In our case CSV files around 200MB ( see this blog:  How to visualise big CSV files ) n= 455000 * 18 and r =2.    . So follow the formula it  Number of combinations will be
670760991810000 . Even after removal garbage it's still impossible to get calculation finished in finite time.

 
(ns clojure.examples.cartesianpairs
 (:gen-class))

(defn cartesian-pairs 
"Function read col and return list of all possible pairs [ [x1 x2] .. [xi xj] ] where xi not = xj and xi not number   "
[ coll ]

  (->  
  (for [x coll  y coll  :when (not= x y) :when (and (not (number? x)) ( not (number? y))) 
  :when ( < (.indexOf coll x)  (.indexOf coll y))  ] 
   (str ":" x y )  ) 
 
  ) 
)

(def my-coll [123 45 "b" "f"  "d" "e" 'f 123 1234 4534] ) 
;(for [l '(1 2 3)]  (println "Hi") ) ;(swap! collection assoc (str ":" x ) 1) )
(->> 
my-coll
cartesian-pairs
println
)

(println (reduce conj #{} (cartesian-pairs my-coll) ))

(println 
(reduce #(assoc %1 %2 (inc (%1 %2 0)))
        {}
         (cartesian-pairs my-coll) )
)

;(println @collection)
;(println (into {} #{(cartesian-pairs my-coll)} ))
; (println (get (into {} #{(cartesian-pairs my-coll)} ) ":[b f]") ) ;(println collection)

 
  
 
 
The final Result won't have any duplication hash map of Cartesian pairs:
                                                                                                                                  
(:bf :bd :be :bf :fd :fe :ff :de :df :ef)                                                                                                                                       
#{:fe :bf :df :be :de :bd :ff :fd :ef}                                                                                                                                          
{:bf 2, :bd 1, :be 1, :fd 1, :fe 1, :ff 1, :de 1, :df 1, :ef 1}  
 

No comments: