Thursday, April 28, 2016

How to visualize generic CSV files

The goal of this post to show how to populate all possible combinations of [Financial products x Financial organizations]  or [Financial products] and etc. . The occurrence  frequency of  those in Complaints database give full picture of complaints events happened per bank, product and etc.  This is combinatoric task which may be  similar  to this  blog  .
      This task is implemented by  Clojure nugget that tested on 'Customer Complaints' database from US Government statistic data  web site.To  download the database csv file  by using this instructions:
  1.  Go to Government's web sites:  US , China I prefer US which more accurate and accessable Web site. Go to US Consumer Complaint  choose federal  click CSV icon .
  2. Data parsed by Clojure but you can use Incanta API  to visualise it. I will show in next post .
  3. To run this nugget add Clojure data-csv  API or use another one:    Clojure CSV API : into your project. For data-csv use this dependency :project.clj  
(defproject data-reader "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [
  [org.clojure/clojure "1.8.0"]
  [org.clojure/data.csv "0.1.3"]
  ]
  :main ^:skip-aot data-reader.core
  :target-path "target/%s"
  :profiles {:uberjar {:aot :all}})

The result of this nugget is repetition's  frequency  of all of unique values in 2nd column which is "Financial product". : 
Consumer Complaints Database  (parsing..)!
rows count: 554542
columns count 18

#object[clojure.lang.Atom 0x147e2ae7 {:status :ready, :val {:Mortgage 186152, :Debt_collection 100719, :Bank_account_or_service 62437, :Cred
it_reporting 91533, :Money_transfers 3798, :Prepaid_card 2464, :Credit_card 66303, :Other_financial_service 548, :Consumer_Loan 20925, :Student_loan 15800, :Payday_loan 3862, :ver 0.0.1}}]
The result is similar to the next SQL with grouping and having :
SELECT COLUMN , MAX(COLUMN) FROM TABLE GROUP BY COLUMN 
(ns data-reader.core
  (:gen-class))
(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])

   
(def collection (atom {:ver "0.0.1" }))
   
(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "Consumer Complaints Database  (parsing..)!")
  
  (defn csv-map
  "ZipMaps header as keys and values from lines."
  [head & line]
    (map #(zipmap (map keyword head) %1) line))
   

 (defn csv-map-ext 
   [head & rest-lines]
   ; (map csv-map rest-lines)
    (for [x rest-lines] (csv-map head x) )
 ) 
  
  
(defn match-vector [myVec]
(loop [data myVec, index 0]
     (def first-data (first data))
     (if-not (nil? (re-find #"(APR)" first-data))
      first-data
      (recur (rest data) (inc index)))
    )
)

(defn printObject 
" get object from the list assuming it is map and print all it's elements"
[& param]
; debug (println (nth param 1))
           ( def object (nth param 1) )
      ( ->
       (println)
       (print (object :Company) " - " (object :Product)  ) 
     )
)  


(defn ws-replce
"Replace whitespace in string  - 'par'"
[param]
(clojure.string/replace param #"[ ]+" "_")
)

(defn add-keyvalue-to-map
"add a new key to map or if exist increment value corresponding for this key"
[param  collection]
    (def col-key (read-string (str ":" param)))
    ;( println "debug add-keyvalue-to-map.1")
 ;( println  "col-key =" col-key )
 ;( println  "collection = " collection )
 ;( println  "@collection = " @collection )
 ;( println  "(get @collection col-key) = " (get @collection col-key))
    (if-not (nil? (get @collection col-key ))
    ( swap! collection update-in [col-key] inc)
  ;; notice that the value swapped in, is part of the returned value
    ( swap! collection assoc col-key 1)
 ) ; if

)
 
(with-open [in-file (io/reader "resources/Consumer_Complaints.csv")]
 (doall
    ;(csv/read-csv in-file))
  (let [csv (csv/read-csv in-file)] ; :separator \;)]
    (println  "rows count:" (count csv) )
    (println "columns count"(count (first csv)))
 ;(def ls (last csv))
 (def first-data (first csv))
  
   (for [x (rest csv)] 
    ( ->
    
        ;(def match (re-find #"(loan|suspend)" (reduce str x)))
  
  ;(if-not (empty? match) )
  ;( ->
   ;( println ( str "debug for - match: " match))
   ;(println (ws-replce (get x 1)))
         (add-keyvalue-to-map (ws-replce (get x 1)) collection)

   
       ; (println (nth (csv-map first-data x) 0))
             ; (printObject ( nth (csv-map first-data x) 0))
     ;(println (get x 1)) 
     ;(get x 1)
     

    ; (ws-replce (get x 1)) collection)
     ;(get-in [:LASTUPDATETIME])
       ;println
  ;)
  ;else do nothing
  ;(println "Not matched")
    )

 )
 ; this break the code
 ;(println collection)
   ) ; let  
  ) ;doall
) ; with 
(println collection)
 ) ;main


The final result of this nugget return only occurrence of one column. In this example it's second column.. To get more detailed information it's nice to see this combination: "Financial organization" and "Financial product". It's about Cartesian pair generation:: Creating all possible unique Cartesian pairs [x y] combinations from set, where pairs [x y] not equal pairs [y x] and x,y from the same collection and not numbers
and this post: Create a word frequency map out of a large string s This function help to generate all possible pairs
 
(defn cartesian-pairs 
"Function read col and return list of all possible pairs [ [x1 y1] .. [xi yj] ] where (x not equal y) and x,y not number for [xi yj] where i  
  (for [x coll  y coll  :when (not= x y) :when (and (not (number? x)) ( not (number? y))) 
  :when ( < (.indexOf coll x)  (.indexOf coll y))  ] 
   (str ":" x y )  ) 
 
  ) 
)



(defn add-cartesian-pairs-to-map
;"add a new key of Cartesian pairs  to map or if exist increment value corresponding for this key
;Result of this function: 
{:bf 2, :bd 1, :be 1, :fd 1, :fe 1, :ff 1, :de 1, :df 1, :ef 1}  
 where collection parameter is : [123 45 "b" "f"  "d" "e" 'f 123 1234 4534] "
[collection]
(reduce #(assoc %1 %2 (inc (%1 %2 0)))
        {}
         (cartesian-pairs collection) )
)


Then modify few lines in already existed code: add-keyvalue-to-map  This function has different behavior . Instead of running over only one column and generate only one {key: value} pair . It  is going over  all columns and generate all possible combinations of {key: value}, {[key1 key2]: value } . where value frequency occurrence of key combination in CSV file. After collecting ll together and runing this nugget after 40 min I'v got out ofmemory exception. CSV file too big 455 000 records 18 columns. So next post I will make more smart Cartesian pair selection more meaningful.
This is final version which stop only by "Out of memory exceptions"
(ns csv-statistic.core
  (:gen-class))
(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])
(use 'clojure.string)
   
(def garbage-pattern #"(\w*day)|No|Yes|(N\/A)|(^[0-9])") ; where day pattern word
(def number-pattern #"^[0-9]+$")
(def us-phone-pattern #"^\D?(\d{3})\D?\D?(\d{3})\D?(\d{4})$")
(def date-pattern #"^(((0?[1-9]|1[012])/(0?[1-9]|1\d|2[0-8])|(0?[13456789]|1[012])/(29|30)|(0?[13578]|1[02])/31)/(19|[2-9]\d)\d{2}|0?2/29/((19|[2-9]\d)(0[48]|[2468][048]|[13579][26])|(([2468][048]|[3579][26])00)))$")
(def word-split-pattern #"\s+")
   
(def collection (atom {:ver "0.0.1" }))
   
(defn -main
  "I don't do a whole lot ... yet."
  [& args]
  (println "Consumer Complaints Database  (parsing..)!")
  
  (defn csv-map
  "ZipMaps header as keys and values from lines."
  [head & line]
    (map #(zipmap (map keyword head) %1) line))
   

 (defn csv-map-ext 
   [head & rest-lines]
   ; (map csv-map rest-lines)
    (for [x rest-lines] (csv-map head x) )
 ) 
 
(defn cartesian-pairs 
"Function read col and return list of all possible pairs [ [x1 x2] .. [xi xj] ] where xi not = xj and xi not number   "
[ coll ]

  (->  
  (for [x coll  y coll  :when ( < (count x) 20)  :when (not= x y) :when (and (not (number? x)) ( not (number? y))) 
  :when ( < (.indexOf coll x)  (.indexOf coll y))  ] 
   (str ":" x y )  ) 
 
  ) 
) 
  
(defn match-vector [myVec]
(loop [data myVec, index 0]
     (def first-data (first data))
     (if-not (nil? (re-find #"(APR)" first-data))
      first-data
      (recur (rest data) (inc index)))
    )
)

(defn printObject 
" get object from the list assuming it is map and print all it's elements"
[& param]

           ( def object (nth param 1) )
      ( ->
       (println)
       (print (object :Company) " - " (object :Product)  ) 
      
     )
)  


(defn ws-replce
"Replace whitespace in string  - 'par'"
[param]
;(clojure.string/replace param #"(\w+)" "\U$0" )
(clojure.string/replace
(clojure.string/replace param #"([\s\w\-_]|^)([a-z0-9-_]+) "
     (fn [[b]]
       (str  (clojure.string/upper-case  (get b 0 ))  (get b 1 ) (get b 2 ) (get b 3 ) (get b 4 ) (get b 5 ) (get b 6 )(get b 7 )
           )))
 #"[ ]" "")
)



(defn add-keyvalue-to-map
"add a new key to map or if exist increment value corresponding for this key"
[param  collection]
    (def col-key (read-string (str ":" param)))
    ;( println "debug add-keyvalue-to-map.1")
 ;( println  "col-key =" col-key )
 ;( println  "collection = " collection )
 ;( println  "@collection = " @collection )
 ;( println  "(get @collection col-key) = " (get @collection col-key))
    (if-not (nil? (get @collection col-key ))
    ( swap! collection update-in [col-key] inc)
  ;; notice that the value swapped in, is part of the returned value
    ( swap! collection assoc col-key 1)
 ) ; if
 
 ;(println "after if")
)

(defn alones
[ coll ]
"Function read col and return list of [x] "
   (->
   (for [x coll] ; :when (not ( blank? x)) :when( < (count x) 15 ) :when (nil? (re-find number-pattern x)) :when (nil? (re-find date-pattern x)) :when (nil? (re-find garbage-pattern x))  ]
    (str ":" (ws-replce (trim x)) ))  
   ;println
   )  
)


 
(defn validator
"Function validator used to agregate condition by AND boolean multiplication"
[x]
 (and (not (number? x)) (not ( blank? x)) ( < (count x) 15 ) (nil? (re-find number-pattern x)) (nil? (re-find date-pattern x)) (nil? (re-find garbage-pattern x))  )
 
) 

(defn validated-indexes
[ coll ]
"Function read col and return list of [clollection x indexes which passed validation [ 1,5, ..k,.. l] where k,l 
   (for [x coll  :when (validator x)  ]
    (.indexOf coll x))
    ;println
   )  
 ))  
)

;(def myfunc-memo (memoize add-cartesian-pairs-to-map))
  (def row-n (atom 0))

(defn reg-exp-filter
 [x y]
 ( and (nil? (re-find number-pattern x )) (nil? (re-find date-pattern x)) (nil? (re-find garbage-pattern x ))
     (nil? (re-find number-pattern y )) (nil? (re-find date-pattern y)) (nil? (re-find garbage-pattern y )) )
)
(defn cartesian-pairs 
"Function read col and return list of all possible pairs [ [x1 y1] .. [xi yj] ] where (x not equal y) and x,y not number for [xi yj] where i  
 
 (time (doall 
  (for [x coll  y coll :when (and (not ( blank? x))(not ( blank? y))(not= x y)) :when (and ( < (.indexOf coll x)(.indexOf coll y)) ) ]

   (str ":" (ws-replce (trim x)) "|" (ws-replce (trim y)) )  ) 
))
   ) 
  ;(println (swap! row-n inc) )
)





(defn reduce-csv-row
    "Accepts a csv-row (a vector) a list of columns to extract, 
     and reduces (and returns) a csv-row to a subset based on 
     selection using the values in col-nums (a vector of integer 
     vector positions.)"
 
    [csv-rows col-nums]
 
    (reduce 
        (fn [out-csv-row col-num]
            ; Don't consider short vectors containing junk.
            ;(println "col-num: " col-num " out-csv-row: " out-csv-row)
            (def row-num 0)
            (if-not (<= (count csv-rows) 1)
               (conj out-csv-row 
                 (for [x csv-rows ]
                       (nth x col-num nil)
                 )
               )    
            ))
        []
        col-nums))

 
(defn add-cartesian-pairs-to-map
;"add a new key of Cartesian pairs  to map or if exist increment value corresponding for this key
;Result of this function: {:bf 2, :bd 1, :be 1, :fd 1, :fe 1, :ff 1, :de 1, :df 1, :ef 1}  
; where collection parameter is : [123 45 "b" "f"  "d" "e" 'f 123 1234 4534] "
[my-coll indexes]
;(reduce #(assoc %1 %2  (inc (%1 %2 0))) 
(reduce #(swap! collection assoc %2 (inc (%1 %2 0)))
        {}
  ;(alones (reduce into  ( reduce-csv-row  my-coll indexes )))) ;only [x]
        ;(reduce conj (cartesian-pairs my-coll) (alones my-coll) ))   ; both [x] [x y]
  (cartesian-pairs (reduce into (reduce-csv-row  my-coll indexes )))) ; only [x y] pairs
)

(defn add-uniquekeys-to-map
;"add a new key of Cartesian pairs  to map or if exist increment value corresponding for this key
;Result of this function: {:bf 2, :bd 1, :be 1, :fd 1, :fe 1, :ff 1, :de 1, :df 1, :ef 1}  
; where collection parameter is : [123 45 "b" "f"  "d" "e" 'f 123 1234 4534] "
[my-coll indexes]
;(reduce #(assoc %1 %2  (inc (%1 %2 0))) 
(reduce #(swap! collection assoc %2 (inc (%1 %2 0)))
        {}
  (alones (reduce into  ( reduce-csv-row  my-coll indexes )))) ;only [x]
        ;(reduce conj (cartesian-pairs my-coll) (alones my-coll) ))   ; both [x] [x y]
  ;(cartesian-pairs (reduce into (reduce-csv-row  my-coll indexes )))) ; only [x y] pairs
)

 
(with-open [in-file (io/reader "resources/book.csv")]
 (doall
    ;(csv/read-csv in-file))
  (let [csv (csv/read-csv in-file)] ; :separator \;)]
    (println  "rows count:" (count csv) )
    (println "columns count"(count (first csv)))
 (def first-data (first csv))
 (println first-data)
 
 (def csv-file-one-row (nth csv 1))
 
 (def indexes (validated-indexes csv-file-one-row))

  (time (doall  
   (add-uniquekeys-to-map (rest csv) indexes) 
        )) 
  
       (println "Hash map of unique elements with theirs frequency: " collection) 
  ; This is very time and memory consuming function  
     ;(time (doall  
  ; (add-cartesian-pairs-to-map (rest csv) indexes) 
   ; (add-cartesian-pairs-to-map x) ;(list ws-replce x ) ;"Elapsed time: 16135.905126 msecs"
  ;( myfunc-memo x) ;"Elapsed time: 20594.399339 msecs"
        ;)) 

   (time (doall  
  ; (println
     (for [x (rest csv) y (rest csv) indx indexes :when (and (not ( blank? (nth x indx))) (not ( blank? (nth y indx))) (not= (nth x indx) (nth y indx)) )  ] ;:when (and ( < (.indexOf coll x)(.indexOf coll y)) ) 
    ; get first index and use it to navigate it to column , then use this column to create all possible combination's elements from this column with element of collection 
    (add-keyvalue-to-map ( str (ws-replce (nth x indx)) "|" (ws-replce (nth y indx)) ) collection)
    
   ) ;for
  ; )
   )) 
   
;(println "row: " (inc row)

  

   ) ; let  
  ) ;doall
) ; with 

  
       (println "Hash map of unique Cartesian pairs elements with theirs frequency: " collection) 
 ) ;main


No comments: