Monday, May 2, 2016

CSV file reducing examples in Clojure

Did you try to parse CSV file in size of 1 Gb ? And not only parse it but get some crunch from it. For example grab some statistic info: "How many times this word or this cell content meet in all others cells". In case of financial data for if file has complains based by financial institutions it's means how many complains every financial institution has , and at the same time the same function return how many complains every financial product has . If the function retrieve nice chart and plus do it fast then it's best functional tool which cost to spend some time on it. Reducing CSV files and preparing data for analysing could be done at least in two ways: Fist by selecting column by column (all code here  on GitHub: csv-statistic)


(defn reduce-csv-row
    "Accepts a csv-row (a vector) a list of columns to extract, 
     and reduces (and returns) a csv-row to a subset based on 
     selection using the values in col-nums (a vector of integer 
     vector positions.)"

    [csv-row col-nums]

    (reduce  
        (fn [out-csv-row col-num]
            ; Don't consider short vectors containing junk.
            (println "col-num: " col-num " out-csv-row: " out-csv-row)
            (if-not (<= (count csv-row) 1)
                (conj out-csv-row (nth csv-row col-num nil))))
        []
        col-nums))
(->>
   

  (reduce-csv-row [[4 5 6][1 2 7] [5 6]] [1 2] )
  println
)
 
Gives the result:
col-num:  1  out-csv-row:  []                                                                                                                                                                                                                           
col-num:  2  out-csv-row:  [[1 2 7]]                                                                                                                                                                                                                    
[[1 2 7] [5 6]]
This function really reduce CSV files by rows so [1 2 7] and[5 6] are 1st and 2nd row started from 0. It's defenetly not enough and I created another function which reduce CSV files by column's indexes:
(defn reduce-csv-row
    "Accepts a csv-row (a vector) a list of columns to extract, 
     and reduces (and returns) a csv-row to a subset based on 
     selection using the values in col-nums (a vector of integer 
     vector positions.)"
 
    [csv-rows col-nums]
 
    (reduce 
        (fn [out-csv-row col-num]
            ; Don't consider short vectors containing junk.
            (println "col-num: " col-num " out-csv-row: " out-csv-row)
            (def row-num 0)
            (if-not (<= (count csv-rows) 1)
               (conj out-csv-row 
                 (for [x csv-rows]
                       (nth x col-num nil)
                 )
               )    
            ))
        []
        col-nums))
This is example how to call this function:
(->
  (reduce-csv-row 
   [[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] [20 21 22 23 24 25 26 27 28 29 10 11 12 13 14 15] [30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45] [50 51 52 53 54 55 56 57 58 59 60 11 12 13 14 15]]
  (validated-indexes csv-file-one-row) )
  ;(validated-indexes line) 
  println
) 
The final result selected columns:
Col-num:  0  out-csv-row:  []                                                                                                                                                                                                                          
col-num:  8  out-csv-row:  [(0 20 30 50)]                                                                                                                                                                                                              
col-num:  9  out-csv-row:  [(0 20 30 50) (8 28 38 58)]                                                                                                                                                                                                 
[(0 20 30 50) (8 28 38 58) (9 29 39 59)]

Second more smart way and more generic by removal garbage cells by regular expressions Here some regular expressions:
(ns clojure.examples.regular-exp
 (:gen-class))
(require '[clojure.string :as str])

(def some-quote 
  (str "It was the best of times. "
  "It was the worst of times. It was Friday "
  "night and it was late."))

(def day-pattern #"\w*day ")
(def garbage-pattern #"(\w*day)|No|Yes|(N\/A)") ; where day pattern word
(def number-pattern #"(\S+):(\d+)")
(def date-pattern #"^(((0?[1-9]|1[012])/(0?[1-9]|1\d|2[0-8])|(0?[13456789]|1[012])/(29|30)|(0?[13578]|1[02])/31)/(19|[2-9]\d)\d{2}|0?2/29/((19|[2-9]\d)(0[48]|[2468][048]|[13579][26])|(([2468][048]|[3579][26])00)))$")
(def word-split-pattern #"\s+")
(defn java-interop-regex
  "just doing Clojure regex with Java APIs"
  []
  (let [day-found (re-find day-pattern some-quote)]
    (println "Is there a day? ... " day-found)))
 

(def line " RX packets:1871074138 errors:5 1/12/96 dropped:48 overruns:9")
(def ^:dynamic *inc-matcher* (re-matcher #"\d+" line))
(def ^:dynamic *exc-matcher* (re-matcher date-pattern line))
( ->
 (re-find *inc-matcher*)
 println  
)

( ->>
 (nil? (re-find garbage-pattern line)) 
 ;(nil? (re-find date-pattern line)) 
 ;(nil? (re-find number-pattern line))
 ;(str/split (str/trim line) word-split-pattern)
 println  
)



Reducing by regular expression it's more generic and after applying regular expression patterns to reduce garbage it will remove cell which have 'Yes'/'No' date and empty strings No we have pretty result now. Reducing CSV elements could be done by one operation. No atomic collection will be required no nutation.
This one operation code:
(defn add-cartesian-pairs-to-map
;"add a new key of Cartesian pairs  to map or if exist increment value corresponding for this key
;Result of this function: {:bf 2, :bd 1, :be 1, :fd 1, :fe 1, :ff 1, :de 1, :df 1, :ef 1}  
; where collection parameter is : [123 45 "b" "f"  "d" "e" 'f 123 1234 4534] "
[my-coll]
(reduce #(assoc %1 %2  (inc (%1 %2 0))) 
;(reduce #(swap! collection assoc %2 (inc (%1 %2 0)))
        {}
  (alones (reduce into my-coll)))
        ;(reduce conj (cartesian-pairs my-coll) (alones my-coll) ))
)
This function use standard csv vector of vectors embedded vector is CSV row :
(add-cartesian-pairs-to-map (rest csv)) ;"Elapsed time: 10049.654962 msecs" with atomic collection and "Elapsed time: 10476.827934 msecs" with regular collection
To make Cartesian's pairs uncomment this
;(reduce conj (cartesian-pairs my-coll) (alones my-coll) ))
Consumer Complaints Database  (parsing..)!
rows count: 554542
columns count 18
[Date received Product Sub-product Issue Sub-issue Consumer complaint narrative Company public response Company State ZIP code Tags Consumer
 consent provided? Submitted via Date sent to company Company response to consumer Timely response? Consumer disputed? Complaint ID]
"Elapsed time: 10049.654962 msecs"
#object[clojure.lang.Atom 0x26adfd2d {:status :ready, :val {:Ocwen 20950, :`2285 1, :BarclaysPLC 2735, :[36XX 1, :IBERIABANK 93, :FirstBank
52, :CollegeAssist 3, :NCS,Inc 7, :Sarma 27, :Man-DataInc. 6, :NIC,Inc. 12, :SC 6979, :Plastyc,Inc. 4, :TCARInc 1, :MH 27, :MEMACORP 1, :441
4 1, :`3290 1, :NC 15450, :Inprogress 3861, :AspenYoLLC 26, :Creditcard 87048, :TaxServ,LLC 7, :Othermortgage 74213, :AES/PHEAA 1805, :FHCMo
rtgage 1, :CHS,Inc. 12, :Creditrepair 54, :Servicemember 26258, :6061 1, :ConServe 191, :Mayan,LLC 4, :FM 32, :RPMMortgage 5, :PLSGROUP,INC
41, :GMFinancial 501, :Latefee 2829, :CathayBank 8, :RiaFinancial 36, :GCFS,Inc. 14, :PW 9, :2346 1, :DLC,LLC 24, :MT 889, :M.A.R.S.Inc. 35,
 :MS 2562, :Xoom 34, :WI 6098, :Rewards 1779, :Credico,Inc. 21, :2531 1, :Lenderfi,Inc. 3, :0802 1, :Closed 13368, :IdoNotknow 21580, :HOVG,
LLC 266, :GU 69, :AloricaInc. 609, :ERC 2692, :OR 6577, :4451 1, :Web 360334, :9582 1, :DebtWaspaid 11283, :VT 941, :IsNotme 1, :P.S.C.,Inc
40, :DevalLLC 16, :DLGLLC 1, :NVRInc 45, :8022 1, :ME 1925, :NFM,Inc. 6, :Elly,LLC 1, :(9682 1, :0288 2, :IvanBrown 1, :Hewitt-Capital 16, :
Comerica 785, :IPAC'SInc. 1, :FDIC 3, :BMOHarris 749, :4566 1, :VAmortgage 3725, :ChaparraInc 2, :CashMoneyLLC 2, :Bizcorp,LLC 2, :OutTech,I
nc. 8, :Recheck,Inc 2, :JzanusLtd. 6, :Innovis 121, :N.A.R.,Inc. 50, :OlderAmerican 45144, :AK 634, :RFNA,LP 39, :Nexcheck,LLC 5, :7007 1, :
4412 1, :WY 547, :[2754 1, :WA 11504, :AssentInc. 3, :OH 17316, :MyCRD 1, :2759 1, :CCBAInc. 4, :8910 1, :Vehicleloan 12137, :Auto 2510, :Ve
hiclelease 1637, :NV 6738, :2752 1, :PlusFour,Inc 26, :GFSII,LLC 8, :CarMax,Inc. 75, :HI 1924, :LexisNexis 204, :0236 1, :EverBank 813, :KS
2688, :Addingmoney 154, :-0142 1, :MetLifeBank 184, :5313 1, :Walwick,Inc 7, :Bankruptcy 379, :QuickenLoans 964, :RI 1894, :9837 1, :Seterus
,Inc. 2832, :7611 1, :9454 2, :NH 2822, :6314 1, :Jomax,LLC 6, :9730 1, :TFLCTrust 1, :AL 5610, :C/CFinancial 2, :Nelnet 138, :TowerLoan 150
, :PMGI,LLC 1, :Privacy 389, :USForexInc. 3, :Experian 30768, :Remitly,Inc. 4, :QuikFundInc. 1, :fradulent 1, :Bogman,Inc 37, :MidFirstBank
5, :-3343 1, :3704 1, :Sigue 5, :USAASavings 2163, :FrostBank 36, :PMAB,LLC 65, :LoanDepot 233, :UnifinInc. 14, :GETARATELLC 1, :UMBBank 107
, :RAB,INC. 10, :9521 1, :Titleloan 326, :E*TradeBank 148, :NY 38124, :NPSGROUP 17, :Gettingaloan 689, :FlagstarBank 1711, :CMGFinancial 42,
 :TeccoSystems 1, :1012 1, :Otherfee 1664, :MoneyGram 1011, :Fax 8092, :PulteMortgage 44, :Equifax 31736, :*1212 1, :6062 1, :CWFinancial 5,
 :BankUnited 191, :TFCGroup,LLC 2, :BOCLLC 29, :Acopia,LLC 4, :Mobiloans,LLC 180, :IL 19546, :MO 7208, :1421 1, :Olddebt 1, :MCMCAutoLTD 2,
:ALCHROINC. 34, :CuroInc. 1, :Overlimifee 177, :RMB,Inc. 26, :ArgonCredit 19, :BankOfHawaii 66, :MiLEND,Inc. 2, :Webcollex,LLC 49, :PHHMortg
age 1474, :GoogleInc. 78, :SweHomesLP 7, :Arbitration 252, :BNYMellon 152, :UT 2944, :0160 1, :Afni,Inc. 991, :LKManley,Inc 3, :DTAGROUPLLC
39, :U.S.Bancorp 9533, :VHDA 30, :CFMGroupLLC 12, :VA 18076, :SeleneFinance 642, :9407 1, :1891 1, :KY 3972, :MN 6402, :AcreMortgage 5, :Arv
estBank 429, :Dirick,Inc. 1, :CFSACCEPTANCE 2, :D.B.M.E.,INC 4, :AS 17, :SenteMortgage 1, :ResourcePro 1, :MTACCInc. 1, :6661 2, :EmbassyLoa
ns 1, :RevCrest,Inc. 1, :Payrollcard 313, :Transitcard 30, :FlurishInc. 15, :NJ 22332, :JPayInc. 6, :[9217 1, :GA 24444, :Dailycalls 1, :Dis
closures 29, :DCQLLC 7, :5322 2, :MD 17611, :Bliksum,LLC 72, :CommerceBank 175, :CapitalOne 15598, :FCHoldCoLLC 51, :EarmLLC 15, :SFS,Inc 37
, :AstoriaBank 208, :PLDR 1, :NavyFCU 1590, :SoundMortgage 3, :SiwellInc 14, :CT 6424, :1120 1, :ULRS,Inc. 8, :NRAGroup,LLC 305, :EZCORP,Inc
. 105, :MFP,inc. 25, :0740 1, :Daiyaan,Inc 1, :SklarLawLLC 2, :AmeriCollect 100, :CalLoanCorp 2, :8640 1, :9027 1, :FactualData 44, :4530 1,
 :LenderLive 92, :Cashadvance 193, :Equidata,Inc. 35, :[9507 2, :Prepaidcard 2464, :Kwikcash,Inc. 1, :DirectCapital 22, :KRockfordLLC 1, :Am
erisave 169, :L2C 1, :SD 753, :AA 8, :Grant&Weber 184, :MI 14439, :Pay-O-Matic 2, :UBSBank 42, :IN 6096, :GuildMortgage 110, :UnionBank 412,
 :LHRInc. 47, :AmegyBank 3, :Citibank 25479, :7705 1, :DC 3212, :eConCreditLP 1, :3057 1, :2830 1, :9212 1, :7825 1, :Discover 4739, :Whitne
yBank 127, :ScottradBank 179, :Rabobank 57, :9330 1, :Referral 109184, :DE 2904, :EastWestBank 64, :LA 5371, :SimpleLoans 1, :LoanToLearn 63
, :8021 1, :3495 1, :Email 344, :DeutscheBank 48, :2015 1, :Escallate,LLC 83, :RapidMortgage 1, :PNCBankN.A. 6961, :[9175 1, :3461 1, :BBVAC
ompass 1013, :FHAmortgage 19115, :ArmadaCorp. 18, :OmegaRMS,LLC 29, :SynovusBank 259, :`29XX 1, :JemcapLLC 3, :Io,Inc. 47, :Ineedhelp 1, :Ar
borMortgage 3, :7122 1, :Jonsue,LLC 1, :3830 1, :ConsumerLoan 20925, :Kinum,Inc. 13, :AZ 12280, :Jvsgroup 3, :VWCredit,Inc 124, :Americash 5
, :-2914 1, :BluFiLending 6, :[7864 4, :CFFunding 2, :`1195 1, :RevSolve,Inc. 33, :Publicrecord 5549, :ALG 1, :Rossman&Co 13, :6415 1, :FMSI
nc. 61, :EMortgage 1, :AbleMortgage 1, :NTFN,Inc 4, :TX 41178, :FTC 13, :Accountstatus 26719, :Medical 13237, :Excessivfees 58, :TN 8481, :S
allieMae 851, :Mortgage 189576, :ArmconCorp 17, :1121 2, :RodenburLLP 18, :CA 81317, :PASCO,Inc 4, :ND 473, :AccessGroup 83, :9029 1, :Payof
fprocess 1766, :Conn's,Inc. 154, :NDS,LLC. 1, :0707 1, :StratedgInc 1, :iServeTrust 7, :ONYGLO,INC. 1, :WV 1448, :4302 1, :AR 2371, :2638 1,
 :0926 1, :[29XX 1, :LoanCare 571, :VI 141, :BB&TFinancial 2562, :BayEquity 7, :Viamericas 5, :RANLife,Inc. 1, :OCC 2, :8916 1, :PinReid,LLC
 2, :LPSG 2, :1000 1, :Ceannate 72, :Valarity,LLC 48, :MA 11062, :3370 1, :GreatLakes 87, :InvestorBank 58, :.6303 1, :3021 1, :PA 19920, :V
ericheck 1, :GARRETTFIRM 4, :4361 1, :0832 1, :Pawnloan 59, :FedloanHelp 1, :MOHELA 95, :9370 1, :3271 1, :CRLHomeLoans 1, :Epn,Inc. 8, :WLC
C 31, :MP 19, :Checkcashing 153, :ARCServices 6, :Remex,Inc. 25, :MVBA,L.L.C. 3, :LoganViewLLC 3, :9970 1, :VeriCore,LLC 5, :Veldos,LLC 27,
:Apelles,LLC 20, :GFS,Inc. 6, :ZOA,LLC 6, :WebsterBank 204, :-1631 1, :.0875 1, :4443 1, :KeyBankNA 1420, :3316 1, :O.D.REI,Inc 1, :MorganSt
anley 252, :SIRVA,Inc. 3, :SpiriterLLC 1, :USCB,INC. 44, :CFS2,Inc. 2, :SCVInc. 17, :DriveTime 240, :Other 15075, :MEFA 38, :DHIMortgage 33,
 :0611 1, :AE 216, :CO 9463, :WeCollectInc 12, :Fees 139, :CreditOneLLC 5, :ZestFinance 5, :9176 1, :GMFSLLC 5, :Postalmail 36687, :Moneytre
Inc 11, :0106 1, :1972 1, :GryphonCorp 93, :FL 53462, :LPH,Inc. 2, :Amex 5740, :6126 1, :[45XX 1, :NE 1734, :EarnestInc. 4, :9000 1, :ARMWNY
LLC 132, :TheTRAFGroup 26, :ver 0.0.1, :CLSMortgage 1, :PR 1304, :3270 1, :CoreLogic 219, :EBA,LLC 3, :CB1,Inc. 12, :EZLoans,Inc. 1, :Ophrys
,L.L.C. 5, :Jormandy,LLC 3, :Mobilewallet 239, :Convergys 24, :*0104 1, :OK 3610, :IA 2280, :FirstCitizens 168, :ResMacInc. 6, :[3220 1, :Mo
neyorder 70, :FraudOrscam 2041, :Studentloan 15800, :CashCall,Inc. 937, :Phone 39891, :CNPSolutions 2, :Avante 59, :SFMC,LP 1, :4843 1, :071
0 1, :2162 1, :ID 1929, :N&SPartners 3, :CMM,LLC 2, :SVIGroupInc. 1, :PentagonFCU 459, :7790 1, :FixMedia 2, :AP 152, :Accountterms 7208, :J
MD,LLC 13, :MEDCAH,Inc. 8, :NM 2767, :RMPGroupInc. 285, :1158 1}}]

No comments: