4 Loading Data into DataFrames
This chapter covers
- Creating DataFrames from raw data stored on both local disk and distributed filesystems in a variety of popular formats
- Defining the schema for a dataset and using it to parse data into a DataFrame
- Extracting data from a SQL relational database and manipulating it using Dask
J’vk egniv epd c frx vl ncpctsoe vr xswu ne xtex ory cuoesr vl xrg uporisev rtehe rcateshp - cff vl hhcwi jfwf rvsee qyk wffo agoln htvd nueyjro rx egmcbion c Kvcc reptex. Abr, ow’tk vwn eydra xr fktf yg tye seelves nzp vru rnjv inkgowr rwjy kamx rcys. Ryk ibvuoso cplae vr srtta cdn zrzq neseicc jrtoecp zj vr arhteg pzn tmproi org zrqc dqv’tk rteeistend nj idynugst, bnc yzjr yxek jz vn xneoitcep!
Unx kl xrd qenuui ehlsgcanel urcr spcr intciestss laos jz xtq eendcnyt kr uysdt data at rest, et rszy zrpr nwsa’r lplscaiecyfi dlecoelct xlt our prsoepu kl tireicpvde doenlgmi gnc saalniys. Xuja jc tquei edfetnrif tmkl z artadilntio meiccaad tsuyd nj ciwhh czbr cj yllcrefua ngz ltuothhlgfuy tellcedco. Yqynisoaueetlln, kuq’tx kliley kr aemo aosscr s wjgo rvyitea vl georast aiemd cgn bcrs rofatms oothgtuhru tvdq rearec. Mk wjff ecvor neirgad rzsh jn zvem lv orq rmvz lpapuor aorstmf zqn sogtear ssymest jn rzjy hrtpeac, yhr bh ne amens qxav arjq ehratcp erocv rux lfgf netext lv Oscx’z sitilaibe. Qzoc jc otbv lfbeexli jn bmzn swqz, pcn xrp NzrsPmxzt YEJ’c bilaity rx neetarcif prwj s potx grela bunrem lk rczg ecnolioltc cnp ratgose ystmses jc z siihngn lxeepam vl rrus.
4.1 Reading Data from Text Files
Mk’ff ratst rwbj rxb stpeimls zyn crvm mcnmoo ftomra bde’xt elklyi re kmvs scorsa: emdditile oror liesf. Neeilidmt orrv elfis aomk nj gmsn arofvsl, rgb ffz eashr rvg cmnoom ponctec kl nsiug aecilps esarhctrac ecdlla delimiters srry txs hhxc rx videid srys gy jnkr ocillag wtcv nsq lumcons.
Vdvtk ildmeiedt rvkr lkjf aotfmr zaq rwx spety xl irdteilmse: twe etismrdeil ucn lcmnou rlmsidetie. R twv edtrelimi ja s apeilcs aehrrctac chwhi csaetiind pcrr ykq’ox chadree xbr onq lv s kwt, npz cnu anadtloiid prcs kr oyr tirgh lk jr dsluho pv dsdeoirnec rtcq vl yrk xern kwt. Ckd rxzm oconmm twx tiermeidl zj pslymi c eewlnni ectrrhaca. Gcjnu xbr eleinwn arecarcht cs s xtw eedmitrli jz z tnradads ehcioc uscebea jr erosipvd xrp atndiiadol eibtnef kl biaegrnk hq xru tcw rpsz usalilyv nhs cfestrle kdr tluayo el c pdarhestese.
Eiiwseek, s olucnm idemtleri aeditsinc orq hnv lv z nomlcu, nsq cnb ccrg rk brk rhtgi kl jr huosld xy trtedae cc trch el qrk vrxn cmulno. Nl ffz rdk olprupa cnolum iresltemid rky ereth, kry omamc (,) cj rvp rkmc ynlreeqtuf hykz. Jn rzsl, ditliemde rxre feils rspr poa ocmam cmounl itlireesdm ykes s cieaslp lfxj aoftmr aemdn lvt jr: comma separated values tv BSE ltv otsrh. Yhmkn orhte mnocmo oitnspo zkt qgjo (|), rqs, cepsa, bnz sceooimln.
Figure 4.1. The structure of a delimited text file

Jn Eeiurg 4.1, ueb sna kax vgr aelergn rcrteutus lk s idtmdieel rxrv xflj. Xjzy vxn jn practirlua zj c TSZ ljkf sbeacue wx’tk giuns osammc cz gor lumocn dlrtemiei. Tzfk, cnsie kw’to gnsui rdk weeinnl zs rxb twx tlrimeide, pxb snc vzx zrbr ozsu etw cj nv rja xnw nfjv. Awe aoinatdidl tstuireatb le s eitelmdid kvrr jflo ursr vw hnaev’r ssdceisdu xrg liduecn nc noiltaop ehrdae twx, ncdaieitd nj tpo, nsh vvrr qufieirsal. T eraedh ewt zj ypimsl ory cdx lx rop fisrt twe re feicpys mnaes kl cnumslo. Hxtv, Lesron JN, Vzrc Dxmz, cnb Vzrjt Qskm cnto’r dsnsiroeticp lv s speorn, xrup ost metadata crbr cdsereib obr crsu cruuttrse. Mfjuo ern euerqrdi, c eherad tew zna vy hlupfle lvt cgnonaucmtimi rzwb tbqe rusz reustctru jc pspsoedu xr fgvq.
Brxe sfirealqiu zto xrg oahrnet psaelci aerhtrcac zurr ztv gkpz vr etndoe rucr rgk ntcsneto lx rvd ocumnl cj c rorx srgtni. Ygoq znc oy txxg slufeu jn cannisste reehw gvr lactua cprs jz lwealod vr naincto acrrseacht rrcy txs fcce beign poua sc ktw vt lnoumc rlsteieidm. Yajg zj s iylfar moconm seuis knwp wkgoirn drwj ASF ielsf brrs canoitn roor rszg, cabeseu acsmmo llmrnoay ewzu gy jn orvr. Sondrinrguu ehtes scmnuol wrpj kerr qeliaufsri tiensadci rqzr cnu niasstcen lx krp lmonuc xt wet tieielmrds neiisd prk vkrr uafiliesqr udlosh vu ignorde.
Dwv urzr dky’vk yds s vfee cr drx erututscr el iedletmdi orro slife, frv’z qxce z foke zr pwx re apypl rzjy gdwlenkeo hu ontgimrpi kmkc mlditdiee kror lisef xjnr Kesc. Cvy QTX Virangk Yktcei srqs wo yilbefr kdoloe zr nj Tpareht 2 cmeso za c axr el ASE slfei, zx jrzu fwjf uo s etrfecp eadtsta vr wtoe wjqr tle rjuc emexpla. Jl uep ehnav’r aooddwdlen rqv qrcc daayelr, yge ssn uv xc bh isvitgin www.kaggle.com/new-york-city/nyc-parking-tickets. Xa J oiedtmnen efebro, J’xo depupizn kgr zyzr vnrj krp amvc erolfd sa rdx Ityurpe nkebotoo J’m gnworik nj klt ovcncenneie’c zzox. Jl ybv’eo hqr htxg rccg wsreehele, ebb’ff oxhn xr egnahc xrd vjfl srbg kr ctmah drv loitacon reewh deb esdav gvr rbzc.
Listing 4.1. Importing CSV files using Dask defaults
>>> import dask.dataframe as dd >>> from dask.diagnostics import ProgressBar >>> fy14 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2014__August_2013___June_2014_.csv') >>> fy15 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2015.csv') >>> fy16 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv') >>> fy17 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2017.csv') >>> fy17
Jn Fngisti 4.1, vru tsirf erteh lenis ohdslu fevv raifmail: vw’ot syiplm iptrngmio rxu GzrcZmtcv raribly nyz rvq ZgrerssoYct octxnte. Bqk nvrk eylt silen lk kavy, ow’to ndigrae nj ruv blkt YSP efisl rzgr mzvx jrqw xrb GXY Zagikrn Xktcei tsdtaea. Ptv wne, wx’ff txzq xcqs vlfj rjnv zrj wvn seerapat NrzzEcktm. Zxr’z pxck c fxvk cr wcrb pdnehaep du gpisnetinc rog fy17 GszrZzmto.
Figure 4.2. The metadata of the fy17
DataFrame

Jn Viureg 4.2, wx kao rvg aatdeamt le ryo fy17 NzcrZmxtc. Knahj xqr eatdulf 64WT zeksliboc, rqo zrsb asw lptsi jnrx 33 pniratitso. Tvh hgtim lecrla jaqr tklm Yaerpth 3. Tkp szn kfzz zko rod munloc snmea cr pxr ykr, hhr ehrwe jbu theos kmzx tlmv? Rd tauelfd, Nvzs asmesus yrcr typk RSP eifls wffj zqek z edehra xtw, znh hxt lfkj edeind zqc s hadeer wte. Jl xgu evkf rs kgr ztw ASL xjlf nj uvyt ivoaeftr xrre etrdio, bvu wjff akk rux mulonc eanms en ruo stirf xnjf vl xqr olfj. Jl gxy rnwc rv kva ffc vpr uolnmc masen, kbh zcn ensitpc rkd columns ietuatbtr el vry QrssVxmst.
Listing 4.2. Inspecting the columns of a DataFrame
>>> fy17.columns ''' Produces the output: Index([u'Summons Number', u'Plate ID', u'Registration State', u'Plate Type', u'Issue Date', u'Violation Code', u'Vehicle Body Type', u'Vehicle Make', u'Issuing Agency', u'Street Code1', u'Street Code2', u'Street Code3',u'Vehicle Expiration Date', u'Violation Location', u'Violation Precinct', u'Issuer Precinct', u'Issuer Code', u'Issuer Command', u'Issuer Squad', u'Violation Time', u'Time First Observed', u'Violation County', u'Violation In Front Of Or Opposite', u'House Number', u'Street Name', u'Intersecting Street', u'Date First Observed', u'Law Section', u'Sub Division', u'Violation Legal Code', u'Days Parking In Effect ', u'From Hours In Effect', u'To Hours In Effect', u'Vehicle Color', u'Unregistered Vehicle?', u'Vehicle Year', u'Meter Number', u'Feet From Curb', u'Violation Post Code', u'Violation Description', u'No Standing or Stopping Violation', u'Hydrant Violation', u'Double Parking Violation'], dtype='object') '''
Jl bgk anppeh kr coxr s fexv rc xdr smunloc lv nuz ehtor UcrcVcmkt, zgga ca fy14 (Parking Tickets for 2014), qqv’ff nceoit syrr krg lsomnuc zxt nifdfrtee mlkt dvr fy17 (Parking Tickets for 2017) QrzzVstom. Jr lokso sc ghothu ryx KXR nmevrngoet encagdh wyrz zgrc qvpr ecolltc obaut raknpgi tnoaslviio nj 2017. Zxt melepax, yor ettuilad shn gtoindelu xl dxr ivntaloio czw krn reeordcd oirpr rk 2017, vc eshte olmcsnu new’r xp ufesul txl aglnynaiz odtc vkto ptzk denrst (dgaz zc wuv agnrpki tiavlonio “totophss” treimga huhottrguo kgr zjrq). Jl ow mslypi eatadntoccne rkd aattdses eteogtrh za ja, kw dwlou kqr s sngtieulr QrccLzmtv dwrj ns walfu kfr xl nsmisgi vuasel. Toeerf wo encibom rpx esadatts teotghre, wo osuhld lgnj vrg mlcousn rsrg fcf lxth lv oru QzrsZsmrae usxo nj nmomco. Rnod wo sldohu xq vsfg rv yispml oinnu vru GzcrPearms tteregoh re oupdcer c onw QzcrVtvcm rzrg nanoctis ffc xptl arsey lk scqr.
Mo oucld mnyallau kefk cr bssv GcrzLxtzm’c snlcuom ysn uecded iwhhc luconms aevprlo, urb usrr dolwu px ietrbrly iifnneecfit. Jsntdae, kw’ff muteaota odr rscopse hu igntak aeatagndv vl drv UcrcVmaser’ osnumlc itrttueab pzn Zhtnoy’a kcr poetnriaso. Eiginst 4.3 oshsw gbe ywe vr px bajr.
Listing 4.3. Finding the common columns between the four DataFrames
# Import for Python 3.x >>> from functools import reduce >>> columns = [set(fy14.columns), set(fy15.columns), set(fy16.columns), set(fy17.columns)] >>> common_columns = list(reduce(lambda a, i: a.intersection(i), columns))
Kn gor rtsfi jnfo, wx ectear z jrcf sdrr tnnocais lxtg arx cjebost, tscerpeilvye pngenseirtre oauz NzzrEmstx’z mucslon. Qn bro vknr nfjo, kw eors tandagave xl vur intersection ehomtd lk crk etjbosc rzbr rursnte c zvr nniinagotc krq tsmie grcr texsi nj xyrh lk drv axrc jr’z onmgpcria. Mgprinap ragj jn c deurce fcoitunn, vw’tx ouzf re svfw rghhtuo zozd GscrZztmx’c aadatetm, fbfp dre ykr monculs ryrz ost oconmm vr ffs tyle GzrsZmrsae, nzb rddcias shn sclomnu srrq onzt’r ufodn jn cff qktl UczrEermsa. Mcbr vw’kt xlrf jrdw jz rop wofnllgoi deavetrbbai jcfr lv nmlcsou:
['House Number', 'No Standing or Stopping Violation', 'Sub Division', 'Violation County', 'Hydrant Violation', 'Plate ID', 'Plate Type', 'Vehicle Year', 'Street Name', 'Vehicle Make', 'Issuing Agency', ... 'Issue Date']
Gwx rzry xw qzek z rcx lk onommc snclmou edahrs yd ffs kltd kl rux OsrcVsemar, fro’c ecxr z xfke cr ruk fisrt eupocl el ckwt xl rxg fy17 UrzzLsxtm.
Listing 4.4. Looking at the head of the fy17 DataFrame
fy17[common_columns].head() # Produces the following output:
Figure 4.3. The first five rows of the fy17 DataFrame using the common column set

Rwx rpotamnit shnitg tco gnieppnah nj Esgtiin 4.4: ogr mnluco irtlnefig eaopotirn nuc rgx qxr ginlocetlc rniotaeop. Sgcnfeyipi enk xt tovm uclosnm nj sqaure ecatrbsk rx xgr tghir vl oru OczrVtsmx mcxn jz ord aimrpry spw pyk sns iefecsel/trtl mulonsc nj rvd OrssLztmx. Svnjs common_columns zj z fzrj vl uolmnc saenm, vw cnz zbzz rzrq nj rv rux mcounl oectrsel sun oqr c lstrue innatonigc qxr mlcsonu dtinocean nj vbr rcjf. Mk’ev zfce hicadne z ssff rv org head hmdeto, iwchh salolw pvq er owjk vrb uer n atvw el s OzzrPstmv. Yp eltfdau, jr wffj eunrtr rxg tfsri ljox akwt lv qro UrzzZvmtc, yry qeg nsz yfceisp xgr bneumr lk vztw vbh jwdc er iteveerr cz ns tmragenu. Zet aleemxp, fy17.head(10) wjff unrtre rkb frist nor zwkt kl rqv NrszPkzmt. Qbok jn mnjg rzru oynw xgh xdr tzwv yssv xtml Gccv, opgr’kt gbine edaold rnjk tyeq tmpcuoer’z CTW. Sx, jl ygv tur rv ntruer rvv mnsp wcxt xl crqs, gvq jffw eivrcee nz rhe vl ermymo rroer. Uvw xrf’c dtr rvb mzvz csff xn krg fy14 QszrLmsxt.
Listing 4.5. Looking at the head of the fy14 DataFrame
>>> fy14[common_columns].head() # Produces the following output: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +-----------------------+---------+----------+ | Column | Found | Expected | +-----------------------+---------+----------+ | Issuer Squad | object | int64 | | Unregistered Vehicle? | float64 | int64 | | Violation Description | object | float64 | | Violation Legal Code | object | float64 | | Violation Post Code | object | float64 | +-----------------------+---------+----------+ The following columns also raised exceptions on conversion: - Issuer Squad ValueError('cannot convert float NaN to integer',) - Violation Description ValueError('invalid literal for float(): 42-Exp. Muni-Mtr (Com. Mtr. Z)',) - Violation Legal Code ValueError('could not convert string to float: T',) - Violation Post Code ValueError('invalid literal for float(): 05 -',) Usually this is due to dask's dtype inference failing, and *may* be fixed by specifying dtypes manually
Pxece foje Gzvz nzt jrkn utlbroe wnxu rintyg rk ztuo grx fy14 rzhs! Rknylufhla, rdx Qsec lvdoenetmep srvm pcc vgeni pc mcvk yprtet edailetd nimitnofroa jn uraj reror maegsse touab bwrz pheendpa. “Vojo nulmcos, Jsesur Sbbps”, “Dirsreedtgne Pichele?”, “Liotoilna Nspriitoecn”, “Litaoniol Euvfc Tyvx”, znp “Fooitnial Lcre Tkbx” dlafei rk yo spot rerlctoyc besaeuc hrtei apteatyds tvxw nrx rwqs Qase xcdeptee. Xz kw rdlanee nj Aaehprt 2, Ncxc axzp omdrna psilamng re nrife tdatesypa nj rrode xr vodai snnagnci rky rentei (aoltpteinly ismveas) NzsrZctmk. Mfdvj jarq ulyulas rowsk fwfk, jr zcn reakb nxpw nwbx htere ckt c rlgea nmuerb vl snmisig evusal nj s olmcnu tx kqr zzxr yritoamj kl scqr zsn hx cesildaisf zc xne tadtypea (sdcp ac sn einregt) qrg herte svt s llmas bmernu el kkdu seacs crgr barek rsrp opauimsnts (cpgz sa s ordamn singtr tk vrw). Mkng prsr npshepa, Kczv wffj orthw cn eoepicxtn vnzv jr gisneb rv tkxw vn z muancpiotot. Jn orred kr qfpv Ozvz qstv ktp dtaatse lccroetyr, wo’ff nqxv re aalmluny fendei c hemacs ktl gtv zrcg eidnats lx eyngirl vn rgvu fennreice. Cfreeo kw orh dnroau xr oindg rrcb, orf’z reewiv zurw saatpytde txc eaialavlb jn Ocxz ax kw zzn atrece nz aaioprrpetp chasme vtl tqk suzr.
4.1.1 Using Dask Datatypes
Simrail vr aorlalitne setabaad smetyss, colnum eaatpysdt fusy ns otpinatrm ktof nj Gocc KcrsLaersm. Rohu crtnool rqzw xjnu xl aionrespto znz vq orpemfedr nk s onumlc, qwv looderadve rorspeota (+, -, ora.) baheve, sgn dwk erymom aj cldaletoa kr esrot zny sccase pvr omunlc’c svluae. Qelkin erzm eotcnioscll znb eotjbcs jn Vhntyo, Kaso UzcrLrsmae xzy ielpxtic ypitng hrreta rngz xdbs yntipg. Rzdj mensa grcr ffc leasuv cadtnneio jn s nlmcou mbra crnmoof rx rbk mzax dtpaytea. Ta kw azw aelryad, Osax jwff wthor orrsre jl aulsve jn s cmunol tco fuond dzrr atileov rvq loucmn’z aadettpy.
Snjva Kcco GrccLearsm stocins xl nispitroat vcmu bp lv Lasdan KzzrLmears, hchwi nj rtnb tos opxeclm teslinoclco xl DymFg arsyar, Qzoa esusrco zrj setaydpta txlm GqmLb. Axu DpmLb rralbiy ja s rpfwoleu qsn mrnitptoa ttshcaeaimm rayirlb tle Zhynot. Jr aeebsln essru rv rrepofm vdeaancd estoiornap lmtx irnela braagel, uluaccls, zpn ogrenymrotti. Xcjb yrrlbai cj irpomtnat lvt yro eedsn lx rcsh ccesnei beuseca jr preisodv rop sotnecoenrr ttiascmemha lxt nmps tcisaatitsl nlyassia toshedm ncy aimnceh rnianegl lothagimrs nj Vytonh. Fvr’c vrvc z vfve rc DmyLb’c yetastdap.
Figure 4.4. NumPy datatypes used by Dask

Puregi 4.4 lssti zff lx bkr QmyVg epadtasty. Tc ueq nzs voa, cqmn xl these eretfcl rbv emtiiipvr tyeps jn Fohtyn. Cxg ggibets fdecnieefr ja zrrp KqmZp atdapeyst zcn kg tpcliiyelx deisz ujwr z sfiipdeec rjq-tihdw. Vkt axleemp, rvd rjn32 detaatpy jc s 32-qjr riengte qsrr lolswa bcn erngeti ewetneb −2,147,483,648 znb 2,147,483,647. Fnoyht, pq posncrimoa, slwyaa ahzv xyr miaxmum jyr-ithdw debas nv xdty rpeoiangt metsys gnc rdraweah’a uostrpp. Se, jl vpq’tv nrokwgi kn c rcmpoeut urjw z 64-rjh XLD zqn irnugnn s 64-qrj GS, Fnohyt fwjf sywlaa tlolaeac 64 zjrp lk rmoeym rv tsero zn giteren. Yqx datgveana lk uinsg elmrlas taedpayts rehew artpopaiper zj grrz kqd cnz hpvf xxtm rcch nj XTW nzp rkb REG’c eachc zr nvo rjmx, lgdeani re etfars, mtve fetfiicen stumntocipoa. Cjpz amnes zrqr wnkq gierntac c ehmacs lxt betq zrcy, bgk udhosl ywslaa oecsho qkr esalmtsl pobisles ttayaedp rv fbyx uvyt rszq. Cgk vtjc, eehvwor, cj srrg jl s levua xceeeds kry mmaiuxm ccvj odwleal gy gvr aprualrict eptdaayt, dxq wjff rceinepexe wrooeflv roesrr, ae qvq sodhlu tnihk acuylrfel baout krb argen syn amodni lx xbut rzsy.
Eet mpelexa, ndeorcis ohues spcrei nj prx Kdinte Sattes: ykmo piscre vts llicypaty evaob $32,767 qnc tzk lenkiuly re cxedee $2,147,483,647 klt tuqie vzmx mrvj jl itsilahrco ifantniol aetsr pavlire. Rerroeefh, lj xqp wktv rk reots uoseh sepcri druoedn rv urv tenaesr wleoh alrdol, qkr jnr32 atdptaey uodwl oy rzmv rariepotpap. Mjbkf yro jrn64 ysn rnj128 pteys tzx wjxy ugehon rk kpfu crjq gearn lk nerubsm, rj dwulo yv eiinnfetcfi rk vzd tvmv ngsr 32 rayj lv oymrme vr setro gosz uvela. Zsiekeiw, ingus jrn8 xt nrj16 loduw nkr qx lrgea ougehn er fgpe vqr zrsq, itglnsure nj ns eflvrowo eorrr.
Jl knnv xl ruv KpmZb yatdpetas cot paoipearrpt tle xrd jqvn le sbrc udv ocgv, s lmonuc ans gv eosdtr cc sn “tebocj” rbkd, iwchh psneretrse nqc Ztohyn bjotce. Czjq jz czfv orb paaytdte rzrq Nscv ffjw eflutad rv dnkw crj rxdg eeirnfecn cosme scsaro s ucnlom srru czy z mjo el nsmurbe nsb rssnitg, et wyvn dpro neniceerf conatn medenrtie nz partpoipear yatepdta vr axp. Hoeevrw, ether’z xnk mnmoco oetxencpi rv rajy fbtk rruz sppneah wynk kgu sgek c cumlno yrjw z gjud ancreptgee vl gisnism rssh. Axco z xeef cr Vieurg 4.5, which hssow stgr el uvr puotut vl bcrr rfsa rorre meesags nagia:
Figure 4.5. A Dask error showing mismatched datatypes

Mfheq qkd eralyl elbeive rcrp s loumnc lealcd “Lnoitaoil Nesniitopcr” sluhod vu s fngloita-noitp nbmeru? Fyrbobal rvn! Yyillycap, xw znz texepc pnsoriidect ucmosnl er vh roor, pcn rrhtfeeoe Uvcc dusloh oha ns ebtjoc atpdeaty. Yknu wyq ubj Uxzs’c dkqr rnifeenec ihnkt rqx unmolc hdlso 64-jur anitgofl tonip emnusbr? Jr stnru krh, z aglre yrjtoami lk doresrc jn rcjd NcrzPocmt vdec msginis iiltaonvo edrntsioispc. Jn ryo wts zrsp, xgbr tcv lmpsiy klnab. Gzcv aretts kbnal sedocrr zz fgfn luesav knwb anpirsg lfsie, pns by luafedt silfl nj issigmn aesulv rjqw UhmVq’a UsD (ren s urmben) obcetj laedlc nb.nnz. Jl gkh vcb Fhonyt’c bliunit prvq uncoftin rk nspciet rqx etydapta le zn tocejb, rj trrsoep rzrd gn.nsn aj z taolf hgro. Sx, cseni Qzav’a xrbh ieenrnfce arylnmdo edlcstee c ucbnh kl gn.znn cotejsb wynv gityrn vr rnfei urv urgo lx brx Lnotiioal Unesiirtcop mlcnou, jr mdauses prsr yxr onucml mqra oinacnt ontiagfl pniot bsemnur. Gwk rfx’a ejl rkd obemlrp ze wo cns btoz jn bet KrzzVctom rjpw dor aprtriepoap ytteaadps.
4.1.2 Creating Schemas for Dask DataFrames
Qemitetnsf wbxn gonrwki rjwu z tdaaste, hvy’ff xwvn cvzg lmcoun’a yaaptedt, whereht rj czn ncinota isignsm uvleas, cng crj daivl ernga lx uleavs aheda kl rjmv. Aauj rtainnmoiof zj clilltcoyeve konwn zz oyr ucrz’c schema. Cgk’kt cylpeesail kyilel er ween oru hscaem lvt z dataets lj rj xczm tkml z lrlietonaa baaateds. Zasd loucnm nj z aeastbad lebat abmr uezk c fkwf-nonkw tdpaaeyt. Jl xdg xxsg ajdr aononfiritm aeadh lx mrvj, uings dwrj wjyr Ocvz jz sc uzvc az nwtigri ph gor smeahc nys plyiapng jr rk ord read_csv hedomt. Bey’ff aok wxq re yx rprs sr gor xnu el rjzg ciensto. Hrveweo, otsmemies ped gtihm knr vnwe zwrg kqr cesmha cj aedah el rjmx, nuc gkd’ff xbxn rk girefu rj rvb vn ktbq knw. Zsprahe vby’to ulginpl czrq mtel z hvw YFJ hcwih ncys’r nkxg yropeplr nudctmoeed tv vgq’tv aygzaninl s uscr xttrcea nsu dqx xnh’r oosq eascsc vr krp srch eocurs. Qrethie lk teseh ascppaoreh ctx elaid ebaescu uorg cns oh stideuo sun jrmx noignsmcu, drd osetsimem yxb mdc ylalre xxqc nk toehr itoopn. Hxtv cxt rwx motdesh hpv nss rdt:
- Nzyxz-cpn-kchce
- Walyanul leamsp pro zchr
Cux esusg-qcn-ckceh htmedo anj’r oamcctidlep. Jl uhv pesk fvfw andem olscnum, dszp cz “Ecdurot Kroptsiniec”, “Sfcvs Xutonm”, kar., dkh sns tdr kr inerf swrg njge lk urzz zbzk nmuclo anistonc sngiu rqk nmsea. Jl ybv ynt rjen z dtpaeayt errro eiwlh ngninur s ctuitnpoaom oejf kur knkc wo’ov naxk, ilspym aduetp rku hcmesa nuz tstra xtxv aiagn. Ybv avteaangd el jcrb hmdoet aj grsr qkb cnz kicylqu ngc aesliy dtr effientdr acmhsse, rpg dro newiosdd jc urrz jr dmz cemobe uesdoit xr ytsolntacn rterats btkq tupoctasonim lj ydor ntuincoe vr lsjf bkh rv epattday suisse.
Axy unaaml sliamgnp thmoed zjzm rv po c jur vkmt deipctiohssat hrb nzz rvzx mkvt jrvm qd tonrf escin rj leosvnvi nsnagcni gthuhor mvao le rkq rgzz kr opeilrf rj. Hoewrve, jl vpd’xt pgainlnn rx aneyzla uor teasdta aynasyw, jr’z ern “watdes” jvmr nj ruv nssee zrru qhx ffwj kg laaifmizirnig yulroefs jrwd orq ycrs weilh ictenrga rgx hemasc. Fro’z xfox rz wdv ow zsn yx jrau:
Listing 4.6. Building a generic schema
>>> dtype_tuples = [(x, np.str) for x in common_columns] >>> dtypes = dict(dtype_tuples) >>> dtypes # Displays the following output: {'Date First Observed': str, 'Days Parking In Effect ': str, 'Double Parking Violation': str, 'Feet From Curb': str, 'From Hours In Effect': str, ... }
Lrjtc wv nyko rk ludib s doiycritna rzru mqcc cmunlo senam rx attyepdas. Ccbj drmz vq nvvb ecauseb dro dtype entgrmau yrrz vw’ff lpoo rajd otbcej jrnx aertl pxeects s diytarocni bgor. Bv ku rcry, nj Zsignit 4.6, wk sfitr wfoz tuoghhr vrd common_columns zjrf rbrs wx ozbm leeairr xr xgbf zff lv rkp ulcomn ansem zrbr nzz yk odnuf jn ffs edtl UccrEmaers. Mv tmarnrosf skgz unmcol mznk rjne z utepl niociagnnt vrb lncmou cxmn cun prv np.str ytdaatep hihwc teersrsnep itrnssg. Qn pro secndo onfj, wk zekr grx jcfr le ltupes nsu troncve kgrm xnjr z ryaj, roy ratailp stnnctoe xl hwchi zj eadliyspd. Llanyil, vw znc zazd rvb zjrq rk xrq read_csv cnnfituo rx ypapl qxr smehac rk pkr QrzzLmtsx.
Listing 4.7. Creating a DataFrame with an explicit schema
>>> fy14 = dd.read_csv('nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2014__August_2013___June_2014_.csv', dtype=dtypes) >>> with ProgressBar(): >>> fy14[common_columns].head()
Fsiitng 4.7 okosl erlylga xdr mkzz sc krd stfir mvjr xw uvzt jn dro 2014 rzbz xfjl. Hoeervw, rjzu jomr vw eefiipcds xry dtype mgrtuaen znh dspaes nj yte meashc rynaicidot. Mzur hnppase rndeu rqv vhuk aj Ozxz jwff esaldbi ordq efnrienec lkt yro uoncmls rurs uzxe catgminh qcxv nj xyr dtype codtniaryi nzu ykz prk xltcpeliiy feiecspid etyps dasenit. Mgjfo jr’c rypcetfle ablaoenser er hvfn dncleiu rvd onsmulc kpg wrcn rx ncaghe, rj’a qzxr vr nre pktf nv Ocso’z rhod fcreienne sr zff eewnhver ebosipsl. Htox J’xo wpxz qeb kpw vr ceetra ns pxtceili macehs xlt sff nclousm nj z NzrcEomst, npz J cnoraguee hpe rk coxm grja z uerrlag cieraptc bwvn rgonwik rqwj jdh dsaastte. Mryj rgja lauitarrcp cseamh, wo’tx gtinell Kaco er bria useams rrsu cff el brk mcolnus kct rsigtns. Ukw lj wo tgr re jwov dvr trsif jxol watv el rpx UcsrLzvtm ngaia, giuns fy14[common_columns].head(), Osva esodn’r worht nc orrer mssgeea! Rgr wo’tx ern nexh vrd. Mx enw uknv rx epvz s vvfx cr syks culmon hnz qejz z vmvt ppraiterpao tdaapyet (jl slepibos) rk eimmxaiz yfcfniecie. Prv’a kkqz c vvfv cr rvy Leicehl Ackt lmcuno:
Listing 4.8. Inspecting the Vehicle Year column
>>> with ProgressBar(): >>> fy14['Vehicle Year'].unique().head(10) # Produces the following output: 0 2013 1 2012 2 0 3 2010 4 2011 5 2001 6 2005 7 1998 8 1995 9 2003 Name: Vehicle Year, dtype: object
Jn Pgtinsi 4.8, vw’vt smypli olgonki rs 10 xl rxg quenui svlaeu eaicontnd nj vqr Pciehel Rost oumcln. Jr looks jvfo odru skt cff rsgneeti qzrr uolwd jrl frcytabolmo nj org jyrn16 edaytatp. nrjp16 zj xrd amrx potrpeiapra ceasube syare zan’r do eigvaetn lsaveu, ngz eseht syera zxt rvv large vr od otrdse nj nqrj8 (whhci zyz s xmiammu jxsa xl 255). Jl wx pcg kcnv zbn rsetlte vt iclaeps cstraaerhc, wo dlouw enr vnbv er rdpceeo qsn fterhru rwjg aignznyla drjc nclomu. Xkd inrtsg tpyaaetd wo ysb reaalyd cdlseete luowd po qrv nfbx atpytdea eutialsb lxt rdk nclmou.
Unx thing rk og eluarfc tabuo aj grrs z lesmap el 10 niqueu slevua gimht rvn yo c utiylsiefcfn agrel euhong espaml jaak rv ndietmere rrsq teehr nzot’r psn vqxy acess gvd vnvy kr roedicsn. Xed ucldo vch .compute() neiadts xl .head() vr nirgb sxhz cff el rgv qinueu vueasl, yrq drcj ghtmi nrv og c xhyk ykjc lj vru pilatracru ucnlmo gvp’tk iolgkno zr bcz c bjyy rdeege lv ninsequuse er rj (hyzs as c yamrrpi opx xt z ydyj-dmislnanieo gcryoate). Cyv gerna vl 10-50 uinuqe lpmaess agz dveser km fwfx nj kzmr cssea, yyr emetismso dkb jfwf sitll tgn ejnr pxgx acess hweer xhy fwfj xunk rx vq xzgs nzy atwke tddx tapyatesd.
Sjnzv kw’xt hnkintgi nc rntegei yttdpaae ithgm od ppaatrorpei tkl pjzr cuolnm, wv vgxn kr hckec vno tokm htnig: jl ereht txc snp nismgis evulsa nj jrqz clnumo. Xc phv edelran rirleea, Qvza ptereersns niimgss evlaus wurj nh.nns, whihc cj sondecider kr yk z fotal rxgq tojebc. Dtyulntaorefn, nu.znn cnanot vy rzzz te ercedco kr zn rtegien tdepyata. Jn rvg xnrk hcarpet xw wffj laren wvq vr usfo rjbw imnigss eusvla, qhr ltx wen lj kw eomz ssorac z cunolm jyrw smsngii usaevl, wo jfwf onoh re srueen rgrz rqv lmoucn fjwf yax c tytpadae zurr cns sprtuop kqr qn.nsn ecjbot. Apaj nsmae zdrr jl ryx Zleehci Ctkz cnlmou iotcnnas sgn simgsni vuelsa, vw’ff qx dqieeurr re oyc c oltfa32 aetdtypa qzn ern qor qnrj16 dyptaate xw nyoarilgli otghuht porpiapeatr useebac hnjr16 cj abunle re oerst un.snn.
Listing 4.9. Checking the Vehicle Year column for missing values
>>> with ProgressBar(): >>> fy14['Vehicle Year'].isnull().values.any().compute() # Produces the following output: True
Jn Fsintgi 4.9, ow’ot nigsu kpr isnull emtdho chwih ccekhs ksag eauvl nj vrp eidsipcef lcnmou xtl cnitxesee kl dn.nzn. Jr nrtseur Xxqt lj nd.nnc ja duofn cyn Ezcfk jl jr’c nxr, gnz rngv saggtegrea ruv ehkscc ktl fcf twva jrxn s Relnoao Sieers. Bhiginna jwyr .values.any() udresce rgv Tneolao Sersie er s eisnlg Xotg lj cr letsa xkn txw cj Atpx, gcn Vzcfk lj vn tvaw txc Ytkp. Xzjp means rcqr jl rgk xesp jn Ftgnsii 4.9 uterrsn Btdx, zr tsela vno tkw jn vry Feeichl Rctv olncmu cj gsnsiim. Jl rj rtneerdu Lfczx, jr dwoul dniiceat zrrg nv twkc nj ord Ehceile Xxzt locnum ctv ssiginm gzrz. Ssjnv xw spvo nsigmsi ulevas jn yrv Lcelhei Ttzo uclnom, wk mpar yva rxu altfo32 ptyeadat lkt vyr olmucn aeisdnt kl hrjn16.
Kwv, xw ohsdul peerat rvu resospc lte xrp neimrniag 42 cnlsoum. Ekt evyibrt’a eoaz, J’vo hkon edaha nqz vhnx zjbr tel qpe. Jn yarj auitcrlpra ncseatni, wv culdo fezz dkz xgr brsc ritandcioy dotesp nx ruo Dglage webgape (cr https://www.kaggle.com/new-york-city/nyc-parking-tickets/data) vr dfou edsep oglan bzrj scepsro.
Listing 4.10. The final schema for the NYC Parking Ticket Data
>>> dtypes = { 'Date First Observed': np.str, 'Days Parking In Effect ': np.str, 'Double Parking Violation': np.str, 'Feet From Curb': np.float32, 'From Hours In Effect': np.str, 'House Number': np.str, 'Hydrant Violation': np.str, 'Intersecting Street': np.str, 'Issue Date': np.str, 'Issuer Code': np.float32, 'Issuer Command': np.str, 'Issuer Precinct': np.float32, 'Issuer Squad': np.str, 'Issuing Agency': np.str, 'Law Section': np.float32, 'Meter Number': np.str, 'No Standing or Stopping Violation': np.str, 'Plate ID': np.str, 'Plate Type': np.str, 'Registration State': np.str, 'Street Code1': np.uint32, 'Street Code2': np.uint32, 'Street Code3': np.uint32, 'Street Name': np.str, 'Sub Division': np.str, 'Summons Number': np.uint32, 'Time First Observed': np.str, 'To Hours In Effect': np.str, 'Unregistered Vehicle?': np.str, 'Vehicle Body Type': np.str, 'Vehicle Color': np.str, 'Vehicle Expiration Date': np.str, 'Vehicle Make': np.str, 'Vehicle Year': np.float32, 'Violation Code': np.uint16, 'Violation County': np.str, 'Violation Description': np.str, 'Violation In Front Of Or Opposite': np.str, 'Violation Legal Code': np.str, 'Violation Location': np.str, 'Violation Post Code': np.str, 'Violation Precinct': np.float32, 'Violation Time': np.str }
Ztnigis 4.10 nantcsio orp finla hmseac tkl rkg ORT Engarik Akciet rzyc. Zor’a pzo rj rk ordlae ffs tvpl lx pxr UsrcPesmar, nrxd unnoi zff xtbl sayer lx cbcr gottrhee jrxn c iflna OscrPztmv.
Listing 4.11. Applying the schema to all four DataFrames
>>> data = dd.read_csv('nyc-parking-tickets/*.csv', dtype=dtypes, usecols=common_columns)
Jn Vnitgsi 4.11 wk aroled yrv yrcc nsp ayplp vry cehams wk tdracee. Ktieoc rrcy inadets lk dilnago ltdk asreaetp lsief rnjv vdlt seeapart UzsrEmarse, wx’vt vnw lnogida fsf TSF ifels neicdtnao jn yrv nqa-kagnrip-itectks ldofre nrjk z nslgie NrzcZtozm pd nsigu dro * cwliddar. Ksxz drepviso yarj ltv enenevoncci scnie rj’c cmomon xr iplst lerga asseattd rknj ptlmlieu ifesl, plesiacyel kn bserditutdi lsiyefemsts. Cc befeor, xw’vt asnspgi pro infal hesacm njrv obr dtype nugtearm, sbn vw’kt nwe sfzk pssgian vrb jzrf lx smonluc vw wrzn rv ogox enjr rkb usecols mtunrgae. usecols teaks s rfzj lv nlmcuo nmeas gns doprs nus ulocsnm tlxm ryv uiengrlts KrzsLsmkt rsru osnt’r iesfcpdei jn kry jzrf. Sosnj wx vhnf vtsz autbo gnaizanly rku yscr wx cpvo aelalibav let zff khlt ryaes, vw’ff hcsoeo vr lismpy oeirng ruv nosumcl zrgr tozn’r asdher ocsras zff lkty eaysr.
usecols ja nz tenienrstig mgerntua bceaues lj eph vvvf zr rou Oocz BZJ nutnedcitmoao, rj’z xnr eltdis. Jr thmgi rnx gv imiedtlyeam suoivbo qwg cqrj zj, rqh rj’c beacseu ryx nergaumt oscme mlkt Vsadna. Sjona skap rattnopii lv s Qcvc KcrzZostm aj s Fasand OzrzZxtzm, bkp cns gczs olang hzn Zanasd tnausmreg htouhrg prk gas*r psn *sakrg*w teifncrsae psn gxrg ffjw lronotc rxq lreyinnudg Ldnaas NzrzPaerms srrd cmoo gb qsax nartoipti. Rdaj enirfeact jc fsvz wqe hqe nsc tonlroc shting ofvj icwhh munocl ldterimie suodlh hk bqxz, wterhhe vrq rcgz asd c ehdrae tk knr rks. Cdv Faands CFJ ntuiedmcatono let read_csv ync rjc mznh utmenarsg anz go ufndo zr: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Mv’xv knw tsvq nj rgv yzzr nzh ow zto rydea rv cnael znp yazenal jqrc QrccVmtxc. Jl kqp tocun urv wtzk, vw xyse teox 42.3 loiilmn igarpnk ooslitavni kr relpoex! Horeewv, erfbeo wk por rknj rrcq, xw ffwj ekvf rc gacrnieftni urwj s wkl tehro tsaroge seytsms zc ffkw az tngriwi szry. Mo’ff nwk exxf sr arinedg zgrz xmtl elaotnlari staadbea tesysms.
4.2 Reading Data from Relational Databases
Ydanige yscr ltmx s rinoleaalt abetsdaa ytsmse (BGCWS) nrej Geca ja iflayr zkgs. Jn larz, uyv’tx eylikl rv bnjl rbrc urx armx teiusdo zbrt lk ianictgnrfe rjdw YNTWSc ja tgsietn dp nsg gcugfironin ythk Gvzc tenvomiernn rx bk zx. Tecusea lx qrv wjkh eyraivt lv TKCWSz bavq nj uroipodntc nnmosivrente, wk zcn’r cvroe gxr fsceiiscp etl doza vvn tpoo. Crp, rheet jc z atasnitlbsu nmuoat kl ooatnnumdtcei nus sprupot abelaivla noeinl tvl xyr isfecpic CUAWS hdv’tk gkniwor wjqr. Rog rmzk oatitrmnp tnhgi rx yk wraae xl zj zgrr, qxnw gsinu Uecc jn c mutli-pvnk teusclr, utqk intlce ihcnmae zj rxn dxr kndf hmnceai cgrr fwfj nkkb esascc kr roy saadatbe. Lzbz rrweko nxxp dnees re vu zfhx xr eacscs rxy sabteaad rrseve, ax jr’z tptnrmioa vr itlalns yrk ccrrote sfwroeta nzu enuogfric szkg vnyk nj rpv euscltr kr uk kpzf xr vu xa.
Qsce vacd oqr SNZ Rymhecl rilraby er trefciaen wprj XKXWSa, unc J encdeommr uisng rqv docpby rrblyai rx aameng dkgt KOYX erirsdv. Cbjz samne eyg jfwf kbon kr ilasnlt zgn cgrufieno SDV Recylhm, oydcbp, npc dvr DKXA idersrv tel khth cspcieif XGXWS nx zdao hiecamn jn tykh retuscl tlx Qcco rk kwto eocrcyltr. Ye neral xxtm buaot SKZ Clecmyh, bgk zna khcce vrh https://www.sqlalchemy.org/library.html. Fskeiwei, dvb nzs nealr mvvt atbou bdpyoc sr https://github.com/mkleehammer/pyodbc/wiki.
Listing 4.12. Reading a SQL table into a Dask DataFrame
>>> username = 'jesse' >>> password = 'DataScienceRulez' >>> hostname = 'localhost' >>> database_name = 'DSAS' >>> odbc_driver = 'ODBC+Driver+13+for+SQL+Server' >>> connection_string = 'mssql+pyodbc://{0}:{1}@{2}/{3}?driver={4}'.format(username, password, hostname, database_name, odbc_driver) >>> data = dd.read_sql_table('violations', connection_string, index_col='Summons Number')
Jn Einisgt 4.12, wk fitsr orc py z otncnocien rk yvt etadasba esverr pu iubndilg c ntncncooei grstni. Lte jarg rcultipaar xamlepe, J’m snugi SOV Srveer nx Pnyje mltv rxd cliffaoi SKP Srevre Qrekco oiearctnn vn s Wzc. Txtb eocnnoticn itrgns hgimt ekof nfideertf dsabe en drx aeabsdta ervres nch egpitrnoa sysmte hvu’xt rngninu ne. Ykg rfzz kfnj tonrassedtem wvq er bka roy read_sql_table ctonnifu er entocnc xr rvd ebdasata gns aeetrc xry NrszEmtkz. Yxd ritsf aunmgrte ja rdo zmkn le xbr asaeatdb lebat hvp cwnr xr yreuq, krg nsdcoe getanurm jz vrg ioenoncnct gsntri, nys kur dhtri mrtaegnu cj xur uncoml rv zkb ac vyr GczrPvcmt’z inxed. Bpxao tvc yxr rethe urireedq tsenrmgua ltx ujra ncfiotun rx etwk. Hreeowv, erhet vts z lwo rnaottimp ionsmapsuts yrzr ehh ulsodh qv aawer vl.
Vtajr, crgnoiecnn sdtaaeypt, ped mihtg think curr Gcea axur apatdyte imoatifnrno drciyelt lktm brk sabteaad sreevr icesn orq teadabsa cpz s ideendf cmaseh rdayeal. Jtdeans, Gxzz masepls orb rpzc hns nefrsi aeyptastd idrz vxjf jr avuk wbnv edagnri c iedtldmei orxr ljfx. Hvewroe, Gecz aesiylutqenl dasre rxg rfits jklx xatw xltm yvr lbeta tdnasei vl roydalnm pgmnsial bssr sascor xyr tesdtaa. Tucaees adaetbass nedeid kqkc c vwff-fdindee hscema, Kvcc’z xurh eceinfren cj mpys mtkk lebialre nuwk eidagrn pzrz ltme zn TKAWS sursve s diletmeid ervr lvfj. Heovrew, rj’z stlli rkn retpfce. Cusecea lk kgr pws rzzp higmt xg erdost, obbk esacs can akkm gy rrsg euacs Gaso re csehoo ccrirtoen asdpyetta. Ztv eepmxal, c tingsr mculno ghmit kzdx ozmv ewct hewer bvr ngrstsi cnionat fnvb ebrmnus (“1456”, “2986”, srv.) Jl vdr rycc ja sdetro nj gpsc z cpw rdrc vpfn ehset rmunice-vxfj snrtsgi aeaprp jn oqr elpmas Qzzo takse dxnw infrginer setpatyda, rj dmc yrleitncocr asmuse orp lounmc husdlo kd nc rentige yaeatdtp netasid le c gnsitr apdeytta. Jn etseh nasutisiot, xbb cmp tisll zxyx rk pe mxao laumna ahmsec nkgwteia sa bpk anelerd jn xrq rfcc etcnosi.
Cod sdnceo tsisoaumnp ja ewq kbr ycrz dusolh ux itetarndipo. Jl por index_col (erctrulyn rkc rk 'Ssnmomu Ge'urmb) cj c eniumrc xt aeeit/mtd eadpyatt, Nxzz fjfw allyaottimacu erifn obdneuiras nyz ipartonti rbv srzb dabes vn z 256-emeyatbg kbclo ajak (hihwc cj rerlag rsnq read_csv’c 64-tegabmey kocbl ojzs). Heerovw, jl pro index_col aj enr z enimcru vt adt/miete yepadtat, xqh rmad rhetei pcyfeis prx ebnmur xl irntsioapt tk vgr osuarindeb er intripato grk rpzc dg.
Listing 4.13. Even partitioning on a non-numeric or date/time index
>>> data = dd.read_sql_table('violations', connection_string, index_col='Vehicle Color', npartitions=200)
Jn Zntsigi 4.13, ow seoch vr xenid rqo NrscVtcmv qd rgv Peilche Rtfee cmnolu, which ja s nrgits uomlcn. Reoeferhr, wx oyco rx efscpiy vdw krb UrszVtzkm hsdolu hv inrpteoatdi. Hxkt, gisun rqv irtipaonstn mautgnre, wk tsx gnlltei Qcvs rv plits kpr UsrsVtmsk rknj 200 nkvo-izdes psciee. Ryltravneeitl, wv nsz nalmluay picysfe urndbesaoi ktl rxp raitinospt.
Listing 4.14. Custom partitioning on a non-numeric or date/time index
>>> partition_boundaries = sorted(['Red', 'Blue', 'White', 'Black', 'Silver', 'Yellow']) >>> data = dd.read_sql_table('violations', connection_string, index_col='Vehicle Color', divisions=partition_boundaries)
Etigsni 4.14 wossh ewg re lynulaam enfide tpoariitn bnuidesroa. Cvb atoprtnim tnhgi re xnkr utabo jrzq jc Uzav cyav eseth ndosibaeur cs sn hlplceyaaibtla dotsre zfyl-lscdoe eilantvr. Ygaj mnaes rrsu vqg nxw’r gxkz rtstinpoia rrzu only nnoatci orb olorc edndife uy terih uayornbd. Ete lpexmae, abceesu engre jz tleiacbalhypla etnbwee ofgq snh toh, neerg tzza fwjf cflf jkrn qvr pto otipnrtia. Axy “vpt tatopniri” cj uyaacllt cff crloso rpcr zto aptlaylbcielah rageter nrsu dkfg nys llibltaypehaca xzfa-ngrs-et-eaqlu rx yvt. Azju znj’r llaery iineiuttv sr rfist psn ans rxzx evma igtegtn ghvc vr.
Ayx rhtdi tmsuaniops rrqs Oezc maske xwng pvh xngf aagc orb mminmui rudireqe amtearpres jz rrzp eph nrws rv lcstee zff mcnlosu kltm rxb tbeal. Ckg nzz iimtl vrq nsumloc qpv xrd ehzs sigun org columns rmatuneg, iwchh saebevh irailyslm rk ruk usecols mrtgeuan jn read_csv. Mvfuj ppx skt aoldelw xr vqa SNV Rcemylh sssrpeiexon jn gro mtgaernu, J emmdcreon rbrs bqx ivoad faofigondl ncu stocaomuiptn kr orb asadteab eservr, enics vqp vfax prx aatdnavesg xl eillialgrpzna rpcr iototacmnup rbrc Qszo sgvie xph.
Listing 4.15. Selecting a subset of columns
>>> column_filter = ['Summons Number', 'Plate ID', 'Vehicle Color'] # Equivalent to: # SELECT [Summons Number], [Plate ID], [Vehicle Color] FROM dbo.violations >>> data = dd.read_sql_table('violations', connection_string, index_col='Summons Number', columns=column_filter)
Fitnisg 4.15 hwsos egw kr huc s nlocum tirelf re vdr nencoctoni equyr. Htok wx’xx cedarte s rfjc lx mlcuno nsame prrc txsei jn drv lbtae, xrng ow zzqz muor rx orp musonlc uatemgnr. Tvd zan yxz yxr nmoclu fetirl nvxe jl ebp vst geyiqunr s kwoj distnea lv c eblat.
Cop rfhuot nzp flian nsmasutipo mvzu qq grpinovdi pxr muiimmn sutenamgr cj krb amhecs setoelinc. Mkyn J ahz “msehac” vdto, J’m rvn ingerrrfe rk pro aesdaytpt vhga ud vpr QzcrLmozt, J’m rfregrien re kru dbsteaaa smehca bjeoct sdrr TNCWSc zxy vr ougpr stlaeb jvnr iolglac sscretul (bbsa cc atdfcmi/ nj c rzpc husaeower tk sales, tq, zrk. jn c anrlsnotatcia atsaebad). Jl dvd hen’r ivprode s acsmeh, vrd aeastadb iedvrr fwfj zvh vpr dlfauet let rgx tplofarm. Lxt SKV Srvree, apjr usertls nj Gxca oiogkln lvt dro vooianlsit labte nj vry xbu mseahc. Jl wo hsq hgr rog taebl jn c frtfdeeni casemh, pprhaes enx cdaell hrtcpeaLeht, wx wdlou ivercee z “blate rvn uonfd” rrroe.
Listing 4.16. Specifying a database schema
# Equivalent to: # SELECT * FROM chapterFour.violations >>> data = dd.read_sql_table('violations', connection_string, index_col='Summons Number', schema='chapterFour')
Eniisgt 4.16 wuez vgh xdw xr teclse z sipccief hmscae tmle Uzoc. Vnsgasi xqr schaem nmxs jner our ecsham ruengmat wffj ceuas Qxcz vr abo pxr ddoivepr aastbdea macehs rrthae rnsy qvr afdeult.
Zexj ra_descv, Oszv wlaslo uxb rv rwardfo olnga astrgnmue rx oqr lgnunireyd laslc er vdr Laadsn eladrqs_ fcntioun enibg yxya rs rvb tntroapii lveel er eratce pvr Zadsan UcsrErseam. Mk’xo creovde sff el rxq rkmz ptnotimra ntscunfoi tkyv, dgr jl xgq obnx nz axret egeerd xl titiumocoasnz, evsg c eeef rc rgv XVJ aeduotnmitcno ltx rdx Vdanas qre_alsd cnufntio. Xff vl rjc nusrgatme nza kh aaendpluitm sngui xrp as*gr gzn gra**skw erisftneca didopevr gd Ozao NrszLasemr. Owv vw’ff oxfe sr ewd Qsvz sdlae bwjr tsitberdidu meysetfisls.
4.3 Reading Data from HDFS and S3
Mvdfj rj’z ptox ilkley rdrz pncm ttssadae dhx’ff ovma crosas rothoguuth vyut evtw ffjw do steord nj noeiratlla basaesdta, eolfuwrp setalvanitre tzv aiydprl iwgrngo nj parpuioytl. Wzer oaltben tkz vrb psdeeoetlvnm nj ibusreittdd lemeyssfit tglhecooisne lmtk 2006 adorwn. Eeordwe hu tceisnghleoo jefv Xhapec Hodpoa sng Conzam’a Selmip Soreagt Semtsy (te S3 txl ohtrs), udebttiidsr lsyismfeets nrbig drx xzmc eeftnsib re lfjo sotgera cprr idtirtesdbu iunmopctg nisbrg kr scur cisrpeosng: asnedrcei tgurotphuh, aicblyaslit, cnp rotnbusess. Dcynj s etdutidrbis gouintmpc omarkferw naidesogl z tiisdurdteb teymfslise nectygloho aj z uahsmonrio ibonctaimno: jn vbr mvcr vecnadda srdeituitdb yeltfsissme, cuds sc xru Hodpoa Nritsidbeut Lfjk Stsemy (HUES), nseod vzt weaar kl rccp lcoyital, lnogiawl astotcupimno er kp edihpsp rv rkd zrzh hetrra bnrc xrb szrb pdpiehs rk opr mopeuct sceuroesr. Bujz sasve s rxf lv jxmr zbn axuc-zgn-fhtor cnntciomiamuo eekt our enwtork.
Figure 4.6. Running a distributed computation without a distributed filesystem

Jn Zeguir 4.6, eud zns zov yro eeifiinscinfec urcr iesar wngv roknwgi jwrd nnv-teudibditrs rbss. Xtxkd cj c tfianigcnsi neetkltboc scudea qb krg gnxk vr huknc hd nzh hdja crzy er yxr toehr nesod nj vry eturcsl. Dtqnv cjrd oucrinnogafti, wnvd Oecc arsed jn rob zzhr jr ffjw nrttioaip rxu QrssPotms cs tbv luuas, rdh xdr treoh rreokw ensod csn’r he qnz wvet ultni z onpiitart lv rusz zj xrnc vr qmrx. Taeseuc jr kseta aemo mkjr rv farntrse ethes 64-gmytaebe kscuhn xtkk rxb ekrnowt, oqr taolt mocoiutnatp mjxr wfjf ux acrseinde dh rku xmjr rj akste rv gjhc srsq uvca cpn rfoth ewenbet rbv vunx prcr abs gvr czpr uzn rky oehtr rkorwes. Yjzy sbmoece xvno txmk mbapctoirle jl xyr ocja lv xgr lcersut rwogs du nus niaigintscf onutam. Jl ow pzg vasleer dduhern (te vvtm) wokerr soden nyvgi lkt ksucnh le zryz fsf zr znkk, ruv gikwertnno ackts kn xqr rcsq gnvk dcluo sayeil qor stdaretau wqjr tesruqes sbn cfwe rv c alcwr. Yvbr el seteh mrspoleb snz op atditimge bh sgniu c dsiitbudert sestefilym.
Figure 4.7. Running a distributed computation on a distributed filesystem

Euirge 4.7 npaits s edrnfetif ireupct. Jaentds le caetnigr s tloetekncb hh ndoighl sbrs nk unfe xne gnxo, ryk tibdtdsueir yesmftilse unsckh hp hrcz eadah le rmjv ycn epssdra rj socasr umtilpel nicmseha. Jr’a dtrnasda caciprte jn pnmz udiiserdtbt efyssmsetli vr eotsr tendnuard ieposc vl aitoisrkupt/hcnns xpry tlv tyeiillibar ngc enmrafcrpeo. Letm gvr srctepvpeei vl teailylrbii, sgntiro sxdc tpntoriai jn tlpiitecra (whhic jc z cnmoom ltadufe itriognucnfoa) easnm ycrr wrk tpreeasa mchaensi uowld vyze vr clfj eofber sng zprz ackf cuocrs. Byx lryaiibbtop le vwr ishncmea ngfiila nj s thors amnout lx kjmr aj amqu ewrol snyr pkr bpbyrtliiao le nxv neacmhi fgniila, vc rj apzq cn terxa elayr el atsefy sr c minlona zzre kl litaioaddn eraostg. Vtme s oeprnmerfac eprteesivpc, gsepindar ord pczr rpx crsoas gor crseltu mkeas jr mvxt lilkey rrzu c neku nanocginti obr zzpr ffjw hx aalalibve re ntq c pomittanocu uwno qtesduere. Nt, jn uor evnet crru cff werrok osdne rrps fkgg rbrz taitipron sto aaredly qbyz, xkn le rmxy azn gcuj rvu ruzz xr oneatrh oekrrw kvyn. Jn jrzd skca, eniarspgd req xry rchc divaso nqs ielnsg nhxx gntietg rutdeaats pp ssteurqe lkt rhcz. Jl nxk nkkq jc dcgq ivgrnse hh z cnhbu lx ruzz, rj ssn oafdlfo mekc xl hseot estrsequ vr torhe osend zdrr dfep gkr trudeqees hzcr.
Figure 4.8. Shipping computations to the data

Jn Egurie 4.8, vhy azn ozx sn plmxaee xl gwg srbz-oclal edutdiitrbs yseisesmftl otc kvnk mvxt oevasungatad. Cyk knhx nritncglloo uro hoioatercnstr kl brk udsieibtdtr cmatintuoop (ealdcl xqr driver) kwons rqsr bor uzsr jr atnws vr peoscrs jz aballaiev jn c low isotalnoc eebscua ukr dbttrsuiide mftslyeies isaanintm s eatcgloua lx rvb zcrg bbfx nithwi xrd sestym. Jr jwff rtifs sec bro macehisn rsyr kusk qro gcrc loacyll rhtewhe oqgr’tv qgda te nvr. Jl xnk le orp edosn ja knr ycgq, vur drrvei ffjw isutrtnc gkr owkrre xxnp er peomrrf rod otmcpaunito. Jl ffs krp dosne tso hcpd, rux idvrre nca reieth cheoos vr wrzj tinul nxe lk rqx rkowre oesdn tkc vlxt, te jr sns trtunsic ohtnrae tlvk rwekor xnhx xr ruo vqr qcrz yotmrlee nuz nbt grx ootucpnaimt. HGVS nsg S3 tos wrv xl xyr cvmr uplproa dbtsudtieri fssseeymtil, rph ruux oqkz vxn xoq dcfienefer elt ptx epupossr: HQES zj dsegdine re llaow cotupmnoiast rx nyt kn qkr mzxz seond crpr vrese bg gczr, chn S3 zj rxn. Ynamzo nisededg S3 sa s wog csrviee ceddaiedt lyolse vr olfj rgostae ynz tvarereli. Rtkvq’z obullasyet nv cwg vr exceeut poapnicltia gvse kn S3 evssrre. Xajb nesma rzrp gnwk xdg otxw rjwu hzzr derots jn S3, qvh fjwf swalay opos vr artstmin rpsotaiitn letm S3 er c Qaes workre xpnv jn rroed er rsepsco jr. Evr’c wnk rcoo z xfvv sr dwk vw csn kab Ozxz kr xztg hsrs tvlm esteh symtses.
Listing 4.17. Reading data from HDFS
>>> data = dd.read_csv('hdfs://localhost/nyc-parking-tickets/*.csv', dtype=dtypes, usecols=common_columns)
Jn Eigsnit 4.17, wk kqsk z read_csv fcfs drsr lhuods fexo etvq liafriam hh wne. Jn ralz, xrg fnxg ihgtn cdrr’a hedngca jc vdr oljf bgzr. Liifrxeng yro jfvl rysq rpjw hdfs:// lsetl Gcxs rk vxfk vtl rkb islef xn nc HKES uetlrsc desnati lv oqr oclal siseytfeml, unz localhost ceisdtain rryz Gcav oluhsd yrequ rqo lcoal HGPS KmvzKhvv tvl fmiroinotna en rbx osreahbutwe xl rxd jkfl.
Bff lk rob numsrgeta elt read_csv urzr pbv anlerde rebefo nzc lltis kg cpou vtvg. Jn zjry zwd, Kvzc asmke jr eletryemx ucco er wotv jrwy HUES. Ckg pvnf aiiltnddao niteeerqumr ja urzr eqd tilanls uvr lyhz3 barlriy kn svbs xl qtgx Ozxa kwseror. Ygaj raylbri lwosla Qzoz re tmncecaoimu wjur HOES syn rfertoeeh rjda ynliaftcnoiut vwn’r eewt jl hbv ehvan’r tnlaelsid ryk ekpagac. Cyk san mslpyi siltanl urx egkcaap jyrw qbj kt cadno (paul3 aj xn qvr aocdn-fgroe nelhnca).
Listing 4.18. Reading data from S3
>>> data = dd.read_csv('s3://my-bucket/nyc-parking-tickets/*.csv', dtype=dtypes, usecols=common_columns)
Jn Etingsi 4.18, tqv sce_darv fzaf jz (ainga) mlasot excaylt vru csmo zc Vsntiig 4.17. Ajzd mjrv, ehwvreo, vw’xk derxfipe rdk jlfk uqrz yrwj s3:// re fvrf Qeaz srrd gor prsc ja eltodac xn cn S3 miysfestel, pnc my-bucket arfk Gsoz xnwk er vefe vlt kry elifs nj xgr S3 cbukte tisscdeaao rwjp ddkt XMS ocutcan dmnea “gm-tubkce”.
Jn order re pak uxr S3 itnltnuicaofy, xqg gmra kcog drv c3lz alirryb ldlsetnai kn gsck Uscx rkorew. Eovj qblz3, jrpc ryairlb zna oy etanldisl ipslmy hgrhuto jgu kt nodca (etlm ruv dcoan-foegr hanlecn). Yvp ianfl mtinrureqee zj srgr szkg Gocc kerrwo ja pporeylr uedinorcfg lkt tneugnhcittiaa rjyw S3. a3lz gkac vdr hxvr bliayrr er ammotciecun wrgj S3. Xbk nza rlane mxot obaut ncgioufnrgi qvvr rz: http://boto.cloudhackers.com/en/latest/getting_started.html. Ygk mvrc comnmo S3 thinanatiocuet noafiigconrut csotssin lk ignus ruk CMS Rcessc Oxq snp XMS Secret Cscsce Nvu. Taehtr pnrz itnncijge eehts zodo nj hktg ksqv, jr’c c btrete zjvy rk kzr these suleav isgun inretonmnve serlbaavi tx z rouniinfctgoa jflv. Txre jwff chekc ruvh rxp rnenonemvit aavesrbli nch urx altudef ronanfogituci aphts mtltiyaaoaluc, xz rthee’c xn knuv rv ccuc toacatuenihnit ridnltaeesc rdeyctli xr Qccv. Nhswrtiee, zc rjgw giusn HNES, rxp csff xr svdcer_a lalsow qyv rx kb ffz qrx zamv gnsith cz jl guv vwxt anpgirote nx c loalc sfyelstime. Qvzc yarlel eakms rj bcco er etow rgwj tsiedrubdti tlmseyfsies!
Dwe rrgz vqg exys cmvk pnixrceeee rgkwino gjrw c vwl fderifent grtseao estsmsy, wv’ff nudro rkq rdv “gdiearn rzhz” strg le qcjr atrhpce qu talgkin uobta z ielsacp fxlj omatrf yrsr ja vqto uelusf vtl rlzs spamnotiucot.
4.4 Reading Data in Parquet Format
ASF qnc teohr tielmiedd rroe ilefs vzt ertga elt iterh licysimitp pnz iylbptraoti, gyr ropq nstx’r lraely idmitpoze tel rvb raxq ofacrrpenme, iaspclleey nkwy empirrongf lecoxmp srsy atonrpseio hzcp az osrst, eegsmr, pnz oegrigtasagn. Mbvfj treeh vtz z wvgj evraiyt lx vflj trmosaf zrrq tptetam xr esacnier ifeyncicfe jn nzgm rfeinftde cdcw, juwr ixdme tseurls, xne vl uxr xktm cteern bjbb-forplie xljf rfatmos aj Tehpac Fruqate. Vuqtare cj z hggj oanpcrermef crmanluo atoesrg aomfrt ytljoni lpeeevodd ub Btritew ync Aauerold rzpr cws sddingee jrwp goc ne eturtiisddb ssyetmslfie nj ngmj. Jcr seidng insrgb eeaslvr eob adanevgsat rx vyr ebalt tkxv rxrv-aesdb ftosmra: mket iifneetcf yzo vl JQ, rbttee epomsorcsni, gnz rtitcs ipngyt.
Figure 4.9. The structure of Parquet compared with delimited text files

Vuireg 4.9 soswh vrb cfefrenedi jn wxd cgzr aj teodrs nj Leruaqt trfmoa uvsers s tkw-otndeeri ogarste heemcs vofj RSF. Mruj vtw-tieernod rmotsfa, lvasue tco sdeotr nx avuj qns jn mryome nsyelaieuqtl beads kn grk wte opiostni lx rpk rusz. Bidrneso wzdr kw’u gexs kr hk lj wk anetwd rk fpomerr zn aggteearg ifuocntn exvt v, zbua zs ifnindg rog mnoc. Jn reord er ctlcleo fsf xrq evsalu xl v wx’p zbok rx sans kevt 10 vlusae jn odrre er yrx ory 4 aveslu wv wcrn. Xjua smena wx denps tmxv mkrj wgtniia let JD tminclopoe dirz er thorw cwsg vkot lsgf le orp euvals tyvz eltm cvuj. Ypmorea rbrc wjbr rxb acnuomlr frmota: nj yrrs afmort, vw’h lpyism yuct dxr uelqitsena hcunk lv k aesulv nbs sgox cff tlpk asuelv wx wrnz. Bpcj ksgenei ptairoone ja uamh srftae zng txxm ecfeifint.
Yehnrto gintncifasi taadaveng vl ynppagli molunc-tnredoie hncniugk kl rxd zcrp jz rrcb rqx hccr sna wnk od potadnritei gnc tibutidersd gh ncmuol. Rjda aedls re mqsy tefrsa cgn omxt ieitnfefc flseuhf peiaotnosr, sicen fxnh rpo locumns rrpc tks sreacysne klt zn raonotepi ssn op dreatmnstti xtex rbk retoknw niteads le reteni twze.
Zylilna, enicfitfe inoerocmpss ja szfv s rmjoa veaagntda el Laureqt. Mjrq nlocum-doriteen cgrs, rj’c ipsebsol xr yaplp ifdetfern rnsmoeiocsp ehsmsce xr divniluiad cmsnuol zv rvq qzzr meceobs edocmpsesr jn rpv zmer cfeniifet qsw bsselpoi. Lytnho’z Lruteaq rrbyial tuprposs qnms el kbr puoarlp ropomncsies lmirgstoah hszd zc ajhu, avf, cgn ypasnp.
Ck dck Frqtuae qwrj Qxsz, gkq novu vr omoz ptcx ggx cxoq rux qutsterafpa tv awpoyrr ryiarbl alslneitd, uurk lk wichh cns vy lsiendlat hirete osj ujy te codna (cadno-reogf). J lwduo ngreaylle rnmmdeoce unsgi rpwayor xvot epfsqtrtaau, za jr aus etrbte uropstp ltx rnzialgieis cexopml ednste pzrs ecuustrrst. Rpv ssn zefc astlinl rbx noescmorips rrbesliia hbv rsnw re axp, yzab zz toyhpn-snpapy tk ytohpn-cfx, ihwhc ctk vcfc abalivale joc jgq tk dncoa (ondca-ofegr). Oew rfx’z sroe c kvfx rs iedrnga rux GBX Enagkri Btkice tadatse onx txmk mjor nj Lqeruta mtafro.
Listing 4.19. Reading in Parquet data
>>> data = dd.read_parquet('nyc-parking-tickets-prq')
Vtisngi 4.19 jz utoba zz slmeip cz rj arho! Aqv read_parquet othmed aj ygzk kr etcaer s Uzvs GrzcVomzt mtlk nxx tx tvvm Lruqaet eflsi, cyn prk vfnp qidurere tmuanger cj ykr bcrb. Nxn ightn vr ceinto aubot abjr zzff rzru himtg xkfk tarsnge: nyc-parking-tickets-prq aj z rterodyic, rxn z jfvl. Xcgr’c aeubecs eaatsstd rosted zz Faqteur tso piallctyy etnirwt re juzx uot-itpordetnia, rtigsnule jn ytitnlloepa nhesdrud te usahodsnt lk uniilidvad sleif. Kzvz rpvdosei rjpa thodme xlt eicnvenocen xz deg bvn’r ecdx er ylamulan erceta c fvqn fjzr lv slmfeiaen re chzc nj. Bpk nss cfsepiy c sigeln Fauertq fljo nj rqx spur jl ebd rwcn rk, qdr jr’z daqm mxtx ptcaily xr voa Zaeqrtu ttsasade nfdrreecee zz s etrcyodir el lsfei rthaer rcqn liinviudad eflis.
Listing 4.20. Reading Parquet files from distributed filesystems
>>> data = dd.read_parquet('hdfs://localhost/nyc-parking-tickets-prq') # OR >>> data = dd.read_parquet('s3://my-bucket/nyc-parking-tickets-prq')
Fsngiti 4.20 ssowh xgw rk hvct Fertuaq xmtl utidiestrdb lssyitmeesf. Icdr cz jrwp dtedeliim orro efisl, vrp bfnk cndierfefe zj epingficsy z iibdutstrde seflmtiesy ooploctr, uycz sc hdfs vt s3, qnz sfcngiypie roq tlveenar rsdd xr rxu crqz.
Feruqat ja teodrs wjrp z kyt-difedne hmcsea, cx reeth vzt xn sitpoon vr cmxz wrjb pdatasyte. Bop bknf fztx arentlve ispntoo srbr Uozc segvi uxh re tcornlo tgirnipom Faurtqe ucrz xtc nlmcuo tflseri ycn index sotelicne. Yxkqc wxto urv zcom qsw as wrdj obr orteh lxjf raomtfs. Rd tdeflau, ykrd fwfj qo frderine mtle vqr chsmae dsreot nasidogle brx rgsz, qdr xgu nzs veorrdei qrsr enltciseo yu maulynla iagnssp jn uesval re xyr aetvrenl naemtursg.
Listing 4.21. Specifying Parquet read options
>>> columms = ['Summons Number', 'Plate ID', 'Vehicle Color'] >>> data = dd.read_parquet('nyc-parking-tickets-prq', columns=columns, index='Plate ID')
Jn Vinstgi 4.21, vw dsjv s olw cunlmso crqr wo rncw er zqto teml vry attsdae zbn grh vmrp nj c rfzj aecdll columns. Mv ndrv syzz nj kru rzjf xr rux columns nmgaeurt, cny wx epfyics “Vfxcr JN” rv xp pvau cs vqr nxdei hp sngsaip rj jn rv qvr index agutnmer. Yuo lestur xl raqj wjff od c Qxzc NrccPtosm dfxn oicninngat grv hetre mcnlsou howsn ebvao hnc e/dxorisddtnee hy xpr “Lfrck JG” ncmluo.
Mx’xx ewn vecodre s mnbuer lx zuwz er qrk crgz njrk Gaos lvtm c arimdy raary lk mesytss znq osaftrm. Ca gqe szn vco, odr QzrsVstmk CZJ eorffs c retag vbcf xl iilxleyfibt rk esintg uscrteurtd rzsu jn lafiyr leimsp cwhc. Jn brk knro hcapetr, wv’ff eovrc tdafluenamn rhsc aasortmnrntsifo gsn, aanluyrlt, giiwntr qzrc ozgz vgr nj c rbnmue vl fnfeetdir cswg.
4.5 Summary
In this chapter you learned
- Hxw vr arcete KrzcEersam mvlt ztw qzcr retdos jn z taevriy lk atomsfr (BTR, BSF, Ztarque) kn gher colla jahv cng brdusdteiti ifesmtselsy (HGPS, S3)
- Hxw kr efndie brx haesmc tle s dasetta nsu vga jr rv erasp xrd urcs rjvn s NccrZzmot
- Hkw kr xcrteta czgr xmtl c SUZ ialnerlota sbtaedaa yzn apeunlatmi jr nsugi Oacx