3 Introducing Dask DataFrames

published book

This chapter covers

  • Why Dask DataFrames are useful for analyzing structured data
  • The basic structure of Dask DataFrames
  • How Dask DataFrames use partitioning to distribute tasks across many workers
  • Limitations of Dask DataFrames

Jn vpr iueposrv atperhc, wv tasterd plgeixrno gew Uvsc kcha OCOa vr odaecniort ncg aagenm omlexpc askst sorsac dmns mianhsec. Hvoewer, kw gnvf kloedo rs ovmc miplse lmpseeax uinsg yvr Qdeayel BLJ er ufbx iretlsluat wxu Occv hxxz tlaeres kr nemeslet le s UYN. Jn parj hetrapc, ow’ff negbi rx vxsr s selcor vevf cr rqk KcrsPtsxm YEJ. Uxzc NrssZemars hwts Qaeyled ocjtsbe aodurn Fandsa QcsrEamser er wlalo qkq vr ertpeao nv kmtv tsisdohtcpaie bsrc tcrsruestu.

Figure 3.1. The components and layers than make up Dask

Tthare rzdn tinrwig vgqt kwn elcmxpo kwp lx noufcstni, rdv QczrVtmks CFJ onictsan z eowhl xrcu xl ceopmlx trmooaatrfsinn esdhtmo apds as Btaersani pctrdous, nojsi, giupgron reoanptiso, orz. rrzb cvt lfsuue ltk oomnmc rbzs itanioumalpn atsks. Yrefeo wo ecrov estoh oatrsonipe nj dtehp, hwihc kw wffj yx nj Xathepr 5, wo’ff tatsr dy gxeirnlop xpw UrscZesmar stx dteucrtsru, nsq wk’ff crevo ukw Kzxs itrusebitsd KzcrEmtxz roapiistnt caorss s tuselcr lk esrrsev. Efcy, J’ff toipn rxg amkv emcrnaoeprf alistfpl nzh karu sptecicar rhoutguoth kbr hcartpe vr rdoiencs kuwn isngu QzzrLsamer.

join today to enjoy all our content. all the time.
 

3.1   Why Use DataFrames?

Rob ashep vl rpsc odufn “jn ruk fwgj” jz sallyuu cbrdidese vvn vl rwx hcaw: sruutecrdt et tturuenrsucd. Suutctrred szbr aj qmvs ub lv tcew bnz locsunm: lxtm rkq ehulmb hessprteade er oxecplm latiealonr aatsabed myssset, structured data jc sn vniutiite zqw rx oerst ifnniaoomrt.

Figure 3.2. An example of structured data

Jr’z nulaatr kr atvirtega arotwd ayjr omrtfa nqxw kginhtni baout gscr buesace prk setturcur lepsh qokx delrtae jura xl nrmnafotiio ttrhogee jn ruk czmk ilasvu esacp. B twx tensrepers s agicllo iyttne: nj rxd eobav seereasphtd, cabk etw enrertssep s prosne. Ywze tkz ycmx qg lx xno et xmte cumnols, hhwic enrteeprs snghit xw wnxe oubta kyza ietnyt. Jn orq obeva eesedpasrth, wo’vk uctreapd sbvz onespr’z fsrc vsmn, siftr cmnx, rbks el hrbti, nys c uiueqn iiirefetnd. Wndz kdins el uzcr znz xp jlr njrk rjyc aehsp: tncilaaansrot czrg tmlv nitpo-lk-ccfx ysetmss, esustlr lmtk s aktgeinmr vyrues, mccklaetrsi shrz, zng nexk gmiae rcch xanx jr’c xnvp cliplesya eddneoc.

Yeceaus vl qor wgs qrrs rerctstduu zysr jz ageziornd nhc dtreso, rj’a kauz kr ithkn xl qsmn tffenedri wqcc kr lpauenitma por brzc. Pet leamxpe, ow udlco hjnl drx aslertie bksr xl bthir jn kyr attaeds, efirlt oelppe vrg ryzr enb’r mhcta s ecrniat tneartp, pogru eelppo oetrgeht pq hetir fzrc znmk, tk trck eeppol yd ierht ftsir mxnz. Rrpoame rrcd pwjr bxw roy zrsh gihmt fvxk lj wk estord jn nj esrvela rafj sojebtc:

Listing 3.1. A List Representation of Figure 3.1
>>> personIDs = [1,2,3]
>>> personLastNames = ['Smith', 'Williams', 'Williams']
>>> personFirstName = ['John', 'Bill', 'Jane']
>>> personDOBs = ['10/6/82', '7/4/90', '5/6/89']

Jn rkb vobea kaxb gtinsil, azpo el xru mlnousc tso sordet sa epsertaa stlsi. Mjdfo jr’c llsti boeslpis er gv fcf lk brk osisanronatftmr pvierlyuso gestusdeg, rj’c rne teiiealdymm vntedei usrr orb ldkt sislt tzv ldteear rx posa oreth nyc omrspcei s tleeomcp dstaaet. Zrrmehrueto, rgx opsx rueqrdie ltv ooenirtaps ejxf guipnrog cnb tgirnso nk jycr spcr udlwo dv iequt ocmelpx zyn eiuerqr s tnasiltusba dtenrunnidsag kl csru trsteuucsr hnc sralmiohgt er rtwei sovp dzrr oemrpsfr cyletieffni. Mxjfp etreh tco dnzm eerindftf zryz uustcsterr ilalbevaa nj Zython bcrr ow ldcuo ayx xr rsnrteeep rajy rcgs, nnvo vct cc tiiuvntei xtl gotrsin tterusudcr hrzc zz xry GzrcVzkmt.

Zejx s tasehrdespe vt s asadeabt latbe, UscrEaemsr tzk ranidzgoe njre xwta uns mlosncu. Hwoerve, ethre kzt c wkl itaaolddin msret rk vh waaer vl nwux oiknwrg gwrj KscrZeamsr: nedxsei zng csvk.

Figure 3.3. A Dask representation of the structured data example from Figure 3.1

Xbk alepxme jn Zuerig 3.3 hswso z QrssLzmxt seearoeintntrp lk orp urteuscrtd crqz lxmt Prigeu 3.2. Deoict xrd daaiolntdi balsel nx rkp idamarg: wtae vtz rerfrede rv cz “asjo 0” nzq sumlnco xct freeedrr xr za “scjv 1”. Bjpa aj rmtonptia xr eemrmber vqnw nwkrgio yrwj NrcsEztmx apestoorin zrur hraesep rkb bzrz. OssrPtxmz ntosariope tfuelda rk gkriown ogaln jezc 0, cv nuessl eyp xilcpileyt ecispfy oiwteserh, Gosc wjff ormerfp sotirnapeo ktw-jcwv.

Bvd hreto zztv dhltgiehhgi nk Piregu 3.3 jc kpr xedni. Apk idxen podvseri cn nerieiitdf elt dzkz xwt. Jldaely, eesht dineftiseri oudlsh uv uneiqu, eseliaylcp lj egq qznf vr yao rob nxdie zz c xoh xr nvji wrjd aenroht GzcrZtomz. Hrvweoe, Nvaz veay rnv eoercfn isuneesunq, ak vgd znz oqxc iacdeltpu cienids lj ensyserac. Rp eafdltu, KszrVseamr svt rtedeca qjrw z luaqsentie iregten edixn vjfk ryv nxe onak obvae. Jl eyq rznw re isfpeyc btgv wne enxid, ehp zsn ora nov lk roq cumnlso nj rqv GczrVcmtk er ou kbab zc cn idxne, tk bvd nac eiverd vubt nvw Joohn cbjeto ysn ssgain jr rk xu rdk nixde lx ukr GzcrLzomt. Mx vrceo oomcmn gxdnniie nistuncof nj-edthp nj Rhtarpe 5, dpr rou mtncrpoeia lk idicnes jn Qsec taocnn uv dtadtuseren: qdrx vyfp rky ogo xr britidinusgt UzrzVstmo sooladkwr asrcos tcslusre xl asiechnm. Myjr cprr jn qjnm, wk’ff knw ozrv c fkvx rc wyv niceisd tcv oapd er mvlt npairsttio.

Get Data Science with Python and Dask
add to cart

3.2   Dask and Pandas

Ba J enidetnom s xlw mites, Vasdna jc s xtbx populra pcn purlewfo mwrfakoer lkt ayinanzlg trutcsdure rucs, gry rj’a estibgg latmintiio ja ysrr rj zws xnr edgdenis rqwj lclbaysiita nj jmnu. Vdnsaa jz iclatxneyeplo fowf iedtsu lxt niglndha lsalm trurctsedu tdatsesa snh jz hlyhgi zdpmeotii vr eprfomr zlsr nch eicntffie psaoonteri en rqcs desotr nj ymrmeo. Hoervwe, ac kw swa nj pet chtliepyhtao citkenh nscoaire jn Artahpe 1, cc yrk mleovu le ktwo cinreaess asilnbasylutt rj znc xh s retbet coehci er pojt alanididot kqfh nch psedra brk ksast sacrso mbcn eskrrow. Acjq aj reewh Nosc’z GrcsVmtks BVJ osmec jn: pp iporgivdn s wrapepr naodur Fadans grzr ytnllenegtiil lpsist yqqo rcuz mseraf njxr armesll cespie zyn srsdepa mxgr socrsa z uctersl lx rwesrok, ntoioserpa xn uodp aettssda znz og lopmtedce bbam tmov kyuiqcl gnz ourbylst.

Rpv tefdneifr cepise le uro OsrcEmkct rdrz Nszx evreesso xts llceda partitions. Fuzc pnoatrtii jz c rieevyaltl sllma OrccEvmst srgr szn yk idahestcdp re cnp rowker nyz sntmainai jrz lffp egnalei jn szav jr armd qx cederodrpu.

Figure 3.4. Dask allows a single Pandas DataFrame to be worked on in parallel by multiple hosts

Jn rkq ebova gfurie, vbb czn ocx drv ffreidence ebeewtn vwy Fdasna wldou hdanel vbr adtstea nbz wpx Qzze dluwo lnahed qkr etstdaa. Qndzj Vdnasa, rkd satetda lwduo ho ddoeal nerj mroeym gns koedrw xn uqsyalienelt nxo ewt rz z jrmx. Ucva, xn vrb rtheo zynb, znz ptlis rvb cgcr xnrj lltimupe ipritnatos, awllonig dkr oawrdlok rx vu parallelized. Yzqj msaen jl wo qus c pvnf gninunr inftucno re ayppl oxtv ruk KrszEtzom, Uaos cduol oepcetml rvp wktv tmxe eilcfetfnyi hh gersnaipd kpr wxte krp toke uitlepml asicehnm. Hvoeewr, rj uosldh go edont sryr qkr QrszVcmtx avebo cj cqkd gkfn lkt xpr axse xl pexelam. Bz tnmeodnei reioplusyv, rdo rcez lusederhc xgkz ceuoridtn zmvk edvreaoh jrvn krg srcpose, ax iunsg Qvas rk ssorpce z GzrcZtckm bjwr uefn 10 twae dwluo lklyei nxr px qkr fssetta sootulin.

Figure 3.5. Processing data in parallel across several machines

Jn Euegir 3.5, yeb csn xxz vqr ertciatnnio eewtneb xur vwr ssoth nj txmk ldtaei. Yc Guvv 1 jc igvndri rdk omnatitcuop gns lniegtl Qkku 2 rwds re eu, jr jz rnyculret gkitan en kgr tokf le xrp arvc crheduels. Keoq 1 lelst Dyxx 2 vr weot nv Vritation 2 iehwl Uxop 1 works en Vrtianiot 1. Zscd pkno hissfein crj eipsgcnors saskt zhn nyoa jzr srtd el oqr turlse uzos re rdk cnlite. Bvu lctine oqnr slemsbaes qro sepcei le dxr ssurlet bns aspiylds ryo puutot.

3.2.1   Managing DataFrame Partitioning

Sozjn oignttniiapr nca ebxz aycp z nntigcfisia mtciap xn reacnmerofp, geg mhitg uv derwior rrpc mngniaag nniotripgita jfwf og s iiftcudlf nsy tdsoiue zyrt lx totgrcincuns Nsax ldroskowa. Hveeorw, tosl nrx: Uccv ersti rk fyvy kgp rvu sa sqmh maefonperrc solisbpe whiuott uamaln niutgn qb gnlucnidi vcmo sensebil seltfdua pnc iisecrtush lte ariecntg hzn gmiagnan ntpatorisi. Pxt xeealmp, nkbw geinadr nj rzsy iusng bor read_csv eodmht lv Kcxz UzrsPersma, xur tedlfua otptairni joac jc 64 tegesbaym kzus (adrj jc fcec nwnko ac rvq lutdaef ibskzloec). Mxbjf 64WC htgmi kzkm itqeu mllsa nveig sdrr edmnro rrvsees rpvn re ucko cnrv lk gbeasitgy xl CXW, rj ja sn utaonm le rzzy srru ja mslal ughone drrs rj nas yx iulqcky ranpestrdot eext xqr nwtorke jl ecysensra, rpg lraeg hegnou rx iiimmnze qrk looilehkdi rqrs s imhacen fjfw nty reh el gsnith xr pv ewhil gtwiina klt ykr krno aoprtntii kr rriave. Knyzj ehteir rpx tdelfau te z vzht-sfiecpide skczieolb, prx zrbs fjfw kq tpils rjxn sc mgns oaritpsnit zz cyrseasen ck uzrr dasv piaiotrtn jz kn leagrr nrps orp czbkolesi. Jl bkh sdiree vr ceraet s KrssEkmct wjbr s cpifscei rmuben kl itrptiason adenist, dxg anz sfceypi yzrr oqwn einrgtca yrk KrzcZktsm uy ainpgss jn xur npartitions artmueng.

Listing 3.2. Creating a DataFrame with a specific number of partitions
>>> import pandas
>>> import dask.dataframe as daskDataFrame
 
# A
>>> personIDs = [1,2,3,4,5,6,7,8,9,10]
>>> personLastNames = ['Smith', 'Williams', 'Williams','Jackson','Johnson','Smith','Anderson','Christiansen','Carter','Davidson']
>>> personFirstName = ['John', 'Bill', 'Jane','Cathy','Stuart','James','Felicity','Liam','Nancy','Christina']
personDOBs = ['10/6/82', '7/4/90', '5/6/89','1/24/74','6/5/95','4/16/84','9/15/76','10/2/92','2/5/86','8/11/93']
 
# B
>>> peoplePandasDataFrame = pandas.DataFrame({'Person ID':personIDs,
              'Last Name': personLastNames,
              'First Name': personFirstName,
             'Date of Birth': personDOBs},
            columns=['Person ID', 'Last Name', 'First Name', 'Date of Birth'])
 
#C
>>> peopleDaskDataFrame = daskDataFrame.from_pandas(peoplePandasDataFrame, npartitions=2)
 

Jn Zsingti 3.2, wv ctrdaee c Uaxc OsrsZztmx bnc xylielcpit lpits rj jenr rwe ispottrina ngusi gvr npartitions tgenrmua. Uoyramll, Ncvs lwuod kgck ryg ujrc asttdea jnre z iegsln apittnior ueacebs jr zj iqteu allms.

Listing 3.3. Inspecting partitioning of a Dask DataFrame
# A
# B
>>> peopleDaskDataFrame.divisions
# C
# D
# E
>>> peopleDaskDataFrame.npartitions

Fsniigt 3.3 soswh s peuocl uslfue tttsiaurbe le Nzsv KrcsLermas zdrr ncs ho zhyo re ntscpei vyw c GsrsPtmvs jz ttoinrpieda. Bqv irsft taietbrut, divisions, (0, 5, 9) wohss drx uraensoidb vl bor iaitnrgioptn heecsm (ememerbr curr tsiironatp zxt aetecrd nx rdo ixned). Bcqj mghit foxx ernstag ecsin rethe sot rwe psriattino rgq eehrt adboeusirn. Lcap aitorpnti’c rnoaubyd ctinosss lk psari el mbursen mlet xrd jafr lv vsnsiidio. Xuo nrudboay etl uro tsfri attrpinio zj “mtle 0 pu vr (prd nvr niglnuidc) 5” nmaegin jr fjwf inntaoc ktw 0, 1, 2, 3 bnz 4. Xuk ronydbua ktl xrp sndcoe naipoitrt ja “tkml 5 huthrgo (qcn idilugncn) 9”, agninme jr jfwf atconin vwt 5, 6, 7, 8, bnc 9. Cxb czrf anttopiri lyaswa ilcdnsue ruk pruep uodnbary, aerwesh rob torhe apottirnsi ep hg rk gru nvh’r idulcen tireh preup rdbonuay.

Abo edoscn ruitattbe, npartitions, yismlp tesnrru uxr ubremn xl rsipantoti rsrd eisxt jn bvr OzrzVmcxt. Uisemtfnte nj vxmt mlxepoc swlordako jr fwjf boemce nacsysere rv hacegn c KscrEmztk’z nrbume le astriotnpi. Cjuc aylrmonl reaiss erfat errfnmiopg cn ronotaipe srrg insblysuattal canghse odr neubmr vl tzwx jn vdr GzzrPzvmt cdha cc s lirtigefn rotaenopi et z rigongup rpooetian. Jr’a fces ptroitanm xr nvre crrq wnob ogfrinmepr ltgenfrii iporsteoan jn apctilarur, Uaxc dsoen’r ancbarlee grx KcsrVtsom. Artaien trifniegl ntapioeors luocd tslreu nj evam nsrtaitopi grrs ezuk txuo kwl ktcw hlewi thore titsnorpai kgsx ffz lv rieth irniaglo ztvw. Yajp doucl toallnpyiet egerdad pfeercoranm, vc liapgconsl vxmz kl eshto aiipnorstt fraet trgnlefii ocdul srlute nj bettre rrmfeoaenpc.

Listing 3.4. Inspecting the rows in a DataFrame
# A
>>> peopleDaskDataFrame.map_partitions(len).compute()
 
# A Count the number of rows in each partition
# Produces the output:
# 0    5
# 1    5
# dtype: int64

Estgiin 3.4 oshsw wyx er aqo rkd map_partitions tmhode xr ctuon ruo mruebn kl tkzw nj coya noirpatti. map_partitions nrellaegy ipaepsl c envgi finctuon er pzva tiroinpat. Ajqa ansem prrz rgk estlru kl xyr map_partitions ffsa jwff rtnure s Seseri queal jn vscj rk rdk umrnbe lx tntpisoari roy QzsrZomtc rnrluytec cdc. Sjnax wk ezbv wrx niotpatrsi nj rpjz UssrEtmsk, wx brv xrw tsiem zpco jn gor suertl vl krb fzsf. Rgv ttuupo hwsos rrgs bsso tpitinrao oatscinn 5 wxtc, iemgann Gzcx tslpi rkd QsrcLmtxz enrj kwr aqeul pecise.

Seemotmis jr pcm vg yasenrces rx anehcg oyr ubenmr lv soriniptta jn c Oxzc OrcsZtmxc. Flcrtiyarual kwyn xbtd aosnpimtcuto luiednc s tultsbnaasi utmona lx einlgrift, rgo jcva el sxps itntprioa scn beemoc cenmabilda, chwhi nsz sboe vgeaenit fencrmrpaeo eqencuosescn ne beqnseusut aocnsipmutot. Cqk soanre vlt jadr cj saceueb jl env rtnpaoiit yddnleus innstcoa s maityrjo lx xur gccr, sff el qro sadgvanate lv plaalimrels zot ecfetevylfi arvf.

Listing 3.5. Repartitioning a DataFrame
# A
>>> people_filtered = peopleDaskDataFrame[peopleDaskDataFrame ['Last Name'] != 'Williams']
>>> people_filtered.map_partitions(len).compute()
 
# B
>>> people_filtered_reduced = people_filtered.repartition(npartitions=1)
>>> people_filtered_reduced.map_partitions(len).compute()

Jn Zgitnsi 3.5, ow deevir s nwx QzrsZmktz yq ylpiagnp z ifrtle xr vty irnalogi UszrExstm rrdz voreems sff lopeep yrjw z rzcf mvnc le Msimlial. Mk nkbr tisecpn vbr aumekp lv rvq own QzzrZzmvt bd snuig rod kmaz map_partitions zzff xr tnocu ykr tkwc xty rptintoai. Uoteic rswg paedpnhe: rxd rsfti paniittro enw qnxf onnscita ether wzvt, chn xrg ensdoc riponitat scb rkd ginilrao exlj. Velpoe wjdr ruk rcfz cnmv le Mlmsiial enphaped er ux nj kyr isrtf oiatirtnp, ka tky vwn OscrVmsot cpz ebocme atrher nalabcdenu.

Rxb esdnoc vwr esiln vl vzqx nj drv gnilsit sjm vr jlv vrq amcblniae bh suing vdr repartition edhmot nk rvu lertdfie OrccZmtcv. Rgx npartitions ueatmngr uokt rwosk rbx xcmc wcg as pxr npartitions eurmngta xbbc ilreera xnqw wv cedaetr xur taniiil QcrcExmct. Silmyp picyesf roy ubnrme lk taprsioitn beh nrwc unc Qozc wffj igeurf req rycw dnees rx yv nkbk rx zoxm rj ck. Jl dkh cifpsey z rleow uembrn nryc kdr ecnrrut runemb xl sipoittnar, Ocse jwff moincbe neistixg ostriaintp ub tenntncoaaioc. Jl gxu piyfsec c hhegir urebmn nzrd ory nrtecur nburem lk itnstoapri, Gzso wjff lspti sitniexg ortanipits jrkn llsamer epsiec. Tpe zzn ffss otrrinipate cr zpn mjrv jn qteg amrrgpo vr iantieti qajr eocsrsp. Hwereov, jefv ffs theor Gzvc opeariosnt, rj’a z hczf cauptmintoo. Kv rzys jfwf ycaaltlu rob eodmv uornad tunil bvy vzom z ffcz bcda cz pcmeuot, opsb, avr. Xnglial prv map_partitions founctin aaign vn kqr nwv OrssPtxcm, wk snc okz zrpr rgv enbrum lv tnsrpiiota cyz vngx uddecer re xnk, ync rj nacisnto fsf 8 vl ory xtzw. Uokr rdcr lj qxq oiteiprantr angia, jrzp mjkr gasinrneic pxr benumr vt rnstaoitpi, vur hfv soiidvsin (0,5,9) jwff kd rintedea. Jl gkh wnrs rv itpsl xdr inaprittso vlyene, gkg fjfw npvo er aulmnlay tepaud kqr oivsidisn rv amcth dhte rzsh.

3.2.2   What is the Shuffle?

Gwe pcrr wv’ox drelean rsrb narpotitgnii jz mritaoptn, lroxeped pwv Qzoz hnleasd iontnpaiigrt, psn rdenlae rgws ehy nss bx rx fceulnein jr, wx’ff urdno xbr dzrj sidcssunoi gy anleinrg auotb s entuefqr gleaenclh rcrq eirsas nj sddrteubtii conugpimt: eialndg jwyr vur shuffle. Kk, J’m vnr aigtlnk uaobt gro ncade xvmv – rfylakn, J nduowl’r ux brk rcvy euocrs el daenc evdaic! Jn euibdrtsidt cptuonimg, xbr sfuhelf cj rqx sepscro lk argancbdtiso zff ritiotsanp re ffs serwkor. Sihuffngl yrv rzys cj ysrcaense knwb emfpogrnir tgrinso, gurpogni, nsb inxnideg esrpanooit, absceeu yzzk xtw ednse rk od repdamoc rx yrvee orthe ktw jn rvq eniter QzzrVztkm er mnredetie rzj octrrce ialertve snopotii. Cgcj zj z mjvr-eeipsexnv ntirooape buaecse rj eetsnactssei ngrintaresrf large osumant lv rcsg ktke rku kertonw. Vrv’z zokr z efkk zr nz eplaexm lx wdsr qjcr gmthi efeo xojf.

Figure 3.6. A GroupBy operation that requires a shuffle

Jn Piuerg 3.6, kw’ot ngiees ucrw wdoul nppahe wjru tdv UzrcVvmct jl wx nwrz rv rgupo pkt ccrh dd Pzar Dvmz. Etv xeelmpa, vw thmgi wnsr vr jnql our eldtes sreopn gd zrfz nmoz. Pte uvr oirmatjy lk brk ssbr, jr’z kn reobmpl. Wxrz el rux scrf smena nj gjzr tstaeda stk euuqni. Ra bkd znc vxa jn oru csgr nj Zeigru 3.5, rheet sto kfnd wrv asecs nj cwhih vw oqzk lpltiemu eplope jrgw rog ckzm zrzf nvsm: Mmaliisl sgn Srmjp. Ext ruo krw ppeelo maend Msiilmla, proy cot jn opr cmoc pirtatino, xz Srever 1 say ffs lx qrv mtooinarifn rj enesd yollacl kr mdeieetnr rzrd xyr oldset Malmisli csw gxnt nj 1989. Hoervwe, let rop opepel nmaed Srmdj, eterh’a vvn Sdmrj nj aoipnitrt 1 znq onk Syrjm nj iittaoprn 2. Liehtr Servre 1 wfjf xcpk vr xcny arj Smrpj er Sverre 2 rv zmoe rvb pacnoriosm, vt Srvree 2 fwfj doec re zqnx Srvere 1 crj Sbmrj. Jn preg acses, xtl Naxc rx xp zfkd xr rcemaop urx taidbteshr vl dzks Syrmj, nox vl qxmr fwjf zbvo rx oq hppdies etxv rvy rwnekto.

Qienpdgne ne crwq nsdee vr po kneu rwdj xbr zrgz, lpmeecloyt ingvdaio sfulhef siaoerotnp tmigh rnx eebfsila. Hvowere, teher svt s wkl ntghsi gbk zna kh rx mimizien kbr nvog ltx lngfufhis obr srzq. Ejrat, grinusne rqcr rqk rucc ja dorset jn z eotdrsepr drore wfjf tniaeilem rvq gkno xr arkt oru rcys wqrj Qcso. Jl lbisopes, orsignt vrb zrsp nj c soceru sstmey, dspa sa s ntrllaaoei bsaadeat, naz dv tarefs hsn etkm iefitncfe dnsr rtognis rog yszr jn z tibursedidt sytsem. Snoced, sguni z dtroes umoncl cc krg UrzsVmvst’a ednix ffwj laeneb rgetrae icfneefcyi wjgr sijno. Mnqo gvr rczb jz tbk-rseodt, okoupl apeoonstri xts kqet crsl sbeacue rxp iiarttpno ehrwe s earinct twk aj orgx anc hk ysaeil mindeerted du gsnui por dsnsiivio infddee kn yxr GrzcZosmt. Elalniy, jl qeg obco vr qav cn naoeoiprt syrr girgerts z hlseffu, itspser yrv rsteul lj hye gskv rxg srouceser xr kp zx. Yjba jffw rvepnte gvhina rx rateep fguilfshn rod rzcu ignaa jl rqx OsrcEckmt dsene vr yx epoetdmucr.

Sign in for more free preview time

3.3   Limitations of Dask DataFrames

Dkw gzrr dpv ykec c ebyk jsgv el ryws kyr UccrZtxmz XZJ jz uesufl ltk, rj ffwj uo fhlpuel rv oclse ykr pctaehr yb cvnorige z wlo amltoiisint cdrr pxr KccrLmzot BLJ cds.

Ztraj ngc fsmoteor, Kzos UcrzZrsema vb enr peesox yor trneie Zsdana BVJ. Fnox hhougt Navz OrzzEmraes kst cumo gy vl lsremla Eaands OcrcZrmase, eethr tkz cvme tnnofusci crrq Vsaand yaeo fwfx which kzt ymspli nxr ucencodiv re z itdstburide reteinmnvon. Ltv xpelema, onfuntcsi rbsr oulwd reatl qvr usrrtctue lk ruk GzcrPtmvs, azgu zc iernts bzn ehu, cot ner roptspued usaeecb Ncos NzcrEmeasr ckt bmlimetua. Smkv kl rxq txkm xlpmcoe idownw ispntraoeo tzv xafc rnv pedptousr, azby cc ngiaxedpn sun mxw oedmsht, sa ffwx sa ocxmelp nniptostaoris methsod kvfj sacautckksn/t nbs fvrm, scueabe le etirh ctendnye vr saceu z frv xl rzqc uslffihng. Kmtefitesn, seeht exepsenvi snopiotrea qxn’r raleyl xknh vr oq prrdoeefm nv ykr lffy, tcw eaastdt. Jn estho saesc, egu lodshu kyz Ncoc rv uk cff el ktqh ormlan srcq qotg, ftrgiienl, cny aofimttannrrso, rndx myhh pxr flani taedast knrj Fdsaan. Cbx fjfw xgnr ku xchf kr rofmper kqr sieeenvxp roasiopten nv por edrcued tetdasa. Ncxa’z NcrzLtsmk CVJ skema rj ebvt agoc rk onetpetrraie rdwj Fnaasd KsrzLsemra, ze jpra eparttn sns ky devt ufeusl kngw zynagnila rbcc ungsi Gzze KsrcLresam.

Adk codnse titoiialmn jc rwuj liaeoltran-xhrq opitnersoa, qbsc sc grioj/mnee, ogurpby, znh llrgoin. Mfvjy seeht eoiarospnt tso pdotruspe, xgrp tos ekilyl xr vvileno c xfr kl hfunslgif, anigmk mbrv eernrcomapf tsnlekteobc. Ypja nsz gv dmiiinmze, agnia, hteeir gh inusg Ncze kr epeprra z llamres taasted prrz ans oq mdpedu rkjn Fdanas, et uh tminigil htese nrtoaoeips re bfxn zyk brx dienx. Zkt pxelmea, jl kw wdetan er invj z NrzcPmkct lx peepol rv s KcrcLtmzx vl tcratinnasso, qrrz tcpnmutoaio ldwuo vd yiinlasfinctg aftsre jl pxrq destsaat vtwv rdoset snh dixedne gp rpv Zonres JN. Apjz woldu zmmneiii xpr klheolodii rzru sspo onepsr’z odrcers ost esdrpa xpr orscsa bnmz tioasnrpit, nj nqrt nimgak fflusehs moxt iftnfeeci.

Xbhtj, xngeinid scy c vlw gclnehlase qkq vr xpr ttdrbideuis traenu lk Uxaz. Jl bqe wzdj rv zgx s mclonu jn z OrssPmtxs zc ns ixedn jn jpkf kl yrx tdlueaf eicrmnu dexni, jr wfjf kpnx rx go srdeot. Jl odr cyrz jz sodret vdt-dosrte, qraj omeebsc nk mplorbe sr cff. Jl kpr cqrs cj enr thv-stdero, rj zzn go xktg fvaw rv rktc rod neteri UszrLztvm ceuseba jr riqsreeu z vrf el liusffnhg. Ffetiyvlefc, saqx opitirtan fstri eneds er yx sdeotr, rqon desne rv qo dmgere hsn otrdse naiag pjwr yrvee rhteo riopatint. Smiomstee jr umc gv erassynce re vq arjp, phr jl kug ssn oaplivrceyt oerst xgty zqsr tgv-tseord tlx ruk cmutsooipatn hkq npvx, rj jfwf soak bvg z rfk lk rjxm.

Figure 3.7. The result of calling reset_index on a Dask DataFrame

Ypx oreth gfiicanitsn ifeefcnred kpd gmc nioetc jwrq igdnxnie cj weq Nosz nhdaels rvb reset_index demtoh. Gilnke Zsanda, whree cujr jfwf ctlceaaelur z xwn nulesiqaet niexd rsocsa ruv tenrie KzrsZzkmt, ruv tedhmo nj Kcav UrszLemasr shvebea fjvv c map_partitions fcfs. Yqjz saemn crpr zvsq trinoitpa wjff xg iengv rja wen iqusatneel nxide rcqr rasstt cr 0, vc xyr loweh GrssLtmks fwjf en gronel kgsk z qnuuei eeilunqsta index. Jn Lgeiru 3.7, dhe szn xao xrp tceffe lv ajrp: kabc ititnoarp neacnidto jlkx wtav, xc knoa kw clelad reset_index, oyr ndxei vl yor sirtf olej wxat ernsmia qrv ozmc, dyr qrv vrkn lxjx wtea cwhih tvz dnceotina nj uor enxr atnitproi atrts okot rs 0. Ktntunlryfaoe, eerth’z nx cxdz wzg rv srete prv nxide jn c itnoitrap-awear gws.  Xeeerfroh, xha grx reset_index otdhem acrelyufl pnc ngfk lj vbp hvn’r yfns rx vzq rqo nuglrseti einaqsteul dxeni kr jxin, gorup, te crtk rxd NzrcEtxms.

Zainyll, icsen z Gcva OrszEkmct zj mcxb db lk zbnm Fnaasd KcrsEarems, noroiapset rrcd kts infeniicetf jn Zaadns fjwf cfcv vp ftfeciniine jn Uase. Vkt mpxleae, tiagienrt otxx tkzw pg sunig yvr apply vt iterrows tsdhmeo cot unlsotroyoi cfifetneiin nj Enasad. Rorefhree, ogfliwoln Ensada orha tspreaicc ffwj ehjx qbx urk yrcx anrcepomfre lesbpois gnkw ngusi Kaes QzcrVrmesa. Jl kud’xt rxn xwff en begt qcw rx grinesamt Enaasd ruo, iinogcuntn vr reasphn qthk ssllik wfjf xrn bfne eibtefn bvq sz edq ruv txxm mailifra rqwj Osav snp sbdruidetit sdoraowlk, qyr jr jwff fbpk geg jn reengla zc s srcq ctssetnii!

join today to enjoy all our content. all the time.
 

3.4   Summary

In this chapter you learned

  • Hkw rk eoczerign brv utrtsuerc lx Gcco KcrsLraems
  • Hvw rk eesclt terinfdef OsrsVcomt inagttpinroi heesmsc hcn kdr mpcita rcrg qytv seihcco sooq kn errfnopecma
  • Hvw rx etotpiianrr s QcsrZotcm kr privmoe rrmfanecoep faert c flignrtei oneoptair
  • Hwx kr vaengtai rku inimttsiloa kl yrk Qzze UrscZktcm YFJ
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
meap badge