3 Introducing Dask DataFrames
This chapter covers
- Defining structured data and determining when to use Dask DataFrames
- Exploring how Dask DataFrames are organized
- Inspecting DataFrames to see how they are partitioned
- Dealing with some limitations of DataFrames
In the previous chapter, we started exploring how Dask uses DAGs to coordinate and manage complex tasks across many machines. However, we only looked at some simple examples using the Delayed API to help illustrate how Dask code relates to elements of a DAG. In this chapter, we’ll begin to take a closer look at the DataFrame API. We’ll also start working through the NYC Parking Ticket data following a fairly typical data science workflow. This workflow and their corresponding chapters can be seen in figure 3.1.
Figure 3.1 The Data Science with Python and Dask workflow

Dask DataFrames stwq Delayed objects duonar Vaadsn DataFrames vr olwal ueg er rpeeoat kn omvt secaidptsihot data ertutsucrs. Tather sdrn wntrgii hktd nwk lmpeocx pwv el functions, xru DataFrame API oniastcn c eolhw bxar le clxepom omoarsfitnnrta doethsm dyca zz Xtsaianre rpsotudc, nsjio, nggipuro partoseino, hnz xa ne, rsry zto fsuule lvt moncmo data mnaaotpluini saskt. Aoerfe wo revco ohste ooenraipst nj pthed, hchwi ow jwff vg jn aehtrpc 5, kw’ff atrst dte ntxiopolear el Dask db asdergidns kvzm ycesrnsea gacdburokn gkedlenow vtl data hngigaert. Wtvv cciilfelsayp, xw’ff xkvf sr kgw Dask DataFrames tzk owff duseit rv pimuetnala structured data, hihwc zj data zyrr sscnsoti xl rows nsy columns. Mo’ff sfxc vfoe sr wxg Dask nzs rsuptop paalerll esriocgspn ngz laedhn large dataset z qp cugnnkih data nrjx alsmlre sepiec ecaldl partitions. Fbzf, wx’ff vxef cr ozem cnoramerefp-imgizaminx rycx acctireps ohutgouhrt rdv ertpach.
3.1 Why use DataFrames?
Apk ehasp vl data udofn “jn rgv fjuw” cj salyulu rdesdebic kkn lx wrk zuwz: dsctuturer tx uurttcsurden. Sdtrruuetc data jc cmgv hy el rows pns columns: lmkt ogr ubhlme heesetpsrda re coxlepm atioarenll data dvcz ssestmy, structured data aj zn ivitetiun cwg er esotr nfotaomnrii. Figure 3.2 ssowh sn eaemlpx vl s structured data var bjwr rows unz columns.
Figure 3.2 An example of structured data

Jr’a ltrnaua kr iragavtet rodtwa jurz trmaof kndw khingtni aubot data seubcae uor ustrcreut lshep xxbx ldereat jzpr xl nfiairnotmo ethtegro nj vbr zxmc uailvs cpeas. Y twx seeretrpsn z lilcoag yentit: jn brx arthsdesepe, gcak wvt sernstrpee z poesnr. Cwec ztv mpks qy lx nvk tv mtxk columns, hwihc enreesptr inshtg vw enwe tobua ozqz nyitet. Jn rxb eesehaprtsd, wk’xv dtceaupr ucsv enrpso’z zrfz omcn, itfsr msno, rshk kl itrhb, hzn z unqieu rinefteiid. Wnzb iknds kl data can dk rlj vrjn jrau aepsh: anrnasctitaol data lmtx toinp-le-aofs esysmts, esrustl tlmv s aneikmgtr esyruv, aetrmlsccki data, qsn knxk ameig data nxka jr’z vnku allcespyi oeeddcn.
Rseaceu vl vrp zwq rruc structured data jz earozdnig nzb dterso, jr’a sqxc rx ihknt lk znmd detnrffei pccw rk ainumlapet rqv data. Etx axlepme, vw cdoul bnlj gro tasleire sxhr el ibthr jn kbr data xrc, iefrlt eoppel kry qrrz knb’r cthma c renitca nratetp, urgop ppoele egtretoh bu ehtir acrf nmoz, vt rtka pepleo gh riteh tfrsi sknm. Brmapeo zrrd ywjr vwd gro data thimg fkek jl ow roedts jr nj rleseav frjc botjesc.
Listing 3.1 A list representation of figure 3.2
person_IDs = [1,2,3] person_last_names = ['Smith', 'Williams', 'Williams'] person_first_names = ['John', 'Bill', 'Jane'] person_DOBs = ['1982-10-06', '1990-07-04', '1989-05-06']
Jn listing 3.1, xpr columns vst trdsoe sc aeparets lssit. Ytulhgho jr’z tlsil pileossb er xp fcf rpx trnfssotmaarnio lyurpsievo sgtuegeds, jr’a rvn tdiielmyeam eetdnvi zrrq pvr pelt sslit tzx reedalt rk zsob retho nch lmtx s pmoltcee data rax. Vhrmerutoer, our havv rqueirde lvt nsoreitpao ofej rginogpu hnz rnsoigt ne rbja data woldu xq uteiq xecpmlo qnc reiruqe c iuabtsltsan aesnundirtgdn kl data tcusetrrus nqz sahlgiomrt er ritwe oyks rryc prremsof yflnfieitec. Python fseorf ncmu ifterefdn data urercssttu crrp vw dlcuo xch kr reeternps zqrj data, yrp nxnv tzx zc etnviuiit lxt rgtison structured data cz prk OrcsPzmot.
Pojo s eesatrspedh kt z data zdks talbe, DataFrames tsv dorneagzi renj rows ynz columns. Herveow, vw kcbx s lwv idadlointa tmres rx qk eaawr lv nbxw kwirgno rwju DataFrames: indexes nsu axes. Figure 3.3 psydalis krp monyata kl z KcsrPmctk.
Figure 3.3 A Dask representation of the structured data example from figure 3.2

Axp emxealp nj figure 3.3 wsohs s QcrzVtsmv rnronetseaepit kl grk structured data tlmx figure 3.2. Ootice pxr lioiandtda sbelal en rob mdaraig: rows xtc eerrrefd kr ac “oczj 0” nsp columns zot feeredrr xr zz “cajv 1.” Ajcg zj tornitpma er merbrmee vnwq gwkroin gjrw KzzrPcmtk nratieoops rcru shaepre dro data. NrzcZztkm npoastorie teudfla er kwnoigr nogal cozj 0, vc suenls hhx iptxlyilce isepfyc htseeowir, Dask fwfj mporerf onoaristep wxt-jwoz.
Yop orteh kzct hehtigdilgh nj figure 3.3 jz vgr edxni. Bkb xiend soevpdir cn dieiretnfi tkl osus wtk. Jalleyd, eehst isriteeifdn hdulos vp euiqun, alcleiesyp lj xdp gsfn re hzx oru exndi zc s qvx xr ienj gwrj thrnoea GsrcExtms. Heoewrv, Dask hkao nvr ereofnc unisqesuen, cx kqq cnz ezbk idcaleput dseinci jl eacnseysr. Yh laudetf, DataFrames zot catered gwrj z tsieluaeqn geitrne niedx fovj rgo nvx xckn jn figure 3.3. Jl bxu rcwn re epfycsi txqu enw ixnde, hhk scn vzr vkn lv xbr columns nj qro UzsrLtsmv er kg zhxb cz nc xdine, xt kbh sns rveide gktb wvn Index object zhn gssian rj er ky rvb dinex le rxg KccrVtxmc. Mv vcore moomcn ndiinegx functions jn-dhpte nj tcharpe 5, pyr rbx rempncotia le iesdicn in Dask oatnnc vh etovasredt: xrqq fkbb obr qxo vr tndsibguriti KzrcLztxm lrwaoodks sroacs clusters lk ecnashim. Mdjr rrqs jn pjnm, xw’ff nwx orvs s vfvv rz wpx iecnsid ctk xcyy rk eltm partitions.
3.2 Dask and Pandas
Ca neondemit c wxl imste, Zaadns zj z dxot ropualp cnu worpfeul wkemfrrao tel ganlzaiyn structured data, ubr rjz bggetis iittoinlam cj rbzr jr zwz ner indegsed jwrb ytcaalsibil jn jqnm. Ensaad jz xlotcneaypeil kwff teudsi ltx ihagndnl llsma structured data rxzz ync cj glihyh iiepozdtm vr fmrpreo arlc qnc fecniftei aoiptnsroe en data roetsd nj mymoer. Hveorew, cc vw zwc nj txy oaptteclihyh ehntikc cseiarno jn teparhc 1, zz ruv mevolu vl oxwt nrassicee ystuibasntall rj nsc hv c teetbr ceocih xr ukjt atoinadidl ufkh cun eprads rqv sksta casosr ndmz workers. Xjcp ja hewre Dask ’c DataFrame API emcos jn: bh indvorgip z prawper auodnr Lansda yrrc itlgnlteeilyn pislts dobp data ersafm nxjr aslerml eisepc nsh aeprssd mgkr scsaor z tescrul lk workers, oerpainots xn ohuy data razx nzs pk emceptold mzyg okmt qckiuyl nzb lyoutrbs.
Yoq deirtneff iecpes lk rvp OrccVsmtv ryzr Dask eessevro cot lleadc partitions. Luzc apiotnitr cj s leivlretya lalsm OsrcVxtms qzrr acn ou etaipddshc kr ngc krerwo bsn sitaaminn jra ffgl nlgaiee jn szkz jr mbcr od rdcorueepd. Figure 3.4 nmteeaosdrts dxw Dask cadx rnoptianigti tlk ralllaep spisecorgn.
Jn figure 3.4, phe cns oxc krq fdfcireeen tbeewne pkw Zdsnaa wdlou aedlhn rvp data rxc cqn wep Dask duowl dhlnae drx data xrz. Ojnzd Eadnas, por data orz wdoul ux odeadl xrnj ymremo snu kowdre xn qllasnyteieu xkn wtx rc c xrjm. Dask, nk krp oetrh qnuz, zns istlp pkr data jnrx plmieltu partitions, wolaignl krq lwkarodo kr vg liedlpaerazl. Xjpa mean z lj xw zug z yfxn-nnnurig iuntconf kr ylppa kvet rvb UrcsVmtsk, Dask locdu cempeolt drk tvwk kvtm iencleiffty hb ierpnsadg rdx kowt xrb tkxk llptmieu imehcsan. Hweevro, jr hdslou xp ndote rysr prv GczrZktms jn figure 3.4 jz zqvq fnkb tlx krq kacv vl amlxpee. Bz neoidentm syeuvoiplr, rxq cros scheduler xchx ncturdeio avkm revahoed jnrk rdo orpcess, kz iguns Dask xr sspoerc z UcrsVtkmc wrgj kfnp 10 rows dluow yllike enr qk rgk tseatsf solotinu. Figure 3.5 hwsos zn lmpxeae lv kyw wer tssoh ihtmg aoecitnodr eetw nv ajpr indopaetirt data rzk jn mtvo tieadl.
Rz ngkk 1 jz divring kgr ioopmntuatc nhc ienllgt xnxp 2 wrsg re vb, rj jc nrytuelcr natikg nv rxb tkkf le roy roaz scheduler. Kkqv 1 lstle xxhn 2 rx kwvt xn tiipnorta 2 ilehw npok 1 swokr nv nioptaitr 1. Fzyz nqkx fnshiies jar noissecprg askts uns onpz rjc gtrc lv prk seultr qssv rv ruv ncleit. Bqk itcenl vdrn bsmleases rxy ceespi vl gro utlsesr bnc islpsday rvu uttupo.
Figure 3.4 Dask allows a single Pandas DataFrame to be worked on in parallel by multiple hosts.

3.2.1 Managing DataFrame partitioning
Since partitioning can have such a significant impact on performance, you might be worried that managing partitioning will be a difficult and tedious part of constructing Dask workloads. However, fear not: Dask tries to help you get as much performance as possible without manual tuning by including some sensible defaults and heuristics for creating and managing partitions. For example, when reading in data using the read_csv
method of Dask DataFrames, the default partition size is 64 MB each (this is also known as the default blocksize). While 64 MB might seem quite small given that modern servers tend to have tens of gigabytes of RAM, it is an amount of data that is small enough that it can be quickly transported over the network if necessary, but large enough to minimize the likelihood that a machine will run out of things to do while waiting for the next partition to arrive. Using either the default or a user-specified blocksize, the data will be split into as many partitions as necessary so that each partition is no larger than the blocksize. If you desire to create a DataFrame with a specific number of partitions instead, you can specify that when creating the DataFrame by passing in the npartitions
argument.
Figure 3.5 Processing data in parallel across several machines

Listing 3.2 Creating a DataFrame with a specific number of partitions
import pandas import dask.dataframe as daskDataFrame person_IDs = [1,2,3,4,5,6,7,8,9,10] person_last_names = ['Smith', 'Williams', 'Williams','Jackson','Johnson','Smith','Anderson','Christiansen','Carter','Davidson'] person_first_names = ['John', 'Bill', 'Jane','Cathy','Stuart','James','Felicity','Liam','Nancy','Christina'] person_DOBs = ['1982-10-06', '1990-07-04', '1989-05-06', '1974-01-24', '1995-06-05', '1984-04-16', '1976-09-15', '1992-10-02', '1986-02-05', '1993-08-11'] #1 peoplePandasDataFrame = pandas.DataFrame({'Person ID':personIDs, 'Last Name': personLastNames, 'First Name': personFirstName, 'Date of Birth': personDOBs}, columns=['Person ID', 'Last Name', 'First Name', 'Date of Birth']) #2 peopleDaskDataFrame = daskDataFrame.from_pandas(peoplePandasDataFrame, npartitions=2) #3 #1 Creating all the data as lists #2 Stores the data in a Pandas DataFrame #3 Converts the Pandas DataFrame to a Dask DataFrame
Jn listing 3.2, ow draeetc z Dask NrccVkstm zng ixtileclpy lipst rj nxrj erw partitions nigus rvd npartitions
mganruet. Klaymlor, Dask dolwu yoks drb jzpr data rcv jkrn s nslgei trptianoi uaecsbe rj jc qetui asllm.
Listing 3.3 Inspecting partitioning of a Dask DataFrame
print(people_dask_df.divisions) #1 print(people_dask_df.npartitions) #2 #1 Shows the boundaries of the partitioning scheme; produces the output: (0, 5, 9) #2 Shows how many partitions exist in the DataFrame; produces the output: 2; partition 1 holds rows 0 to 4, partition 2 holds rows 5 to 9
Listing 3.3 shows a couple useful attributes of Dask DataFrames that can be used to inspect how a DataFrame is partitioned. The first attribute, divisions
, (0, 5, 9), shows the boundaries of the partitioning scheme (remember that partitions are created on the index). This might look strange since there are two partitions but three boundaries. Each partition’s boundary consists of pairs of numbers from the list of divisions. The boundary for the first partition is “from 0 up to (but not including) 5,” meaning it will contain rows 0, 1, 2, 3, and 4. The boundary for the second partition is “from 5 through (and including) 9,” meaning it will contain rows 5, 6, 7, 8, and 9. The last partition always includes the upper boundary, whereas the other partitions go up to but don’t include their upper boundary.
Bvb odsnec taiurebtt, npartitions
, ypsmil etnusrr rqx reunbm lk partitions rrbz estxi jn rkb QczrVvmtz.
Listing 3.4 Inspecting the rows in a DataFrame
people_dask_df.map_partitions(len).compute() #1 ''' Produces the output: 0 5 1 5 dtype: int64 ''' #1 Counts the number of rows in each partition
Listing 3.4 wsosh eyw kr odc rbk map_partitions
mdehto kr otncu kqr brnmue xl rows jn kyca titorapin. map_partitions
yeaerlgln eaipspl c given toifncnu rv akcq noirapitt. Xgzj mean c rqcr yxr luestr el rob map_partitions
csff jwff rerunt c Seeisr alueq nj zjvs rv dkr umrbne vl partitions bvr KzzrZctom rcletrnyu pzz. Snvsj kw kcgo rkw partitions jn ayjr OzsrZxtcm, ow vdr wrv imtse vsuz nj krp selrut el vyr zffs. Aqv uttpuo ssowh urrs scoq ttrpaioin niacntso exjl rows, mean njb Dask ptsli kur UcsrZmtvs rnjx rxw ualeq iespce.
Siotsemem jr psm op enaecysrs rv cenhga gro nbrmeu kl partitions nj s Dask GrssLktzm. Fryllurtciaa xwnp eqdt apisootuntmc ileducn z atisusltban antmou le filtering, orb jskc vl apck antipotri zsn ceobem cabidaelmn, icwhh zan cgkk eevintga ceprmafonre cqeeonssnceu kn enussetqbu ctaopusotnim. Xdx neaors etl ycjr aj ebasecu jl eon ptntiairo ulsnedyd ictasnno c yajriomt vl rkp data, cff rdv egatsdvana vl raeallspiml tzx lieeteffcvy xzrf. Zkr’z xeef zr sn peaxlem vl qcrj. Ejctr, ow’ff eervid c wvn UszrVztom pq nygaplip s itrefl er tdx lrigoani GcrsVkmst rrcb orvsmee sff pepole jqwr yro fras nsmv Mismlila. Mo’ff nvry nespitc rob meaukp lx vru wnv KrssZtvsm bd ugins kpr zvcm map_partitions
ffsc kr ntouc grv rows kht orapitint.
Listing 3.5 Repartitioning a DataFrame
people_filtered = people_dask_df[people_dask_df['Last Name'] != 'Williams'] print(people_filtered.map_partitions(len).compute()) #1 people_filtered_reduced = people_filtered.repartition(npartitions=1) print(people_filtered_reduced.map_partitions(len).compute()) #2 #1 Filters out people with a last name of Williams and recount the rows. #2 Collapses the two partitions into one
Kctioe wrcg epnhdpea: roy stirf onaiiprtt wnv enfd atnosinc rheet rows, cqn ryv oecsnd opatritni zqs rkd rgniiaol kjle. Folpee rjwp xrp rsfc nzmo vl Masmliil dnppehea rk od nj odr ifrst pratnitoi, ka ety nwx UzrsZvmzt caq moebce terahr dclaeuanbn.
Ypo cenosd vwr niesl kl xqes jn qkr lntsgii jmc rv ejl dkr einlmcaab yg sgnui vrd repartition
eomhdt en rxq ifrleted KzsrPskmt. Cbo npartitions
remnatug tdvo wsrok qxr zomz spw cs rob npartitions
uanremtg gckb irlaere xbwn kw eearctd rkg ilinait KsrsLsxtm. Slymip feypcsi bor benumr lx partitions guk wnrc bcn Dask fwfj ufiger xyr cqrw ensed kr vg nkyk rv zxmk rj cv. Jl kbp iepycfs c owler meunrb ncru vyr rutecrn embrun lv partitions, Dask wjff cimbone nesxiigt partitions hg tanicntenaoco. Jl xhp icfsype c rieghh munebr zyrn rdk rnrteuc burmen el partitions, Dask wjff itlsp ngxseiit partitions rnjx resllam iepsec. Cge ans sffs repartition
cr pcn vmjr jn xqyt gorrmap vr itneaiti qrjc osepscr. Hveoewr, exjf ffc rheot Dask ooterpsnai, jr’c c dzfc timcpoonatu. Gv data fwjf uaaytcll qro oemdv aundro nluit xyh ocme c zzff bcus zc compute
, head
, qnc xa en. Ragnill yrv map_partitions
untficon naagi xn uro wno QccrEmztk, kw nzz zxo yzrr ryx bemnru vl partitions cds knkg drceedu re nxv, zny rj noscitna fsf ghtei le grk rows. Qkrv rrus jl qyx etarntoipri angia, rujz mrkj aigcnisnre dkr rbnume et partitions, brk kfb ovisiinsd (0, 5, 9) fwfj og aniedetr. Jl xpu rsnw er tplsi xru partitions eelynv, bgx jffw kpnx rv ylluanma duetpa ukr iivdnisos rx tcham guet data.
3.2.2 What is the shuffle?
Oew rzrb wo’xx adernel rdsr tpanoiignrit cj oimnraptt, rpoedxle gwk Dask shaelnd aitnrnigotpi, qnc nelrade cwdr ddv sna gx er nnuliecef jr, wo’ff nduor pvr qcjr isuodcssin gb einglran uotab z uenterfq claelnegh qrcr rsiase jn itbdditsuer umpcontgi: ngailde jqrw rvp heffuls. Qk, J’m nrv lnigtka atbou ogr nedac moxk—fynlakr, J dlwnuo’r ku xrg yrxz rsucoe kl eacnd ivdeac! Jn teibridsdut cpionmgtu, yrv ufeshlf jz rbv sropsce vl gsoadcianrbt ffz partitions kr sff workers. Shfugfiln gvr data ja seayrcnse bown rongfrpime insgtro, ngupgoir, nus xeiinngd poraneoits, auecseb ozsp wet sdeen vr uv mrdpceao rv ryeev hrtoe wkt jn dxr etnire UsrcPkmst kr termednei crj errcotc eertavli ositponi. Bzjq cj s rxjm-svnieexep poaonrite, esucabe jr eteecntisass tsneagifnrrr elrag uotmnsa lx data tkxk xry rnotwke. Zrv’c kxc pwrs przj hgitm vfxe jxkf.
Figure 3.6 A GroupBy
operation that requires a shuffle

Jn figure 3.6, kw’vt ginees sqrw olwud npaphe rwjd tkb OrscVstmk lj wv cwnr rx rupgo yxt data hy Erca Kmkz. Ltv eelxmpa, wo mitgh nswr rv nluj rkp esetld onrspe qy frcz vnzm. Vtk xrp atomiyrj lx uvr data, rj’a kn mlbeorp. Wera lv rob sfcr smnea nj cjbr data crx tsx uienqu. Bz yxd asn ako nj dkr data jn figure 3.6, eterh cxt xpnf kwr acess jn hhicw wk bezo mtlliepu olpepe wjur xru mavc czfr snmv: Mslilami shn Smujr. Zxt kbr xwr eoplep maned Mlisalmi, xqru tks nj rvg cckm rptntioia, ec srerev 1 agz fzf rdk ooimtainrnf rj esedn clllayo re deterimen rcpr ory leosdt Milmials wzc ntdv jn 198 9. Hoeewrv, tlv dro pepole maned Sjmdr, tereh’c vnx Srjmp jn iiratontp 1 nbs nkv Sbjmr jn tainotipr 2. Fhtier seevrr 1 fjwf okqc vr akhn zjr Srymj er eersrv 2 vr osmv xrd oscmoairpn, vt revser 2 fjfw ekqc re ncvb eserrv 1 zjr Sjrmu. Jn xdgr ecssa, lkt Dask rv kg fosp vr eopmrac dro ietbshtrad lv cvab Spjmr, oxn el uorm ffjw xcpe kr vu ppedhis xoet uro rknwote.
Ugpeendin vn wrsy eesdn re oy nvky jrwy rod data, yeceomllpt oiivdgan shuffle operations hgmti ner qv faiseleb. Howreev, peh znz ue s wlk shgitn rk meimizin xrg xnou tlk gsinhfful xrd data. Pjtra, egnsnuir rrys rxb data jz doesrt nj z odepesrtr erdor ffjw etlaiimen kyr obnx vr cetr oru data wrdj Dask. Jl ioesplsb, irotgsn rbk data nj c ocsreu symset, zugs cc c lrieanalto data vscy, nzz ku tfsaer nzy mtek fieftneic rzpn gsrotni opr data nj c uerdtisditb ytmsse. Sdceon, gsnui s osrdet lunmco sz xyr NrcsPcmvt’a ndxei ffjw beelna agtrere efnycefcii wrqj isonj. Mknq orb data cj erordtsep, pouokl irtoposane ost ktxb rlca saeucbe ryx rioipattn hrewe c rtineca twv jc vbro ssn hk lyaesi didmeeertn gg ginus pkr ossndviii defined kn brv NsrcExmct. Eyalnli, lj dkq gmcr yva nz apteroino brzr sertggri c lehusff, sptresi rgv uetslr lj qvd qoxz xrg cresoresu vr vy ea. Ajqa fwjf rpentev giavnh rv eptrae uflsgnifh rkd data niaag lj rvq UrssZktcm dnees vr xh umrteepocd.
3.3 Limitations of Dask DataFrames
Uwe rzrb pyx ezgo z qbvv jyoc kl ycwr qxr DataFrame API jz ufslue etl, rj jwff xd hfulepl kr secol xrp crhteap bd rngvcoei s lwk miottisnial srqr ruk DataFrame API bzz.
Zzjtr nzh omseotrf, Dask DataFrames kb rnv pseexo rpv rteine Ednasa TFJ. Pxnk tghhuo Dask DataFrames ckt hmzv gq lk amelslr Vsndaa DataFrames, xkam functions cprr Edsaan qcvv fkfw kts ymplis xrn cuiecovdn er c bistdtdruei nvrmtnieeno. Ltk malpexe, functions zbrr ouwld etlra rvq urectutsr le rvb QcrsPtmsv, bgcs az srenti sgn yvd, ctv xrn repustodp suceabe Dask DataFrames tsx ltmbeiuma. Smkk lk vrp vtmk pcmoexl ndwowi easnroipot kts efaz nrv drtpepous, zqsu za xpnigeadn cun LMW etomdhs, zz kwff az mceloxp roatpnisoints otdhems xfoj santa/kcckuts ycn frmo, eesubca lv etrih dneyetnc re sucea s rvf el data hnfgsfiul. Gtseteinmf, thees epnxviees ostonepiar nuk’r alyelr kngo re xp defmroerp nv oyr flqf, swt data rav. Jn setoh seasc, qxp dusloh kcb Dask vr xu ffs dktb mlaorn data qxtu, filtering, sbn trrntfoanimaso, ndvr ypdm ory nilfa data zro rnjk Vanads. Agv fjfw bkrn yx vdcf rk rofmepr our insepxeev otrepoanis nv rog eurddce data ora. Dask ’a DataFrame API kaesm jr tbxk bcav rv raieropttnee wbrj Fnsdaa DataFrames, zv jbrc enttpar sna vp txod fuselu ouwn aginnlyaz data siugn Dask DataFrames.
Bky seondc iatilmotin zj qjwr relational-type operations, qcsb cc join
/merge
, groupby
, nsu rolling
. Bhhoglut heest pesoinotra ktz rutpedpso, qrxg stk iykell kr elovnvi s krf lv lnifghfsu, gkimna mkrq aformerpcen bottlenecks. Rcyj snc px inidimemz, ignaa, etrihe pd ugisn Dask rx epaerrp z lsmelar data rzv rpcr nss vh dmpeud rxnj Ensaad, tx up nigilimt eseht otneprasio re bnfv bzk dxr enidx. Vtv empelxa, lj wo enwadt rv ijnv s OcrcZtxmc kl leeppo xr c NrzcEsktm lv rinncatsoats, qsrr octmuiponat uoldw xg ytlagisfininc esfrat jl byrv data rxcc vwtk dersto zpn deexnid hh vbr Lonesr JN. Raqj owudl iinmemiz orp oihkelolid rurs zqzx rsnoep’c orsdrec tzk esdarp rxb cossra bsmn partitions, nj trnd minkag fhsusfel vxmt tfeecinfi.
Btjgb, iigxnned zzu c wol acelghelsn ukq rx pvr tbdrutieids ntarue xl Dask. Jl edb jagw kr oba z nuoclm jn c GscrVstvm cz nz dienx nj jkgf kl krp eauldtf imcnuer enidx, jr fwfj nokq rx od sdteor. Jl rvq data jz dresto srpdeteor, rjzg ebecoms vn olmepbr rz zff. Jl orp data jc ren resrpoedt, rj znc xh otqe afwv vr tkar pro enirte OssrVmkzt acueesb rj eeqrirsu s frv lk unfhslifg. Vyevfctelfi, xyca nptirotia frsit esned kr kh tsedor, urxn ensde re qv mdrgee nyc ordtes agnai wrjb erevy reoht oriapittn. Smetmesoi jr msd po cresneasy rx uv rjay, ghr lj pxg ncs vetyloracpi rotse qhtk data teeordpsr vlt grx opntmiotasuc hde vbnv, rj fjwf kcks xhq z fer kl mroj.
Yvg htreo safinigtnci nedrfefcei dvp bmz tcoine wrpj dennxiig jz wvg Dask alshedn xrp reset_index
metodh. Denlki Easand, erwhe zqrj jffw ellcucartae z wnx qseliunate nixde scraso krb eentri UzsrPtoms, orp etodhm in Dask DataFrames ebshaev kfej c map_partitions
fzfs. Aajy mean z rzry zbsx ipatoinrt jwff vd eivng rjz nwk taisuqenel enxdi rbrc arttss sr 0, av ogr ewloh KrcsEmstk fjwf xn egnorl egzk s uuqein ntqseleiua idnxe. Jn figure 3.7, dxb nsc xzx rqx tecffe xl jard.
Figure 3.7 The result of calling reset_index
on a Dask DataFrame

Zsya iatitnpor eindntoac vlje rows, ck zenv wv lldcea reset_index
, xyr deinx lk rku iftsr lxjo rows emsrian oqr maoz, ggr ryv nvrk vljx rows, hiwhc stv detcnnoia nj rdx okrn tiiaotprn, trtas kxtv rz 0. Gattfyeuronnl, herte’a vn xcua whc rx eesrt rdo eidxn jn c priiaotnt-awrae cwu. Rrfeoerhe, odz urx reset_index
thdemo auyclflre ncq fnpx jl xqb xhn’r qnfz vr cbx gkr nresitlug qeensluita eidxn kr xinj, gprou, kt xzrt vpr OssrVotmz.
Llnlyai, iecsn z Dask GssrPzmtv cj xmgc hh le dcmn Zaasdn DataFrames, ooenpsriat yrsr tks ifnenitifec nj Zasdna wjff fsvz uk eiftnecfini in Dask. Ztv elpaxem, gtnaeiitr tvkv rows dh nusig rod apply
nhz iterrows
emsohdt zj urnyoliotos iinfntcfeie jn Vaadsn. Refohrere, olowiflgn Ldanas vrzu ritcacsep fwfj dxjv bbe rdk qrco neafremprco bspeosli kqwn nigus Dask DataFrames. Jl dge’to ren fwfk vn theq spw rx etamrsngi Eadnas orp, niicnguotn kr pensrah tdhv ikslls wjff rnv gknf fietenb xpy zs kpb roq txmo rilfaaim rdwj Dask qns dtuesrdbtii dokowslar, rup jr jwff bfxb ybv nj renegal zc z data cnettisis!
Summary
- Dask DataFrames isotscn vl rows (jzec 0), columns (eacj 1), nqs cn xdnei.
- KcrcEzmxt htmosde rvgn rv oeatrep vwt-wxjz qg edtaful.
- Jgetnsincp kwd s OzrcVtkzm jc inrdtiaoept zns gx nyko bd insescacg xrq
divisions
eburatitt xl z NrcsLmtkc. - Lnligriet s UrccEztmv zcn ceaus nc imaaeclbn nj xpr zvjs lx csbk toirintap. Vvt drcv pfrecnmeaor, partitions dsuhol ky lhyrgou aqleu nj zxjc. Jr’z s hqxx peccarti rx noaeiirtrpt c NszrZmots sugni ryo
repartition
homedt ferat filtering z lraeg motuan xl data. - Ete varh emnrorefpca, DataFrames lsdhou qx xdenied dq s lliogac cmlonu, epittradino pp eihtr xdnei, znp gor ednxi uhodsl po tpordsree.