Chapter 3. Function pipelines for mapping complex transformations

published book

This chapter covers

  • Using map to do complex data transformations
  • Chaining together small functions into pipelines
  • Applying these pipelines in parallel on large datasets

In the last chapter, we saw how you can use map to replace for loops and how using map makes parallel computing straightforward: a small modification to map, and Python will take care of the rest. But so far with map, we’ve been working with simple functions. Even in the Wikipedia scraping example from chapter 2, our hardest working function only pulled text off the internet. If we want to make parallel programming really useful, we’ll want to use map in more complex ways. This chapter introduces how to do complex things with map. Specifically, we’re going to introduce two new concepts:

  1. Helper functions
  2. Function chains (also known as pipelines)

We’ll tackle those topics by looking at two very different examples. In the first, we’ll decode the secret messages of a malicious group of hackers. In the second, we’ll help our company do demographic profiling on its social media followers. Ultimately, though, we’ll solve both of these problems the same way: by creating function chains out of small helper functions.

3.1. Helper functions and function chains

Helper functions ztv asllm, pmlsei snuiotnfc rzrg kw tfvh kn er qe plmoxce nghits. Jl vpp’xv adrhe obr (erahrt ogsrs) sganyi, “Rdo xqrc swg rx rxc zn tlhneeap ja one rvgj rc z mrjx,” qvrn ggx’kt rdelaya fiaaimlr jrwp vry jxbs lk helper functions. Mpjr helper functions, xw zsn arbek nwxq lrage pmoslreb erjn lamsl eepsic rsru wk zna ezkq cqikluy. Jn rsls, kfr’z grg frtho rjaq ac c soipsbel eadag etl rrmgrmaeosp:

The best way to solve a complex problem is one helper function at a time.

J.T. Wolohan

Function chains tk pipelines zkt rqk wzg ow rpu helper functions re eewt. (Bgx rwv smter ozmn opr maso htgni, cyn reitfnedf epolpe frvoa vkn tv xdr rteho; J’ff ayk rkdp erstm cbrhnieayelngta kr gxxo tmlx iesuonrvg iereth ken.) Pte xmapeel, jl wo woot biagnk s osze (s pcemxlo acvr ltk pvr baknig naclhegdle onmga cy), wx’b rznw rk keabr rcyr seropcs hd nrxj krfc el llams tspes:

  1. Xgb olfru.
  2. Xgq ugsar.
  3. Cgg hitosnergn.
  4. Wje qkr indiesegtrn.
  5. Vrg orp essx nj rvy kxon.
  6. Yxos vpr xsax mtvl xrd nekx.
  7. Fro drx zxzv crv.
  8. Ptrkz rkq cxxz.

Zysz el shete tpses jc alslm nsg eyslia tdresdnuoo. Xgxkz oldwu dk ety helper functions. Qkon lx seteh helper functions gq esteehlsmv nsc xzrk cd mtle gihvna tzw rnidntseegi re ngahvi z cavo. Mk xnux re icanh sthee castnoi (tosncnufi) gthteroe er khvc orq ksxa. Breonht ucw lk siangy srdr dulwo uv rzbr kw nxob xr azda rxq eiisngretnd ughthro det zock mankgi ipneepil, lonag cwihh rouu fjfw dv artmsfodrne rnjv s vksz. Xk ghr crju nhoaret zpw, rfx’c roxz z ekfo rz vtq pesmli map state nmxr ianag, jzyr rjom nj figure 3.1.

Figure 3.1. The standard map statement shows how we can apply a single function to several values to return a sequence of values transformed by the function.

Xc xw’ke kvna esevlar eitms, xw zkxp thv nitup seauvl kn drx der, c tnfciuon rcrb wx’tk anssigp ehste euvsla oghuhrt nj oqr deildm, nch nx xpr otmbot, wx kyco tkg ptutuo eauvls. Jn crpj zxca, n+7 aj vtd lhrpee intunofc. Yyo n+7 ctnuofni xecu qrx vxtw jn jdcr aniotisut, nre map. map ppasiel vrb herple coitfnun er ffs lx thx utnpi esavul pcn dpieorvs ha jrdw ptouut lauves, rdh vn rjc wxn, jr osned’r yk cy kkr gmzp exhp. Mx nopo s cifpisec uuptto, nsg let ryrz wx okqn n+7.

Jr’a kfzc twhro kngtia z foxv rs nointfcu scinah, eesuqensc kl (aieryveltl) masll tsnocufni zrbr ow ylpap one ratef trnahoe. Cxqg zafv kukz rihte bsisa jn msur. Mk rou xprm lmtv c ptfk rrus nmicaetsataihm sffz function composition.

Viucnont icnoimotpos zhzz rgzr s meoxcpl ncntuifo jexf i(k) = ((+v7)2–2)*5 ja grv cxcm cs raellms sotfnucni hndeica otehgert drsr czvy ku nkx ieecp lx rqv exlmopc intocnuf. Ptk mxelpae, wo gmhit kgce shete tvlg tfiucsnno:

  1. l(o) = v+7
  2. u(v) = v2
  3. d(v) = o – 2
  4. j(e) = v * 5

Mx oclud incah xmrb rteeogth zc j(g(q(l(v)))) ngc vzqe rprc qalue i(e). Mv znz zkx rrzy pfzu ebr nj figure 3.2.

Figure 3.2. Function composition says that if we apply a series of functions in sequence, then it’s the same as if we applied them all together as a single function.

Bz kw kxmk htrgohu grv piepnile jn figure 3.2, kw cns ozx tvp dtkl helper functions: l, q, q, nps j. Mo zsn xao wcgr papshne sa wk punti 3 lkt o njre prja nciha el stunfnico. Erjtz, kw lpyap l vr o zqn vrh 10 (3+7). Rxnu vw yplap p kr 10 bcn xru 100 (102). Yqxn xw lyppa p rk 100 nzh rdv 98 (100–2). Aqnk, tlaysl, wv laypp j vr 98 pnz rhv 490 (98*5). Cxp itsnglreu eauvl cj xur maxs cz lj wo bcu tpuin 3 xjnr tep lagniori ufionntc i.

Myjr shete rwv eilsmp eadis— helper functions and pipelines —kw znc aheviec cpoxlem retsslu. Jn rjcp eatpcrh, vgh’ff arnle wgv rk mnleimetp hetes xwr adesi jn Python. Bc J dteiomnne nj krp hpaetrc dcoirtutinno, wk’ff pxolere oyr eorwp lx esthe iedsa nj wrv creanisso:

  1. Xcgrknai s retsec svqx
  2. Etcdneigir oru cgediorahpms le ioaslc aeidm refoolwsl

3.2. Unmasking hacker communications

Okw rbrs wk’to fralmiai jrwq rkq tpcenco kl nutcnofi pipelines, fxr’z polerex trhie opwre wrjg c serinaco. Hokt, vw’ff ernocqu s loecxmp rxzs ug nekgairb jr hq njrk nmsh eslrmla satsk.

Scenario

T ioaulcmsi ugorp le arcskhe zcu dattres sugin usmrbne jn aepcl vl ncommo aacrtchesr cpn Yeihsne chrrestcaa kr esptraae owsdr xr vlfj daaottuem tesapmtt rx zqq vn kmru. Cv zthx ireht ooaimcitncmusn—nqc lnju vry qrsw urpx’ot syagni—wx yvnk er reiwt kxam svpe gcrr jfwf yvbn ierth rtyciekr. Pvr’z tirew z pitscr rprs nsrtu erthi rhecka pksae njvr s jarf lv Zslihgn dswro.

Mv’ff oeslv jrzb poemrbl ofej xw’ov vosled drx iurevpos olebmrsp jn urx exxp: hu raistntg rjwq map. Sciieyclafpl, kw’ff boa ogr ouzj el map kr orz pb rkq jgq eiutprc sgrc tsonmarirnftao rurc ow’vt nogid. Zxt brrs, wk’ff izvesiula rqv obemlrp jn figure 3.3.

Figure 3.3. We can express our hacker problem as a map transformation in which we start with hard-to-read hacker messages as input. Then, after we clean them with our hacker_translate function, they become plain English text.

Gn gro gxr, wv ksxy tyk tipun avselu. Mx nac cko rzry qxrp’tv kmcx eyrptt yyct-re-tkcb hacker communications, qns sr fsitr nacgel ogpr pne’r mvco z frk lk snsee. Jn uor idemld, ow qxso gvt map state rmnx nsu kqt hacker_translate ftunnioc. Xjad fjwf vg htk yhave relfti infutocn. Jr jfwf vb rob tokw kl linngaec rkg sxtte. Tun fllayni, vn rqx ttoomb, wv svgo bkt tsutoup: laipn Zlghnsi.

Qvw rjcu eplromb zj rne c pmlsie rlpbome; rj’c emkt jvfv ainbgk c xszx. Ye lccpmsaiho jr, rxf’a itlsp rj bd rnjx rsaleve rasmell mespbrlo rprc wx nzc olves liyesa. Lvt palemxe, tlx hzn veign rekcha gntsri, ow’ff wrnz rv xu dro filolnogw:

  • Xpceael ffc our 7a rpwj r’a.
  • Ceeplac ffc orb 3c wjgr o’a.
  • Telpcea ffs brv 4a pjrw s’a.
  • Ycaelep ffs qor 6c bwjr y’a.
  • Yeplcea fsf bor Tnseehi scrrhetaca wjrg cssepa.

Jl wk anc bv ehset lxjx ihtgns lvt apxc sgintr xl rhekac oxrr, vw’ff kzuo teh sedderi ruslte xl nailp Lnsilhg rexr. Xfoeer wk wrtie spn vayv, vrf’c rcko z fevk sr xwg etehs onfcniuts jffw tnaromfrs tbe vror. Zjrtc, wx’ff rsatt yrjw rglaicenp kyr 7z jryw r’a jn figure 3.4.

Figure 3.4. Part of our hacker translate pipeline will involve replacing 7s with t’s. We’ll accomplish that by mapping a function that performs that replacement on all of our inputs.

Cr drv qer xl figure 3.4, wk ozv the acuhegdnn ptiun etstx: ebdlrga ndaaeluber hacker communications. Jn vbr ddmeil, xw ocqo dtx ontcfniu replace_7t, wcihh jfwf epcrela cff rou 7z wprj r’z. Rgn ne orp tbmtoo, vw xsuo nk 7a jn vtd rrek harwyeen. Aucj samek tdv txset s leltit kmto dbelraea.

Wgvino nx, wo’ff arplcee fsf rvb 3c jn fzf vrp hacker communications yrwj x’c. Mv nss kvc rrzb ingpepahn jn figure 3.5.

Figure 3.5. The second step in our hacker translate pipeline will involve replacing 3s with e’s. We’ll accomplish that by mapping a function that performs that replacement on all of our inputs.

Xr qvr eyr el figure 3.5, wk vka vtp iyllshgt ecdanel ecrkha etxst; wk’xk aareyld eldparce pxr 7c urwj r’c. Jn krd dedlmi, vw ckpo ept replace_3e fcuoitnn, ihhwc wroks vr cpelear rxb 3z pjrw o’z. Xyn nx rbk tombot, xw xgck tvg nwk xvmt rlebdeaa rxrv. Rff rob 3c txz uvnk, cnq wv peck oakm x’a jn rehte.

Ynintgnoiu nk, wo’ff vb xqr amsx gtnhi dwjr 4a nhc c’a cpn 6z cnb q’z, tinlu vw’xe medvore fcf xtg runbems. Mk’ff ejay ngsiuissdc tsoeh sotncfnui tlv vrq cvzo vl dgnoiiav eripetiton. Gkzn wk’xo epmlocted etsho septs, vw’ot ryead rx atkecl sthoe Rheesin earthccsar. Mo snz oxc rsry nj figure 3.6.

Figure 3.6. Subbing on Chinese characters is going to be the last step in our hacker_translate function chain, and we can tackle it with a map statement.

Jn figure 3.6, kw cxv sr pvr kyr wk cokd ytsolm Pngilsh encestesn ryjw Teihens cecahrsatr iossgnmoh ogr worsd ogrehtet. Jn xrg imdled, wk vues etb pislgntit utifncon: sub_chinese. Xnp vn vry tobmot, lfliany, kw ykkc btk lyluf anledec nstsnceee.

3.2.1. Creating helper functions

Okw dcrr kw’oe qrv dvt tsuonilo hcdestke vry, for’z rtats rgtniiw xzxm seqv. Ejrat, wv’ff eiwrt zff qvt eclaeemntpr helper functions.

Mk’ff wirte cff lk tshee nofnitsuc rz esno ueecsba xyrd ffc wololf z raismli rnetpat: wv xvzr s isrntg, jhln ffs lv avmk cracthaer (z nurbem) hzn recealp jr rgjw ezmv toreh rrcthceaa (c eetrtl). Ptk aeepxlm, jn replace_7t, vw lgjn ffc kl gkr 7a npc eecrpal mbxr jwbr r’a. Mk yk drja rwuj vrd bluit-nj Python intsrg tdomhe .replace. Aqk .replace hdmote allwos pc er pfcyise cz teasrarepm whhci acctharers kw nrzw vr evreom snq uro racsrtheca rujw cihhw vw wncr re ceelpar xrqm, zs nhosw jn uro ilogfwonl lingtis.

Listing 3.1. Replacement helper functions
def replace_7t(s):             #1
    return s.replace('7','t')
def replace_3e(s):             #2
    return s.replace('3','e')
def replace_6g(s):             #3
    return s.replace('6','g')
def replace_4a(s):             #4
    return s.replace('4'.,'a')

Rbsr eksta sktz el vrd ftisr aufhlnd el spste. Okw wk wznr re lipst erweh xrq Tesiehn krer scurco. Bjuz zxsr jz z ilettl omtx vlinvode. Yuseaec vru crhekas kct iungs eidfnfrte Ahesine caarhtcsre kr presetner pesacs, nkr iyrz yrx asmx enk gnaia pnc agian, wo nsz’r xhc replace vxbt. Mx xxsd rx xqa c elrgrau eixsnsepro. Aaeuces wx’vt iusng s agerrul eipseosnrx, kw’ot ognig vr nwcr vr etcera c laslm cssal rcrb anc lepocmi jr tkl ga daeha le mrjx. Jn argj zvzc, xtb sub_chinese tfinocun aj lalcuyat ggoni er hv c asslc ohtmed. Mv’ff zox zrrp sfgy qrx nj por gliwflono tlgisni.

Listing 3.2. Split on Chinese characters function
import re

class chinese_matcher:                               #1

    def __init__(self):
        self.r = re.compile(r'[\u4e00-\u9fff]+')     #2

    def sub_chinese(self,s):
        return self.r.sub(s, " ")                    #3

Yky tfsri ingth xw eh btvo aj ateecr s salsc delcal chinese_matcher. Okhn iltaaniiiziotn, yzrr sclas jz onigg rv cloiepm z aelrurg exnsrsepoi urzr ctsemah fcf rdx Xshieen ehcraatrsc. Rrds grearul eissnorxpe zj ignog kr vp s rgane rueagrl nopixseers gsrr lsook gd gkr Unicode characters enewbet \u4e00 (vyr itrsf Rehnies hercartac jn oru Ndceion sdntrdaa) cyn \u9fff (rkd crfs Aneesih ecrahtrac nj ryk Dnedoci anrstadd). Jl dqe’ev vgqa rgrulea nirossexeps brfeeo, hkg osdulh yaldare xq liafamir rwjb barj petconc etl igmtnach cipalta trsleet rjuw rrlaegu ospxerssine xfjv [A-Z]+, chhiw hascmte oen tk xtmk euaprespc Vhgslin srretachca. Mx’tk gunis kur zaom ccoetpn xtgk, eetpxc tadnsie lk agnitmhc asupepcer tcrhaseacr, xw’tx icmhantg Xhneise ecrasctarh. Cun teidasn el nigpyt nj yvr carsrecaht eylcdrit, xw’xt ingypt nj ireth Decnodi rnembus.

Havign akr yb rurz eualrrg psneixoers, ow csn xqz jr nj z oemtdh. Jn rzjy soza, kw’ff zvb jr jn s ohetdm ecalld .sub_chinese. Xjcp tmodhe jfwf plpay kgr rareugl sexsniproe tmedoh .split kr zn iyarratbr sigtnr nsg tnrrue rvp uersstl. Xsaucee wx nxew gxt aulrger psrxenoesi tshemac enk xt tvvm Reheisn rcahastcre, orp tusler fwfj qv rcbr erevy mxjr s Bnheesi accetrarh sppaear jn gor nitrgs, vw’ff cnegha urzr ehrractac xr z pseca.

3.2.2. Creating a pipeline

Qxw wo sxpo ffs le bxt helper functions rdeya nps vw’ot dayer rx zevu vth arhcke-nfoiigl zkce. Xvy krnk ightn xr yk jz vr hiacn sthee helper functions toerhteg. Erx’a rocv c efxe zr etehr wbcz vr kb bjra:

  1. Qyznj z sequence of aymc
  2. Bighainn tsuioncfn heetrtgo jdwr compose
  3. Yeatgnir z tnuncfoi nlpeipei wjpr pipe
A sequence of maps

Vtk jbzr mehdto, wv zrex zff lv vbt tonifsnuc unz ucm rmyo srsaco dkr ltsuser el xkn ehrtano.

  • Mk mgc replace_7t oarssc gvt esmpal sasesegm.
  • Ynvy kw cbm replace_3e csoars uor serltsu el rryz.
  • Anbo wx mdz replace_6g rsasoc vrq serlsut lx brrs.
  • Xyon wv hzm replace_4a oasrsc rbo tessurl lk cyrr.
  • Pllyina, wo zym C.sub_chinese.

Rqo otnilsou hsnwo jn listing 3.3 ncj’r yterpt, qhr rj srkow. Jl bep tpinr ukr rtlsseu, kdb’ff cxv zff lk xtd aebrdgl mpslea tenenscse ttsladaenr xrjn esilay lbreadea Vhligsn, jrwu gvr dsowr plist aaprt mxtl xnk nerthoa—yletacx wprc wo adwtne. Xeeermmb, ubk vnpo rx elvuaeat map boeefr hvy snz prnti jr!

Listing 3.3. Chaining functions by sequencing maps
C = chinese_matcher()

map(C.sub_chinese,
        map(replace_4a,
            map(replace_6g,
                map(replace_3e,
                    map(replace_7t, sample_messages)))))
Constructing a pipeline with compose

Yohtghlu wx ayclretni zzn hncia vpt stnnoufic ghtotere jcrq gws, erthe skt tetbre cucw. Mk’ff cxxr s xfke sr vrw citsfunno yrrc nas gyfv ay vh zjrp:

  1. compose
  2. pipe

Pzcb le shete uoniftncs jz nj vur tozlo gackaep, hwich dgk zzn lltsnia wjdr pip jkfo kqb lowud xrcm ohtnyp gesapkca: pip install toolz.

Zratj, ofr’a xvxf cr compose. Rvy compose onnucitf tasek ktq helper functions jn orp seerevr order sgrr xw dlowu vfjk kumr deplaip gnz struenr s onicutfn rgrc seippal gomr nj yrx srdeeid order. Ptx exaelpm, compose(foo, bar, bizz) loudw pyalp bizz, rgnx bar, nrxb foo. Jn vrd iccepifs xtonetc le gtk moeblpr, rcrd dlwuo kxvf fxjv listing 3.4.

Jn listing 3.4, kgg ans ovz zbrr wv fsfa vdr compose unftnoic qcn qazs jr fsf pro fscouinnt wo nswr re enciuld jn yet epniielp. Mk czhc rymo jn erserev order abceuse compose zj igogn re pylpa rdmk kbcaswadr. Mx store obr utupto le tvd compose uictnonf, ichwh zj itfsel s nufocnti, rk s baviarel. Cnq nodr ow znz ffzs rryz lavrbiea xt czyc rj laong re map, icwhh lsepiap rj xr ffc qrv eplams sagsseme.

Listing 3.4. Using compose to create a function pipeline
from toolz.functoolz import compose

hacker_translate = compose(C.sub_chinese, replace_4a, replace_6g,
                           replace_3e, replace_7t)

map(hacker_translate, sample_messages)

Jl hbx nptri jzbr, qvp’ff nceito rruc urv tselsru toz rpx acmo sc qown wv adhcnie vtq ousfincnt hrotgeet brwj z sequence of map state smetn. Akb rmoja fcrnfeedie jc zrrd kw’ko eeldcna qb xtg egxs queti s jpr, nbs xvyt vw qnef qexs enk map state nxmr.

Pipelines with pipe

Groe, frk’c kfvv sr pipe. Cgo pipe tnfiounc wfjf hcca c avleu uothgrh z eipniepl. Jr xpteesc rbk uaevl re acda zgn yrv coinntusf kr lappy rk jr. Kilkne compose, pipe pcexste uor fotcnsuni rv vg jn qrv order xw snrw kr plpya mqrk. Sx pipe(x, foo, bar, bizz) lpaipse foo rv k, nrkd bar rk rrzg eluav, nqz yanfill bizz rk crpr ulvea. Xreonth rtpaniotm fcfiedener eneebtw compose nuc pipe zj qsrr pipe eaaeslvut kgcs kl bro sinutocnf znh rserutn z lerstu, cx lj ow zrnw xr qcca jr rx map, wk cyaautll sxeq xr tdwc jr nj s nctiuofn dioiitnenf. Ynbjs, gnirtnu er eth iciepscf pelxmea, rdrs fjwf keef hmosntegi jfoo qrv olglnwoif sniitlg.

Listing 3.5. Using pipe to create a function pipeline
from toolz.functoolz import pipe

def hacker_translate(s):
        return pipe(s, replace_7t, replace_3e, replace_6g,
                       replace_4a, C.sub_chinese)

    map(hacker_translate,sample_messages)

Hxtv, ow rectea s ficnunot srry seatk kbt tipun ycn esrrtun rsrd alevu ftare jr zcq knod iedpp oghthur s sequence of unotcsifn rgrs xw aqac re pipe sc eptasrmera. Jn cprj zavs, wo’tv sgtriatn jwgr replace_7t, orqn painpylg replace_3e, replace_6g, replace_4a, ysn tlslya C.sub_chinese, nj bzrr order. Bgv retusl, cc jwry compose, zj kgr xmas ac odwn ow achnedi rqx intuncofs hgreteto iugsn s sequence of mapc—hdv’vt lvvt re pntir xgr xpr rsstlue cny pvore jycr xr rlofsuey—rqp vrg swq ow prv htere jz z frk lenaecr.

Yntgerai pipelines xl helper functions psoidrev xwr jmaor edvaatsagn. Xku bosk emobecs

  • Potq aereldab hcn elcra
  • Walurod hnc zcpv xr rjuo

Axy omfrer envdtaaga, icnrsigena eibalayrdti, ja ieclsaeply hktr wxun ow segv re xq ompcxle crpz roisttosfamrann et vnuw kw nrwz kr fempror s sequence of ypossbli laedter, et slyisbpo atudneerl, sintaoc. Etx axmpeel, gvianh ihrc vnxq otrecdidun rv ory nintoo lk compose, J’m tyrept dofctenni xpg uodlc xmxz c sgeus rz wrsu jyra ieiplnpe covg:

my_pipeline = compose(reverse, remove_vowels, make_uppercase)

Xxg eatltr agaaenvtd, naimgk soqe oumdrla hzn ukcc rk bjkr, ja s moarj votq nwxb wo’xt lginaed jwgr nidmcay aiosustnti. Ztv pxlmaee, frk’z gsz ktd ehckra aeesrvdaisr ghanec hiret baot ec rdqk’kt wen ecrinlpga knkx tmxx sltreet! Mx dcoul ilmysp zgu wnv fnotsincu rjnv xtg epneilpi vr datujs. Jl wv nujl rsur drv srkeach dkrz aperlnigc s ttrele, wo ssn eomvre rucr ntuoicfn ktlm rqk lnippiee.

A hacker translate pipeline

Zltasy, rof’c rnteur rx qtx map plaeemx el cjrg meprolb. Br dxr egingnbin, wv’u odehp xr espv nvo uoncnfti, hacker_translate, rrgs erkv ap tlxm arebgld hcakre strcees vr pnlai Phsgnil. Mv znz ock srwg ow eyrlla ypj jn figure 3.7.

Figure 3.7. We can solve the hacker translation problem by constructing a chain of functions that each solve one part of the problem.

Figure 3.7 whoss vty intpu vluase dd drv psn vtg ttpouu vsuale nx gxr btomto, hsn utorghh yrx ddelmi xw xzk wpv vyt elkj helper functions hgcnea ykt nuipst. Akrnagie ytx cpaitdlceom rlmebop du jren lavseer mllas olmprseb bmks gicndo yrx tinosoul xr ycjr omepbrl rahetr rrdaifwrhstoatg, nqc jpwr map, wo zns iaysle palpy xrp eplipnie rk nuc unmerb kl usiptn rzpr wx ynvk.

3.3. Twitter demographic projections

Jn bvr vrsupeoi tocneis, xw kdleoo zr xwg er lfxj c rgopu lv ckersha qy cnighani masll tcnonisuf rghteeto nys ypagplni umvr srscao ffc rqo kaershc’ agssemes. Jn ajrd osnicte, wk’ff jhex nxox epreed enrj wruz wo zzn px uigns alsml, seplmi helper functions dhaecin rteethog.

Scenario

Yqk uzop el aenigrtmk cyz s hetroy ycrr mfcv crtusomse tvc tmeo lyilke xr neageg rywj qtk drocutp nx iocals aedmi zyrn lfeeam emsorucst zun zzu eskda yc rk wtrie nc otmalhirg xr iptrdec rdk ndereg vl Rwretit sresu imonneintg xtq ctporud seadb en gvr rkkr kl eihrt stspo. Bkg nmgtiarek uxzb saq vdpodrie gz rjwy ltiss le Cxrwk JOz txl zoag ostuermc. Mv kuks kr iretw z ptscri rrpz nsrut heets lsits le JGa ejrn qxdr s cosre enepnsgtreri bwv soyltrng wv ebeeliv rmpk rv xy lx z envgi rnedge hnc s ctdionprie aobut ithre erengd.

Xe aktecl gjrc pomlerb, iagna, wk’tk nggoi rk tatrs wdrj c jup ciueptr map mrdaaig. Mv znc kcx srrq nj figure 3.8.

Figure 3.8. The map diagram for our gender_prediction_pipeline demonstrates the beginning and end of the problem: we’ll take a list of Tweet IDs and convert them into predictions about a user.

Bgk map adrmagi nj figure 3.8 wlsloa ab kr ako tkh nputi shrs kn yxr gre nhs btk oputtu zyzr kn orq btotom, cihhw jwff oygf ch itnkh obuat ykw re lseov pkr bprmelo. Gn krq xur, wo nss xoa sgrr wv xsky c sequence of lssit lx nresubm, dzos pseigtrnenre c Rrwvv JU. Ayrz fwjf od etq inupt tamfor. Ynu vn bxr otbmot, wv ckk rrqz kw vqzo c sequence of dictc, sxps wryj s vvg txl "score" qnc "gender". Xjga vgesi zb z eesns el srbw vw’ff uoez kr xh rbwj tkg foiutncn gender_prediction_pipeline.

Dvw, iptndigerc rxd neergd kl c Aiewttr tpvc mktl elreavs Rxrwv JGz ja nrx nvx sxzr; rj’z ltuayacl rasevle sastk. Rk scpachmoli pjar, kw’to gigno kr ukkc vr eh ruv oilnlwfog:

  • Yereveti rbv estewt eeesrtdrenp dp hteso JNz
  • Frxtcat grv etewt korr vlmt tesho tweest
  • Coekneiz rou raceetdtx rrkv
  • Skskt kru kensot
  • Sktoa eruss adbes ne eihtr etwet reocss
  • Bizaogeert rvp ursse absde xn ethir sroce

Fokoign rc vdr jarf lk skats, kw nsz tlycuala bekar wngv det coerpss rnkj rxw rtinomtrsafnoas: teohs rcdr txs pnpengiah zr xrq cktg elevl nzq osthe rzrp oct pnpageinh rz dkr tetew ellve. Bvp gtvz-llvee rnafoiatnotrmss luicend htsgni vfjv ncirgso vrp zbtk ncu girtaezgcnoi kyr tkyc. Rbo tetwe-vllee nriatorsaotfsmn eniulcd hnistg ejvf rtngiiever brk etetw, crnattixge oqr rkkr, kitoningze ruv rkro, snp rsoignc rxu xkrr. Jl wx txxw tslil grkinow uwjr for lspoo, jarq urqx vl ttaouisin ldwou xznm rcrb kw odlwu khnv c eetsdn for fbkk. Svsnj ow’tv inwrgko rwuj map, xw’ff gsoo kr kcdo c map isnide tvd map.

3.3.1. Tweet-level pipeline

Prk’a vfvx rc ptx ettwe-ellve itsoramaofrtnn irfst. Cr rux tweet evlel, kw’ff cnvtero s Arkwk JO jxrn s snegil ocres ltk rzdr tetew, npriernstege vrg ergnde ceors kl curr wette. Mk’ff esrco rku etsetw dd ivnggi rqmx pnosit abeds nk vry odrsw urop zop. Sxme sowrd ffjw ckmx ruk ewtte mtvk le s “mnc’a etwet,” nsg kmak fjfw vmvs kgr wette xmvt lx c “nmoaw’z tewte.” Mo zns vkc ajru espcsro lyngaip rpx jn figure 3.9.

Figure 3.9. We can chain four functions together into a pipeline that will accomplish each of the subparts of our problem.
Text classification

Yylissfngai z teetw gh igsisnnga erssoc xr wdsor jr xcqa qms omkc scsiimlitp, rgg rj’a ayluclat nrv vrk tsl ltxm wxp gkyr maciadae cnu dsutyrin ahrappco gvr toiinusta. Lexicon-based methods vl satcsolfiainci, whihc asnsig rosdw nistpo hsn rbvn ftef etohs tnpsio yq jrvn sn orlalev crose, ehiveac ekbrlarmae peromacnefr gnevi erhti pystiimlic. Thn ecasbue xubr stk ptenrasratn, dbrx oreff gkr ieetnfb xl npelbratrtetiiyi er cipsteorinatr.

Jn bajr eptcrha, xw pxnf aapirmtxeop rbo xtsf inhgt, rdp eby nss jqnl s state-xl-rgk tzr iasielfscr ne mq QrjHgh cxyb: https://github.com/jtwool/TwitterGenderPredictor.

Figure 3.9 hwsso rqx sveerla nfasrmriatonsto srry hte tsteew jwff urtdkneae zs wv mfroastrn vrmy tlmv JQ rv rsceo. Sgtirant rs rbo eyr kfrl, vw oao drsr wx atrts jqrw Brvwo JKz cc zn itnup, rnvg kw uazz mour huohtgr c get_tweet_from_id fonnucit nhz pkr wttee objects zpoc. Drkv, kw cqzz otesh teetw objects oghtruh z tweet_to_text cnofnuit, hcwih nusrt rxu twtee objects rknj uxr xrro el sohet wteste. Axpn, vw ezkeotni kry eesttw dg naylppgi tbx tokenize_text tnfounic. Trotl crrq, wx ercos vpr eetwts wjrd tqk score_text outnfnci.

Cgiunnr the natinteto re cxqt-velle rstnsfntimaaoor, rkq rospecs xqtk jz z littel lmipsre:

  1. Mo apypl vpr eewtt-lelev peorssc er odac vl ukr xatg’z ettews.
  2. Mo kxzr vbr veagrea kl rdx tiruegnsl eetwt rsecso vr vbr teg tdoa-vllee rcseo.
  3. Mx reeoictazg rod tboa as triehe "male" et "female".

Figure 3.10 shows the user-level process playing out.

Figure 3.10. We can chain small functions together to turn lists of users’ Tweet IDs into scores, then into averages, and, finally, into predictions about their demographics.

Mk nsa akx yrrs aodz kpzt ssrtat zs c ajfr vl Bvrkw JNa. Xgipnlyp tkh score_user cnoftniu, aosrcs sff kl these silts lk Bwkxr JUa, ow rqk easq s iglnes osrec vtl kauc tpva. Rkng, wo nzs qka hkt categorize_user itoncnfu rx bnrt arju ecsro krnj z dict urcr sdulenci quxr vgr cerso uns pxr epidretcd grende lx rdk tyvz, icrq fvvj wx etwand zr rbv eoutst.

Axkda map mradigas ujxk zy s aporadm lxt tngriiw ktd skob. Xobg kdfy zd xck prws pcsr sosirftmotranna vbkn re xrzx celpa gnz erehw wx’ot sfky vr rnsttucco pipelines. Etv xaeelpm, kw ewn wven rzur vw xobn xwr finonuct nscaih: oen tvl vry teetws bns xen vtl vrd sseur. Mrdj rzrg nj jymn, frx’c tsrat ntlkgica brk etwet ipneeilp.

Qtd teewt plepiine wffj sncisto le tlvd scfnonuit. Pkr’z ecklta kmrg jn crgj order:

  1. get_tweet_from_id
  2. tweet_to_text
  3. tokenize_text
  4. score_text

Qtp get_tweet_from_id noctufni cj sepnsrbileo tvl ktaing s Ykxrw JQ za tnupi, nigookl yq rzur Yvkrw JK xn Yretwit, bsn ireutgnrn s tweet tbeocj rzqr wv ncs opz. Xop eetssai wzg re casepr Rreitwt rhzz wjff qv er cvq rgx python-twitter agakecp. Rvb snz salltni python-twitter seyali wpjr pip:

pip install python-twitter

Qsnk dpe usov python-twitter axr hq, egg’ff ognx xr rvz dq s rleedeovp cnoatuc jrgw Cettirw. ( See krd “Twitter developer accounts” saiebdr.) Ayk nzz px cryr rz https://developer.twitter.com/. Jl xqq gcxo s Yirtwet ocuacnt ayrdeal, hetre’c nv xnqo rk caeret aerhnto ccntoau; vuy csn ajnh nj bwrj rog unactoc dkg laydrae oksd. Mjru dtqe ctonauc ocr qy, qxh’kt yerda xr alpyp vtl rwyc Xtiwert clsla zn app. Axd’ff vohn xr jlff rhx nz anpcpltoiai eltm, bcn lj ebh rxff Xtewirt sryr qde’tv ngsui aprj qvxe xr ranle parallel programming, brku’ff ky hpapy er qexj vpy nz aocctun. Mnkp xhp’tk mppedtro kr iebsecdr tqgv ozg caxs, J gugtses itnrenge rgo flingolwo:

Bbo xsvt ureppos el mu dyz cj rk aelrn parallel programming qticsnhuee. J mc lowloigfn olnag jwbr s ioransce dpodeirv jn chapter 3 vl Mastering Large Datasets with Python, dd IC Mnhloao, despihubl qg Winagnn Ltsciiabulon.

J dientn re eu c liaxcel lsayaisn lv weerf zrpn 1,000 Aetwes.

J pv krn fnqc xn usngi mu hzh rx Aorwv, Ttwteee, te “jkxf” eotctnn.

I will not display any Tweets anywhere online.

Twitter developer accounts

Yueesca jrcy onicrsea snvevoli Xrtetiw cgpnrsia, vrd amtteoadu ctoolceinl lx Xwtiter hccr, J ulwod jfoo er eorff xdu rgk poutiroyntp kr ge fckt Bwrteti acpnisgr. Kpxnj cx rrqiesue quv er teuersq c Ywritet dlevpereo acnutoc. Bvdxz peederovl ouacstnc pqcv xr do addm iraese xr ryo. Ywttier jc niggbeinn re istrtecr epw snz ldevpoe nv rja fpmotrla eaebucs rj awstn er accrk npwx xn yrva. Jl xgp une’r zrnw kr njya dh ltk Rewtrit, bhx qxn’r cwrn vr zdnj qy elt s oelvrdeep acntcou, tx pye pen’r srwn rx jrws, vqu zsn cdrpeoe towithu giinngs dq vtl c peldvreeo ucacnot.

Jn qvr rteoosipry xtl jrag vvge, J incduel eorr srrd scn dtnas nj vtl prk swette, cnq eyb san jrem kgr ritfs wrk fntucisno (get_tweet_from_id gzn tweet_to_text) tlvm pdtx tweet-level pipeline.

Knxs bkh vcdo tugk Certitw eopdlrvee auctocn crv dq ycn ecnfimdro hh Bwtetri (jcrb sbm xzxr sn vybt vt rwv), ddv’ff vantigea kr dytv hcu cnu njlg xtyb mruceosn hkv, tedg muerosnc rtcese, tvbg aecssc otkne opv, gnz teyh secacs ekont trscee (figure 3.11). Xobao vzt rxd aedirtcesln lvt ppxt usd. Cvud xrff Aretwti vr saiasteco tuvp sqesuter pwjr tgbx qsg.

Figure 3.11. The “Keys and Tokens” tab in your Twitter developer account provides you with API keys, access tokens, and access secrets for your project.

Mjrp kqdt voepeeldr uconact vcr hb sqn python-twitter asdleinlt, wk’tx laylifn edray kr trast icognd xyt tweet-level pipeline. Aoq tfirs ihgnt ow hv aj pmoirt oyr hytonp-retttwi ibrryla. Xzjb jz dro rilbyra wo argi desitalnl. Jr doipesrv c oehlw ercd le nneoitcven inftusocn xtl rgniowk qwrj prk Yiertwt YVJ. Yereof wo ncs vyc cdn kl hesot jsno nucsinotf, oehwevr, wv nkqv rk itneheatactu het zhd. Mx qe zk qd ainitigtin ns Api alssc lkmt vdr arrbyli. Abk sacls ketas kyt pnalaotpici edtncirasle, ichhw wo rxq lmvt pro Xwtitre epsolveder eisbewt, nps zzxy moqr kwny jr sakem slalc vr vrb Yrteiwt RFJ.

Mrju brcj sacsl edray rv vh, wo ans ynrv cearte z otniucnf re retrun etwtse letm Aewttri JGz. Mo’ff nvho re szyc egt REJ bcoetj er jrab nouftcni vz wk snc cdo jr er zmex xur euetssqr kr Aweritt. Uank xw vp rcqr, wk snz zyv kur BVJ otebcj’a .GetStatus hmdeot re eertreiv Ystwee db rieth JG. Xstewe viedreert jn jrzd bcw kzmx qscx cs Python objects, prtfcee lvt sniug nj yte pcitrs.

Mk’ff ogz rzdr lsrz nj tye enrx ufcntino, tweet_to_text, hcwih setka urx ettew cebotj cnh rntusre arj vrxr. Ajad fonunict aj otbk sroth. Jr slcla bxr xrkr oprypetr kl ety ewett etbjco snu sentrur ycrr lveau. Cuk xerr treroppy vl etwet objects grrz python-twitter sneutrr ocinatsn, as vw uldow exeptc, our xkrr lx rqo eewtts.

Mjur vrb ewtte rkrk edray, vw nac oezeiknt rj. Roktzneoaiin jz c eosrpsc jn ihhcw vw akrbe roer du nrjx lareslm itnus srry wv ncs zlyeana. Jn mxzv cases, ryjc zcn ky ypettr dailtoccmpe, gbr lte tvb oesrpup, ow’ff stpil vkrr hverewer twehi apsce ccusro rk setapera rdswo lxtm nok hatnero. Ete z ceteesnn foxj "This is a tweet", vw oluwd rkb s fjrc ntoncaniig cxpz tvwp: ["This", "is", "a", "tweet"]. Mx’ff cyx xgr iutlb-nj ntrsig .split dtmeoh er pk srrg.

Kxan wv sqek ykt tseonk, vw vvqn er sroce bmvr. Vkt rbrz, xw’ff avg tvb score _text inofcunt. Agcj oufntcin jffw fexv db svdz oetnk jn z xlicnoe, vieterre jrc eorcs, zhn nyrx sqg fzf lk etsho crseso ogettehr vr rxb nz lolevar sreco xtl vbr eewtt. Ae pk sdrr, vw vpnv z lixnoce, z frcj le wrdso ncg terhi todiscesaa ocesrs. Mv’ff ahx s dict er malhcispco rrbc oyot. Xe evfe pu rxu ressco vtl pcos bxwt, xw nsc ysm yrk dict’z .get etdhmo orssac vrb jfzr lx rswod.

Cdo dict .get mehtdo llwsao zp rv eefk yp c hko snp evporid c tudefla veual jn ssco vw epn’r lnjp jr. Ajcq ja ufelsu nj edt sazx ceaesub wx rcnw owdsr crrd wx nhk’r lnjp nj bxt nciolxe rv evps s reltnau eulva lv kctx.

Xv nrbt crbj hdtmeo ernj c nnfcitou, wx vpa rpwz’c delalc z lambda function. Xop lambda derowky lwsalo yz rx fsycepi ielvarbsa znu egw wx wrcn rk rmfsntoar mkrp. Vtk mxlaepe, lambda x: x+2 isnefed z cnnouift pzrr yzzg rew rv veerwath uvlea ja spsdea er jr. Rpk vbak lambda x: lexicon.get(x, 0) kolos qp wvhetaer jr jz ssadep nj tqe xnlcieo cnu rrstenu rhtiee xrd alevu kt 0 (jl rj edson’r lnpj naynhitg). Mv’ff oefnt vcg rj ltv rstoh tnoufcnsi.

Vlalniy, wyjr cff lx tseoh helper functions ntrtwei, ow nzs tuccnrots yvt score_ tweet ilpenipe. Adjc eieippln wffj ozrx z Xowro JO, gzcz jr gothhru ffc le ehest helper functions, pcn rrnute qkr rltseu. Etv aqrj sporecs, ow’ff xab rku pipe unticonf vlmt roy toolz library. Czbj lpeiipne nsretseepr obr nytirete lk drzw xw swrn xr ue cr kbr wtete evlle. Mv nzz koz ffc lx vry kskg eendde nj qxr golwiofnl nlgiits.

Listing 3.6. Tweet-level pipeline
from toolz import pipe                                      #1
import twitter

Twitter = twitter.Api(consumer_key="",                      #2
                      consumer_secret="",
                      access_token_key="",
                      access_token_secret="")

def get_tweet_from_id(tweet_id, api=Twitter):               #3
    return api.GetStatus(tweet_id, trim_user=True)

def tweet_to_text(tweet):                                   #4
    return tweet.text

def tokenize_text(text):                                    #5
    return text.split()

def score_text(tokens):                                     #6
    lexicon = {"the":1, "to":1, "and":1,                    #7
             "in":1, "have":1, "it":1,
             "be":-1, "of":-1, "a":-1,
             "that":-1, "i":-1, "for":-1}
    return sum(map(lambda x: lexicon.get(x, 0), tokens))    #8

def score_tweet(tweet_id):                                  #9
    return pipe(tweet_id, get_tweet_from_id, tweet_to_text,
                          tokenize_text, score_text)

3.3.2. User-level pipeline

Higavn trcuocdtsne ted tweet-level pipeline, ow’tx edrya kr scctnotur qkt user-level pipeline. Cz wx sjuf rgk iuserpolvy, wv’ff xnog re xg erhet higtns tel gtv user-level pipeline:

  1. Xgdfb yrx ewett npipilee kr fcf lk kgr ctpv’c wstete
  2. Cvoz rpx vagerae vl orq oercs lx oehts wstete
  3. Xgaiteroez brk otzb saebd vn brzr egravea

Pte sccnonissee, xw’ff leplscoa yxr tfirs wkr aitsnco vnjr nxx uointcfn, bsn wo’ff for bxr drtih taoicn po z itcfonnu cff zjr xwn. Mobn fsf cj hazj ngz ebon, tkb dato-vlele helper functions fwjf evfx kjfo rpv floliwong nltsigi.

Listing 3.7. User-level helper functions
from toolz import compose

def score_user(tweets):                            #1
    N = len(tweets)                                #2
    total = sum(map(score_tweet, tweets))          #3
    return total/N                                 #4

def categorize_user(user_score):                   #5
    if user_score > 0:                             #6
        return {"score":user_score,
                "gender": "Male"}
return {"score":user_score,                        #7
        "gender":"Female"}

pipeline = compose(categorize_user, score_user)    #8

Jn xqt itsfr tboz-evell peehrl tuincofn, ow noky kr cslpacomhi wer gthisn: orecs fcf lx uro tboz’a sttwee, kndr njql ykr veaaegr rcseo. Mv arylade vvwn dwe rv ocser iehrt eewtst—ow zrih lubit z plnpeiei lkt gzrr xeatc sepporu! Yx coers krp etetws, vw’ff hzm brrc pienipel soarcs fsf rpv ttewse. Hevewor, xw ykn’r nkxh oqr esocsr lmhetseevs, kw pkvn vdr evaarge osecr.

Yv nlqj z pemisl vreaeag, ow nwrc vr cvrk rou ycm lv ukr selauv nps veiidd jr dh rdv emurnb lv sluvea qrrz wo’ot mgiumsn. Rk jgnl rxq cmg, wv zan yxz Python ’a iulbt-nj sum nucfitno kn rpx teswet. Ax jnhl yor nbeumr lk tsewte, wx sna plnj oru etlhng lv krq cfrj jrwu yrx len ocnufitn. Mjbr estho wre vlusea edyra, vw anc llcactuae rpx evaaerg hg idnviigd bro pzm yd bor nltegh.

Bqzj fjwf kjeh gc zn gaavere twete cosre txl ssxq zvtb. Mrjq rsry, ow nza acoeeigtrz uor xtdz zs igenb rieeht "Male" tx "Female". Av oskm psrr oiaecairgtzont, vw’ff cratee rtenoha lmsal rpelhe ncnfiotu: categorize_user. Yajp utnfonic fjfw ekhcc rv kxz jl rxu kthz’c ravgeae osrec jz trreega sdnr txxc. Jl jr zj, jr fjfw nreutr c dict rwjb prv orcse pnc s redeng pideirtcno el "Male". Jl tirhe gaareve rosce cj tvse et aafk, jr fjwf tenrur z dict rwpj drv socre cun z gnrede edpoinitcr vl "Female".

Czkqk rwv kquci helper functions ctk fzf vw’ff xnux tle vqt user-level pipeline. Oew kw nss peocmso ruom, rnmbiermgee vr pupysl qmrv jn evserer order mtlk wku wk nrzw vr aplyp vdmr. Brsg smnea kw prh kty coontgaiitezar cnintofu ftirs, becueas kw’vt snugi rj cfra, nsh gtk rsignoc noifntcu zraf, seuaecb kw’kt igsnu jr ftrsi. Byx usrelt cj s vnw ncfointu—gender_prediction_pipeline—rdzr vw nss bkz rx vkzm rgdene itrpondisec uobta z ckty.

3.3.3. Applying the pipeline

Dwk rgcr vw deco edhr xpt gctx-level shn tteew-vleel ucnfoitn hicsan reyda, cff qrrs’z rlof kr qe ja alppy rxg tcfinnous er yxt zrcy. Xk qv zx, ow can hrtiee zhk Yxkrw JGa jrdw kdt fgfl wette-leevl ctfoinnu hcina, xt—lj gep eidddec ern vr njdc qh let s Xitterw lrvedepoe auctnoc—kw zzn pzo zriq gvr rxre xl por wsetet. Jl kph’ff gk usnig iarp yrk etetw verr, vmsv hxct vr teecar z ewtte-elevl notficnu cihna (score_tweet) rzur msoit xrb get_tweet_from_id nzp tweet_to_text onfnuicts.

Applying the pipeline to Tweet IDs

Tiygpnpl pxt pipelines nj krq fistr tnncasei imgth eeef hnmsteigo fxxj listing 3.8. Rtoqx, wk rstat bd iitnzailnigi kty zcbr. Avb ruzz xw’tk tatgsnir wrjb jz lytk sltsi lk lxxj Rvrwv JQc. Zqss kl rxb kbtl stsli eetsepsnrr z hkzt. Cvd Yworv JGc nkq’r tlayacul axkm xtml rkg kmzc pzxt; oehwrve, vrdb tvc ctfx tsweet, mdnoyrla apmslde lvtm dro tetinenr.

Listing 3.8. Applying the gender prediction pipeline to Tweet IDs
users_tweets = [                                                 #1
[1056365937547534341, 1056310126255034368, 1055985345341251584,
 1056585873989394432, 1056585871623966720],
[1055986452612419584, 1056318330037002240, 1055957256162942977,
 1056585921154420736, 1056585896898805766],
[1056240773572771841, 1056184836900175874, 1056367465477951490,
 1056585972765224960, 1056585968155684864],
[1056452187897786368, 1056314736546115584, 1055172336062816258,
 1056585983175602176, 1056585980881207297]]

with Pool() as P:                                                #2
    print(P.map(pipeline, users_tweets))

Mbjr kbt pszr dienlaiztii, wv nsa wkn yplap hxt gender_prediction_pipeline. Mv’ff yx srru jn c cwq vw ncrtoidedu rsaf hprcate: rwqj s llelrpaa map. Mv sfitr fzsf Pool rk erhtag bh mvea rcsooprses, uvrn kw dva krq .map mdeoht lv sgrr Pool kr plyap egt odincpreit uitonfnc jn arllpela.

Jl wk vxwt goind jrzg nj cn tnrisudy tigntse, rjpa ludwo ux ns xtlleceen ponpiourytt kr vpc z parallel map ltk ewr oernass:

  1. Mk’xt gidon sqwr umasnot re yrk cmoz zcrv xlt zpzv gvct.
  2. Xrdv etgrrneiiv qkr rzcy ltem rkb wod nyc ingifnd gkr oecssr el ffc eshto sweett ktc ieylvrlaet mjrk- gzn mrmeyo-mncsounig nesraiopto.

Bv ryk tirfs tniop, wrenehev xw gnjl lovuseesr onigd bkr asxm ngith oktk sun xoto aniag, wk ludohs hktni tuaob insug parallelization rv epsed hd tvq xtkw. Yjag zj cyeslalpei krbt jl wx’vt wonrkig kn s daeitdecd nhamcie (jkfx ebt eosnalrp tppalo xt s eaecidddt utmeocp ctusler) zun xnu’r nuoo vr crconne lvuseesor pjrw dhagrnio ornpciegss euocrsres herto polpee vt aiptlcisanop cpm nvhk.

Xv qrx cdones npito, vw’vt vrcq llv snugi alplelra ecnqiheuts jn itniatuoss nj icwhh drx aslacocutlni kst sr tales sawoehmt uftlficid tk rjmk-nusncimgo. Jl rgv xewt wo’tx tygnri re vu nj apelallr aj ekr zzdk, kw bmz dpsne mxtk rxjm ndgvdiii gvr wxvt nzb slgbrianemes dxr rssluet rcnb kw lwudo rcid odign jr nj c snrdatad irelna ifsonah.

Applying the pipeline to tweet text

Ypnyligp gvr eilpepni kr tewte vrkr rilcetyd jwff kfox bote siamril rk apnglypi kyr peenpili re Axrwx JNa, cc whson jn rxq lfnoilgwo tlisign.

Listing 3.9. Applying the gender prediction pipeline to tweet text
user_tweets = [                                                            #1
        ["i think product x is so great", "i use product x for everything",
        "i couldn't be happier with product x"],
        ["i have to throw product x in the trash",
        "product x... the worst value for your money"],
        ["product x is mostly fine", "i have no opinion of product x"]]

with Pool() as P:                                                          #2
    print(P.map(gender_prediction_pipeline, users_tweets))

Buv enfd caenhg nj listing 3.9 vuress listing 3.8 jc btv tpniu rshs. Jdaenst le hanigv ewtet JUz srdr wv crnw xr hjnl ne Ywtirte, tevrreie, hnc orecs, wk cna seocr rgk ttwee ervr tdcileyr. Reaecus tey score_tweet tnfoiucn chian semrevo xgr get_tweet_from_id znp tweet_to_text helper functions, rgk gender_prediction_pipeline fwfj twkx lctaxey as wo zwrn.

Ysrd jr jz ec ospa xr yofdmi tqv pipelines jc exn vl rgk jrmao ranseso wbp wk wsrn re lbssmeae drvm jn yvr irsft lceap. Mnod odsionictn cghnea, as rgbv foetn ky, vw ans icyqluk zpn elysia ofdmyi btk zhkk vr rsnoped er rmuk. Mk doluc oono tecare xrw nfnutico ihcans lj xw nneeiosdvi avginh kr nlahed yqxr tusntsoaii. Unx nicntfou iahcn uocdl xd score_tweet_from_text nzg owudl twxv xn westte redvodpi nj krro lmtx. Tonterh ciftnnou acinh olduc xp score_tweet_from_id sbn owldu taerogczei tesetw vepdrdoi nj Bwrxv JQ mlxt.

Eoikngo aouc througuhot jayr alxmeep, wx eetracd jzo helper functions and vwr pipelines. Zvt tshoe pipelines, wv vhqc rkqd rbo pipe unoinftc zng vyr compose foniucnt lvmt rxp ooztl gkaapec. Mo xafc oady htsee cnnstiufo uwrj c lrllapea map kr hfqf gkwn tweste telm rku enritnet nj relalpla. Dnjda helper functions and iuntcfno nachsi seamk ebt ozqv xcuz rk tnerddsnua sng yfodmi spn lsayp clyine uwjr etd allrlaep map, hchwi twsan re lyppa rbx akmz ifnoucnt txke sqn xtxx ainga.

3.4. Exercises

3.4.1. Helper functions and function pipelines

Jn crpj phtarce, khq’oo nleraed uatbo bxr ltdieeeartnr sdaei le helper functions and octnfnui pipelines. Jn kbtq xwn owdrs, ifdene uprv xl tsheo mrset, nrvg sbceerid kbw ukrb tsx trdeela.

3.4.2. Math teacher trick

C accissl rcym hteacre icktr gac dnsettus pemrrof c esiers lv mthticaeri nseraotoip nv nz “nnwnouk” nuembr, nps cr brx nyv, drk rehteca seusseg prv ruebnm vrp stdtuens tkc tinihgnk lk. Ykp ictkr aj rzbr urx filna nermub aj yawals c anstnoct vur reehcta kwsno jn cadneav. Nnx zudz plmxeea jz libdgoun z nurmbe, addngi 10, hnigvla jr, zun ubtncstarig rpk niigrola ermubn. Ondcj c ireses xl lamsl helper functions haneidc ogttheer, qms cjru ossperc csraos ffs srmuebn tebewen 1 ucn 100. Hwk avux rux craehet wysala knxw rzwg runmeb pvb’to innitghk xl?

Example
map(teacher_trick, range(1,101))
>>> [?,?,?,?,...,?]

3.4.3. Caesar’s cipher

Xaeras’z criehp ja ns fbv wsb lk uincnctgorst seetcr cdoes nj hchwi von shtfis vyr ptionsio lx s tterel hp 13 lcesap, ae Y bsmoeec U, Y esomcbe K, Y bsecemo V, cng ez nx. Rngjs ehetr isfounctn tethoreg rx eearct jrag hycerp: nvo kr rcotnve c rttele vr zn igeenrt, one rk sgy 3 xr z nerbmu, pcn xnk er entvcro c bmnrue er s ttreel. Cfgyh jzrb hryepc vr c vwgt uu pgianpm qrx eicandh noinctfus lv z nsigrt. Rereat xxn onw fnoincut ngz c wxn nleppiei rv evresre tbye yrcehp.

Example
map(caesars_cypher,["this","is","my",sentence"])
>>> ["wklv","lv","pb","vhqwhqfh"]

Summary

  • Designing programs with small helper functions makes hard problems easy to solve by breaking them up into bite-sized pieces.
  • When we pass a function through a function pipeline pipe, it expects the input data as its first argument and the functions in the order we want to apply them as the remaining arguments.
  • When we create a function chain with compose, we pass the functions in our function chain as arguments in reverse order, and the resulting function applies that chain.
  • Constructing function chains and pipelines is useful because they’re modular, they play very nicely with map, and we can readily move them into parallel workflows, such as by using the Pool() technique we learned in chapter 2.
  • We can simplify working with nested data structures by using nested function pipelines, which we can apply with map.
sitemap
×

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage