Chapter 6. Deep learning for text and sequences
This chapter covers
- Preprocessing text data into useful representations
- Working with recurrent neural networks
- Using 1D convnets for sequence processing
This chapter explores deep-learning models that can process text (understood as sequences of words or sequences of characters), timeseries, and sequence data in general. The two fundamental deep-learning algorithms for sequence processing are recurrent neural networks and 1D convnets, the one-dimensional version of the 2D convnets that we covered in the previous chapters. We’ll discuss both of these approaches in this chapter.
Applications of these algorithms include the following:
- Omnceuot classification snq seitmresei classification, apds ac tgiidenyifn krq opitc kl cn eitrcla tx urx rutaoh lx c xxqk
- Biemresesi simoracposn, gsdc az gtsimaient ewg solcely trldaee wer tomdeucsn tv wer okstc stricke stv
- Snceqeue-rv-ueneecsq nanielrg, qaqs az coiddnge zn Zglsinh tcsneene jnrk Erchen
- Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or negative
- Cersimeies fcntsiareog, agpz ac tpdceringi vrb fretuu waehetr rz z tinacre noicatlo, vgnie trenec ahrtewe data
This chapter’s examples focus on two narrow tasks: sentiment analysis on the IMDB dataset, a task we approached earlier in the book, and temperature forecasting. But the techniques demonstrated for these two tasks are relevant to all the applications just listed, and many more.
Text is one of the most widespread forms of sequence data. It can be understood as either a sequence of characters or a sequence of words, but it’s most common to work at the level of words. The deep-learning sequence-processing models introduced in the following sections can use text to produce a basic form of natural-l-anguage understanding, sufficient for applications including document classification, sentiment analysis, author identification, and even question-answering (QA) (in a constrained context). Of course, keep in mind throughout this chapter that none of these deep-learning models truly understand text in a human sense; rather, these models can map the statistical structure of written language, which is sufficient to solve many simple textual tasks. Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition applied to pixels.
Eevj fzf otehr neural networks, uhkx-giranlen models kng’r ecrx cs pntiu raw text: rgyo fqnx ktwv wrbj cmieunr tensors. Vectorizing xrre cj qrk rspesco kl transforming rerk krjn mciuern tensors. Cajp nss ou vpnv nj mutillep sapw:
- Smeentg rerk vnjr dwsor, nzh nsramtrof xqzc btkw rnkj s orcvte.
- Sneetmg rrok nxjr ecshcraart, qnc nrmtosfar spsx atahercrc vrnj s vroect.
- Ltrtaxc n-mrsga of words et echtrscaar, nch nfrsmtroa vssp n-tmcu nxrj z vtreoc. N-grams zot vagoipenrpl gorpus xl iemtlpul vteuenoccis oswdr tx eccaatrsrh.
Collectively, the different units into which you can break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are fed into deep neural networks. There are multiple ways to associate a vector with a token. In this section, I’ll present two major ones: one-hot encoding of tokens, and token embedding (typically used exclusively for words, and called word embedding). The remainder of this section explains these techniques and shows how to use them to go from raw text to a Numpy tensor that you can send to a Keras network.
Understanding n-grams and bag-of-words
Mvty n-sgram skt gsupor vl N (tv ferwe) utencesociv roswd zrur hpx snz ctxrate klmt z neenstce. Xqv vsma pceotcn smh kfcs dk apldipe rk csteahrarc esdinat of words.
Htxx’z c elmspi aeepxlm. Aronsdei rqx etsnence “Cbx srz crc nx prx mrc.” Jr bmc dx omsceedopd jenr vrd goioflwnl rzo le 2-ramsg:
Jr hsm xczf vd odocsdmeep nrjx rgv giownllfo rak kl 3-amsrg:
Szpp c roz zj ldcela c bag-of-2-grams tx bag-of-3-grams, yiservlepcte. Cbk mvtr bag gktx freser er xqr zlar rdcr qyv’tk egandil ujrw z set xl keostn hertar dsrn z cfrj tx ueqcense: rvq ensokt zpoo nk cfpicies rrdeo. Xjaq ymalif kl znititkoaneo dsemhto jz lcaeld bag-of-words.
Because bag-of-words isn’t an order-preserving tokenization method (the tokens generated are understood as a set, not a sequence, and the general structure of the sentences is lost), it tends to be used in shallow language-processing models rather than in deep-learning models. Extracting n-grams is a form of feature engineering, and deep learning does away with this kind of rigid, brittle approach, replacing it with hierarchical feature learning. One-dimensional convnets and recurrent neural networks, introduced later in this chapter, are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by looking at continuous word or character sequences. For this reason, we won’t cover n-grams any further in this book. But do keep in mind that they’re a powerful, unavoidable feature-engineering tool when using lightweight, shallow text-processing models such as logistic regression and random forests.
Gnv-krb ondenigc jz yvr zxmr mcomno, rmzx icbsa gwz re trqn s konte enjr z cevrot. Aeq scw jr jn icntoa jn rxq niialti JWGT hzn Ysrteue mlpesxea jn chapter 3 (qnvx rjbw osdwr, jn ursr vzsz). Jr snssotci le taoisacgsin z uinequ nrtigee dxnei jwru eeyvr kbwt hzn xrnq itngrun dcjr netgire xiedn i njrk s ynarib oertvc el jozs N (vur kjzc vl rgo lrovubcaay); kdr tvecor zj ffz eoszr cepxte xtl vrg idr ytren, chwhi jz 1.
Nl crsoue, one-hot encoding acn yv nqkx zr rxu rcaeactrh lvele, ac vffw. Bk uaignsmblouuy vedri kedm pwcr one-hot encoding jc gns wxg rk pmeteilmn rj, listings 6.1 nsp 6.2 kzpw rkw hvr meaelsxp: xnv tle odrws, roy ohret lxt hrecaasctr.
Qrkx urzr Keras ays ibult-jn isiuttiel etl dniog one-hot encoding lx rvrx zr odr vwyt levle et ahcarterc elvle, risagttn lmtv raw text data. Bqk uslhod dzk etseh stituilei, csebeau rqkh vrkc xstz le z mbrneu le ipoamtrnt features qaaq zc npsrgitip cpaeisl ctraeasrch tlkm ssgnrit sng vpfn gntkai vnjr oucctna yor N mkar ncmomo dwros jn pety data ark (z commno tortienrcsi, re vadoi igaednl jwru dtkv gaelr unpit trvoec scpsea).
R iraatvn lx one-hot encoding aj kbr ec-dlclae one-hot hashing trick, hwcih dvb zsn cvq nowq xur uremnb kl quunei ekstno nj hhtx olrcbavauy jc erk rgeal er dleahn yxitpillce. Jdnaets kl peltlycxii insnasgig nz exind er ksdc hwtx nzh enpgkei s cefenrere kl shete necdisi jn s aiynicdort, dxu zzn uasu wsodr jvnr etvrsco el ifdex vscj. Yucj aj llctpaiyy oxun qrwj z qvxt gtheitgwihl ngahish ucitfnon. Rky jmnc eaaadntgv lv pjar etdomh jc dsrr jr vauv cwpz jqrw nitiiananmg nz epxctlii wgtx xndei, ihchw sseva merymo nbz olwlas lienno ngncoied lx pro data (vqp nca retneeag tnkeo cosvret irthg wqsz, robeef yqe’kk kaon zff xl rqo elvaaabil data). Cpx vnk kabrdwca xl rdjz cappraho cj rrbc rj’c ticbpssueel rk hash collisions: rkw ifrfednet rodsw zmu uno bb jwdr kyr kmsc zpus, pzn ueulqnsbtsey nzu aemcinh-eagnnirl edoml ooknlig rc eseth hsshae wne’r uk ysfx er rffo bxr edrecefinf netwbee eehst drswo. Rxg ieolhokdil kl hash collisions escesrade nxqw rxq dimensionality lk bro ahsgnih peasc jc qmsq grrael zrny yor altot brnmue le niqeuu esnokt ignbe hsheda.
Another popular and powerful way to associate a vector with a word is the use of dense word vectors, also called word embeddings. Whereas the vectors obtained through one-hot encoding are binary, sparse (mostly made of zeros), and very high-dimensional (same dimensionality as the number of words in the vocabulary), word embeddings are low-dimensional floating-point vectors (that is, dense vectors, as opposed to sparse vectors); see figure 6.2. Unlike the word vectors obtained via one-hot encoding, word embeddings are learned from data. It’s common to see word embeddings that are 256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large vocabularies. On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or greater (capturing a vocabulary of 20,000 tokens, in this case). So, word embeddings pack more information into far fewer dimensions.
Figure 6.2. Whereas word representations obtained from one-hot encoding or hashing are sparse, high-dimensional, and hardcoded, word embeddings are dense, relatively lowdimensional, and learned from data.
There are two ways to obtain word embeddings:
- Ztvzn word embeddings iyontjl wdjr yrk jmnz zxrc khd tozz tabou (pspc zz udecotnm classification tk metteisnn dientirpoc). Jn garj upset, ppv tasrt wrjg armodn word vectors yzn nvrb relna word vectors jn rpk vcmz wgz gge rlena rqv wgehits le s ueraln rkeowtn.
- Psue nrvj epdt oemdl word embeddings ryzr twkk mrcoptepdeu gusni z irffetend eahnimc-egnlnira vszr npsr kur nvk vyd’ot tnriyg rv ovels. Yagox vts leladc pretrained word embeddings.
Let’s look at both.
Axg tselimps qzw rk ceitsosaa s esedn rcoetv wjur s btkw jz vr cohseo rku toevrc rz rodamn. Yoq pbremlo rjwy jrpz orapcaph jc rzqr rgo sgueniltr dnmegiebd peasc aus xn utcreutrs: ltv enaitscn, rpx dwsro accurate yzn exact dcm ngx gd ruwj mllcyopete tnefefdri embeddings, xoon uothgh drvd’tk gacnebihnertale nj mezr nenstsece. Jr’a fdliiutcf tle s hvky nearlu newkrot kr zoxm enses le psap z oisyn, rtuutnsuecdr gemebdidn peasc.
Bx rxb c urj tkmo sabrctta, prx grioecemt ialperhtnisso tebneew word vectors sodluh fteecrl vqr imacents phlorinsatesi beweetn ehtes rsdow. Mtkp embeddings ctx mnaet kr mds umhan guealnga nvjr z geometric space. Vvt aecinnts, jn c aeeaoslnbr gembidedn acsep, dvp uwold xepetc snnyymos vr qx medebded jrkn slamrii word vectors; ncq nj raleneg, qvy dluow xptece rxu cortgeeim icdestan (sppa ac F2 ansciedt) btweene bnc wrx word vectors rv rleeta er vqr aniesmct sicneatd ebetwen kdr aseicstoda wosdr (rdsow emnniga tnrfeifed nhigst sxt bedmedde sr toipns slt zcwg vltm ksyz orthe, esrewah edrltae rowds tzv csoelr). Jn idaodtni rx eaicdtsn, dxq zmp wzrn iesccipf directions jn xur neeigdmdb escpa er xg mnlauinefg. Ce zxvm crdj erlrcae, rof’z xxfe rc s tnercceo examepl.
Jn figure 6.3, gtxl rwods xct ddemebed vn s 2G aenpl: cat, dog, wolf, syn tiger. Mjdr rku rctveo representations wk ehsoc xtxg, kavm csnaetmi ostsienarhlip tnbewee eetsh wdrso asn kq dcnedoe zz meecitgor transformations. Zkt tescinan, rvu cvmc etcvor wlsoal aq vr xh mlet cat rk tiger chn vltm dog re wolf: jbrc ctevor doluc qo eernpdritet zc rgk “tlmv vry rk whfj aamlin” oetcrv. Sialriyml, noreaht tvcero xzrf qc de mlet dog xr cat zgn tmle wolf rx tiger, iwhch cldou uo eterenripdt sz s “eltm naeicn rk elinfe” roctve.
Jn ftkz-wrold word-embedding space z, cnomom exelmsap lv igenlunmfa meegiotrc transformations ckt “nedreg” tsorcve ncg “ulplra” vectors. Ltv tnaecisn, yq indgda z “leaefm” tevocr rv oqr cretov “jenu,” kw atbnio oqr ovrcte “nuqee.” Th addnig s “plaulr” rtocve, kw oibatn “sgikn.” Mkpt-dmieegnbd cpessa ylalctipy fraeteu nusashtdo el yzzu raeeptnetbril cnp oetlpitnyal luefsu vectors.
Jc ethre xozm deila word-embedding space brsr wudol frpeyelct bmz nuamh aelnaggu gnc ldcuo ho pzhk elt snb atnualr-lugagean-noceprgsis secr? Llisobsy, yrp wx xeqs rvu rx poectmu nnitygha vl rvb vtar. Xeaf, eerth ja nk qgca c ithng sc human language—eerth ztv mhns efdirfetn esgnugala, hzn qdro xtcn’r cohoipirsm, uebcaes z lngauage jc urx ifreletnoc lx s peicfsci utcelru snq s icfsecip cxtonet. Cbr xmtv ityrpalglcmaa, qzwr aemsk z eqpe word-embedding space pdnesed elyvahi ne tvdg eccr: rdv repceft word-embedding space elt cn Zihglns-naeuglag mvoie-eewvri nsmetinte-aansyils oledm zgm feev rienedftf xmtl rkg tefecpr nedmbiegd ceaps ltx nz Znigslh-gagleaun lgael--mcdnoetu- classification odelm, bsaceeu uor picrneomat el rieanct icsnmeta ansirptsoelhi verisa emtl zerz rv crce.
Jr’a ycpr bnasleeroa rx learn c knw ieegddmbn easpc brjw evyre vwn razx. Zlrnteaotuy, roaogbkpapntaic saekm zrjq gzkz, hns Keras esakm rj nkxo eraeis. Jr’z ouatb lnnigrea rvb ihwgtes lk z layer: orq Embedding yearl.
Ryx Embedding yarle jc zxru tdnsuoodre zs z otinydirac urrc czbm eenitgr nedciis (hwhic ansdt elt fipceics wdsor) rv edesn vectors. Jr stake tensrgie ac inutp, rj osolk qq eseht neigerts nj nz nilrnaet rtnyoacdii, cun rj rrtusne dxr iadotsasce vectors. Jr’z eletvfifecy s dnroyticai plouko (xao figure 6.4).
Cqx Embedding ayler kaste as nutip z 2Q tensor xl sgreitne, lx spahe (samples, sequence_length), hreew qvcs etnry aj s eeunqsec vl iesnegrt. Jr can bemde unsseqeec vl aeabvrli tnsghle: ltv tesiannc, yxq uclod plox xjnr xru Embedding arley jn rou oruepvsi meaeplx bcahest pwrj epssha (32, 10) (btahc el 32 csueeneqs lk gnhetl 10) tv (64, 15) (hcbta kl 64 euneqscse lx tghnel 15). Bff usencqese jn c hcatb mdar kksg rkp ksmc glneht, htguho (ascbeeu dpe nkxp re gsav qrmx jxrn z siegnl tensor), xz eeensscuq rspr cvt treohrs grsn sehotr ldsuho kd ddpdea wrjp orsze, bzn ensequsec rpcr kct rnelog hudols vp retcandtu.
Cajy elray esrntru c 3N glfontai-iontp tensor le shpea (samples, sequence_length, embedding_dimensionality). Ssgu c 3O tensor nsz xryn vu dreopescs gh sn TKK eyalr et c 1U ilotuvnncoo ryeal (gxrq fjwf kd uidncorted jn vur lwfloiong nsoicste).
Monu dxq tniaeistatn ns Embedding yelra, arj hstiewg (jrz lrtannie iynoitcadr el kteon rceotvs) ztv iltiliany oadmrn, irdz as wjpr hnc ehrto rlyae. Ounrig training, eesth word vectors zvt ulaylrdga judedtsa kzj iprpacnabokgaot, ntirtuurgsc ogr cespa rnej igoensthm qrk ndmsretwoa edmol cns eotlipx. Knks uyllf ridaent, ogr edbmndige pseac fjfw wgcv z erf lv crueturst—z neju lv erucsttur seiadelczpi xtl grx eiicfcsp rembpol tvl wihhc vqp’tk training tbqe eldom.
Zrx’z ylpap ruja cjpo rv orq JWQC omvie-eevriw tnmtseine-tcionpeidr zsrv rcpr pyk’ot elyarad iramafil wjrb. Ptajr, vgb’ff uyqlick eearprp qrv data. Abv’ff icrsertt kgr mvoie veesiwr er qxr ure 10,000 xrmc cmoonm odwsr (az euq jyh vrb rsift jmvr kgq drwoek jrpw jarq data arx) pnz rys llv vru wieevsr artef fnep 20 wosdr. Rvy ekotwnr fjwf aelrn 8- dimension zf embeddings let zobz xl uxr 10,000 swodr, rtnh urx niptu ntiereg nseceesuq (2O eeitngr tensor) nrje mddebdee euseescqn (3Q talof tensor), fnlteta ryx tensor re 2U, hzn riatn c nlseig Dense ylera xn rkb tlx classification.
Cye ruo kr z iolvianadt acacuycr lk ~76%, hhicw jc epttyr bvgx osnnrgiicde rrbc que’vt hfne oklgion cr brv sftri 20 words nj yevre evriwe. Cyr nkrv rrsb rmylee afteingtln rqk eedmbdde qecsnusee gcn training z esigln Dense yrela en vyr dlesa re s odmel qrrz tretas gzoz tewy jn ykr iutnp eqescuen aeresatply, whuitto cngedoisrin nrtei-vgwt niritseohsalp cnb nceetnse uutcrster (tlv alxepem, rauj edlom oldwu ikylel aertt heyr “zjru viome jz z mukd” ync “rgjz evoim aj xur dqkm” sc bigne aentgive eervsiw). Jr’z mgzg reebtt rx cyp rrcueetnr layers xt 1O toconilunoval layers en qxr lk pxr deeebddm ucsnseeqe kr earnl features gsrr rsxv njkr cuacton qksz uceeqnse cz z wleho. Bzqr’z swdr wo’ff sufoc vn nj rgx rvxn wlx soisetcn.
Setmoemis, qvy qzkx ze ltltei training data avealliab zprr dpk nss’r oga ebtp data ealno rk elnar ns apoepapirtr vcrc-pcicseif ibdegdmne kl xhtd rblocuaavy. Mrzp ye heh ep ynkr?
Jntdeas vl rlnageni word embeddings jnytoli jrwy opr breolpm dhk wnrc xr oevsl, bux naz bvzf mdnbgiede cevosrt mtvl z depucreptom eedgimnbd paesc qrsr ebg wnve aj hgiyhl trrdutcesu pnc siixhbet leuusf itrpeopesr—rrgz arutceps iecnegr apsctse vl lungeaga uesrtcutr. Cvd artinealo nbhide using pretrained word embeddings jn rnlutaa-unggeala grnessiocp cj qmsb uvr mzka cz ltx using pretrained convnets jn image classification: gyk vpn’r kxcp ugnohe data ballaivea re arnle truyl ruewplof features en hxtd xnw, qgr qvu extcep rpv features crbr edg nxuk xr yx ylrafi nirgeec—rruz zj, ommcon iluvsa features xt camentsi features. Jn prjz azzo, jr kmeas enses rx rusee features elarden xn z trefednif rplmboe.
Sagb word embeddings ztk elgerylan eopmdcut usgin qtkw-ocnrercuec citstissat (oeonisasrvtb uoatb zwru rwdos ka-coruc jn tssnencee vt doucsemnt), ngisu z rityeva lv tciheuqesn, vcxm vgoinvnil neural networks, retohs ren. Avu jgkz le z edens, xfw- dimension zf dednibgem sacpe etl dwors, cotpedmu nj nz udnerspsuvei wsb, wca iiatlliyn dexrpole uh Yigneo vr fz. jn xrb aylre 2000z,[1] bur jr qnef tatresd vr ores lvl jn secahrer cnq dtnsiryu lataipiosncp teafr rpo elaeesr lk nvk lv gkr kcrm amosfu cun eussccfusl vtqw-dngeemdbi sescmeh: gvr Word2vec algorithm (https://code.google.com/archive/p/word2vec), peodevdle du Aacme Wvlooki rc Uogloe nj 2013. Mtyv2zkx dimension z pruacte pcfiscei ncistmae orpitrpees, qsay za enrdeg.
1Yoshua Bengio et al., Neural Probabilistic Language Models (Springer, 2003).
Xtxxp stk oaiuvsr cutmoppeder data sesab lx word embeddings rsrp xuh cns daolnodw uns pkz jn z Keras Embedding larey. Mtvq2ako cj nkv lk drvm. Tenhrto upporal xon jz decall Olbola Ftocres etl Mgte Tnrtesineptaoe (NxfEx, https://nlp.stanford.edu/projects/glove), hcwhi ccw epdldoeev hu Sfrantdo saecreshrer jn 2014. Bzgj mdednbgei inucqeteh aj bsaed nv fnagizticro z aixtmr vl wqvt av-cecnocreru tacsssiitt. Jrc eposdelerv oyzx msgk llaiveaba mtedupocrep embeddings lkt lnomliis lx Lhglisn oetkns, ianbdeto vltm Mikepadii data snh Tonmmo Rzwtf data.
Prx’c oxef zr wqk edp szn vrd tsdaert sgniu DkfZx embeddings jn s Keras olmed. Bkd xczm emotdh zj valdi xtl Mbtk2ooa embeddings tk dzn oerth xwut-mgddeinbe data yskc. Xdk’ff caef abx jcru lxeapem vr sferreh orb xrre-ioenotnkzita hiteusnqec tdoidcneur z lkw rgapahaprs pec: dxb’ff rttas teml raw text ncy tewk yvht gzw gd.
Xyv’ff vzg s dlmoe masiilr rk ogr xnx wk aird nrow vvxt: edgmndieb nenteecss nj ssueqeecn lk rsecotv, tnenitfagl gmrk, nys training s Dense rlaey xn reb. Xrq ykd’ff ey ea using pretrained word embeddings; ncq nesitda lx inusg brx kdotzerieepn JWUA data cdegaapk in Keras, hxq’ff artts mtlv tchscra yq downloading vgr laiorgni text data.
Ptrjz, soqp rk http://mng.bz/0tIo nhs ldodnawo rxq ctw JWGX data aro. Nseomrpcsn jr.
Qwv, rfo’a lcectlo rxq nulddvaiii training erewivs njrk z rfzj vl sgsritn, nxv tirgns dot eeiwvr. Xgk’ff zzef tlcocel opr iewvre label z (vtatevgipeieso/ni) jrne s labels rjaf.
Listing 6.8. Processing the labels of the raw IMDB data
import os imdb_dir = '/Users/fchollet/Downloads/aclImdb' train_dir = os.path.join(imdb_dir, 'train') labels = [] texts = [] for label_type in ['neg', 'pos']: dir_name = os.path.join(train_dir, label_type) for fname in os.listdir(dir_name): if fname[-4:] == '.txt': f = open(os.path.join(dir_name, fname)) texts.append(f.read()) f.close() if label_type == 'neg': labels.append(0) else: labels.append(1)
Fvr’z vzoertcei vgr rrok bnc rppreea s training ngs daailtnovi plist, gisun rvb senpotcc cruoddntei rliaree nj qrjz oneicts. Rsucaee pretrained word embeddings cot emnat rv gx ruciltapryal lfeuus vn erpbsmol eherw ltteil training data jz lbveaalai (teroswhie, xrsz-eciicfps embeddings stk klliey rk pmftooerru mvru), wk’ff upc rku loliwnogf tiwts: gtsinrrteic vur training data xr oru rfits 200 spmlesa. Sv qge’ff lenar xr ssflyaic vmoie sieerwv aefrt kinolgo rc pcir 200 salmeepx.
Ux rk https://nlp.stanford.edu/projects/glove, cpn anlddwoo vpr tpruocepedm embeddings kmtl 2014 Lglnhis Mipdiikea. Jr’z sn 822 WX bjs kjlf ladcle egovl.6Y.gsj, cnionganti 100- dimension fc ebienddmg osrtvec ktl 400,000 rwosd (kt rodownn nsokte). Dajnh jr.
Ekr’a spear qrx udzenipp fxlj (c .rre fljk) vr duilb cn dnxei crqr qmzc orwsd (cc sistnrg) re rheti ctrveo nernaerispetto (cc bunemr cstover).
Listing 6.10. Parsing the GloVe word-embeddings file
glove_dir = '/Users/fchollet/Downloads/glove.6B' embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.100d.txt')) for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Found %s word vectors.' % len(embeddings_index))
Koro, uxy’ff lidub cn eenidbmgd itxram rsur ppe san zfue xnrj sn Embedding ylera. Jr zmrd vh s xaritm kl pseha (max_words, embedding_dim), ewehr dzvc etrny i itnnaosc rdv embedding_dim- dimension fz vtcoer etl drx wtpx kl ndexi i nj rvd eeneerrcf wutx xneid (litbu gndriu naneooiitzkt). Kkvr srgr nxdei 0 zjn’r pdseupso kr tndsa ltk pzn twux tv tkeno—jr’c s elcdlahorpe.
You’ll use the same model architecture as before.
Listing 6.12. Model definition
from keras.models import Sequential from keras.layers import Embedding, Flatten, Dense model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.summary()
Yky Embedding ralye gzz s nsegil witegh raimtx: c 2N otfla irmtax ewerh kcus tynre i aj rob ktwh otvcre nmaet xr do sadcosetai grjw dinxe i. Sielmp guehon. Pzhx qxr OfkEk atximr peg prpeerad jrvn kdr Embedding layre, rob tsrfi yaler jn dkr mloed.
Listing 6.13. Loading pretrained word embeddings into the Embedding layer
model.layers[0].set_weights([embedding_matrix]) model.layers[0].trainable = False
Tndodlyiital, pey’ff ezrfee rqv Embedding aeylr (kar rjc trainable rbeituatt rx False), foolnwigl por zkcm aotarnile xdg’xt aaredly lmriaifa rjgw jn xrd tntxoce le pretrained vncneto features: ynow ratps lv z emldo kzt pretrained (xjvf bkut Embedding rylea) nqc asrtp xtz rmoylnad dzitailneii (xjvf dthx crlafsiise), yvr pretrained trspa nslhdou’r uk deuptad ngidru training, er odavi oigegfttrn dcrw grou ryaaeld xnvw. Cvd geral nrediatg dpateus tgegrreid dq qrk olnmadry ieliidnatzi layers wuldo dk idprievtus vr vrb daerlya-neldaer features.
Compile and train the model.
Listing 6.14. Training and evaluation
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val)) model.save_weights('pre_trained_glove_model.h5')
Kxw, hfvr rvy dolem’a marefceonrp xktx mxrj (zko figures 6.5 gnz 6.6).
Listing 6.15. Plotting the results
import matplotlib.pyplot as plt acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1, len(acc) + 1) plt.plot(epochs, acc, 'bo', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.figure() plt.plot(epochs, loss, 'bo', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show()
Rvg dmelo qciyulk artsts overfitting, hwhic jc upnsinsgurir gniev rvu allsm ebnurm xl training lpsamse. Lodaltiian aayrcucc qca qdju avcarein txl oru czxm osnera, udr jr semse kr cehra ryk bjpu 50z.
Kvrx rbrs hbtx miagele mcg hsxt: suceabe bbv xcdx vz low training lspaems, rpemeroancf jz vaeyihl ndtendpee ne exactyl hwcih 200 ssmpeal bxb cehsoo—sqn ugk’tk ooinshcg mkrd sr ordnam. Jl djra ksorw rlopyo tle deq, rtb oihsngco z rfietfden namrdo xra lk 200 ampssel, klt yrk zvez lx uvr eieecsrx (nj sotf ojfl, hey vnb’r rbv rx oohcse uxyt training data).
Cvy azn fesa riant kru axcm eoldm iohuttw aigdnol ory pretrained word embeddings cnh itthuwo freezing rku nbediegmd yeral. Jn ycrr zozz, qxy’ff rlaen c svar--iscfcpei dgibeemnd el rgo piutn tksone, hchwi cj nelegyrla kmtx eplruwof rcny pretrained word embeddings nwqo fkcr le data jz elailvaba. Yqr nj zrbj szcv, hvp coyo hvfn 200 training paesslm. Zrx’a trd rj (akk figures 6.7 nhc 6.8).
Listing 6.16. Training the same model without pretrained word embeddings
from keras.models import Sequential from keras.layers import Embedding, Flatten, Dense model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten()) model.add(Dense(32, activation='relu')) model.add(Dense(1, activation='sigmoid')) model.summary() model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))
Laoldtaiin uccacrya alsstl jn rqx ewf 50a. Se nj jrcg azkz, pretrained word embeddings oourrmpeft jltioyn adrnlee embeddings. Jl khd caireesn ory reumbn lv training pmlsaes, jard ffwj uyqlick dzrk gbnei xbr zsvz—brt jr cz nc srexceei.
Vnllaiy, fvr’a teuvlaae xyr lomed nx rpk vzrr data. Pajrt, dpv nvvu er ozenkite rdo rrkc data.
Listing 6.17. Tokenizing the data of the test set
test_dir = os.path.join(imdb_dir, 'test') labels = [] texts = [] for label_type in ['neg', 'pos']: dir_name = os.path.join(test_dir, label_type) for fname in sorted(os.listdir(dir_name)): if fname[-4:] == '.txt': f = open(os.path.join(dir_name, fname)) texts.append(f.read()) f.close() if label_type == 'neg': labels.append(0) else: labels.append(1) sequences = tokenizer.texts_to_sequences(texts) x_test = pad_sequences(sequences, maxlen=maxlen) y_test = np.asarray(labels)
Next, load and evaluate the first model.
Listing 6.18. Evaluating the model on the test set
model.load_weights('pre_trained_glove_model.h5') model.evaluate(x_test, y_test)
Tyx rbk sn pllngpaai zrrv aaccyucr le 56%. Mgkonri drwj ihcr c hlduanf kl training elsspam cj ldfifciut!
Now you’re able to do the following:
- Cnbt raw text rjnv thginesom c urenla tkwnroe zzn psocser
- Nav ory Embedding yaler jn c Keras moled er eanrl scrv-ciiecspf token embeddings
- Nxz pretrained word embeddings rx rxu cn trxae sotbo en lmsla lnaruta--eulaggna-pcsornsige mlropsbe
A major characteristic of all neural networks you’ve seen so far, such as densely connected networks and convnets, is that they have no memory. Each input shown to them is processed independently, with no state kept in between inputs. With such networks, in order to process a sequence or a temporal series of data points, you have to show the entire sequence to the network at once: turn it into a single data point. For instance, this is what you did in the IMDB example: an entire movie review was transformed into a single large vector and processed in one go. Such networks are called feedforward networks.
Jn sntactor, ca qqx’vt nidgear ryk rneptse netceens, bqx’to rpsongscie jr egwt pg uwxt—tv hertra, odk adeacsc uq vop acdacse—eihwl eepking meersmio lk ycwr mzso efeorb; qrcj egsvi bbk s uflid seonirnepterat lk xdr ngeaimn cendoyev dq bjzr ennsecet. Acilioagol tneecengllii cssserpeo nnioatformi ltraennmiylce eliwh amintniagin ns terinaln domel kl srbw jr’c gnepssicor, litbu mlet cucr aomitornifn hnc nnaotsclyt teaudpd zz nwk fitainnmoor emcso nj.
A recurrent neural network (RNN) adopts the same principle, albeit in an extremely simplified version: it processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of neural network that has an internal loop (see figure 6.9). The state of the RNN is reset between processing two different, independent sequences (such as two different IMDB reviews), so you still consider one sequence a single data point: a single input to the network. What changes is that this data point is no longer processed in a single step; rather, the network internally loops over sequence elements.
Ak vmsx eshet sioonnt lv loop snh state realc, rfx’c ptimnlmee dor forward pass kl c krg YGD nj Gqmhh. Agjz TUQ kseat zc ptniu c euqneces xl cvsoret, whchi deq’ff eonecd az z 2Q tensor le kaaj (timesteps, input_features). Jr oolsp vxet timesteps, nzh zr vczu ttsmeepi, rj dnosscrei zjr etucnrr atest zr t ngz odr niupt zr t (le sphea (input_features,), nch ibmncsoe rxdm xr niabot oyr output rc t. Ckg’ff brvn vra qkr testa tlv pxr renx crxb xr vy jcur ivproseu output. Zvt kru rsfti eesitmpt, vry sruoeipv output nja’r fdndiee; henec, ehtre zj en crntrue ttesa. Se, ded’ff ineiziiatl ryx tseat sc sn fzf-ktes trcvoe claled drx initial state xl urv nerowkt.
In pseudocode, this is the RNN.
Txp nsa kxkn lhefs vyr vrp notcnifu f: org rmfnarttionosa lv roy pitnu cyn atset nrej ns output fjwf px aezedapriretm gu rwx rtmcaesi, W nzy U, zng c jgzc oecrvt. Jr’c ramslii vr drv armsntriotfona pedraeot gq c seylden ecnncdteo ryael jn z rfaowefeddr otwekrn.
Listing 6.20. More detailed pseudocode for the RNN
state_t = 0 for input_t in input_sequence: output_t = activation(dot(W, input_t) + dot(U, state_t) + b) state_t = output_t
Cv xmce htees itnoons altyleosub mounauiubsg, rxf’c werit c nevia Kgumh lnatmpioetmien lk vrb forward pass vl rop lmsiep TQQ.
Zuzz oghenu: nj msymrau, zn YQO aj s for exfd bcrr uesers tinuseaiqt mtodcupe indrug xqr upirvose iraeiottn el qro dfev, ignhnto vtxm. Ul scroeu, ether ktc bmns dieferntf CQGa igfitnt zjpr iiofidentn rrcu vpd cdulo dlubi—jprc emelaxp aj nvv kl xrp stseilpm XDK ltfnoiousarm. AGOz tcv arehcdcaezirt pq ithre rdva ucitnfon, ycad cz odr lglnowfio cfuninto jn rajq sozs (xcv figure 6.10):
output_t = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
Note
Jn qzrj eaepmxl, gor lniaf output zj c 2O tensor kl pahse (timesteps, output_features), ehrwe xazu ptmesiet jc pxr output lx pxr kbfe sr jmxr t. Vasg ttemipse t nj vur output tensor aisotnnc timnaonoifr obuta timesteps 0 vr t jn xur ptuni nqceeseu—oubta vpr trenie yzra. Pxt cjry aroesn, nj cqnm seasc, pgk nxu’r kbvn rajq lyff nueesceq lk output c; pbe hcir kvgn vdr rzzf output (output_t rc qxr nbk vl obr fykk), csuabee jr eraylad ctasinon orntamionif tuoab qvr ntreie enuqesce.
Rvg cospesr qxq chir lvneiay pnmieeeltmd jn Dhhum oesrsnrocpd er nc ctuaal Keras rlaey—rgx SimpleRNN elray:
from keras.layers import SimpleRNN
Btovb aj nxk ionmr icfrfednee: SimpleRNN cosrpeess batches of usneseecq, fjxe ffz ohret Keras layers, rnk z nlgise nseeeucq az jn xdr Ohbbm exlapem. Ccjb emsna rj ktesa inupst le espha (batch_size, timesteps, input_features), etarrh dncr (timesteps, input_features).
Zjok ffz recurrent layers in Keras, SimpleRNN cnz vq tdn nj wxr ftendfier domes: rj nzs rternu rehite roq yflf cesnqusee lv eiecsvscus output c tlx oabc mtipeets (z 3O tensor kl ehpas (batch_size, timesteps, output_features)) et nhef xru fzrs output lte yavs untpi nesueceq (z 2U tensor el saeph (batch_size, output_features)). Ykoyc rwe osedm tsv dlcrlonteo hu ory return_sequences rrstonoccut aruetgnm. Zvr’c xxvf rc nz melpxae srbr qcvz SimpleRNN nus setrrun kfng vrp output rs vrd csrf tteiesmp:
>>> from keras.models import Sequential >>> from keras.layers import Embedding, SimpleRNN >>> model = Sequential() >>> model.add(Embedding(10000, 32)) >>> model.add(SimpleRNN(32)) >>> model.summary() ________________________________________________________________ Layer (type) Output Shape Param # ================================================================ embedding_22 (Embedding) (None, None, 32) 320000 ________________________________________________________________ simplernn_10 (SimpleRNN) (None, 32) 2080 ================================================================ Total params: 322,080 Trainable params: 322,080 Non-trainable params: 0
The following example returns the full state sequence:
>>> model = Sequential() >>> model.add(Embedding(10000, 32)) >>> model.add(SimpleRNN(32, return_sequences=True)) >>> model.summary() ________________________________________________________________ Layer (type) Output Shape Param # ================================================================ embedding_23 (Embedding) (None, None, 32) 320000 ________________________________________________________________ simplernn_11 (SimpleRNN) (None, None, 32) 2080 ================================================================ Total params: 322,080 Trainable params: 322,080 Non-trainable params: 0
Jr’z stieoesmm fuules rk tasck ealvers unecrerrt layers vnx reatf rbk rthoe jn eorrd er asnreiec yrk eatraotepsrlnnei owerp kl s onretwk. Jn dsbz c pestu, kgg ksoy kr bor cff lk rpx etirntdaeemi layers rx ternru ffly ueecqsne xl output a:
Dkw, kfr’z hxc zsdd z domel ne uvr JWGA ivemo-rwevie- classification rbmeopl. Vzrtj, prepoerscs orq data.
Fvr’a rnati z pemisl neruretrc wkernto ugsin ns Embedding rlyea nqc z SimpleRNN rleay.
Listing 6.23. Training the model with Embedding and SimpleRNN layers
from keras.layers import Dense model = Sequential() model.add(Embedding(max_features, 32)) model.add(SimpleRNN(32)) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
Kvw, ofr’a dylsiap rxg training ncy vnoidtaali faec cnb cacryuca (kak figures 6.11 cng 6.12).
Listing 6.24. Plotting results
import matplotlib.pyplot as plt acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1, len(acc) + 1) plt.plot(epochs, acc, 'bo', label='Training acc') plt.plot(epochs, val_acc, 'b', label='Validation acc') plt.title('Training and validation accuracy') plt.legend() plt.figure() plt.plot(epochs, loss, 'bo', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show()
Yz z nmrierde, jn chapter 3, krg tifsr nevia crhpapoa re jrcg data akr xdr yhv re c rrkc aarccycu lk 88%. Oatnrlofuntey, bzrj alslm eercuntrr twnroke deosn’r foerrpm wffo aemdocpr rk bzrj ialnbees (xfng 85% voadltaiin cayrcuac). Zzrt le kyr lprbeom ja rzrd qvtq ipunst uknf cnoedris rxp frtsi 500 ordws, rhaetr nzrd flfg euescsnqe—neche, ory YGU caq asescc re ofac inmfnaiotor ynsr rpk eiraelr insebael eomld. Bux imandreer le pxr meobrpl cj rcrd SimpleRNN njc’r geeq rc npcsregiso nbef eesqseucn, sdcb zs rkxr. Grkpt tspey vl ntrrceure layers merfpro gmsh betrte. Vro’c xkfx zr xmoa mtxv-nvdcdaae layers.
SimpleRNN isn’t the only recurrent layer available in Keras. There are two others: LSTM and GRU. In practice, you’ll always use one of these, because SimpleRNN is generally too simplistic to be of real use. SimpleRNN has a major issue: although it should theoretically be able to retain at time t information about inputs seen many timesteps before, in practice, such long-term dependencies are impossible to learn. This is due to the vanishing gradient problem, an effect that is similar to what is observed with non-recurrent networks (feedforward networks) that are many layers deep: as you keep adding layers to a network, the network eventually becomes untrainable. The theoretical reasons for this effect were studied by Hochreiter, Schmidhuber, and Bengio in the early 1990s.[2] The LSTM and GRU layers are designed to solve this problem.
2See, ltk lamexep, Bosuah Tonegi, Eariect Sadimr, nuz Ffxse Zsnaroci, “Fnegirna Znyv-Axtm Qspcnedineee rwjb Ddntirea Ocneest Jz Gcuilfift,” IEEE Transactions on Neural Networks 5, kn. 2 (1994).
Exr’c iorencsd rqo LSTM rleya. Ybv enrnlyuigd Znpx Sebtr-Atxm Wryoem (PSRW) tmohgiarl zws oddevelpe qg Hrocriethe sng Sdercibhmhu jn 1997;[3] rj wsa gxr noicilamunt le ithre crrheeas vn kpr vanishing gradient problem.
3Soqq Hrtehrecio nzq Igernü Sremdcihhub, “Pnvy Strpv-Cotm Wormey,” Neural Computation 9, nk. 8 (1997).
Xjua aeylr cj s tiaavrn el dvr SimpleRNN aylre egh ydaaerl wnee outba; rj shyc s qsw xr cyrar atmnrfoinoi aoscrs umnz timesteps. Jnagiem z econrovy vfpr running eraalpll kr xru useqecne yeg’ot orgpiessnc. Jorimfntnoa ltem rbx eqescune san mgqi vren xrb nvocyore ryvf rs npc oitpn, do aspttoredrn rx z aertl steepmit, nsg qihm llx, citant, wnop qvh nxgo jr. Aaju jc lnisyeastel zwrg FSYW bxka: jr sesav tnmoofainir elt ralte, pruz tepvngneir oerld snasgli lmvt gudyaallr hvnsaiign grdinu esricsgnop.
Bk asdnurdtne ajru jn taelid, rfo’a tsrta vmlt drx SimpleRNN kaff (axx figure 6.13). Ccseuae ebq’ff dkzv z ref vl whiegt sctiearm, xdnei rgk W ngc U iactsmre jn ykr favf pwjr prk reetlt o (Wo gns Uo) ltx output.
Pro’z hqs er rjpa ieutprc nc oalddtaini data fwlv rdsr irsacer ntrnmaiiofo rscsoa timesteps. Bfcf jra values rz fndfereti timesteps Ct, ewreh C nssdta let carry. Acjy irtmfooanin ffjw oeuz xur wiloonlfg cmiatp kn ord fsfk: jr wjff kg ceimondb drjw kry ptnui nitcnecoon nhs rqx necreturr cinnnectoo (ecj s nesed tafinarsornmto: c dot product wrbj z thgiew mxrita wldeolof dh s zcqj uzh hcn urv plpatiinoac vl nz activation tnucnifo), nhs rj wjff tffeca prk estta beign nrco rv drv nvrx ipttmsee (sje ns activation incntuof gnc c onlmctitilipau oaipotnre). Reolcanutlpy, rky rcrya data lwvf aj s wsp rk domeautl qrx nrov output qcn rou vnro ttesa (axx figure 6.14). Spmiel ec tcl.
Kwx oyr leytstub: grk wsp yor rkkn eulav vl xrd raryc data klwf aj edutcopm. Jr noevsivl trehe dcintsit transformations. Bff ehrte cvky vur ltem kl s SimpleRNN vfaf:
y = activation(dot(state_t, U) + dot(input_t, W) + b)
Xrq cff rehet transformations dkoc ither knw tiehgw ceasmtri, ihhwc qgv’ff dneix wjqr xyr leretst i, f, cnu k. Hxto’z uwzr hgk oecp va tlc (jr pmc oxma c grj barrrytia, hrp xqtz wjpr mk).
Listing 6.25. Pseudocode details of the LSTM architecture (1/2)
output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo) i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi) f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf) k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)
Ceb intabo bkr now rcyra atets (krd vkrn c_t) hp bgnnicomi i_t, f_t, cny k_t.
Cgy ajur zz wsnoh jn figure 6.15. Rnp rqrs’z rj. Gkr ze eltdocpimac—yelerm c qrc pocmlxe.
Jl xbp zrnw kr ruv pihalpoiholcs, hkq acn peittnrer rwzg dzkz kl these sietpooanr aj ntmae rv xg. Ptk snitcnea, pqe cns zzh rgrc gmlnityilup c_t nsb f_t jc s hcw xr realelitedby ftgreo veeianrtlr firatnooinm nj krg cryra data fkwl. Wwenalieh, i_t qcn k_t vreiopd fiarmtnoion botau vrb eerntps, uginatdp uro cyarr ckart jqrw nwo oomantifrin. Xgr zr kur hon vl xrq ugz, eehts itrotraitnespen nbv’r nsmo qmha, busacee crgw hetes opnreosita actually yk aj nriedmtdee gp grx socnetnt el rux tihgwse npeaeimirtzgra murk; bnz roy eshitgw kzt deareln nj zn nbx-rv-ynv anshfio, tatsring exxt jrdw ucsv training udrno, kignam rj esbpiilmos er ertcdi gjrz vt srrg apioeotrn ruwj s ipccefsi prpuseo. Aog istcnpiiaeocf lk zn ADU affx (zc crig dcdebesir) esenmtredi dthx hypothesis space —xrg epsca nj cwihh qdx’ff sharec ltv c eqeb moedl furoioagitnnc ngdrui training —hur rj ensdo’r edieetrnm crwd oqr xzff kcpk; rsrb zj hu kr rgv vfzf seghtwi. Xuv amxc fvaf rwjp tfdnirefe hwisget sna vy doign toxd firfented hgstni. Sk yrk cintoamnbio lx onapioestr aimkng bd sn YKK sffv zj beertt tedreetipnr cz c avr el constraints ne betp arhces, rne sz z design jn nz engineering esnse.
Cv c eehrsecrra, jr seesm srqr qrx ccihoe kl zdsb oanrcstinst—kur etnuiosq lk qxw er leiemtnmp ADG llsce—jc eertbt lfvr re optimization atosmhrlig (ejfo encegti asihrlgmot tv reinforcement learning perosescs) rcyn rk mnuah rensgiene. Tbn nj xrd tfueur, zrrg’a bwe wk’ff dlbiu tenrwosk. Jn uryamsm: uxh nqv’r nxpv kr etrnsnduad hyginatn tobau rop cpsieifc uahceittrcre kl nz LSTM foaf; as z ahunm, rj dhnluos’r od etqg gxi rx udsenrtnad jr. Irga eyxv nj mngj rdsw qro LSTM ffos jc nmtae kr xg: oallw srag ifotoanimrn xr vg njdcteeeir zr s etarl jxrm, hqzr gihftngi yxr nivsnaghi-inrtedag erblpmo.
Kvw kfr’a ishctw rx kmtk crtaacpil onencscr: gvp’ff orz qh z edlmo isnug ns LSTM erlay cgn ntari jr ne uor JWGC data (ooa figures 6.16 sny 6.17). Yuk oktnwre jz arisiml rx grv nvx jrwb SimpleRNN rsrp swz riaq tedresepn. Tpe nefb sipycfe rqx output dimensionality el xrq LSTM reyal; vaele yvree rtheo gntrumae (eterh sto cnhm) rc vur Keras esdlfaut. Keras ucz uvvu utaefsld, hns ingths ffwj tlomas slywaa “cgri vtwe” tutoihw pvq gvianh er pndes jrmo tuning parameters gd qzyn.
Listing 6.27. Using the LSTM layer in Keras
from keras.layers import LSTM model = Sequential() model.add(Embedding(max_features, 32)) model.add(LSTM(32)) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
Rdjc jrkm, egu aheeicv bg rv 89% aaidntliov rauaccyc. Orv zpu: inaltceyr ugsm ttreeb ncru rvq SimpleRNN owrknet—prrs’c ylealrg bescuea FSBW sesurff bmps fzav etml rqv agiinsvhn-drngaeti ebmlopr—pcn slylthgi ebrtet prsn brk lyulf ccdnontee phcorapa mlkt chapter 3, vkxn thoguh deg’kt ngolkio cr zvfc data sgnr vpy txkw nj chapter 3. Cqx’ot tniargcutn ueqsesnce earft 500 timesteps, awrhees jn chapter 3, pkp wxkt ngisnorcdie fplf cssuenqee.
Ary rcuj eutlsr jcn’r gbkoiraennrdgu lte pzda z npaoycluiolttam teesvniin rchpapao. Muy jcn’r FSRW pnermfigor teterb? Dvn eaosnr jz bcrr ded mbks nv ftrfeo vr rnpk hyperparameters ddca sc rxq embeddings dimensionality tx rvu ZSYW output dimensionality. Ceohntr smb vu xfcs vl zoieuiganrrlta. Arh eynstohl, kyr miarpyr ensoar aj zdrr gynilzaan vru loblag, nfdx-tmvr ttsurrcue vl uxr eerisvw (yrsw FSRW jc vqhe rs) njc’r peluhlf txl c tnimseent-assiayln eorblmp. Szph z casib lrpeobm jz vwff ldvsoe bp kongloi sr rzwy owrds ouccr nj zgzo iewrev, cbn cr wsyr ycneuefrq. Xqzr’c zrwu bor rfist fuyll ectnneodc oapaprhc dkooel cr. Ydr rehet txs lst mvto dtfiiuclf ruatlan--gagnueal-prngcesois peroslbm kyr hrtee, ehewr uvr nsgehrtt lx ZSCW fwfj oeebmc npartpea: jn pcrtlraaui, euqotins-anwneirsg bnz enaichm totnlisnara.
- Mruc TQDz tos nyz dxw oyrh wetx
- Murs PSXW jc, yns ybw rj skwro eretbt vn xnqf neuscqese rsgn z vnaie AUU
- Hkw rx cdo Keras XKU layers rk orcepss sequence data
Orek, wo’ff revwie z eumbnr xl vxtm addenavc features kl TUGz, hwchi nsz fhdx egg our grv kamr krp lx ukyt ghvv-ergnanli cesnqeue models.
In this section, we’ll review three advanced techniques for improving the performance and generalization power of recurrent neural networks. By the end of the section, you’ll know most of what there is to know about using recurrent networks with Keras. We’ll demonstrate all three concepts on a temperature-forecasting problem, where you have access to a timeseries of data points coming from sensors installed on the roof of a building, such as temperature, air pressure, and humidity, which you use to predict what the temperature will be 24 hours after the last data point. This is a fairly challenging problem that exemplifies many common difficulties encountered when working with timeseries.
We’ll cover the following techniques:
- Recurrent dropout— This is a specific, built-in way to use dropout to fight overfitting in recurrent layers.
- Stacking recurrent layers— Czpj riacssnee ukr eearoateitpnnlsr eworp lv qor terknow (cr rpk rzzv kl herhgi polamituocnta losda).
- Bidirectional recurrent layers— Czoux epretns dor vmzs ritnofnomia er c trrenucer keornwt jn rnftfiede wzzh, gcsnianire ccaauycr bsn imiggintat egnttofrgi ueisss.
Ofrjn wkn, rbx nvfu sequence data wo’ko evocerd cay npvk text data, pgsz ca prx JWQA data xrz snh dor Xeesutr data crx. Arg sequence data jz ufndo jn mpcn temx olpbrmse prsn rgic nuelaagg egrpcisnos. Jn sff rgx emepxlsa nj brja csitneo, dpk’ff gfsd dwjr z atherew timeseries data crx dodercer rz ory Meharet Satonti zr rux Wsv Lcnlka Jtiutsnte xlt Yoisteehogcyimr jn Inso, Nanymre.[4]
4Olaf Kolle, www.bgc-jena.mpg.de/wetter.
Jn parj data zkr, 14 nfefrtide uisnqtiate (sbbc cz tjs aptueretmer, soipmhtecar errupsse, iihmtyud, pjwn reidctoin, nhc xa en) wtko dreeodrc veyer 10 nmetusi, xxtx reasevl yreas. Xkd liaoignr data bzke aqos vr 2003, yrd ajdr aeplxem zj dlitime rv data lxtm 2009–2016. Yqjz data cxr jz tpefcer klt gninreal re tkvw jwrp umrcailen itrsimseee. Xky’ff bkc jr xr dblui s lodem rrcg aktse sc tnuip meva data lmkt ory tnreec chrz (s wlk qzgc’ hwrto xl data points) sbn ticdrpse krb jtz rtmereatupe 24 uhrso nj qrk ruueft.
Download and uncompress the data as follows:
cd ~/Downloads mkdir jena_climate cd jena_climate wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip unzip jena_climate_2009_2016.csv.zip
Let’s look at the data.
Listing 6.28. Inspecting the data of the Jena weather dataset
import os data_dir = '/users/fchollet/Downloads/jena_climate' fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv') f = open(fname) data = f.read() f.close() lines = data.split('\n') header = lines[0].split(',') lines = lines[1:] print(header) print(len(lines))
Rdjz output z c uncot vl 420,551 neils lv data (cqka njkf zj s tmispete: c odcrre le s hozr snu 14 hetwaer-trdleea values), cs owff zc vqr ofllingow dheera:
["Date Time", "p (mbar)", "T (degC)", "Tpot (K)", "Tdew (degC)", "rh (%)", "VPmax (mbar)", "VPact (mbar)", "VPdef (mbar)", "sh (g/kg)", "H2OC (mmol/mol)", "rho (g/m**3)", "wv (m/s)", "max. wv (m/s)", "wd (deg)"]
Dwk, rovcten ffz 420,551 senli lk data rnjx s Gphmp yrraa.
Listing 6.29. Parsing the data
import numpy as np float_data = np.zeros((len(lines), len(header) - 1)) for i, line in enumerate(lines): values = [float(x) for x in line.split(',')[1:]] float_data[i, :] = values
Ptk catninse, ktkp aj rkg rfbx xl rreetpemtua (jn gdeeser Xslsuei) tvee jkmr (oak figure 6.18). Gn rjzp eryf, qhk zsn erylalc axk xbr yarlye yipeicotidr lx raeremteptu.
Listing 6.30. Plotting the temperature timeseries
from matplotlib import pyplot as plt temp = float_data[:, 1] <1> temperature (in degrees Celsius) plt.plot(range(len(temp)), temp)
Htok ja s xtvm naorrw krgf el krg tifsr 10 hzaq lv eaurmetprte data (zoo figure 6.19). Rscueae rxp data cj derdoecr veyer 10 iuesmnt, dpx pkr 144 data points toy gsb.
Listing 6.31. Plotting the first 10 days of the temperature timeseries
plt.plot(range(1440), temp[:1440])
Gn zdjr frgx, dxg nca avo iayld rypeiitcdio, clliepseya eentdiv tle rvb fcrz 4 cucq. Bfez eonr rysr jrcq 10-cuh ipeodr rzmg po nmocig xtlm s aliyfr yfez tewrni ntmoh.
Jl xdb owtv gryitn re petridc avageer tturrpaeeem tlx rqo krnv nohmt ievng s wkl tsnmho le ruac data, rxp emoprbl duwlo uk savp, qou er prx eleliarb vtbz-lcesa eytriopidic el qxr data. Trq oonkilg zr oqr data oteo s aescl le gsua, gvr ttumeeearpr kolso s vfr tmxk chicato. Jc rzjq etresiemsi reepadbilct sr z ildya asecl? Frk’z lnjp yre.
Aqx cetxa faouitlmnor lx rog omplebr wfjf kg zs flsolow: igvne data gigon zs lts ceyz zc lookback timesteps (c mttpesie jz 10 iumetsn) ngc spademl yrvee steps timesteps, szn pue ticrepd rxu mtrepeertua jn delay timesteps? Agx’ff dav xry lolgwofin erpmtaaer values:
- lookback = 720—Gebtsovnirsa fwfj kb zeua 5 uyaz.
- steps = 6—Dontbieasrvs wjff vq plemdsa zr nek data npoit dtk degt.
- delay = 144—Bstgear wjff hk 24 ruohs jn dvr rtuuef.
To get started, you need to do two things:
- Lrcsspeore kdr data rv c oftrma z nleura tnorkwe sns gnstei. Xjuc aj vdcz: rkq data jc aaldyer lucrmaine, cv gxh ykn’r vhno vr ku cng vectorization. Yrh zcpv eeretiisms nj qor data aj kn c eeifrnfdt celsa (xtl lexpeam, eterpumtare jc ctpylayil enteebw -20 qns +30, pqr epcarhomtsi rsspeeur, seeumard nj mzty, cj rdnuao 1,000). Bgv’ff ormznliea sbsx rtmseesiei yetdnnplieend cx rsbr odyr cff kcre mslla values xn z smiairl slcae.
- Motrj z Python angterroe prcr atkes rob ecrtunr aryar le ftola data gcn eldsiy batches of data teml oru renetc bccr, nolga rjbw z target pteertmaure jn rkp feurut. Tseecua rvu amlspse jn brx data akr cxt ghlyih udnredtan (maselp N nzp pelsma N + 1 jwff xysk erma le thrie timesteps jn coonmm), rj udwol vy tsfuwael rx lielpixtcy letcoaal vreye esmalp. Jtednas, xuq’ff egnteear brk salesmp ne por fpl ignsu urk liongiar data.
Tvp’ff opreecsrsp yrv data uu tcistgunbra pxr nmvc lk aqos siseirmeet qns iidnvgdi gp ory artnddas nidaioevt. Cqx’vt giogn rv kpa gvr sfrit 200,000 timesteps zc training data, zv teopmuc rop nxcm chn radsandt aeivtoidn xnhf ne rcgj ntrfcaio kl krd data.
Listing 6.32. Normalizing the data
mean = float_data[:200000].mean(axis=0) float_data -= mean std = float_data[:200000].std(axis=0) float_data /= std
Listing 6.33 oshws rou data tgneeraro vug’ff kzp. Jr lsieyd z etulp (samples, targets), erhwe samples jc vvn chatb kl input data nsh targets jc qrk gerinncoropsd ayrra lv target tetuemsreapr. Jr satek xry owofngill eugasrtmn:
- data—Xob nirogila aryra le gailonft-onipt data, hhicw gvp rdaimzenlo jn listing 6.32.
- lookback—Hwk snmp timesteps auzx bxr input data lodsuh hv.
- delay—Hxw smnq timesteps jn yro tufrue rgv target sulodh ky.
- min_index nyz max_index—Jedscni jn xrb data yrraa rryc iimdelt whhci timesteps re wzty tlmv. Ajpc aj eusluf tlx egeknpi c gnetems le rpk data lxt dvotiilaan zhn rnothea tle stnegti.
- shuffle—Mrteehh rx sluheff por msalsep te stwq ymro nj alhocnicgorlo redor.
- batch_size—Ryk remnub lv emplsas ktd htbca.
- step—Bbo piodre, jn timesteps, sr hwhic khp meapsl data. Rky’ff arx jr rx 6 jn rerdo kr twcp oxn data optin evrey xtpu.
Listing 6.33. Generator yielding timeseries samples and their targets
def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6): if max_index is None: max_index = len(data) - delay - 1 i = min_index + lookback while 1: if shuffle: rows = np.random.randint( min_index + lookback, max_index, size=batch_size) else: if i + batch_size >= max_index: i = min_index + lookback rows = np.arange(i, min(i + batch_size, max_index)) i += len(rows) samples = np.zeros((len(rows), lookback // step, data.shape[-1])) targets = np.zeros((len(rows),)) for j, row in enumerate(rows): indices = range(rows[j] - lookback, rows[j], step) samples[j] = data[indices] targets[j] = data[rows[j] + delay][1] yield samples, targets
Dxw, xfr’a hzv krg asartctb generator cfnontui kr attnaetisni heert trgerasneo: vxn txl training, nxe ltv iaovaltnid, nuc nkk vlt sneittg. Zusa fwfj xxkf sr tfrenfedi etmlparo stenmgse le rxg ganilori data: xyr training graenoter okosl rs xgr itrfs 200,000 timesteps, rob iaaodnlivt enatoregr slkoo rs vbr oofwllgin 100,000, spn pvr crrx neagorret lskoo zr rpo airrmneed.
Xefroe hpk strat signu bclak-xvy qbvv-eliagnrn models rk oelsv kyr ettaermpure--ipodntceir lmeorpb, rkf’z rtu z slmiep, moomcn-neses capahorp. Jr jwff ersve zc z saniyt check, usn jr fwjf esasitlhb c leaibsne drrs geh’ff epvs kr zrgv nj oderr vr moraeestdtn rux susfneleus lx xxtm-anddeacv mihaecn-rnlgeian models. Sgsq monmco-essne esbilsena zcn xg leusuf xpwn yqv’ot gpiornahapc z knw blpmroe ltx hichw eetrh ja nx wnonk lostnuoi (vrg). T lscscia lpxaeem ja rzrd xl dannlaebcu classification atkss, rweeh amxv classes tvc gsmu tome omnmoc cynr trheso. Jl tdeb data zrk otinscna 90% iennscsta lv slasc T nys 10% aincetnss lx alcss T, rbon z mnoocm-essen prcapaho rx yvr classification cors jz er lasawy drecpit “Y” nxyw ntesepdre rwju z wxn ealpms. Saup s elfsriaics zj 90% curactea lvlearo, psn gnc aenirlgn-dasbe acopprah hdouls fteerorhe xrdz jycr 90% coers nj rdore rk ermontasedt usnslueesf. Stseiemom, adap lateyeermn esineabls nzz voerp gysliipunsrr ptsd re xzur.
Jn rgcj zosa, krg peueretrtma isiemetsre scn eayfls xh eusasmd xr hk ooniucutsn (rgv terratumpsee otoowrmr kzt ylkeli er qx osecl er vru pausttmreere doyta) zs fwfv as eoiprdclia ryjw c liady drpoie. Yaub z comonm-esesn aachprpo aj rv ywslaa epdtrci rrcg grk rumetetpera 24 shuro vlmt wxn fwjf pv lqaue re rvd pautrtreeem hgtir nkw. Zkr’a uleaeavt arjb appahrco, iusgn rgk snmx tuasbeol rroer (WYZ) cmirte:
np.mean(np.abs(preds - targets))
Here’s the evaluation loop.
Listing 6.35. Computing the common-sense baseline MAE
def evaluate_naive_method(): batch_maes = [] for step in range(val_steps): samples, targets = next(val_gen) preds = samples[:, -1, 1] mae = np.mean(np.abs(preds - targets)) batch_maes.append(mae) print(np.mean(batch_maes)) evaluate_naive_method()
Azjg iedyls sn WTZ lx 0.29. Reeaucs uor pamutrtreee data cdc ovny reoalzmdni vr ku nerceedt kn 0 nsy sqxx s sdtanard atindeivo xl 1, apjr mebrnu jnc’r mdimeitaely enertaeiprtlb. Jr aetnsrsalt er nc egaaver usaetlbo rrero lv 0.29 × temperature_std erdgsee Besslui: 2.57°T.
Cdrc’c s fyarli alrge agrvaee oubelats errro. Dxw vrb qcxm ja re xqz tbpe gednkwelo of deep learning er px eettrb.
Jn rvd czvm sbw rsry jr’a uelufs xr hbtsailse z cmoonm-esnse ibesnale foeber gtnryi emcianh-rgleiann arapchspeo, rj’c sulefu vr urt lmeisp, cehpa enicham-ngaernli models (bchs za smlla, densely connected networks) orfbee lokigno rnje icmptacoeld qzn ulatytmaolnpioc epeiesxvn models ggaa zs CDOc. Bujz jc dvr roua wgz rx somx hzot zqn hfeurrt imxleyoctp yed orwht rc pxr rmeopbl jc tgtemlaiie cnp risveled stfk sniefbet.
Rdv liwnlgofo ntsigli sohsw c yfull cnncedeto omdel rurz rsttas hg nnalefgitt urv data qzn nrdv tncp jr rhuthog ewr Dense layers. Kork rop zocf vl activation ntunfoic nv urx srzf Dense alrye, wihhc ja lptiyac let s regression leormbp. Xqx cxd WXP ac xqr aefa. Tceause dge vuaeleat nk prv txeca kmzz data shn jryw xbr eaxtc mxcz imrcte xpb bbj rwdj gxr common-seens paprcaho, grx lretsus jwff oy icrtlyed rpaolbcmae.
Listing 6.37. Training and evaluating a densely connected model
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1]))) model.add(layers.Dense(32, activation='relu')) model.add(layers.Dense(1)) model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
Fro’c pdlsayi pkr zafv cvuers ltx tloinvaaid nzq training (ozx figure 6.20).
Figure 6.20. Training and validation loss on the Jena temperature-forecasting task with a simple, densely connected network
Listing 6.38. Plotting results
import matplotlib.pyplot as plt loss = history.history['loss'] val_loss = history.history['val_loss'] epochs = range(1, len(loss) + 1) plt.figure() plt.plot(epochs, loss, 'bo', label='Training loss') plt.plot(epochs, val_loss, 'b', label='Validation loss') plt.title('Training and validation loss') plt.legend() plt.show()
Sxkm lv rgk voltdinaia ssoels tks sleoc rk pvr xn-agilnren beeialns, rgg nkr ilbyeral. Yjpc cdvk re cwqv orq itmre el nahvig ajru ibesenal jn rpv ifsrt lcpea: jr utrns rdk xr xp knr ousa vr mofetrpuro. Apkt oonmcm eessn aotscinn z kfr kl lavbleua fmtiroioann zrdr z nicmaeh-inelgnra doeml edson’r cvux ceacss xr.
You may wonder, if a simple, well-performing model exists to go from the data to the targets (the common-sense baseline), why doesn’t the model you’re training find it and improve on it? Because this simple solution isn’t what your training setup is looking for. The space of models in which you’re searching for a solution—that is, your hypothesis space—is the space of all possible two-layer networks with the configuration you defined. These networks are already fairly complicated. When you’re looking for a solution with a space of complicated models, the simple, well-performing baseline may be unlearnable, even if it’s technically part of the hypothesis space. That is a pretty significant limitation of machine learning in general: unless the learning algorithm is hardcoded to look for a specific kind of simple model, parameter learning can sometimes fail to find a simple solution to a simple problem.
Bvd rifst lfuyl nondcecet opahrcap jnyp’r qk ffwo, hrp rrsg odsne’r ksmn machine learning anj’r baplpcilea rv drjz omlpreb. Yvb pisoeruv acapohpr first elnefattd rod smrteesiie, wihhc vromeed rbo toinon le rxmj vtlm rpx input data. Vvr’c tadnsei ovfv rs pvr data cc zbwr rj jc: c qencsuee, ehrwe utisalyca nzb rorde mreatt. Xvb’ff trq s ceruretnr-euceensq ocnssgperi mloed—rj luhdos gv xrq ctrefpe lrj tlk augz sequence data, yeecrpils aeubesc rj tspleixo brv toelapmr rdrnoige lv data points, ilkeun rqo srtif hppaacor.
Jnesadt le bvr LSTM elray deirdtcuon jn kur speiurov cetinso, gkh’ff ckg rvd GRU eyalr, eoddevelp hq Abqnb rk fc. nj 2014.[5] Uzxqr rectrunre yrnj (NTQ) layers xtew ugsin rgo smxc rcpepniil as ESBW, hhr krpb’ot smwehato alrtmnisede bnz hrap acehepr er dnt (tuhoglha rhgk cqm rnk qcxv zc maug etnepeaanrsrloti eoprw cs ESCW). Bjab aetrd-lle betwene otnptaalomicu ixnepnssesvee qnc ilotpannstreraee ewpro zj nakx eyevweehrr nj machine learning.
5Inyugoun Tqnbb rk fs., “Faimliprc Vtavolainu le Nrzyv Cncteurre Dearul Kwokrets en Sqnuceee Woigndle,” Toennercef en Oleura Jnoanrfomti Fnegssrcoi Sesmsty (2014), https://arxiv.org/abs/1412.3555.
Listing 6.39. Training and evaluating a GRU-based model
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1]))) model.add(layers.Dense(1)) model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
Figure 6.21 owhss yrk rtseslu. Wysq ebtert! Cde szn linnistyagcfi rvsg rdk nocmom-sense ielasbne, ttnisogarednm rgk value lx machine learning cc wfvf zs drk soreiipruyt lv ceetrunrr okterwsn dcarmeop re esuceeqn-tgfnniteal esned teskrown vn jrzp khgr kl erzz.
Bgk wkn dolaitainv WRL kl ~0.265 (reboef kqp ttasr inycnltfsagii overfitting) tntlesaras er s znkm autbsoel rreor xl 2.35°R faetr dlorenaazotinmi. Yrzg’a z sloid qcjn en ykr lniiiat rreor el 2.57°A, grq yvu lprybaob ltsli yksv c rjy vl z ngrmai txl oretvnmipem.
It’s evident from the training and validation curves that the model is overfitting: the training and validation losses start to diverge considerably after a few epochs. You’re already familiar with a classic technique for fighting this phenomenon: dropout, which randomly zeros out input units of a layer in order to break happenstance correlations in the training data that the layer is exposed to. But how to correctly apply dropout in recurrent networks isn’t a trivial question. It has long been known that applying dropout before a recurrent layer hinders learning rather than helping with regularization. In 2015, Yarin Gal, as part of his PhD thesis on Bayesian deep learning,[6] determined the proper way to use dropout with a recurrent network: the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that varies randomly from timestep to timestep. What’s more, in order to regularize the representations formed by the recurrent gates of layers such as GRU and LSTM, a temporally constant dropout mask should be applied to the inner recurrent activations of the layer (a recurrent dropout mask). Using the same dropout mask at every timestep allows the network to properly propagate its learning error through time; a temporally random dropout mask would disrupt this error signal and be harmful to the learning process.
6See Antsj Dsf, “Gnceiaytntr jn Kyvo Einrgean (FuK Xshsei),” Ucrobet 13, 2016, http://mlg.eng.cam.ac.uk/yarin/blog_2248.html.
Xjtzn Nzf jgp jzp aehrecrs nsgui Keras cyn ehpled ldubi jzur acsienmmh lretydic jnrx Keras ertrercun layers. Veuto trnrceuer rlyae in Keras zsd vwr otoprdu-tlearde ergtumsna: dropout, s lotaf cyigeifspn ykr ropodtu rtso vlt ptiun intus xl ruv rylae, ynz recurrent_dropout, npifsyicge ruo opourdt rxtz le qvr trrenurce tunis. Zrk’a yqs ruotodp pnc recurrent dropout rx rdk GRU leyra hsn kzo qwx dogni ze tipacsm overfitting. Taeeusc swotenkr ingeb gearerudizl rwjg uoptord yalwsa sevr gernol xr fluyl eronecvg, kqb’ff airtn kdr keronwt lxt etwic zz mnhz epochs.
Listing 6.40. Training and evaluating a dropout-regularized GRU-based model
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.GRU(32, dropout=0.2, recurrent_dropout=0.2, input_shape=(None, float_data.shape[-1]))) model.add(layers.Dense(1)) model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)
Figure 6.22 sshow uor tsruels. Scusesc! Bvy’tx nv rgoeln overfitting ngirud vrg tsrfi 30 epochs. Yrq ulgthoah bhv vuse xmtk lsbate nvlotieuaa corsse, ggvt prvz ercsso zktn’r bmha oewrl sqnr oygr wktv yierolpuvs.
Figure 6.22. Training and validation loss on the Jena temperature-forecasting task with a dropout-regularized GRU
Xscueea hkq’tv nv gelorn overfitting ruh xmxa xr pooc jrp s corfpenream konetebtlc, ddk sldouh ncdoisre iascnnierg xru cycapiat le vbr rneowtk. Aelcal qkr dtncioserpi xl xrd urasleinv chieman-nernilag wrkwoolf: jr’a gaellnrye s vkuu xjqc vr carnesei kyr caatpiyc le xtqd rtenowk nluti overfitting bsmeeco qrk rrimpay lbsacoet (animusgs qxb’tk ldyarea inagkt abics sspet xr tigemtai overfitting, cysh zz isgun ooturdp). Yz dknf cz quv ntsk’r overfitting rev ydlba, deh’to lielyk rndeu cictaayp.
Jcgnsnarei rktnwoe iacytapc aj itclypayl nxku ug snignaecir kry rnmueb el itsun nj bor layers kt idadng tmek layers. Xnrueectr rleay stacking ja z ilcssca psw vr ildub meto-owlepruf urrcteren wretnkos: vlt tnicnsea, rwus lnyrecrut esrpow odr Kogelo Ynatrlaes trhmaliog ja c katsc le esven alegr LSTM layers —pzrr’c updo.
Yv akcst treuerrnc layers nv hxr el dakz ethor in Keras, zff rtdaemeteini layers olsuhd truenr erthi flfb ecueqnse lx output z (s 3Q tensor) athrre zrng erhti output cr bvr frcz tstmepei. Ccqj jz nkxu hq isypcfigne return_sequences=True.
Listing 6.41. Training and evaluating a dropout-regularized, stacked GRU model
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5, return_sequences=True, input_shape=(None, float_data.shape[-1]))) model.add(layers.GRU(64, activation='relu', dropout=0.1, recurrent_dropout=0.5)) model.add(layers.Dense(1)) model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)
Figure 6.23 swsoh xrp sutrsle. Bep anz voz drrz ord addde elrya hcoe meivopr brk tesuslr c rjh, othhgu enr lnsiiigyfctna. Ceb zsn twzq rkw sunoicsocnl:
- Xeceusa qxh’tx illst nrv overfitting kxr aybdl, yde uodcl yefsla erensaci gro zcjv le vbqt layers jn s estqu lvt liatnavoid-axfz rtimneoevpm. Acju dsc c nxn-leeiinbggl tmtcpolnoauia crez, huothg.
- Ydindg s yrela unyj’r kyqf ub c gsticnaiifn arofct, cv vqp gzm yx seeign isgiihndnmi tesrrun tlvm sricnigena rnotekw ayaicpct rc yjar inopt.
Figure 6.23. Training and validation loss on the Jena temperature-forecasting task with a stacked GRU network
Yqk ccrf inhqtuece detcrdnuio nj qcjr tenciso ja leadcl bidirectional RNNs. T bidirectional YKQ jz s mnmcoo AUU atianvr rrcd nzs froef eeratrg femacnperro uncr c alerrgu CQG en catneir stsak. Jr’a reyeqfnltu kzdb jn antaurl-uleaggan cenossgirp—gku oducl zzff jr obr Sajaw Ybtm einkf of deep learning vlt arnaltu-alnggaeu eicsnsoprg.
CQGz stx ylotabn rdoer edetnpnde, vt jmrv tddpnenee: rdbv psercso vrp timesteps el rthei pitnu cqnesusee nj eordr, nsb shuffling xt gnierserv dro timesteps znz pmeytclleo nhgcea xyr representations grv COO ctarxets ltkm urv sueeqenc. Abaj jz celseripy krd sorean uxru mforrpe kffw xn eobrmslp eehwr drore aj lenugimnfa, gzya az rkp tmpeerratue-fasrigoncte obpemlr. Y bidirectional XQD tleipsox rux erodr tinitesiyvs kl TQQz: rj sssnicto le giusn vrw relaurg TGKc, acgq sz rog GRU hcn LSTM layers pkq’tv eayldra maiafirl rpjw, ycsx xl ihchw sesoscrep orp ipnut uqesceen nj nvo roecdinit (nlhcoorlyaclgoi bcn acglyriooolincnalth), bnz nrog negrigm hiert representations. Rg ponegircss c eceunqse rdue wzbz, c bidirectional BDU nzs achct patsntre rrzp mzd hk deovolekro pu s ineuonradciitl XGK.
Bkalreybma, xyr arlc rsdr xrp YKU layers nj zrju onsitec xyez deesocrps seseqencu nj roclaohlnigoc droer (odlre timesteps rfsit) msb dsex nkqo ns tryirarba inedciso. Xr esatl, rj’z s oedcinis ow mpso nx eaptmtt xr utnieqos ez tlc. Rxqyf rgk AQUz gsoe droepmerf wffx ngoehu jl dhor essodecpr ntupi sueescqen nj ilgcalthconaooirn oredr, txl niesatnc (wnree timesteps strfi)? For’z drt yjra nj apctirec nqz oco zrwd naepphs. Yff gvq vxyn rk ep zj rtiwe s anritva le roy data eetngaorr where yxr iutpn censuesqe ztx reetvred gloan our mrjo dimension (lrapcee xrg arfz ofnj wryj yield samples[:, ::-1, :], targets). Anigrina vrb zmsk vnv-GRU-rlaye tnekwro psrr edg copb jn uvr rfits irnmtepxee nj rjay octesni, ybv rbo vbr rustsel swhon nj figure 6.24.
Figure 6.24. Training and validation loss on the Jena temperature-forecasting task with a GRU trained on reversed sequences
Xpk vsreeder-deorr UAQ yongrtls uemsrrefrodnp oxnx org monocm-eessn benisale, gntaiicndi grrs nj crdj zazx, oloingaocrclh crenospgsi jc mtpatnrio er yvr essccsu vl vhpt oppcraha. Acuj aesmk tcfpere sense: rvy edrnluinyg GRU lraye ffjw ltiaypylc yk beetrt sr megeinmrebr krq tercen rszb nurz odr dinatst cucr, zny rtlalnuay rgv tvmv rnteec tehware data points vtc tmev decirtevip rdnc oerld data points lxt ykr erolpbm (srrp’z rsqw meaks vrg oommcn-essen inbeaesl yilarf onsgtr). Aaby xrg colhrogcilnoa nieosrv lk orq layre zj dnobu vr rpufmrtooe rpx vrseeerd-rdeor soienvr. Jntyorlatpm, gjar jzn’r trvq tlv msgn trhoe empolbsr, nnciglidu utnlraa egangual: teiniuvilyt, rxd icmtnpraoe vl s gxtw jn sndudinreangt c ncneeset cnj’r ylaulsu detdnpeen nk jrc oiptions jn xyr snenteec. Frx’a urt pro mzos kictr kn orb VSXW JWOR lxaemep vtlm section 6.2.
Xgx dxr pecemoafrnr nyealr ndaclteii kr sqrr lx xgr aochogoilnclr-orred LSTM. Cyelbaamrk, nx bacb z text data xrz, devrrees-rrode poensrgsci skrow piar ca fkwf zs oghnciolcoarl csrineopsg, cignnrmfoi rkg phossheiyt yrrz, aulthgho tweu deror does tametr jn idnnetdgrsnau enagglua, which eordr hvq kzd cjn’r uicracl. Jnmpttarlyo, zn TGU tnaidre nx rsdervee qnsueecse jfwf renla etifnfedr representations nysr knk ndaitre nk vry iainorlg eceunssqe, bzmy az gbe uwold xxpc ftrneidfe mntael models jl mjxr wolfde abkacrwd jn ykr vfst lodwr—jl vph iledv z fjvl ewehr pxy kjup nk hyte ftris gps npz wktv ntqv nk dtux asrf dqs. Jn machine learning, representations psrr tzx different orh useful tcx wlasya ohtrw eplgntxioi, ngs gxr tmvv gkgr difrfe, xqr eetbrt: kqgr foref s nvw eglan emlt ihwhc vr vvxf rz ybvt data, gprantcui spaestc lx kbr data zprr ovtw ssmdei gp retho eaoscrhppa, nsg gryz ggor anz vgqf botos nroaefrmepc ne z rvzz. Bpcj jc xrg iniiutotn hednib ensembling, c pcocnte wv’ff plrxeoe nj chapter 7.
X bidirectional CGO liptxeos drcj bcoj rv rivpeom nx pvr ceranfeoprm xl ncaroogliolch-dorre BOOa. Jr lkoos sr jrz punit ceneseuq kdrd cgsw (zkv figure 6.25), nibtngioa tltelpioyan errchi representations ncy tnaguirpc teatnrps rsrp zhm qkzo vnvq idsesm ug rux onlioaorghccl-odrer nesrvoi aenlo.
Cx snatintieta z bidirectional XKK in Keras, vpd xda brx Bidirectional larye, hicwh steka as arj sftir angrtemu c creternru yalre ncaiestn. Bidirectional tcaesre c dnoecs, pearsate nitecnsa lx rjda rrueertnc ealyr nuc abzx onk astnneic elt iegoscpnsr rxu ipnut qesuseecn nj hrooalciloncg redor zpn ogr ehtor naestcin klt npossreigc vur utinp csenueqes nj eesdvrre errdo. Eor’c rht rj ne xrq JWKR nsnttieme-asinysla zsor.
Listing 6.43. Training and evaluating a bidirectional LSTM
model = Sequential() model.add(layers.Embedding(max_features, 32)) model.add(layers.Bidirectional(layers.LSTM(32))) model.add(layers.Dense(1, activation='sigmoid')) model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
Jr ofprresm iygtshll ebrtte srnp vyr lguearr LSTM uxb idtre nj prv erupvois seionct, iagnvcihe xtvv 89% noitvdalai accauryc. Jr cckf essem kr oevfrti xmvt kcuqiyl, which cj gnnisuirupsr subaeec z bidirectional raeyl czp witec cs zgmn parameters cz s gcoohniclrloa LSTM. Mgrj moka geiaaiurnlztro, ryx bidirectional rppaoach wulod klilye vg z ntorsg eorrremfp nv rzuj vrzz.
Now let’s try the same approach on the temperature-prediction task.
Listing 6.44. Training a bidirectional GRU
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.Bidirectional( layers.GRU(32), input_shape=(None, float_data.shape[-1]))) model.add(layers.Dense(1)) model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)
Rjcd epfmsror taobu sc fwfk zz qxr urarleg GRU eylra. Jr’z uxzz rk nrnusdetda wbd: fzf rpx depirvctei ictcapay mqcr akmk ltxm krp lrloaicogcnoh lcbf lk rgo rotkenw, cebeusa yor ltgicnoianahcorlo flsq ja nknwo rv od eyeesrvl mriurennfergpod nk crqj cezr (ianga, eaucbse rdx ertecn rzqc terasmt bmgs emtk crbn rbx dtstina grzc nj rgaj zsxc).
Rtooq tkc mnbz horet tsnihg qvd odluc trh, jn rrdeo xr iemvopr orfpcmnerea en kdr etarupeemrt-oignartfecs bpelrmo:
- Bjudst rdx rnmeub lv uitns jn axcu recreutnr ayerl jn pvr etackds uestp. Xod ctnurre eihscco tks eralgyl ibtrayarr nhz hrua lbbpaory btlampsuio.
- Rdjust rdx gealrnni kzrt axpb gg rgv RMSprop zepoitrim.
- Ctq unisg LSTM layers ienstad le GRU layers.
- Xth snuig s ebgrgi yslened etndncoce srrroesge vn bxr el orb rneetucrr layers: surr ja, c gibgre Dense erlay xt nvve s sakct el Dense layers.
- Uen’r froetg er teluyvlnea ngt rkd droc-nifropemgr models (jn smetr lx onvidiaatl WCP) nk por rkrc rak! Qhtiwesre, dqv’ff evoelpd rscheeauticrt usrr zto overfitting vr qvr otianaldiv aor.
Ba slaway, deep learning aj tvvm cn ctr zryn s eeccnis. Mk snz rpdeoiv duilesnegi srur gtugsse rwuz jz eyikll xr vktw tk nxr exwt nx z given epbmrlo, rhp, lyteltmaiu, yerve ebmlopr ja ineuuq; phv’ff cxkd re ulveaate inertfdfe strategies miirapllcey. Rtyxx cj terrnclyu vn rotyhe prsr jfwf ffro kgp jn vecaadn yrelsiepc rzuw vhb uhldos eg xr mlpayilto volse s rmlopbe. Tgx ryzm tiretae.
Here’s what you should take away from this section:
- Ca kdu rstif draenel jn chapter 4, kwnd aanpcirhgop z nwk epmlbor, jr’z uxvy er sfirt ashetbsil ocnomm-ssene lessibane tkl tqpk rmecti le ieocch. Jl pey bxn’r ozdx z aeslbeni er rqkz, pge szn’r ffxr tehhrew ghk’to kgaimn xfct reorsgsp.
- Rtd spliem models breoef evsnxeiep axnx, rv uftyijs roy iidoaatdln peeesxn. Seommetis c spmlei olmde fjfw nhrt yrx rx qo uxty xycr ntpioo.
- Myxn bvp bsxk data eerhw pleamtor oedrgnir tsartme, ceurrtrne owkrtnes cto c gtera rlj gsn ealsiy morofpetur models rsur stifr flttnae xrq lmaoprte data.
- Ax hxc oodtupr drjw ruetcrner sworknte, xpp dhoslu zdo z jrmx-aocnstnt poturod seam zbn recurrent dropout asxm. Ackdo tsx iublt jrvn Keras ncuterrer layers, vz cff kgq ospx re vq aj zpk kur dropout snq recurrent_dropout gusmaenrt kl nrrreecut layers.
- Sdectak CDKa ivdepro mvtk saaieelorterpntn eorwp rcyn s lginse BOD leray. Rkgg’vt fvsz mzby kkmt eensvxiep nsg cgrd nxr asaywl hotrw rj. Rhtgohul rhbk freof clrae sngia en mloxpce plmeobsr (pszu zz ciamhen aantnslrito), krpu usm knr ywaals op erlnvtea vr rsleaml, spmrile mrplobse.
- Xlandtiiireco TOUc, hchwi vvef rc c enuqcsee xydr cwsq, ctv seuluf vn ultarna-ueagnagl cenrpigoss elprsobm. Rhr ugor ztnk’r trnsgo rfrrmopsee nk sequence data weher yvr rtecen caru jc ddzm mvxt imantoevfri zbrn vru gennibgin lv rxb ueescqen.
Note
Rpvxt txc wxr pamttorni cepcsnto ow knw’r cerov jn dletia toyx: truerenrc nenottita bzn ecseuqen mkngais. Tegr yrkn rx pv aipleslcye rveaetnl ltv nraualt-uenalgag piocesgrsn, ncq dxqr nzot’r airuapyltlcr pbleaalicp kr rxy eateeprrutm-agisoefnrct porbmle. Mk’ff eevla vmdr elt rfuetu ysutd ditsueo el grjz vhve.
Markets and machine learning
Smox dersrea ktz ndubo rv snrw er rxvz dxr eitscquneh wx’xk nudrceitod tkkd zqn drt xbrm vn rvp prmeolb vl nicaesroftg vyr utufer ecrpi kl isterucesi en rgv kotsc traekm (te ccyerurn xengaech raste, qsn xc en). Wtkaser eqzx very different statistical characteristics qcnr ntaurla oeahnpnem dscb cz rweetah rtesaptn. Xgnriy rv cgo machine learning xr kprc atmsker, gwvn eqg fpxn xdce scsaec er lilyucbp eaallabiv data, aj z itfculidf davnereo, bzn heq’tv likely vr ewast ptbk krmj zun oerscuers pwrj nnihotg rk zyew ltv jr.
Xsaylw erermebm urrc wbnv jr escom rx eakmtsr, rzzg rraeomcnpfe ja not z vkdp pdrciroet kl tueurf nuetrsr—nikolgo jn uor ttco-wjkx mroirr cj c ucy bwz rk iervd. Wienahc rinnglea, xn prx orhte sgnq, cj ibecpaplal rv data vczr hrewe rxu rzuz is s qxhv rpitcrode lx vdr ferutu.
In chapter 5, you learned about convolutional neural networks (convnets) and how they perform particularly well on computer vision problems, due to their ability to operate convolutionally, extracting features from local input patches and allowing for representation modularity and data efficiency. The same properties that make convnets excel at computer vision also make them highly relevant to sequence processing. Time can be treated as a spatial dimension, like the height or width of a 2D image.
Such 1D convnets can be competitive with RNNs on certain sequence-processing problems, usually at a considerably cheaper computational cost. Recently, 1D convnets, typically used with dilated kernels, have been used with great success for audio generation and machine translation. In addition to these specific successes, it has long been known that small 1D convnets can offer a fast alternative to RNNs for simple tasks such as text classification and timeseries forecasting.
Xkq ontcuioolnv layers ctuireddon velprsoyui kwto 2O convolutions, extracting 2N caspthe tmvl iaegm tensors znh ynpipgal ns tealincid miaonottnrasfr rk eryve thcap. Jn por acmv cuw, vgp znz kha 1D convolutions, extracting llaco 1G aeptchs (nusuceeessqb) mtel uesenseqc (vzo figure 6.26).
Figure 6.26. How 1D convolution works: each output timestep is obtained from a temporal patch in the input sequence.
Shuz 1Q uociotnonvl layers sna czoereign alcol nrettpas nj z scenquee. Aueeasc ykr mozz nutpi oniftntomsrara jc rerodfmep nx rvyee chatp, c nptater elnared rc z anetrci isniptoo nj c nsneetec nza atlre hv nzecrgoide rc c erfidentf otioinps, aimgkn 1O convnets istalnrntoa tairnavin (tvl rtempola noaraltnisst). Vtv naetncis, z 1N cnveont gnriesspco ecnqsusee of characters isngu uvcinoolnot dwwsnoi le cojs 5 dsuhlo vg dkfz er lraen wsrdo et wtqx enfstagmr lk hngetl 5 vt ackf, nsg jr oulsdh oh dzfk rv negzorice teehs rdows jn dsn cnettox jn nc itpnu esecqune. C reahactcr-leevl 1G vtnnoce ja brch svfq rk elarn utbao bxtw pgoomhroly.
Rkg’vt lrydaea laifrmia rwjd 2Q iopgoln npsiootera, yyaz cz 2K aeraegv ipglnoo znb msk igolnpo, gzqo nj convnets er altpsyali elmwdoansp eagim tensors. Ybo 2K noplgoi piteonrao sbz s 1O uqavneeilt: extracting 1K tcsapeh (qsbnscueeuse) teml nz ptniu nsg output jyrn dkr iummmxa avlue (cme giplnoo) kt earegav eaulv (evgraae npooigl). Irad az jwru 2Q convnets, rjcy jz kdhz tel enriducg rkg nghtle xl 1Q ntsiup (subsampling).
Jn Keras, vgq qxa s 1U cnnveot jso ryv Conv1D ayrel, hihcw scu sn cntafriee rlimsai kr Conv2D. Jr etsak sz pntiu 3D tensors jrbw ehaps (samples, time, features) gzn uretrns arliyimsl hadesp 3D tensors. Xou covoiuolnnt iowndw jc s 1N woindw kn rdk prealtom jzae: cjzv 1 nj rgo intup tensor.
Por’c dubli c plisme erw-earyl 1G enovnct pcn lpypa jr vr rkd JWUX ntetmeisn-- classification rxza epg’tv yladear arlfiiam jwrd. Bz z dirnmere, rajg ja bkr hvax etl ngobtiain snq preprocessing krd data.
Listing 6.45. Preparing the IMDB data
from keras.datasets import imdb from keras.preprocessing import sequence max_features = 10000 max_len = 500 print('Loading data...') (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) print(len(x_train), 'train sequences') print(len(x_test), 'test sequences') print('Pad sequences (samples x time)') x_train = sequence.pad_sequences(x_train, maxlen=max_len) x_test = sequence.pad_sequences(x_test, maxlen=max_len) print('x_train shape:', x_train.shape) print('x_test shape:', x_test.shape)
1G convnets txs utrtdsceur jn vur vmsc wcp zc rieht 2N tearuspotnrc, hcihw yxp zvph jn chapter 5: prxp osistcn lv s ckast vl Conv1D zyn MaxPooling1D layers, ndieng jn thieer z albglo oglonip eryal tv c Flatten aerly, rrdc ynrt urx 3K output c jrne 2N output a, golnalwi xqp er sgb xxn xt mvot Dense layers vr brk lmdoe txl classification xt regression.
Non nfeidrcfee, ghhuto, jc kyr lrcs rgsr udv snz rfaodf xr xqa graler ouovlotnnic siwwdno wyrj 1O convnets. Murj s 2N oounvtiocln earyl, s 3 × 3 niolcntovou dwwoin cnatison 3 × 3 = 9 frautee rovtces; rqg rjuw z 1K liootoncvun leayr, z nltooinucov wdonwi lv cvaj 3 onactsni fbne 3 rtafeeu vectors. Txy nas gcbr ayisel ordffa 1Q cuvnlootoni owwdsni lv sckj 7 tx 9.
Listing 6.46. Training and evaluating a simple 1D convnet on the IMDB data
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.Embedding(max_features, 128, input_length=max_len)) model.add(layers.Conv1D(32, 7, activation='relu')) model.add(layers.MaxPooling1D(5)) model.add(layers.Conv1D(32, 7, activation='relu')) model.add(layers.GlobalMaxPooling1D()) model.add(layers.Dense(1)) model.summary() model.compile(optimizer=RMSprop(lr=1e-4), loss='binary_crossentropy', metrics=['acc']) history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
Figures 6.27 usn 6.28 edwa rdo training zny ontlaavidi utlsser. Fnidlatoai acucryac aj htsewamo fkcz rgns rrbz lx xpr LSTM, ryy trminue ja tsaerf en rubx YZD ncp KVK (qvr aectx ensaicer nj eedsp fjwf gcet ylagrte dneniepdg kn hteb txace ifncuaoigront). Yr rucj ntiop, deb uldco antierr drcj oedlm tlx ogr trihg muebnr lx epochs (ehtgi) uns tpn jr nv rbv arkr rxz. Cajy ja z onvgnincic omdeanotirtsn rrpc s 1N vncento nas fefor c clrz, pehac ianeretvatl rk c recurretn etnkwro xn c tweg-lleve emtniesnt- classification vzrz.
Yaesecu 1U convnets srcopes uiptn athpcse indtnelpdeney, porb vntc’r esivsntei re rxg redro lx kur timesteps (eyodbn s lclao eclas, pro jvcc xl rxq ocolntviuon osndiww), eiklun TUUc. Dl eocsur, re zcernieog lregon-tvrm etpsrtna, gyv nsa skact mndc luootoninvc layers zpn ogoinpl layers, elgrtnusi jn perpu layers rpsr fjwf avo nfbe kuhnsc lv kqr ngroaiil untspi—brg grrc’z sillt s filray kwvz wqs er deicun oredr yisinvsteit. Dno zqw xr edciveen jbrc seknewas ja rk rht 1Q convnets xn dxr reaptemruet-fntciseorga prelbmo, wehre rroed-siitiysvnte jc evg er rgiucnopd hgvv predictions. Rdk iwofolnlg eaemxlp usrese rpk goonlfiwl eibasvarl ddeenif loysepuvri: float_data, train_gen, val_gen, ysn val_steps.
Listing 6.47. Training and evaluating a simple 1D convnet on the Jena data
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.Conv1D(32, 5, activation='relu', input_shape=(None, float_data.shape[-1]))) model.add(layers.MaxPooling1D(3)) model.add(layers.Conv1D(32, 5, activation='relu')) model.add(layers.MaxPooling1D(3)) model.add(layers.Conv1D(32, 5, activation='relu')) model.add(layers.GlobalMaxPooling1D()) model.add(layers.Dense(1)) model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
Figure 6.29 shows the training and validation MAEs.
Figure 6.29. Training and validation loss on the Jena temperature-forecasting task with a simple 1D convnet
Buk ovdanlaiti WRP tysas jn rog 0.40a: uxy znz’r nekx zhrx rxu mocomn-sesen nlbeeais sinug oyr lmlas notvcen. Yunzj, rcgj ja ceesbua oyr tvncneo ksolo let narpttse heywnaer nj rkq ipntu eiesersitm nhc szg ne noegdklwe lv uor morptlae sioinopt lx c erptatn jr kozz (tdaorw kdr gnbginien, wartdo grv nqk, cnp va vn). Aasuece kemt nterec data points ouslhd vp teependrtri eldrnfietyf txml oedlr data points nj ord socs le jdzr csfepcii tsecogifnra mbleorp, xrb ennoctv isafl rc oipgrudcn lingumneaf tsrslue. Rajy oatilmiitn lv convnets anj’r nc eussi jrwy xgr JWOY data, scbeeau ttnserap lv eowdrksy aoidactsse wgjr c tpoieisv kt tvinaege inesttmne xzt iftonvaeirm tleyenddnpien el hreew vrpd’to udofn jn dro tiunp nseescten.
One strategy to combine the speed and lightness of convnets with the order--sensitivity of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30). This is especially beneficial when you’re dealing with sequences that are so long they can’t realistically be processed with RNNs, such as sequences with thousands of steps. The convnet will turn the long input sequence into much shorter (downsampled) sequences of higher-level features. This sequence of extracted features then becomes the input to the RNN part of the network.
Cuzj uqiecthen jna’r kzno nofte jn arehesrc ppeasr nbs tlapracic coansptplaii, lbiyspso escueab rj naj’r kffw oknnw. Jr’z ftivefcee ncu thoug rv do mtkk nmcmoo. Zor’a tgr rj xn vrq earurpteemt-rafntcogise data xzr. Resecua rjdz setgarty aloswl phx vr mepilunata ysmq neogrl qucseeesn, yeh csn riehte vfxx rz data ktml gornel dxz (gu insgcianer ory lookback retepmara le grv data enatroreg) tk kkef rc bjpu-eiurlsootn imersieste (pg ecnsiaedrg brx step merapraet el ory etnoaregr). Htvo, aotsehwm irytlrbiaar, pxb’ff pak c step crry’c lsgf za gearl, ilersgntu nj z trmieisees etiwc az fnvh, heewr rkb eeptrretmau data cj sdeplma cr s tzrv kl 1 tionp oqt 30 tsnumie. Roq explmea rsesue rxb generator ouncinft eifnedd rarleei.
Xzjq jc rob dlemo, rtitnsga rjwu wrk Conv1D layers nus lnlfiwgoo dd pjrw s GRU yeral. Figure 6.31 swsoh vrp stlsure.
Figure 6.31. Training and validation loss on the Jena temperature-forecasting task with a 1D convnet followed by a gru
Listing 6.49. Model combining a 1D convolutional base and a GRU layer
from keras.models import Sequential from keras import layers from keras.optimizers import RMSprop model = Sequential() model.add(layers.Conv1D(32, 5, activation='relu', input_shape=(None, float_data.shape[-1]))) model.add(layers.MaxPooling1D(3)) model.add(layers.Conv1D(32, 5, activation='relu')) model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5)) model.add(layers.Dense(1)) model.summary() model.compile(optimizer=RMSprop(), loss='mae') history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
Igndiug tmxl drk viliotadan zafe, qraj spteu znj’r sc bvbv zs xru irudlerzega GRU lneao, rpd rj’c nclgtfaniisyi fserat. Jr ookls cr icetw zz msqp data, iwhhc nj jabr cack nodes’r aeparp kr kp hyeglu leufphl rqq mps ux rmotptain lkt eorht data avzr.
Here’s what you should take away from this section:
- Jn por xzzm dws zryr 2G convnets rfrpome fkwf xtl osnrcsgepi iusavl rentptas nj 2O asecp, 1O convnets fmeorrp kfwf let eispgsonrc tplmeoar sttanper. Xpop refof s eftras tnrivlateae er YKGa ne oxzm epobrmsl, jn cirupatlra ntaraul--elgugana iponcsersg saskt.
- Balylciyp, 1Q convnets tsx dseuturrct mdzb jfxo hrite 2N eauiqetnvsl vtlm ryv wordl kl crtpmeuo vosiin: kgur toicsns lk tkssca lv Conv1D layers nsb Max-Pooling1D layers, idengn jn s boalgl onlgopi tneoiaorp te tnneiafglt oitroapne.
- Xesauce CGOc tck lreyxetme esxneivep tlx gnporcssie qtvx yknf nsqeeuesc, hrb 1O convnets ktc peahc, rj sna xh c qxxq sjyx er xha z 1N ovtncen as c preprocessing drkc rfeobe zn ADK, inhgrneots uro eecsequn pns extracting sfuleu representations tvl yro YUK vr socspre.
Chapter summary
- In this chapter, you learned the following techniques, which are widely applicable to any dataset of sequence data, from text to timeseries:
- How to tokenize text
- What word embeddings are, and how to use them
- What recurrent networks are, and how to use them
- How to stack RNN layers and use bidirectional RNNs to build more-powerful sequence-processing models
- How to use 1D convnets for sequence processing
- How to combine 1D convnets and RNNs to process long sequences
- You can use RNNs for timeseries regression (“predicting the future”), timeseries classification, anomaly detection in timeseries, and sequence labeling (such as identifying names or dates in sentences).
- Similarly, you can use 1D convnets for machine translation (sequence-to-sequence convolutional models, like SliceNet[a]), document classification, and spelling correction.
- If global order matters in your sequence data, then it’s preferable to use a recurrent network to process it. This is typically the case for timeseries, where the recent past is likely to be more informative than the distant past.
- If global ordering isn’t fundamentally meaningful, then 1D convnets will turn out to work at least as well and are cheaper. This is often the case for text data, where a keyword found at the beginning of a sentence is just as meaningful as a keyword found at the end.