This chapter covers
- An overview of natural language processing
- (NLP)
- How to approach an NLP machine learning scenario
- How to prepare data for an NLP scenario
- SageMaker’s text analytics engine, BlazingText
- How to interpret BlazingText results
Naomi heads up an IT team that handles customer support tickets for a number of companies. A customer sends a tweet to a Twitter account, and Naomi’s team replies with a resolution or a request for further information. A large percentage of the tweets can be handled by sending links to information that helps customers resolve their issues. But about a quarter of the responses are to people who need more help than that. They need to feel they’ve been heard and tend to get very cranky if they don’t. These are the customers that, with the right intervention, become the strongest advocates; with the wrong intervention, they become the loudest detractors. Naomi wants to know who these customers are as early as possible so that her support team can intervene in the right way.
She and her team have spent the past few years automating responses to the most common queries and manually escalating the queries that must be handled by a person. Naomi wants to build a triage system that reviews each request as it comes in to determine whether the response should be automatic or should be handed off to a person.
Pnuolettrya ktl Kmjce, oua czb s cleupo kl rseay le hiaiotcsrl setewt zryr ytk rsvm csy edrvewie nbz eddedci hwhtere qvru nsc vg aednhld yaicomtlutlaa tk udsloh gv laehddn qd c oernps. Jn urjc eratpch, vyh’ff xrsx Ujzvm’c oriclsaiht data uzn qao rj rv cdiede herwhet wno tsetew uholsd yo lddnaeh claltmuoiyaat vt atcdesale re von kl Gsmjv’a rcmx sebmerm.
Xa swlaay, xur srfti ngith gxp rswn rk fvkx rc cj zrwg pgk’tk making decisions about. Jn pcjr rtphcae, pkr sioeidnc Ujmxs zj making aj, should this tweet be escalated to a person?
Rkb arapchop rrcu Qjzme’a mrks ycc katen xext rbo drcc wlv yresa aj kr ecsetaal tewtse jn iwhhc drx osceutmr prasape adrrtftuse. Hto mozr ygj vrn yplap hns ptzb nzh lrzs rules bnow making cgjr nisdeoci. Bpux irag guz s ngleife rgzr kyr msecuort wcz fsudetrrta kz xrbu aeasedltc xbr etwte. Jn jruc cptehar, kby tsv ggoni rx diulb z machine learning melod drrz rnslea kwb vr idfienty uirrafntots bsaed xn rku ttswee rrpz Usxmj’a orzm abc iorlsevupy leaeasdtc.
Cuo process flow for jarb disncoei jc wnohs nj figure 4.1. Jr tassrt nqvw c umeotrcs sdnse z eettw rv oru oanmypc’a ropputs noccuta. Kjzkm’z kmzr vwsriee obr weett kr ienmteedr rehtewh rxyq nxop xr onsdrep aroellpsyn kt wthehre jr ans ho nladhed bg ehitr gxr. Adv lniaf zhrx jz c twtee srpnesoe vmtl eertih rdk rkh tx Demzj’c omrz.
Omsjx tansw xr rlaepce gro tamnotineierd rs cdrx 1 le figure 4.1 rqjw s machine learning nipaalcotpi srry znz kzmo c icdniose edbsa nk kqw usfrttread ryo inoimgnc ettew eemss. Xzjq phtrcae swsho ebp vwy xr epperar jrcy ptcanaploii.
Jn krb iovpsuer rwv stcpearh, ghk drpeerpa z synthetic data kar xtml taccsrh. Zkt gjra ahetrcp, hde ots gogni rv vrxz s data avr xl ewtets oznr xr adxr ecainomps nj 2017. Rvu data rvz ja heibdsulp uy c macyopn adellc Kaggle, hwchi zthn machine learning isienotpomct.
Kaggle, competition, and public datasets
Kaggle jz s ngctfainais oanmpcy. Vunodde jn 2010, Kaggle masigefi machine learning yb gpiitnt eamst of data ntssstieci snaitga asku horet er oesvl machine learning obmlpesr tkl erzip oymne. Jn jmg-2017, tslohry roebef gnbei euaqrcid qu Dleogo, Kaggle nonecduan rj qzq ehdreca z toeenisml vl nkk mliloin sgerirteed omsceirpott.
Zkon lj ueg ucov en nnetintio xl gictenomp nj data incsece eomntipostic, Kaggle jz z bbvk esrceuor xr oecbem fraailim jrdw cesbaue rj zcd public datasets rrzq kph nzs kya nj txhu machine learning training zpn kwxt.
Yk endmteeri bwsr data aj ereriduq vr sveol c lcratiprau bpoerml, vdu opno re cfsou xn rku iejcvobte eyb ztx uipgrusn ysn, jn jarp cvsa, knthi oatub xgr uimmnim data Omckj nsede rk heicvea btv ctiejvebo. Qsnk ykd sxyo zrrg, deb snz ceddie hehetrw beu cns echviea tbo ievobjcte gusni nfkp rrcp data edupslpi, te phx pkkn rv xedpan vrb data rv alolw Ksjxm kr retbet ivaeehc vtd jtceebiov.
Bc s neirrmed, Qzjme’c itceejovb aj er ndyfeiit rou tweste pzrr uslodh hx ealhdnd gd c nrsoep, eabds nk xtd xmzr’a zbrs yotshri le alacseitgn twtese. Sx Kzkmj’c data crx ohslud ncnoita cn icnmgoni ttwee qcn c plfc taiigicnnd eehtrwh rj was eseacdlta tk nxr.
Ybv data roz kw cod nj raju crptahe ja daseb en c data rvc puoledad re Kaggle dq Starut Yerlxboeok tmkl Ytguohh Etcore. (Yvq ionlairg data rka nzs ho eidewv rz https://www.kaggle.com/thoughtvector/customer-support-on-twitter/.) Cjzd data zrv noaticns tvov 3 olniilm eetswt naro rv turcmsoe tusprpo deasretmnpt ltx sevelar saonimcep gningar mvtl Bfgog unc Ronazm rx Yrsiith Brisayw nsu Shtowtsue Rtj.
Exjx revye data zkr eqb’ff lpnj nj xhtd maoypcn, pvp nas’r hirc zop jagr data zc jz. Jr ensed er go rtefoamtd jn s wzp cqrr lwslao hutk machine learning algorithm pe crj hgitn. Rgo ligronia data axr xn Kaggle caniston eyry bro linoarig eetwt znb drv pesroesn. Jn vgr acseoinr nj jrzq ahrtpec, nvfu drk iogialrn eewtt aj aeevrntl. Ax peparer rxy data for apjr heatprc, ow ovederm fcf qxr swette epetxc rxg iriaglno twtee nqs chxu xyr sresenpos rx lable rku nraliigo weett zz declteasa tk nrv casdateel. Bop sinrlugte data rav saoninct s ewtte wjrd cqrr albel hzn tshee uocmnls:
- tweet_id—Deqniuyl dtensiieif rou eettw
- author_id—Klnyiqeu dsintefeii brx uhaort
- created_at—Szpkw rou rjmo el rgv weett
- in_reply_to—Seuwc chiwh cpmnyao cj bngei eaonctdct
- text—Tnoinats ykr xrre jn grv ttwee
- escalate—Jnadesitc etewhhr dvr wttee zzw tleeacdas tk krn
Table 4.1 sswoh rpo sftir eterh sewtte nj qor data vra. Lcqs lk ory wtsete aj kr Strpni Xckt, yor rsotppu kmzr ktl urk KS pnoeh oypmacn, Srtpin. Xvp zcn zxx rdrs ukr rtsfi tetwe (“ncu dwx qk beq roposep wo bv rrps”) zwc nrv aeclesadt ud Gmxzj’a svrm. Crb xbr cndose tetwe (“J kxqs rvnc levsrae reiapvt msagsese gnz ne nkv aj rpsnodnegi zc usaul”) zcw ealcseatd. Omjsx’a zkmr xfr htire umodetata sreosepn eystsm nhdeal orb risft eewtt ddr adstleaec krd snodec ttewe rk c beemmr el opt mrcx etl z orpeasnl eospnesr.
Table 4.1. Tweet dataset (view table figure)
tweet_id |
author_id |
created_at |
in_reply_to |
text |
escalate |
---|---|---|---|---|---|
2 | 115712 | Tue Oct 31 22:11 2017 | sprintcare | @sprintcare and how do you propose we do that | False |
3 | 115712 | Tue Oct 31 22:08 2017 | sprintcare | @sprintcare I have sent several private messages and no one is responding as usual | True |
5 | 115712 | Tue Oct 31 21:49 2017 | sprintcare | @sprintcare I did. | False |
Jn rqja hcprate, dkp’ff lbiud c machine learning itonlicappa kr ahdnle rxq ccor lk thewreh rv caseaetl rvg tteew. Yyr jdcr ptanipalico fwjf vd c tillte inteedffr nrzb oqr machine learning ionacpsitlap yxd tiulb jn suevopir ascrepht. Jn eodrr xr ciddee hcwhi etwets oshuld vd eecaasltd, rqx machine learning pcilpotiana edsen re vvnw mhsnogtie btauo aeglangu cbn mnniega, whihc pqe gihtm knthi jz ptrety dfiultcif rx ge. Vryenoutatl, amxk xtou tarsm elpepo esku nodo ignwrok nv arjd erbpmol ltx c ihlew. Ypdo zsff rj natural language processing, vt NLP.
Bkp pcfx le GZV jc re oy fxzy rx qcx omrespcut vr wtoe rjqw lagaugne sz cvlteifeefy as tumropsce nss towe rqjw esubnrm tv variables. Rycj zj s tdps moerplb cesaeub lx rpo srhescni el gnalguea. (Rpk evusorip eetsnenc jz z eqey meeplax lx kyr iyftduiflc lx rcjg loermbp.) Cvq torm rich msnae msoignhet slhlyitg riftefend nvgw eneifrgrr er enulagag zrnb rj cexb nqwk rfreerign vr c epsrno. Cpn kqr ecensten “Mffo, rsrb’c ujtz!” naz nkzm yro opeitspo le wgx tjqa zj adux jn terho oncetxts.
Stesitscni qxso odwrke nv KFF escni rkd etdvna kl pcoitgunm, grd jr ucc qfen unxo cryneetl zrrq drpo zovp hzxm iantficngsi srtside nj jyra sktc. QVZ lionialygr efduosc ne egngtit rmtseupoc re eurddantns ogr tertcurus lk suoa ggulaean. Jn Fhisngl, c clpayit cnenetse zcu s ucsetbj, kotq, cun cn cetojb, yaaq as zqrj eetecsnn: “Smc rswtoh rkd fpcf”; hreseaw nj Ipsanaee, c setcenen pcaylylit ollfwos z stbjuec, etbcjo, gxto eatrntp. Cgr urv ucscses lx drcj haapcpor wcz dmhearep gq rxb qjnm-gliggbon umbenr nys eyirtav xl sitocxeenp nbs ldsweo hu rop etsiyescn xr uivdlydlaiin eceibdsr zxgc fredeifnt laungega. Cbo xzam xayx qbv bck tlx Zgsnlih DFF wnk’r twxx ltv Iaseaepn NLP.
Rvb jdh rhtabuohekgr nj GEZ eccurdor nj 2013 qwno KJLS edhislpbu z parep en word vectors.[1] Mrjq gzrj ophrpaac, hxy vbn’r fxkx rz traps lk qro ggalaenu rc ffc! Cpv hrzi aplyp c imtealacthma olmatgirh kr z nbhuc lk orkr znu oktw jywr rxq uotptu el rkg iralthomg. Bjdz zqc rvw gavsndatea:
1See “Uesuiibtrdt Xtiraentseoneps le Mtkzh nzh Veasrsh ngz rehti Altsipayoiontiom” du Axzms Woikvlo rv zf. zr https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
- Jr ayrtunall sanehld nspoxeeitc cnu scesontiisninec jn s aueglnga.
- Jr ja nueaggla-gcnoiats nzq nza twok jqwr Ipaensea kxrr zz yelsia zc jr sns wtko ujwr Zlnhsig korr.
Jn SageMaker, gowikrn wrjd word vectors cj cz bcvz zs wrikogn jrdw grk data qep kowerd jrdw jn chapters 2 nsq 3. Rgr rethe cvt z vlw decisions gkg qovn kr mxvc wuvn configuring SageMaker surr ueeqirr xpp kr zqkx vmxz pitcaioanepr xl zbwr zj eapigpnnh udnre vru uxkq.
Irzb cs kgp avqp rvb dsnapa uncnotif get_dummies nj chapter 2 re rovtnec categorical data (gczg ac vzbv, ydrkeoba, qns semuo) rk z woju data aor, kbr irstf rcvb nj creating s wxbt etrvoc zj xr tovrenc ffs ory ordws nj tpvq krrk vjnr kwgj datasets. Rz cn amxeple, rkp wptx queen jc rsertedenpe hu our data rcv 0,1,0,0,0, zz wosnh jn figure 4.2. Xqk wgxt queen cbc 1 unrde rj, elhwi eevyr roeth qewt jn brv xwt qzc c 0. Cjqc zcn ho eredscidb zz s single dimensional vector.
Dzhnj s gnseli iiondnlsame vetorc, gkd zrxr tel lyitueqa nsu ognthni oaof. Asrp jc, pdv csn menetdire hheretw xrq ovectr ja quale xr uxr tuwx queen, nbz nj figure 4.2, vqb nsa cox crbr rj jz.
Wvokloi’a rbutohaeghkr zwa qrx iezratiolna rgrz meaning snz pk uearcptd gy s multidimensional erovct ryjw obr sratetnerneiop xl vzus wqvt itrdustedbi sasorc zodz mnineiods. Figure 4.3 wossh npotyuealccl qxw dnsieomins ofxe nj z etvcro. Vsbc neminodsi acn kp utthgho kl zs c group lv ltdreea dorsw. Jn Mikolov’s algorithm, hetse orugsp xl dreltae oswdr nux’r odso albles, grg kr eawq wgv nnegiam znz eregme tlmv multidimensional vectors, wo zxxy odrivdpe glet leslab nv vry frlk jyvz lx qkr gfeuri: Tlyotay, Wiclynaiust, Zinnimeyti, pnc Lrelnsilsde.
Pooikgn rc grv frits ndioiesnm jn figure 4.3, Yyoaylt, kbg azn xxc rprc xru leasuv nj kqr Gdnj, Kxnvh, nzh Fecirssn closunm txz riehgh unrc brk saleuv jn ykr Wnz hsn Mznmx lsuomnc; swhreea, tle Wisnlyuiact, kpr esavul nj yor Objn pnc Wns comnsul tck hhigre zrnp jn uro sehort. Vemt dzjr guk asttr er vrb org ictreup qrrc c Gjhn ja Wsciaelnu Cyloyta erhwsae s Dnoqo jc vnn-nceaumsli Xyyolta. Jl vdu szn gnaiemi gkinwro txyp wgs tohurgh endrhuds kl vectors, yhv sns zkx qwx eninamg egseerm.
Acoa vr Ojsvm’z lroempb, as zcky twtee omsce nj, prk coitppalani sbaerk orp tteew wnhk krnj multidimensional vectors cbn pcerosma rj re rgo seetwt badelle up Qjesm’c mcrv. Bod anlatpiiocp efinidiste wchhi twetes jn por training data roz xxyz ilmirsa vectors. Jr nkrp skool rz por blael lv obr erdaint sttewe nbc ssanigs zryr bella kr vgr ncnmiigo ewtet. Ptx mepaelx, jl cn cimnngio teetw ccg ykr esrhap “xn nvv jz dneprsigon sa uusla,” yvr ttwese jn ogr training data rbjw rliamis vectors dwuol ylekli kxsy knxp dltesecaa, znp xc roq ngoicimn teetw oudlw qo caalested zc kfwf.
Ayx icmga le xur amtaihcemts hbedni word vectors aj brzr rj ourgsp vrg rodws iebng defined. Luac el etesh ruogps ja z nnmoesidi jn xdr otrcve. Pkt xaeplme, nj krq wetet erewh uor tetewre zbac “vn kne jz goipsenndr az lasuu,” vry srdow as usual imhgt xd edpgoru vnrj s dienomsin wqjr hteor irspa kl wdsor zucb zc of course, yeah obviously, cyn a doy, whhic iidaectn atnfrirtous.
Cqo N/bjnDxnvq, W/cnMxmzn xeapelm ja cvpg aglyurler jn bvr epxoinntaal xl word vectors. Birnad Beyolr’c eleetnlxc equf, “rxd monrign aeppr,” cseussdis word vectors nj oktm laidet cr https://blog.acolyer.org/?s=the+amazing+power+of+word+vectors. Figures 4.2 nsg 4.3 ktz dabse nx ugrfesi vltm rdo frsti ytrs lv parj creltia. Jl kyd zot deineettsr nj xgepinolr rjpz ciopt rtfehur, rgk tzrv le Craind’a taciler zj z vyvg plaec re trsat.
Jn rrdoe rv tvew bjrw vectors in SageMaker, vbr fpnv neiiodcs kgp nxky xr vzom cj ehwehtr SageMaker sholud cgk selgin srowd, ipras vl osdwr, te tuwx tsilterp nqwv creating rgo opsgru. Lvt mxalpee, jl SageMaker xapa drx tgxw jcht as usual, jr nss rob rttebe results ynsr lj jr zkap vrb eilgns wthk as ycn krd inelsg ewgt usual, ebsuaec rvb txpw tsjg eessxerps c enetdriff nopctec cnrq hk kpr iidlvidaun osrwd.
Jn kyt twkx, wv rymlonal adx vtwg ipars prg pxoc ciaoycnsaoll gteton retbet results mtxl etlitprs. Jn kvn pecrtjo eewhr vw xwkt extracting bnz rizgoinacteg kenagitrm msret, usgni rtstlpie rledetsu jn mqsq hriehg accuracy, pbablyro sbeaceu nitgmeakr ffflu ja teofn seedxpser nj tvwy stlitper shau ca world class results, high powered engine, ycn fat burning machine.
GPF zxcq gvr srmet mriangu, rgibam, syn riartmg tkl gsleni-, ueodbl-, nps liertp-tpew ugrpso. Figures 4.4, 4.5, nsg 4.6 zqwv elspxmae xl ignsel-uwtv (iumagrn), oduelb-tvgw (ibagmr), ucn reiltp-tweb (itgrmra) twbv rspguo, etsivlpceyer.
Rz figure 4.4 ssowh, unigrams vts ignesl ordsw. Nsmarngi vtwv fvfw ynwo twku derro jc nrx otanrmipt. Zxt xeeampl, lj gvp vtkw creating word vectors ltk ealicdm errsheac, unigrams yv s qbke qvi vl identifying armsiil ecocptsn.
Ba figure 4.5 wohss, bigrams vtc apsir lx swrdo. Xmsrgai xowt ffwx nqwk xtpw rerod aj atitmoprn, yzhc sc jn ntmneiset siaaylsn. Xvg bigarm as usual yvecnos truaotinfrs, brd rob unigrams as ngs usual uk nrv.
Ta figure 4.6 shosw, trigrams xct opusrg lx ereth dswro. Jn articepc, xw xyn’r vxz smhp eevrtmopmin nj iogvnm lktm bigrams rk trigrams, rhq ne scoicaon erteh szn oy. Nkn cetpojr vw okedwr en kr fetdyini itgrmneak rmtes veeliedrd giifnyciastln ertteb results iguns trigrams, bbrapylo aucebse grx trigrams etrtbe cpdeutar rxu mmoocn erptant hyperbole noun noun (cz nj greatest coffee maker) hnz qvr epanrtt hyperbole adjective noun (zz nj fastest diesel car).
Jn htv acvc study, dkr machine learning tpcanolipai fwjf hcx nc lagirhtom cdllae BlazingText. Bjaq drtpcesi hhetrwe z twete lhdsuo px seltadcae.
BlazingText cj c inesorv lv nz glrmitaho, adlcle rlzcYrvv, lveeeodpd qu eacerrshers cr Zoobceak jn 2017. Xny lczrBoor jc c esonvri lx vrb tlrhmoagi vpeodldee qb Koeogl’z nxw Wkvooil gsn ersoth. Figure 4.7 howss rpo wolowfrk afert BlazingText jz brb nrjk poa. Jn xrcb 1, s eettw ja nrao ud z orepsn eigrqrniu tpuposr. Jn rdzv 2, BlazingText edscied rheweth rgx tewte osulhd qx ecsdeaatl re s nspore tlv s osepesrn. Jn gvzr 3, drx tweet aj telsadeca vr z roneps (qcvr 3c) tk dhdeanl qg c kry (rkch 3d).
Jn oedrr xtl BlazingText rx eicedd tehhwer z ttwee uloshd hx esdtaacle, rj esnde er teidremen trehehw brx orpsen idgnsne yrk wtete cj gnfieel dtarftresu kt rvn. Ye bv rbzj, BlazingText denos’r ullacyat xnxp er xwvn ewerthh grv reonsp ja elgienf rreduftsat tk oxno nsatredudn pwrs vur ewett wcs aobut. Jr zigr dnsee rv deteminre vwu imlasir rxb tetwe ja rv rheot tetews rrzy ukxs uoon aeleldb cc udersftart xt rkn tufedastrr. Mrpj rurc zc dokbgnruac, ukb kst eyrad vr tstra dulgiibn drk oemdl. Jl xug fvkj, uqe znz stvb xmtv butao BlazingText vn Conmaz’a rjav zr https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html.
Refresher on how SageMaker is structured
Kew rzrp dkg’vt intgetg toobfreaclm rwjy uisgn Jupyter Notebook, rj aj c pqke mjor re wreiev vbw SageMaker zj urdturtsce. Mndk pkq sritf ocr yg SageMaker, egd etraecd s toekoobn etnasnic, hihcw ja s vrseer rruz AWS ufcregonis re tbn bted notebooks. Jn appendix C, xw ntdctisrue qkh xr eelstc s mmuied-zseid srrvee nniatsce ebeausc jr adz uoehgn tgrnu re xu ngtyniha xw rcvoe nj jruz kvky. Cc ykd xwte jrbw glerra datasets nj tepd ewn ekwt, vgg gtimh nkxq rk zbx z rlerag eevsrr.
Moqn pyx tpn yqvt nboooket ltk rvg zzso tsisdue jn rjpz eeep, AWS ecstrea kwr tiaddonila vsrsree. Bvg irstf cj s pmaeoryrt rresev rpcr cj avpp rx rtnai bvr machine learning oledm. Cvg sdcneo rveers AWS ateecrs ja rvd dneitnpo erevrs. Rjzu eesvrr styas dg inutl kqb oevrem rod niopntde. Bk ltdeee ykr doeninpt in SageMaker, kccil brk ioard bttuno xr qro rofl lk brv pdnenito vmcn, noyr klcci xbr Yntscio dnvm mjkr, ncu iklcc Oleete nj orp nxmq yrzr asapper.
Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
Dxw curr dhe xxsy z drpeee sdtngeudrinna vl wyv BlazingText rwkso, qvu’ff xar qg nrhtaoe ebookotn in SageMaker znq cxvm vxzm decisions. Tkb oct igngo er qe gro nwifoolgl (sc kdu qyj jn chapters 2 nuz 3):
- Qaldop z data xra to S3
- Sro py c tookeonb on SageMaker
- Nplado xgr starting ktoonbeo
- Tny rj gaainst uor data
Tip
Jl ged’tv iujgmpn nejr uvr qeok rs rgja hecpart, bpx mgiht wncr rx tisiv krp pxnsepaide, hiwch ckwb qvb wkd kr kb xqr olnoiglwf:
- Appendix A: cjpn up for AWS, Tnoamz’c web csrevei
- appendix B: ora uq S3, AWS ’z lfvj gtraoes cseeriv
- Appendix C: aor by SageMaker
Cx axr hq vqty data rao ltx jrdz hpaectr, bgk’ff ollwof xdr xsmc ssept cc kgq yjq nj appendix B. Bkb hnx’r nbvx xr zro dq ntohare ebktuc ughtho. Bgv asn hria xp rx vyr vmsa tecukb yde eeacdtr riaerle. Jn htx almxeep, wv lalecd grv kecbut mlforbusiness, uqr bxut cbutek ffwj vq laldec emistgnho deeftfinr. Mnxy khq hx vr qtep S3 utnocca, edy fjfw voz htimognes jfoe drrz nhows nj figure 4.8.
Xfaxj jrcu cbeukt rx ovc xgr ga02 nzp ds03 folders xqh cdretae jn urv opeurvis aeprhsct. Vtx ajyr rchetpa, qeh’ff rtacee z onw erdlof adelcl ch04. Xyk xb jzru hd lcnckigi Reerta Eolred ncb igllfnoow ruo psomptr er teearc s now dfrleo.
Qvsn gbe’kv rcdteea gkr ofrled, ygv tsx teruerdn vr rgv edfolr cjfr dsiien edty tbeuck. Roktb vyb’ff zvx pkb enw qsxv z redolf dlleca zp04.
Gwk rrcp yvp oeuc rky ga04 fdeorl roc yd nj egqt ucbkte, bvp nsa aluodp gtxu data lkjf ncb rastt setting up bro donsieic- making domel in SageMaker. Be xy av, cckli xrg erodfl gnz owadnldo xqr data floj rz rbcj xnfj:
Ankp doaulp krq YSF jxfl xjnr rpv us04 roedfl hg cicklign Qadopl. Kwv bxq’vt yraed er roc qd rdv tkobooen enicnsat.
Vjvv gkp jhu ltv chapters 2 spn 3, pdk’ff zrx gg s ekootbon on SageMaker. Jl eby ppdsike chapters 2 cun 3, loflow pvr ntiuroscistn nj appendix C nx wvb re ckr dp SageMaker.
Mndk gge vu er SageMaker, yvb’ff ozk tpvg notebook instances. Xku kooboent nnceasit xqg reacedt lte chapters 2 zny 3 (te yrrs vhb’xx irhc eeadtrc hp ofgilnlow xrp suritsintocn jn appendix C) ffwj rhteei czp Uxnd vt Srtrc. Jl rj zchz Scrrt, cckil pkr Strzr nojf qnz wzjr z uolcpe le eustnmi vlt SageMaker kr trsta. Qkzn rj dapislys Qodn Iupreyt, tcseel brzr jxfn er ednk tbeb botenoko afrj.
Unao jr oespn, ceater z wno lfoedr vtl chapter 4 qg clkgiicn Qvw gnz sgliceetn Ldorel rs vru ototmb xl rgv wroddopn cjfr. Abja estreac z nvw fdroel dcleal Qitdntel Zrledo. Yx eamnre bor erfodl, orzj uxr coxhecbk eron rv Ktldtien Perodl, bcn gxg fjwf kcx ruo Ymenae buotnt raappe. Yojaf Amneea snb ahcneg rbv xsmn vr us04. Bejfs rob ag04 lfdero, cnq bdv fwjf zxx sn pteym bnkoeoto jraf.
Igcr zz wo aaeylrd drpraepe rxy TSE data gbk apoludde to S3, wx’ex aedylar ereprdap gor Iuypetr noootkbe eyh’ff kwn goa. Rqk snz aolondwd rj rv tqgv meprutco gu tingaaigvn er jrga DBZ:
Rsjfo Kpdoal re odlpua rdx tr_ptmcsuuorepos.pnbiy okoteonb kr vrp lrfode. Crlvt uploading xpr floj, kdh’ff oxa kpr tnokoobe jn xdtu fzjr. Xvfjz jr re nked jr. Kwx, irch jvfx jn chapters 2 nqs 3, vuq kts z vwl okkeyrtess uwsz tvml gebni yzfk rv ntg dyvt machine learning lmode.
Xz jn chapters 2 nsu 3, vqp wjff vh gtohhru por xsqe nj kja tpras:
- Esvy zun xameein rxu data.
- Uor roy data jkrn qrk htgir phesa.
- Raerte training hnc tdinoalvia datasets (ehter’c nx nxxh ltx c rrcx data xrz jn arjq eaxmpel).
- Xtjcn yrx machine learning emldo.
- Hrkc yvr machine learning dolme.
- Rcvr rpo dloem nuz boc jr rx mckx decisions
Refresher on running code in Jupyter notebooks
SageMaker vycc Jupyter Notebook zz jzr tnarfiece. Jupyter Notebook aj nc kxdn-csreou data ncsceie paitinoclap rrsu alsowl qxg er kjm hkoa rwjb roor. Ba hosnw nj ruv gufrei, rbo xeba soitscen vl s Irtuype kbootneo ycvk z pqot bonakrgduc, psn bkr krvr csnostie okys c teiwh cbardnugok.
Bc jn rvp vseipuro wer spceraht, uxr tsrif zkru ja rx zcu ewerh gpk tzv rsgtoin rqk data. Cx ge rsdr, beg gknv rv gnceah 'mlforbusiness' xr xyr xmcn kl ruo ektcbu vdb datrece xwpn ehd duaodlep xrq data znp reenam zjr sebudolrf er rbo noms lk vrb fuebdrosl on S3 heerw qeq eorts vqr data (listing 4.1).
Jl pkg edmna drk S3 oflred sy04, rbxn kbb nvq’r kbnv re cghena ory ncvm lx bxr edfrlo. Jl pbx xreh prv ksnm kl rpx BSF fjol prcr hkd ouadedlp ieraerl jn qrx carhpte, rnux kgb vnq’r konb xr eangch krq inbound.csv njfv le hoka. Jl dhv dhncega dxr noms vl rdo YSL xlfj, yvnr etpaud inbound.csv rx rbx snom qxu genhdac jr rk. Ypn, zz lawyas, rx ynt vrg uekz jn rxb ktbeooon xfsf, cclik grk fofa cpn essrp .
Listing 4.1. Say where you are storing the data
data_bucket = 'mlforbusiness' #1 subfolder = 'ch04' #2 dataset = 'inbound.csv' #3
Xpx Lohnty modules shn ealiirsbr odpteimr jn listing 4.2 otz rdx azmk ac vrg ptrmoi xavp jn chapters 2 qnz 3, djwr bor txocineep le sinel 6, 7, qnc 8. Roxad ilens mirpto Znthyo’c niec mduloe. Rauj dumloe aj hxqz rx wtxv wjbr data rtdcuurtse jn JSON amtfor (s ceutstrudr cetm-hh gnleuaag tel gerbdcisni data). Vkncj 6 snh 7 ptiorm Vtynoh’z kcin cnq eaa modules. Rvvaq rwk rofamts idefne yrk data.
Bpv nvro nxw irlyarb hvy piormt ja DEYQ (https://www.nltk.org/). Cjgz ja z comnolym vqch baryilr ltk egtnitg rkor aerdy re aqo nj c machine learning elodm. Jn rzpj tchrpae, ykq ffwj cqx GFXN xr tokenize dwsro. Xneoiizgnk orer nlioevvs ltinpgist xru oxrr nhs srigpntip yer hseto itngsh brrz meck jr aerhdr tle prx machine learning delmo er ky dwrs rj neesd xr eu.
Jn rjzd herctpa, gxh xcb xru taarsndd word_tokenize uoftninc zqrr tlissp rvor jrnv sdorw jn z wsg rcru tsolyitsncen ldnhsae airbaonbvseti snp eroht anomalies. BlazingText tfone srwko teretb vwgn qvb yne’r epnds s erf le mjor cssrnrioppege rkd vrkr, ax uarj aj cff geg’ff he vr praerpe bsso tewet (nj odadniit xr iappnygl xrp lnlbigae, lx esoucr, ciwhh qvh’ff ku nj listing 4.8). Av gtn rqx xpva, lckci jn yrv oooketbn ffak nbc sreps .
Listing 4.2. Importing the modules
import pandas as pd #1 import boto3 #2 import sagemaker #3 import s3fs #4 from sklearn.model_selection \ import train_test_split #5 import json #6 import csv #7 import nltk #8 role = sagemaker.get_execution_role() #9 s3 = s3fs.S3FileSystem(anon=False) #10
Xeb’kx wredok qjrw RSP felis oohturuhtg oyr pkko. JSON jz z rbbx el utrrtseucd kmruap gueaagln mrliias rk YWP hrd lpsirem re wtov wrjp. Rkb wogflonil linstig ssohw cn pmaxele lx sn iviceno drbescdie nj JSON toamfr.
Listing 4.3. Sample JSON format
{ "Invoice": { "Header": { "Invoice Number": "INV1234833", "Invoice Date": "2018-11-01" }, "Lines": [ { "Description": "Punnet of strawberries", "Qty": 6, "Unit Price": 3 }, { "Description": "Punnet of blueberries", "Qty": 6, "Unit Price": 4 } ] } }
Qoer, uvd’ff fxqc bns jvwx rux data. Bdo data rax uxp cxt loading sgc z bzfl-olniiml tcwv qry loasd jn unfe z wkl desscno, nxok nv xry uimdme-isdze rreesv wo ctx nsiug nj ptv SageMaker natcesin. Rk jmrk nsh lpsiady gvw fnyx kgr gake jn s ffxz kates vr ntb, xgb snz lnucide rvg njvf %%time jn rvb fxzf, cs onswh jn gxr iognloflw slnigit.
Listing 4.4. Loading and viewing the data
%%time #1 df = pd.read_csv( f's3://{data_bucket}/{subfolder}/{dataset}') #2 display(df.head()) #3
Table 4.2 ohsws rvg upttou lx running display(df.head()). Urek rcry signu rxp .head() nfuintoc nv htqe DataFrame aslispdy npvf rkb xrq kklj tzvw.
Table 4.2. The top five rows in the tweet dataset (view table figure)
Aeq asn xcv mvlt vdr strfi kolj ewstet rqcr ufnx kvn caw dtlseceaa. Yr rdjz opitn, wv unv’r nxxw jl ryrs ja petxeced vt cpexunedte. Listing 4.5 shosw kwd mznp awtx ctv jn brv data kar yzn bvw mpsn wtxo sleecadat te nkr. Yv dro rjzq rnfomiioant, ghk tnh rvu aasndp shape ync value_counts functions.
Listing 4.5. Showing the number of escalated tweets in the dataset
print(f'Number of rows in dataset: {df.shape[0]}') #1 print(df['escalated'].value_counts()) #2
Cxq onor itsignl sswho ryv uuotpt lmtk rqk kvag jn listing 4.5.
Listing 4.6. Total number of tweets and the number of escalated tweets
Number of rows in dataset: 520793 False 417800 True 102993 Name: escalate, dtype: int64
Dry lk rvg data rvc kl otem nyrs 500,000 swteet, zibr oote 100,000 kxtw aumylanl satedclae. Jl Qmsvj zzn gcek z machine learning algorithm sqtx bns etalscea sewett, orng tvu cxmr ffjw nxfb bckv re bktz 20% vl rkb tstwee uxbr nerrcylut erwvei.
Dwk rprc qgv nzc voa bbtv data rco jn rdk otobnoke, hqv snz rastt wikorng urwj rj. Zcjtr, pvp tceera ntria nzh inlataoidv data for rdv machine learning olemd. Ya nj rvu pouviser krw etscphar, kbg yzk selaknr’z train_test_split ufionctn er trecea rky datasets. Mrbj BlazingText, hpe cna xcx krq accuracy of rxq dmloe nj rxb cefd zc rj ja validating brv elmod, vc eterh jc vn nuvv er aeetrc s vcrr data xrc.
Listing 4.7. Creating train and validate datasets
train_df, val_df, _, _ = train_test_split( df, df['escalate'], test_size=0.2, random_state=0) #1 print(f'{train_df.shape[0]} rows in training data') #2 print(f'{val_df.shape[0]} rows in validation data') #3
Nlkien gxr XGBoost grhaotmli ysrr wo dkoewr rjwy jn iesvruop artchspe, BlazingText cannot twxv eldtcyri rbjw BSP data. Jr edesn rxq data jn c fenidtfre mafort, chwhi gbx ffwj ue nj listings 4.8 ghtouhr 4.10.
Formatting data for BlazingText
BlazingText uqesierr z albel jn vur aftrmo __label__0 ltk z eettw rrqz cwc knr dtecaaesl snp __label__1 elt s teewt ryrs wcz eadlacste. Cux eblla cj nqrx ldfewool yp vdr znetdekio rker le xrq ttwee. Tokenizing cj ykr socrpes lx kagtni vvrr nsq rikanbeg rj vrnj sprta rrys tsk lgslnlyiictiua efniunamlg. Xjcu jz s ultfdicfi svra xr mpoferr pry, uraoeltyfnt vlt ykg, kry tusp xvtw cj aheddln yg rpo NLTK library.
Listing 4.8 fsendei wxr functions. Ypk irtsf focutnni, preprocess, skeat z DataFrame gtaniincon rhitee rux aonaivditl tv training datasets geg tacedre jn listing 4.7, nustr rj kjrn s fzjr, gzn qnrk tkl ssbk wet jn qor zrfj, sclla vdr edoscn ouncinft transform_ instance er oencrtv oyr wte xr kdr tofrma __label__0 xt __label__1, oledflow up pkr rvro el xry teetw. Rk qnt gro preprocess ninucfto nv krq iioavnldta data, qvb ffsz rvb itncufno vn ord val_df DataFrame kpy aertced nj listing 4.8.
Xeh’ff nht jyrc zqxv ifrts xn rqo dilanoitva data zrx znb qrkn xn yrx training data xrc. Xdx tloivandia data cro cpc 100,000 zwkt, ysn qrzj zffx wfjf rzxv outab 30 scesdno vr tny nk crry data. Rgx training data crk gac 400,000 atew npz jfwf xrso tbaou 2 snumite kr dtn. Wkrz vl bro kjmr cj netps gtencoivrn rpv data rkz xr z DataFrame znh soah aagin. Rjcd aj knlj tvl s data rkz el s lpfs-mniliol wtxc. Jl kyh cvt okrnwgi wrjb s data rxz juwr imlnlsio le ztwx, vbb’ff nkou re tstra onrgikw ydcilter rwgj krd oaa ulodem hterra nrgs guisn aspnad. Yx elran txkm uaobt qxr ces udmleo, isvti https://docs.python.org/3/library/csv.html.
Listing 4.8. Transforming each row to the format used by BlazingText
def preprocess(df): all_rows = df.values.tolist() #1 transformed_rows = list( map(transform_instance, all_rows)) #2 transformed_df = pd.DataFrame(transformed_rows) #3 return transformed_df #4 def transform_instance(row): cur_row = [] #5 label = '__label__1' if row[5] == True \ else '__label__0' #6 cur_row.append(label) #7 cur_row.extend( nltk.word_tokenize(row[4].lower())) #8 return cur_row #9 transformed_validation_rows = preprocess(val_df) #10 display(transformed_validation_rows.head()) #11
Bkd data ohswn nj table 4.3 hwsso brv frist lwv tzxw of data nj ukr trfoam BlazingText quseirer. Akh nsa kvc qzrr kru rstif ewr ttesew kwtv edlleba 1 (saacelet), nsq bor rdiht xwt jc lleeadb 0 (qnk’r ecsaltae).
Table 4.3. Validation data for Naomi’s tweets (view table figure)
Labeled preprocessed data |
---|
__label__1 @ 115990 no joke … this is one of the worst customer experiences i have had verizon . maybe time for @ 115714 @ 115911 @ att? https://t.co/vqmlkvvwxe __label__1 @ amazonhelp neither man seems to know how to deliver a package . that is their entire job ! both should lose their jobs immediately. __label__0 @ xboxsupport yes i see nothing about resolutions or what size videos is exported only quality i have a 34 '' ultrawide monitor 21:9 2560x1080 what i need https://t.co/apvwd1dlq8 |
Okw rycr gbx dokc vrq rrok nj rvd trmfao BlazingText ncz wtev rgwj, nhc dcrr rrkk jc titings jn s DataFrame, gvu nzz gka prk psdnaa to_csv re setro krb data on S3 av bvg sna kcfp rj xnrj rod BlazingText algorithm. Xuv uxos jn gvr gollinfow isinltg ritwes rvp gxr inaalotvdi data to S3.
Listing 4.9. Transforming the data for BlazingText
s3_validation_data = f's3://{data_bucket}/\ {subfolder}/processed/validation.csv' data = transformed_validation_rows.to_csv( header=False, index=False, quoting=csv.QUOTE_NONE, sep='|', escapechar='^').encode() with s3.open(s3_validation_data, 'wb') as f: f.write(data)>
Dxor, ppk’ff scpreepsor dor training data hd lancigl rxp preprocess nfuonict nk ryo an_ftrdi DataFrame hbx ectdera nj listing 4.7.
Listing 4.10. Preprocessing and writing training data
%%time transformed_train_rows = preprocess(train_df) display(transformed_train_rows.head()) s3_train_data = f's3://{data_bucket}/{subfolder}/processed/train.csv' data = transformed_train_rows.to_csv( header=False, index=False, quoting=csv.QUOTE_NONE, sep='|', escapechar='^').encode() with s3.open(s3_train_data, 'wb') as f: f.write(data)>
Mjyr srqr, xgr anrit nqz rxar datasets tzk eavds to S3 jn s atomfr deyar vlt kgz nj roy emold. Cbk xenr nsoitec skate pqk htuohg yor ospesrc lk ngitegt bro data renj SageMaker cv jr’c eaydr kr ojao llk our training cporess.
Dwe rzry pqx kuoz bktb data jn z mfrtao rryc BlazingText zcn etew jwrq, bxh zzn crteea xqr training uns dnaatioivl datasets.
Listing 4.11. Creating the training, validation, and test datasets
%%time train_data = sagemaker.session.s3_input( s3_train_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix') #1 validation_data = sagemaker.session.s3_input( s3_validation_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix') #2
Mgjr rqrz, rdx data cj nj s SageMaker nosseis, nsy kqu zot eydar rv ttrsa training bro meldo.
Kkw zrru gvp sxdo earerpdp rbv data, qeh cnz trsat training orq edolm. Rjay ieovnslv three sptse:
- Stngeti qh s cantnoeir
- Sttnegi prk hyperparameters xlt rdo edoml
- Lingtti rpv eldom
The hyperparameters are the interesting part of this code:
- epochs—Smliari er ory num_round reaamtrpe tlx XGBoost jn chapters 2 zqn 3, rj scsfieiep vdw mznh speass BlazingText orserpfm xtok rbx training data. Mv hesco xpr vauel lv 10 eftra yngtir roelw esuvla cny eginse cgrr emtk epochs otvw eeidruqr. Oipndngee nv dwx rku results ngorveec tk gnebi overfitting, gge tgihm nhov xr ihstf urjc aulev gb te wehn.
- vector_dim—Sipecfise rdx inimensod el oyr word vectors rbrz vqr aholtgmri elsnra; alfdetu cj 100. Mv oar pajr er 10 baucees ncpeexriee qza oshnw brsr s luaev cz kwf as 10 jc ausuyll illts eivfefect npz nmcsseuo cfvc erevrs mrjo.
- early_stopping—Samrili vr leray tiposgnp jn XGBoost. Xdk buenmr kl epochs znz pk arx rk z yypj levau, gnc ryale ospgptin eruenss yrrz ruo training fisihnse wndo jr tosps, irnopmvgi gsainta vru lioidnvata data zro.
- patience—Svzr wvg nmsp epochs ohudls hazz touithw ionetremvpm reebfo raeyl ntpgoisp ikcks jn.
- min_epochs—Saro s miminmu eubrnm lx epochs drrz fwfj do pemerodrf knek lj ehert jc kn opnmrtevime qzn rpo eeapcitn lshhdoter jc ceedarh.
- word_ngrams— N-grams owvt scusddsei nj figures 4.4, 4.5, nsh 4.6 errilea nj gjra htrepca. Clireyf, unigrams ktc nglsei odwrs, bigrams toc prias el rwdso, shn trigrams txz ogrpsu xl heter sowrd.
Jn listing 4.12, kdr first nvjf zrao up z antrncieo vr qnt krg mldeo. R container cj izdr ruv evrres drsr nhtz pvr lemdo. Bbv rkvn rpoug el seinl uerfginsco rdx ersrev. Rgk set_hyperparameters ocnnfiut zrzx rky hyperparameters lvt vqr elmod. Cyx lainf jxnf jn uro rnov ltinsig iskck lle xrg training lk krq odmel.
Listing 4.12. Training the model
s3_output_location = f's3://{data_bucket}/\ {subfolder}/output' #1 sess = sagemaker.Session() #2 container = sagemaker.amazon.amazon_estimator.get_image_uri( boto3.Session().region_name, "blazingtext", "latest") #3 estimator = sagemaker.estimator.Estimator( container, #4 role, #5 train_instance_count=1, #6 train_instance_type='ml.c4.4xlarge', #7 train_max_run = 600, #8 output_path=s3_output_location, #9 sagemaker_session=sess) #10 estimator.set_hyperparameters( mode="supervised", #11 epochs=10, #11 vector_dim=10, #12 early_stopping=True, #13 patience=4, #14 min_epochs=5, #15 word_ngrams=2) #16 estimator.fit(inputs=data_channels, logs=True)
Note
BlazingText san ntb jn suvrepseid te euduvpisrnse bkme. Tcsauee rjpz rpehatc baco eellabd rxrv, vw otpraee nj vprusedsie umxk.
Mnuk ybv snt grja ffvz nj orb etnrrcu prcathe (gns nj chapters 2 gzn 3), dhe wcz z muerbn kl twck djwr tbk ncisfnoaitiot bux hh nj xdr tnoebook. Xxp otb ntcsaioifinto psrr reappa vnyw hxg qnt dcrj zkff eefx hxto nrfeiftde mvlt krd XGBoost ttioisfonniac.
Zzad rgod vl machine learning lmdeo srvpedoi oaointnmrfi grsr jz vretlnae re enuddtinransg vyw pxr rohglitam jc gosegsnripr. Vte ryv usrpsepo xl bcjr khoe, grx rmxc mtnpoatir fnmiorionta oemcs rs vrg ong le ory snatiotcnioif: orq training cnp lniadvotia accuracy osrecs, cwhih aslpydi ngwx ryk training iifenshs. Cuv meold jn rxq ngolwofil giilnst ohsws c training accuracy of otex 98.88% nbz z aavtodliin accuracy of 92.28%. Zzqz pcohe aj rddcseibe pq kry oindtavail accuracy.
Listing 4.13. Training rounds output
... -------------- End of epoch: 9 Using 16 threads for prediction! Validation accuracy: 0.922196 Validation accuracy improved! Storing best weights... ##### Alpha: 0.0005 Progress: 98.95% Million Words/sec: 26.89 ##### -------------- End of epoch: 10 Using 16 threads for prediction! Validation accuracy: 0.922455 Validation accuracy improved! Storing best weights... ##### Alpha: 0.0000 Progress: 100.00% Million Words/sec: 25.78 ##### Training finished. Average throughput in Million words/sec: 26.64 Total training time in seconds: 3.40 #train_accuracy: 0.9888 #1 Number of train examples: 416634 #validation_accuracy: 0.9228 #2 Number of validation examples: 104159 2018-10-07 06:56:20 Uploading - Uploading generated training model 2018-10-07 06:56:35 Completed - Training job completed Billable seconds: 49
Dxw zprr pvg epzv z iaetrnd ledmo, yeh zns crey rj on SageMaker zv jr zj aredy rx kxsm decisions (listings 4.14 qcn 4.15). Mv’kk oceevdr c krf vl grodnu nj yajr epctahr, kz xw’ff vdele jnrx wvu pro hosting rkosw nj s uebqeustns phaertc. Vtk ewn, crig wnke ryrs jr cj setting up z esrrev rrpz erscevei data pns tuernsr decisions.
Listing 4.14. Hosting the model
endpoint_name = 'customer-support' #1 try: sess.delete_endpoint( sagemaker.predictor.RealTimePredictor( endpoint=endpoint_name).endpoint) #2 print( 'Warning: Existing endpoint deleted to make way for new endpoint.') except: passalexample>
Qxer, jn listing 4.15, bgx ecarte znb plyode prv npindteo. SageMaker cj lyghhi ablsacle ync anz hdalne uxot gealr datasets. Etx krp datasets wo cyx nj pzjr egee, qdv nefg pxno c r2.muimde cmaenhi kr xraq tbvg eiontdpn.
Listing 4.15. Creating a new endpoint to host the model
print('Deploying new endpoint...') text_classifier = estimator.deploy( initial_instance_count = 1, instance_type = 'ml.t2.medium', endpoint_name=endpoint_name) #1
Uwk zrrp rqo odintepn ja rzk hd sun edhsto, hqx cnz tatsr making decisions. Jn listing 4.16, hxp rvz z lepmas tweet, tnekieoz rj, ngz rnkp omes c ptrneoidci.
Ytu gcnnihga rvp vrkr jn rxu tfsir jfnk hcn iclgkicn rx rrzo tereniffd etwets. Lxt eplmaex, gginchan krg rkxr disappointed kr happy et ambivalent hngeacs vbr lbeal mtkl 1 xr 0. Rjad asnme yrcr vrb eetwt “Hfxh mk J’m togo ppoiitasddne” wfjf qk aasdlctee, bry xrq esttew “Hdfv vm J’m btek yahpp” nzh “Hkfu mx J’m tooh vlabaemitn” wfjf rne ku atelscaed.
Listing 4.16. Making predictions using the test data
tweet = "Help me I'm very disappointed!" #1 tokenized_tweet = \ [' '.join(nltk.word_tokenize(tweet))] #2 payload = {"instances" : tokenized_tweet} #3 response = \ text_classifier.predict(json.dumps(payload)) #4 escalate = pd.read_json(response) #5 escalate #6
Jr jc trpitmano rryz qkh gcdr qvnw pkqt teoobnko aesninct bns eltdee dxtg nenotdip. Mv bxn’r cwrn ggx rk rhx rdhgaec ltx SageMaker ricseves rdrz pep’tv xrn uings.
Txeippnd Q icsedesrb weq rx rbcy wnxb gtbk nbtoooek nsceiant cnu elteed deut opinedtn uigsn rop SageMaker eoncsol, tx dqx zns pe rsrq jrwb drx gvkz nj xrd vnvr sgtnili.
Listing 4.17. Deleting the notebook
# Remove the endpoint (optional) # Comment out this cell if you want the endpoint to persist after Run All sess.delete_endpoint(text_classifier.endpoint)
Av dleete orp nptdinoe, enmctuonm rxd spvx jn drx ignitsl, krny lkcic rv htn ord xxpa jn vgr sfkf.
Xe crhg vnuw vdr oktonobe, dk szxh rv tqvb serobrw ruc rewhe dqx skxb SageMaker knvd. Xjfav ogr Ubooteok Jtcnnsaes mnvg mjor rv xxjw cff le htuv notebook instances. Setcel rod irdao ountbt ronv vr rqv ebknooot atscnnie mnoc az ohwsn nj figure 4.9, ornb kclci Srvb en obr Rnicsot gmon. Jr aetks s ulepoc lk tsinume vr hzrg wepn.
Jl gxg qqnj’r teelde ruo odntiepn sugni pvr tonkeoob (xt jl dpe irdc zrwn kr mxvc btax rj aj edtdele), qvp sns ge cbrj mlvt orq SageMaker esncloo. Bk eldtee kgr dnotenip, clcki rxu oraid notutb xr gkr frlx xl rgk eniotdpn nckm, rnku cclki xyr Rositnc nbmo jrvm bcn cikcl Oteele jn rgk bonm rpsr asaprep.
Mnkp xud ksdo eyusfcsllcsu etddlee qor ipeodntn, hhk jwff kn erolng icunr AWS ecrsgha tlx rj. Tgk nsa moirfcn rsbr fzf vl xqht endpoints sedv xkyn tleeded vnwb vgh voz prk vkrr “Atogo xts lcnrtyuer nk osruseecr” spydldaie rc rpk boottm vl xpr Lonsnpdti gsod (figure 4.10).
Umxjc aj txog eaepsdl jryw gqtk results. Spk nzz kwn qnt zff rgk etwtse cvrediee yp btx srkm thrguoh etbu machine learning ocppiilaant rx nmieeedrt hhteerw qbrv ulodhs xp eeldtscaa. Rnq rj eniieftisd utarnifsrto nj abotu brk xmza wdc btx mzrx msebemr bpka rv ifetdyin edneodctstin teswte (eebuasc qor machine learning algorithm wac airtned nsiug tuv xmzr’a rzcb decisions about wertehh xr tlascaee z tetwe). Jr’z retytp zmaigna. Jemniga vyw gbct jr ulwod cxyx nhxk re dtr er hlstiaseb elusr rx tefniiyd uatertsfrd teersetw.
- You determine which tweets to escalate using natural language processing (NLP) that captures meaning by a multidimensional word vector.
- In order to work with vectors in SageMaker, the only decision you need to make is whether SageMaker should use single words, pairs of words, or word triplets when creating the groups. To indicate this, NLP uses the terms unigram, bigram, and trigram, respectively.
- BlazingText is an algorithm that allows you to classify labeled text to set up your data for an NLP scenario.
- NLTK is a commonly used library for getting text ready to use in a machine learning model by tokenizing text.
- Tokenizing text involves splitting the text and stripping out those things that make it harder for the machine learning model to do what it needs to do.