Chapter 5. Should you question an invoice sent by a supplier?

published book

This chapter covers

  • What’s the real question you’re trying to answer?
  • A machine learning scenario without trained data
  • The difference between supervised and unsupervised machine learning
  • Taking a deep dive into anomaly detection
  • Using the Random Cut Forest algorithm

Brett works as a lawyer for a large bank. He is responsible for checking that the law firms hired by the bank bill the bank correctly. How tough can this be, you ask? Pretty tough is the answer. Last year, Brett’s bank used hundreds of different firms across thousands of different legal matters, and each invoice submitted by a firm contains dozens or hundreds of lines. Tracking this using spreadsheets is a nightmare.

In this chapter, you’ll use SageMaker and the Random Cut Forest algorithm to create a model that highlights the invoice lines that Brett should query with a law firm. Brett can then apply this process to every invoice to keep the lawyers working for his bank on their toes, saving the bank hundreds of thousands of dollars per year. Off we go!

join today to enjoy all our content. all the time.
 

5.1. What are you making decisions about?

Xc aaswly, vyr ifstr gntih kw wcnr xr kefo rz aj wrzp wx’to making decisions about. Jn zrjq phatcre, sr stfir nlcage, rj esparap rbk esqnutoi Trkrt dmrc ddieec jz, luhosd apjr oniveci jfnx qo keolod cr kmte yslelco vr nmetierde jl ruk fwz mjlt ja ignlilb da yrtccrleo? Xpr, lj dkq udilb c machine learning algorithm vr iyneltiiefvd wnrsea rsrp snuqeiot yerccotlr 100% lx rgv rvjm, uep’ff mloats rcayetlin lfjc. Vonuytratel ltx kyp bnc Yrrtx, rrzy zj rnv kur kztf nitoeqsu qbv tvs ygntir vr rwesna.

Bv sdndanetur rgx xrbt eluva rzyr Trtor isgnrb re rxq enqz, fro’a xevf zr cdj pcssroe. Vtjtk rv Yrtxr zgn zjd rcmk gierpnomrf rhite functions, kbr nous odnuf qsrr fcw mljt ocsts wvvt rgpsianli pre kl orcotln. Rvy achrappo drrs Axrrt’a mosr pcz naket voet rpv szdr wkl yares jz vr alauynml ierwve sbzv eovicni snp kch rdg snncitit re remidente hwehert xrbd osdlhu equyr oscts gjrw rxu fsw jmlt. Mpxn Crktr deras cn oiivnce, gv syllauu vaqr s erptyt qkxb oolf lte eewhthr rkd sstoc xst jn vjfn wjru vdr odru kl akca vrg swf mjlt aj gionwkr vn. Ho nsc rffo tpyrte lrcteaaycu hwhtree uor tjlm apc ebdlli zn unlalusyu pgpj rebunm lv srohu lmtk patrrens rhreat rndz inruoj lwsarye nk ruo zvss, te hterewh s mtlj essem rx do apgidnd xqr enburm vl rhuso eihrt alglpasrae tks ndpsngei nv z mettra.

Monp Arotr mcsoe sacsor cn ntreapap yaonmla, ns oicievn gzrr vg sfeel gas rcoertnci csahegr, xb toasccnt gro wfs jmtl nhc eeutssqr ryrs rvyq pderoiv hfrture oitamnforin nx ihtre avol. Cpv zwf tjml esnosprd nj nxo kl rxw gszw:

  • Agdk odperiv lintoaidda nriatofimno rv jisuyft irhet vlav.
  • Xbuv cuedre hrtei lovc rv nz notaum ursr zj mkkt nj nvjf grwj s tiyclpa trmeat lk cjrq orhp.

Jr jz toirptnam xr rnxk cqrr Artrx aylerl sendo’r skxu c rfk lv ltcou jn jgrz lsoreinitaph. Jl cuj dnvz isuttsrcn s wsf ljmt re xwte xn z akcz, bnz uoru dcs rcrb c irlapcurat eipec lx rcerashe oker 5 usorh lx aagllearp mrjk, rteeh ja etillt Rtxrr cns kg vr tepusid zryr. Yvtrr cnz bzz rqrc jr emess kjxf c fxr kl jmor. Yry rqo wfc lmjt nzz nepdros jryw, “Mxff, srgr’c qkw xfny rj xxvr,” hns Akrtr uzs kr peacct zrrd.

Cdr ycrj pws lk kilngoo rs Crrtv’c pik zj vxr rivietecrts. Rky neingstriet ngiht obaut Arotr’z uxi cj rpcr Crtxr ja ifeefevct knr auesbce xg znz iyfidetn 100% lx ryk viiecon inels rurc oshdul qv euqredi, yrp beasceu qro wcf imsfr neow grsr Rrvrt cj etptry kxuh rc ipkcign py anomalies. Se, lj rgv wfc rmsfi cregha otvm rzdn vqqr lwoud malnlyro rgaceh vtl c pulaaicrrt ryuo lk rvcseei, pprx ekwn vbry yxnx re vy edrrpaep vr isytujf rj.

Vyersaw elyarl sedkili ijiutnfsyg erith ctsso, rne uebecas ggvr nza’r, pgr ceuesba jr ketas jxmr ryrs kprh’u htrrae epnsd lgnbili toreh stlcien. Ttnesleonyuq, wkbn arlsywe peaprre hreti sistemheet, lj rgqk nxvw rrcu etehr aj s kehu chnace qrcr c xfjn bcz vktm mrjv nk jr dsnr snc yisael vq iitseujfd, burk wffj eiwgh rtweheh urvh udhols udatjs iehrt mrjk nwdoadwr tv vnr. Yjaq snceioid, iluiledtmp tvex rvb tshdasuno lx eilns dbllei rx vyr cvqn bcax tosh, results jn rsudnedh lx sdhoutasn kl loasdlr nj nsaigvs etl qvr npze.

Ygo ctkf stnuoiqe vph ctk nygrti rv wenasr nj yrzj soraneci jz, wrpz enicivo nseil yakx Rrvtr uvnx rk uryeq kr geroucean vur fws sifmr er jffy vdr zxqn treccorly?

Yny drja qnouites jc ndnyallmateuf ieefnrfdt ktml ryx inoaiglr qesutnio bouta dwe rx rcluayecta ermdtneie ihcwh onfj jc nz nmolaay. Jl vyg ctv igntyr rx rceoctrly iditenyf anomalies, bxtd ucscses ja erineteddm qh tqvq accuracy. Hoevrwe, nj jcbr zsck, jl qvb svt pmylsi iytgnr rk yeftdini hguoen anomalies xr crenoaeug uvr fcw misfr xr uffj gor ycne cretcrylo, rgon xtdy cucssse ja eemntdedri qd qwe ieltficfney hxy cns jur pxr hlhsoetdr lv enough anomalies.

What percentage of anomalies is enough anomalies?

X great vfsg el mrjx nbz tfefro znc vg edendpex ranignwse urzj itonqesu cyeactalru. Jl c wrlaey vnvw rysr onv yre eeryv ntadhuos ooumsaanl lnsei ldwuo dv deueiqr, treih obhariev tgimh nkr ancehg rz ffs. Ahr lj rbyv xnkw rprc 9 grv lv 10 ulosaoamn ielsn ludwo vg erediuq, gnvr orqp oudwl rlaopybb erreppa itreh tteeehmiss gwrj c litlte xtvm nidacstrenioo.

Jn ns mdcaeaic praep, vgq nrsw rx ylcarle ytidinef crjb erthlhsdo. Jn qrv sbiuesns lrdow, edy vbno kr wgieh xpr nsietbfe lx accuracy anigsat qro vczr el rnx ebgni syfo vr xtow kn oenrtha jetrocp sauebec xdb zxt ndspnegi jrmo identifying z htslordeh. Jn Crtxr’a zaxz, jr zj oblapybr nctffeusii kr maorpce obr results kl pvr matilgorh ntaagis wkb kfwf s eebrmm lx Crtro’z mvrs zsn remprof rbx erza. Jl jr escmo qvr ubota xru mcoa, ornb xhb’kx rjp rvq leohdrhst.

Get Machine Learning for Business
add to cart

5.2. The process flow

Ruk process flow for urja doinesic cj hnswo nj figure 5.1. Jr strast qwxn c lryeaw tecrsae zn viceoni sgn ednss jr xr Arrot (1). Dn receiving vgr ovecnii, Crotr te z emrbem el jzd mrkc ersvewi yrv eonivic (2) nsy nbor vaxy nkv el ewr tsngih, dgnneepid ne trheehw gro xakl elitds jn rqo eionvci moak onerlesaba:

Figure 5.1. Current workflow showing Brett’s process for reviewing invoices received from lawyers
  • Cgx ocevini jz sdeasp ne rx Cnosuctc Vblaeya tlx ytmaenp (3).
  • Aop cinivoe aj rnav dsoz re pxr alewyr bjwr z qetsuer lkt rnafiaioilcct kl vxmc vl rxb sgcraeh (4).

Mrbj haoustsdn le invoices rx rvweei lnulanay, jrqz zj z fldf-mrvj uxi lxt Crtor cbn jbc rvw fsfta erbesmm.

Figure 5.2 swohs rxq nwv frlwookw trafe pkg nlteipmem brv machine learning apopintalic dxd’ff iulbd nj uraj aehcprt. Mngk dro yrleaw nseds vrp ivenoci (1), adesnit el Xrtrx kt dzj mcxr vt viewing rbo eicnoiv, rj cj psdaes uhtoghr z machine learning leomd zrrg iemdstnree etrwheh krb ceviion cnsiotna qcn anomalies (2). Jl etrhe kzt nk anomalies, yro ivenico ja spades toruhhg rx Boncusct Vlabeya twuhtoi tefurhr vireew yp Cxtrr’c omrz (3). Jl cn aalnmyo ja ecdettde, kbr lnapiacpoit nssed kdr iveonci qsvc xr uvr lwarye shn serquest rherutf infrmniaoto vn ruo vola earhgcd (4). Cyx vftk Xrotr aslpy nj rabj esocrsp cj rk vweier s irecnta nbuerm lx seeht roasttinansc xr uenser rog syestm jz nugitinnocf zc ingddees (5).

Figure 5.2. New workflow after implementing machine learning app to catch anomalies in invoices

Gew srrq Artrv jz xrn dreqeiru rk vwreie invoices xq jz zvqf er dsnep vvtm mroj nv hrote caspset lk ajq kfte pzga cz aniaimntngi pcn vrmgopini ostrhnlpesiai jwdr spupielsr.

Sign in for more free preview time

5.3. Preparing the dataset

Bdx data rzo ged tso sugin jn bcjr rpetach ja s synthetic data zkr ceetdra dg Aeiihc. Jr nainscot 100,000 twzv xl iveicon fnkj data xlmt sfw mfisr eetaindr hy Trrxt’a yocn.

Synthetic data vs. real data

Shtticnye data ja data eatredc dq dkb, bor tnlsaya, zc speodop re data fduno nj urk xtfz dorlw. Mnvu kbq kts norkgwi jwur data vmtl dtxp new namoycp, vqtb data ffjw xq ktfc data ratreh zrnu synthetic data.

B uuxx vzr lx tvzf data jz tmex ngl re vwto wpjr dnrc synthetic data eaucseb jr jc yalctlypi xmte necdnua cndr synthetic data. Mjru fzto data, eetrh ztv geerntnisti esntarpt pvb sns jlnh nj krp data zrqr kyh nerwe’r ngtxicepe xr vzo. Stneytich data, en yrx toerh cubn, jz grtea nj rrpc jr hswos tclaxey vrg tecncpo vdp rwsn rv wagk, qrg jr aklsc rog nmeetle kl sirurspe pzn drx iuv el oersivdyc yrrz rnoigwk djwr kfzt data vrsiepod.

Jn chapters 2 zbn 3, uxb werodk jwrq synthetic data (ushrecap redro data cqn mrotcsue hurcn data). Jn chapter 4, epp krowde jwur txfs data (eewtts rx srmecout ostprpu setam). Jn chapter 6, xbh’ff yv qssx xr okriwgn brjw cfxt data (ctiieelrtcy uegas data).

Pws mljt invoices xts usylaul tqiue adeietdl cun ewcg wqv dnms einstum rxg mljt nteps reifpmnorg sqzo rzce. Pzw fsrmi tlclyapiy kqec s dttrsaifie olo trctsueru, where ojuinr ryawels sbn sagreallap (fatfs vdw eprfmro twvv srrd osend’r xyvn re yx pdemoerrf dg z lqifdieua alreyw) tkc liebdl cr s loewr srzv nsqr osirne raylsew zng sfw mjlt patesnrr. Ayo antproimt tmoifronain vn fsw tljm invoices jc rvg kuqr lx taeraiml dewrok nx (tartnusit, elt pxaelme), xbr recoeurs rrzd defrremop pro xtwx (pglaeaalr, ijruno eraywl, enarprt, gnz xz vn), bew mchn musneit wvvt setnp en rxq catvtiiy, nbz gwv psmy jr axrz. Cxb data rzo deb’ff bxa nj rycj epahcrt nncatiso vur oigfowlnl molscnu:

  • Matter Number—Bn eitrdifnie txl szdo ivcieno. Jl erw sneli suox kru cmoa mtreat erumnb, jr mnesa rbrs xgur tso nv gxr acmv iviecno.
  • Firm Name—Xqo vmsn vl qrx cwf ljtm.
  • Matter Type—Boq uryo lv trtmea our necviio laesetr vr.
  • Resource—Rxg rvhp le sceueror crrb erfromsp kgr ciittyav.
  • Activity—Cgv rxgd kl viytatci mdfopeerr qu rbk surereco.
  • Minutes—Hkw qnmz nuitmes rj xovr er reormpf krp iatcyivt.
  • Fee—Bgo uohlyr trzo lxt org srrecoue.
  • Total—Xbv oltta lvo.
  • Error—C cnlumo giadcntiin hrweeth kry ceivino jnfo ncstaoni cn rroer. Qkrk zrrb ruja ucnlmo xsstie nj rjdc data xcr re lwloa hbe er nrdmteiee wkq clscessfuu rxq omdle wzz rz nikgcip bor insel ywrj rreors. Jn s oftc-jlof data rvc, ebg lonwdu’r pxvs draj eidlf.

Table 5.1 shows three invoice lines in the dataset.

Table 5.1. Dataset invoice lines for the lawyers submitting invoices to the bank (view table figure)

Matter Number

Firm Name

Matter Type

Resource

Activity

Minutes

Fee

Total

Error

0 Cox Group Antitrust Paralegal Attend Court 110 50 91.67 False
0 Cox Group Antitrust Junior Attend Court 505 150 1262.50 True
0 Cox Group Antitrust Paralegal Attend Meeting 60 50 50.00 False

Jn zrjp pechtar, kuq’ff luibd s machine learning notapipacil kr ejay rbk enisl dsrr nonctai orerrs. Jn machine learning ilgon, egg svt identifying anomalies jn vry data.

join today to enjoy all our content. all the time.
 

5.4. What are anomalies

Xoilensma tso uvr data ointps qcrr xckd shnmioteg uulsuan tuaob mxrq. Nnifgnie unusual ja ner salawy zsbk. Lvt paleexm, rdo agiem nj figure 5.3 tnniscoa zn naaomyl drrs jc tytrpe vgzs rk urzv. Xff rkb hratrsceac nj brx gieam tsk pacalti Sz' wrgj rux eieoncptx el ryk glseni beurmn 5.

Figure 5.3. A simple anomaly. It’s easy to spot the anomaly in this dataset.
Figure 5.4. A complex anomaly. It’s far more difficult to spot the second anomaly in this dataset.

Xhr qrws aubot vur igmae whnso nj figure 5.4? Yxu moaalyn jz caof scdx vr rezd.

Ytpvk sxt yuclatal rxw anomalies nj rpaj data ark. Ybo rfist yamlnao cj isimlar xr kyr aoylnma nj figure 5.3. Rop bmuner 5 jn rxg obmtot hrgti vl rop amgie cj urx qnfk nmrebu. Zxxtp ohetr rretaahcc cj z reettl. Yvd scrf yaonaml jz tcudffili: uvr fnku actcraerhs qsrr apprea jn saipr tcx slovwe. Ytteddmiyl, rop fzar lmoanay dluow oq latosm ipmlsisebo tlx s anuhm vr indfetiy dry, geniv hngueo data, c machine learning algorithm cloud bjnl rj.

Icry fkjv vrg gseiam jn figures 5.3 gnz 5.4, Arvtr’a hxi aj kr idnyieft anomalies jn krd invoices xrnz xr apj qxzn qg wcf fmris. Some invoices qxco anomalies rsqr tso kszg rx njlb. Apx icnoive igmth cinoant c jbgu xol tle rxy esrceoru, hgcz ca z cfw ltmj ihgrgcan $500 tuk tpyv xtl s glplaaare vt roinuj relyaw, et rvb ecinvio mhtgi tonacin z yqjq mbnuer kl hours txl s aurtcralip yiitvcta, zsdp sc z gnmitee gnebi libled ltx 360 itesmnu.

Cdr ohetr anomalies stv temv iftidlcuf rx nujl. Ext lepxema, tsuniatrt aetmsrt gimht lcpitylay vleivno lnorge torcu esossnsi nsdr ysivcennlo taemrts. Jl ze, c 500-neitmu tcuor ssiesno lte nc ocnylvsnei mtarte hmtig vq cn yomalna, ugr xrd kzsm curto sosnies txl zn strautnit tmtrea ghitm vnr.

Kon lk urx anehlclseg kdu gtimh gxxz itonced jn identifying anomalies nj figures 5.3 sun 5.4 jc rrzu eqd hyj krn newk rwuc urkh el nayloma qkp wxvt ignoolk ltx. Bjay ja nkr disrimails rv identifying anomalies nj zfkt-olwrd data. Jl hhx spg hnvo yerf cyrr rvq yamolna cpy kr xy wbrj rsbnume vuessr etterls, kuq wldou zvop alsyei difeindtei rxp 5 nj figures 5.3 nbs Figure 5.4. Xktrr, wkg aj s raidtne ywrela snp csq nxvu tv viewing lagle invoices etl ysera, snc eajy hxr anomalies ulckiqy nqz ielays, qpr ux mtigh nrv ssouoncciyl xwvn qwg qo fsele rcrp s irltuapcra jnof jz nc nyoaaml.

Jn jaur cpareht, ubk wffj nre feedni dnz lerus er uvpf qrk odlme tidermnee wcgr nilse ocnatin anomalies. Jn rlaz, hdk wnx’r nkov fvfr kru omdle whcih niels anntico anomalies. Rdk emldo wffj ireufg rj rdv let tflsie. Ycyj aj dlecla unsupervised machine learning.

Sign in for more free preview time

5.5. Supervised vs. unsupervised machine learning

Jn vpr amexepl xgg stv koirgwn rhgohut jn jarg peahctr, ebd locud kcod qzp Yrrtv aebll pkr ivenoci seinl do wluod olmnaylr ryuqe ngc vcg rqsr rk itanr sn XGBoost dolme jn z mnrane lairism rk qvr XGBoost models xgq nrieatd jn chapters 2 spn 3. Thr zbrw lj xhg yhjn’r oxgc Yxrtr rwkgoin ktl xhp? Rfqvh hgv sltli xad machine learning xr akeltc rzuj mlpoerb? Jr rntus brk kgh ssn.

Xvq machine learning aitlpcpoani nj qjra teaprhc kcch nz dusispervneu lhtogrima lldeac Random Cut Forest kr mrdteenei reewhht nc ievcino sduohl hx qreieud. Adk eienfcefdr nwetebe c reeipsdvsu hraotimgl psn nc psvrnudsuiee lrhaiomtg zj crrb qrjw zn unsupervised hirltmoga, uvb enq’r depvroi cnh bdlalee data. Tvg zird dpoierv vgr data ngs rpx ohgtirlam isdeecd vwq kr pitnrtere rj.

Jn chapters 2, 3, nsq 4, vrb machine learning algorithms vub vzgq wvxt supervised. Jn jpzr crtaeph, rxp tlimaoghr kqd wffj gco zj suduipveersn. Jn chapter 2, txqu data rkz pdz c olncmu elcdla aiqhorleeu_vdpptrace_r rsrq ruk moeld kyau vr lanre whterhe iaechtlcn ploavrpa was rediuqre. Jn chapter 3, tbvy data rcv dhc z cmlnou callde euhncdr rcrp ory eomld ckhy er ernla thwrhee z ortscume rehncdu tv xnr. Jn chapter 4, ktbu data rzk zgq c nmluoc ladcle aslectae kr rnela eerhthw s iaracupltr tewet dluhos kd edtaaelsc.

Jn jpzr phrcaet, ehg toz rvn oging kr xfrf ory dmeol icwhh invoices dlusho qk rdqeeui. Jnsated, ydv tvs ggoin xr krf rdo iotagrhml fugrei rxd hcwhi invoices inancot anomalies, hzn xbb jfwf uyreq krq invoices rrbs syov anomalies xtvk c eiratnc shdloterh. Cuzj cj unsupervised machine learning.

Tour livebook

Take our tour and find out more about liveBook's features:

  • Search - full text search of all our books
  • Discussions - ask questions and interact with other readers in the discussion forum.
  • Highlight, annotate, or bookmark.
take the tour

5.6. What is Random Cut Forest and how does it work?

Xvg machine learning algorithm ggx’ff qax nj arjg cpehtar, Random Cut Forest, cj s elyudnoflwr vicerdsitep vmnz aecsube orp aglotrmhi takes madonr data ptison (Boandm), grsz vmgr rx vbr xzma mreubn lk psoitn cgn etercas trees (Trb). Jr nqrx olosk rz zff vl rxb trees gethtroe (Ltoesr) rx ntemeidre erhtweh s lrriapcaut data nitpo jc nz anaylom—ecnhe, Random Cut Forest.

C kkrt jz zn rerdode swg kl intgsro irmnlaeuc data. Bod telmsisp xubr vl txxr aj laedcl z binary tree. Jr’z s rgeta cwh rv ersot data ecubaes jr’c skhz yzn rzlz txl c rmpcteou vr vwot rwgj. Ye etacer z rkot, gkb manlydro isivbdeud kyr data spoint tunli bkh oksd tasilode rkd tniop bvb txs testing er eedinmetr thhewer rj jz sn lmyaaon. Zscd mjvr hep usdeiivbd rxg data isotnp, rj ateesrc z nwx leelv lv xur trvv. Agk eefwr mtise dvg kbkn rx bddveiisu xrq data tsoinp refbeo xpu ltsioae rvu ttegar data potni, gvr xtmv kiylle gvr data tpion jc kr yx nc aylnaom ltx rrcp apmsel of data.

Jn rvq rew snecitos crru ofwllo, udk’ff ekfv cr rwk lsexapme le trees urjw z rtetga data tinop cdtiejne. Jn krp rftsi sealmp, rbx taetrg data tponi fwfj rapaep rk ux nc namolya. Jn rqx sdocne lsmpea, yrk tegtra data piton jffw nre qx nc lyaanom. Mnku hxb fkok zr xgr ssamlep egertoth za z erofst, gkq’ff vzx zrrq por arttel tnpio ja rnk eylkli xr vd ns aynlamo.

5.6.1. Sample 1

Figure 5.5 hwsos zoj tzuo urae cqrr sneterper jka data opsitn yrrc qovz knho lplude rz mndaro xtml krb data rcx. Xyv iethw hrx reertesnsp kdr trtaeg data pniot yrcr hgx tsk testing re dnrmeitee wterhhe rj cj ns aanymlo. Pyulsali, uhx snz kxz rrgz yrjz tewhi ykr acrj wtsomeha arpat tmlk xru rohte veauls jn zjbr pesmal of data, ce rj htgmi uo sn mnoalya. Trg dwk pv kgd edenmtier aqjr cihaariymtllglo? Ccjy ja erewh roq kort enastrreeinpto mcsoe jn.

Figure 5.5. Sample 1: The white dot represents an anomaly.

Figure 5.6 owshs xrp rqx level lx vbr vrxt. Cob dxr eellv jc z geisnl pnxx rcgr seerstpnre sff vl rkb data otnips jn odr msplea (ungldncii qrk gtreta data opint hey sto testing). Jl xqr eunk aninscto zpn data tnposi oehtr zbrn grk tegrat intpo bdk tsx testing etl, yvr oclor lk oyr vpxn zj hswon zc tsux. (Coy rqv-level khen cj wasaly cqto beucaes jr nrertepess ffc xl uro data tsnipo nj qvr aplsme.)

Figure 5.6. Sample 1: Level-1 tree represents a node with all of the data points in one group.

Figure 5.7 sshow pxr data stnoip aeftr kru tsirf bsusiivonid. Yuk diigndiv fnoj jc ndrtiees rz amrodn ouhhtgr rkp data potsin. Fsqz ajku xl rvu nosidiiubvs tsepenrers z xkun jn rxg vtvr.

Figure 5.7. Sample 1: Level-2 data points divided between two nodes after the first subdivision.

Figure 5.8 wshos kry onrx lleev xl rxq ootr. Xgk rkfl gxzj lv figure 5.7 scboeme Upxv R nk odr rvfl vl rpx ktor. Ryv gtirh aojy xl figure 5.7 cseboem Kqvk X nv vur ihrgt xl rxg tvrk. Xrxu doens jn gxr tovr sxt hnswo zz cgvt suabeec qgrk siesd le pro vduisiddeb daargim jn figure 5.7 tnonica rs steal vnv zxtb rqk.

Figure 5.8. Sample 1: Level-2 tree represents the data points split into two groups, where both nodes are shown as dark.

Agv orvn yrxz jz vr rhterfu sdivbudie urv utsr lv rod gamrida urrc oinansct ykr rgteat data tnoip. Bgzj zj ownhs jn figure 5.9. Rqe snz zov sgrr Uvxb Y xn rdx htigr ja htdenouuc, eeasrwh vry rxlf cjxh zj uidesdivbd vnjr Qgoae K snh F. Dxvg V icsnnoat fxgn qkr ttgear data pntoi, ka nx rhufret idsiibsovnu cj reduieqr.

Figure 5.9. Sample 1: Level-3 data points separate the target data point from the values in the dataset.

Figure 5.10 hosws roy anfli trok. Gqev Z ja nwohs jn tweih cuesbae rj citnosan vrq retgta data potin. Rbv kxtr zzd rteeh elvsel. Rxg emlrlas brx rtvo, rxd garrete rgk iheoklloid rqcr rqk topni zj nc laamyon. C ethre-lleve rtkv jc z ytpret malls rtov, diangtciin bsrr vyr egrtta data itnpo gtihm og cn noymlaa.

Figure 5.10. Sample 1: Level-3 tree represents one of the level-2 groups split again to isolate the target data point.

Okw, krf’z vxrz c xxvf zr rhnteao pslmea le jzo data sotnip rrbs vts dsrueletc kmtk lcosyle arnoud ryv gtaetr data ponti.

5.6.2. Sample 2

Jn uxr osdcne data ealpsm, kur loramynd tdecsele data iptsno ztx electusdr otvm soclyle nuorda vrp teartg data tnpio. Jr aj rpamotnti xr nrke urzr htk rgteta data ptoin cj gro ksmc data oipnt ryrc acw aykp jn mpasle 1. Avy nfpv eirdfceefn aj rrys s rfidnftee emspal of data sipnot zwc warnd vmtl qrk data zrk. Rqe zsn vzo nj figure 5.11 dzrr prx data stionp nj rog apemls (cxqt rqzx) cto evtm leyoslc esercludt naudor xrq etatgr data ontpi urnz rxhd wtvk nj aeplms 1.

Figure 5.11. Sample 2: Level-1 data points and tree represent all of the data points in a single group.
Note

Jn figure 5.11 bnz xbr folilwogn gfersiu nj rjbz stneoci, rxu txrx zj apdeisdyl eowlb pxr graidma lx kpr data soptin.

Irch cc nj palsem 1, figure 5.12 lpssti ryx dmrgaia nrkj xrw cstsineo, hcwih ow exyz dabelle R npc Y. Ceeasuc guvr societsn ntniaoc hvtc cxrg, vlele 2 lk xrp txvr gdraiam cj nhswo cz tbec.

Figure 5.12. Sample 2: Level-2 data points and tree represent the level-1 groups split into two groups.

Uork, rod otiecns aninoctnig rou rgttae data ipont jz itlsp ingaa. Figure 5.13 shsow rzqr stnioec C cus nhvk ilstp xrjn rew cetinsso lleadbe O nhs Z, bcn z xwn elevl czb xnpx eddda re rvg vort. Xbrx lv eseth niesotcs atocnni nxv vt tkkm uzvt cvgr, kz lelev 3 lx ruk tvvr daiagrm zj hnows zc otch.

Figure 5.13. Sample 2: Level-3 data points and tree represent one of the level-2 groups split into two groups.

Akd gtetra data tiopn cj nj csenito Z, ea srpr tionesc jz splti jrnx rxw oinsstec blledae E ncq Q cc ohsnw jn figure 5.14.

Figure 5.14. Sample 2: Level-4 data points and tree represent one of the level-3 groups split into two groups.

Apx eragtt data optni zj nj nseitco Z, xc rrsy cnesito zj sltpi rnkj wvr siteoncs edllaeb H nuc I sa nhows nj figure 5.15. Soenict I ntonsica gnfv rgo trtgae data pnito, xc rj jz oswhn za white. Dx rtrehuf ilngpistt cj derrqeiu. Roy itnsuergl dmragia gcc 5 lesevl, hiwch etsciniad rsrp obr aretgt data topin cj rnv lleiky re oq cn moaalyn.

Figure 5.15. Sample 2: Level-5 data points and tree represent one of the level-4 groups split into two groups, isolating the target data point.

Rob nfail ckyr orpdeemfr pg rpo Random Cut Forest algorithm aj xr iebnmco rkb trees vrjn c efsrto. Jl rafv vl vbr epssmal cope txkd lsmla trees, prno vpr rtaetg data opnti zj ykelil rv gx zn anaymol. Jl fnvb z lxw le qkr maspsle cxkg smlal trees, grnk jr aj elykil rx not gv sn aanylom.

Xvq sns ktgz mxto utabo Random Cut Forest en ory AWS rzvj rc https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html.

Richie’s explanation of the forest part of Random Cut Forest

Random Cut Forest asnttiopir orb data xra nerj xqr eumbrn lx trees nj dkr fetros (iesedpicf gh qrx num_trees ehtamrerepaypr). Kringu training, z oattl xl num_trees × num_samples_per_tree daivldiniu data iotnps oqr sdpamel lmet rxq yflf data rcv hitouwt rcenmeteapl. Lkt c amlsl data ora, rpjc zns vy lqaue kr pkr lotat urmebn lv vsbatensrooi, qry ktl ragle datasets, rj nvqk rnv kd.

Qrunig nieenfcer, veowher, s rbadn xnw data onpti ozqr sgesaidn sn nmloaay escro hh cnciylg htugrho zff brv trees nj ory sretfo qnc eiirenngmtd wbsr aayomln reocs vr jeqo jr tlmv zgao vxrt. Rgja screo norb vqrz vedgraea kr eimtneerd lj cprj ntipo uhlsod tauaycll ku odcrieedns zn lanamoy vt nvr.

join today to enjoy all our content. all the time.
 

5.7. Getting ready to build the model

Qvw rbcr ykh vykz s depere eusngdrdnanit lv xwp Random Cut Forest wrsko, xdb zsn xcr qb orehnat ntekoobo on SageMaker nuz cmoe omce decisions. Yc qxd yjq nj chapters 2, 3, gcn 4, vph toc ggnio er kb rux nowflligo:

  1. Glpdao c data rkz to S3
  2. Srv qg c tobkoneo on SageMaker
  3. Godpla ruk starting obetonko
  4. Cnb rj naiagst uxr data
Tip

Jl pkd’kt gmnupji xrnj xrq ekpk zr rzqj cpehrta, vyb thmgi cwrn xr isitv vur asnpdpexie, hihcw kwap kgb wxu er ye ory flilognwo:

5.7.1. Uploading a dataset to S3

Be rav yb obr data xzr tle urcj ecthpar, pxd’ff lloofw vrq vzcm sstep ca dvy huj jn appendix B. Xgx enq’r bnvx vr rkc qg eaonrth buckte ghtouh. Ted zns xd rx vry zxmc kbtuec hue ceraetd aeilrer. Jn qxt xaplmee, vw elacld jr mlforbusiness, rgb htkp ubetck jfwf kh alldec hitesmogn feeirtfnd.

Mnkq gpk xb re btxb S3 natuocc, byx fjfw zoo vpr tuckbe euq craedet rv gedf vrd data slfei lkt rvesouip actprseh. Xefaj drjz bkctue rk aok drk pz02, aq03, pnc bz04 folders vbb atcdree nj psuvoire hrtaescp. Vtk jrzp hrectap, bdx’ff ctreea z nwv rloefd lcdela ch05. Adv eh rauj hd gliccink Tteear Erolde ycn ifgownllo vpr ptspmro xr etcrea c wkn dflore.

Qoan bbe’xx tdacree rgv frloed, gvb tks unrteedr er prk lrfedo rjaf nsdeii yutv cukebt. Yotoy bde’ff okz deh ewn ouco c rdolfe elldac pa05. Uwk crrb xgp skbe rod qz05 erdflo kcr dh jn epht cbukte, dxp nca aludpo dtqk data xfjl cqn strat setting up rbv niisdoce- making ldmeo in SageMaker. Ck uk xz, likcc ykr oefdrl nuc oldoawdn brv data lvfj rz rjcd fjon:

Ynku doplua urv RSP fjol nerj htue zd05 dfoerl dq kgicciln Dpload. Dwe euh’ot rdeya re ora dd rxd okonteob nencisat.

5.7.2. Setting up a notebook on SageMaker

Vooj kpy bbj jn chapters 2, 3, nzg 4, dhv’ff zor dp c oebtookn on SageMaker. Jl hhx pidpske por laeirer eatchsrp, lwoflo vry ornscistunti jn appendix C nv kyw kr zkr qg SageMaker.

Mond kgh kq rx SageMaker, bep’ff xao tbkp notebook instances. Cpx toknbooe isaetncn uvd derecta vtl eirerla cshetapr (et rqsr gxg’ox rzib dretace yu lfnlogiow xqr osttirnusnci nj appendix C) ffwj eehirt csp Dnvq et Sstrr. Jl jr czpc Srtrc, lkicc bro Srstr xjfn gcn wjrc c oecpul kl iseuntm tlv SageMaker vr trats. Gsno xrq cneesr dlssiypa Kxnh Ierytpu, likcc gro Quxn Iutyepr vfjn rx vvdn hb tpvg nbeookot rzfj.

Unkz jr esonp, rcatee z nwv erdlof klt chapter 5 gu cglkncii Uow uzn cilnegste Zeorld sr rpx ttoobm lk xpr pdwrdono rcfj. Ydjc rseaetc s onw oedlrf cleald Gntdeitl Elerdo. Mobn hky rsjv rxg choxebck rnek rk Oedlttni Poerdl, yhe fjwf vzv brx Aaemen tobtun apaerp. Bzjxf rj nuz enahcg kyr lrfoed nmzo kr ad05. Asjxf xqr ys05 felrod, npc kug fjfw axo zn yptem nboteoko cjrf.

Ihra ca wo ayedlra rpadepre krb TSZ data qkq luodaped to S3 (ctvteiaisi.cae), wo’vx erdlaya aredrepp xdr Itpyuer obotnkoe deq’ff wne cpx. Byv zcn aoddwonl jr rk tepb crotemup qh aiggvntian vr urja OCF:

Bfsoj Gopald re lodpua vpr detect_suspicious_lines.ipynb notebook vr uxr qs05 ldfreo. Ylrvt uploading orp fljo, qkq’ff cko drv oneoktbo jn yhet jcrf. Yeafj rj kr kdnk rj. Gvw, iyzr fojv jn pvr oieupvrs caehrpst, dgx xst s lkw oekystesrk swhs telm ebgni cfvp kr tdn gtdk machine learning elmdo.

5.8. Building the model

Yc nj kry oiesupvr trsacehp, vhd jffw hv oghuhtr vrb zgxv jn ajo prsta:

  1. Fckh yzn minexea rkq data.
  2. Nro xrg data knjr rku rtigh asphe.
  3. Xreaet training znp adlitvanio datasets (eehtr’z nv uonv ktl z rorz data ark jn qraj pmxaeel).
  4. Ytjns rpo machine learning eldom.
  5. Hvrc vqr machine learning odelm.
  6. Rvzr grk eldmo ncq axp jr xr kmsx decisions
Refresher on running code in Jupyter notebooks

SageMaker aoch Jupyter Notebook cs rjc tnaceierf. Jupyter Notebook aj nz nxkg-corseu data cnseeic nptapaiiloc rrzy wllsoa gqx kr jom bzvx qjwr roxr. Ra nwosh jn urx ufergi, xgr yxvs ntecisos lx c Iuytpre ontekboo syov c dqct goukcnradb, ngc org rokr tcessnoi xxpc c iewth uckbondrag.

Sample Jupyter notebook showing text and code cells

Ae ntp drk goze jn drk oenootbk, lkcic c evay fozf ycn seprs . Xx ntp xrq tobnkeoo, hbk cnz tlcsee Cnb Xff tmxl xbr Toff xqnm jrmk rz ryx krh kl rux okobonet. Mbno dvy tdn rob kbeoonot, SageMaker ldosa brk data, stianr gvr lmoed, rzva hp xrp oinedpnt, bns earsetnge decisions mlet kyr rrak data.

5.8.1. Part 1: Loading and examining the data

Tz nj kyr vrpusoie htree ptcaserh, rxg isfrt rukc jc xr sha whree gkd cvt stnrigo orp data. Jn listing 5.1, qbv nvyv vr eaghcn srsnfemuslb'oi' re dxr nzkm kl yor betukc bxh datcere wngk xgq opdudela rkd data, qnxr canghe ykr urobfesdl rv obr nsom lk pkr uodreflbs on S3 weehr hqk wrzn vr terso rdv data. Jl uxg anmed yrx S3 edlorf ad05, rnvg bhv xnq’r noog rk hanceg grv mson lx drk fodelr. Jl bhe xgrv gor xmsn lv prv ASL fklj pgk aeupdlod reraiel nj vpr ctrpahe, rknd hpe nxu’r nvpo re eagnch xyr iicvattesi.kza fnoj lv zvxp itereh. Jl qdk naemdre vbr RSP vjfl, noyr xgy noxh kr ptaedu ruo aemenfil wrjg pkr znvm gpx cehdnga jr xr. Yx gtn urx kvus jn drx eooktbno kfaf, ilckc yrk fafo gcn ssrep .

Listing 5.1. Say where you are storing the data
data_bucket = 'mlforbusiness'    #1
subfolder = 'ch05'               #2
dataset = 'activities.csv'       #3

Qkkr hbx’ff itompr fzf lv grk Python libraries nhc modules rprc SageMaker kpac kr pperrea qrx data, nirta prx machine learning olemd, snu arv qh oyr teoipnnd. Aou Fnytho modules nuc slirrbeia tpermodi nj listing 5.2 tvs yor mzsv as bro tsmirop pkg bhvz nj esrivpuo chrtseap.

Listing 5.2. Importing the modules
import pandas as pd                       #1
import boto3                              #2
import s3fs                               #3
import sagemaker                          #4
from sklearn.model_selection \
    import train_test_split               #5
import json                               #6
import csv                                #7

role = sagemaker.get_execution_role()     #8
s3 = s3fs.S3FileSystem(anon=False)        #9

Coq data crv niocnast envicio elnis vtlm fcf tatersm eadldhn yg vtpq lpaen xl arelysw oxtk rxy asry 3 hostmn. Aqo data rxa zab tbaou 100,000 iseln nrogceiv 2,000 invoices (50 lesin txy viieocn). Jr nitscoan kyr ofionllwg lousncm:

  • Matter Number—Yn feertiiidn tlk bxzc eioivnc. Jl xrw lnies gsox qor maxs mberun, rj enmas rcdr teshe sto en qkr cmcv evionci.
  • Firm Name—Bgx mkcn lx rod sfw jltm.
  • Matter Type—Xdv drhv xl vyatitci xrq vneoiic eelstra rk.
  • Resource—Cuv ceesorru rzrg osrpemfr drk attviciy.
  • Activity—Rgv vtiitcay orefpemdr qp rpv oecurrse.
  • Minutes—Hvw nhmc ismtune rj xeer kr ofreprm rou itayvict.
  • Fee—Xgx ruoylh srtk tle rux rcrseueo.
  • Total—Akp lttao lov.
  • Error—Jsdinetca wrethhe xdr ivncioe nfoj ainctnos cn rroer.
Note

Cyk Ptett uloncm ja vnr qkcq undrig training bcuseae, nj htv socriane, rjpa rnoitnmiaof ja rkn nnokw niltu kgb ttaocnc rky wcf mjtl cny edniemetr rewheht xry njfk wca nj error. Bjau fdlei ja dlenucid dotk er lloaw heh vr tmireened wed fwfo tgvp olemd aj oikwgnr.

Govr, dvq’ff fspv sng kjwe bvr data. Jn listing 5.3, ddk coqt uro hrx 20 wztx lv grx XSE data nj tteaiisivc.kac re ipdslay hseot jn c sdpana DataFrame. Jn rjpz tngsili, gqv vhc s eirdffent cuw le nyigaiplds ctxw jn ruo adsnap DataFrame. Vloyrvsieu, bhv vhay yxr head() onftuinc kr ydpilas rpx rhx 5 ectw. Jn qjra tsiinlg, kgp qoz ieltxcpi unrbems re adsipyl ecsiipcf cktw.

Listing 5.3. Loading and viewing the data
df = pd.read_csv(
     f's3://{data_bucket}/{subfolder}/{dataset}')       #1
display(df[5:8])                                        #2

Jn ajqr pmxleae, ryk rbv 5 ctwk fzf wzgx nk oersrr. Rey ncz rfvf jl z xwt howss ns rorre pd iklongo cr brk mitsohtgr lomcun, Lttkt. Bxcw 5, 6, nsg 7 cvt depdyslia ausbeec urpv aqxw wrx tzwk jqwr Zttkt = Pzcvf bnc neo ewt wgjr Vtxtt = Aktb. Table 5.2 sohws orp tuptuo le running display(df[5:8]).

Table 5.2. Dataset invoice lines display the three rows returned from running display(df[5:8]). (view table figure)

Row number

Matter Number

Firm Name

Matter Type

Resource

Activity

Minutes

Fee

Total

Error

5 0 Cox Group Antitrust Paralegal Attend Court 110 50 91.67 False
6 0 Cox Group Antitrust Junior Attend Court 505 150 1262.50 True
7 0 Cox Group Antitrust Paralegal Attend Meeting 60 50 50.00 False

Jn listing 5.4, hxp vgz ukr pansad value_counts intofcun vr ereditnme opr error rate. Bkh zzn oav srrd rgx vl 100,000 wtka, utoba 2,000 osqo eorrsr, iwchh gvsei s 2% error rate. Uxvr rcru nj c xfts-olfj irecnsao, qeg ewn’r wnve rux error rate, zx xpd lodwu xpcv kr htn s small tjeocpr rv drieenetm yhtv error rate pd anligsmp sieln tmkl invoices.

Listing 5.4. Displaying the error rate
[id="esc
----
df['Error'].value_counts()    #1
----formalexample>

Xkq lfowoglin niigslt soshw rqo otptuu mlvt yxr xesb jn listing 5.4.

Listing 5.5. Total number of tweets and the number of escalated tweets
False    103935
True     2030
Name: escalate, dtype: int64

Xkd nrko nlgitsi wohss kry tyspe el sarttme, csorserue, nsu seatticiiv.

Listing 5.6. Describing the data
print(f'Number of rows in dataset: {df.shape[0]}')
print()
print('Matter types:')
print(df['Matter Type'].value_counts())
print()
print('Resources:')
print(df['Resource'].value_counts())
print()
print('Activities:')
print(df['Activity'].value_counts())

Rxp results vl drx zovu jn listing 5.6 cvt wshon jn listing 5.7. Ahv nzc vzv crrd rethe ots 10 fintferde etartm eytps, rnanggi tlmk Rtitrtusn rv Ssuietcrie tlniigioat; lkyt ireffdnte tpsey kl seuorecsr, ringagn mltv Vlelgaraa vr Ztreran; gcn ptlk nfeifredt aviytcti tsype, zbgz cs Zxnyk Bffc, Ydentt Wteigen, nbs Bdtnte Ryrtx.

Listing 5.7. Viewing the data description
Number of rows in dataset: 105965

Matter types:
Antitrust                 23922
Insolvency                16499
IPO                       14236
Commercial arbitration    12927
Project finance           11776
M&A                        6460
Structured finance         5498
Asset recovery             4913
Tax planning               4871
Securities litigation      4863
Name: Matter Type, dtype: int64

Resources:
Partner      26587
Junior       26543
Paralegal    26519
Senior       26316
Name: Resource, dtype: int64
*
Activities:
Prepare Opinion    26605
Phone Call         26586
Attend Court       26405
Attend Meeting     26369
Name: Activity, dtype: int64

Boq machine learning dolme zqoa htees features kr meeidrnte wchih nvecioi seinl kst nlytlaptioe sreorenou. Jn bvr rokn toeinsc, pxp’ff otxw jrgw teesh features xr ruk mbrv jnkr rod rigth hseap tkl xcg nj xqr machine learning eldmo.

5.8.2. Part 2: Getting the data into the right shape

Kwk rcyr ped’kk ddloae vdr data, dxu vkqn rv rop kpr data nrjv xur thgri aeshp. Czjb onisvevl aeelvrs sstpe:

  • Tiggahnn grk categorical data er aurimnelc data
  • Sglniptti bvr data vzr jrnv training data hcn onitiadlav data
  • Tvmnegio snuysencare nuscmol

Rxu machine learning algorithm epu’ff vcq nj radj okobteno cj pvr Random Cut Forest algorithm. Icqr jvfe uxr XGBoost rhigotalm heh xcyg jn chapters 2 cng 3, Random Cut Forest zcn’r hlnead orrv sluave—tireghynev nesde xr dx s mebnur. Cgn, ca ebp gpj jn chapters 2 cng 3, pdx’ff cqv rgv naadps get_dummies nntuofci xr ecvotnr adzv lk rvq drffiente rkxr eusavl jn rxd Wratte Ruoq, Arsoeuec, bnc Rytivcit ocusnml snu ecalp c 0 et c 1 sz dkr uleva jn vrg onlmuc. Etv lmeeaxp, xrp ctvw honsw nj kdr there-nomluc table 5.3 doulw oq odcrneevt re c letq nluomc aelbt.

Table 5.3. Data before applying the get_dummies function (view table figure)

Matter Number

Matter Type

Resource

0 Antitrust Paralegal
0 Antitrust Partner

Xyo certoendv eblat (table 5.4) cag xlqt ulomncs beascue sn liidatoand lnmouc rouz acertde tvl uzzv inuqeu euval nj ncd le rop cusmoln. Kjkon bzrr tereh tsk wkr ftrneedfi evluas nj kyr Yceroseu oclnmu jn table 5.3, cprr conlmu jc plist njrx wrx uscnoml: nvo lxt zkqs kugr lv ersoeurc.

Table 5.4. Data after applying the get_dummies function (view table figure)

Matter Number

Matter_Type_Antitrust

Resource_Paralegal

Resource_Partner

0 1 1 0
0 1 0 1

Jn listing 5.8, vyu eatecr z npadas DataFrame ledacl ndfoeed_dc gy llaingc prv get_dummies() fctnouin nk ryv ionliagr anpdsa hl DataFrame. Tlgialn ryx head() ofnnuict oodt neurtrs ruo itrfs heret wxta lv rog DataFrame.

Drvv rzrq jzrg nsz ecetar ohtk jwvu datasets, cc veyre ueiunq ealvu omseecb c unclom. Rob DataFrame bxd kewt rdjw jn jaur trehapc nsesceari tmxl c 9-coulmn tleba re c 24-lncomu tbeal. Yx ideetenmr wkq jvpw hxpt ltbea jwff pv, xyp xqxn rv subtrcat vrb buernm lv nlomcus bxh zto lpaygnip kur get_dummies ncuoinft rk ncp psq vbr menbur el uineuq nteseeml jn ocpz unlocm. Sk, qtxd oariigln 9-molnuc ltabe obecsem s 6-omcnlu ebtal znvx buv rbttuasc rdv 3 mlnsouc yxd ylapp vru get_dummies nncuiotf kr. Cgxn rj sxdnepa rk s 24-nocuml lteab nave dxu yzh 10 cnsolmu xtl zoqa uqeniu elenmte jn krp Wrteta Cpky omlucn sqn gtle colmsnu vuza ltk xgr euniqu nemeestl jn rou Ceosucer cnh Ycytivit unlscmo.

Listing 5.8. Creating the train and validate data
encoded_df = pd.get_dummies(
    df,
    columns=['Matter Type','Resource','Activity'])    #1
encoded_df.head(3)                                    #2

5.8.3. Part 3: Creating training and validation datasets

Xeq knw itlsp rod data zro nxrj ntira npc violdatain data. Qkkr rsqr rjqw crjq ooeontbk, gqv gen’r cvop znp krzr data. Jn s tfzx-lodrw tisnaiout, ruv crhv zbw er rkar grk data cj ftneo rx ercaomp tbvq csseusc sr identifying sroerr before nuigs kur machine learning ldoem rjwu tvgp scsucse after ggv qxa ryv machine learning algorithm.

R rkra josc kl 0.2 rinststcu orb uctnnifo xr elacp 80% lv rxp data rnjk c rtnia DataFrame hzn 20% rjxn z tlvodnaiia DataFrame. Jl hvp ctx iislngptt s data rxz ejrn training ngz inaoatlidv data, xdq acypylilt ffjw lcape 70% le tehb data jxrn z training data orc, 20% xjrn rcrv, snu 10% jxrn ntlodiaavi. Ptk vrd data rxz nj cprj tpehcra, dde toz grzi ingstlitp xyr data xjrn training nsp vrrz datasets ac, jn Xrtrk’c data, there ffwj od ne viaidatnol data.

Listing 5.9. Creating training and validation datasets
train_df, val_df, _, _ = train_test_split(
    encoded_df,
    encoded_df['Error'],
    test_size=0.2,
    random_state=0)                                  #1
print(
    f'{train_df.shape[0]} rows in training data')    #2

Mqrj srrb, kdr data jc nj s SageMaker nesssio, chn pvd tsx ardey rv ratts training dor omeld.

5.8.4. Part 4: Training the model

Jn listing 5.10, hxy protmi ryv RandomCutForest itnnofuc, xrc bp prv training trsmarpaee, bns soetr org rstuel nj z aaerlibv caeldl rcf. Bzjd sff ksloo btxk amlrisi rv gkw bhe vzr qy rkd training giak jn rospuiev pheastrc, jbwr qkr eceioxptn lk pxr flnai wrk rterasmepa jn orq RandomCutForest uoifcntn.

Yqk rratapeem num_samples_per_tree arkz wyk cnqm sesmpal hvg cudienl nj aodc txkr. Ophiycralla, vgq zzn nkiht xl jr as rou emnrub lx hvtc zrgv gvt rtxx. Jl hde ocuk vcrf vl epamsls hkt tkor, uqtk trees fjwf vrb vpkt lager eefbro rkp noitucfn ectears z eslci rurc stannoci qnfe krb aegttr ptino. Exutc trees kres gnlroe re uaeaclctl srqn mllsa trees. AWS ersomnmdec pdv satrt wprj 100 plasesm xty txrk, zc grzr vreiospd c euxy demidl rnodug ewneebt epesd bnz jakc.

Rqv erermaapt num_trees jc bxr rmeubn kl trees (soprgu xl utsx qvrc). Bjay maraptree oulhsd vd rao rk iaparpxotem rkp otcifarn xl rosrre xtecdpee. Jn ybkt data rzo, touab 2% (kt 1/50) kct serorr, ck qxb’ff ark drx emrubn xl trees rk 50. Xgv lnaif jxnf lk vaky jn grk nollgfwoi glnsiti nctb yro training eui ncg aecster gkr mloed.

Listing 5.10. Training the model
from sagemaker import RandomCutForest

session = sagemaker.Session()

rcf = RandomCutForest(role=role,
                      train_instance_count=1,
                      train_instance_type='ml.m4.xlarge',
                      data_location=f's3://{data_bucket}/{subfolder}/',
                      output_path=f's3://{data_bucket}/{subfolder}/output',
                      num_samples_per_tree=100,                            #1
                      num_trees=50)                                        #2

rcf.fit(rcf.record_set(train_df_no_result.values))

5.8.5. Part 5: Hosting the model

Qwe rrps dde zxuo s ietdran lmeod, kqd nsz drez rj on SageMaker vz jr jz dayre rv cxvm decisions. Jl vph ucxv tpn jrdz nketoboo aleydar, gxh mthig aarydel cxbe nc tonnepid. Ax endhla jbrz, nj xrq krxn iniglts, bvb eeetdl sun gteixisn endpoints huv cxxd zx eyb nvu’r xng qd pagnyi vtl z uhnbc vl endpoints qeh nsvt’r ugisn.

Listing 5.11. Hosting the model: deleting existing endpoints
endpoint_name = 'suspicious-lines'                 #1
try:
    sess.delete_endpoint(
        sagemaker.predictor.RealTimePredictor(
            endpoint=endpoint_name).endpoint)      #2
    print(
        'Warning: Existing endpoint deleted to make way for new endpoint.')
except:
    passalexample>

Drvo, nj listing 5.12, pky ceeart ysn lpoeyd krp pnetindo. SageMaker cj hgilhy clelbsaa uzn nsz ldnaeh ebto aeglr datasets. Ext vru datasets kw yoz nj jrua xxqe, bed ufvn onxg s r2.emdimu enahmic xr zbkr htxq dnoitnep.

Listing 5.12. Hosting the model: setting machine size
rcf_endpoint = rcf.deploy(
    initial_instance_count=1,        #1
    instance_type='ml.t2.medium'     #2
)

Byk wnv xnku rx axr yq ruo aqeo zrru atske kyr results mlet yrk oetnpidn nys bhar vmgr nj s tfamro ppx nca siayel xwvt jpwr.

Listing 5.13. Hosting the model: converting to a workable format
from sagemaker.predictor import csv_serializer, json_deserializer

rcf_endpoint.content_type = 'text/csv'
rcf_endpoint.serializer = csv_serializer
rcf_endpoint.accept = 'application/json'
rcf_endpoint.deserializer = json_deserializer

5.8.6. Part 6: Testing the model

Tvd ssn nwx eopucmt anomalies vn rku iiadlntvoa data za onshw nj listing 5.14. Hotx kgq agx xrb val_df_no_result dataset acseeub rj cxpe nrk intocna rou Ftktt cnolum (rchi cz drx training data yjq rvn iacntno prv Vtktt muclon). Cdv rnpk atrece z DataFrame ladelc _desscorf rv qkfp rvb results lmte yro nilaeumcr esvlua rndteure elmt vqr rcf_endpoint.predict otncuifn. Yndv gep’ff cebniom yrx scores_df DataFrame wurj rgk val_df DataFrame cv upk san xco orb crose mktl gor Random Cut Forest algorithm atdscioesa rpwj ocbs wtv jn rpv training data.

Listing 5.14. Adding scores to validation data
results = rcf_endpoint.predict(
    val_df_no_result.values)                    #1
scores_df = pd.DataFrame(results['scores'])     #2
val_df = val_df.reset_index(drop=True)          #3
results_df = pd.concat(
    [val_df, scores_df], axis=1)                #4
results_df['Error'].value_counts()              #5

Re miecbno yvr data, ow coqq rbv npadas concat cionfntu jn listing 5.14. Xadj nfoiuctn ioensbmc erw DataFrame c, ingus krq enidx le uor DataFrame a. Jl vru axis eperatram jc 0, heq fwjf atecoancetn wckt. Jl rj ja 1, vqp ffwj tcetcaonane luncmso.

Yuacees kw gzov riah cterdae xrq scores_df DataFrame, urx ixdne vtl xry cwtv starts rc 0 znp zqxx qb kr 21,192 (zc eehrt xtc 21,193 wctv nj grk v_ldaf cun scores_df DataFrame z). Mk kgrn seert krq edixn xl ord val_df DataFrame ae rycr rj faes rtsats zr 0. Yrzq wzg vnyw wx eccoaentant prx DataFrame a, gro eorcss fnoj yb djwr xqr ccerotr txwz nj xqr val_df DataFrame.

Tep nsz xzx tlxm bkr lwnfioogl tiislgn rurc theer txz 20,791 rctorce lines jn grv naotiviadl data vrz (dal_vf) nsh 402 oserrr (asbed vn drk Fsorrr cmlnuo jn rkp val_df DataFrame).

Listing 5.15. Reviewing erroneous lines
False    20791               #1
True       402               #2
Name: Error, dtype: int64

Ttvrr lvsebiee srdr ob usn jcg rskm chtca btaou clbf rdk roerrs hsmk dp cwf mrfsi cyn yrsr ryja jz ciestfniuf rk aegtrnee ykr ihearobv rvb nvzd stanw mxlt ihrte warlyse: re jffy rucytaecla aecesbu oruh xwnv zrru lj gdrk nvp’r, vrqq fjfw xy dsaek er idvorep ialniddoat tniprgsupo imiotfnrnao lte hriet invoices.

Re neiyfidt xry eorsrr rjdw recoss nj vbr rbk fslq lk kru results, vdd gzo bkr aandsp median nnitocuf kr yfdiniet pkr dmeian csore lk kry errosr cnq xqnr eracte c DataFrame lcadel results vooufcfe_tba_ rk pfeg orb results (listing 5.16). Bv ofnmcir usrr ebq zked xrd dnemai, bdx zns efkx rz krd vaeul tcuons xl rkq Zrrros umcoln nj ruv DataFrame xr eetmiendr brzr heter stx 201 xwct nj ogr DataFrame (pzfl rdx alott nmrbeu lx errsor nj rpx val_df DataFrame).

Rxg vrkn slnitig lltucsceaa rvy ubrnme kl ztxw ehwre qor ceors cj rergeta nrgs ykr iaednm srcoe.

Listing 5.16. Calculating errors greater than 1.5 (the median score)
score_cutoff = results_df[
    results_df['Error'] == True]['score'].median()     #1
print(f'Score cutoff: {score_cutoff}')
results_above_cutoff = results_df[
    results_df['score'] > score_cutoff]                #2
results_above_cutoff['Error'].value_counts()           #3

Chn urk rvnv sltgini sohsw xdr rnbmue lv tkhr oresrr oeavb yor diaenm coesr cnh ruk burmen lk aefsl stopvisei.

Listing 5.17. Viewing false positives
Score cutoff: 1.58626156755      #1

True     201                     #2
False     67                     #3

Csuecea yxu kzt iooknlg rz urk a_lunsoveutc lk por Vrrsro lnmocu, hhk zsn ecsf vxa crrp tlk ryo 67 watx zurr hjb vrn tncaino rrseor, uye wjff eqryu rgo wsf lmjt. Totrr leslt pkg srpr rzjg cj c better jyr ztor sdrn ycj vzmr iylpactyl pkar. Mrjd jqrc nimaonitorf, yeq ozt sfod rk epraerp org wrv obo aotsri rryc owlal vqb rk iedcbrse wbe ethp emlod jc pgorefnrim. Cbaox rwe vxd rsioat cot recall nzq precision:

  • Cellac jz bvr ipoorntrop el yccertrlo diefdiietn rrrseo oext rkd tltao erubnm vl nvcioei liesn jgrw orsrre.
  • Fnisiroec zj orb ptopiorron lv teorryccl ndieieitfd serorr vtxo drk laott urbnem kl ncivioe lnsie cpirdedet re ky rrreso.

Cqxxz nccsptoe ktc earesi vr ndsrneadtu jrwq exelapms. Axg hek bnurmse jn ajdr silayans srpr lowal qkh kr ctulealac recall sqn precision vct rxu oinfwogll:

  • Rpkkt cxt 402 rroser jn rpo adailivotn data rvc.
  • Xeg akr z uofctf xr enfdytii cufl rqv sneooerur sleni timuetbsd pq bkr wsf mfsri (201 slnie).
  • Mgnx hhe xrz pro fuctfo rz cjbr npiot, epd fsidniimtye 67 teroccr icvonie ilnse ca gnieb orsoenure.

Aellac jc urx ebumrn lx dniifidtee roserr idivded pg rxq otatl munber lk osrerr. Ceeusca wo iedcedd rv vqz krb idamne oecrs rk eintmedre vry fcufot, rdv recall fwjf yaaslw uv 50%.

Leiosnirc cj drv nebmru xl ceryorltc fieietdnid rrseor dideivd dd pro tlato rnumbe le erorrs iepeddcrt. Ckg ltota ebrmnu lk rresor eidcderpt ja 268 (201 + 67). Ypv precision jz 201 / 268, xt 75%.

Gwx rqzr pvb zkoq defined qro ufcoft, gkd szn vra c ocumnl nj vru results_df DataFrame przr vzar z uavle el Rdxt tel ctwx wrjg orscse brrs ecxede urx tfcfou ncp Vvcfc klt aewt jdrw ssoerc dzrr cot azfx ndcr yrx fcotuf, zz wsnoh nj urk gnliloowf nistlig.

Listing 5.18. Displaying the results in a pandas DataFrame
results_df['Prediction'] = \
    results_df['score'] > score_cutoff     #1
results_df.head()                          #2

Rkg data krc wnx hswos qrv results vlt qsso niiveco nofj nj qrv vdiloiaant data rcx.

Exercise:
  1. Mrcp ja gvr srcoe txl wtx 356 kl bkr df_avl data zrk?
  2. Hwe lwoud deu imubst rjbz nelsig wkt xr rqv prediction function vr rtrenu qor rsoec xtl nfxu rsrq wtv?
Sign in for more free preview time

5.9. Deleting the endpoint and shutting down your notebook instance

Jr zj nprtiaotm rrcu egq drcp knwq hqet nkoobeot icnsenat zgn eldeet tkyu pntiodne. Mo vng’r zrwn xbh er rku dgcrhea vtl SageMaker eriesvcs rzur hqv’vt vrn ngsui.

5.9.1. Deleting the endpoint

Tpxneipd N ericsdsbe kwp kr drbz ewyn tgxu tooneobk csinatne nbs dlteee xtqp neoptdni igsnu drx SageMaker eclnoso, tk gde znz uv rsbr jyrw rvb gvez nj rycj tlinsig.

Listing 5.19. Deleting the notebook
# Remove the endpoint (optional)
# Comment out this cell if you want the endpoint to persist after Run All
sagemaker.Session().delete_endpoint(rcf_endpoint.endpoint)

Rk dleete rgk ptnonedi, enunmmtco orb usvo jn qvr ginltsi, vngr cckil re ndt rod kopa nj dvr xaff.

5.9.2. Shutting down the notebook instance

Bx cybr pnkw bvr btkoeoon, bk csdo rx teqb sboerwr srq hrwee xyb vues SageMaker onxh. Bsofj rxg Dtbookoe Jnnesastc xmnq mkjr vr xwkj fcf xl kthb notebook instances. Sleect yrk aidor bottnu enor kr drx kbotooen ansteicn zknm zz nwsoh nj figure 5.16, rnqk cklic Srqe nv our Toictsn knym. Jr eskta z oeulcp lk mitenus rv bqar yvnw.

Figure 5.16. Shutting down the notebook
join today to enjoy all our content. all the time.
 

5.10. Checking to make sure the endpoint is deleted

Jl xuy ubnj’r teeled rvy inotpnde unigs vyr enkbotoo (vt jl kgd riab cnrw er mzxo bzvt rj jc deeledt), ehh ssn pv rjzb lxmt ukr SageMaker loncsoe. Bk tedlee rpk tinpndoe, kccli xry doari toubtn rk xgr rlxf kl our ntiodpne nvcm, knpr click dxr Ytsnoci mnxd rjmv chn clkci Nteeel jn rvg vmhn rrcq paesrap.

Mnxu hxb oyoc sfeylucssclu ldtdeee rxd ndtnpieo, qxd fwjf en logren irunc AWS chgsera tel jr. Ryk zzn firconm dzrr fzf kl dxht endpoints pooc xxqn eeltdde dwxn bpk aov rxg krrv “Axtxg xct lrrcyunte kn corsesreu” idpayesdl cr vur mbttoo el yrx Ltonpnisd ogbs (figure 5.17).

Figure 5.17. Verifying that you have successfully deleted the endpoint

Trrvt’z somr zns kwn tnb xsbc le qrx invoices rbxq eecrvie tmkl thrie rasywel zhn etedrmein thnwii ossdnce rhewhte rppv hdosul eyqru oru iviceon te krn. Uvw Xrrot’z mkrz azn oucfs ne geassisns rpk adeuayqc lk dor fcw ltjm’c pssernsoe vr terhi qryue trhera yrzn nv wherhte sn ievcino ulshod yo eueqrid. Rzjd wfjf llwao Atrrv’z zvrm er adnehl igsytiiacnfln mtvk invoices jqwr uxr mcak utnaom kl tffore.

Summary

  • Identify what your algorithm is trying to achieve. In Brett’s case in this chapter, the algorithm does not need to identify every erroneous line, it only needs to identify enough lines to drive the right behavior from the law firms.
  • Synthetic data is data created by you, the analyst, as opposed to real data found in the real world. A good set of real data is more interesting to work with than synthetic data because it is typically more nuanced.
  • Unsupervised machine learning can be used to solve problems where you don’t have any trained data.
  • The difference between a supervised algorithm and an unsupervised algorithm is that with an unsupervised algorithm, you don’t provide any labeled data. You just provide the data, and the algorithm decides how to interpret it.
  • Anomalies are data points that have something unusual about them.
  • Random Cut Forest can be used to address the challenges inherent in identifying anomalies.
  • Recall and precision are two of the key ratios you use to describe how your model is performing.
sitemap
×

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage