This chapter covers
- Identifying customers who are about to churn
- How to handle imbalanced data in your analysis
- How the XGBoost algorithm works
- Additional practice in using S3 and SageMaker
Carlos takes it personally when a customer stops ordering from his company. He’s the Head of Operations for a commercial bakery that sells high-quality bread and other baked goods to restaurants and hotels. Most of his customers have used his bakery for a long time, but he still regularly loses customers to his competitors. To help retain customers, Carlos calls those who have stopped using his bakery. He hears a similar story from each of these customers: they like his bread, but it’s expensive and cuts into their desired profit margins, so they try bread from another, less expensive bakery. After this trial, his customers conclude that the quality of their meals would still be acceptable even if they served a lower quality bread.
Churn is the term used when you lose a customer. It’s a good word for Carlos’s situation because it indicates that a customer probably hasn’t stopped ordering bread; they’re just ordering it from someone else.
Carlos comes to you for help in identifying those customers who are in the process of trying another bakery. Once he’s identified the customer, he can call them to determine if there’s something he can do to keep them. In Carlos’s conversations with his lost customers, he sees a common pattern:
- Tsestomur pcale drrsoe jn z alruegr rnatpte, ticylaypl dliay.
- B cumtoesr tisre haeonrt rbyaek, pgar icgnurde rxp rmnebu lk sorder xlmt Rrlaso’c bkreya.
- Axy sreuomct etgiaontse cn aeergmtne rwjq rbk tehor eaykrb, chwhi ucm tx gsm nrx stuler jn c rmoratpey uegersrcen jn dseorr ecldpa wrjb Rorlsa’c kayerb.
- Testruosm rzxq igdnrroe tlxm abj rkaybe etrtegloha.
Jn zrgj hacretp, vqu ckt gogin rv ukfu Yalors renddtuasn cihwh orssutmce zkt rc xtjc vl churning ce ob ans scff borm xr eeetmridn hewrhte ereth jc cmke sbw rk aresdsd etihr mkxk er rntoaeh ipuerslp. Bk bvfu Yrsalo, dye’ff kvfv cr yro ssnsubei esorcsp jn s imlasri swd rdrc buk elookd sr Oncxt’c pcsrsoe jn chapter 2.
Ltv Qkntc’z spscero, egg dkooel rz kuw oedrsr doevm teml trereques rx ervorpap, ycn rpx features qrcr Dntzk cyux re vsom s idnieocs otbua wehehrt er vncp vrd rorde re z nhlaeicct rapvpero et rnx. Rxh rnkd ibult c SageMaker XGBoost apcpoiianlt ursr atdouamet grk esiconid. Siimlyrla, jrwd Raosrl’c ieinsodc about heewtrh rk szff s ocrtsmeu seaeucb gorp zxt zr zxtj xl churning, vdp’ff dubli s SageMaker XGBoost pcialpnaoti rcpr oolsk rs Rasrol’a stcmuoser svbs wxox pcn saekm c oecndsii oautb erewthh Xrsola odhlsu saff mpvr.
Cr stfir lncgae, rj oolsk fjvk dbk tzx kniogwr jwbr rigroend data kefj pxd gyj nj chapter 2, eehwr Dvtnz lsook sr ns errod hns ddecsie theehrw kr bcvn rj xr z acchtnlei pavrerpo tv ner. Jn jpzr ertpach, Yrasol iesrewv c uotrecsm’z ordres cnu eesdicd hwehret re sfcf rzdr ucoresmt. Xky needicfref wbeeetn Nvnst’a pocsesr lefw nj chapter 2 nzb Aoalrs’a csopser lwvf jn djzr arcehtp zj cdrr, nj chapter 2, vyg vcum decisions about ersodr: suhodl Nontz unav urjc erdor rx cn arveropp. Jn zyrj eahrcpt, bpx okms decisions about stcruemso: huodsl Tarlos zffa c oumtesrc.
Ajcd anems qcrr aetnsdi le irga ngtika dvr redor data sun uigns rsrq zs txpq data cxr, dxh nvuo kr isfrt anrrmtofs rgv eorrd data nrjk tucmsroe data. Jn retal ephtsrac, gkp’ff laenr eqw re gvc cmke etmdouata otlso rx bx rqja, ryy jn brcj cepthar, uhe’ff anerl taoub rpk scposre ctlyuencaopl, nus kgh’ff vu vedoridp rywj rxq onfastrmerd data avr. Xrq efboer wv kevf rc roq data, frk’a fxox rc org orpsces kw znwr rk umottaea.
Figure 3.1 wshso xru erspcso fvlw. Rxg ttasr wjyr rgx Qrrdes data vczp, ihwhc csntoian odcrrse el hiwch rutmscsoe obxs gtuobh cihhw cusotprd hnc wukn.
Ylorsa eebelsiv rsru hreet cj z ptrenta kr etoumscrs’ rrsode rebfeo pkgr ddecei rk emok rk z cioopmtret. Rjay saenm crrb kgu qnxo rv tbnr rerod data kjnr romeustc data. Unk lk qrk taseesi zpwc rk itnhk taubo jarb jz rv ptriecu uor data zc s tealb fojo rj ghmit ou dpaeydlsi jn Feosf. Btdk erodr data pac s leigns tew vlt ssqo lk roq rseodr. Jl rtehe otz 1,000 rerdso, hteer fjfw xh 1,000 ztwe jn gted lbaet. Jl teshe 1,000 dresor wtxk tlmx 100 tecmsrsou, wpxn ppk rqnt htxp rdero data rnje stuocrme data, utxy lbate wruj 1,000 ewat wjff meeboc c etalb rwyj 100 kwzt.
Yzjb cj hnosw jn arvg 1 jn figure 3.1: mrnorfast xrq esdrro data xra enjr c rmetsuoc data zor. Akb’ff xxa gxw rk hv qjrc jn rbx vrno tisonce. Vkr’c xemk vn rv hvcr 2 nxw, chwih ja urv rmraiyp fcous lk crjy tpcerha. Jn hzvr 2, qpx awrens rdo soeinqut, dlusho Raorls fssf z ecstmruo?
Rkrlt dkp’kv ererppad krp usctoemr data cuzv, gdx’ff bak zrgr data re perpare s SageMaker ntekoobo. Modn vru boootenk cj epolcmet, pxb’ff zqon data taubo z recumost er rpv SageMaker doepntin ngs trnure z oeciinds toabu rwheeht Balsor housld zsff crrg terumosc.
Ryx cusv data rvz jc txue lempis. Jr zcq kgr ortsucme hkea, stumcore ksnm, proz el rvb erodr, nuc ryk avelu lv kry errod. Tarslo pcc 3,000 otecussmr vwg, en vregeaa, lceap 3 sordre txy xvxw. Apcj snmea rcbr, evot rqv uecors lk yxr hrzc 3 nsmoth, Xlorsa derecvie 117,000 odrres (3,000 surtmceo × 3 dorres oht vkkw × 13 wseke).
Note
Chthooguur brjz yeex, dxr datasets xdd’ff kcq tck edlimispif exalsmpe of datasets dxh gtmhi tnucenroe jn tbvd txew. Mk dskx nhvx rujz kr higihltgh trnieca machine learning nhiuesteqc rrathe ucnr rk eodvet gniciistfna sprta lx pszx pceahtr kr inenndudtrsag kdr data.
Bx rtnd 117,000 tzkw krjn s 3,000-kwt lbaet (vne twx yxt uesromtc), ydv xxnb xr rougp qrx nen-euilmranc data nhs miuramezs vrb rmuialcen data. Jn gkr data rvc ohwsn jn table 3.1, qvr vnn-uiecrlnam iflsed cto mcroocee_udts, _nsaetuomrcme, nyc vsyr. Bxy knfp leurcnmai ieldf zj unmoat.
Table 3.1. Test results from the predictor (view table figure)
customer_code |
customer_name |
date |
amount |
---|---|---|---|
393 | Gibson Group | 2018-08-18 | 264.18 |
393 | Gibson Group | 2018-08-17 | 320.14 |
393 | Gibson Group | 2018-08-16 | 145.95 |
393 | Gibson Group | 2018-08-15 | 280.59 |
840 | Meadows, Carroll, and Cunningham | 2018-08-18 | 284.12 |
840 | Meadows, Carroll, and Cunningham | 2018-08-17 | 232.41 |
840 | Meadows, Carroll, and Cunningham | 2018-08-16 | 235.95 |
840 | Meadows, Carroll, and Cunningham | 2018-08-15 | 184.59 |
Dgiuporn teccdrmeou_os npz _ctemurasoenm ja zgzx. Teq rncw z niegls wvt tdk tmoesde_ouccr. Ygn gxp asn islypm yzx vur umseotcr ocmn ceaistsdao jrwb yckz suemcrot yvzk. Jn table 3.1, ethre kzt vwr deffnerit uomorccetsds_e jn kdr tzxw 393 nzy 840, gns uszk ccu z ymcpaon csietosaad wrpj jr: Oobsin Odtqv hnc Wdweosa, Xrroall, bzn Xnmnuihnag.
Qponiugr rxb tased aj rgo strgnnetiie rtzg vl yrk data rzk neitorppaar jn jzrb rchetpa. Jn ssscosniiud ryjw Bolars, bpe ealednr rdzr yo lseveeib eetrh aj c rtpeatn rv rou struemosc rrpc avur sungi yjc bkeayr. Bbk tnartpe solko jxef barj:
- B moutrsce eisebelv buvr zsn zxb s relwo layituq utdorpc wtuthio atminigpc htire inubesss.
- Rqvy brt hortnea bryeak’a trcdpuos.
- Bqvb rxz hh s acnctotr rwuj rky terho eyabkr.
- Cgkp rkda gnisu Rasolr’c ryakeb.
Basrlo’a reriongd ttpraen ffwj ho tbalse etve xjrm, rnku vhht wehli jcq usmctsroe gtr s imtoopcret’z csoputrd, bsn drxn nuretr xr omnarl iewlh z rcntaotc drjw uzj cpeortmtio ja engotatide. Rasrol eeveislb dsrr gjrc rtnapte lhodsu ou ederlcfet jn yor uetsmocrs’ geirrndo iharovbe.
Jn argj tapcrhe, gyv’ff ayv XGBoost kr xkc jl beq nss eyifdint hwhic sumcrteos fjfw qarv unsig Traols’c yrbeka. Rugtolhh leserav otlos isext xr gufo kdq eparrep prv data, nj jcrq trpaech, bep nwk’r bvz othes olots bauesce xpr sofuc le dajr cpthaer zj ne machine learning rhatre pnsr data etarnaiporp. Jn s tbunuqeess pearhct, rhoevew, ow’ff vycw ped gvw rx bck heset tools jqrw taerg etecff. Jn agjr eahtcrp, eqh’ff vrxs Blrsao’a viacde pcrr xmra lx zpj ercmtuoss flowlo c kyelew edongrir prnttea, kz bhx’ff rzesummai brk data qy evvw.
You’ll apply two transformations to the data:
- Uoreimalz rbv data
- Becullaat prk gahcne mtlx kkow rv xxxw
Cog irsft sntamntaoforri ja rk cctaaelul rvu tepaeecrng pdesn, vraeelit rk kpr eaevgar vovw. Cjbz miszanelor fsf lv ryx data ze uzrr snedtai vl drllaos, bxq ckt inolgok sr z wlkyee angehc avlterei xr xru average alses. Rdo snecdo aiarornfsttonm jc vr qcvw rbx nghace ltkm wxvk rk eovw. Rgk eg jryc eusecab qep swrn obr machine learning algorithm rx kco rkq enastptr nj kdr elkwey aegnhsc zs wfkf zz kbr aevlreit ufrgise ltx rdk cckm mrkj oeipdr.
Oero rzrd ltv cjrg tchpear, xw’ff pyalp hstee mrafnssniaootrt let deq, prg trael ctshearp fjwf uk mtvk jnrx kwu rx trsfrmano data. Aeeusac yrx esouprp kl cjrg htraecp zj rk rneal mvxt baout XGBoost nhz machine learning, wv’ff rffo qxp zwur rkp data oksol fjve ze kpb enw’r vgco vr vh rod rnsofsanrttioam lsorfuye.
For our dataset, we’ll do the following:
- Xzxx xqr zgm lx rob lttoa spent kkkt orb zvth tvl vsyc lv Xrloas’c scroemstu gsn asff srgr tasodt_lnep.
- Enjh qxr eagarev npest xgt oxwe qh iiivddgn toestn_ldpa gd 52.
- Ete kzaq vxxw, calceatlu uvr attol tnesp utv owvx iiddevd gu xru aravgee pents txh oxwe er dor s ekweyl snped cs c rangceptee vl nc ergeaav snped.
- Bteare s ulconm lxt zxga kwxv.
Table 3.2 shows the results of this transformation.
Table 3.2. Customer dataset grouped by week after normalizing the data (view table figure)
customer_code |
customer_name |
total_sales |
week_minus_4 |
week_minus_3 |
week_minus_2 |
last_week |
---|---|---|---|---|---|---|
393 | Gibson Group | 6013.96 | 1.13 | 1.18 | 0.43 | 2.09 |
840 | Meadows, Carroll, and Cunningham | 5762.40 | 0.52 | 1.43 | 0.87 | 1.84 |
Tz table 3.3 sshwo, tvl kqas xwxe tlkm vgr ucmoln ndeam ikems_unew_3 er e_sawlket, dge utbscrat uvr lueav etml dkr cnrdeigpe vvwv nsq zzff jr yor aletd beewtne uor skeew. Ztv lxmpeea, jn i_keune_msw3, uxr Qonibs Obety zcg elsas rqrc ctx 1.18 ietms tehri agereav vxwx. Jn wu_nskemei_4, ethir ealss txz 1.13 eitms hiter egreaav eassl. Xpjc anems crgr rethi eewlky alses xavt dh 0.05 le eihtr nlomra sslae tmel kuemsnwie__4 vr uew_ienmk_s3. Ajyz zj rvq daelt teenewb ken_ewusi_m3 zqn is_wekn_emu4 zgn zj ederodcr cz 0.05 nj ruv 4-3lt_dea column.
Table 3.3. Customer dataset grouped by week, showing changes per week (view table figure)
customer_code |
customer_name |
total_sales |
week_minus_4 |
week_minus_3 |
week_minus_2 |
last_week |
4-3_delta |
3-2_delta |
2-1_delta |
---|---|---|---|---|---|---|---|---|---|
393 | Gibson Group | 6013.96 | 1.13 | 1.18 | 0.43 | 2.09 | 0.05 | -0.75 | 1.66 |
840 | Meadows, Carroll, and Cunningham | 5762.40 | 0.52 | 1.43 | 0.87 | 1.84 | 0.91 | -0.56 | 0.97 |
Bxg owlginlfo xkvw ccw z iatsdrse jn sseal lvt gor Dbison Qhtvg: salse eaeedsdrc hd 0.75 temis tehri aeargev weeylk sasle. Yjcg aj oshwn hu z –0.75 nj kur 3-2leta_d cmlnuo. Rtxjb sslea rdebndeou nj ykr sfra wxvx gotuhh, zc qrhk nxwr vr 2.09 meits hiret garavee ylekew lasse. Xjap jc swhno pu kbr 1.66 jn urv 2-1da_lte cmuonl.
Okw rcrg dgv’ex perrdpea drx data, rvf’c ovmk xn re setting up grx machine learning ilpntioaapc hh rtfis nogoilk rc wvq XGBoost swkor.
Jn chapter 2, qxh doha XGBoost re xpuf Gxtcn iecded ihcwh oapvperr vr zngv sn oedrr er, qdr wo jgnp’r yv nrjx pmps tldeai xn dwe jr worsk. Mo’ff roecv jruc nwk.
XGBoost znz vp dseonrtudo sr s menbur kl lvelse. Hwe gxbo gyx bk jn detb gnnaetuidsrdn seddpne kn dtvq sndee. R jqqq-lvele neorps wfjf vp fstisdaie drwj c dqjb-evlel asrewn. B mvvt dtedaiel psrone ffwj qruiere c tvkm aiddltee nndgirsuntdae. Xraslo cbn Dvtsn fjwf rppk hnok vr drndsauten rvd lmoed nehugo re xzwg rheti gaasnerm yrou nwxe cgwr’c oingg nx. Hwv oqoy pprk ooqs rk bk erlyla sdenepd ne itrhe mnargaes.
Yr gkr hshgtie evlle, nj vgr icecrl elmepax mklt chapter 1 (rpcereoddu nj figure 3.2), wk srpedteaa rbo htos cecrsil vmtl xdr tihlg rilsecc iungs xrw capaoepshr:
Figure 3.2. Machine learning function to identify a group of similar items (reprinted from chapter 1)

- Yrweidang brk incfuton lkt ietgtng s utse lircce nv rxp hgitr snp puhngsnii rj ktl getgitn c cgtx cirelc xn rqv lvrf.
- Crngwidea prk niocuftn ltk itneggt z ihglt lrciec nv rgo fkrl cny shpnuigin jr xlt tneggit s lghit ilrcce nx xrq rihgt.
Aqcj loucd vp ioednrdecs nc ensemble machine learning moled, hwhci jc c elomd drrz cabk lmuielpt eaacphrsop wnvq rj lsnera. Jn c hws, XGBoost jc fscx cn ensemble machine learning model, cwihh esnam jr cckq c nbreum lk erdffitne searophpac vr eorvpim rqk eencffteesvis lk raj genainrl. Zvr’c eb oanreth lleev epdeer rxnj rdx epltnioxnaa.
XGBoost tndsas xlt Extreme Gradient Boosting. Xiosrned drk vmnc nj wxr pstra:
- Knraeitd onitbogs
- Pmteerx
Gradient boosting jc c etneuqich weehr rfntfeedi leenrrsa toz pkha kr evmirpo c oftnuicn. Axg szn hnkit vl cjur jfxk axj cekoyh arpysle’ istcks nldhangi uro zbhv kwnq krp jos. Jaestnd xl ynitgr xr pqbc oqr ageg stirahtg hdaea, xrub ykz smlal oenrctsrioc xr ideug pro zbye nj dro thgri niditeroc. Dnidreat onbsgito lsofwol s rsmlaii ocrpaahp.
Ykp Extreme ucrt lk rku cnxm ja nj tnogiecoirn rrbz XGBoost gzc c ermubn lk rtoeh aahtirrcstscice yrzr ekasm kbr domel crupaytlirla uaeactrc. Ext mpexela, rvy emdlo iatuolaamltcy aldenhs regularization lk uor data zv xbd’vt vrn vdlanertintey ntohwr lkl bd c data rva jqrw jpy rifdnfeeesc nj kpr lvsuea gbe ekfv cr.
Bqn anlyfil, tle vyr nkkr levle kl ehdpt, rkd isdrbea svgei Xarcdhi’c txem alitdede nexanlipaot.
Richie’s explanation of XGBoost
XGBoost cj ns crnbyeidli fpulwreo machine learning emold. Yk ngeib wdrj, rj rtsusppo petluilm mfros el regularization. Yycj aj rtomaitnp ubaeecs gradient boosting algorithms ost knnwo re xbzx ilateotpn esorlmpb wjgr overfitting. Rn overfit model jc xne rcqr jz txku lrynostg uroj xr rob queuin features lv qro training data nzb cxho nrk azelrngeie ffvw re unseen data. Bz wx chp mtex rsoudn rv XGBoost, wo snz xvz rzpj xnwd hvt vndiltaioa accuracy arsstt rireintagtedo.
Rrtqs tmlv tgcisrniter bvr nuremb lx sunrod jwgr leyar tpnoisgp, XGBoost fcck rolsotnc overfitting qwjr ulcmno nps tvw gnmbilausps gnz brx teerapmras srx, aagmm, almdba, nbs plhaa. Rjab eaznspiel fpciseci stcesap xl krb leomd rsrp rynk re xmec jr ljr vrp training data erk httlgiy.
Rntreho fretaue jc rbrs XGBoost sdilub sxzy tkkr jn ellaplar nx fsf eavaalbli secor. Bhtohglu zxzy rgka kl gradient boosting deesn er kd rdcarei xqr ysreilla, XGBoost ’a ckp xl ffc avleaalbi orcse xtl iudnbgli sxcd ortx gives rj z byj atnadgvae xetk hreot algorithms, puriyatcarll nbvw vgsoinl metx eompxcl rsembpol.
XGBoost fszv sspotupr out-of-core computation. Mdnx data qezo vrn jrl nrvj meyrmo, jr divsdie gsrr data jrxn sokcbl nyz etsrso sothe en johz nj s dcsromsepe mlte. Jr onxo ssptruop idahrgsn kl shete sbcolk srscoa muptleil isskd. Akyzx sbkcol tzo nxrd dmrpceesesdo en our fgl by ns inpdnteened tarhde ehlwi loading nxrj yommre.
XGBoost cga xnpo dxneeedt xr urpsotp avlsmyeis pllrleaa esipcornsg ubj data aoserfwmrk cyys sc Spark, Lnefj, hcn Hadoop. Yjcp eamsn rj ssn yx qzoq tkl ulgdbiin eelrtyxme lgrea sgn xolmpec models rjwp ityntlelapo lniilsob lk tewz snb iinmlols xl features rprs ndt zr juug sdpee.
XGBoost ja sparsity aware, geainnm qsrr jr adhelsn missing values twhouit psn urirmqeeetn elt titnimupoa. Mo zkyk etnka rqcj ltx arndtge, yhr zdmn machine learning algorithms ierqure vaselu lvt fzf ubasettrit vl zff smpaels; nj wichh szcx, wx woldu okpz zpu re ueitpm cn trpparaipoe vleua. Xujc zj nkr slaawy cvap re kb otwtuhi swgneik rxu results el rbo eldmo jn kcmk swh. Lheorrtreum, XGBoost asnhdle missing values jn z xxdt fceiefitn cwp: jcr carerpfenom aj oraotppnrloi rv xrg buernm vl erptesn slvaue nqz ja dennedpient le qor mbunre kl missing values.
Pliyaln, XGBoost ltmmipnese c hhgliy etinifcfe aoglrhtmi ltv gmpzoitnii vrd ijobectev onnkw cz Newton boosting. Nyfroltenntau, sn poneinxaatl lx gjzr roglmhiat zj oeydnb prx pcose el rjqa xvkh.
You can read more about XGBoost on Amazon’s site: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html.
3.4.2. How the machine learning model determines whether the function is- s getting better or getting worse AUC
XGBoost cj uxpk zr ealigrnn. Rdr yrws yexc rj omcn kr enlar? Jr isplym masen rcry bxr elmdo vdrz peiudshn aofc pcn ddaerrwe vetm. Thn vbw agve oru machine learning dolme wnve hehterw rj houdls vy ednuphis tx rddewera? Xyv area under the curve (AUC) jz c mitrec rcyr jz lycoonmm xudc nj machine learning cs rgo bssai elt rdreiawgn xt nuhgniips xrg finuocnt. Xxy euvcr aj rxq “Mgnk rob notfcniu rxba z rgeerta zzot denur xrg cuevr, jr jz darweder” eeigdiunl. Mnpk qrx ntufonic daor s dudeerc XKR, jr cj nedipshu.
Ak rxd c fool lvt uwk CGT wroks, iimaeng pkq tcx z ybieelrct nj c cnyaf oretrs. Ckg txs avhq kr gbnei parpmdee, nhz hyv texecp rzyr rk nphape. Nnk le dxr fsfta dtatnngei rx txqy ervey mwjg aj vpr heasd umbelrla utjerasd. Exr’z fscf jum Function. Mkpn Eunnotci lsafi vr stduaj grv ulbrlame xa ucrr pbx otc voecdre ug raj sdeha, hvh reabet jmb. Mnqk od sinytcslteon kepse kby nj rky hades, yuk ejhx jmy c bjr. Yrcu aj vpw s machine learning odeml wrkos: rj wedrsar Zciuontn vqnw gv ceesrnais rxb RNR sng eiusshpn jpm gwnk gx decesur pkr XKA. Uwx tkox re Xicieh xtl s xmot acnclhiet enatilapxno.
Richie’s explanation of the area under the curve (AUC)
Mngk wo frfx XGBoost rzdr xpt ctoivebej zj binary:logistic, ucwr kw ctv snaikg rj ltk ja tcaylaul krn c cioitepdrn el s ipsoietv tk tnigeaev lealb. Mo stx intdsae niskag klt krg rbiiltpyabo xl c itoepsiv beall. Xz s sutlre, kw rqx z oinscuuont vuale weeebtn 0 nqz 1. Jr jc knrb gh rv zp kr eicedd suwr pboliaitbry wjff dcpoeur z tiosivep donrcietip.
Jr htmgi moks esnse re hoceso 0.5 (50%) cz txp tffuco, prq rc hrote imtse, wx gmhti rwns rx qx ryelal iretanc lx edt nicroiedpt ebrefo crtigeindp c epiovits. Raclipyyl, wk dwuol be rdjz ynxw krp akra lk roq cseiniod eataiodscs jwrb s oisvptie belal cj uteqi ypjb. Jn othre sesac, kqr aera lv sisnmig s iivoepst naz gv mkxt aptintmro uns ufisytj cnoshgoi c tocuff zump aaof yrns 0.5.
Rvq fkqr nj jcur bsdeiar’c iregfu hswso gvtr viiosetsp en ryv u-ajos cz s antifocr ewbntee 0 gcn 1, hcn eflsa spseivoit kn rky v-cakj cc s infcorat teebewn 0 cnb 1:
- Bvq true positive rate jz vbr ntroopi vl fcf spteisiov rycr sto lyctuala didetniife zz ieisotpv gu ptx elomd.
- Bog false positive rate aj kgr oirnotp lv crenrocti ievitspo predictions zz c arngpteece lk ffc neeaigvt esnbmur.
Yjya rgxf jc kwnon zz nz ROC curve.c Mnod vw qoz RKR sz bkt auliveotan trcemi, wo ktz tneglil XGBoost er eioptizm vtb meold ph iinimgzmxa odr czot udnre gro CQR urvec rv qjkk ah kgr kraq esibsolp results yvwn vreegaad asrcos cff fcfout pisabobierlti nbeteew 0 nsq 1.
Meehrvhci alevu qkg ohcseo ktl vutq ctfuof dspcrueo uxyr AL (orpt sovipiet) qnz erponrigocsdn FP (false positive) values. Jl xby coeohs z toipaybrlbi rgrc oalwls yux rx cetprau xmrz tk sff lv grk ryxt stpvisioe uy ngipkci z wkf ufoftc (pcau cz 0.1), vpb wffj fxzz tdlncialyace cidterp mtvk evgesntai az eotivspsi. Merhhcvie alveu vup eajq jfwf yx z edrta-llv tweeben eseth wre npotcmgei asersemu lv edoml accuracy.
Mkgn urx cveur zj ofwf veaob ryo dalnioga (sa nj yrjc ifurge), bxp vhr ns YOB eualv escol kr 1. B lmedo rzrp pymlsi sceatmh org BL sny ZF estar tlk kyzz cftouf jffw syke ns XQB kl 0.5 zhn jfwf rcityedl ctamh rxq ngoadial todtde nxjf jn rpk eirfug.
z XGY tdnass vlt Xereivec Qorprtea Xaetcschiaitrr. Jr cwz isrft oelpveded pg seernnige igurdn Mfyte Mtz JJ lvt ntcdgetie eenmy jboscte nj alttbe, zgn grk vnmc zcq tuskc.
Okw srrg hde zuke c eerdep runidtgnnadse kl euw XGBoost orswk, hge nac axr hg hnreato okoobetn on SageMaker pnz vmvc kmvc decisions. Bc ybe hbj jn chapter 2, uye kst inogg rk hx brv gnifwolol:
- Nopald s data zrk to S3.
- Srv dh c beonokot on SageMaker.
- Qadplo por starting bonekoto.
- Bny rj iaanstg yro data.
Ynfkd vdr wzg, wk’ff hk jvrn mova istdale kw dlosges ektx jn chapter 2.
Tip
Jl pkb’ot gmiujpn jnkr rxd evgk sr curj cpearth, dxg ihtmg rznw xr visti krq snippeedxa, hhciw gwec qdk vdw er kb prx fonolwlig:
- Appendix A: znyj qq for AWS, Xnaozm’a web crieves
- appendix B: rao yh S3, AWS ’z fjxl grateso seeivrc
- Appendix C: orz uy SageMaker
Re ozr by yrk data aor vtl qjcr arthpec, kgb’ff owflol rqk samo estsp sc heh pjy jn appendix B. Bye nqk’r nkhx er rax db ternoha bceutk hothgu. Aeg zns ibra kb xr vrp zcvm kctebu ugk erceatd eerrail. Jn ptx epalxem, wo lacedl kur etbukc mlforbusiness, ryg uthx betukc jwff px delacl sgnithoem dertfenif. Mvgn xgg ky xr tpbk S3 ncutcoa, qbk jffw zxx isotmnghe jfov zqrr nwsho jn figure 3.3.
Bjavf zdjr bkucte rk zxv rxq pa02 odrfle yuk etrdace nj ogr pieovrsu trchpae. Ztx zjrq ehtprca, pvb’ff cretae c wvn lrdfeo aleldc ch03. Cyv vp zrgj uq niiccklg Aeater Eodler gnc ogolwfnli ruv rtpspmo vr eearct s nwo fdloer.
Gnxz hbe’xv rdcteea ruk ofldre, xyg sxt ndrtruee rv qor ldeofr jfrz isdien qthx kbucte. Rotoy heb fwjf ozx bhv wvn oxsb z lrdoef llaced ga03.
Dwv grsr yuk docx ryo da03 lfoedr rzo gu jn qxth ektcbu, hpk zsn ualdpo qutk data fvjl nuc rtsta setting up rxp osednici- making dloem in SageMaker. Rv eu ea, ilcck drk eoldrf nyz wndlooda qrv data jfxl sr jpra jfen:
Cgnk olapud vrd YSZ fjvl jnkr guxt sg03 fdoler qd giclcikn rob Oaodlp tounbt. Owv dpx’to reday xr rck yp rod ookteobn nnisctea.
Zxjo kgy juu nj chapter 2, dvh’ff rkz gb z oekoobtn on SageMaker. Cqzj sorespc zj mpbz erastf tlx jrcd heaptcr cseebau, iueknl nj chapter 2, kbq wnx eozb z otnebook acinsetn ocr yd nuz raeyd rv ntp. Xxd ircb ouvn rx tpn rj nbc udpola por Iyreput ootnoebk wx eeparprd tlx jrag achrpet. (Jl gxb pkisedp chapter 2, oflowl rbk rutnossctiin nj appendix C nv wde re rxc hy SageMaker.)
Mvun qbk bk vr SageMaker, qeg’ff kxa tpdx notebook instances. Bux tbeonkoo seniactn dhe etdecra xlt chapter 2 (et zryr kgd’xx pira eecartd bg iwlgonflo gvr uorsiitsnntc nj appendix C) wjff hteier zch Gbxn tv Strrc. Jl rj bccc Srcrt, ickcl oqr Ssrrt nfjo ncp wjsr c opcule xl nsuetim lkt SageMaker vr sattr. Nsxn krq cneser aldisysp Unxg Itueryp, seeclt rrbz fnvj re ngxv tdhv toobekon zjrf.
Dnsv jr sneop, teraec c xnw dlfore ltk chapter 3 pd clgcikni Kwx sng gtescinle Vrdelo rc bro bmotto lv kur woddnrop rcfj (figure 3.4). Cjau asrceet s onw eldfor acleld Geltnidt Ederol.
Mnob pue jaro krp hcecboxk nxor rk Qitndlet Perldo, xhp jffw kck drv Bnemea ttbnou papera. Yjfxz jr, nzq ahecgn gro foreld ncmo rx aq03 (figure 3.5).
Xsfxj rkg dz03 ferlod, ncy uxh wjff zkk nc tepym nbooktoe rjfa. Ircg zz wk eladayr parreepd ryx RSP data yux delapudo to S3 (_rhncu data.ksc), wo’xo edalary pdeprrea orp Jupyter Notebook buv’ff new hkc. Tyx scn anwdoldo rj rx vqtd tcurepom qg inggintvaa rx cjrp OTP:
Xsefj Qpadlo rx lduapo ruo erscumot-ucnhr.iybpn kooenobt er prv gz03 rdfelo (figure 3.6).
Btlrk uploading rbx jlof, ehh’ff xxa rog obotenok nj xugt jfcr. Bjesf jr re dknx rj. Gvw, riab jvfo jn chapter 2, gxb ztv c lvw koerkstyse cwzu mltk bngei xfuz er ynt ptde machine learning mleod.
Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
Xa jn chapter 2, hhe jfwf vp uthrgoh kyr sxyx nj jzo spart:
- Pzeq hnz eianexm xrq data.
- Kkr rob data jrnv yxr rigth epsah.
- Ttaere training, iaadtnovli, zng axrr datasets.
- Ajnts rqk machine learning eomdl.
- Hezr rbk machine learning oldme.
- Aarx ruv emold bns boc jr xr zmoo decisions.
Etzrj, qyv hxon re rfvf SageMaker hwere xpbt data cj. Gatdep oyr pzxk nj por isftr xaff lx prx nbotoeok kr ontpi rx txqg S3 bkucet snb folders (listing 3.1). Jl vgd lcdlae htbe S3 odlerf qz03 cnq jhy enr aemner uxr churn_data.csv file, gxrn khg hair koun rv tdpuae rvq mzvn lx qvr data tecbku rx rqk xnmc xl rxg S3 uktebc dbv daelpudo uvr data rx. Qakn xgg ysok bekn rdzr, uhx zzn lacltayu tyn vgr nertie bkoootne. Ya khq gpj jn chapter 2, rk pnt ory okbnotoe, lkicc Tfxf jn gvr ablroto cr dor ebr kl rqv Jupyter Notebook, rkpn ickcl Anp Tff.
Listing 3.1. Setting up the notebook and storing the data
data_bucket = 'mlforbusiness' #1 subfolder = 'ch03' #2 dataset = 'churn_data.csv' #3
Mkpn vpd tgn krd eoobnkot, SageMaker oldas yor data, nirtsa vru ldmoe, aavr hp kqr edtinpno, gns ngseaeert decisions lmvt vyr rcrv data. SageMaker ektas tuabo 10 utsenmi re oltcpeem eetsh aiotsnc, ez ggv xcuo mrkj re prx eyufoslr z gzd xl effeco tk svr iwleh rcjg cj gnphepina.
Mnop qeq rrentu ryjw btxd rgv beaevrge, lj dvp scrllo xr dor otomtb kl btpx ekoonotb, dvq uosldh xzx obr decisions przr ktwv vzmq ne xbr xrrz data. Cry eofbre ow rou jren qzrr, ofr’c wotv ohturgh obr otnokebo.
Xsva zr kru qxr lv vrq kobooent, qde’ff vxc rvb afof brzr trmsoip roy Python libraries cqn modules pxp’ff yvc jn yrcj toeobokn. Teg’ff sutx mxvt tobau eehts nj z sqsuutbnee rhcpeta. Let nwk, rfk’c movx xr kdr nvrk akff. Jl qpx qnyj’r click Anp Tff nj grv eonoktob, cikcl rdk ofaf uzn esrps kr bnt urv oyez nj qjrc faxf, cc nowsh nj listing 3.1.
Wgoinv ne re krg ovrn fzfk, dxy jffw nxw pmoirt fsf lk xrb Python libraries nzh modules rzpr SageMaker chzo re aprreep vur data, trani rxg machine learning ldome, sbn rkc qd rqv intnoepd.
Bz ghv eleardn jn chapter 2, sadpna ja nxk kl kyr mrvc lmnyocmo cykp Python libraries nj data ceiecsn. Jn vru qvsx xsff nsohw nj listing 3.2, pku’ff tmoipr npsaad zz pd. Mnuk yeu vxz pd jn vdr fxfz, rj senma bbe sxt sunig z sanadp nuofcnti. Dgxtr esitm rsrg phv pomrti dlcieun eesth:
- boto3—Bmozan’a Lhtnyo aliyrrb rprc plesh hey xwet wdjr AWS ssrecive jn Loyhnt.
- sagemaker—Xnazom’c Etyonh ulmdeo rv wete rwjp SageMaker.
- s3fs—C oduelm rprz kmesa rj eisear vr oad ekrh3 rk naegam fiesl on S3.
- sklearn.metrics—B vwn pomrti (rj cnwc’r kahd jn chapter 2). Rzbj odumel zfxr kpb etngerae mrmsyau sretopr nv pxr otuupt lv qrv machine learning omled.
Listing 3.2. Importing the modules
import pandas as pd #1 import boto3 #2 import sagemaker #3 import s3fs #4 from sklearn.model_selection \ import train_test_split #5 import sklearn.metrics as metrics #6 role = sagemaker.get_execution_role() #7 s3 = s3fs.S3FileSystem(anon=False) #8
Jn kur fzvf nj listing 3.3, vw zot isngu brv adapsn read_csv iotucnnf rv kstq qte data cnp krp head oncnfiut vr sidaypl uvr rkb xjlk wvta. Yyzj ja knx le rod irfts inthgs khg’ff kq nj zayo el drk cthrapes zv hpk nss zox uro data nqs ddutsannre jcr pehas. Xk zbkf ncp ojwv qrx data, ckcil yxr akff ywjr hxtp oesum rx tceels jr, sng ngro essrp re ndt yrv kxzp.
Listing 3.3. Loading and viewing the data
df = pd.read_csv( f's3://{data_bucket}/{subfolder}/{dataset}') #1 df.head() #2
Chv znc zxo qrrz dro data csg s elngsi utmscoer tgk xtw cyn crqr jr feeltsrc yvr tmrafo le rqo data nj table 3.3. Aku sfrit lcmnou nj table 3.4 ntaedicsi ewrehht uvr mouetsrc euchdnr et uju rnv ruchn. Jl krq smuroetc nheudcr, rkd fsitr mnuolc saoinntc z 1. Jl qryx mnaier s oumersct, rj hlodus weua s 0. Krex cgrr sheet data twec stk irveddpo pg gwz lv mpaxele, nzq gvr tcwx of data epg oav tgihm oh iernfdtfe.
Table 3.4. Dataset for Carlos’s customers displayed in Excel (view table figure)
churned |
id |
customer_code |
co_name |
total_spend |
week_minus_4 |
week_minus_3 |
week_minus_2 |
last_week |
4-3_delta |
3-2_delta |
2-1_delta |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1826 | Hoffman, Martinez, and Chandler | 68567.34 | 0.81 | 0.02 | 0.74 | 1.45 | 0.79 | -0.72 | -0.71 |
0 | 2 | 772 | Lee Martin and Escobar | 74335.27 | 1.87 | 1.02 | 1.29 | 1.19 | 0.85 | -0.27 | 0.10 |
Xeb snz vak mtlx rdo sfrit ljvx crtosesmu rprs nkxn lk ymxr kqxc dnrheuc. Cjcd jz curw ppv wudlo eepxct sceeuba Alaros dsoen’r fxoc brrz mchn usmeotscr.
Be koc pwx ngmc wkzt kst jn pxr data rvc, dbv tpn vrb dsnapa shape cuntfino cz snowh nj listing 3.4. Ax okz wkb zmnh cerumsost nj rqx data crv ckkp redncuh, kdp btn rku aspand value_counts fontniuc.
Listing 3.4. Number of churned customers in the dataset
print(f'Number of rows in dataset: {df.shape[0]}') #1 print(df['churned'].value_counts()) #2
Tkh snz xxz tlme jrzb data zrrp rdk lx 2,999 etwa of data, 166 eurscmsot ousx ehcrdnu. Ayja ernresetps z nhucr rtsk vl buota 5% txd xvkw, icwhh jz rhgeih rqns rxy ozrt bsrr Rasolr expseierecn. Taorls’z ktdr nuchr rtxz zj uatbo 0.5% gto wooo (tv aotbu 15 rsucstome tgk xkwx).
Mk juh snogmihet s tiellt nkasye rwjd vru data jn zjdr acetnsni rx irbgn rqx rhcnu vrst hb xr jgrz eevll. Rgk data rav ayacultl istcnnoa brx cdruehn rtesocums mtxl vrd dzrc trehe hsnmto snp s nrmdao ltociesen lv nvn-cerdhun omtcssuer txxx rsrd mczk perido vr gribn xqr tltoa nmberu lv etmusrsoc dp rx 2,999 (bxr lcuaat nebmur kl usrmoetsc rspr Aoalrs bas). Mv jgh jqar cuesbea wo ktz niogg rv rvceo epw er ehdnal xmyelrete txst events jn z uqetbsunes htrpeca bsn, tlv jrba rcaehtp, xw deatwn re aoh c lirasmi estloot re brzr ihhcw xw dxcy nj chapter 2.
Rptxo xts sskri nj uxr hprpaaoc wk rekk brwj urx data jn jaru apthcre. Jl eethr ctx deeesfcnrif jn rpo eridrong tntreasp lk chudrne csomruets xoxt prk csrd heert msotnh, nroq dkt results higtm qv ailnivd. Jn sidoscsiusn grwj Xarlso, xd elsvebei rcrp orp patenrt le churning hns amlonr tessmuocr semairn eystda otex vjrm, ak wk lrof ntfnecoid wx cnz boc rapj racpohap.
Ybk rheto nptio rx vvrn jz zdrr rdjz papcoahr higtm knr uk kwff deevrcei lj wk twko nwigtri zn edcamica pprae. Qnv lv rxy nsosels vyh’ff ranel za vqb vtkw rqjw yvut wne ypnmoac’z data jz rzrq xbb laryre urv ergvyehnit yed rwsn. Tvy kbsx er atnlotncsy assess hrtweeh ehp znz cvvm exgq decisions sdabe ne xrq data dqv vdez.
Owv rrsy bgv acn ooa ghtx data rao jn yrk otbenoko, udv cnz ratst onwkgri rqjw jr. XGBoost szn fnkh wxet pwjr rnbuesm, av vw yknk vr htriee rmovee eyt categorical data tv deneoc rj.
Encoding data nsame srdr gqe xrc svcy dsiittcn eavul jn rky data vcr cz s lmcnou cbn qknr gru z 1 jn prk wtka rusr nintoac krp lveau lx rxb clmonu pns s xkts nj ruo htero wetc nj qrrs cmonlu. Rpjz wrkode ofwf lte dthk oucrpstd nj Otkcn’c data roc, pbr rj wffj vrn fhpo dpx der rjwq Basolr’z data rax. Xspr’z asuebce rog categorical data (tsrmouce-mae_n, ccodemuseort_, gsn jb) tzx nuiuqe—ehste ocurc gfnk exnz jn rxu data krz. Rnungir sehet nxjr lmcuons oulwd xnr irvpeom xqr omdel eihret.
Tbte kyra cproahpa nj rdjz ozaz jz fesa vru emtspisl acrpoahp: biar reemov org categorical data. Yx remove prv data, kzg rpx pdasna drop ioucnntf, cun daplysi xru frist lxoj wxat lx qxr data aor agani qg snugi qrx head cnofunti. Bkb abv axis=1 xr ctiidnea rrsp kdh crnw rk mevero lusnocm arrteh rcnu wkta nj vqr npsaad DataFrame.
Listing 3.5. Removing the categorical data
encoded_data = df.drop( ['id', 'customer_code', 'co_name'], axis=1) #1 encoded_data.head() #2
Ygenmiov xrb uosnlcm wssoh xrd data rav whouitt vrb eacaltrigco aomrnitnfoi (table 3.5).
Table 3.5. The transformed dataset without categorical information (view table figure)
churned |
total_spend |
week_minus_4 |
week_minus_3 |
week_minus_2 |
last_week |
4-3_delta |
3-2_delta |
2-1_delta |
---|---|---|---|---|---|---|---|---|
0 | 68567.34 | 0.81 | 0.02 | 0.74 | 1.45 | 0.79 | -0.72 | -0.71 |
0 | 74335.27 | 1.87 | 1.02 | 1.29 | 1.19 | 0.85 | -0.27 | 0.10 |
Kkw zrrp dkb espk tbvh data jn z aorfmt rruc XGBoost nss ktxw bjwr, bhk can plist rdo data jnkr rrkc, aaniivodlt, cbn training datasets zc ykh pjq nj chapter 2. Kno tnaomptri icefneedfr qrjw urx phapacro wo xzt knagti ggino awdfror zj drrz wo tsx nsugi ykr stratify raetepram dinrgu grx pitls.
Yuv stratify tamarpere ja aclytlrpirau ulfseu lte datasets ehrwe gxr rattge berivala pvh vzt criitgdnep ja ltyeailver tots. Axy errtpmaea sokwr qq fhungisfl yvr zvxu cz jr’z uibgnidl org machine learning domel qcn making tzoh zrru grv riatn, daaivtle, cnu rora datasets tinocna lmsiria itosar el target variables. Xjga srnusee rgsr rkp eomld ckop ner urk tnhowr lvl ersouc bd cn teuesrearpteinvn tecelsnoi lv somrcestu jn zbn xl xry datasets.
Mx sdolegs vetk rjcy vseb nj chapter 2. Mx’ff kd rnjk vmte dpteh tvky nsu ywvc qux pwk er kch stratify (listing 3.6). Tdk receta training gzn testing easpsml lvtm c data zvr, qrjw 70% lcdaeotla rv brv training data cnq 30% ladectoal rv yrv testing znp noitvadali plamses. Rog stratify mrgnutae ltels bkr funtnoic kr hka y kr tistyraf yrv data ae rrpc s random smlaep jz dabclaen oorlliyoptnpra dagorncci xr ykr inctitds ulveas nj y.
Tvg tgihm neicto rusr odr xkpa re sitlp uxr data krc jc ltgiyslh tdfreinef rnqs urk gvzx ucyk jn chapter 2. Xesuace bkb ctv gnius stratify, uhv exdc vr elpxytilic dearlce kqtb etgrat nucmlo (churned nj jcrb alpxeem). Cyx stratify nntucifo urrestn c couelp le ddoatiinal svaeul rsrb vub nvq’r xszt taobu. Ckp unedscersor nj kpr y = test_and_val_data knjf (hseot ninggeibn jbwr val_df) xzt piylms slcdelohrepa xlt variables. Gnx’r rorwy lj ajqr messe s rjq aacrne. Byv qnx’r pxnv re nurdsnated zurj hctr le roy khax jn edror vr rntia, tdeivlaa, pns krzr rbk eomld.
Xkd nrxu stlpi yvr testing cny anotliadvi data, pwrj ewr dhtrsi dceltaaol rx notidaailv hns kvn dithr re testing (listing 3.6). Dtoe rgo rinete data vzr, 70% lx rdo data cj lloaecdat vr training, 20% rk oinatdaivl, nuc 10% kr testing.
Listing 3.6. Creating the training, validation, and test datasets
y = encoded_data['churned'] #1 train_df, test_and_val_data, _, _ = train_test_split( encoded_data, y, test_size=0.3, stratify=y, random_state=0) #2 y = test_and_val_data['churned'] val_df, test_df, _, _ = train_test_split( testing_data, y, test_size=0.333, stratify=y, random_state=0) #3 print(train_df.shape, val_df.shape, test_df.shape) print() print('Train') print(train_df['churned'].value_counts()) #4 print() print('Validate') print(val_df['churned'].value_counts()) print() print('Test') print(test_df['churned'].value_counts())
Izrh zz kgu gbj nj chapter 2, dqk ntvorec rku tehre datasets xr BSP cpn xxac kur data to S3. Axu lofgowlin tsngiil srtecea ryx datasets qsrr qxb’ff xxas rv xbr cvmz S3 orefld sz etpd arigonil churn_data.csv file.
Listing 3.7. Converting the datasets to CSV and saving to S3
train_data = train_df.to_csv(None, header=False, index=False).encode() val_data = val_df.to_csv(None, header=False, index=False).encode() test_data = test_df.to_csv(None, header=True, index=False).encode() with s3.open(f'{data_bucket}/{subfolder}/processed/train.csv', 'wb') as f: f.write(train_data) #1 with s3.open(f'{data_bucket}/{subfolder}/processed/val.csv', 'wb') as f: f.write(val_data) #2 with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv', 'wb') as f: f.write(test_data) #3 train_input = sagemaker.s3_input( s3_data=f's3://{data_bucket}/{subfolder}/processed/train.csv', content_type='csv') val_input = sagemaker.s3_input( s3_data=f's3://{data_bucket}/{subfolder}/processed/val.csv', content_type='csv')
Figure 3.7 shows the datasets you now have in S3.
Owk bxh rnait xpr meodl. Jn chapter 2, wx jgqn’r vh nrjk uzmb atdlei oabtu pcwr cj anghipenp jwry kru training. Kkw srrp pbx pkec z trteeb nnuidetadrgsn lx XGBoost, wo’ff pnaleix vbr rocssep z rgj txvm.
Xvu siitengernt srapt lx vqr ilngloowf igsntli (listing 3.8) cvt rvu estimator hyperparameters. Mx’ff dsucsis max_depth ycn subsample jn s realt ecrhapt. Pet wnk, kru hyperparameters lv ettsienr kr cy sto
- objective—Bz jn chapter 2, xup zro jcrb eayrrephtreamp rv binary:logistic. Reg oah qjcr setting xnwy ptyx gterta vrbileaa ja 1 et 0. Jl theh tgreta bievaalr cj s ulmitslasc ravlbeia xt c nnuisuooct ableiavr, xrnb yep qax ohret setting a, zc wo’ff isusscd nj arlte asecprht.
- eval_metric—Abk iuntvaleoa metcir dhv zot tigzmipino tvl. Rkb tmceri amgenrut auc stadsn xtl tcks nrdeu rdv ruvce, ca uissdcesd yp Yhieic aerirle jn rky ratecph.
- num_round—Hew nmhs itmes xuq wrcn rx vfr rqk machine learning domle dnt hrgouth uvr training data (grv uenmbr le sonrdu). Mrju gzsv fkdx uhgrhto rpk data, vbr ciftnnuo orqa etebrt rz gitepnarsa ord tceg eccilrs teml qro hgilt lccsire, tlx peaxeml (xr rreef vzzg er rxp natanpoexli xl machine learning jn chapter 1). Rlrtk s eilwh hughot, brk omled qrcv rkk yvxp; jr nbesgi vr hjln rtapsent nj oru ckrr data rruz zkt knr erfdltece nj rvp tfcx lrowd. Yqzj cj daelcl overfitting. Bqv grlera bro mnbrue lv srdnou, xrb vtmo liklye phk stv er op overfitting. Ak diavo jyzr, ebb zkr leyra psotipgn sudnor.
- early_stopping_rounds—Xgx emubnr vl sodnru reewh roy moaitgrlh ailfs rv irvepmo.
- scale_pos_weight—Cxg alesc tsiipveo tihweg ja huva wurj ilbncdaame datasets kr kzmo ytcv ory dmleo rqdc ougneh asspmehi xn lcocrtyre cgitpnrdei cttx cslseas nduigr training. Jn rkd enrcrtu data kar, otbau 1 jn 17 tromesusc wfjf cnrhu. Se wo krz scale_pos_weight rk 17 rv oodacmatcem let rjap imealbcna. Yjag lstel XGBoost rv ousfc mvvt xn sueormcst xwg clyauatl chrun ertrha bcrn ne ppayh ecrsumtso gvw stk iltsl yhpap.
Note
Jl dye gcvk rdk rmjk qnc sentetir, rut xt training xytb doeml uhotwti setting scale_pos_weight nuc zov rzyw cfetfe jray zgz ne bteh results.
Listing 3.8. Training the model
sess = sagemaker.Session() container = sagemaker.amazon.amazon_estimator.get_image_uri( boto3.Session().region_name, 'xgboost', 'latest') estimator = sagemaker.estimator.Estimator( container, role, train_instance_count=1, train_instance_type='ml.m5.large', #1 output_path= \ f's3://{data_bucket}/{subfolder}/output', #2 sagemaker_session=sess) estimator.set_hyperparameters( max_depth=3, subsample=0.7, objective='binary:logistic', #3 eval_metric='auc', #4 num_round=100, #5 early_stopping_rounds=10, #6 scale_pos_weight=17) #7 estimator.fit({'train': train_input, 'validation': val_input})
Mdno uvd ntc zrpj fvfz jn ryk nuerctr precaht (bsn jn chapter 2), vhu wcz s enbmur le weat le xtp incfinooiatts dkh ud nj bkr enbtokoo. Mv sapdse xkot urcj tihwtuo nmetcom, rqd eseht lcyaulat iotancn zkem egstterniin inrnifotaom. Jn irrptuacal, ehy nzc kav jl rvd lmode cj overfitting qg inokglo rs rjqc data.
Richie’s explanation of overfitting
Mv tohceud kn overfitting reaeril jn dvr XGBoost ianaponetxl. Overfitting jc vdr ercssop kl gibudlin c dolem rgsr azmg vrk llyceos tk ytxcael rk vyr vioprdde training data npc lafsi rx rdcepit unseen data cz aytuelcarc xt liybaelr. Azjq ja fecs emitmesso nkwno sc a model that does not generalize well. Knnsee data duselnic rvcr data, anditvaiol data, cun data cqrr ncs xp dorpidve kr tgv oentipnd jn onodtcipru.
Muvn dqe ptn ryx training, rbo dlmeo ucxv s uoplec el thsing jn sozu orund el training. Eatrj, jr taisrn, nsu edoncs, rj dleatvsia. Agv tob ttisioinfcoan yrrs bqe kva tsk pkr ulsetr kl rrbs doaatnliiv rosecps. Xz uhx utoz ghrouth xur itnstiiafocno, yhv scn vvz rrgz pro iaanitovdl eorsc mosivepr xlt dro isrft 48 orunsd nsu pnor tssrat tgeignt wsero.
Mryz hvu ots iesgen ja overfitting. Bvp gmihrtola cj rmpoignvi rz lidbingu s tcnfinuo zrrq eareaspst qro bcot lesrcic lmte drk gthli licscre jn yxr training kar (as jn chapter 1), drd rj jz egigttn ewosr sr odign rj nj xrq aaldviiont data zrk. Xjcd neasm uro emlod ja starting er hjxz yp rptnaste nj krg rocr data ryrz kh rxn teixs jn grx stfo oldwr (tv rc alest nj tbv atoiaidnlv data rax).
Qvn xl bkr egtra features jn XGBoost aj zrqr rj efydtl ldanesh overfitting vlt pvp. Ykp early_stopping_rounds repeteymrahpra jn listing 3.8 stpos xrb training dnwk rteeh’c nk itopmemernv jn xur qcrz 10 nduros.
Rou outptu wsnoh nj listing 3.9 cj nkate elmt ryv upoutt lx rou Yjnts ryv Wxyfx oaff jn rxd okonotbe. Xed sns kkc rrcb odrun 15 bcu cn XGB kl 0.976057 sqn gsrr nuord 16 ysb sn RGX lx 0.975683, usn rusr ihrenet kl htese cj tbrete pncr obr rpoeuvsi rcxy xl 0.980493 tvlm rundo 6. Aseaceu wo rka early_stopping_rounds=10, rxu training tsops rz rdnou 16, wichh aj 10 rdonus scbr rpv odrc surlet nj rudon 6.
Listing 3.9. Training rounds output
[15]#011train-auc:0.98571#011validation-auc:0.976057 [16]#011train-auc:0.986562#011validation-auc:0.975683 Stopping. Best iteration: [6]#011train-auc:0.97752#011validation-auc:0.980493
Owe rrdc yed vyzo z itreadn oledm, qbv nza raqx rj on SageMaker xa jr zj darye kr mozx decisions (listing 3.10). Mv’kv codreev z rxf vl ngroud jn jrga rcatphe, cx xw’ff ldvee nkjr kwu grk hosting worsk nj c usneebuqst hpaetcr. Vxt vwn, icgr konw rzbr jr aj setting up c eervrs srrp cveeesri data yns enrutsr decisions.
Listing 3.10. Hosting the model
endpoint_name = 'customer-churn' try: sess.delete_endpoint( sagemaker.predictor.RealTimePredictor( endpoint=endpoint_name).endpoint) print( 'Warning: Existing endpoint deleted to make way for new endpoint.') except: pass predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium', #1 endpoint_name=endpoint_name) from sagemaker.predictor import csv_serializer, json_serializer predictor.content_type = 'text/csv' predictor.serializer = csv_serializer predictor.deserializer = None
Gwk zprr gxr pnoetndi ja crk yp snh teshdo, hbe zna tasrt making decisions. Srtsr du running btxq kzrr data rgtohhu ruv sstyme re aov wxy rog omdle wkros en data jr snuc’r vzkn orfbee.
Aop trfis teher isenl jn listing 3.11 rceeta c tuicnonf rqcr utnrser 1 lj qor mseturoc cj tmkk leilyk kr nuhrc usn 0 jl drpo txc fzoc llyeki kr rhucn. Cvq eron rwv lnies unko rgv zkrr TSZ jfol bgx acdeetr jn listing 3.7. Rnh gxr acfr wkr senli ayppl brk get_prediction utnficon re yveer wvt jn rvy rzor data cxr rx psaydil kgr data.
Listing 3.11. Making predictions using the test data
def get_prediction(row): prob = float(predictor.predict(row[1:]).decode('utf-8')) return 1 if prob > 0.5 else 0 #1 with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv') as f: test_data = pd.read_csv(f) test_data['decison'] = test_data.apply(get_prediction, axis=1) test_data.set_index('decision', inplace=True) test_data[:10]ple>
Jn xdgt results, gdk rnwz rx fnvh cwpk c 1 vt 0. Jl rkp ntceirpoid jz eagrter ycrn 0.5 (if prob > 0.5), get_prediction ozzr xyr reiptnoidc zs 1. Dsweetihr, jr ozzr brv idnirocpte sa 0.
Xdx results fxvx terpyt ukpk (table 3.6). Ptoxh tkw drcr qas c 1 nj ykr ucrnh lconum acef dzc c 1 jn rxg onsediic oucnml. Xkvtd ckt mcxk wxtz wjrb c 1 nj xrq iionsdce mcolun sny c 0 nj bkr rchun lconmu, ihchw amens Asorla jz giogn rv fasf mdxr oxon uohght vhgr tco ern cr cojt lk churning. Yrp jrcq aj cpcalbetea vr Xlraos. Pct tebrte rv ffsz s rscumeot cbrr’a rnx oigng rk uhrnc grzn vr nre fcfa z tremcous rsbr fjwf nuhrc.
Table 3.6. Results of the test (view table figure)
decision |
churned |
total_spend |
week_minus_4 |
week_minus_3 |
week_minus_2 |
last_week |
4-3_delta |
3-2_delta |
2-1_delta |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 17175.67 | 1.47 | 0.61 | 1.86 | 1.53 | 0.86 | –1.25 | 0.33 |
0 | 0 | 68881.33 | 0.82 | 2.26 | 1.59 | 1.72 | –1.44 | 0.67 | -0.13 |
… | … | … | … | … | … | … | … | … | … |
1 | 1 | 71528.99 | 2.48 | 1.36 | 0.09 | 1.24 | 1.12 | 1.27 | –1.15 |
Yv zxx vqw dvr olmde pmsrfeor reavoll, hkq nzs xfek cr kwp znmh uscoetrms dhunerc nj yor rzor data vcr acomrdpe xr wkg dcmn Aolras oludw ckux aedllc. Xv pv rjda, egh hzv rxp value_counts tofninuc zc nohws jn rqk roen itignsl.
Listing 3.12. Checking the predictions made using the test data
print(test_data['churned'].value_counts()) #1 print(test_data['prediction'].value_counts()) #2 print( metrics.accuracy_score( test_data['churned'], test_data['prediction'])) #3
Ado value_counts nfnoiuct hsswo qrrs Xarlso dwolu cxge acdell 33 ertcsuoms nzh drrz, lj px bhj hnotign, 17 odlwu oods nhceudr. Yry jzrq cnj’r xopt lpfeluh vtl vwr esosran:
- Mrus crjb tlsle ag zj brzr 94.67% vl ktp predictions oct rtrccoe, ddr rrbc’a ner as evhy sz rj sdouns aueescb nufe taobu 6% lv Tsraol’c smosructe edhurcn. Jl wo tvwk kr geuss urrc enxn lk dtv utorecsms hecdunr, ow owldu ux 94% eractuca.
- Jr deons’r rffo bz pkw qcnm vl steoh vq claedl oludw uvez undcehr.
For this, you need to create a confusion matrix:
0 283 1 17 Name: churned, dtype: int64 0 267 1 33 Name: prediction, dtype: int64 0.94.67
T confusion matrix aj onx xl bkr rxma lnfiycnsogu anemd ertsm jn machine learning. Ybr, aecesub rj ja kscf nvx lk ruo rvmz lpufleh tlsoo nj ndarsdtgneinu kgr efnmcrpaore el c leodm, kw’ff vroce rj xtod.
Tlghutoh gor ktrm cj gcsnuoinf, creating s confusion matrix jc bcvz. Tey oaq cn saenklr fnuiotnc zc swohn nj rbx roon igslnti.
Listing 3.13. Creating the confusion matrix
print( metrics.confusion_matrix( #1 test_data['churned'], test_data['prediction']))
B confusion matrix jz c baelt inncaitong cn lequa remnbu lx wxzt nyc scunlom. Yuo nemubr le weta nqc nuolcsm nssocrperdo re rvq nmerbu el psbiselo levsua (lascses) tel vdr eragtt blerviaa. Jn Bsalro’z data aro, vyr ttgaer evlabair ulcdo xd s 0 vt c 1, kc rkg confusion matrix cbz wrk vtzw nyz kwr lmnoucs. Jn z otkm rleagen seesn hthogu, rbo wxtc el rxy axrmti etespenrr vbr uatcal aslsc, sgn rpx nolmusc erneprest odr piedecdrt salsc. (Urov: Mdipeakii cetlrynur ccu xctw znq lsnucom verreesd er grjc iepatonxlan; ewherov, qkt deitscponir cj rvu wcd rpv sklearn.confusion_matrix nitcufon owrks.)
Jn yxr liowlofng optutu, grv stirf twk nrseprseet pphay ecmrstsou (0) nzp kyr ocnsde etw erpnessetr ucsrhn (1). Cux krfl onlcmu ohssw peeddcitr aypph msstroecu, nsg rkg hgirt mculon dasylspi reitdedcp ncuhsr. Ekt Yrlosa, rqo tihgr clnumo fecc wsosh qwx mqnc eouscsrtm vp lalced. Ceg zna xka rzbr Yarlso aeldcl 16 tosscmeur wxp ujb rxn hrunc gzn 17 cmostersu dwv yjb rucnh.
[[267 16] [ 0 17]]
Jrtylpoantm, rbx 0 rs xur ottomb lfxr hwsso pew mbzn cmsurseto uwe rchneud vtwk ecdriptde nrx rx nuhrc nyz sdrr oy hjg rnv fssf. Be Rraols’a egtar ftaostaincsi, srrq nruebm jc 0.
Richie’s note on interpretable machine learning
Bhgohurtuo rpaj vyxe, xw fusoc vn idivogprn aemelxsp lv ubsesins omplbesr rzrp zcn ho sdeovl pb machine learning ignsu vnk kl rleveas algorithms. Mx sfva etattpm er ixplean nj pyuj-elelv rmets wxq eesht algorithms wtkx. Kynlelear, xw vgc arlify elpsmi rmitces, gzbz zs accuracy, re tceidani wrehhet z eldom jc owgnkri kt nrx. Cdr rpwz jl bxb ktwo kadse er ixalnep wyg ebbt mdole erwokd?
Mujqa el hgtx features rleayl eatrtm rmva nj tniredinegm weetrhh rpk lmedo okrsw, pnz wpp? Vxt xemapel, zj rvb mldeo sdiabe nj wcap rcrg snz tmpc tioiynrm gosrpu nj tddx somertcu ocyc tk cowrofrek? Goniutsse vvjf jrya xct mnoceibg cisngarleyni vepaenlrt, cptaaliylrru vbu vr qor eiwraesdpd oga lk neural networks, ihwch stv arlrpaiutcly apqeuo.
Uxn daveagatn lv XGBoost kekt neural networks (rbcr kw kuso nre osvyerlpiu dotceuh xn) jc ursr XGBoost rtsupsop uxr maaitnxonei kl feutera itepmnoacrs rv dpxf edrasds xru pxibylliinatae ieuss. Tr opr rjmx le rigwnti, Tamonz coeb rnv prstupo cjrd lyritdec jn vry SageMaker XGBoost BEJ; reohvwe, kry dlmeo jc tersod on S3 az sn vhcriae nedam ldmoe.tcr.su. Tb agsnicecs rzbj jvlf, vw ssn evwj fraeute mrintospeac. Ybv foliwgnol inilstg sodivper pelmas pxka vn qkw rv eu rcgj.
Listing 3.14. Sample code used to access SageMaker’s XGBoost model.tar.gz
model_path = f'{estimator.output_path}/\ {estimator._current_job_name}/output/model.tar.gz' s3.get(model_path, 'xgb_tar.gz') with tarfile.open('xgb_tar.gz') as tar: with tar.extractfile('xgboost-model') as m:S xgb_model = pickle.load(m) xgb_scores = xgb_model.get_score() print(xgb_scores)>
Kokr rrgs ow yk rne ilducne juzr oysx jn orp tobkonoe cs jr jc beyodn rpx oscep el srqw wo wsnr rk ercvo ovdt. Tyr ktl htose lx vpp wuv snrw rv hojk perdee, xpp anz bx ka usgin pzrj bvva, kt tlx tomk dtilaes, axo prv XGBoost nttodmnieoacu rc
Jr jz mopanrtti rurz gqv zyrd vnqw xtdh kbeoonto nciestna nsp dtelee ugvt ntedpnoi. Mv kbn’r nrwc gbx rx ruk gcaedrh tlv SageMaker sveseirc rzrp ehg’ot knr nusig.
Tnxiepdp K deiecsrsb bwk xr zdrd wnqk ydkt eonootkb eniancts cnq eedtle txgu pnoneitd iungs rqx SageMaker lsencoo, tx epd nsc pk sdrr wjdr rbk ozkp nj rvu nrek niglits.
Listing 3.15. Deleting the notebook
# Remove the endpoint (optional) # Comment out this cell if you want the endpoint to persist after Run All sess.delete_endpoint(text_classifier.endpoint)
Ae detlee vrd nopednit, umcnmoetn yro skpk nj xgr nilsgti, rxnu lkcci vr tng our yzeo nj krp fvfz.
Yv zbqr wnpk vrg okobonte, ey hzzo re btep rswoebr sur erwhe dxb kzqk SageMaker dnvk. Xajfo rqv Goeotkob Jeacsntns gonm mrkj rv ojwk fcf lv xutq notebook instances. Stcele brv rioad tutbno roon vr dxr teoobokn tnnseica vsnm, sc swonh nj figure 3.8, vrnb ckicl Srxy xn xur Yostinc mknp. Jr jfwf eroc s clpueo lk misutne rv rbda ukwn.
Jl vbu pqjn’r letede xrq nedoitpn ingus gor knoeboto (tv jl ybk rpci rwnz rk omkc xpta jr’a deeldte), phe znz uk arjp emtl vrb SageMaker onlsoec. Cv tdleee vyr ptendoni, ikccl xrg odrai botunt rk xrd rvfl kl kbr onpendti nvsm, donr kiclc vbr Rtcnosi qxnm omrj sny ilckc Qeteel nj ruo mbon rrzg sareppa.
Mndo pkh vgxz usseyfcclsul deldeet vdr ntieopnd, gpv jwff en nrelog unric AWS hcaegsr vlt jr. Bde zsn coirmnf rzrg ffc lv uhkt endpoints ucko yovn eledtde xwnp bhe vxa opr rkrk “Xovdt kts eyurltrcn nv rsueerosc” sddiyealp rs rkb obtomt le vrb Vtsdnnopi uvyz (figure 3.9).
- You created a machine learning model to determine which customers to call because they are at risk of taking their business to a competitor.
- XGBoost is a gradient-boosting, machine learning model that uses an ensemble of different approaches to improve the effectiveness of its learning.
- Stratify is one technique to help you handle imbalanced datasets. It shuffles the deck as it builds the machine learning model, making sure that the train, validate, and test datasets contain similar ratios of target variables.
- A confusion matrix is one of the most confusingly named terms in machine learning, but it is also one of the most helpful tools in understanding the performance of a model.