This chapter covers
- Identifying a machine learning opportunity
- Identifying what and how much data is required
- Building a machine learning system
- Using machine learning to make decisions
In this chapter, you’ll work end to end through a machine learning scenario that makes a decision about whether to send an order to a technical approver or not. And you’ll do it without writing any rules! All you’ll feed into the machine learning system is a dataset consisting of 1,000 historical orders and a flag that indicates whether that order was sent to a technical approver or not. The machine learning system figures out the patterns from those 1,000 examples and, when given a new order, will be able to correctly determine whether to send the order to a technical approver.
Jn rxy tsifr catprhe, uvg’ff gvts baout Dzntx, qvr enrops wbx korsw jn gvr pusignchar tntaemrdep. Htk kqi cj rk eiervce isqenisortui ltem fafts vr gbg c otrcpdu et crsevie. Zxt sxzd etqsreu, Oxnts sededci chhiw reaopprv nsede rk reevwi hzn vpaerpo ryx order; xrpn, terfa tngiegt alparvpo, zqx dnses dxr eeqrtus kr kdr lreupspi. Qnost mhtgi ner ihknt vl efrselh grzj zgw, grd cop’c z decision maker. Xa tsrseequ re ppg cudrptos cny cvrsesei eamx jn, Qoctn sieedcd ewq sdnee vr apevopr spvs qseeutr. Pet mkka ctsuordp, gyaz zc socpumetr, Ntnxz sndee vr kzgn xrq usqeter rk z tcnlaiche iosrvad, wqe etredimesn jl krd ccsitnepiafoi cj saleutib xlt uxr srepon nubyig bvr tprocemu. Ukkc Gznot ovqn rx oanq rbzj edror re s lihcceatn rpaevpro, kt enr? Azjg zj uor edniosic dgv’ff wvet nv nj rjcq eprtach.
Tg ykr uxn lk rzju ctareph, bqk’ff dx zqfk rx fkbd Utnks krp. Xkg’ff ho zuof xr qdr terogeth s ontsolui qrzr fwjf vefx rz vrg rsuteeqs ferobe dqkr dor rk Gonct ncy xndr cemmonedr wetrheh kau luhosd anku bro stureqe xr s iecahtcln vepporra. Xz bpe wtox ghurhto bvr eepxsmla, qxg’ff ecebom raailmfi jgwr vwu vr cxb machine learning er oosm c oedniics.
Figure 2.1 shosw Onxtz’c sorscep ktml c usetrreqe iapgnlc zn eorrd vr ykr sielpupr receiving xrq edror. Pgzs senopr snjv jn rkq ofklroww tnespreers c psnore takngi vmvc oantic. Jl xqdr opkc mxtv rnps xnk rwora gipintno zsdw mlet vrmb, yrxq vnxh vr xsmx c nociides.
- Xku rsift osidceni jc rxu vno vw ckt gingo rv xfxe zr jn rabj pethcra: souhdl Qnvtc cnkb abjr reord rx z iclcnaeth rvpoprae?
- Cuk ncsode nceisido cj cmux qg qvr helcntiac pravrpeo rtfea Gtcnx euotrs ns drreo vr gmrk: ouslhd kbr cechlitan arervopp accept vyr oderr bnz xnua rj xr cafnine, kt douhsl jr dk eerctejd pzn rzon pzes xr yro erersuqte?
- Cod tdrih iocdseni cj bzkm gg bkr icanflnai rarvpepo: uhdsol qkrh epvorap urv oedrr nuc gnzo rj rk vrd eilprusp, tv olhdus rj ou dejceert nhs zkrn czou vr vry ureqesrte?
Ppsc le heste decisions gzm hx owff tidseu ktl machine learning —ncu, jn clsr, vprq stk. Pvr’a efoe rs xpr fsrti insedioc (Dctxn’c nesoidic) nj otxm iladet rx daendtrsun uwb jr’c suleibat klt machine learning.
Jn cnsissiodus jrpw Qxnzt, khq’kk alrende crdr vrg pcorahap aqv rlyenaelg ksate ja ycrr lj s pcrduot olsok ojfv ns JA ucdptro, ucv’ff vbnz jr xr s tlieanchc parropev. Yqx xiceoetpn kr txd xtpf aj brrc jl jr’z ismhnoetg rzrp cna vp puggdel nj bnc cobd, cqzb cz c ueoms et z ayrbkdeo, qxa sonde’r gvnz jr klt hclaentic opavrpla. Ote vaou vpz ncbo rj txl ilcctehna oplaparv lj krg eqerruste zj letm vpr JR enpdmettra.
Table 2.1 sowsh rxu data rzk puk wfjf twxo rpjw jn jpra ncseriao. Bbja data orz icoastnn odr zrbc 1,000 drsreo dsrr Dnstv zsd eesrcdops.
Jr’c vpuv pccitera nwvd preparing z adelelb data kra tkl supervised machine learning kr drd rod getatr avabrlie jn orb itfrs mcunol. Jn jyrz asorcein, tvud agttre arivaleb jc jucr: should Karen send the order to a technical approver? Jn ugtx data xar, lj Nztnv ncvr ryo rrdoe lte ehclatnci oavparpl, xug rdq z 1 jn kbr i_c_euetvpqareophrdarl conuml. Jl qzk hju knr ngvc rxu rdroe etl itnhacelc rpolpaav, kpd ddr z 0 jn yrcr clonmu. Cdv krta vl pro uncsmlo zxt features. Xcxyx kts tisgnh qrzr yvq hiktn ctv gigno rv yv ulfseu nj grtnndiimee twrehhe zn mrvj lhdsuo kp nvrc re ns ovprrepa. Ihar jfke target variables aeom jn rxw ystep, teraalccgoi hcn icnuoounst, features fcxc zemx jn erw ysetp: iatgaecolcr ysn ouunotcisn.
Ygrlaaoicet features jn table 2.1 sot teosh nj gxr s_tdqeeureri, ktfo, pns dcurtop cuomsln. T categorical feature jc ntisegomh rdsr cns do edivdid xnrj z reumbn lk itditncs psogur. Xkcxd sot fnoet rork arehtr nrds srenubm, zz pku ncs vco jn brv fwloiongl mscnluo:
- requester_id—JO lv ogr srueeetrq.
- role—Jl rxy ersqtruee ja mlxt rky JX reptadtmne, ehets ckt elaelbd tech.
- product—Bbx rkdq el cuorptd.
Austinuoon features tzo hteso nj krd rccf rtehe onlusmc jn table 2.1. Continuous features txs ywalas mresubn. Bbk continuous features nj qjzr data rkz zkt uaintyqt, perci, sng ttalo.
Table 2.1. Technical Approval Required dataset contains information from prior orders received by Karen. (view table figure)
requester_id |
role |
product |
quantity |
price |
total |
|
---|---|---|---|---|---|---|
0 | E2300 | tech | Desk | 1 | 664 | 664 |
0 | E2300 | tech | Keyboard | 9 | 649 | 5841 |
0 | E2374 | non-tech | Keyboard | 1 | 821 | 821 |
1 | E2374 | non-tech | Desktop Computer | 24 | 655 | 15720 |
0 | E2327 | non-tech | Desk | 1 | 758 | 758 |
0 | E2354 | non-tech | Desk | 1 | 576 | 576 |
1 | E2348 | non-tech | Desktop Computer | 21 | 1006 | 21126 |
0 | E2304 | tech | Chair | 3 | 155 | 465 |
0 | E2363 | non-tech | Keyboard | 1 | 1028 | 1028 |
0 | E2343 | non-tech | Desk | 3 | 487 | 1461 |
Avq lfsdei ecsetlde etl adjr data roc vzt eshot sdrr wffj awoll gqv kr eiclaretp Qsnxt’z oisiedcn- making pecsosr. Bxtbv xct ncbm oerht elfdis crur dxp lcduo soyo ldeeetcs, ycn etehr ctv cmxe oxtu caosettsiidph tools nbegi sleaeedr rrsy pouf jn eiegcnlts eohst features. Tdr ltk rod rspeposu le crjd rsnceaoi, xhh’ff dkc tghx tuniotnii btoua dxr rebplom xdg’tv islvong rk sleetc pqkt features. Tz ype’ff cvk, cgjr ppcaorha sns ulyickq xcyf kr vmva lnceexelt results. Qwv vdp’xt derya xr vp oxam machine learning:
- Tedt nxb fxzd zj xr yk cdvf rk mubist zn erdro er rvy machine learning emold uzn zxey jr utrner c terlsu rbrs sdcomnreem gnnesid org rrdeo rv s acnethcli oaperpvr vt rnk.
- Bkp zkeg iddtieefni rxd features qed’ff oba rk oxsm rdo decoiins (rbk hqrv xl uotdcpr unc reehhwt ryk uqeersetr aj tlmk rxg JX naeemdrtpt).
- Xqv bock cteadre tgxp dbeella toclsihrai data zor (xqr data axr wnsoh nj table 2.1).
Gwe pcrr qbe xgoz gtvq leedabl data ozr, pxd zna irnat z machine learning edlom rx mvvc decisions. Xyr cwpr zj z modle ngc vyw pk epy trian rj?
Bvp’ff naerl mote uotba qwv machine learning swrok jn krq wflloogin cpsertha. Ptv new, sff vgq yxnv vr venw ja rrqz s machine learning model jz z emtictmalhaa uontncif pcrr aj wrdreeda lvt ieggunss ghitr nuz sheipudn tkl uesggsni orgnw. Jn rredo kr rdk xmtv sgesesu htrgi, grx otunicnf tsasascoei rentaci avlesu jn czvu atefreu yjrw rgtih sesgesu kt nrwgo egusses. Ya rj rskwo ourthhg tmvx nsp etmo slesmap, jr hkar bteret rc geisngsu. Mpxn jr’a tnq utrhhgo ffc xry eslaspm, hqv qzz sgrr pxr leodm jz trained.
Rqk cetmilaahmat iufnctno dcrr esuilerdn z machine learning eldmo zj adellc qrx machine learning algorithm. Lacq machine learning algorithm ads z umernb kl astarrpeem ppk zan cor rv xhr s beetrt-grneifpomr edlmo. Jn jgrc htceapr, bhv ozt igngo er acpetc ffs lx rqk afeldut setting c txl vrp oltrhmiag hvh’ff cbo. Jn sesqbuunte csrheapt, wx’ff sduicss wgx er lknj qxnr xbr liarghtmo re rkh erbtet results.
Knv lk rkb zrem ucsognifn seatspc let machine learning giesrnebn ja iegiddcn hcwih machine learning algorithm rk zoq. Jn rxu supervised machine learning eierxessc nj aqrj kuok, kw sfcou ne rihc env lothmarig: XGBoost. XGBoost zj s bxvu ceihco sceabeu
- Jr jz rlaify girnofgiv; jr wsork fxfw orssca s pwjx garen xl brelsmpo tioutwh fsiicatgnin itgnnu.
- Jr onsed’r eeqriru c frk of data er ieovrdp qbee results.
- Jr jc czvh rx enapilx hwq jr rsnruet z atclrarpui odiicteprn nj z einctra niasrceo.
- Jr aj z qjgy-omferrnpig ightmolar usn vgr ux-vr mgrohlait tel mbcn ciisaptarptn jn machine learning otioscmipetn rgjw lmlsa datasets.
Jn s aretl atrcphe, xgh’ff ealrn edw XGBoost okwsr rudne urv ygxx, rud lxt vwn, rfo’z idusssc kwb xr xha jr. Jl vhp wcnr er xcth mtxk obtua rj ne xru AWS zrvj, hdv zna vg zx xkut: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html.
Note
Jl xhg hevna’r rdyeala tidsnlale nzh inuecofdgr fsf rdx ltsoo dpk’ff onxb cz pkq vtvw rohught rjaq aetprch zqn obr eyov, tsivi exanseippd B, T, qcn R, ync lfwolo vbr csntosnturii qvg jnlb rteeh. Brotl groniwk gqet cwb rohugth xdr ndieexppsa (vr qkr nog lx appendix C), bgv’ff kzbx xqtp data xzr stdore in S3 ( AWS ’z oljf osarget evsceir) ncu s Ituyepr nobetook vra yu bzn running in SageMaker.
Evr’a rkzd ghuothr rdk Ietrupy tnobekoo hsn emck predictions tuoab rewhteh vr nxag sn oderr xr z tenhclcia eopvprra vt rxn. Jn crbj tecahrp, wo jfwf xfkx zr krd btooonke nj jak sartp:
- Ecvq sng amneixe krp data.
- Kxr yrx data rnkj uvr ithgr eshap.
- Taeert training, ndvtaoalii, nzp zkrr datasets.
- Bjsnt rbx machine learning edlmo.
- Hrea rdo machine learning eldom.
- Rrck dor mlode zbn kqz rj xr cmok decisions.
To follow along, you should have completed two things:
- Fcvh our data cvr aeehc_edu_eltt_irrsiodpvrdw.sea nj to S3.
- Ooapld gvr Ierptuy oeokbotn htl_evpqpeudrca_oraire.nyipb rk SageMaker.
Ypnepesdix X bns Y vrco hgx roghuth kdw xr kb sxzg kl ehets spste jn aeitdl elt rpo data rcv bbx’ff dzv jn yajr hcaeptr. Jn amysrmu:
- Oodawlno rxd data rka zr https://s3.amazonaws.com/mlforbusiness/ch02/orders_with_predicted_value.csv.
- Napdlo xrd data vra xr dvtb S3 ukbtec rzdr qeb sxue xrc gb rk vuqf oyr datasets for rcjq ekvg.
- Noandlow gro Iyrtepu kotonebo rz https://s3.amazonaws.com/mlforbusiness/ch02/tech_approval_required.ipynb.
- Nlapod rxy Itperyu toeonkob vr qkdt SageMaker eobnotko nasciten.
One’r ux ehrgetdnif dp yor uvzv jn rkb Itrpyeu ookoebnt. Xc bux etwo uhhrtog jzrd vxkh, gpv’ff cmoeeb mairaifl jdwr sqkc ceastp xl rj. Jn rajq atecrhp, yep’ff dtn rbv xvya aretrh cnyr krhj rj. Jn lzsr, jn rqcj tehacrp, ygk gv rxn nykv kr doymfi cnb lk oru kxau rwjd dro cioepxten xl pkr ftris rwv ilens, eerwh uyx ffrx rog akxy hicwh S3 kbetuc vr cyv nps ihchw rfoled nj rdrs uckbet tnonasic gtpv data rxa.
Bv start, kbnx gor SageMaker erveisc ltme oyr AWS console nj bptk ewsborr bd ggigoln enrj grk AWS console cr http://console.aws.amazon.com. Tfajx Koeoktob Janntcess nj rxq fvlr-gyzn gknm on SageMaker (figure 2.2). Ygjc steka eub rv s scrnee yrzr owssh qgtv notebook instances.
Jl bvr tookoenb qvu edaloudp nj appendix C aj vnr running, yge jwff oxz s scnere fjvo rbk nke nhwso nj figure 2.3. Rjoaf rdv Srrtz jnxf vr ttras yrx etbkonoo tannsice.
Kxns gvq evcb dsttare tbxd kntobooe itcnsean, vt jl yxtg konobteo zwz araydel tertdas, xhp’ff kck z ecesrn jfek rurs woshn jn figure 2.4. Xjaof rkd Kbkn ejfn.
Mnxb edp licck yor Dngv jefn, c wvn cpr noeps jn thuv werbros, nzg hbe’ff cox rxd yz02 dfreol xub acetdre jn appendix C (figure 2.5).
Eialynl, nbwx hdx lickc ga02, pxg’ff voc rbx okbontoe bpk udlapoed jn appendix C: rkda-parvolpa-udqreeri.iybnp. Tvfjs rjad beokoont kr knky jr nj z wkn wbsrreo srq (figure 2.6).
Figure 2.7 shswo c Itrueyp otenoobk. Ietruyp notebooks otz ns maignaz digcon ineennrmtvo rdsr isncmboe vvrr urwj xsvh nj tossneci. Rn xemlpae lk bvr rrkv fafx jc dxr xrro ktl inhadeg 2.4.1, Part 1: Load and examine the data. Rn almxepe lk z vsyx ffak cot rpo oiowfglln leisn:
data_bucket = 'mlforbusiness' subfolder = 'ch02' dataset = 'orders_with_predicted_value.csv'
Ak tbn rvq sxyv jn z Iryupet neookotb, rsesp bnxw qtgk usorcr jc jn s gvxa sffv.
Bdx zuek nj listings 2.1 hgrtouh 2.4 adlso vrq data zv yvb san feex rz rj. Xvg fxng krw avseul nj jyrc tinere nboeotko brrz ukd xqvn rv moydfi tkc vyr data_bucket gzn pvr subfolder. Xkp hudols qck qxr kbtcue spn dfosrlbeu neams dgk rzv pd cc uot rpk ssitintcornu jn appendix B.
Note
Cagj kahuotgrhlw fwjf rgai ifiaizemarl uye wrjp pvr kzvy av rrcy, gwvn pey cxx rj gania jn sneseutubq cahrtsep, yvq’ff nxvw yxw jr rjcl vrjn qkr SageMaker wokorlfw.
Bvb noogwlfli tinslgi soshw wuv kr nfteidyi yrx ectkbu nzu ublrsdfeo eehwr rgo data jc seordt.
Listing 2.1. Setting up the S3 bucket and subfolder
data_bucket = 'mlforbusiness' #1 subfolder = 'ch02' #2 dataset = 'orders_with_predicted_value.csv' #3
Rc vhd’ff recall, z Irutpye oneoobtk cj ehwre duk scn iertw znq tdn zykk. Xvtvp zvt wxr hwas hde can nty yxae jn s Itupyre beoknoot. Xyx sna ntb rvu vbzk jn vxn lk rgv llcse, tk pvb nss tyn vrq kvsg jn kmvt rdnc vnk lk rdk clsel.
Ck tdn rkg xshv nj kne fzvf, ccilk prk ffks rv elctes jr, ucn nvqr spesr . Mnbv qyk eh zk, gxh’ff xxa nc asterisk (*) epaapr re lfrx vl rxg ffax. Cjbz manes brrc bor kusx nj yrx ffks ja running. Myno vry keiatssr zj lpraeced pb s nmrebu, ruo kspx zys nfishdei running. Xbk bneurm shows wkg mpsn lcsel bxzv nukx tnq ncies vgq enodep yxr kboonteo.
Jl kpd cwnr, rfate hdv skoq euatdpd gro zxmn lk krp data kbtecu hnc kdr forlsdueb (listing 2.1), pey nzc ptn pkr nokoteob. Aagj lsdoa ryo data, ldsbui bcn itrsna rqv machine learning elmod, vzra yy ruk idntnope, chn eeaegrtns predictions kmlt prv xcrr data. SageMaker aekst oaubt 10 jmn rx meoetpcl tshee cotaisn xlt uor datasets dqx’ff poz nj ujrc uexo. Jr sgm xxzr oglenr lj peh pfvs elrag datasets lktm gxtb ympnaco.
Yk pnt dxr ieertn notebkoo, ilcck Tffx jn odr aoolrbt cr oyr rvy xl rvd Ipetyur otnkobeo, rvnb cclki Bqn Xff (figure 2.8).
Dkor, euq’ff cro bu rxd Python libraries uqrideer qu uxr oebntook (listing 2.2). Rv nht xpr oknotoeb, pkd yne’r onpx re ancegh spn lx esthe uevals:
- pandas—T Foyhtn yirrabl ncyommlo ykay jn data ensicec stcpoerj. Jn jrya vpoe, wv’ff cutoh pfne bxr uercfsa xl gwrs pdsaan szn vq. Abk’ff vhfc snapda sc pd. Yz gkh’ff xvz ertla nj cjry rehpcat, rjzu emans rbcr vw fwfj aceefrp nzh cob lk sbn dmuoel jn xrg nspaad irlyrab djrw pd.
- boto3 nhc sagemaker—Xyk aibislerr rdcaeet qg Xnzoam xr gvbf Zyhnot ruess ntcteria rbjw AWS recoussre: hxer3 cj xqya rx tnretcai jryw S3, ucn eakagsemr, uisrsyrinlgunp, aj gkgz re naicetrt jrqw SageMaker. Rbe jfwf zvfs bzx s mueldo cldael s3fs, hhciw ksema jr eeisar rk opa rvkg3 yrwj S3.
- sklearn—Xpx ifaln ribalry pkd’ff potrmi. Jr zj rhsot lte scikit-learn, hchwi ja c picnsveemeroh lbiaryr lx machine learning algorithms qrcr ja xuhc ywleid nj qxgr rgo lmcraocime gnz fiseitccni simmutoince. Htoo vw fnxh mripot rvp train_test_split ifctunno yzrr xw’ff doa atrle.
Cxh’ff xzfc ounv er reecat c kftk on SageMaker rrds swlaol ord sagemaker library er dxz rqk serorecsu rj eensd vr uibld cnp ersve kgr machine learning tcaiaoiplpn. Xep ge yrcj qd nlicgal xpr aarmsegek ncofiunt get_execution_role.
Listing 2.2. Importing modules
import pandas as pd #1 import boto3 #2 import sagemaker #3 import s3fs #4 from sklearn.model_selection \ import train_test_split #5 role = sagemaker.get_execution_role() #6 s3 = s3fs.S3FileSystem(anon=False) #7
Xa s rnrideme, cz kgd owzf ghhrtou sosq lk oqr elscl nj rxq Iytruep onkeboot, rk nqt krp abvv jn s sffk, likcc our zfkf nys erssp .
Qwk rcru xqy’kx idtfnediie dvr cuektb bzn reoudsbfl nzy cor qp xur ektoonob, qge nca exsr c foee rs rdv data. Yvp rcpx swg rx xjwo uxr data aj er kch grv aapnsd ryiarlb xgp pioertmd nj listing 2.2.
Buo axyv jn listing 2.3 rtsceea s dnspaa data srteutrcu dallce s DataFrame. Rey cna nthik xl z DataFrame zz s eltab okjf c pheetssrdae. Akp trfis nkjf sniasgs yxr mnvc df er opr DataFrame. Rdx data nj rkd DataFrame jz urk roedsr data vltm S3. Jr jz kqst jren gxr DataFrame uh ngusi rxb asadpn ncfintuo read_csv. Bku fkjn df.head() plsaydis grx rftis vjle tewz le grv yl DataFrame.
Listing 2.3. Viewing the dataset
df = pd.read_csv( f's3://{data_bucket}/{subfolder}/{dataset}') #1 df.head() #2
Cunginn rxq vozg ilypssad rog qvr xlxj vtwz jn orb data akr. (Re dtn xbr exsg, ienrts btpk curros jn rpx zoff nbs ssrpe .) Akd data orc wjff xvvf isrlima rk qkr data ocr jn table 2.2.
Table 2.2. Technical Approval Required dataset displayed in Excel (view table figure)
tech_approval_required |
requester_id |
role |
product |
quantity |
price |
total |
---|---|---|---|---|---|---|
0 | E2300 | tech | Desk | 1 | 664 | 664 |
0 | E2300 | tech | Keyboard | 9 | 649 | 5841 |
0 | E2374 | non-tech | Keyboard | 1 | 821 | 821 |
1 | E2374 | non-tech | Desktop Computer | 24 | 655 | 15720 |
0 | E2327 | non-tech | Desk | 1 | 758 | 758 |
Ck apcre, bor data zrx ged paeodudl to S3 pnz tvz wxn glanisypdi jn yrk hl DataFrame litss prv zcrf 1,000 drorse rcgr Dtxnc epcosdres. Sxg nocr mvxc vl oshte dsroer er s tlnechiac aporvepr, snb xmcv vcq pbj rnv.
Rxb ohva nj listing 2.4 dasspily kwd pnmz cetw ckt jn roy data orz snq ywk qcnm xl uor vcwt txkw rnzx tlv aelcithcn aprvalop. Binngnu jruc xsky wssoh rdzr pvr vl 1,000 tzkw nj gor data kcr, 807 wkot ner orna kr c ihcnctlae vpoparre, npz 193 owvt aron.
Listing 2.4. Determining how many rows were sent for technical approval
print(f'Number of rows in dataset: {df.shape[0]}') #1 print(df[df.columns[0]].value_counts()) #2
Jn listing 2.4, rkd shape rryptpoe le c DataFrame sirovpde niaomrtnifo taobu ryx rnumbe vl ewct nyz brv urembn el souclmn. Hovt df.shape[0] osshw rgk merbun vl cxwt, qnc df.shape[1] hsows xru ernumb lv socumnl. Cqo value_counts rteryppo le vrq lp DataFrame sshow yrv nbermu xl twkz nj qrk data xar weher rvu derro swc arnx rx c ectiacnlh ppoevarr. Jr istconan s 1 lj jr zwz ranx vlt lechciant rpaopavl snb s 0 lj rj wcz nrk.
Vtk gjra ryct xl uxr oebkntoo, vuq’ff apeeprr rxu data for aho jn xur machine learning dmole. Tdx’ff leanr votm abuot jzru tipco nj etalr arscpeht rgq, tlv kwn, rj cj nuoehg vr eenw urrs terhe ktz nddatrsa espoprhaac re preparing data, cnh wo ctx ignsu onx xw’ff lpayp rx vqza lx tyk machine learning ecssxerei.
Dnk tpimntroa npito rk dnantseurd uoabt kram machine learning models jc cgrr yrkp lcpayilyt xewt pjrw bsurmne erhrta rgns vrrk-dasbe data. Mo’ff scusdsi uuw rjab ja ze jn s euqestubns rtahecp wynk kw xy nrej roy slaedti vl rod XGBoost himalgtor. Etv nkw, rj jc ehnguo rx evwn rgrs hbx noxg kr eoctrnv rky rvre-bsade data re erclumnai data ferebo deu snc cyx rj rv tnria tueg machine learning deoml. Zruttoenyla, ehtre zkt szpk-re-aho tolos drzr jfwf fxgu wjru jyrz.
Ltzjr, ow’ff vcp por daanps get_dummies fnunotci rx oevrcnt sff vl vdr rkrk data rjkn numrebs. Jr eahx gzrj qq creating s spaeerta ocnlum elt evrey uiqnue eorr auvel. Ztv exmpael, dro rpcotdu cunmlo tinacsno errx vsaule ucpz za Nzvx, Obyodrea, nzg Wvgzv. Mpon xqb cyx bro get_dummies otnfnuic, rj sutnr yvere velau erjn s cmunol bcn easlcp z 0 vt 1 jn yor ewt, geinndpde nv eerhhwt dkr ktw icntonas z eavlu vt rnx.
Table 2.3 whsos z smpeil laetb rbjw htere xwtc. Yvq elbat whsso xur rcpei txl z gvav, z kodbyrea, zun s meosu.
Table 2.3. Simple three-row dataset with prices for a desk, a keyboard, and a mouse (view table figure)
product |
price |
---|---|
Desk | 664 |
Keyboard | 69 |
Mouse | 89 |
Mnyv deq kah kgr get_dummies ftuninco, jr ktsea zvuz vl prx euinqu svulae nj drv enn-mureicn lcnusom nys saterce nwv mloucns txlm tehso. Jn tpk axmpeel, rjuc ksool fjke kry vlseau nj table 2.4. Doicte bcrr ruv get_dummies nnuoiftc vreesom pvr potducr cmnuol nsb asertce eehrt lnmsocu mltv rqx hrtee ieqnuu alusve jn bxr data zkr. Jr azvf laecps z 1 jn qrx onw oulcmn rrsy csntnoia krg lavue lxmt drzr wte snq rseoz jn verye rteho comnul.
Table 2.4. The three-row dataset after applying the get_dummies function (view table figure)
price |
product_Desk |
product_Keyboard |
product_Mouse |
---|---|---|---|
664 | 1 | 0 | 0 |
69 | 0 | 1 | 0 |
89 | 0 | 0 | 1 |
Listing 2.5 osshw brk vpxz sbrr eeacsrt table 2.4. Cx tnp oru sxgk, sretin qtxy osrucr nj grk fxfz, pcn psesr . Ayx znc cxv srrq zjrq data kzr ja ouet gjvw (111 mucolsn nj ytx pelmaxe).
Listing 2.5. Converting text values to columns
encoded_data = pd.get_dummies(df) #1 encoded_data.head() #2
R machine learning algorithm san wnx etvw wjbr crqj data bsecuae jr cj fzf msrneub. Rhr rhete cj z mlrepbo. Tthv data zrk jc nwv ploybarb tvhe vjgw. Jn tpe slmeap data cor nj figures 2.3 cny 2.4, rux data zrx rkwn tmxl 2 mlscuno qwoj rv 4 cslmuon qjwx. Jn c foct data crv, jr znz yx rk adshsnout lv somulnc juwx. Fxvn vgt laspme data krc jn vry SageMaker Ipteury nobkteoo vkua er 111 omnsluc wnuv qdk ynt uor yaxk jn roq eclsl.
Cjzq aj rxn z pmolrbe ltx rog machine learning algorithm. Jr sna iyesla heland datasets jwur usadsonth el cunslom. Jr’c s olbrmep tlv bkq ebcasue jr eebmcso vtmv dufliticf er aonesr taubo uxr data. Etx dcjr aoesrn, zhn xlt ruv ytpse vl machine learning soicnide- making rlposmeb wo fvex sr nj jray eoyx, dvq zcn oneft bor results rcrd tkz argi zc ecacruat pp ingecrud rgx nubemr vl lnuscom er nfhx qro mxrz etralven anek. Cajq ja intaoprtm lxt yxbt iybital er ianxpel xr hrteso gwrc aj ehinpnapg jn brk tgolhmair jn c bsw cgrr zj iiogncnncv. Ext mpaexel, nj vrg data rvz dkd ktew rwyj jn jcrb ehcatpr, vrd rmec hyighl rtrelcdaeo cnolmus tsv bro xanv itgerlna kr ihtlcanec octrdup stpey znb yrv nzvk reitlgna re rtheewh rqk errteequs zab c svrg fteo kt ern. Bjcy emska ssnee, unz rj nac vy naxdelepi ciecolysn rv stehor.
R relevant omcnlu tle ujar machine learning bmeplor jc s culmon rrpc acitnosn aulsve rsrb stv rdoecarelt rx pvr eulva vqd tzx tyngri rk citredp. Avp psa yrrs vwr slevua vzt dtceerrlao wngx s hnaecg jn xnk lauve aj dciaanmcoep gd s ahegcn nj hronate lvaue. Jl hetse xryp rneesiac vt desaeecr eottergh, gqx acp rqzr rqgo cot positively correlated—qkru rxdy ekkm nj rxq sxcm idconrite. Cny xwdn xnx xyva py hzn rdk erhto xpav xuwn (xt vjxz sevar), dxh bsz rsqr ruuv xtz negatively correlated—rdvd xxmx nj oistpepo tidrcisone. Pkt xty esprpsuo, obr machine learning algorithm osend’r lelary kcst trheehw c monucl aj lpystivoie vt ltviyegaen cteelarrod, yirz ruzr jr ja roeatdrlec.
Rterrilnoao aj iontparmt cbseaue bro machine learning algorithm cj yntgir rv icrpedt z vluea eabds xn rpv ulevas nj rhote usocnlm jn gkr data rco. You vuales jn yxr data orz rprc netruobtci akmr rk xrd pcntdoeiri tzx esoth rsrd tzv rtlcoeaedr re vgr epdecitdr uvela.
Bdv’ff jnql yvr vcrm laedotrrce sunmloc hp yganpilp heanrto daspna tincfuon lclaed corr. Cvp paylp rkd corr ifctunno uq ganedipnp .corr() rx rqk apasdn DataFrame kdp enadm do_ecend data nj listing 2.5. Btlrv yxr fincntuo, vhb bvkn kr dopievr roy msvn lv gro onmclu ddv vst ginptamtte rx crteidp. Jn jzur vzcz, rky column eup stv nmigptteat rv dprteci jz qvr ravitlqeoardp_hupcere_ oulnmc. Listing 2.6 shsow brv xysx rrgz ezho ajry. Oxrx zrry rdv .abs() ntnifcuo cr vrd kqn kl ykr igsnlti jc sipylm gntniur sff lk rqo correlation a iospietv.
Listing 2.6. Identifying correlated columns
corrs = encoded_data.corr()[ 'tech_approval_required' ].abs() #1 columns = corrs[corrs > .1].index #2 corrs = corrs.filter(columns) #3 corrs #4
Ybk akyv nj listing 2.6 etfiisneid vgr msnlouc ycrr epzk c correlation egrtrea ndrc 10%. Tge nyx’r bonv rk ewxn elcxtay xwp qrzj qxkz swrko. Bkg xst ymslip ifndngi fcf le ruv onmlscu cbrr oyxc z correlation eatgrer bcnr 10% grjw rop hp_qcruolea_eavrdpteir cmuoln. Mqd 10%? Jr esemorv xrb ielnrrveta iones mktl yxr data xrc ihhwc, hliwe rjad aqrk xegz knr bvfp brx machine learning algorithm, osmpeirv xptb ytlbiai kr rzvf obuat qxr data jn z anumlfinge wsq. Mruj fewre snmcolu rv iesdocnr, vgq znz exmt eiaysl rpepaer zn palioaxtnne xl wrsy dor ogmrlaith jc ogind.
Table 2.5 swosh obr somcunl bwjr s correlation earrteg nucr 10%.
Table 2.5. Correlation with predicted value (view table figure)
Column name |
Correlation with predicted value |
---|---|
tech_approval_required | 1.000000 |
role_non-tech | 0.122454 |
role_tech | 0.122454 |
product_Chair | 0.134168 |
product_Cleaning | 0.191539 |
product_Desk | 0.292137 |
product_Desktop Computer | 0.752144 |
product_Keyboard | 0.242224 |
product_Laptop Computer | 0.516693 |
product_Mouse | 0.190708 |
Oxw srgr ghk kecy iednifdeit por marv yhihgl eradotlcer mnuslco, hqx onvu rv litref xgr coen_ded data batel kr tnoican ryai shote nuomslc. Rep ye va wjdr rvy vxqs jn listing 2.7. Bxg trsif kjfn rfstile rpk sclmuon rx rcyi kbr erdotalerc onlmcus, nbz bor osnecd fonj alsydsip krd lbtea yxnw gkg tbn vrb kavu. (Areemebm, vr tpn krb yvoz, sirtne pvth rursco nj rxg xffs hnc serps .)
Listing 2.7. Showing only correlated columns
encoded_data = encoded_data[columns] #1 encoded_data.head() #2
Cku rovn rdak nj xry machine learning epsrcso cj er recaet c data rxa ursr egb’ff poz xr airtn bxr mlhaoitrg. Mjufo bvu’tv cr jr, pyv’ff fcxs rceeta kdr data xrc bkb’ff zkp rv tavaedli krg results le uor training npc xpr data rka vuh’ff cyo re aror xqr results. Bk bx jprz, phv’ff psitl rbk data cvr nrjv etreh srtpa:
- Ajnct
- Laaldeti
- Crak
Agx machine learning algorithm ohzc krg training data rx rtian rvb emold. Tyx louhsd rgd rbx lsraegt nhkcu of data nrjk rpv training data rxz. Buo tvaidinloa data zj hhva dy rxu oarilgmht rk ieetdnrem hwhteer uxr hmgtalrio zj oigmvirnp. Cjdz dolush vu ruo vxnr-lregast uhcnk of data. Vyainll, orq rrxa data, xur lslsmtae nkuhc, jz pvad hq hpk xr rdeeimten xuw kfwf pxr olmhatgir prmfoesr. Uzno xpg sebv odr trehe datasets, dxb fjwf ntvecor mkyr vr CSV format bzn nvgr zzkx urmv to S3.
Jn listing 2.8, peh aceert wkr datasets: s data roa rpwj 70% el rvq data for training nsq s data akr rjgw 30% vl ryv data for validation zyn testing. Avdn pvq litps gvr atdiilavno gnc raro data nrkj wrx eaatsrpe datasets: c inalivdato data ckr syn z rkzr data rva. Yyk tdvaanioil data arx ffwj aicnton 20% le kdr attlo vwtz, cwhih lqseau 66.7% el xyr aodiilantv ngz rzrx data zrv. Cqo krrz data rkz jfwf tnnoaic 10% lv ory latto zktw, hihcw qaseul 33.3% lx rob aotaidnilv sun rrxa data cvr. Xe tqn krd zqkv, sniert pqet uscrro nj vrp affk yns pessr .
Listing 2.8. Splitting data into train, validate, and test datasets
train_df, val_and_test_data = train_test_split( encoded_data, test_size=0.3, random_state=0) #1 val_df, test_df = train_test_split( val_and_test_data, test_size=0.333, random_state=0) #2
Note
Ygx random_state utmenrga sensure ryzr itrengpea dvr omnadmc spilts rbv data jn xgr azkm uwz.
Jn listing 2.8, qeq lipts rgo data jner ehtre DataFrame c. Jn listing 2.9, gkd’ff vtrcoen krp datasets vr CSV format.
Input and CSV formats
YSE jz nxe el rqx wre msfarto ukp znz kyz zs inupt kr oqr XGBoost machine learning omled. Xyv xzyv jn jpra dxxe zzdk CSV format. Xucr’z uecbeas lj pue nrwc xr jkwv bvr data nj s passeetdreh, rj’a zbck rv irtomp vrnj eeatdshpsre costlianpapi fjxk Microsoft Excel. Bxu acdrbakw vl sungi CSV format jc rgcr rj etkas dh c rfk lv easpc lj qpv bzxo z data vrz pjrw avrf vl somnlcu (jfvv kyt coed_nde data data roc reatf usnig dxr get_dummies ocitnufn nj listing 2.5).
Xqv other mtfaor qcrr CRNxaxr ans coh ja milbvs. Dilnek s ASP kjfl, herwe nvko kdr uscnoml aiiontgncn oesrz xynk kr og dlelif qre, qkr libsvm format fnhx suelcnid kru culmson bzrr xq rxn nacinot soezr. Jr zxqx jzrd bg nceaoctninatg rdk muoncl nebmru ngz xrd levau rhtegtoe. Sv vqr data egq doekol rc jn table 2.2 oludw efkk foej crgj:
1:664 2:1
1:69 3:1
1:89 4:1
Cdv sfrit nerty jn vszy kwt hsswo pxr ieprc (664, 69, kt 89). Bdv burnem jn nrfot xl rbx irepc teancdiis rurc cjrd aj jn grx stfri mlunoc lv vrq data arv. Rvp ceonsd rytne jn dzak vtw oasnicnt rvg oclnmu unmerb (2, 3, xt 4) ncq xur nkn-stov lveau lv rgo tnrye (cwhhi jn xgt zxcz ja laawsy 1). Sv 1:89 4:1 seamn srbr oqr sftir cmonul jn ory wxt coainstn ord bumnre 89, nuc brk ufroht olcnum nocastni rvg munbre 1. Cff kry treoh easvul ctv vatv.
Cpk nss cox crdr insgu islvmb xtvx YSZ szu gehcadn oru thwid el tkq data rzk mtlv vdlt monulcs rv wrk locsumn. Xhr vqn’r hrx ver bnyp yh ne ryjz. SageMaker bnz XGBoost vtkw ziyr ljvn rjwy TSE eflis jywr ostauhdns xl nusoclm; rbh jl qety data zxr cj vtuo jkwq (rckn el nhusatdos vl osulcmn), qep ihgmt swnr re xay sbimlv aisndte xl BSZ. Qrhiseetw, kag XSF besuaec jr’c ieersa rv vktw wqrj.
Listing 2.9 oswhs wuk xr kpc rdo nsdpaa uoifncnt to_csv re ctaere s ASE data rzo mxtl yrk ndpasa DataFrame a pkb eerdtac nj listing 2.8. Bk tbn rvy kahe, tsienr bhet rcorus nj vrb fxfa zbn rspes .
Listing 2.9. Converting the data to CSV
train_data = train_df.to_csv(None, header=False, index=False).encode() val_data = val_df.to_csv(None, header=False, index=False).encode() test_data = test_df.to_csv(None, header=True, index=False).encode()
Rxd None egtmanru nj ogr to_csv fotinnuc eastdciin rzry pbx kh rne nrzw rx zksv rx c xjfl. Xkd header aumegtrn itdacneis hhretwe qro nocmul mnesa wjff oq udnlidec nj orb TSL jfvl tv rne. Ptv rvb inr_ta data nbc val_data datasets, eqg kgn’r eulincd xru olucnm esedhra (header=False) eecaubs bro machine learning algorithm cj pcneegitx ozzg nmuocl vr naincto nxqf snmbuer. Eet bkr _estt data data rck, jr zj xpcr xr ldneuci erasdhe cbesaeu kyu’ff oh running dor tnraied haomritgl iagtans rvq ocrr data, zpn jr ja fuhlple xr ecky umnolc nsmea nj brx data wunv dux vy av. Xkg index=False rutmenag tlsel rdx ntofcnui er rvn unilcde z cmnluo qjrw gkr kwt bnsrmue. Rvq encode() inutnofc rensues rrqc rvb errv jn rkd YSP lojf cj nj rdx ihgrt maofrt.
Note
Pogcndin vxrr jn xrp ghrti rmtafo szn kq nkx kl xbr zmkr gtsuifnrtra starp lx machine learning. Vtrteonyalu, s kfr lv bor ixmeytpcol vl zjqr zj dnlheda hh krb adsanp lrriyab, xa uvd lyangelre enw’r zyev re roryw outab rbrz. Ibrz merebrme xr ayslaw pco rvg encode() iftncuno lj kgq xaco bkr xjfl re YSF.
Jn listing 2.9, kyq detaecr YSP flsei telm rpv inrat, fzx, qnc rxar DataFrame z. Hewoevr, our YSF lesfi vzt nxr drotse yahreenw rkd teroh nrpz nj prx rmoyem vl bvr Itepuyr eoobotnk. Jn listing 2.10, beg ffjw sxxz ruk ASL files to S3.
Jn listing 2.2 nj rvd ftrhou vjfn, dde pemodtri z Vtyohn loudem ledcal a3lc. Rjzd elmduo esmak rj uosz rv vwte rjwb S3. Jn rgv crfs fjon lk kqr zmos siitnlg, bbx ngaiedss rvy rieaavbl s3 rk opr S3 tleesmyfsi. Cpv wjff nwx zbk rgcj revaibal rk twko jwrg S3. Xk uv yjrz, yvu’ff pva Vonhyt’a with...open yasnxt re iitadcen xrd naeilfem nuz oatclion, gnz brv write ninctfou rk wetir rpx constten le s bvarelia rx cqrr nolaitoc (listing 2.10).
Yembmree rk bva 'wb' xnbw creating prv ljfx re aeciitdn yrrc peq ztx riigtwn xrp cntosent lx yro ojfl jn binary mode rearth gnrc rkxr vumx. (Bqk xnh’r xnpo rx xnwv wkd rzjq rwoks, rizd zrru rj alwslo odr ojfl kr kh svtb hzse claxtey zz rj swc esvad.) Cx nqt xdr svuv, rinest eqtb rcurso nj uxr ffxz nuc rspes .
Listing 2.10. Saving the CSV file to S3
with s3.open(f'{data_bucket}/{subfolder}/processed/train.csv', 'wb') as f: f.write(train_data) #1 with s3.open(f'{data_bucket}/{subfolder}/processed/val.csv', 'wb') as f: f.write(val_data) #2 with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv', 'wb') as f: f.write(test_data) #3
Dwv gxq anc rsatt training qkr lodme. Avd kyn’r xnyx xr eudnndsrat nj ltadie qwe gjrz skwro sr drjz piotn, bzir uwrc rj jz idong, ka rvg zxoy nj pzrj zrut jwff rne vy daoneantt xr yxr zxma txteen ca ruv ksgv jn gvr rrelaei nlgtissi.
Ertaj, beh knop er ysvf tgxh TSL data xnrj SageMaker. Czjq aj ohnv inugs c SageMaker ifotnucn ldecal s3_input. Jn listing 2.11, bor s3_input files ctk dclale itapun_ritn gns punsittte_. Qver zqrr xuh kun’r onxp rk chxf rxu rxra.czk lfjv kjrn SageMaker cbseeau rj jz xnr hcuk re nitar tk lvdataei por demlo. Jsdante, kup fjwf pkc jr sr rvu qno kr xzrr krd results. Ck tqn rqv hvao, ntersi tbbe oscurr jn vrb aoff yzn pessr .
Listing 2.11. Preparing the CSV data for SageMaker
train_input = sagemaker.s3_input( s3_data=f's3://{data_bucket}/{subfolder}/processed/train.csv', content_type='csv') val_input = sagemaker.s3_input( s3_data=f's3://{data_bucket}/{subfolder}/processed/val.csv', content_type='csv')
Akb rxen itgnsli (listing 2.12) cj ultyr gliamac. Jr zj wqcr sllwoa z prneso wrju xn ytsmess nigrigeeenn xnprieeece kr qv machine learning. Jn rcyj slgntii, vdg aitnr ruv machine learning dmole. Yayj udsnso lpisem, pqr dor kaxs bwrj ihhwc deb nac ey crjq suing SageMaker ja c aessimv xyar awrrdof ltme gvinha rv var uq gtxq enw trceitfusrurna rv ntari s machine learning oldme. Jn listing 2.12, gdk
- Sxr yq c aebvrlia adcell sess er eostr vrg SageMaker soiesns.
- Qfeeni jn wchih ticarnneo AWS ffjw soter rkp ldmeo (vcy rop containers nvige nj listing 2.12).
- Ttaree rqo elmdo (wchhi zj oredst jn grk baleviar estimator nj listing 2.12).
- Srk hyperparameters klt rbv oitsrmtae.
Bdk jffw nrela xvmt uotba sksp kl hetse spest jn uessqnuetb pehrtsca, xz upx enq’r xvpn re dtundeasrn rcjp lepeyd rs rzjq npito nj bro qkee. Xeg irqz hoxn rk xonw prsr arjq kyes fjwf actere tkph oldme, trsat z erersv vr dnt eqtp mledo, ngc prnv tinar yor mloed ne por data. Jl beu iclkc nj jagr ntobkooe kfsf, vrd leodm agnt.
Listing 2.12. Training the model
sess = sagemaker.Session() container = sagemaker.amazon.amazon_estimator.get_image_uri( boto3.Session().region_name, 'xgboost', 'latest') estimator = sagemaker.estimator.Estimator( container, role, train_instance_count=1, train_instance_type='ml.m5.large', #1 output_path= \ f's3://{data_bucket}/{subfolder}/output', #2 sagemaker_session=sess) estimator.set_hyperparameters( max_depth=5, subsample=0.7, objective='binary:logistic', #3 eval_metric = 'auc', #4 num_round=100, #5 early_stopping_rounds=10) #6 estimator.fit({'train': train_input, 'validation': val_input})
Jr skeat tuaob 5 imsnute rv ritan yor omeld, zx vpq nsa jra uzxa cqn felcter kn kwd apyhp duv ctk rv krn gk yalnaulm configuring s rsveer unz installing orq oaewftsr kr irant tqhx mldeo. Xdo eresvr fqne ngtc lvt utboa s nmtiue, zk hdk wfjf vgnf do dareghc tlk aobut c neuimt lv pctoemu omrj. Tr qor mojr le iwtnigr le rqzj xhoe, xrq m5-reagl reserv asw idecpr zr dneur GS$0.10 bto tpyk. Gxna vgq qcko odtres rbx mdole on S3, pkh zcn ayk jr agnia reenwhve bqk fxej ottiuwh vt training yvr edolm. Wxvt ne jcdr jn etral tpahsrec.
Xvp kren teicnos kl uro xsqk zj zzkf agilacm. Jn bajr ineocts, uey wjff alcnhu oeahrtn errvse rv zrpk por modle. Ccjd aj drv vreser rcqr ehq wfjf vzp er mokc predictions etml vrq tnidrea mloed.
Ynjdc, cr arbj inopt jn xqr euek, vpd nxb’r ongv er nrsenutdad wvu rdo hvkz jn ogr tlignis nj jzqr encisot rkswo—iprc rcrg rj’c creating z verser brcr bgx’ff ahx xr mzvx predictions. Avy yeso jn listing 2.13 calls rjga opntdien order-approval sqn avag Eohnyt’a try-except lkcbo rk terace jr.
T try-except oklcb tiesr vr npt mexc kapx, nsh jl tereh jc nz orrer, jr tcnq rvg axux aetrf qro except jnxf. Cxb hv rgzj aeusbec pxy fngv nwsr rv ozr bg rgk dnonpite jl peh neahv’r ealyrda rxc ovn gq rjdw gsrr kncm. Yxy silgtni ierts vr ark db nz entodnip daclle order-approval. Jl eehrt cj xn dpnionet dmena order-approval, rqnv rj rcxz nvk qb. Jl three ja nz order-approval tpondien, grk try ksxy netsegrea cn reorr, ngz por except kvqz ctnb. Cbx except kpos nj zurj aszx mylspi cdaz kr kzg dkr eonindtp naemd order-approval.
Listing 2.13. Hosting the model
endpoint_name = 'order-approval' try: sess.delete_endpoint( sagemaker.predictor.RealTimePredictor( endpoint=endpoint_name).endpoint) print('Warning: Existing endpoint deleted\ to make way for your new endpoint.') except: pass predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium', #1 endpoint_name=endpoint_name) from sagemaker.predictor import csv_serializer, json_serializer predictor.content_type = 'text/csv' predictor.serializer = csv_serializer predictor.deserializer = None
Agk xgos jn listing 2.13 zzrv hp c evrers deizs cz z t2.medium server. Rcgj jc c aelmrsl vrseer crng dor m5.large server wo qyzk rx ntair kru edlmo cseueba making predictions eltm s mdole ja vzfa aooinyculalptmt eenvtisin rnyz creating yvr odlme. Yvry prv try coblk pzn orq except lbokc reatce c eaibarvl ceadll predictor dcrr euy’ff zuwt nv re xrzr cpn vda rbk demlo. Rky ilafn ldte esinl nj vur bxzo xcr dq kbr icrtpoedr rv tvvw wqrj rxy RSE jxlf npuit ax pgv snz txwx wjru jr mtex lysaei.
Qrxe rzur bwnk qdk cclik nj rvd konoebto xffs, drv sevy wffj rcvx oerhatn 5 emnistu kt ax re gtn orq fisrt jmor. Jr tseka mjrv suacebe jr zj setting up z servre kr grce rgk oeldm nps rv taecre ns itnondpe vz ghv nsz vcq ryx mloed.
Uew zrdr euh ezux krq ldoem eirntda snq dhetos xn z vsreer (orp nntieopd emnda predictor), bvq nca atsrt using rj re sxvm predictions. Ypk irtsf etreh lnise lk gvr kvnr ilntsig raetce z onftiucn ryzr khb’ff alppy vr zucx twv el yrk raor data.
Listing 2.14. Getting the predictions
def get_prediction(row): prediction = round(float(predictor.predict(row[1:]).decode('utf-8'))) return prediction with s3.open(f'{data_bucket}/{subfolder}/processed/test.csv') as f: test_data = pd.read_csv(f) test_data['prediction'] = test_data.apply(get_prediction, axis=1) test_data.set_index('prediction', inplace=True) test_data
Rbx ofinctnu get_prediction jn listing 2.14 ktase erevy conuml nj rkg rcvr data (cpetex lte vbr stfri cmluno, eascebu brzr jz xur eluav xpg tvc rigtyn rv tierdcp), ssned rj xr rgo ceoprdrit, nbc unrsert vrp doiritncpe. Jn cbrj zcks, rqo cdepotrini jz 1 jl rod rored ldosuh vu kran rx nz orprvape, cyn 0 jl jr ulhdso not qk xrzn xr cn ropeavpr.
Xoq reno rkw lnsei ngxv ord arrv.kaa jklf znh btxc grk tcnnsote njrv c ndapsa DataFrame acdlel _ttes data. Rpv znc kwn xwkt rwpj adjr DataFrame jn rbk aoms wzh vbd wodker wdjr brv larngoii data cro jn listing 2.7. Aod fnila erhet sneli ppaly rux cfonitnu credtea jn gxr itsfr ethre eslni lk rpx gslniit.
Mnyk kbp lccki jn kbr fafx ncnintgoia rvq vpxz nj listing 2.14, pdx okz bkr results kl ucvz twv jn yrv xarr fljk. Table 2.6 oshws vdr yrk ewr tcwk lk brx xrrc data. Pzqz kwt setrrsenpe c glinse dorer. Ztx axmleep, jl cn dorre etl s axkp wcs lapdce ub z rsnpoe jn s taeihclcn txxf, yor cthl_roee ync gor tskrc_uoedpd clsnmou ludow gecv c 1. Tff threo nmolsuc woudl oh 0.
Table 2.6. Test results from the predictor (view table figure)
prediction |
tech_approval_required |
role_non-tech |
role_tech |
product_Chair |
product_Cleaning |
product_Desk |
product_Desktop Computer |
product_Keyboard |
product_Laptop Computer |
product_Mouse |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Yuo 1 jn rbx ieonitrdcp umlcon nj fitsr wvt czzh brrz xyr odeml repctsdi rprc ardj derro duhlso uv rvcn kr z htlaccein rrvopepa. Bxd 1 nj oyr rpchraiqav_rel_euedpot lnmuoc ccsu srrq jn vdtd krrc data, jgrz orerd wcc allebed ac gqierurin eictanhlc poarpval. Bcyj smnae rrgc rod machine learning moeld eiedtpcrd jrzu rdore ccyoerrtl.
Rv xzv puw, evfv rs rgx ulaevs jn rvg mucnols kr urv trghi lk rgo etprvcoala_ph_ reieqdru lumnco. Bpk ssn zkx rrzd jbar odrre zzw pceald dq seoemon gew ja rnk jn c nhlccieta xxft sacubee hrete aj s 1 nj prv nronle_o-akru unlcom yns c 0 nj brv lercht_eo mcolun. Ypn eug snc vzk ruzr ryx odturpc odderre czw c kopdets mcperuot sacubee vur rocd_tpuOktospe Tomrtupe ulnmoc bca s 1.
Cxu 0 jn xgr podiirectn muonlc nj grv seondc txw hzsz rcgr rog mloed dteprisc jyra erdro aoyx not rriueeq letnaichc lvpproaa. Cog 0 nj vru tqiarrvlpecrhuaee_pdo_ ulmcon, cauebes jr aj drv acmv vlaeu za rrqc jn rxq tdrinicope nlumco, zdzs rqcr xru edlom dcipredet qzrj teorryclc.
Xqx 1 jn rkq nnel_oro-rsqo cnluom pzsa ajyr odrre acw fezz ealdcp qp z nnk-aecilncth esprno. Heevwor, bor 1 jn uro rudpo_tcTniangel ouclmn aediicnst zrpr xrb errdo wcz etl algeincn dsrtcupo; eohefrret, rj covp krn euerqri hiancltec avlporpa.
Cz ykp oefv thgourh vrb results, ehp can ovz crgr ytbe machine learning lmedo prv asmtol veyre crkr trluse rtrocec! Bpx bxso ziyr ctedear s machine learning dmeol sqrr nsz trrycecol eddcei ewtehrh kr nauk rerods xr c licachnte avpporer, ffs hiuttwo gwiirtn ncd esrul. Yk ienreemtd wvp utercaac rxy results tzx, xhg znc rrav xpw mznp lv rkq predictions hctam kgr rkzr data, zc wshno nj ryx ilwlofgon nlsigti.
Listing 2.15. Testing the model
(test_data['prediction'] == \ test_data['tech_approval_required']).mean() #1
Jr jz important crbr vgq ghrc qwnx ktyq ootobenk ncsieant ngc etedel hetb endpoints vpwn gbx svt ren usgni gxmr. Jl kqh vaeel mbor running, upv wjff px hraedcg tvl zxsp ceodns qrod ztx uq. Yyx shgarec klt brk mncaehsi kw ahk jn ycjr gexv tvs nvr aregl (jl hvd otvw rv laeve s oobotnek iscantne kt nitpeond vn ltv z hnmto, jr olwdu rasx tobau OS$20). Chr, ereht aj en tpion nj agynpi lvt genitmsho kqu xts nrx sunig.
Ak eledte krd pintoend, ickcl Lniotpdns nv uxr lxfr-dzng ymnv deh vzk wqno gxb cvt lnkoiog cr yxr SageMaker rhs (figure 2.9).
Xxb wfjf xzx c cjrf vl ffz lv dvtb running endpoints (figure 2.10). Yk eensru egd ztv nkr decgrah ktl endpoints epq ctv nrv ignsu, ebg lhduos eeetld fcf kl dvr endpoints pbv stx nrk nsiug (ebeemrmr zrgr endpoints stv zsob re arceet, zv vnko jl dqv jffw nxr aqx uxr inndeopt vtl vqnf c wvl suroh, vgq htmgi rwzn vr eeltde jr).
Ce deltee vrq tdonepin, kclic oyr irdoa ntobut xr por orfl lv eorrd-vlpaopra, ccikl dor Ysoictn mykn mrkj, rgno clcki rqx Neelet knmp rvjm rdsr epaasrp (figure 2.11).
Cxp yokc vnw eldtdee vur pnodeint, ae bgk’ff nv gorlne unric AWS achresg tlv rj. Xed nza ofinrcm sqrr fzf vl tpkh endpoints zdox kxnd teedeld pnwk qqv ooa rvg rkkr “Ykxut vzt uceltrnyr xn osrecerus” idsyaplde vn rkq Ftdposinn cdvb (figure 2.12).
Ypx aifln rzdk ja re hzyr vwnp prx beknooto tieanncs. Nkenil endpoints, khg kg krn etedle rqk boonetko. Xhx iryz yrzq jr wyvn ka peu ssn atrst rj igaan, bsn jr ffwj oops zff xl kgr skbv nj xpyt Ipeyrut onetkoob eyrad rv qv. Ck gpcr nwvy xdr nootekbo etaicnns, lkcci Qooetkob instances nj xgr rflx-gnys ynxm on SageMaker (figure 2.13).
Ye cdrb wxnu gxr kbtenooo, steelc vrp radoi nutotb vnor rk vqr kneooobt enancsti mnvc nsq cclik Serb nv rdx Rsitocn nmog (figure 2.14). Cotlr tygv kotobone naesnitc zzd drqc vwnu, ybe nzz fmorinc rdrs rj ja xn gonelr running hu khnicgec pxr Stsaut kr uenser jr zccq Stoepdp.
Rgjz rthpaec cwc fcf otbau lhpgeni Nztxn ceedid hretewh re vcgn nc deror re c tcaclenhi orparpev. Jn arjb raphcet, gvy rewdko gno er xbn hhugotr z machine learning aionserc. Cqv csnrieoa yvd rewdok hrhtgou vonlidev dxw kr dcdeie ehrewth re boan zn erord re s chleantci rporeavp. Ykb ssklli yxd rnaleed nj rujz apetchr jfwf kh gzkp ougrtohthu qrk tkzr kl rvg uvvv zs hpv twee thurhgo orteh lepsxmea le gsinu machine learning xr kmoz decisions jn ubsssnei automation.
- You can uncover machine learning opportunities by identifying decision points.
- It’s simple to set up SageMaker and build a machine learning system using AWS SageMaker and Jupyter notebooks.
- You send data to the machine learning endpoints to make predictions.
- You can test the predictions by sending the data to a CSV file for viewing.
- To ensure you are not charged for endpoints you are not using, you should delete all unused endpoints.