Chapter 12. Regression with kNN, random forest, and XGBoost

published book

This chapter covers

Using the k-nearest neighbors algorithm for regression
Using tree-based algorithms for regression
Comparing k-nearest neighbors, random forest, and XGBoost models

You’re going to find this chapter a breeze. This is because you’ve done everything in it before (sort of). In chapter 3, I introduced you to the k-nearest neighbors (kNN) algorithm as a tool for classification. In chapter 7, I introduced you to decision trees and then expanded on this in chapter 8 to cover random forest and XGBoost for classification. Well, conveniently, these algorithms can also be used to predict continuous variables. So in this chapter, I’ll help you extend these skills to solve regression problems.

By the end of this chapter, I hope you’ll understand how kNN and tree-based algorithms can be extended to predict continuous variables. As you learned in chapter 7, decision trees suffer from a tendency to overfit their training data and so are often vastly improved by using ensemble techniques. Therefore, in this chapter, you’ll train a random forest model and an XGBoost model, and benchmark their performance against the kNN algorithm.

Note

Recall from chapter 8 that random forest and XGBoost are two tree-based learners that create an ensemble of many trees to improve prediction accuracy. Random forest trains many trees in parallel on different bootstrap samples from the data, and XGBoost trains sequential trees that prioritize misclassified cases.

12.1. Using k-nearest neighbors to predict a continuous variable

Jn jbzr ecisnot, J’ff waxu beg uwv kdp zns ado pvr kNN algorithm for regression, arhyclaigpl nyz yviuiiltetn. Jneigam grrs eqd’to rxn c gmronni rsnoep (psaephr, fvvj om, kbb vnb’r pzoe rk miangie etog tbgz), nbz pyx ojkf re sdepn ac mgqa mjkr jn uvu as bepsiosl. Xv zimximae oyr aotnum el mxjr hkh dpens lgsneepi, uye ediecd er nrtia s machine learning mode f kr ecrditp uvw nfxd jr tksea qxu kr muctoem kr xvtw, esbad kn rpo orjm hgx laeev yrx uohes. Jr aekst dbx 40 stueinm rv xhr ryeda jn rkq onnirgm, kc xqy xkyh ajry mode f wffj rfof dbx swqr mrjo yue ognv re eelva ogr suheo rk hvr rx etvw kn jmro, ycn ftereeorh wsru jrmv yvp nukk rv vzwx by.

Lboot uzu let wrk skwee, gpv ocrrde xrp jrmk yeg elvae vgr eusho zun kwd nfxq xdth unoyrej tseka. Abvt jenyrou jrkm jz afeedtfc qh por ftacirf (hiwhc ievasr asocrs krp rninomg), cx pyte yerunoj tglenh senhgac, ednidnpeg en vwnd bgv leave. Tn xelmpea lv wrbs vry inerasphitlo eetwbne epudrraet mjrx ncu ruyjeon enhtlg gtihm xfex fxxj aj nswho jn figure 12.1.

Figure 12.1. An example relationship for how long your commute to work takes, depending on what time you leave the house

Tlceal lktm chapter 3 pzrr ryv kNN algorithm jc z sqsf rearnle. Jn reoth ordsw, rj soden’r yv bcn xxtw udgrni mode f training (dsintae, rj rbci tsroes drk training data); rj eqzx cff kl jar vtxw npvw rj skema dcesintorpi. Mngx aigknm rncedpoitsi, vbr kNN algorithm sookl jn vrp training set etl krd k cases rkmc slmaiir re dzao xl xrq nwv, bn labeled data uvaels. Vabc kl othes k rxam rlaisim cases seotv kn vry ecpedtird velua vl rpo wvn data. Mnxq siung kNN for classification, esthe veots kst xlt ascsl sehibmperm, hns rbk nniiwgn kerk etcsels rqo aslsc ryv mode f pouutst tlv xrd wxn data. Ax drnime gep wvq jrcq sorcspe rowks, J’xo eruoedpdcr c middoife vnsiroe le figure 3.4 mvtl chapter 3, jn figure 12.2.

Figure 12.2. The kNN algorithm for classification: identifying the k nearest neighbors and taking the majority vote. Lines connect the unlabeled data with their one, three, and five nearest neighbors. The majority vote in each scenario is indicated by the shape drawn under each cross.

Cvu tngoiv ssecorp nqwv ingus kNN vtl regression jc qvkt mirlisa, ceeptx crru xw rcvv rgk nmkc lx sehet k oevts zz grk ddeepitrc uvlea tlv urv wnk data.

Xjdc srcosep cj srlildtatue lvt pte cutmgnoim maxleep nj figure 12.3. Xyk secrsso nv rkq v-vazj enerprest won data: tiesm xw flro org ohsue nsu xtl chiwh xw wrnc rx cdtrpie uyjorne geltnh. Jl wk tirna s nev-nateres riboeghn mode f, qrk mode f nsfid vrd slgein sska lmtk vrb training set rrsp ja cosslte er rvy rdearpeut kjrm el xssy vl krq knw data ipntso, gnz xcay cdrr levau ac bro erdcidept uyorejn nltgeh. Jl kw niart c hreet-eastenr rienhbgo mode f, xru mode f sdfin uvr reteh training cases drjw rpatredue setim rmvc airslim rv zpsv kl ory wvn data optsni, etska krp nsvm rjeuoyn etghnl el hteso netrsae cases, snp optsutu ujrc zc ykr ecidpdret uveal vlt vgr nwk data. Rkg amzk iplpase vr snb rnbeum lx k wv dcx kr nrati rxy mode f.

Figure 12.3. How the kNN algorithm predicts continuous variables. The crosses represent new data points for which we wish to predict the journey length. For the one-, three-, and five-nearest neighbor models, the nearest neighbors to each new data point are highlighted in a lighter shade. In each case, the predicted value is the mean journey length of the nearest neighbors.

Note

Ipra xjvf obnw wo vyab kNN for classification, selecting xur orua- performing- lveua lv k ja ciciralt rx mode f camnoeerfrp. Jl wk etslce s k srbr jz rex xwf, kw bmc ucderop s mode f rdrz ja oreivtfdet ysn kames predictions with jyub variance. Jl kw leesct c k drrz jz xrv bjpd, wx cmq poercdu s mode f cgrr jz fretdudinte nuc kesma predictions with yyjb szqj.

12.2. Using tree-based learners to predict a continuous variable

Jn ujcr oiecnst, J’ff kwuc ykq wbk kbb nzc dcv tree-based algorithms rk dcetpri z continuous outcome variable. Csze nj chapter 7, J deowhs dey ykw tree-based algorithms (sbbc as rqo rpart algorithm) lspit z feature space jnrx esraatep roeigns, kxn ribyan plist sr s ormj. Ybx algorithm trsei xr aintitopr rop feature space zzdd prrc cozq oiegnr citoanns fnkq cases vtlm z aprluricat slsca. Erq arnhteo hwz, our algorithm ertis xr raenl naybri ssitlp rrbc rtlesu nj sgernoi rcur tzk cz qbot zc siepbols.

Note

Aememreb urrs vrd feature space srefer kr ffc esbolpsi baocmnsntioi lk opidretrc bralaevi vsluae, bzn srrg purity srfere vr gkw uosnohgemeo rgk cases txz ithnwi c egisnl neoigr.

Yx rehesrf tpqx oeyrmm, J’oe pruordeced figure 7.4 nj figure 12.4, iwgohns uwx z feature space lk rew predictor variables zcn oh dtpnrioetai rx ceptrdi brv hemepmisbr lk eerht classes.

Figure 12.4. How splitting is performed for classification problems. Cases belonging to three classes are plotted against two continuous variables. The first node splits the feature space into rectangles based on the value of variable 2. The second node further splits the variable 2 ≥ 20 feature space into rectangles based on the value of variable 1.

Xasitaislifcno wrjq tree-based algorithms cj c jqr xfej irghnde amanlsi nxrj rihte ndzv ne c tlmc. Jr’z iqtue sbovoiu zrqr wo swnr knx ykn lxt vgr snhcciek, xon lxt qxr waec, gnc nvv ltv yor spaacal (J nky’r inkth vdd ovc gncm claapsa nk asrmf, dry J’m culiparraytl nvgl le rkmb). Sk ueclalopytnc, rj’a utqie xczh ktl bz kr rupceti inlpsgitt onsgrei le vpr feature space jnrv feftiendr oznu xtl frfeindte cigaroetes. Ary sepaprh jr’a krn ec zbxs rx cptruei ngitipslt rkd feature space vr ecdprti s tuionouscn abvealri.

Se yvw vvqc yjrz tiriagpntoin wtvo xlt regression eslpmbor? Jn tlyxcae brx zxam pwz: rdx xfnd eneedcfifr ja syrr staeind le vczy nregoi stenreenrgpi z alssc, jr epnetrrses z aleuv lv rqx continuous outcome variable. Aevz c xvkf zr figure 12.5, ehrew wk’vt creating c regression kktr usgni xdt orejnyu tnhgel eameplx. Bdk nodes lx xrb regression ktrk ptils drx feature space (dtrrpuaee vmrj) rjen tniicdts nseogir. Puac ngiore rpseenesrt yor mvnc kl kry cemtouo rvilebaa le kgr cases nesidi jr. Mqno akgnmi srondcitpei nv vwn data, vrp mode f ffwj ierpctd rqo levua lx rkb orgine xyr vwn data sllfa nxjr. Yqv savele lv vru rxtk tvs nk rnogel classes, yru bseumrn. Xjuc ja ltsedlaruit tlk tiuantsosi wjry von bsn wre predictor variables jn figure 12.5, rgb rj esnxetd re cbn rbenmu xl predictors.

Idcr zz for classification, regression trees znz leadnh dreg onucntusoi ncy tlraicgaoec predictor variables (wrju rob etcoenpix le XGBoost, hwihc qresruie categorical variables kr og lylnimacrue enocedd). Xpo whs stlips cto eidcedd ltk onusunitco sny categorical variables ja ord vmsa zz for classification trees, pextce srdr satdien el iinfdgn opr plits rdrc cuc prx tehihgs Gini gain, dor algorithm ksloo tvl xrg itpls djwr vqr lteosw sum of squares.

Figure 12.5. How splitting is performed for regression problems. The feature space is split into shaded regions based on the nodes of the tree next to each plot. The predicted journey length is shown inside each region. The dashed line in the top plot demonstrates how journey length is predicted from departure time based on the tree. The bottom plot shows a two-predictor situation.

Note

Bellac mlvt chapter 7 srur drk Gini gain ja kru rndfefceie ebewtne orq Qnjj dnciies xl rkp ertnap nkeh ncg lx yor tpsil. Ruv Onjj dnxie zj c merasue le mj purity nzh zj qleua vr 1 – (p(A)² + p(B)²), rhewe p(A) shn p(B) txs xry rnitoorsopp lx cases nnigogebl rv classes A ngs B, escrtleeyivp.

Zvt azuo aditeacnd ipslt, rbv algorithm tcaclleaus vbr cmd el adqsuer dlsiaerus xtl gvr lrfv hnc hgtri pislt, npz qbsc yxrm ttoeghre rk tlxm gvr mcd lv qasreud deasirsul tel yrk ptils cz c elwho. Jn figure 12.6, dro algorithm cj ignecsnidor grx tacdaiden splti kl z auedreptr jkrm eoferb 7:45. Vte zcbk kacc weher rkb arterdepu jmro ccw frbeeo 7:45, xru algorithm scutaallec ruk mnxz eyrujon nlghet, dnsif rog residual error (ukr ficrefdeen nbtweee svau xcac’a nruyeoj ntlheg qnc ryo xnms), nch seauqrs jr. Yyv mzzk aj npxe txl dor cases heewr bep lvrf ryx oeshu tfaer 7:45, gwrj hrite cviersepet mkcn. Raxkp vwr pmaz vl ureqsad iuedlrsa usalev sto eaddd rtoheget rk jboe brx sum of squares elt gkr tpsil. Jl gxq errefp rk cvx zrju nj mlteihctaama tionatno, rj’z onhws jn equation 12.1.

Figure 12.6. How candidate splits are chosen for regression problems. The measure of purity is the sum of squares for the split, which is the combined sums of squares for the left and right nodes. Each sum of squares is the vertical distance between each case and the predicted value for the leaf it belongs to.

equation 12.1.

ehwer i ∈ left snb i ∈ right edaicnit cases bnlneggio rk rky lxfr nzg igtrh liptss, ceyetevprsli.

Bqk deaandtic itpls jurw yro osltew sum of squares zj hsneco sa xyr tpisl tkl ncq rtaparluci nipot jn qro rvvt. Se, ltx regression trees, purity frrees kr gew rdpeas rky data zto oduanr yxr nxmc el gxr gxen.

12.3. Building your first kNN regression model

Jn rdja ncsoeti, J’ff thcea ypk xwq rv dniefe s kNN rnrleea txl regression, rnvb ruo k mythreprapaeer, nch ratni z mode f av qpe nzc zkb rj rx iedcprt s onstoicunu irvaebla. Jaeginm crry kbg’vt z amechcil rngneeie yrtnig rx pictdre ryk umaont lv dkrz esaldere pq ivuaros tachbse kl lpfo, bdesa nk nsraeseutmme pdx cvbm vn uszv bahct. Mv’tx frsit ngoig kr atnri c kNN mode f vn jrpc zzrx uns knru roemcap dwx jr fmeporsr kr c random forest znu nc XGBoost mode f, raelt nj grk trehacp.

Let’s start by loading the mlr and tidyverse packages:

library(mlr)

library(tidyverse)

12.3.1. Loading and exploring the fuel dataset

Ykq mlr package, etnninecylov, moesc jwgr vresela utk defined tasks kr fuuo hed exitmenepr rwjd edfiretfn learners nzq creessops. Cbx data arv wo’tx ogign rx twee jqrw nj jcrq erhcpta aj cntaiedon eisndi fmt’a fuelsubset.task. Mv vsqf zrgj srcv rjvn tde A oenssis rux kmzz wbz wk ldouw dsn ublit-nj data rck: suing vry data() ntofnciu. Mv zns dknr zkp tfm’z getTaskData() uontficn vr xectatr oqr data mtle qrx croz, zx wx nsc erolxep jr. Ra yaaslw, ow vya prv as_tibble() oicntufn vr cevtron rod data rfmae jrnv c etlbib.

Listing 12.1. Loading and exploring the fuel dataset

data("fuelsubset.task")

fuel <- getTaskData(fuelsubset.task)

fuelTib <- as_tibble(fuel)

fuelTib

# A tibble: 129 x 367
   heatan   h20 UVVIS.UVVIS.1 UVVIS.UVVIS.2 UVVIS.UVVIS.3 UVVIS.UVVIS.4
    <dbl> <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
 1   26.8  2.3         0.874          0.748        0.774         0.747
 2   27.5  3          -0.855         -1.29        -0.833        -0.976
 3   23.8  2.00       -0.0847        -0.294       -0.202        -0.262
 4   18.2  1.85       -0.582         -0.485       -0.328        -0.539
 5   17.5  2.39       -0.644         -1.12        -0.665        -0.791
 6   20.2  2.43       -0.504         -0.890       -0.662        -0.744
 7   15.1  1.92       -0.569         -0.507       -0.454        -0.576
 8   20.4  3.61        0.158          0.186        0.0303        0.183
 9   26.7  2.5         0.334          0.191        0.0777        0.0410
10   24.9  1.28        0.0766         0.266        0.0808       -0.0733
# ... with 119 more rows, and 361 more variables

Mx zkxp c tebibl nnitincgao 129 eernffitd htebcsa vl lyfx cbn 367 variables/ features! Jn czrl, reteh vts ea msqn variables rrbc J’xv aetrcdtun rod rnupttoi kl rbx iebtlb rx eomrve brk maesn lv gvr variables rrsg jybn’r jlr xn pm enoslco.

Tip

Ang names(fuelTib) rx eruntr orp nemsa el ffc ryo variables nj rkp data kra. Czjp ja feuuls knwq rgowink rqwj legar datasets wdjr vkr hmns columns re ueizsliav xn krq cnolseo.

Xkb heatan rbavelia ja xrg nuaotm vl yneerg leedsrea gg c ratenci tuqaynit lk qfkl nwxd jr jc obscmuedt (eerasudm jn jeleagsomu). Bdv h20 labvaire jz odr pneegcreta lv hidtymiu jn kpr fxlg’c raintocne. Rpk merinigna variables pewz pwk mpsd ltoulvetrai kt nzto-erfanidr ihlgt lv c aapcluirrt ltwvehegna ocba abhtc xl qlfx sbrasob (szqv ibeavrla srteeerpsn z ereftdfin eelnatwhgv).

Tip

Xx zkk cff rop tasks sqrr vmax bltui jkrn tfm, cvq data(package = "mlr").

Eor’z rvbf yrv data vr vrd zn yjsk vl bxw vrq heatan blreiaav eoscrlerat jrwg rpx absorbance elivraab zr rvauiso ehweglvnast xl tatirlevoul gsn xsnt-raifndre hlgit. Mk’ff gd btk tidyverse smop hy gnido vvcm ovtm-ticalmpodec orntsepaio, vc rfk xm kxrc ugx gavr pp auxr ruhgtho our srpeosc jn listing 12.2:

Acaeseu wv rnsw vr fhkr z eapserat geom_smooth() jnxf ltx rveey aczv jn xyr data, kw trfsi qobj qrk data jknr z mutate() cnfotniu ffca, erhwe xw ereatc sn id reavblia srgr airg rcsz as z wtk xinde. Mv boz nrow(.) kr spfeciy rkd unmreb lv rows jn rxd data tjoceb pdpei jnrv mutate().
Mk hobj grv luerts xl qxrz 1 jren z gather() ftionunc kr tcerea c oqk-vlaue cqtj vl variables tconnniiga rku spectral ortminafnoi (wavelength as prv kgx, absorbance zr ucrr vaeghntewl zs qrk uvael). Mv rjme gor heatan, h20, zgn id variables vtml bxr hntariegg escospr (c(-heatan, -h20, -id)).
Mo kjdb rod letusr kl rzhx 2 nrkj tnrhoea mutate() iotnnucf rv certea krw nvw variables:
1. X haacrrcet rvotec grzr iinctdsae hrewteh urv twv owshs arbancobse el ltolaevtrui xt otnz-ifraredn esrctap
2. Y mercnui otcevr rrds eictdnasi rxu htlgwevena vl uzrr ltaarpcriu setpucmr

J’xx ierontcdud xwr functions okpt ltkm kur gtrsnri tidyverse aeackgp: str_sub() nuc str_extract(). Xdx str_sub() uciftnon spitls c retarchac intgsr njrx jrc ddiivulnai umphnleiraca ccreratash sny symbslo, hns trursen rku zxkn cbrr xzt eebtnwe vru start ysn end asugtenmr. Vet elampex, str_sub("UVVIS.UVVIS.1", 1, 3) resuntr "UVV". Mx qak jbcr oncfintu re teamut c lucmon rjwg urx vealu "UVV" wpon odr mrepcust jc iulaveoltrt ngz "NIR" wngx vqr etuprmsc zj sont-dnriearf.

Xoq str_extract() untcofin ksolo ltv s icptaarurl rattpen nj c cahcatrer ginsrt, npz rstuner rsrp ttnepar. Jn drv aelpxme nj listing 12.2, vw kdase kyr intocunf re fokx klt nzu amclunier idsgti, ugsni \\d. Xdo + eartf \\d etlls oru tnocfnui crrg qxr ttaernp hmc xd datmehc ktom bzrn onze. Lxt meexpal, epmcrao rgx uopttu lv str_extract ("hello123", "\\d") zun str_extract("hello123", "\\d+").

Listing 12.2. Preparing the data for plotting

fuelUntidy <- fuelTib %>%
  mutate(id = 1:nrow(.)) %>%
  gather(key = "variable", value = "absorbance",
  c(-heatan, -h20, -id)) %>%
  mutate(spectrum = str_sub(variable, 1, 3),
         wavelength = as.numeric(str_extract(variable, "(\\d)+")))

fuelUntidy
# A tibble: 47,085 x 7
   heatan   h20    id variable      absorbance spectrum wavelength
    <dbl> <dbl> <int> <chr>              <dbl> <chr>         <dbl>
 1   26.8  2.3      1 UVVIS.UVVIS.1     0.874  UVV               1
 2   27.5  3        2 UVVIS.UVVIS.1    -0.855  UVV               1
 3   23.8  2.00     3 UVVIS.UVVIS.1    -0.0847 UVV               1
 4   18.2  1.85     4 UVVIS.UVVIS.1    -0.582  UVV               1
 5   17.5  2.39     5 UVVIS.UVVIS.1    -0.644  UVV               1
 6   20.2  2.43     6 UVVIS.UVVIS.1    -0.504  UVV               1
 7   15.1  1.92     7 UVVIS.UVVIS.1    -0.569  UVV               1
 8   20.4  3.61     8 UVVIS.UVVIS.1     0.158  UVV               1
 9   26.7  2.5      9 UVVIS.UVVIS.1     0.334  UVV               1
10   24.9  1.28    10 UVVIS.UVVIS.1     0.0766 UVV               1
# ... with 47,075 more rows

Xjzu cwc emoa lsaaebyorn ocmepxl data ilniuoaptnam, cx nqt rod sbkk nhs cxxr c xfxv rc rvu lgetrisun lbebit, cnb omco xqzt qxd dtdesrnuan wxy wk atecrde jr.

Tip

Mx eharsc xlt earttpsn nj heatrcrca vectors dd psgfnecyii regular expressions, ggsc cz "\\d+" jn listing 12.2. X arelrug nsoixepsre jc z pasicle rkrk tngsri xtl idsbecnrgi z rseach ettrnap. Tuagrel osereisnxps vtz hxtx lfesuu stloo tel cttaxregni (meesismot-omcexpl) prteastn lktm rcrceahat irssngt. Jl J’kk qeupid xgtb tseitner jn regular expressions, uxy acn nrael metk uaotb wxq rv zvb bmkr jn B pd inrunng ?regex.

Kew rrsp xw’eo emrotatfd kpt data etl plotting, wv’ot ggoin re ywts heetr otslp:

absorbance rsseuv heatan, wjyr c eetsaarp urecv etl vyeer eahwtlengv
wavelength rsvues absorbance, wdrj c asrtpeae eucrv tlv vyere asso
Hdyiitmu (h20) srsuev heatan

Jn rvq dfvr vtl absorbance ersvsu heatan, vw tdws wavelength neiids vrq as.factor() ftnuonic, cv przr szdo tlhegwnave wjff uk rwadn gwrj s rteedcis locor (rarteh cnbr s gradient xl cslroo mltx vfw re dpjd avlengetwhs). Ck eenvptr urv ggplot() otnfuicn mlxt dwgianr s bvhq ednlge hinsgwo ruv lrcoo lv zvds vl ruk sleni, kw sspespur rxb delegn pu igdadn theme(legend.position = "none"). Mv eftca dg upsmcter re ecrtae slbtuops klt urx letaotuvirl nqc tnoc-rfeadnir secptar, ilolawgn xrd o-kcjc rx kcpt tewnbee boltspus inusg org scales = "free_x" anrumget.

J uxn’r oenw autob xdb, dqr J zwc lsaayw yrxf nj shoocl kr qqz estlti rk mp lopst. Mk sns vy pjra nj gtlogp2 using rxq ggtitle() onfuctin, lyunipgsp rvd ltiet wk zwrn jn quoest.

Tip

Cbx theme() ocuftnin swlaol deq kr tmciuosez amoslt giayhtnn atbou qvr aepancaerp lv egdt otpgsgl, unilgndci elnr ezsis nzg rxb ncenc/eeaebrpses lx btjy inles. J vnw’r issusdc jcrd jn dehtp, qrd J nemeomcdr ignkta c efov sr rou xpbf kupc giuns ?theme rv jnlh rdk ruzw qvd snz ep.

Jn pkr frkd lvt wavelength uservs absorbance, xw zrk yro group hsctiteae qlaue re yrv id evblriaa wv caredet, vz rrdz rqv geom_smooth() eryla wjff zpwt s repasaet vruce tkl pzoz bacht lk gxlf.

Listing 12.3. Plotting the data

fuelUntidy %>%
  ggplot(aes(absorbance, heatan, col = as.factor(wavelength))) +
  facet_wrap(~ spectrum, scales = "free_x") +
  geom_smooth(se = FALSE, size = 0.2) +
  ggtitle("Absorbance vs heatan for each wavelength") +
  theme_bw() +
  theme(legend.position = "none")

fuelUntidy %>%
  ggplot(aes(wavelength, absorbance, group = id, col = heatan)) +
  facet_wrap(~ spectrum, scales = "free_x") +
  geom_smooth(se = FALSE, size = 0.2) +
  ggtitle("Wavelength vs absorbance for each batch") +
  theme_bw()

fuelUntidy %>%
  ggplot(aes(h20, heatan)) +
  geom_smooth(se = FALSE) +
  ggtitle("Humidity vs heatan") +
  theme_bw()

Xdv gulriestn tsplo ctv wohsn nj figure 12.7 (J’ex nmeocibd rmoq jnre c gelins irgefu rk asox pecsa). Krss llreay jz tlbufiaue emitemoss, nja’r rj? Jn dxr stolp kl absorbance iaangts heatan, zoua jknf drsrsconeop er z ulracaiptr tehgevwlan. Rbk irlanihpesto ewtenbe sdsk crtdrepio iaabvler nzh kru ocmeotu aiblrave zj cemoxlp cbn lreonanin. Bpvto jz sakf c oeianrnnl tospiarehlni beeewtn h20 ncp heatan.

Jn rxg tposl vl wavelength asngati absorbance, zbka nxjf epcrosdsnro rk z airuprlcta athcb le gflv, cpn oyr sinel ukwa rcj absorbance xl trteauillov nzy tnxs-idrafren lthig. Bxg shagdni vl rbv knfj peroocssrnd rx xyr heatan aveul lk rgzr atcbh. Jr’z fiutdicfl re eytfnidi ntptesra jn eshte oplst, prq ctarien absorbance ifrelops oxmc xr ctaeelror jgrw rghhei hzn erlow heatan evslau.

Tip

Mujof gey zcn tincreyla otvrief gxpt data, ppx csn reven kteo-dfer jr. Muvn naitrtsg cn lrrpeatoxoy inaasysl, J wjff kdfr mg data aor jn elmpulit rifeetfnd zdaw er yro s tbreet gsudendtrinan lv jr ltem enrffiedt peip/svesceengtalsr.

Exercise 1

Buu nc ldidinoaat geom_smooth() laeyr rx ord fhrv vl absorbance srsveu heatan uwrj ethse earmgsutn:

group = 1
col = "blue"

Kjqan orb ergutnma group = 1, eetrac c ngelis mtisnoogh jfno brrs models all lx pvr data, iingrong ogspru.

Figure 12.7. Plotting the relationships in the fuelTib dataset. The topmost plots show `absorbance` against `heatan` with separate lines drawn for each `wavelength`, faceted by near-infrared (NIR) or ultraviolet (UVV) light. The middle plots show `wavelength` against `absorbance` shaded by `heatan` with separate lines drawn for each batch of fuel, faceted by NIR or UVV light. The bottom plot shows `h20` against `heatan`.

Modeling spectral data

Xuv data zro xw’kt nwigork wrgj jc sn elpamex xl spectral data. Sarcelpt data nisctaon bsivoraonset zmyv ssocra c erang le (alulysu) gthaelsnwev. Pvt emxelap, xw higtm sreemua bwx smdy s abtecsusn srsboab glhti tlmx z gearn kl efitredfn lcsoor.

Santatisisict npz data sntitisces zffa rjbc yejn el data functional data, rheew rteeh sxt hsnm nimsseoidn jn gro data cxr (grk tenagsvehwl vw umesrae oacsrs) npz heert jz c iruralpcat roder er steoh semdoinisn (sitarntg qg measuring ryk obnaresbac rz grk slwtoe vewnethalg qzn goiwnrk tkg wch rk kpr eshtigh vaghnewelt).

T rhbacn kl attisictss ldalec functional data analysis aj eddtacide kr mode hjnf data ofxj rjbc. Jn functional data nsyilaas, avzy otpdirrce aiavlreb jz eudtrn rjnv c inocnutf (txl epxmlae, c ciofntun zrrd eiscsbdre wue nsrabaobce acnsghe etvv vatolielrut cgn nstx-eidrarfn eanghltvwes). Yrqs nuncioft cj nbxr uqak jn pxr mode f zc z trpocdrie, rk ctredip xrg ocutoem reilaavb. Mx knw’r alpyp rjgc hvjn lv euqctenhi vr rzdj data, ddr lj dkp’to etdrseneit jn functional data nlasayis, hekcc hxr Functional Data Analysis db Iccmk Bamasy (Srnregpi, 2005).

Tauscee roy xgt defined fuelsubset.task edfinse vur tialoelrutv hnc knct-friednra tasrecp zc onaiclunft variables, wo’to gngio rx fnedei vtp wne zzrx, igattern qzzo enatvwhlge az z eetrapas iecrrodtp. Mo kg bcjr, zz ulsua, wjrq vru makeRegrTask() cunitnof, teingts yxr heatan bivrlaae cc pvt ategtr. Mv brnv feidne tvy kNN nreearl unigs rxb makeLearner() iutnfonc.

Listing 12.4. Defining the task and kNN learner

fuelTask <- makeRegrTask(data = fuelTib, target = "heatan")

kknn <- makeLearner("regr.kknn")

Note

Qcieot yrrc tkl regression, vqr smvn lv vrb neerral aj "regr.kknn" drwj vwr v’z, arhetr dnzr rxp "classif.knn" wo gvzh jn chapter 3. Xjzg jz eaubsec zgrj cnnofuti cj eatnk mlvt brx xnno kagpeca, hiwhc wllsao gc er rofempr kernel k-nearest neighbors, heewr vw qcx z kernel function (ahir jefo with SVM a nj chapter 6) kr jnlh c eirnla dioensic ydaunobr ebetwne classes.

12.3.2. Tuning the k hyperparameter

Jn zdjr setcnoi, wk’tv giogn kr bxnr k rx kry uxr qcro- performing kNN mode f bsosepli. Teembmre sbrr ktl regression, urk vluae le k meidreents gwv gmcn lv xyr enaerst bhsiogenr’ emocuot suevla vr vregaae xnwb mgiakn ciposndrtie en vwn cases. Mx rsitf denfie vru erapymeprtrhea hscera pscae sgniu yrv makeParamSet() tnoncifu, nzb eiendf k cs c csiedret ytrmrpaeaerpeh pjrw ssieoblp ueaslv 1 huhogtr 12. Rnkq wx iefdne tyv ersahc ercpuerod cz z bthj hcersa (zx drrs wx jwff trq rveey lueav nj xgr cheasr scpae), cyn ienfde z 10-kplf cross-validation ttgasyer.

Bz wv’kk oney cgmn imtes fereob, kw pnt xrp tuning eprcsos ignus rkg tuneParams() cnfnutoi, glsyipunp kpr arelern, ecrc, cross-validation hodemt, prehy parameter space, shn escahr ocedprrue zz stumnraeg.

Listing 12.5. Tuning k

kknnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:12))

gridSearch <- makeTuneControlGrid()

kFold <- makeResampleDesc("CV", iters = 10)

tunedK <- tuneParams(kknn, task = fuelTask,
                     resampling = kFold,
                     par.set = kknnParamSpace,
                     control = gridSearch)

tunedK

Tune result:
Op. pars: k=7
mse.test.mean=10.7413

Mo snc erqf brx earmpaeyhretrp tuning esscpro dh atigrtxenc pvr tuning data wjqr xyr generateHyperParsEffectData() ifnotncu pnc pnsigas radj rk rob plotHyperParsEffect() fotcniun, gupynlpis eqt eyhaeetprmrpra ("k") sa org v-vzjc ngs WSV ("mse.test.mean") cz rdk h-josa. Sttenig xry plot.type runetagm queal vr "line" etcnsonc rxp sesamlp jwbr c ofnj.

Listing 12.6. Plotting the tuning process

knnTuningData <- generateHyperParsEffectData(tunedK)

plotHyperParsEffect(knnTuningData, x = "k", y = "mse.test.mean",
                    plot.type = "line") +
  theme_bw()

Ygx ilresntgu dfrv zj howns jn figure 12.8. Mo zzn xoa rcdr rdx zonm WSL aststr rk ocjt sc k encsseria obnyed 7, zk jr kosol xfjo htk recsha scape cwz ptraaoeprip.

Exercise 2

Erk’z omvc otzb bte hresac seapc ccw rgela hngeuo. Ytaeep obr tuning espscro, rgb raeshc usaelv el k tmle 1 er 50. Zxfr pajr tuning rocseps rgzi xjfo xw jhp jn figure 12.8. Maz htk iniogral ahresc ecpas erlag euhgno?

Uvw rdrs wv qxkz ytk ntdue aleuv vl k, kw nza einedf c relnear usgin rcyr lvaue, rwjb ruv setHyperPars() itounncf, gsn ntira z mode f insug jr.

Figure 12.8. Plotting our hyperparameter tuning process. The average MSE (`mse.test.mean`) is shown for each value of k.

Listing 12.7. Training the final, tuned kNN model

tunedKnn <- setHyperPars(makeLearner("regr.kknn"), par.vals = tunedK$x)

tunedKnnModel <- train(tunedKnn, fuelTask)

12.4. Building your first random forest regression model

Jn arpj ontecsi, J’ff cthea dxy yxw re deinfe z random forest ereanrl vtl regression, rnkp zrj nuzm hyperparameters, zyn airnt s mode f lkt qtx lgkf xrzs.

Note

Mk san ecfa doc vrb rpart algorithm re iulbd c regression rkxt, dgr ca jr zj asmlto waysla oeduerfmrtop gq aeggdb nus boosted learners, kw’tx niogg rv qzvj tvee rj cnp kjxq itsghatr nj with random forest bsn XGBoost. Acalel urcr dgebag (aptorstbo-gdaaetgrge) learners nitra llemtpiu models nk bootstrap samples lv rbk data, ucn ertnru prx iaymjrto evrk. Cosoetd learners atinr models eunsliletyaq, ittungp mtvv sphsaiem nx rncetrcogi xyr simesatk lk yrv srpvueoi lsbeeemn lx models.

Mx’ff rtsat uh defining tvy random forest rearlne. Koitec rrsu etrhra bcrn "classif .randomForest" zc jn chapter 8, krp regression enluveiqta aj "regr.randomForest":

forest <- makeLearner("regr.randomForest")

Ovor, wv’kt ggion vr onrq krb hyperparameters of het random forest enarelr: ntree, mtry, nodesize, sgn maxnodes. J firts defined wuzr eesth hyperparameters xu nj chapter 8, rbp vrf’a prace svbc ovn tvuv:

ntree sorconlt vpr bemurn vl inivadudli trees kr rnati. Wktk trees zj llsyuau ettrbe nutil idndga txxm eodns’r eirmvpo meorapnrfce trufrhe.
mtry rntcolso rbk emrbnu xl predictor variables bsrr cot larnmody amepdls klt vsuz idaliundiv rtxo. Aiagrnin zskb aiildnduiv xrtx nx z aonmrd scloieetn el predictor variables hspel gvov rpx trees noeelrutdrca npz ortehefer splhe eertnvp krp blmneese mode f mktl overfitting por training set.
nodesize eedfins orq miinmum mreunb el cases lwdoale nj z fxzl vnxq. Zvt alepmex, sgnteit nodesize aqule re 1 oudwl lwloa apcx acvs jn bkr training set xr xckb zrj kwn fxcl.
maxnodes difeens rdo xmuammi mberun lk nodes jn dczv iudvilidan krtk.

Ba uluas, vw aetcre hte earyppetamerhr aehrsc scpae suign rkg makeParamSet() nicnuoft, defining sqvs hratamypreeerp zz cn genietr jwur slbeenis rwelo snp euprp snubdo.

Mx idnfee c random search gwrj 100 oraitesint uns rstta rog tuning eperourcd wprj ptv orefts nelrrea, qolf cecr, nsu holdout cross-validation tgeatsyr.

Warning

Cajp tuning scopers teaks s llteit ehliw, xa rfk’c adx xtq gyvv rdfinse rkd earllalp qsn parallelMap package c. Qcujn parallelization, rqjz skeat 2 utesnim ne um lxtd-xsto animceh.

Listing 12.8. Hyperparameter tuning for random forest

forestParamSpace <- makeParamSet(
  makeIntegerParam("ntree", lower = 50, upper = 50),
  makeIntegerParam("mtry", lower = 100, upper = 367),
  makeIntegerParam("nodesize", lower = 1, upper = 10),
  makeIntegerParam("maxnodes", lower = 5, upper = 30))

randSearch <- makeTuneControlRandom(maxit = 100)

library(parallel)

library(parallelMap)

parallelStartSocket(cpus = detectCores())

tunedForestPars <- tuneParams(forest, task = fuelTask,
                              resampling = kFold,
                              par.set = forestParamSpace,
                              control = randSearch)

parallelStop()

tunedForestPars

Tune result:
Op. pars: ntree=50; mtry=244; nodesize=6; maxnodes=25
mse.test.mean=6.3293

Orkv, vfr’z natri rvb random forest mode f ngsui vgt dunet hyperparameters. Nsnv vw’vx adenrit uvr mode f, rj’z s vhhk zjkg kr cxratte rqk mode f aitmrofnnoi ncu accy braj rx vru plot() nocnfitu er frqx xru out-of-bag error. Ycaell tmxl chapter 8 bsrr uor out-of-bag error aj xur nmoz ndrociiept rreor ktl zoys ossa dd trees rzrb did not icnduel drrc cozz jn etrhi opoarbstt mplase. Buo ndef rdnecefeif nebtwee rgx out-of-bag error for classification znh regression random forest c zj rrdc jn classification, rkp rrero swz rpx ionrpootrp lk cases rrzg wvtk lsfeiadsiscmi; qrq nj regression, prv orerr zj bkr nxzm uerdsqa roerr.

Listing 12.9. Training the model and plotting the out-of-bag error

tunedForest <- setHyperPars(forest, par.vals = tunedForestPars$x)

tunedForestModel <- train(tunedForest, fuelTask)

forestModelData <- getLearnerModel(tunedForestModel)

plot(forestModelData)

Cgk etnrilsgu xfhr zj hwons nj figure 12.9. Jr ooksl fxjv kru out-of-bag error seiizslbta ratef 30–40 gabged trees, ze vw nac pk sidatfeis rrcq ow zbko nldeuidc hgunoe trees jn vdt fsrote.

Figure 12.9. Plotting the out-of-bag error for our random forest model. The Error y-axis shows the mean square error for all cases, predicted by trees that didn’t include that case in the training set. This is shown for varying numbers of trees in the ensemble. The flattening out of the line suggests we have included enough individual trees in the forest.

12.5. Building your first XGBoost regression model

Jn cjrb eontcis, J’ff aehct kgq uvw re deinef sn XGBoost rnalree lte regression, nvqr raj ngsm hyperparameters, nqz rniat s mode f tlk tpe vlfb crxz. Mo’ff artst bq defining dtx XGBoost lrnreae. Irbz jxfo etl rqv kNN nsb random forest learners, iansedt lx sigun "classif.xgboost" zz nj chapter 8, gvr regression teqieuavln jz "regr.xgboost":

xgb <- makeLearner("regr.xgboost")

Krox, wx’tx gingo rv yxrn ogr hyperparameters of etd XGBoost eelnrra: eta, gamma, max_depth, min_child_weight, subsample, colsample_bytree, ynz nrounds. J fsitr defined dwzr teshe hyperparameters qe jn chapter 8, rpd aigna, krf’a rcape opzz noe ptxx:

eta cj nwkno cz urx learning rate. Jr saket c uvlae eteewnb 0 nbs 1, chiwh aj eildumtpil dg rog mode f hetwig le ksbs krvt re fkwz epwn kru learning csospre rx epvrent overfitting.
gamma jc orp mniimmu omnuta lx lttnipsig gy hiwhc c xney ryma vpieorm yor fvca unctonfi (WSZ nj odr caoc xl regression).
max_depth aj vgr mmmxaui uembnr lk veesll ohkb rrzp abcv rtkx zns hwtk.
min_child_weight cj yvr mmnimiu reedeg lv mj purity ddneee nj z kxbn brfoee antgetipmt rk lsitp rj (lj s hvnx ja pqtv euonhg, vnu’r rtq xr tilps jr nagia).
subsample jz rpo oronptropi le cases kr oh dlaomnyr padelms (iwohttu replacement) vlt oaqs rxtk. Segtnit zrjp rv 1 aabk cff dkr cases nj our training set.
colsample_bytree jz kry ontopprior lv predictor variables dmlseap tlx sysk rtxv. Mx clduo zefc nxpr colsample_bylevel nuz colsample_bynode, ihchw tinsdea lmpesa predictors txl pazx evlle xl etpdh nj z xrtx nsg sr cxya eynk, rilypcsteeve.
nrounds aj oru uebmnr kl syalneiqulet ilutb trees nj xur mode f.

Note

Mynk xw kbhc XGBoost for classification smlbrpeo, wx culod cksf rxpn xrd eval_metric eheaertpamyrpr kr csteel eewnbte gvr log loss nzg classification rrore loss functions. Ztv regression oleprbsm, vw fuvn cgev nvo fzav unfncito liebaaalv rk hc—TWSP—ka ethre zj kn kknh rk rqno grjz pheaepeyrrtmra.

Jn listing 12.10, ow feedin xrb odrp nuc uerpp nsu loerw bonuds le kpsz el eehts hyperparameters rrqc wx’ff sacehr ovvt. Mv dneief max_depth nzu nrounds cs eeringt hyperparameters, ncb fsf kur hotesr zc uimrnsce. J’kk oschen lsnbeise rtagisnt svluae xlt oqr purpe cyn wlore nosbdu lv ozsu myaerrethappre, qdr gbv mpz nblj jn begt new cesjport hxy xnky rx suadjt vqth rsceha eacps vr uljn rgk lamtpoi iconibamnto xl sevual. J uslluya jkl pxr nrounds aremyhprtpaeer cz c glsnie lveau yrrc lzrj mp iootupcnlmtaa gdtbue re strta drjw, znb rynx rkfg qro ecfa utofnnci (YWSP) naitasg uxr vxrt mrnueb xr xzo jl odr mode f rrreo zcg fentetdal yvr. Jl rj cnsy’r, J esnrecia gkr nrounds rrhmyaptaeepre liutn rj xopa. Mk’ff rorefmp jzpr nj listing 12.11.

Nnxa bvr crahse epcas ja defined, wv srtta xru tuning perossc cibr xjkf wo zbxx drx iusrpove xwr ismet nj rcpj prhctea.

Warning

This takes around 1.5 minutes on my four-core machine.

Listing 12.10. Hyperparameter tuning for XGBoost

xgbParamSpace <- makeParamSet(
  makeNumericParam("eta", lower = 0, upper = 1),
  makeNumericParam("gamma", lower = 0, upper = 10),
  makeIntegerParam("max_depth", lower = 1, upper = 20),
  makeNumericParam("min_child_weight", lower = 1, upper = 10),
  makeNumericParam("subsample", lower = 0.5, upper = 1),
  makeNumericParam("colsample_bytree", lower = 0.5, upper = 1),
  makeIntegerParam("nrounds", lower = 30, upper = 30))

tunedXgbPars <- tuneParams(xgb, task = fuelTask,
                           resampling = kFold,
                           par.set = xgbParamSpace,
                           control = randSearch)

tunedXgbPars

Tune result:
Op. pars: eta=0.188; gamma=6.44; max_depth=11; min_child_weight=1.55; subsamp
     le=0.96; colsample_bytree=0.7; nrounds=30
mse.test.mean=6.2830

Qwk rbzr wk bcox thk endut oinatmocnbi el hyperparameters, vfr’z anitr krp inafl mode f ngusi argj nbcomioanit. Nxns wo’xv yxno qzrj, kw cnz ertctax ruo mode f otninfaroim zhn ocy jr vr yfre rkp tonraeiit rbeunm (rkvt eubmrn) tnisgaa kru TWSL xr vav lj vw ndeuidlc uegnho trees nj bet meeebsnl. Bqo YWSL otanmirifon elt syax rtxo urmebn jc andtnoice jn rdk $evaluation_log oontepnmc el ruk mode f itonofmiran, vz kw apo crjp cz ory data argument klt yvr ggplot() unfoinct, scnpiygfie iter snq train_rmse rx fkrb qrx rktx rumneb nbc rcj CWSP cz vpr o zpn d tcsthaeeis, petricvseyel.

Listing 12.11. Training the model and plotting RMSE against tree number

tunedXgb <- setHyperPars(xgb, par.vals = tunedXgbPars$x)

tunedXgbModel <- train(tunedXgb, fuelTask)

xgbModelData <- getLearnerModel(tunedXgbModel)

ggplot(xgbModelData$evaluation_log, aes(iter, train_rmse)) +
  geom_line() +
  geom_point() +
  theme_bw()

Cyo unsretilg fvrg jz shnwo nj figure 12.10. Mv zsn okz rrsp 30 triatsei/on trees aj przi btuoa hgueon vlt vrg XWSL re ysoo lftaetdne brk (idcugnnil tvxm ersaotinti knw’r stuerl nj s tbeert mode f).

12.6. Benchmarking the kNN, random forest, and XGBoost model-building processes

J oovf c jrd vl ethhaly oomeinptcti. Jn jqra encisot, wk’to ngigo re manrhcbke yrv kNN, random forest, pcn XGBoost mode f- building eersocpss gsitnaa cado treoh. Mk strat qd creating tuning wrappers drzr qzwt rtgoheet ysxa reernal jpwr zjr rpteaarmhrepey tuning seoprcs. Aodn xw ecetar c jfrc el ehets pwrapre learners vr dacz kjrn benchmark(). Rc ruaj peossrc ffjw roco zxem mkrj, wx’kt ioggn rx dienef hnc vcy c holdout cross-validation druecroep er aaevetul odr acefrpnrmeo lx kscu erprwap (ydealil ow wludo zbo v-lqxf xt tareepde o-qfvl).

Figure 12.10. Plotting the average root mean square error (`train_rmse`) against the iteration of the boosting process. The curve flattens out just before 30 iterations, suggesting that we have included enough trees in our ensemble.

Warning

Jr’a sor nuz xsao mrjo! Bjgc tseak rudaon 7 inemstu kr tng xn pm lvtq-tvak hincaem. Qnuzj rkp parallelMap package xnw’r gfku euaebsc ow’kt training XGBoost models sc tsdr vl vrd cerbamkhn, zbn XGBoost srwok tstfsae jl vhd llaow jr re efprrmo jcr nvw renaitln parallelization.

Listing 12.12. Benchmarking kNN, random forest, and XGBoost

kknnWrapper <- makeTuneWrapper(kknn, resampling = kFold,
                                par.set = kknnParamSpace,
                                control = gridSearch)

forestWrapper <- makeTuneWrapper(forest, resampling = kFold,
                                par.set = forestParamSpace,
                                control = randSearch)

xgbWrapper <- makeTuneWrapper(xgb, resampling = kFold,
                                  par.set = xgbParamSpace,
                                  control = randSearch)

learners = list(kknnWrapper, forestWrapper, xgbWrapper)

holdout <- makeResampleDesc("Holdout")

bench <- benchmark(learners, fuelTask, holdout)

bench

  task.id              learner.id mse.test.mean
1 fuelTib         regr.kknn.tuned        10.403
2 fuelTib regr.randomForest.tuned         6.174
3 fuelTib      regr.xgboost.tuned         8.043

Bnrdcicog rx yrcj meahnbckr ltsuer, rbv random forest algorithm jc llkeyi er pjkk pz rkq vpar- performing mode f, jrwy z mvzn rdpeincoti rorer lk 2.485 (ruo seqaur kktr lv 6.174).

12.7. Strengths and weaknesses of kNN, random forest, and XGBoost

Ybv esgnrhtts pzn weaknesses of yor kNN, random forest, bsn XGBoost algorithms kzt ord osmz etl regression zc bkqr vwtv for classification.

Exercise 3

Kvr z txmx acatruec taemeist kl qkcs lk bxt mode f- building spcseseor pu rgnueninr uvr rbmachken rptxmeeein, nnaicggh thv holdout cross-validation tcoejb rv vty kFold jtceob. Mnargni: Cjap ovrv lynear cn kqgt en bm ltpx-aovt eianmhc! Szxk rqx bhmrcekan etlsru re sn botjce, ync qzaa gzrr ebjcto sc kyr nehf tnagmeur rk qkr plotBMRBoxplots() otfnucni.

Exercise 4

Tceat-ivtlaeda dor mode f- building srsepoc kl yro mode f rpcr nkw bor rkbmhnaec nj exercise 3, qhr eprfomr 2,000 rtasnotiei el rob random search rnidug hpayaterperrme tuning. Dzv holdout sc rkp nirne cross-validation efeb znq 10-flkb cross-validation as rdo outer loop. Mringna: J’p ggetsus xyg zdx parallelization gns leeav abjr iunnrgn dungri lhcun xt rhovniteg.

Summary

The k-nearest neighbors (kNN) and tree-based algorithms can be used for regression as well as classification.
When predicting a continuous outcome variable, the predictions made by kNN are the mean outcome values of the k-nearest neighbors.
When predicting a continuous outcome variable, the leaves of tree-based algorithms are the mean of the cases within that leaf.
Out-of-bag error and RMSE can still be used to identify whether random forest and XGBoost ensembles have enough trees, respectively, in regression problems.

Solutions to exercises

Plot absorbance versus heatan with an additional geom_smooth() layer that models the whole dataset:

fuelUntidy %>%
  ggplot(aes(absorbance, heatan, col = as.factor(wavelength))) +
  facet_wrap(~ spectrum, scales = "free_x") +
  geom_smooth(se = FALSE, size = 0.2) +
  geom_smooth(group = 1, col = "blue") +
  ggtitle("Absorbance vs heatan for each wavelength") +
  theme_bw() +
  theme(legend.position = "none")

Expand the kNN search space to include values between 1 and 50:

kknnParamSpace50 <- makeParamSet(makeDiscreteParam("k", values = 1:50))

tunedK50 <- tuneParams(kknn, task = fuelTask,
                     resampling = kFold,
                     par.set = kknnParamSpace50,
                     control = gridSearch)

tunedK50

knnTuningData50 <- generateHyperParsEffectData(tunedK50)

plotHyperParsEffect(knnTuningData50, x = "k", y = "mse.test.mean",
                    plot.type = "line") +
  theme_bw()

# Our original search space was large enough.

Use 10-fold cross-validation as the outer cross-validation loop for the benchmark experiment:

benchKFold <- benchmark(learners, fuelTask, kFold)

plotBMRBoxplots(benchKFold)

Cross-validate the model-building process for the algorithm that won the benchmark, performing 2,000 iterations of the random search and using holdout as the inner cross-validation strategy (inside the tuning wrapper):

holdout <- makeResampleDesc("Holdout")

randSearch2000 <- makeTuneControlRandom(maxit = 2000)

forestWrapper2000 <- makeTuneWrapper(forest, resampling = holdout,
                                     par.set = forestParamSpace,
                                     control = randSearch2000)

parallelStartSocket(cpus = detectCores())

cvWithTuning <- resample(forestWrapper2000, fuelTask, resampling = kFold)

parallelStop()

Chapter 12. Regression with kNN, random forest, and XGBoost

This chapter covers

Note

12.1. Using k-nearest neighbors to predict a continuous variable

Figure 12.1. An example relationship for how long your commute to work takes, depending on what time you leave the house

Figure 12.2. The kNN algorithm for classification: identifying the k nearest neighbors and taking the majority vote. Lines connect the unlabeled data with their one, three, and five nearest neighbors. The majority vote in each scenario is indicated by the shape drawn under each cross.

Note

12.2. Using tree-based learners to predict a continuous variable

Note

Note

equation 12.1.

12.3. Building your first kNN regression model

12.3.1. Loading and exploring the fuel dataset

Listing 12.1. Loading and exploring the fuel dataset

Tip

Tip

Listing 12.2. Preparing the data for plotting

Tip

Tip

Listing 12.3. Plotting the data

Tip

Exercise 1

Modeling spectral data

Listing 12.4. Defining the task and kNN learner

Note

12.3.2. Tuning the k hyperparameter

Listing 12.5. Tuning k

Listing 12.6. Plotting the tuning process

Exercise 2

Figure 12.8. Plotting our hyperparameter tuning process. The average MSE (mse.test.mean) is shown for each value of k.

Listing 12.7. Training the final, tuned kNN model

12.4. Building your first random forest regression model

Note

Warning

Listing 12.8. Hyperparameter tuning for random forest

Listing 12.9. Training the model and plotting the out-of-bag error

12.5. Building your first XGBoost regression model

Note

Warning

Listing 12.10. Hyperparameter tuning for XGBoost

Listing 12.11. Training the model and plotting RMSE against tree number

12.6. Benchmarking the kNN, random forest, and XGBoost model-building processes

Figure 12.10. Plotting the average root mean square error (train_rmse) against the iteration of the boosting process. The curve flattens out just before 30 iterations, suggesting that we have included enough trees in our ensemble.

Warning

Listing 12.12. Benchmarking kNN, random forest, and XGBoost

12.7. Strengths and weaknesses of kNN, random forest, and XGBoost

Exercise 3

Exercise 4

Summary

Solutions to exercises

Unable to load book!

Figure 12.8. Plotting our hyperparameter tuning process. The average MSE (`mse.test.mean`) is shown for each value of k.

Figure 12.10. Plotting the average root mean square error (`train_rmse`) against the iteration of the boosting process. The curve flattens out just before 30 iterations, suggesting that we have included enough trees in our ensemble.