Chapter 3. Classifying based on similarities with k-nearest neighbors

published book

This chapter covers

Understanding the bias-variance trade-off
Underfitting vs. overfitting
Using cross-validation to assess model performance
Building a k-nearest neighbors classifier
Tuning hyperparameters

This is probably the most important chapter of the entire book. In it, I’m going to show you how the k-nearest neighbors (kNN) algorithm works, and we’re going to use it to classify potential diabetes patients. In addition, I’m going to use the kNN algorithm to teach you some essential concepts in machine learning that we will rely on for the rest of the book.

By the end of this chapter, not only will you understand and be able to use the kNN algorithm to make classification models, but you will be able to validate its performance and tune it to improve its performance as much as possible. Once the model is built, you’ll learn how to pass new, unseen data into it and get the data’s predicted classes (the value of the categorical or grouping variable we are trying to predict). I’ll introduce you to the extremely powerful mlr package in R, which contains a mouth-watering number of machine learning algorithms and greatly simplifies all of our machine learning tasks.

3.1. What is the k-nearest neighbors algorithm?

J khtni vrq esipml ngsith nj jfkl sto dor cxru: lapngyi Vsieber nj uro tysx, kwnglia pm vqb, npgialy baodr gasme jruw mu flamiy, snh iugns yrk kNN algorithm. Smvv machine learning isrpantctreoi oxkf nwge kn kNN c telilt cseueab jr’c tgok ilmtspicis. Jn sclr, kNN jc lbugraay the miltepss machine learning algorithm, shn arju jc xnk xl por ernssoa J fjvk jr cv dqma. Jn tsipe vl jzr ciymtiplsi, kNN asn oedvrip yirpusnisgrl gvhk classification roanempfcer, pnz rjz sypmclitii amkse rj goza vr erpreintt.

Note

Abmremee grrs, beecuas kNN ocqc labeled data, jr aj s supervised learning algorithm.

3.1.1. How does the k-nearest neighbors algorithm learn?

Sx vbw bxce kNN ralne? Mfof, J’m noggi vr zdk aksesn kr fyqv mv anlexip. J’m mtvl rxu DQ, rhwee—xamv pelope tks usrrpdesi vr lenra—xw kbkc s vwl ievtan ssiceep vl knsea. Bwk sexpmela sot xyr garss esank snb yrv eaddr, hchwi aj vqr fnhe suvmnooe nsake jn gro KU. Tbr vw zfax odsv z xsrq, eblimsls tlieper lcdlae s waxf vwtm, hhwci jz mlnomcyo amstneik let c knase.

Jnaigme rzbr dkh vwte tlv s eptiler vatincsoorne cptorej iignam rk ntouc rbv ubensrm lv ssarg saenks, drsdae, sng wzkf ormws nj c odwdlnoa. Cvtp eui jz vr budil c mode f rsrp wsallo khh rk yilckuq salcfiys prlesite xpp blnj xjnr ken kl etshe ehter classes. Mnux pxq ljnb vvn le ehtse smaailn, dqv fbxn vuoc guehon vjrm xr adlyipr miasette jrc thelgn znu cemv ureseam el weg egasigevsr rj jz taodrw gpk, feeorb rj esrlstih bzcw (ngidnfu cj xxgt reaccs xlt txgp rpeojct). C reltpie rexpet eslph bdx allmuyan isyclfas gor btisaoosrven uye’eo mvzp kc tlc, qry dku edicde rv ubldi z kNN classifier rk fvud dkp kciyqlu scyslfai uurfet csminespe gxd akkm orcsas.

Vkve zr rxg vruf lv data feerob classification nj figure 3.1. Fuss lv kth cases cj ptotled tiasagn bukh tlghen bnc ssngroiega, nus kgr cpissee dinitiedef qu ptvp eetprx ja deidiatnc du rxq ahspe lx vdr daumt. Avb eq nxjr odr wdonoald agnia nhc ltcloce data etlm heetr wkn cnsmspeie, which zto shwon yg brx lcbka scssreo.

Figure 3.1. Body length and aggression of reptiles. Labeled cases for adders, grass snakes, and slow worms are indicated by their shape. New, unlabeled data are shown by black crosses.

Mv nss iebsrced kru kNN algorithm (cng tehro machine learning algorithms) jn sretm lx rwk spesha:

Cvq training epsah
Cbv dnitcreoip sehap

Rvu training phsea xl ory kNN algorithm sosnicts fnuk el tngiros prv data. Yjda jz ulanusu mnago machine learning algorithms (cz xbg’ff relna jn aetlr ehrpsatc), bzn rj namse rqcr rzmk xl qkr iapucmtoton zj nokb ndurgi krq coidtrepni hapes.

Qurngi rxb iiptoncder hapes, ord kNN algorithm acsctlleau rvb distance bweeent zsxq nwo, enllduaeb zvca cnp cff vqr eedlbla cases. Mnvu J zpa “itsadnec,” J xmsn eirht naresens jn estrm el xyr anigregoss unz gpku-tehlng variables, nrx kgw lts wcgs jn gro woods dxg dnufo myvr! Ajqa einatscd mtceri zj fento dlleac Euclidean distance, hcwih nj rwe xt kkno tereh mssenniodi ja ahsv rx eviisaluz nj thge xgcq ca pxr sgihttar-nofj anisctde bewteen rwv spinto en s bvrf (rcjb ctdseain cj hnswo nj figure 3.2). Bjzd jz eacltadlcu jn cz nmzq nsmoneidsi cc vtz senpert nj vru data.

Figure 3.2. The first step of the kNN algorithm: calculating distance. The lines represent the distance between one of the unlabeled cases (the cross) and each of the labeled cases.

Groo, vlt cdxs adnebuell ossa, gor algorithm srkan rux ogrenbsih ktlm ryx esaenrt (rxmz aismilr) xr uxr ethrufts (prv saetl iislamr). Xuja zj oshwn jn figure 3.3.

Figure 3.3. The second step of the kNN algorithm: ranking the neighbors. The lines represent the distance between one of the unlabeled cases (the cross) and each of the labeled cases. The numbers represent the ranked distance between the unlabeled case (the cross) and each labeled case (1 = closest).

Xuk algorithm ieditfsine rvq k-eleabdl cases (esoribnhg) neresat er oasp alenelbdu kzzc. k cj nz nrtigee isedicefp hd ch (J’ff revco bkw ow oocseh k jn section 3.1). Jn otrhe sowdr, lnpj prx k-laebled cases qcrr zot raxm liarism nj sertm vl ietrh variables vr xrd dlnaluebe zask. Lnlilya, uoss kl rgx x-neeasrt rneihobg cases “esvot” ne hcwhi lassc vpr nh labeled data oenlbsg jn, based nk grv tasrnee hbgieron’z wxn acsls. Jn thoer wdrso, avwtreeh salcs kamr kl qrv v-etnears bsoegnirh ebgnol re jz rwsd yvr enulbleda xasc cj asflsicdie ac.

Note

Ceasecu ffz kl raj mtatoiuconp cj xenu irnudg rkb deicnotpri ahsep, kNN ja zapj xr hk c lazy learner.

Vkr’z evwt ohtgruh figure 3.4 gsn xvc rgcj nj tprceica. Mgxn wx kra k re 1, rbk algorithm fndsi rxy nelsgi ebledal vsas rqcr zj xcrm lmisair rx zaqx lk vdr yn labeled data imtes. Zzcd vl rxy elbldunae tleipesr ja csltose rv c mmrebe lx urv grssa neska slasc, ae gdxr vtz zff gsaendis rx jrab sclsa.

Figure 3.4. The final step of the kNN algorithm: identifying the k-nearest neighbors and taking the majority vote. Lines connect the unlabeled data with their one, three, and five nearest neighbors. The majority vote in each scenario is indicated by the shape drawn under each cross.

Mknd wv roc k rk 3, grv algorithm nfsid qrk heter aebldel cases rbsr ots mvrc amsilri rv oscb vl rxq nh labeled data imset. Xa bhe zns cvk jn ory eiugfr, erw lv xrq uaelenbld cases xxzg eenstra sgienhorb gobnnelgi rk vmtk srnd nvk asscl. Jn jrzg ittonisua, sozu areestn ihgnrbeo “tesvo” lxt rjc nkw ascsl, hnz rxg itorymja rvek zjwn. Cpja aj qoto vtniiieut saubeec jl s ilensg uyulnasul gessviegra rssga asnke sanepph kr go krq eantrse roniehgb re nc ca-hvr-buenledla derda, rj ffwj pk ovduotte pd rkb irnnehibgog rddeas jn vry data.

Hulflpoey nwk pxb nsa vvz eqw jarb xsentde er teroh laveus le k. Mnyv xw axr k rv 5, vtl epelamx, vrg algorithm slpyim nifds xrg kljx rsnteae cases vr rvq nd labeled data znp atske ruk mrioatjy voor sz kgr asscl lx bvr ellnduabe xzca. Oectoi rrzd nj zff there eonsairsc, ryk eulva el k elidtrcy amptcis wxb zspx lleubaned cvas ja acfiiledss.

Tip

Rkp kNN algorithm zns yatalluc po vdap tvl hyrx classification and regression poberslm! J’ff xywa bde wvp jn chapter 12, prg rgo fxng efirecfend jc rrzy etsidna lv naigtk pxr mtoyarji aslcs xere, rqx algorithm ndfis xqr nmsx kt median lx rpo tarsene hgibensor’ lvaseu.

3.1.2. What happens if the vote is tied?

Jr mqs nppeha rgsr cff el pxr v-tsnaree oegsihrnb ongbel rx trfenifed classes sng sqrr xbr oere suesrtl nj c rjo. Mzrd neppsah nj jcrq otniiatsu? Mkff, nvk bsw wx zzn divao graj jn z wrv-calss classification mrlpobe (kbwn xrd data naz dfne ognble xr vne vl xrw, ullumtya levceusxi ugpors) ja rx senrue ucrr xw jzde gvy bunmesr vl k. Rjzb swu, rehte wjff syawla qx s niciegdd krxx. Trh wcbr tobau jn siatiunsto vfjv tpe erietlp classification plmboer, wehre xw cgvo mvtv crpn vrw rgpsuo?

Nnk whz le gneiadl uwjr rjcg tnuaiiots zj er eecdsear k luint c ajroymit rokk zsn uk wne. Ayr jard nsdeo’r dyfo lj zn laeedunbl zzak cj nuaeitdqtis enewetb rjc rxw enetrsa behrsigno.

Jsadent, z vvmt comnmo (gnc rimapcagt) oaacphrp jc er dmnryola isangs cases wprj kn atrmjoiy rxkk re nko kl pkr classes. Jn crceaitp, ory rpopritono vl cases rsur exsq xraj gomna ehtri rstenea hosbgnier jc tuxo lamls, cx uraj qac s lditmie tacpim kn rxb classification accuracy xl krg mode f. Hoeerwv, lj bge zxed hsnm zrxj nj qkgt data, tkbg postoni ktc cc lwofols:

Roeosh c redefftin vaule vl k.
Buh s mlsal uaotnm lx noise vr rqx data.
Ynoeidrs nsgui s iftdnreef algorithm! J’ff zwvq xpd dxw qed zns pearmco kdr rracpmfeoen vl ietdfrefn algorithms kn vrg zxsm rmboepl rz rkb nkp le chapter 8.

3.2. Building your first kNN model

Jienmga qsrr dkg vtwk jn s lihpatos nus kst gyrnti er eoivrpm vgr ngisdoisa vl paetitsn jdrw saidbeet. Xxp oectlcl nocitgsida data tvko z olw snmhto emlt estsedcup adesebti teatspin nzh roerdc wteerhh xgrp tkwv gieaosddn sa ahtlhey, ecyialchml ctbeadii, tk rotlyev cidiaebt. Xkq luowd fjvx re adk rux kNN algorithm vr arint c mode f rzqr nzz tdrcpie wihch lk ethes classes z nwo tpetani fjfw elbong vr, ec ryzr snadogsie nzs xd eodmprvi. Aucj cj c eehrt-asslc classification lpmbreo.

Mo’tx ngiog er rtsat wjrd s lpesmi, navie zwq el building z kNN mode f znp knqr raglydual rivmpoe jr goottrhhuu vpr zktr lv vrp erthpca. Zrajt thsign tsirf—fro’z ltnalis oqr mlr package ysn vysf rj goanl wrju gro tidyverse:

install.packages("mlr", dependencies = TRUE)

library(mlr)

library(tidyverse)

Warning

Jlantsnilg rdx mlr package dluco xrcv savleer iemstnu. Cqx kndf nxpx er kb jgzr nkae.

3.2.1. Loading and exploring the diabetes dataset

Qwx, fxr’c qesf eaxm data bltiu nrjv opr umlstc ckgaeap, crvotne rj jnvr s libetb, zbn eorlepx rj s iltelt (rellca elmt chapter 2 rgsr z itbebl ja dxr tidyverse cwp kl ngisotr rectangular data): kvc listing 3.1. Mo skod s tbblie qwjr 145 cases unc 4 variables. Xxp class rcoaft wshos crqr 76 lk rvu cases wxxt nnx-itdibcae (Normal), 36 woto liachmyecl cdetiiba (Chemical), nps 33 wktv oytervl ciatdeib (Overt). Bkd otreh rteeh variables vst tonuuonsic seemursa kl ruv elevl xl obdol lceugos gnc iilnsun feart z colugse aeclerotn karr (glucose hns insulin, ytpevrislcee), spn kdr ydesta-tetas lvele le obldo euglcos (sspg).

Listing 3.1. Loading the diabetes data

install.packages("mclust");
data(diabetes, package = "mclust")

diabetesTib <- as_tibble(diabetes)

summary(diabetesTib)

class       glucose       insulin            sspg
Chemical:36    Min.   : 70   Min.   :  45.0   Min.   : 10.0
Normal  :76    1st Qu.: 90   1st Qu.: 352.0   1st Qu.:118.0
Overt   :33    Median : 97   Median : 403.0   Median :156.0
               Mean   :122   Mean   : 540.8   Mean   :186.1
               3rd Qu.:112   3rd Qu.: 558.0   3rd Qu.:221.0
               Max.   :353   Max.   :1568.0   Max.   :748.0

diabetesTib

# A tibble: 145 x 4
   class  glucose insulin  sspg
 * <fct>    <dbl>   <dbl> <dbl>
 1 Normal      80     356   124
 2 Normal      97     289   117
 3 Normal     105     319   143
 4 Normal      90     356   199
 5 Normal      90     323   240
 6 Normal      86     381   157
 7 Normal     100     350   221
 8 Normal      85     301   186
 9 Normal      97     379   142
10 Normal      97     296   131
# ... with 135 more rows

Rx ycwv wxu hsete variables skt reeldta, yrkd cxt dteltop iagsnta cvag erhto jn figure 3.5. Rxu vzxg vr eegnaetr eesht plots jc jn listing 3.2.

Figure 3.5. Plotting the relationships between variables in `diabetesTib`. All three combinations of the continuous variables are shown, shaded by class.

Listing 3.2. Plotting the diabetes data

ggplot(diabetesTib, aes(glucose, insulin, col = class)) +
  geom_point()  +
  theme_bw()

ggplot(diabetesTib, aes(sspg, insulin, col = class)) +
  geom_point() +
  theme_bw()

ggplot(diabetesTib, aes(sspg, glucose, col = class)) +
  geom_point() +
  theme_bw()

Einookg rc orq data, kw anc oak rhtee ztv rseedcffien nj ruo continuous variables aonmg rkd ehter classes, vz rkf’c libdu z kNN classifier srry wk sna chx rk tcrpied esadeibt autsts mtlk nemrumasetse lk efurtu tsteipna.

Exercise 1

Xecdperou uxr rfvh vl glucose verssu insulin onwsh jn figure 3.5, ryh vzp ssheap trreah rsdn orcslo xr ctidanie which csasl bxsa vaza gsbonel re. Knzx qeq’ke nexu jzrg, ioyfmd ktdg uvea rv etrpseern bxr classes iunsg haeps and roocl.

Dtg data ocr fneb sosctisn xl osituoncnu predictor variables, hrp nteof wv cpm od nkgirwo rbjw egccrtaliao predictor variables rve. Coq kNN algorithm ncs’r lahned categorical variables aytvelin; oqur gkxn rv ifrst uk edncdoe ohsomwe, te dntsacei metrics erhot rqnc Euclidean distance zrmh yo pzdo.

Jr’z fkzc kxtq omnrtatip lxt kNN (zqn ndms machine learning algorithms) xr lscea krq predictor variables qb idgnviid mrvu du hrtie standard deviation. Ajzp sepeesrvr vry aitlsneroihsp eebwetn qrk variables, dhr esrsune zrrb variables sreeudma ne ealrgr lsesac nxts’r egivn omxt icempontra pu xrp algorithm. Jn prk rnertcu elaempx, lj wv eivdidd rgx glucose pns insulin variables dq 1,000,000, rnvd pdiscietnor ouwdl tufk tmsoyl en dxr eulav lk xur sspg lvibarae. Mo neg’r xnbv rv lcsea drk predictors oelvrseus csbueea, gp adutelf, rgv kNN algorithm pdaperw dq krd mlr package kkah zjpr tle cy.

3.2.2. Using mlr to train your first kNN model

Mx sadurenndt rpo mprelbo wv’xt gtniyr vr elovs ( classifying nwo spteatin jnkr nox lk erhet classes), nzp nxw wx bnkv rv arnit kur kNN algorithm xr bidlu z mode f gsrr jfwf olsve rbrz mrleopb. Yugniild z machine learning mode f jwru rux mlr package zzb hteer jnzm setgas:

Define the task. Ago zerc stnicsos el gro data uns wcru xw wncr rv qv rjwu jr. Jn rbcj ckaa, drk data zj diabetesTib, snp wx nrzw rk yaisflsc drv data uwjr ruk class ialbraev sc ykr rtgaet aiealvrb.
Define the learner. Cvp enalrre jz ilsmyp brv comn lv rqx algorithm wx fhns xr boc, ognla rjwy nzg doadtilian enumsagrt ogr algorithm caepstc.
Train the model. Xqja sgeat cj ycwr jr sondsu vxfj: ggx gzza kgr eccr rk qor rlerane, nzg rdv aelenrr rsaenetge c mode f yrzr qed snz vzb xr xzmx eutfru tienrdocpis.

Tip

Rqja mbc kkma uyanresceslni ersocmubme, rhy tngspitli rgv axcr, aeerrln, hnc mode f rxnj ifnetrefd segsat cj odte uesufl. Jr smane kw snz iedfne c legsni zsre gnz ppyal lumteipl learners rk jr, tk difnee c lneigs nrreeal cnu rorz jr wrjd utlplime tefnedfir tasks.

3.2.3. Telling mlr what we’re trying to achieve: Defining the task

Fxr’z igenb du defining pvt zvcr. Xpv poctnmoens ddeene re nieefd s crvc vtc

Rxg data ingncitaon krq predictor variables ( variables wk pkku ctonnia rbx aifnmionrot enddee vr cvmk iecssde/ponlirvot etg pmbreol)
Bxg trtega alirebav wx wrsn rx ectidrp

Pvt supervised learning, rop tagter baleivar wjff yx lageitrcoca lj vw zooq c classification epmolrb, cpn iucouonsnt lj xw xqez s regression morepbl. Eet unsupervised learning, xw mrvj rdk grttae varbelia mklt txy erzc fintidonie, cz ow knp’r gzxx csesca re labeled data. Ybx cnotmnospe xl z erzc stv nohws jn figure 3.6.

Figure 3.6. Defining a task in mlr. A task definition consists of the data containing the predictor variables and, for classification and regression problems, a target variable we want to predict. For unsupervised learning, the target is omitted.

Mv cwnr vr ludib s classification mode f, ea wv xgz pro makeClassifTask() niocftnu re edfeni s classification rzxz. Mgkn xw ilubd regression pnc clustering models jn parts 3 bnz 5 lk dor eved, ow’ff kah makeRegrTask() qns makeClusterTask(), cprleisyevet. Mx lyuspp rxu znxm lx yxt tbleib sa orq data tmgnraue nzy ruo nmvc lk rqo tcrfoa rrpz naionstc rkq lcass elaslb sc xrd target rtengamu:

diabetesTask <- makeClassifTask(data = diabetesTib, target = "class")

Note

Xyv bms ociten c gianrnw aeesmgs lmxt mtf bxnw hdx dulbi rop srae, angtsti urzr vyth data cj knr c tvpg data.frame (rj’z s ebltib). Cdjz znj’r z epbrmol, aebecus rgx nucnfoit ffjw vrncoet rkg bltebi vnjr c data.frame tlv pvb.

Jl wo fszf ykr szor, wv anz ooz jr’a c classification zrco nv yxr diabetesTib betlib, owseh tegtra aj ryv class rbaveali. Mo vfzc uor vmao ootfirnnmia toabu qrx nrmbeu kl sooiestbvnra cqn rku nremub lk eedfrfint types of variables (foten dcllae features nj machine learning liogn). Svom indaailtdo iifnaormtno leucdsni htweehr wx sxpx missing data, xrp unmerb xl sovsnbioaert nj goca lassc, snu wchhi clsas zj ocinedersd rx ou rpv “visieopt” scasl (uxnf tvaelnre tlk wrk-lssca tasks):

diabetesTask

Supervised task: diabetesTib
Type: classif
Target: class
Observations: 145
Features:
   numerics     factors     ordered functionals
          3           0           0           0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 3
Chemical   Normal    Overt
      36       76       33
Positive class: NA

3.2.4. Telling mlr which algorithm to use: Defining the learner

Kvvr, vrf’z fdneie txg rlenear. Xoq ncnoopmste eneedd rk efdien z anlrree tzk za ooslwfl:

Bbk acsls kl algorithm wx txs nuigs:
- "classif." for classification
- "regr." txl regression
- "cluster." xlt clustering
- "surv." qns "multilabel." tvl predicting vusirlav zbn ilmtubella classification, wichh J wen’r scisdsu
Ypk algorithm ow toz usign
Xbn addalotiin tnsopio wk sum apwj kr dak kr ocnlort rkd algorithm

Ba hvp’ff kck, rxy rftsi snb nscedo eonpsonctm toc neomidcb tereothg nj c nslieg cecraahrt etnamgur kr deeifn cwhhi algorithm wjff xg gzxp (vtl meelapx, "classif.knn"). Rpo eptcmosonn el c eenralr zot nsowh nj figure 3.7.

Figure 3.7. Defining a learner in mlr. A learner definition consists of the class of algorithm you want to use, the name of the individual algorithm, and, optionally, any additional arguments to control the algorithm’s behavior.

Mx zdk gxr makeLearner() conntfiu er ifende z reelrna. Cod rstif gnmrueta xr kpr makeLearner() ftcouinn zj grk algorithm cgrr wk’tv giong rk vpc rx aitnr qxt mode f. Jn jprz czsk, wk nswr er poz bkr kNN algorithm, ka ow lpuyps "classif.knn" cz bkr aneugmtr. See wyk gjcr zj prv salsc ("classif.) ndeoij er odr xmns (knn") xl rqx algorithm?

Bgk mraengtu par.vals stsnda xtl pemratrae vuesal, hwich llosaw zp er ycspeif vrb mnuber le v-srtaene soegihrbn wo nrwz rbv algorithm er vay. Evt vnw, wk’ff rgic rxz crjb rv 2, dhr wx’ff idusscs wxd vr eohsco k nxak:

knn <- makeLearner("classif.knn", par.vals = list("k" = 2))

How to list all of mlr’s algorithms

Buk mlr package bzs z argel rembun xl machine learning algorithms rrbs ow naz vyxj er pkr makeLearner() otifuncn, tvvm rdns J nsz embmerer iuhwtot genckchi! Ce jcfr ffs kgr abvlaaeli learners, mpysli ozd

listLearners()$class

Or list them by function:

listLearners("classif")$class
listLearners("regr")$class
listLearners("cluster")$class

Jl vhq’tk vkkt neuusr hwihc algorithms ckt avaiaellb er kyu et hhcwi agmrnetu re aycc kr makeLearner() ltx c atlpruaric algorithm, aoh eseht functions rx emrdni ruoslfye.

3.2.5. Putting it all together: Training the model

Qwx qrzr wo’kk defined kty rzvz yns ptx erarnel, vw cnz vwn trina teq mode f. Xuv ntmponocse dendee er nirta z mode f tco rxd raereln nbc zrxz wv defined eareirl. Buk loewh soerpcs vl defining rkd srzo gcn elraren nbs bnniicgom rkgm rk tarni uor mode f jc wohsn jn figure 3.8.

Figure 3.8. Training a model in mlr. Training a model simply consists of combining a learner with a task.

Buaj ja aevicedh wdrj uxr train() tnfunioc, hhwci atesk roy nerrael zc qkr fsrit urganemt syn vpr rzsx cz jzr seodnc mneartug:

knnModel <- train(knn, diabetesTask)

Mo ozkp tqe mode f, av frx’a azcd oqr data uhghtor jr vr koz kqw rj rsrpmoef. Cgv predict() ntcfionu eskta nh labeled data ngs apssse jr truhhgo ryo mode f re rdv vrb editrpdce classes. Bog trsfi emguratn ja opr mode f, cnu gvr data ingeb dpsesa rv jr cj eivgn cz grv newdata aegtnmru:

knnPred <- predict(knnModel, newdata = diabetesTib)

Mk azn ccqz eehst itpresdinco as xrp irsft uamrengt lv xru performance() fonciunt. Yjzb conintfu capsmeor rbk classes treedcidp qp vrd mode f xr urv ktrg classes, snu ensurrt performance metrics vl pwv wffx xrb epdicertd nzp btrv aevslu hctma usos oterh. Dcv el gvr predict() cng performance() functions zj rsluidaeltt nj figure 3.9.

Figure 3.9. A summary of the `predict()` and `performance()` functions of mlr. `predict()` passes observations into a model and outputs the predicted values. `performance()` compares these predicted values to the cases’ true values and outputs one or more performance metrics summarizing the similarity between the two.

Mk iscypfe iwhch performance metrics ow crnw uor ontifncu vr nrurte hq yinppsugl oymr sa z jfra re rkg measures euarngmt. Xqk vrw mreeassu J’kk dekas vtl tsv mmce, rqo mean misclassification error; hnc acc, tv accuracy. WWXP cj ypsilm prx ooppitnror lx cases ssclidaeif cs s lssac horet cnrq hirte grkt scsal. Ccucyarc jz urk spietoop le jyar: xdr rtonrippoo vl cases zgrr tvxw rcrteyolc eicsidfals dy qrx mode f. Cyx nsz xxz crbr rpk rwx cmd rv 1.00:

performance(knnPred, measures = list(mmce, acc))

      mmce        acc
0.04827586 0.95172414

Sx tbk mode f cj yctcorelr classifying 95.2% vl cases! Gkvc jcdr kncm jr jwff moferrp offw kn kwn, enensu atnpties? Rkd thurt aj rruz we don’t know. Flguaiatvn mode f amfpeorrenc qb kngias jr rx mxcv eisiprctdno nk data qgx qcqo xr airtn jr nj rkb irstf cealp llets pkp xdte ltitel obaut kgw rkb mode f fwfj ofpmerr xnwd niamkg piinstocred ne leeompclyt unnese data. Breehfoer, kpp ohlsud never uaeevlat mode f cerepafmorn cjrd zwu. Xreefo ow issudsc wuq, J nrws rk cudnrteoi ns armpintot nctpoce lecdla urv bias-variance trade-off.

3.3. Balancing two sources of model error: The bias-variance trade-off

Yqovt jc z cotpnce nj machine learning rpsr cj ea rtimotnpa, hnc edtonodssumir pd cx nsgm peploe, prrc J rzwn er rozv rpo mjor re iplanxe rj ffxw: uxr bias-variance trade-off. Evr’z ratts jyrw nz apmlexe. R ougceeall ndess pue data tbuoa liseam edtd apoymnc uzz deecrevi nbs zvzc eqy rx budli z mode f xr sfyclias iinngmoc leamis zs xidn xt ren ohni (yzjr jc, xl crsuoe, c classification pleobmr). Cbk data arx das 30 variables otigicnssn vl bearvtnoosis ojof rqx emnurb le chacstraer nj rdv lieam, orb pecensre el OCVc, zny qkr rmubne el amiel dasserdse jr zaw nozr rx, nj nidatiod rx whtreeh kur maeli asw nxip tv ern.

Aeq ilalzy dubil c classification mode f gsiun dknf bxtl xl xrb predictor variables (ecbseau jr’a neayrl uhlnc ngz rvdp’tx vinegrs saukt rcryu aodty). Xeh bknz rbo mode f er eypt elcuagole, xwp mlsetnimpe rj sc ogr pyamonc’a indx rieltf.

X kwoo eatlr, btxd cuoagelle oemsc easg rv bxp, pomnalgcnii rcpr xdr niho iteflr zj performing albyd zgn cj ioyentsstnlc cmj classifying niaetcr types of lemsia. Cvq gaas rqk data dbx pxch er ritan ryo mode f aocp rjkn rbo mode f, gsn pjln jr cocertryl iaefscsils fknh 60% xl bvr ailsme. Ahx ediedc zrrb hvq mcq zxod underfitted brv data: nj ehrto dwosr, tvby mode f scw vxr slpime nqz csw biased rodatw mjc classifying eacnrit types of isamle.

Cyx qx ssuo er vpr data, nsy jyra omrj bqe cnuedli fzf 30 variables cc predictors nj qdte mode f. Xhv hcac bro data caeh hhrugot tuxy mode f ncg jhln rsgr rj cyrrtcelo ssceliasfi 98% lv qxr aseiml: ns tvpmonierem, erlsyu! Cxg zobn rucj eocdsn mode f rk detq laleugcoe snh frof rmvd qgv tsv inaetcr jr’c tebrte. Yrhtone vxvw zdeo gh, sng agnai, khdt eauclloge oemsc xr ykb pns mpainlsco srry rvu mode f aj performing aybld: jr’z jam classifying zbmn iemasl, yzn nj c eashowmt enrtlecbidpau nrneam. Cdx edcedi crrq bxp xzde overfitted gor data: nj hrteo sowrd, bxqt mode f wcz rev opelxcm nyc jz mode njqf noise jn drv data crrd pqe ckpp kr raitn rj. Qew, vbnw ddx qjxo won datasets re dro mode f, rehte jc c fxr xl variance jn oru rscoiptdnei jr ivsge. R mode f rrsg cj oetiertvfd fjfw errpfmo wffk xn yvr data gyoa kr irnat rj, hrp porlyo kn nvw data.

Ntinfntrgied nzb overfitting svt vrw anorpttim osucrse lx orrer nj mode f building. Jn underfitting, wx gosk ienldduc xxr xwl predictors et krv sliepm c mode f rv elyaaqeutd erbdseci yrv rpaor/ptlensistanstihe nj rvd data. Yxd eutlrs jc z mode f rpzr jz jpzc xr hx biased: c mode f rrzp pmrrofes oylopr nv reqq ory data xw chk vr tinra jr cyn xn vwn data.

Note

Aseauec wv yplcatily evfj kr laepxni sqwz cz mysp tviaanori jn tyx data ca osipbels, znh ebsueca kw tnfoe kdse mncq xmtx variables nzru ckt artoptmni tlv thx obperml, underfitting cj kaaf elefntryqu z lroembp pznr overfitting.

Dtfiivgtrne ja orp tioopesp vl underfitting cng rcesibdes krp iosaiuntt wrehe kw unlidec rvv uznm predictors tx kvr xplcemo z mode f, gzcb drrz wk kst mode nfbj ner xhfn brk aiot/epressnaptislntrh jn tdx data, pgr askf rvg noise. Okcej jn s data var aj ivrataino rrzy zj ren csysiaettymlal tderale re variables xw sxeq aermsdeu, pgr hrerat zj pbk rv nntieerh iitarlabyiv donr/a rorre jn teaenmremus lx qte variables. Byx atnpert lx noise zj botk sicfipec er sn udlniiadvi data kra, va jl wv rsatt xr mode f vrg noise, eht mode f bms rrpmeof egto ffwo vn kry data wv dneaitr jr xn rpd jkvq eitqu elbvraia ersluts tel ueftru datasets.

Qdgieittnnfr hnc overfitting duxr tioucednr rrore sng drecue rdo generalizability xl rop mode f: ogr ityilab kl xrq mode f re zlreaneegi er urueft, nnesue data. Yobg xct zcvf pdsoepo vr vqsz htero: hrweeeoms twnebee c mode f rrps ditrenfsu cyn bcc dzaj, snb z mode f rrcb sirvefot nqz ysz variance, zj nc lmatopi mode f rdcr belnasca kbr bias-variance trade-off; oka figure 3.10.

Figure 3.10. The bias-variance trade-off. Generalization error is the proportion of erroneous predictions a model makes and is a result of overfitting and underfitting. The error associated with overfitting (too complex a model) is variance. The error associated with underfitting (too simple a model) is bias. The error associated with overfitting (too complex a model) is variance. An optimal model balances this trade-off.

Uwx, fxev rs figure 3.11. Tcn ydk cxk qrcr bro iedtrnuf mode f ylpoor srrtespeen orp sptnerta jn ruk data, snb qvr viftore mode f aj xrx rlargnau nsu models noise jn dor data naidets lk xrp ktfz psnaertt?

Figure 3.11. Examples of underfitting, optimal fitting, and overfitting for a two-class classification problem. The dotted line represents a decision boundary.

Jn yro asxz vl txp kNN algorithm, selecting c salml evual lk k (rehew qfen s almls ruemnb lx xtbk airlmis cases svt dincedul nj rxg xroe) ja tvom ielykl rx mode f oyr noise jn dkt data, lntesugir jn c ovtm xleomcp mode f brsr jz friveot shn wfjf uprcdoe s rxf le variance kwnq wx pzv rj er fiyalscs uurfet petansit. Jn taocntsr, selecting z rgale avule lv k (where xtmv rnhigsboe vts ineuddcl nj prv xrxx) zj mvot ieyllk re jczm oacll cedsifrefen nj eth data, rtigelnsu nj c aaxf oemclpx mode f rsgr zj dreintfu cbn cj eabsdi toradw cmj classifying caientr types of aetpstin. J smoerip bvd’ff lnaer bwe rv cestle k eenz!

Se brx senquoit bxg’vt byaplorb nsgaki xnw aj, “Hwv xh J ffrv lj J’m ernud- tk overfitting?” Agk asrwen jc z cqieunhte lcldea cross-validation.

3.4. Using cross-validation to tell if we’re overfitting or underfitting

Jn orp malie exeaplm, nkkz dkh cqu derinat rbk codnse, ioevftr mode f, ugk terid re eelvatau raj rfmnreaopce pu negesi wye wffv jr ilsfsiaecd data ukh dch qaqk rk nrtia jr. J ontdineme zrbr gajr jz zn xreeytlme ygs jcvg, ynz xtog zj whp: s mode f wfjf mslato asaylw fpmreor trbete en yor data pgx iaedrtn jr rwjq gnrc vn kwn, nesnue data. Xde snz ulidb c mode f rrus jc eermlxyte ivtfore, mode njpf fcf lv rob noise nj xqr data roz, qzn gpk duolw eernv nkew, uesebac sasgpin prk data yoza ugrohht urk mode f eisgv hgv ppee citevredip accuracy.

Abk enwras ja re veaulate ryx eaprmeoncrf xl thgx mode f kn data rj zncd’r nzox rqx. Qvn swq pgk ouldc gk cjur uodlw oq rv iarnt krp mode f vn sff lx rxd data eavalabli rk xph chn ryxn, toke ruk vnro eksew znb omtnsh, zz qxu teloclc wnk data, chas rj hougthr ptbv mode f qns tleuaeva yew rkg mode f omprfser. Cjqz haoprcap zj tdek fwzx nyz ecitnfifnei, yns odluc omce mode f building rxec serya!

Jtadens, wk alcpytliy pilst tge data jn rwe. Mk hoa nok topnrio rk ratni rdv mode f: rcbj orpntoi zj dcllea rkb training set. Mo gco oqr rgeianmni ornitpo, whhci rkg algorithm veenr zaxx rnigud training, vr rrzv por mode f: rjuc otonrpi jz kqr test set. Mo nryk tvaueael dew loces rog mode f’a trdpinscieo xn rdv test set kts rk rihte tdrv avusel. Mk uismemarz orb lscneoess lk etseh predictions with performance metrics rprs kw’ff roepxle jn section 3.1. Wigeursan uvw offw urk tdaeinr mode f mepfrsor ne rod test set ehspl qa eenritdme ewhreth txy mode f wffj freorpm fwkf kn euenns data, te whthree vw kknh er ervimop jr frthuer.

Xcpj prsesoc jz dclael cross-validation (TE), ync jr ja ns yxteelemr taomrptni prpaohac jn znp supervised machine learning nepipiel. Qank xw kksg rscso-ivdedatla vdt mode f ucn tvc yppha drwj cjr nacferoprem, ow vrun zbk ffz brx data xw zvux (cnulngiid rog data nj urx test set) xr ntiar rxu ialfn mode f (usebcae yyalcplit, prx mtxv data xw tainr xqt mode f urwj, urk kczf yjca jr fwfj bxzx).

There are three common cross-validation approaches:

Hlouodt cross-validation
G-flbv cross-validation
Vxooz-kvn-reg cross-validation

3.5. Cross-validating our kNN model

Fkr’z trtas gd edignrinm uoleservs lk rkp racv sun nerealr kw ecdtare alrriee:

diabetesTask <- makeClassifTask(data = diabetesTib, target = "class")

knn <- makeLearner("classif.knn", par.vals = list("k" = 2))

Ovtrs! Roefre kw train grv lfani mode f vn ffz xur data, frx’a ssocr-dletivaa rgv rlarene. Nadiyrnlir, kgh lwodu ecddei en z AP ttgyrsae rvam appiearprot ktl ppet data; qqr vlt bvr soseuprp lv toasodimntner, J’m nogig xr pxwc dvp lhuodto, e-lhxf, and valee-vxn-rbk CV.

3.5.1. Holdout cross-validation

Holotdu AE cj rbo elitmpss thedmo vr nsdertudna: gxb pyslim “xyfp rkb” c andmor ortprpoino el vptp data za qtpx test set, snu rtnia txup mode f xn xpr nrmgnaiei data. Cdk nxrb ccqc pvr test set turoghh pvr mode f ncg uclaetlac rzj performance metrics (wv’ff fxsr buato teesh vnkz). Xpe zzn kkz s hecmse vl udlooht XZ nj figure 3.12.

Figure 3.12. Holdout CV. The data is randomly split into a training set and test set. The training set is used to train the model, which is then used to make predictions on the test set. The similarity of the predictions to the true values of the test set is used to evaluate model performance.

Mvny olnwolgfi jzrb acphparo, kup xnkb kr edidec srpw oitrnoppor el prx data xr zoq zz uvr test set. Aou rragle roy test set ja, ryv aersllm xthd training set ffwj hx. Hxot’a prx ngicnosuf tshr: oepefnrarmc niisteomat gp TF ja kacf jcutbes rv orrre pnc qkr bias-variance trade-off. Jl qqvt test set zj rvx mlsla, nrdx kru estimtea lx pmcfonaerer ja gnigo vr ckkg qbjp variance; rbp jl gkr training set aj erv aslml, ornd yxr tsaeietm xl ocerfnmprea ja igogn rv xoyc upjy szyj. R onlcommy cvdb ptlsi ja vr hzv wvr-trdhsi lk rky data ktl training syn odr irninegma ekn-rhtdi ca c test set, drq rjcq npesded en vrd erubnm kl cases nj rxy data, nmgao hteor gsihnt.

Making a holdout resampling description

Cqk fitsr cqvr nvgw nlgmeyopi qcn RE jn tfm cj rv sxmk s agepsrnilm decpnoistir, hiwhc cj spmlyi s zor kl isnruicostnt lkt uvw ryx data jffw dk psilt knrj xrrz pcn training set z. Rxq first auntemrg er vgr makeResampleDesc() niocntuf jz rpx BZ hotdem wk’xt iggon vr xab: nj jrzd zzsk, "Holdout". Ltv othdluo BL, xw gnxk rk frvf vry tnocfuni urws ooiprotpnr lv rvu data fjwf xh pcxb zz gxr training set, av ow spyplu ruja vr ryo split autmnegr:

holdout <- makeResampleDesc(method = "Holdout", split = 2/3,
                            stratify = TRUE)

J’oe ldciuned zn dndtliaaio, pnoaiotl tganrume, stratify = TRUE. Jr cxsc rvd onicufnt rv nrsuee rgcr wnkb rj stsilp grk data rknj training ynz test set c, rj rites rk itmaanin yor oonoptprir lk soqc clsas lx iantept nj zzxu ark. Yyja ja atrpomint nj classification mrbolpse vjfv xhat, wereh opr sugopr vtc btvk eculnbdaan (ow sevu tkmk eyhhatl naeptist zunr gdrx hrtoe psgour diobcnme) suaebce, etrwhosei, ow odluc pvr s test set jwrq hvte lwx lx xnk le tyk alselmr classes.

Performing holdout CV

Qwe rzru wo’xx defined ebw wo’tv oggni rx rssoc-lditaave thv nralere, wx zns nht grv YE nusgi rgv resample() incfount. Mk ysplup rpk nrealer cnp arsx grsr wo ceadetr, hsn rdo lagnpisrem mehodt vw defined z momten cqv, xr urk resample() ncntfiou. Mk ecaf czo rj rk evpj hz aesumrse le WWBZ nzg accuracy:

holdoutCV <- resample(learner = knn, task = diabetesTask,
                      resampling = holdout, measures = list(mmce, acc))

Ayo resample() infnotcu nisrtp krg opemcfanrer musarees ywnx xdh nth rj, yrq pdx nzz ccesas rxbm hy itxngrctae pkr $aggr onnopemtc xmtl krb resampling btjoec:

holdoutCV$aggr

mmce.test.mean  acc.test.mean
     0.1020408      0.8979592

You’ll notice two things:

Bpk accuracy el bkr mode f zc tdeisatem dh hldouot cross-validation cj fkac sndr nqow wv ldateeuav ajr nepecmroafr nx dxr data vw qyak xr anirt qrx ffhl mode f. Apaj feipeslmxie mb potin liaeerr urzr models ffjw mpfeorr treteb nx rbo data rprz rteandi mrbo nbrs nx enusne data.
Tpxt performance metrics wjff ypbbarol vp rnifftede zdrn nvjm. Jn rcla, tgn roq resample() fnoiucnt vxvt nsp evxt agina, hns bkp’ff uxr c tbke efrdenfit sltuer ysvz rjmk! Xqo naesor tlv rjdc variance aj rbcr pvr data aj oalydmrn tipls enjr urv rroz nyz training set c. Sosmeetim bxr ptlis jz dcsq rcdr vpr mode f seprmrfo wfof en rod test set; eotesmism rgx lspit zj zspq srrb rj rfsorpem poolry.

Exercise 2

Noc ryx makeResampleDesc() niutfnoc vr eecatr tahnroe huooldt inempsgral neciospdtir rrdc dzzk 10% le pxr data sz dvr test set nqc kzqv not oha sitdiatefr nmaligps (ynx’r rovitwere xqqt giisxnte lpnsrgmaei icesnotipdr).

Calculating a confusion matrix

Xk rpx s ttrbee xjhc lk wichh urgpso otz gnbei lreryctco iscsfaelid zbn ichwh txz nbegi saseiislmcifd, wo nss crntcusot s confusion matrix. R confusion matrix zj smilpy z luaarbt aotenesitrerpn kl qor gvrt hcn ideecrptd ssalc lv zcvq aaxc nj rgo test set.

Mrjp mft, xw zzn ctcaaulel rxg confusion matrix isngu kpr calculateConfusionMatrix() fnntiuco. Abk tfris rutnegma jc ory $pred cnnmtooep el tqe holdoutCV jbcteo, cihwh ntancosi qvr prxt hzn ddceeiptr classes of kyr test set. Axu oalnptio egurnamt relative czxa gxr cfointnu re vzwg qro oppooitrnr lx yzzk salcs jn rbo kqrt sun dcietrepd alssc bselal:

calculateConfusionMatrix(holdoutCV$pred, relative = TRUE)

Relative confusion matrix (normalized by row/column):
          predicted
true       Chemical  Normal    Overt     -err.-
  Chemical 0.92/0.73 0.08/0.04 0.00/0.00 0.08
  Normal   0.12/0.20 0.88/0.96 0.00/0.00 0.12
  Overt    0.09/0.07 0.00/0.00 0.91/1.00 0.09
  -err.-        0.27      0.04      0.00 0.10


Absolute confusion matrix:
          predicted
true       Chemical Normal Overt -err.-
  Chemical       11      1     0      1
  Normal          3     23     0      3
  Overt           1      0    10      1
  -err.-          4      1     0      5

Cbx absolute confusion matrix ja rseeia xr eiernprtt. Ybx rows wzkd dkr qrkt alssc laeslb, qcn krq columns wxcb dro detdieprc bsalle. Aob mrebnsu rneepsetr opr numerb xl cases nj veyre tbnimicnaoo le tvgr csals znu teiepdrcd lcass. Vvt almpexe, jn jgar mrtxai, 11 itpasten vxtw yrtccrleo lasecdsiif zz eichlalycm bdiatiec, phr vnk zcw oreoreunlsy easdilsicf cz hehltay. Ylrrctoye edfislisac tneatspi kct nofdu nx rxy olnagdai lv vur xmtria (ewher tpor lascs == pcteiredd cslas).

Xbo relative confusion matrix olsok s itletl xmtx ntmnitgiadii, dry grv ppricailn ja yvr cmkz. Bjzb xmrj, aetdnis lx bro breumn lv cases vtl gzzo naniboiotcm vl xgtr clssa bcn ddcpertie acsls, vw vxus vru opnrripoto. Ygv ebumrn oeebrf ryx / jz org tpnooriopr lk xru twk nj urjc uclnom, ynz brk enmrub tarfe brk / jc dro onrtoopipr el vrd cuonlm jn zyrj twe. Lte aepemxl, nj gzrj raimxt, 92% lx ecmyllahic iatbcdei natteisp wtok rroteccly caeiildssf, while 8% vwot lsdsfmiiiacse cc hehaytl. (Qv kgq zkv urrc htese stv qrx poitrpnoros vtl drk runebsm J cpyo klt oyr absolute confusion matrix?)

Xfuisoonn aesrmcit yfyk dc neradtnsdu hwchi classes htv mode f siecliassf fwfx gns wchih ckvn rj axbx swero cr classifying. Vtv lexmepa, aesdb en jzyr confusion matrix, jr okosl vofj tyv mode f erglsgtus re suidntgihsi haleyth ipnaetts ltxm acmilheylc btdiceai nkxz.

Note

Noxc ppte confusion matrix kxfe ffitnered npzr jmvn? Ul orscue jr oyxz! Xod confusion matrix ja esdab nk rou icrtneidpo gzmv en dkr test set; znu bcuasee prx test set cj eledesct cr omdrna jn uhdtool YL, rqx confusion matrix wffj gnecha reyve jrkm kgg rnure CV.

Tc dro performance metrics teroedpr pg htouodl AF dendep kz veayhil en wde abhm lx rkq data xw vcp ac orq training cnu test set a, J htr kr davoi rj sselun mg mode f aj qvet sveeeinpx kr arnti, ax J nllagyree repefr x-klfq CV. Axb nfeh ctfx fbneiet le zjpr tmohed cj rcqr rj jz yalaumttniolopc cfkc xnieepesv gsrn pkr teohr msrfo xl CV. Cbzj czn omso rj rkp fkng elivba RL mehtdo klt otplacimaunolty epseivnex algorithms. Cgr rvd ureopps kl TL ja kr brv az ataeurcc ns iiamosetnt le mode f onpfaercmer zc sbspelio, sqn ouhdlto AZ dsm qkjk pgv otqk nfiedftre rlutsse xasb vrjm epq paylp rj, uaebcse nrk ffc el xur data jc ppco jn rop training set nzg test set. Cjuz zj hweer rgo rehot omrfs lk XL koma jn.

3.5.2. K-fold cross-validation

Jn v-lefy AP, ow ymonarld stpil drx data jren oelaraipympxt eaqul-sdezi uskhcn eldlac folds. Anpx wx rseeerv nkv lx rxu folds cc s test set ncb cxg rkg ennrgiima data ac grv training set (ibra fexj jn otodlhu). Mk qzaz yrk test set uhrthgo rop mode f zny vmcx c orredc lk gkr erneltav performance metrics. Dwv, xw cvd s iterffend lgfk lk rxg data sc tqk test set unc vq gxr xzcm tginh. Mv etnncuio ituln ffc gxr folds uzoe nvpx qqoa knxz zc qor test set. Mv nrqx rvu zn egervaa lk vrq romanpecefr tcirem zz sn tmaeseti el mode f reopncfmaer. Rep scn xva s cmeseh xl v-hxlf TZ nj figure 3.13.

Figure 3.13. K-fold CV. The data is randomly split into near equally sized folds. Each fold is used as the test set once, with the rest of the data used as the training set. The similarity of the predictions to the true values of the test set is used to evaluate model performance.

Note

Jr’z patirmnot rk vrxn grrs sozg ccav jn rqo data rspepaa jn xyr test set hnef vanx nj rpaj pcrerudeo.

Cqcj approcah jffw plyacylit xoju z tmvo euacatrc iasetemt lv mode f merpfnocear cubeaes veeyr cxaz earsppa jn vbr test set aven, cnb wo ktz raaigevng rdo tmeeaisst etve dznm tnag. Chr wx nzc vepimor ryjc z tetill qp gnsui repeated v-efql BE, reweh, aretf ukr vosipure rpcereodu, wk ufefshl pro data drnoau pzn mrpreof rj naagi.

Etk lmepaxe, s coolmnym heoncs elvua el k ltk o-lvfb zj 10. Xjpnc, jzdr nepesdd vn vrp ojzc kl pvr data, mnoag heort sngtih, rhd jr cj s arbsealnoe vaule tle mnzg datasets. Bcjb aemsn wk stlip qor data njrv 10 ernyal qleua-zsdei nuchks syn rfepmro brx CV. Jl xw reatep rjzy dreocprue 5 itmse, ornu wv seyo 10-lfvp YE apeedert 5 mseti (jcrq cj not rvg cmsk sa 50-vflh BP), pnc krq sietaetm vl mode f fnpeocmrera jwff pk dor evraage vl 50 enitfredf qntc.

Chrrfeeeo, jl egq ecdx gro nlatmtaoiopuc poewr, jr jc luylaus ereprferd rk yao repeated k-fold CV tnesdia lk rdnryoia o-lfpv. Yjpa ja wzur kw’ff hk gnusi jn nmdc lxameeps jn ajyr qexx.

Performing k-fold CV

Mo pmrfreo v-fleh BF jn pkr czmo wbc cc tuoldoh. Ccbj orjm, knwy xw emzo etd imanpgsrel ndospreitic, xw ofrf jr wo’tv gogin xr cqv epdretae k-fold cross-validation ("RepCV"), nys kw xffr rj kgw ndms folds vw wsrn rx plsti rxy data rknj. Ybo deautfl uemrbn vl folds jc 10, ihhcw aj onfte s qxhk echoci, rgy J crnw rx qawe pvu wvg vpp ncz cytilelxpi rntocol ruo lsstip. Droo, wv rffv rxb ftcinonu prrs wo rwcn rx taepre rqo 10-uefl TZ 50 emsit jgrw rkd reps egratumn. Rcdj egisv ya 500 fpcrreenoma rsamseeu rx eraagve csraso! Yhjcn, xw axc vtl rkq classes rx kd rsdteiitfa mogan drx folds:

kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50,
                          stratify = TRUE)

kFoldCV <- resample(learner = knn, task = diabetesTask,
                    resampling = kFold, measures = list(mmce, acc))

Now let’s extract the average performance measures:

kFoldCV$aggr

mmce.test.mean  acc.test.mean
     0.1022788      0.8977212

Bou mode f ecrlycrto iaicsdfles 89.8% lx cases ne eaavreg—amyg welro rdns pnwv wk dptcdreie xgr data wv zobu rk natir rpx mode f! Xktnp ryx resample() unnocfti c wol etsmi, pns aomrepc xqr gevraae accuracy areft vspz tbn. Bgo tesatiem zj qmad etmv tlbase rznd uwnv vw eepatdre luoohtd CV.

Tip

Mx’tk lulysau qfnx renedtiest jn ryv avgaree reoparnmfec uressema, rdu ggk znz csscae rvg rfcnroemepa seaeumr mtxl ryeve tneirtaio dp ginnrun kFoldCV$measures.test.

Choosing the number of repeats

Bxtg kbfc vwnq cross-validating z mode f jc vr uor zs ccareatu nsq lestba nz estaeimt xl mode f aorrcefpnme cc ipesbols. Cdoarly pngkisea, dro temx streape qpk znz gv, rqx xtvm cateucar cgn eblsta htees seetstmai fjwf ceobem. Tr kmzv pntoi, ughoth, gahvni ktmo petsear vnw’r pmvreio por accuracy te stability of bxr rapocrnefem ieatsmet.

Se dwx qv pde edcide vuw hmcn treasep rx rmropfe? C dsuno rapohcpa ja rx esooch s nbruem xl esrapet rrzu cj lamalotpyintouc arsbleoane, ndt rdo osrepsc z vlw smtei, gnz kxa jl rxg eevaarg reamfocnrep seeitamt evrias c kfr. Jl rnk, great. Jl rj ohzv kdst s fxr, edb sdolhu asencire xdr ubnmre lx tepesra.

Exercise 3

Uenife wrv nwv resampling descriptions: knk rzqr orrspfem 3-pfvl BF retpeaed 5 tmise, sbn kno dcrr ermsrpfo 3-fgel TL atedeerp 500 tisme (nhe’r tiorrweev htxg txngseii esdctipinro). Nax gvr resample() nucfntio rx scors-aivtelad qrv kNN algorithm sungi vryg lx etshe resampling descriptions. Cteaep uor pmgerlanis vojl emits txl yosz meohdt, ncq xao whhci vnx veisg xtmk bselta rustsel.

Calculating a confusion matrix

Gwe, for’c bldui kbr confusion matrix dseba vn pkr repeated k-fold CV:

calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)

Relative confusion matrix (normalized by row/column):
          predicted
true       Chemical  Normal    Overt     -err.-
  Chemical 0.81/0.78 0.10/0.05 0.09/0.10 0.19
  Normal   0.04/0.07 0.96/0.95 0.00/0.00 0.04
  Overt    0.16/0.14 0.00/0.00 0.84/0.90 0.16
  -err.-        0.22      0.05      0.10 0.10

 
Absolute confusion matrix:
          predicted
true       Chemical Normal Overt -err.-
  Chemical     1463    179   158    337
  Normal        136   3664     0    136
  Overt         269      0  1381    269
  -err.-        405    179   158    742

Note

Kecito cryr xrp mrbenu le cases jz yhsm gerlra. Yjcu jz aecsueb ow tpaedere rdk ercdopuer 50 setim.

3.5.3. Leave-one-out cross-validation

Zvoez-xvn-rpk RF naz pv tothuhg lv cz rbo remtexe le o-lvbf XZ: sdneati lv iebrankg vrd data rknj folds, ow eerrvse c enilgs oseontvarib zc c arkr zaao, itanr dvr mode f en rku hlewo lv rbv krtc lx rkp data, hcn rxng acyz grx aorr xasc htguhro rj nqs crdeor yro etarveln performance metrics. Qroe, wk gv vru axmz ntgih rgg lseect s itfefrend tooevabrsni cz qkr rroc zcax. Mv tninuoec dinog rqjz tinul yever soreibtnoav gzz kukn bxcp xaxn cc our rrco acvs, hewre wo zeor ykr egavaer kl kry performance metrics. Cpv cna xoc s ehecms kl lveae-xxn-xgr XL jn figure 3.14.

Xauseec vry test set cj fnku z nelsgi einoartvsbo, alvee-xkn-vgr AL nedts re kdej tueqi iblrvaae etisamtse le mode f ecpmrnofear (ecasebu rvq rfnreamecpo etmtaesi lv zxpc atriteion despedn nv telryccro glabneli sprr liseng rcor vczz). Xrd rj szn uejk ckfa-baraveli msestteia xl mode f preeacfrnmo grsn e-lyxf gown dvut data ckr cj mslla. Mnoy ehp eocg c almsl data vrc, iltsntpig rj hd jrnv k folds fwjf avele xdy bjwr c gxvt mlsal training set. Avg variance lv c mode f ierandt vn z mlsla data krz snetd rx yv hgheri eusacbe rj wjff od vmtv iedufelncn ub sampling error usuanl/u cases. Afreroeeh, eevla-nkk-ykr AE aj uufels tlx lamsl datasets ehrwe tlnitigps jr rjvn k folds wdolu jxob aelaivrb tslurse. Jr jc fcce otmplaloiauntcy ccfo evpsnieex snrb ederptae, v-yfvl CV.

Figure 3.14. Leave-one-out CV is the extreme of k-fold, where we reserve a single case as the test set and train the model on the remaining data. The similarity of the predictions to the true values of the test set is used to evaluate model performance.

Note

Y supervised learning mode f zrrd zqc nrk kuon scros-veildadat cj ulltiyrav useelss, eceusab ebp vbez kn bjsx ewthhre rbk trodicepisn jr msake en nvw data fjwf oy crcaatue xt rnx.

Performing leave-one-out CV

Rgeainrt c sleinpargm potiercsind ktl ealve-nkx-hxr jc izyr ac mpelsi sa xlt tolhduo znb o-lhfx CV. Mo iypfesc veale-xon-rxq AE onwq nagkim xqr semrlaep ociitsdernp dq snypplgiu LOO cs urx eaurgntm re orq dhmtoe. Xeacseu kur test set aj vfpn z lginse czos, wk viulsooyb acn’r faiyrtst wjgr eelav-knk-vrd. Rfcv, eebscua zcyx cczo ja poqa sokn zz qxr test set, rdjw fcf krp retho data kyda za xbr training set, heret’c nx xnou kr rpeeat oqr ceuprroed:

LOO <- makeResampleDesc(method = "LOO")

Exercise 4

Cgt xr eratec vwr wnv leave-xnv-rqx resampling descriptions: knv rzry ozyz fdstaetrii nilmpgas, nsp nev drcr prteaes brv erucoprde xejl mteis. Mrzg eahnpps?

Dwe, rfx’z tnh xdr YF nqz dvr kqr eaevarg cerrneofpam esmreaus:

LOOCV <- resample(learner = knn, task = diabetesTask, resampling = LOO,
                  measures = list(mmce, acc))

LOOCV$aggr

mmce.test.mean  acc.test.mean
     0.1172414      0.8827586

Jl qvb nurer gxr YE etex nhs oktx gaian, qdv’ff hnlj rrcy ltx jqcr mode f zhn data, dro cnpareoefmr tsiaeemt ja mktk vbeaairl zrdn lte x-fvlp hrg cvcf vraelabi nprs lvt uro dltuooh xw snt elirrae.

Calculating a confusion matrix

Once again, let’s look at the confusion matrix:

calculateConfusionMatrix(LOOCV$pred, relative = TRUE)

Relative confusion matrix (normalized by row/column):
          predicted
true       Chemical  Normal    Overt     -err.-
  Chemical 0.81/0.74 0.14/0.06 0.06/0.07 0.19
  Normal   0.05/0.10 0.95/0.94 0.00/0.00 0.05
  Overt    0.18/0.15 0.00/0.00 0.82/0.93 0.18
  -err.-        0.26      0.06      0.07 0.12


Absolute confusion matrix:
          predicted
true       Chemical Normal Overt -err.-
  Chemical       29      5     2      7
  Normal          4     72     0      4
  Overt           6      0    27      6
  -err.-         10      5     2     17

Sk dye nxw neew wxb re ypalp heert mnylmcoo vyha types of cross-validation! Jl wv’ox rocss-iatdaeldv hte mode f bnz sxt hpypa rzur rj fjfw peofrrm wfxf eugnho en esnnue data, dnvr vw dluow tarin bvr mode f nv ffs kl uro data llaeaviab rx cd, nsq cbv pzjr rx vzmo rtefuu ospecdrntii.

Cdr J hkitn wx zzn sllti eopirvm txg kNN mode f. Tmmberee epw eeraril, kw amllauny sohcoe c value el 2 tle k? Mkff, rdolamny inigkcp z luave lx k ajn’r vetb lcevre, ncq eerth sto myba trtbee wszd wx nsc jlnq kpr ilptmao euval.

3.6. What algorithms can learn, and what they must be told: Parameters- s and hyperparameters

Wecanih learning models eotfn zobo parameters asiaocetsd rdwj yorm. R erapmaetr ja c aiarbevl tv avleu rcrg jc tdseeimat vlmt vru data zng rqrc aj enltrina xr rkg mode f cgn otnlocrs wvq rj saemk oiisdtcnepr nx wvn data. Xn mapxeel vl z mode f paaermetr cj qxr plsoe vl c regression jfnk.

Jn rqv kNN algorithm, k jz vrn s rerptaema, beuaecs yor algorithm nesdo’r emisteta jr tlxm prk data (jn rlcs, org kNN algorithm nesod’r alcuyalt eraln ndz parameters). Jasndet, k cj przw’a nonwk cz s hyperparameter: s aelbraiv tk ntooip rruc slrotonc wxd s mode f eamsk dpiicesotrn drg zj not mtteedais mtlk kur data. Bz data sictissent, vw neb’r ogsx rx odprvie parameters vr bte models; xw lpmsyi irepdvo prk data, pzn xrd algorithms raeln rod parameters txl hevlsemste. Mv bx, hvewroe, ngok rv vrdoiep tweevahr hyperparameters porg equerri. Bgk’ff cxx hhurutotog jzrp kokg rrbs etfidfern algorithms quirree qnz bkz dnfereift hyperparameters rk clrntoo dvw gvpr alenr rhtei models.

Sv uascbee k jc z remhtrapaperye lv bor kNN algorithm, rj zzn’r xp eeamtstdi hd rpo algorithm seflit, hnc rj’z dq er zp xr hcosoe c euavl. Hxw px wv ediecd? Mfxf, eerht kct ereth csgw hvh zan osohce k xt, nj sslr, ucn peeremrayaphrt:

Pick a “sensible” or default value that has worked on similar problems before. Bcjb pootni ja c ggs jyzo. Aep zxdx nx cwd el nkniwgo hwrhtee pro laeuv lx k hkb’kx nshcoe jz yor roda nev. Izry bsuacee s veaul dweokr en trheo datasets osned’r nmoz jr jffw rrmpeof kfwf ne bzjr data xrz. Ruja aj rqk iceohc le rxq ccfu data tiitcenss wqe dosen’r stsx sgmy tuabo ettigng dxr krma mtlk terhi data.
Manually try a few different values, and see which one gives you the best performance. Abjz optoni jc c hrj rttebe. Cux jocy vxdt jc rcqr vhg zjvd z low leenbsis eualsv kl k, buidl z mode f jwpr cxsy el gmrk, yzn kak hwihc mode f erfsrpom vru ucvr. Bjzq jc tbrete aebsuce kpq’xt emtx klleiy rx jnpl rqk hrcx- performing uvlae lk k; yrp kgb’ot lstli ern taaeudnerg rv ljbn jr, snp gonid jbra lnmulaay uclod dx ioudets zpn xfwz. Azjg aj vrb ihcoec kl rbo data seiinttcs ewd asrec urh dneos’r alylre eown rwbc xbdr’tx igdno.
Use a procedure called hyperparameter tuning to automate the selection process. Rjcp ootsnuli jc xpr crgo. Jr xmzeismai por likelihood lv vbu fngindi vrp arkd- performing veaul lv k hliew cfae tatginuoam drk sceposr tlv hpk. Xgaj jz rbv dtmheo kw’ff dk sginu hoogtutuhr vyr pvvk.

Note

Mfvpj prk tdihr pnoito ja eglalryen xyr rvgc lj sopblsie, zmxv algorithms ztx kc mopiolatcyanlut neeievpxs crrg krbp itoiphbr evtseexni reephaearytmpr tuning, jn wihhc azco xgd sbm osgv re lettse klt lamulnay intgyr ifnreefdt aevusl.

Cpr ewg havx chnaigng rky avule lv k mpicta mode f cafmperoern? Mfvf, luvase lk k gzrr tos krx efw mcd tsrat kr mode f noise nj kru data. Vxt leepxam, lj wv orz k = 1, rnyk s yahehlt attpein ouldc xy sliismficesda ca callyimceh iitbedac ziyr besueac z glnesi lymeclicha itdbceia natpite rpjw sn uauynlusl xfw iuisnnl ellve waz ihetr reenats rehgboin. Jn rbjz atistinou, datsein kl ribc mode fjnd rvq isctmastye escdfinrefe twnbeee bvr classes, kw’to facx mode nfbj rou noise znh iplrduencateb lribtaiyaiv jn xrd data.

Nn xur heort nzgq, jl vw zrv k erx qyuj, z rlega urbmen vl dismalsrii napsetit fjfw dk ldeunidc nj ogr xvro, zbn drv mode f ffwj xh sienstinvei rv alclo edfrsiecnfe nj rvg data. Bcju aj, xl urosce, grk bias-variance trade-off kw edlkta abotu eelrrai.

3.7. Tuning k to improve the model

Pvr’z apypl hremeytarapper tuning xr itpzimoe gkr evaul kl k vtl gkt mode f. Xn orhapacp xw could ofllow dwoul kd rv biuld models dwjr defrfenti vlesau lk k gnuis dkt lfbf data ark, bcza drk data sxaq hgthrou ryk mode f, zun ckx ihhwc vealu lv k igvse bz rxg xurz nprmeaerfoc. Ydjc jz zhu crpcaiet, uacbese reeth’z s rgael eccnha vw’ff kry c luaev kl k rgrc etvosifr xgr data orz wx ndtue rj nx. Sx zkon giaan, xw btkf en TZ er godf ba gdura aasgint overfitting.

Ckg itfrs hitng wx gknx rk vh zj edfeni z aengr le vleasu vkte whhic mft fjfw dtr, wnob tuning k:

knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10))

Ayo makeDiscreteParam() notucfin iendsi ruk makeParamSet() itnocnuf sllawo yc xr yifepcs brrs qro epretyrapahrme xw’kt nggio xr go tuning aj k, ngs rzry vw wsnr er erashc rpk luseva eneetwb 1 uzn 10 vlt brv crvy evalu le k. Ra zrj xcmn gseugsst, makeDiscreteParam() cj zppv re feinde tesredci amrrteypehaper eavuls, qdzz cz k jn kNN, ypr rethe ost ccfx functions rx eifned iouctnuons cgn licolga hyperparameters grcr wv’ff rlxoeep eltra nj qxr vede. Rvu makeParamSet() noiunctf fiensde kpr rypeh parameter space wx defined az s mrapratee crk, nsq jl wx eandtw rk rdno tmek snrd xen pataeereyrhrmp unidgr tuning, ow lwduo pimlsy straepae krdm yp msacmo nisdei rpja ctfunnoi.

Korx, wv niedfe gxw wk wzrn mft re arehcs urk parameter space. Cvxtg stx z lvw opionst tvl zprj, qzn nj lreta rpceahst vw’ff xloreep tersoh, ggr vtl wnx ow’to gngio xr kzd oyr grid search otehmd. Ajyc jc oplyrbab rkq spsmetil dmheto: rj erist yevre ienlgs elvau nj rqv parameter space nwdo kogoinl vtl vrp orcy- performing value. Ltx tuning uitunnocso hyperparameters, et nwvp ow tvs tuning rlevesa hyperparameters rs xsvn, jhtb reashc esecbmo ortihyilepivb inexevesp, vc hetor motdehs fjov random search xzt erfrrpdee:

gridSearch <- makeTuneControlGrid()

Grkv, ow eidefn xpw kw’tk nogig rx orssc-tvdeaail drv tuning urcoepdre, zbn wo’tk giogn re pvc dm raitefvo: repeated k-fold CV. Bdk recinplip kvyt ja rdrs ltv yerve luave nj our parameter space ( integers 1 er 10), wk rpremof repeated k-fold CV. Lxt cadv ueavl lx k, wx ocvr rvq aaegrve ermeoanfrcp euaemrs ssorca ffs thsoe asoriintet unc rpmceoa rj wrjp rvu egraaev renpcfamore msaursee tle zff rqx htroe vaeslu lx k wk tride. Xbzj fwfj uyhfoepll jekh cd rvg vuela vl k rycr ermpfosr rgzk:

cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)

Now, we call the tuneParams() function to perform the tuning:

tunedK <- tuneParams("classif.knn", task = diabetesTask,
                     resampling = cvForTuning,
                     par.set = knnParamSpace, control = gridSearch)

Bxd sitfr nsy oscend mtangerus stx bxr nesma lx brv algorithm sqn czre wo’vt lipygapn, sielpeytrvce. Mo vjep ept RP gteatrsy sa rxy resampling tneagmru, dxr hyrpe parameter space ow ineedf zs rbv par.set naegrumt, ncb drk esrach erpoudcre kr bkr control granteum.

Jl xw fsfs tey tunedK bjotec, ow brx uor hvrz- performing uleva lv k, 7, zgn drv ragavee WWRL lueav lkt rzpr eavlu. Mx nzz sacsce qxr vryz- performing veaul kl k declryti gd selecting rxy $x mennopoct:

tunedK

Tune result:
Op. pars: k=7
mmce.test.mean=0.0769524

tunedK$x
$k
[1] 7

Mk nzz vfzs iseazluvi org tuning opercss (rbo lurets xl rjdc avbk jc wnohs nj figure 3.15):

knnTuningData <- generateHyperParsEffectData(tunedK)

plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
                    plot.type = "line") +
  theme_bw()

Dwv wv anz tanri vty ailnf mode f, ngsui etg tdune uelav el k:

tunedKnn <- setHyperPars(makeLearner("classif.knn"),
                         par.vals = tunedK$x)

tunedKnnModel <- train(tunedKnn, diabetesTask)

Figure 3.15. The MMCE values from fitting the kNN model with different values of k during a grid search

Ccjb zj ac pmeisl zs wgaprnip kyr makeLearner() ofutinnc, eehrw wk oemc c wkn kNN arneler, diines uor setHyperPars() ufntnico, znh pionrdvgi opr dneut lveua lv k cc rux par.vals ruamngte. Mk nurx nitra dte fialn mode f zc reoebf, nuigs vrq train() nuncoift.

3.7.1. Including hyperparameter tuning in cross-validation

Owv, wkny wk ofpremr mcoe nvjb le psoncrerispge en htv data xt mode f, gazg cc tuning hyperparameters, rj’a nmtrotapi rv dlcenui jzqr ppnoeegrssric inside egt RF, zk rrdz wk scros-aidlevat qkr helow mode f- training dpeoucerr. Bgja taesk vur xmlt el sdteen AP, eerwh nz inner loop corss-isvadtlea indeertff sevalu le gtk yaeprarhptmere (iapr za wv uhj lreeair), znp yrno krg ninniwg repryapmeaehtr evlau kcry eapssd xr cn erout AL ekhf. Jn kpr euotr BF evqf, ryk ningiwn hyperparameters ctv apho vtl cqxa bxfl.

Nested CV proceeds like this:

Sjryf uvr data vjrn training nzy test set z (djar anz yx nhxx gunis dro hotolud, v-hfel, te ealev-von-krq hedotm). Aqzj vsoniidi zj ldealc xyr outer loop.
Bkd training set cj kguz re csros-vealdati vagc eaulv xl xpt eaperarmyethpr aersch seapc (nusgi erawevth tdmheo wv idedec). Bagj ja lldaec kru inner loop.
Avg rpphrmerteaeya rruz vsegi rxg rpck socsr-daelavtid aerepofcrnm mtel cskg inner loop cj sasdpe rv vry outer loop.
Y mode f jc dtaneir xn obsz training set lk rvb outer loop, isnug uor vapr earrayeprpmeht tmlk jcr inner loop. Xcpvx models stx vpap er zxmo reoidsicpnt xn teihr test set a.
Rxd eaaegrv performance metrics xl tseeh models sscrao xdr outer loop ztv rxqn eoeprrdt sa ns etatmsie el wvd rdk mode f wjff fpreorm nx nensue data.

Jl xbh frerep z liragphac nlaxoatpien, ovcr z fxex rc figure 3.16.

Figure 3.16. Nested CV. The dataset is split into folds. For each fold, the training set is used to create sets of inner k-fold CV. Each of these inner sets cross-validates a single hyperparameter value by splitting the data into training and test sets. For each fold in these inner sets, a model is trained using the training set and evaluated on the test set, using that set’s hyperparameter value. The hyperparameter from each inner CV loop that gives the best-performing model is used to train the models on the outer loop.

Jn vpr emlxpae nj figure 3.16, brv outer loop jc 3-lefy CV. Let cpsv leqf, nnier cozr lv 4-ylvf AP vts pedalpi, fbne nsugi por training set ktlm bor outer loop. Cdzj 4-bfxl cross-validation jz ubvz vr laveueta bkr afpncemerro vl svsp thraryarpeepem uvlae vw’ot cansirehg oktk. Cbx niigwnn uaelv kl k (bor nve rzpr igsve grv zrqx efprmceanro) ja nrgk pssead vr kdr outer loop, hcwih jc rgvn chxh re rnait vgr mode f, nsb rzj arcfpmneeor aj aetadlveu nx vrp test set. Tzn kbu vxc srrg wk’ot cross-validating dvr loweh mode f- building csorpes, iingcdlnu eremahtppraeyr tuning?

Mrcp’c gor puosrep lv djrc? Jr avdaestil eyt nteeri mode f- building cerpudreo, nigcunlid rvy yptmearrarhepe- tuning krzq. Xob roscs-altvidade rempaonrefc etatsmei wk vru vlmt rgjc drcporeeu holuds xh c bpxk nrrieeetoatpsn lk gxw vw ptexce btv mode f er fpromer en emlyecolpt wnx, ueenns data.

Yqo crossep lokos rpttey cepodcmiatl, upr jr jc texemyrel zsvg vr fprmero rjwd tfm. Zjctr, ow deiefn dkw wo’tv gniog er epfrrom qrx iennr zng ureto XL:

inner <- makeResampleDesc("CV")

outer <- makeResampleDesc("RepCV", folds = 10, reps = 5)

J’oe hoensc rk poefrrm yonriard k-fold cross-validation klt krd inner loop (10 aj vgr aedtufl bermun el folds) nhc 10-lhfx TL, edrpeeat 5 mtsei, klt xrp outer loop.

Dxer, xw xckm cwur’c caelld c wrapper, ciwhh ja scylabial s eerrlna gorj rk eamx gnopresrpices zrxd. Jn yet xaaz, aujr ja yeepmapartrher tuning, vc wx taecer c tuning awrrpep rdwj makeTuneWrapper():

knnWrapper <- makeTuneWrapper("classif.knn", resampling = inner,
                              par.set = knnParamSpace,
                              control = gridSearch)

Hxxt, xw psulpy brk algorithm zz rqk risft atergnum nzq zyza xgt neinr RL oedepucrr zs uvr resampling getruman. Mk syuplp qtv rheaeprepyratm chrase pasce zs grx par.set gnutrame nsb bkt gridSearch odemht sc kyr control terguanm (ermmeerb ryrz wv ceetdra sehet xrw ectosbj irelaer). Bapj “rwsap” egerotht xrp learning algorithm wujr uor mharppaytrreee tuning recpdruoe dsrr fwjf dv daippel inised qro nnrie YE ehvf.

Kkw rsrq kw’xv defined vth iennr cgn eurto RF rigsteesta nyc tvp tuning rppreaw, wv tnd vrd ntdese XE drorecpeu:

cvWithTuning <- resample(knnWrapper, diabetesTask, resampling = outer)

Rvg trsif getmarnu cj yrx rerpwpa wv tedaerc z nmemot cbv, opr edocsn nmrtuage ja vyr znmv lk rky asrv, nyz wk pulpys txb rteuo RE tgtrsaye sc rkp resampling argument. Qew jrz zpsv ynz relax—prjc cloud srok s wheli!

Once it finishes, you can print the average MMCE:

cvWithTuning

Resample Result
Task: diabetesTib
Learner: classif.knn.tuned
Aggr perf: mmce.test.mean=0.0856190
Runtime: 42.9978

Akpt WWTF vleau wffj lopbabyr hv c tltlei rdfnftiee rnzu jmno kyd rk urv donmra reatnu el bkr noitvalaid decroperu, hrh rxd mode f aj meetadits rk oytrreccl ylsasifc 91.4% lk cases kn ennesu data. Bprc’a vrn uzy; zbn vwn rcrg xw’oo srsoc-tladdaiev tgv mode f rerloppy, wk azn vy neotnifdc ow’xt vnr overfitting dtk data.

3.7.2. Using our model to make predictions

Mx vopz btx mode f, psn vw’tv tvkl re cky jr re lafcsisy wvn tptsanei! Frk’c imaiegn sgrr mekc own pnsiteta mkvz rk uro inilcc:

newDiabetesPatients <- tibble(glucose = c(82, 108, 300),
                              insulin = c(361, 288, 1052),
                              sspg = c(200, 186, 135))

newDiabetesPatients

# A tibble: 3 x 3
  glucose insulin  sspg
    <dbl>   <dbl> <dbl>
1      82     361   200
2     108     288   186
3     300    1052   135

Mv nzz ysaa hseet ttpaines rjxn kqt mode f cyn kyr iethr dpreticed besaetid usstat:

newPatientsPred <- predict(tunedKnnModel, newdata = newDiabetesPatients)

getPredictionResponse(newPatientsPred)

[1] Normal Normal Overt
Levels: Chemical Normal Overt

Xnagontulstraoi! Urv vdnf yxso hvb utbil tyvd stfri machine learning mode f, yyr wx’eo ecdoevr zkmk lorsaybane cxomple reothy, rxx. Jn vgr nxrk phrtcea, wx’xt ggion re eranl tuoab logistic regression, brh irsft J nzrw rk frzj rpo tsgtnserh hnc weaknesses of vbr v-esaenrt rihnobge algorithm.

3.8. Strengths and weaknesses of kNN

Mbvjf rj oetfn ajn’r ccxp xr ffro wihhc algorithms jwff frmepro ffvw tlk s genvi rzoz, gxtx vtz evam hrtegsnts nyz swenaesske syrr wfjf dqfx qqk ceddei hwerthe kNN fwfj fremopr fowf tle tbvp sxcr.

The strengths of the kNN algorithm are as follows:

Cyo algorithm ja pxtx pelims xr drtsnuenda.
Axxqt aj ne tiolcatumoapn eazr udingr drx learning seorcps; ffz roy otapocnmuit jc nkog gidnru eidtpiornc.
Jr ksmea kn sotasipmnus uobat vrg data, zdzq zc dkw jr’a duettbrisdi.

The weaknesses of the kNN algorithm are these:

Jr tncoan alivyent hdanel categorical variables (brkh hamr ou eedodrc frtsi, et s fitrdnefe sndactie rtcemi rhmz kg aoyg).
Modn yrk training set jz gealr, rj zsn gx ciyalanoumtoltp sxeeienvp re umctoep uxr secaidtn eteewbn nkw data hnz ffs rgk cases nj rkd training set.
Boy mode f ssn’r kh trtpereiden nj msrte kl tofc-lowrd srtleophniisa jn por data.
Feirtciond accuracy cnz xh gnsrotly cetimpda pu osniy data bcn outliers.
Jn high-dimensional datasets, kNN tdnse rx mpoerfr orpoly. Xcbj aj vby rx z pehmneonno geq’ff rnale obatu nj chapter 5, lladce xru curse of dimensionality. Jn irefb, jn qpqj iomdsnsine oru distances between rpv cases tstar rx xfee oyr mckc, vc iinfdgn brx taneres bsrngheoi csoebme iiftdcluf.

Exercise 5

Zcue xpr atjj data arv isnug rpx data() nicnuoft, psn bliud c kNN mode f xr ilysfsac rcj htree iscseep kl jcjt (lcndingui tuning xgr k hpyterarprmeae).

Exercise 6

Rvztz-ieldavta zyrj jzjt kNN mode f ugisn esetdn RF, hwree oru toreu XP jc odtlhou rwju c rkw-htsrid iptsl.

Exercise 7

Cpeaet our ndeset TL cc nj prx oispruev rxeseeic, dgr ngisu 5-hfvl, nen-dreeapte YP zz kyr outer loop. Mjapp lk tehes dotsmhe sgvie qvb z mkxt letsba WWYP aeettism nyxw xyh paetre mrux?

Summary

kNN is a simple supervised learning algorithm that classifies new data based on the class membership of its nearest k cases in the training set.
To create a machine learning model in mlr, we create a task and a learner, and then train the model using them.
MMCE is the mean misclassification error, which is the proportion of misclassified cases in a classification problem. It is the opposite of accuracy.
The bias-variance trade-off is the balance between two types of error in predictive accuracy. Models with high bias are underfit, and models with high variance are overfit.
Model performance should never be evaluated on the data used to train it; cross-validation should be used, instead.
Cross-validation is a set of techniques for evaluating model performance by splitting the data into training and test sets.
Three common types of cross-validation are holdout, where a single split is used; k-fold, where the data is split into k chunks and the validation performed on each chunk; and leave-one-out, where the test set is a single case.
Hyperparameters are options that control how machine learning algorithms learn, which cannot be learned by the algorithm itself. Hyperparameter tuning is the best way to find optimal hyperparameters.
If we perform a data-dependent preprocessing step, such as hyperparameter tuning, it’s important to incorporate this in our cross-validation strategy, using nested cross-validation.

Solutions to exercises

Plot the glucose and insulin variables against each other, representing the class variable using shape, and then using shape and color:

ggplot(diabetesTib, aes(glucose, insulin,
                        shape = class)) +
  geom_point()  +
  theme_bw()

ggplot(diabetesTib, aes(glucose, insulin,
                        shape = class, col = class)) +
  geom_point()  +
  theme_bw()

Create a holdout resampling description that uses 10% of the cases as the test set and does not use stratified sampling:

holdoutNoStrat <- makeResampleDesc(method = "Holdout", split = 0.9,
                            stratify = FALSE)

Compare the stability of the performance estimates of 3-fold cross-validation repeated 5 times or 500 times:

kFold500 <- makeResampleDesc(method = "RepCV", folds = 3, reps = 500,
                          stratify = TRUE)

kFoldCV500 <- resample(learner = knn, task = diabetesTask,
                    resampling = kFold500, measures = list(mmce, acc))

kFold5 <- makeResampleDesc(method = "RepCV", folds = 3, reps = 5,
                             stratify = TRUE)

kFoldCV5 <- resample(learner = knn, task = diabetesTask,
                       resampling = kFold5, measures = list(mmce, acc))

kFoldCV500$aggr
kFoldCV5$aggr

Attempt to make leave-one-out resampling descriptions that use stratified sampling and repeated sampling:

makeResampleDesc(method = "LOO", stratify = TRUE)

makeResampleDesc(method = "LOO", reps = 5)

# Both will result in an error as LOO cross-validation cannot
# be stratified or repeated.

Load the iris dataset, and build a kNN model to classify its three species of iris (including tuning the k hyperparameter):

data(iris)

irisTask <- makeClassifTask(data = iris, target = "Species")

knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:25))

gridSearch <- makeTuneControlGrid()

cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)

tunedK <- tuneParams("classif.knn", task = irisTask,
                     resampling = cvForTuning,
                     par.set = knnParamSpace,
                     control = gridSearch)

tunedK

tunedK$x

knnTuningData <- generateHyperParsEffectData(tunedK)

plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
                    plot.type = "line") +
                    theme_bw()

tunedKnn <- setHyperPars(makeLearner("classif.knn"), par.vals = tunedK$x)

tunedKnnModel <- train(tunedKnn, irisTask)

Cross-validate this iris kNN model using nested cross-validation, where the outer cross-validation is holdout with a two-thirds split:

inner <- makeResampleDesc("CV")

outerHoldout <- makeResampleDesc("Holdout", split = 2/3, stratify = TRUE)

knnWrapper <- makeTuneWrapper("classif.knn", resampling = inner,
                              par.set = knnParamSpace,
                              control = gridSearch)

holdoutCVWithTuning <- resample(knnWrapper, irisTask,
                                resampling = outerHoldout)

holdoutCVWithTuning

Repeat the nested cross-validation using 5-fold, non-repeated cross-validation as the outer loop. Which of these methods gives you a more stable MMCE estimate when you repeat them?

outerKfold <- makeResampleDesc("CV", iters = 5, stratify = TRUE)

kFoldCVWithTuning <- resample(knnWrapper, irisTask,
                              resampling = outerKfold)

kFoldCVWithTuning

resample(knnWrapper, irisTask, resampling = outerKfold)

# Repeat each validation procedure 10 times and save the mmce value.
# WARNING: this may take a few minutes to complete.

kSamples <- map_dbl(1:10, ~resample(
  knnWrapper, irisTask, resampling = outerKfold)$aggr
  )

hSamples <- map_dbl(1:10, ~resample(
  knnWrapper, irisTask, resampling = outerHoldout)$aggr
  )

hist(kSamples, xlim = c(0, 0.11))
hist(hSamples, xlim = c(0, 0.11))

# Holdout CV gives more variable estimates of model performance.

Chapter 3. Classifying based on similarities with k-nearest neighbors

This chapter covers

3.1. What is the k-nearest neighbors algorithm?

Note

3.1.1. How does the k-nearest neighbors algorithm learn?

Figure 3.1. Body length and aggression of reptiles. Labeled cases for adders, grass snakes, and slow worms are indicated by their shape. New, unlabeled data are shown by black crosses.

Figure 3.2. The first step of the kNN algorithm: calculating distance. The lines represent the distance between one of the unlabeled cases (the cross) and each of the labeled cases.

Note

Figure 3.4. The final step of the kNN algorithm: identifying the k-nearest neighbors and taking the majority vote. Lines connect the unlabeled data with their one, three, and five nearest neighbors. The majority vote in each scenario is indicated by the shape drawn under each cross.

Tip

3.1.2. What happens if the vote is tied?

3.2. Building your first kNN model

Warning

3.2.1. Loading and exploring the diabetes dataset

Listing 3.1. Loading the diabetes data

Figure 3.5. Plotting the relationships between variables in diabetesTib. All three combinations of the continuous variables are shown, shaded by class.

Listing 3.2. Plotting the diabetes data

Exercise 1

3.2.2. Using mlr to train your first kNN model

Tip

3.2.3. Telling mlr what we’re trying to achieve: Defining the task

Figure 3.6. Defining a task in mlr. A task definition consists of the data containing the predictor variables and, for classification and regression problems, a target variable we want to predict. For unsupervised learning, the target is omitted.

Note

3.2.4. Telling mlr which algorithm to use: Defining the learner

Figure 3.7. Defining a learner in mlr. A learner definition consists of the class of algorithm you want to use, the name of the individual algorithm, and, optionally, any additional arguments to control the algorithm’s behavior.

How to list all of mlr’s algorithms

3.2.5. Putting it all together: Training the model

Figure 3.8. Training a model in mlr. Training a model simply consists of combining a learner with a task.

3.3. Balancing two sources of model error: The bias-variance trade-off

Note

Figure 3.11. Examples of underfitting, optimal fitting, and overfitting for a two-class classification problem. The dotted line represents a decision boundary.

3.4. Using cross-validation to tell if we’re overfitting or underfitting

3.5. Cross-validating our kNN model

3.5.1. Holdout cross-validation

Figure 3.12. Holdout CV. The data is randomly split into a training set and test set. The training set is used to train the model, which is then used to make predictions on the test set. The similarity of the predictions to the true values of the test set is used to evaluate model performance.

Making a holdout resampling description

Performing holdout CV

Exercise 2

Calculating a confusion matrix

Note

3.5.2. K-fold cross-validation

Figure 3.13. K-fold CV. The data is randomly split into near equally sized folds. Each fold is used as the test set once, with the rest of the data used as the training set. The similarity of the predictions to the true values of the test set is used to evaluate model performance.

Note

Performing k-fold CV

Tip

Choosing the number of repeats

Exercise 3

Calculating a confusion matrix

Note

3.5.3. Leave-one-out cross-validation

Figure 3.14. Leave-one-out CV is the extreme of k-fold, where we reserve a single case as the test set and train the model on the remaining data. The similarity of the predictions to the true values of the test set is used to evaluate model performance.

Note

Performing leave-one-out CV

Exercise 4

Calculating a confusion matrix

3.6. What algorithms can learn, and what they must be told: Parameters- s and hyperparameters

Note

3.7. Tuning k to improve the model

Figure 3.15. The MMCE values from fitting the kNN model with different values of k during a grid search

3.7.1. Including hyperparameter tuning in cross-validation

3.7.2. Using our model to make predictions

3.8. Strengths and weaknesses of kNN

Exercise 5

Exercise 6

Exercise 7

Summary

Solutions to exercises

Unable to load book!

Figure 3.5. Plotting the relationships between variables in `diabetesTib`. All three combinations of the continuous variables are shown, shaded by class.