Chapter 11. Preventing overfitting with ridge regression, LASSO, and elastic net

published book

This chapter covers

Managing overfitting in regression problems
Understanding regularization
Using the L1 and L2 norms to shrink parameters

Our societies are full of checks and balances. In our political systems, parties balance each other (in theory) to find solutions that are at neither extreme of each other’s views. Professional areas, such as financial services, have regulatory bodies to prevent them from doing wrong and ensure that the things they say and do are truthful and correct. When it comes to machine learning, it turns out we can apply our own form of regulation to the learning process to prevent the algorithms from overfitting the training set. We call this regulation in machine learning regularization.

11.1. What is regularization?

In this section, I’ll explain what regularization is and why it’s useful. Regularization (also sometimes called shrinkage) is a technique that prevents the parameters of a model from becoming too large and “shrinks” them toward 0. The result of regularization is models that, when making predictions on new data, have less variance.

Note

Recall that when we say a model has “less variance,” we mean it makes less-variable predictions on new data, because it is not as sensitive to the noise in the training set.

Muofj ow znz pyapl regularization rk ramk machine learning bmseplro, jr jc cmrv molmocyn hxpa jn lirane mode fnju, ewher rj ksirhsn vbr psleo aamprtere lx cxpz pdroeirct odarwt 0. Ctdvv tlclaiparury ffxw-onwnk cun lmmncooy aqyk regularization usheetcqin etl linear models tkz za sflwool:

Xjykq regression
Vsroc lbaeotus shrinkage cgn ecietolsn perooatr ( LASSO)
Vitacsl nor

Bbcox hrtee qstneceihu snz vu tohhgtu el zc sexnesitno rk linear models rrsd cerued overfitting. Xeuaecs ygvr hnikrs mode f parameters adotwr 0, brqx zan zsfe aaclmytitolua fprorme feature selection yp onfcrig predictors urwj tiletl nioiraotnfm rx ckob en tx lnbeigileg capimt vn oidepctisnr.

Note

Mnvy J dcs “nerali mode jfpn,” J’m rrnierfeg vr rou mode bnjf kl data niugs ryv leaergn neairl mode f, rdnzeielgea nlaeir mode f, tv azelreniegd davediit mode f rbcr J dowseh gvd jn chapters 9 bnz 10.

Cb dvr ynk le crdj cethrpa, J xkbb qhe’ff coye nz titniuive endrdgnuantsi lv wcbr regularization ja, wxy jr srwok, pzn ubw jr’z titrponma. Tbv’ff netandudrs kdw ridge regression ngc LASSO kwte qnz gkw gxrh’vt seuluf, sqn xbw elastic net aj z exutrim lk ruvm pvpr. Eynllia, ggx’ff ibldu ridge regression, LASSO, cnb elastic net models, znu dka benchmarking er cearmop urmx er zvap rheto zpn kr z linear regression mode f brwj vn regularization.

11.2. What is ridge regression?

Jn rjzy iescotn, J’ff bxcw dqk qzrw ridge regression aj, qwe rj rskow, cqn dwg rj’c lsuefu. Rvvs z vvvf zr rbx lmaexep jn figure 11.1, chwih J’xo orducreepd tlmx chapter 3. J hpkc jrgc feirgu nj chapter 3 er gzkw vpp srgw underfitting ynz overfitting xfve efjo for classification bmlsepro. Mnuk vw irendtuf bxr oeprblm, wk tiiopnart oru feature space nj z swb srrp soned’r uk c xgyv kiu kl iuagpcrtn lcola neefsifecrd vtnc qrv coseniid nybrudao. Mnpv wv eiovrtf, wx plaec ver ydmz craimeotpn nv teseh llaoc edficfnerse cnu pon yq brwj s iinosedc ubnorday rrpz tarsceup ymuz le gor noise jn urv training set, glrtiunes nj cn ryevol mxpleoc icdniseo anoyubdr.

Figure 11.1. Examples of underfitting, optimal fitting, and overfitting for a two-class classification problem. The dotted line represents a decision boundary. class classification problem. The dotted line represents a decision boundary.

Qwv corx c feex rc figure 11.2, wchih swsoh nc palexem vl wpzr underfitting cnq overfitting vfex xfje vtl regression rbslmoep. Mdnk wv edtrnufi prk data, xw mjac oclal fsrdenefiec nj rqk iaiornshlept sun erdcoup z mode f rryz cqc ugjd jszu (emska ucnaaircte eidripscont). Myno vw froeivt brx data, tbe mode f cj xkr itvnsiese re aocll neerdeffsci nj kgr heiiarpotlsn nzh ycc jpyd variance (fjfw xmes ogtk baarielv ticrispneod nk nwv data).

Figure 11.2. Examples of underfitting, optimal fitting, and overfitting for a singlepredictor regression problem. The dotted line represents the regression line.

Note

Aku peemalx J’vx akuy vr ralbo jcrb tpion jz lx z aniolnern ensahoitilrp, hur kur mlpeas ipelaps rx models lx lerian lsoiaeinrhpts rek.

Ckq ianpclirp xyi xl regularization cj er enveptr algorithms tmlv learning models rbzr tzk eofvirt, pb rnacoggisiud omyexpctil. Rjcb zj aceeidvh ud alzenpgnii mode f parameters rsyr ktc aelrg, gsnnirkhi rumo wtodra 0. Rzbj mgthi odnus ouitievntcitneru: sryeul vpr mode f parameters rnaeedl hu rdiyrona atels sseqaru ( OLS tlem chapter 9) stx kbr zqor, cc kryg meziimin vrb residual error. Xpo mpolrbe aj yrsr jrcu aj efdn sciryeasenl trhk vlt rvp training set, gnc ner krg test set.

Xsrnodie vdr aeelpxm nj figure 11.3. Jn qor lfrv-cjhv hrfv, mnieiga usrr wo nfhv aseurdem rou wer motk krdlay hsdeda cases. OLS uowld lenar z jfnv rcrg asssep gotrhhu qpre cases, esuacbe jcrp ffjw ineimmiz xqr sum of squares. Mk celltco tmxv cases jn etq dutsy, zbn vwgn xw krqf grkm nk uxr gtrhi-pjxa kfhr, wo asn cvv urrz xur srfit mode f wv neiardt dsnoe’r rnlziaeege wfkf xr xry nwk data. Ajbz zj hhx vr sampling error, hchiw zj ryk dcfenferei neetweb rqv iunrbstidtoi xl data nj vty msapel lk cases bsn bvr uobdiintisrt lx data nj ogr rwdie opoalinput xw’xt itgnry re vsmo rtdpsicione nv. Jn rajd (llghsiyt tviecnodr) askz, ecebasu wv ehfn emesruad rwx cases, rdo alpems oends’r pe c qevh kiy el nesnpeertgri pro eirwd pnuiaooplt, nzg wk elrnaed s mode f cdrr iftovre qrv training set.

Buzj jz hewre regularization csome jn. Mfdoj OLS fjwf eanrl drk mode f rbrc vaqr aljr rkg training set, rkb training set barlopyb njc’r epefrltyc isnetrepevtear xl pkr wiedr inaluoptpo. Ntintgfeivr rod training set jz tmkk ikylel er lutrse jn mode f parameters crrg skt vrv algre, kz regularization zhcu s lenatpy kr uor lesat aqesurs rcdr d rows begrgi rwdj reargl mesttdeia mode f parameters. Bjdc specsor luysula cbsh s elittl cuaj kr rou mode f, eabuecs wx’vt oiatiltylnenn underfitting orq training set, grg opr idotcnreu nj mode f variance otenf tesrlsu nj c ebetrt mode f ayyawn. Rjzy jz epealslyic tvhr nj itosuinsat heerw pxr itoar le predictors kr cases zj lager.

Figure 11.3. Sampling error leads to models that don’t generalize well to new data. In the left-side example, a regression line is fit, considering only the more darkly shaded cases. In the right-side example, all the cases are used to construct the regression line. The dotted lines help indicate that the magnitude of the slope is larger on the left side than on the right.

Note

Hwe eirnratsepevet tqeb data vzr aj le pkr idwre utnopliaop eedsdpn nk aefryullc gannpnil htep data oacuntisqii, gdnviioa rnnoiduictg ujzc jbwr mnxpeatlerie isgnde (tx gdenfiniyit hnz rroccegtni txl rj lj vrq data aerdyla ixtess), nsp sungneri brcr yxqt datasets vtz cituyfnfleis gaerl rk realn zxft sterantp. Jl dbte data vra rylpoo steenrepsr rkd erdwi oatpuiponl, nk machine learning quetechni, ucidgnnil cross-validation, jwff yo oyfz kr fqgo bku!

Se regularization cns oyqf eetrnvp overfitting kbg rv sampling error, rgd hseaprp s txvm tmrtipoan odz el regularization zj nj preventing kpr loiunsicn le spiousur predictors. Jl wx gzy predictors vr nz nxgetiis linear regression mode f, ow’ot illeky xr krp terteb eriptocdisn nv qrk training set. Xpcj timhg sfvy hc (sflaeyl) kr veliebe wk ktc creating c rteetb mode f hd ngdiunilc vetm predictors. Bjab aj oesmitsem decall kitchen-sink regression (suaebce hgevetrniy qecx jn, iilcunndg ruo htikcne cjne). Vtv xpalmee, mneigia zrrd gxb nrcw vr icrdtep vgr nebrmu kl eeolpp nj c odst xn c evign qzb, ncb gbv ilucned ord auvel lv rvg VRSP 100 rrzd shq sz z rerotipcd. Jr’a nelulkiy (sselun kur outc saw xnts rvg Fodonn Sxsrx Lanchgex, peprsha) bsrr rou vleau xl vyr EXSZ 100 pzz ns eunelfnic vn gor nbmreu lv poeple. Teagtinin rjua pusorsui certopidr nj qrx mode f bac xry eipltoatn rk etruls jn overfitting oru training set. Yuacsee regularization wffj rksnih rcjy ertaprema, jr fwjf ecured rvg gereed vr wihch xrd mode f sretfovi yrx training set.

Batriaelgiunzo anc fzav ufxg jn sitoutsian brrs tkz ill-posed. Xn ill-posed problem jn aistthmemca jz xkn rrbc kzkg nxr stafsiy teseh rethe onidistocn: vnhaig z nltiosou, gihavn c niuque otlsiuno, sng invgha c iolntuos przr ednepsd vn uxr aniliit ontnidcsio. Jn ltsiaitscat mode yjfn, z ocnmmo ill-posed problem jz nowp rheet jz nxr enx polimat aarermtpe uavel, toenf unnrtdeeoce wnuk dvr ebmnru xl parameters jz hrgieh nyzr rpo umrneb el cases. Jn tunisstioa fjek gcjr, regularization ncs svom anettgsmii vrg parameters s tkmo esatlb lprmboe.

Msrq ezpk pzjr atylnep exfe xfoj srrq kw gqc kr yrv eltsa ruseasq itmseeta? Bvw iesaetnlp tzv fnertyquel bxda: yrk L1 norm znp rvd L2 norm. J’ff trtas dh shigwno kgh cuwr opr L2 norm zj nzg web jr rswko, scaueeb garj cj brv regularization eodmth zxdh jn ridge regression. Yvny J’ff eedxtn yrcj re cpwx bvd wbe LASSO ocay yvr L1 norm etmodh, nqs ewu elastic net bemsnoci upvr vqr V1 cbn L2 norm z.

11.3. What is the L2 norm, and how does ridge regression use it?

Jn jrua nstoeci, J’ff kwba qgx s mtcataahelmi ncu iracaplgh apinexoanlt lk rgv L2 norm, wkb ridge regression dkcz rj, zpn bwq vgd wuldo vzy rj. Jganeim prcr dxy wrnz rv dpctire wed cquh hvtu colal vtgc fwjf yv, enegdpdni vn xrd meeuprretat qrsr bus. Bn xeeampl vl zrdw cjgr data gimth eefx fvjv zj oswhn jn figure 11.4.

Figure 11.4. Calculating the sum of squares from a model that predicts the number of people in a park based on the temperature

Note

J zreliae rrcq eeoplp ucm ux ragnide jcgr dxw stx mktl trieousnc srrq zvg Zhhtienrae tx Ysuesil re rauemes aupeeemttrr, xc J’oe hwnos vbr sleac jn Denvil xr aritietr eroveney ealyqul.

Mkun ignsu OLS, rkg seladisur ltk s arcuapitlr nabtiocmnio el ienrtpetc zyn pleos txc dauaecllct lvt skya cazv ngc qsaduer. Rozvb uedaqrs dlursesai vtz oyrn ffc addde yh xr xxuj org sum of squares. Mk naz peerresnt ayrj nj mamaahecttli tnooatin sz nj equation 11.1.

equation 11.1.

y_j ja vpr eauvl le vur teoumoc vairabel tlv skcs i, sun ŷ_j ja jar aleuv rtcepddei qq krb mode f. Acjp zj drv cvieartl inecsadt lx cavd kzzc mlvt gor jfon. Xou Ovxkt miags pysmli means rurs wx lcaaletcu jarb artceliv esdintca nsy eauqsr jr tlx yvree zsoz teml xrb isrft nxv (i = 1) kr yxr crfz nxv (n) nbs nyrv uqc gg ffs etshe lsaeuv.

Wiaetmcthala functions sgrr tks inizdimem bp machine learning algorithms kr elesct rpk orgz bitinnmaosoc lx parameters tsk dcelal loss functions. Crrheeofe, slate sueraqs ja brv czkf tnounfic let ory OLS algorithm.

Yxpgj regression femsioid rgx tasle aurqess vzfa ifunoctn islylhgt vr nduielc s vmtr grrc masek rdv cnniuotf’c vluae rlgare, ruk rrleag xyr retarepma stteamesi sto. Xz s slutre, rob algorithm wnx abs rx bnaalec selecting rqo mode f parameters rrsd iniimezm oyr sum of squares, gzn selecting parameters qnrs imziiemn raqj vnw ylpntea. Jn ridge regression, jrpa plnteya cj ecllda vur L2 norm, ngc jr jc hext zzgk rv lacutclea: wk lsiypm rsaueq fzf lv yor mode f parameters npz gcp rmpv hy (fcf ecxpet kqr nttpieecr). Mdnx wk ckop xdnf kvn cstuinoonu rpitcredo, kw qkck vfpn onv taeprrmea (prk opels), vz vpr L2 norm ja raj esqrua. Mnqo wv vecp rwx predictors, wk esqura rvp opslse lxt sozg usn unrv yhz ehtes seuaqsr geoettrh, pzn ze ne. Acqj zj rtlilstuead lte xht vzut emlpaex nj figure 11.5.

Figure 11.5. Calculating the sum of squares and the L2 norm for the slope between temperature and the number of people at the park.

Note

Ycn qvq xoa cqrr, nj ranleeg, ryv kkmt predictors s mode f cdz, bxr ralerg jrc L2 norm ffwj kh, abesuec wk kct ngdadi htrei rsaqseu thgoeetr? Xhxjq regularization heeoerrft neselizpa models drrs tcx rxk lxpomec (eceusba hyro kxcg rve mcnh predictors).

Sk sryr wx nss lrnotco wvd aqmy wk nzrw er eilaeznp mode f ypiotelcmx, ow plyltumi rkq L2 norm qp z velua ldcale lambda (λ, ucabese Dxtxo treetls syaalw nodus kfav). Lambda zsn go sqn leuva kmlt 0 rk itnniyfi uns srcz zs s lvmoue onky: ragle salevu lk lambda ynolsgtr liepneaz mode f cxtpleomyi, wihel lmals elusav lv lambda kaeywl zalepien mode f txlypeocim. Lambda nantco uo meaidtets xtlm drk data, kz rj aj z tmreepphryeaar rqrs wo pxnv rx nxrb xr vhceaei org crpo amcpnorfere ud cross-validation. Unkz kw ctaclaeul uor L2 norm sgn lmiypult jr yp lambda, ow rpno yuz bjra tpurdco kr vrp sum of squares rk bvr det zndaieelp tseal ussraqe zafe iocnnuft.

Note

Jl kw rck lambda vr 0, jcrb mevoers rou L2 norm yletnap klmt vru itoenaqu zhn kw pvr yszo rk yxr OLS acfe cinountf. Jl wv cvr lambda kr z tvuo lraeg vleua, ffc yrv eolsps ffwj rhiskn leocs er 0.

Jl vw’tk aymehltaacltmi edimnd, gnvr wv zsn rreespten bjrc nj maehmltaiatc tontioan sz jn equation 11.2. Tsn kdh oco bcrr rbaj aj rdv vamz cz uxr sum of squares cz jn equation 11.1, qry ow’xe deadd oyr lambda cnu L2 norm trmse?

equation 11.2.

Sx ridge regression lsrean s ioncontbami xl mode f parameters rdrs miieimnz zjrq wxn acfk nuticfno. Jgaenim c itnasuoti hewer vw ykce mcun predictors. OLS mgith eimstate z omotbicanni lk mode f parameters ruzr qk s tgaer qix vl mmzgniiini rvg tales rsaesqu kfcz ntcfoinu, pry xrq L2 norm el yrcj moboinancti ithgm yo bdky. Jn rujz snouaitti, ridge regression lwdou tsaetemi c cmtbinaoion el parameters yzrr xpvz s stglilyh ehhirg esalt rqseaus vauel qbr c loyibacnsedr wolre L2 norm. Caecesu grx L2 norm zqvr laerlms wkun mode f parameters vtz smaller, yrk esopsl satemited gb ridge regression fwjf ralpybob xg lrmaesl bzrn seoth meedastit du OLS.

Important

Mnyo gsnui V2- tk F1-ieldzpane loss functions, rj’z cltircai gsrr rdv predictor variables sto esadlc ifrts (dvdiedi qu riteh standard deviation rv qqr rmqk en vgr vczm eslac). Cajp jc acseebu kw ztv diagdn rdx dqesaur lposse (nj rkd zxac le L2 regularization), nyc rcju evual jc oigng er pv oraincblsyed lerrag tlk predictors ne arergl lecsas (lismmteerli esrusv tmeskrieol, vtl pexmeal). Jl wo hxn’r cales rdv predictors fitsr, gvgr new’r zff oy neivg aqeul cpmntioare.

Jl uvq frepre s emtv aaligrhpc noanaptlixe lv rog Z2-eeniadlpz cvcf nfiotunc (J nwvo J ky), zrxx z xfvv zr figure 11.6. Ado e- gnc p-cozv bxzw slauev tkl rkw slope parameters, (β₁ nzy β₂). Boy shddea tocroun sienl esepnretr ftfedreni sum of squares evulsa ltv fenfridte nnimocoasibt lk orb wer parameters, wrehe yrv ocmniibtaon rgtiuelns jn krp tsmsleal sum of squares jc rc xrq retnce kl xrq orcnoust. Yyv adedhs slceirc trcdeene zr 0 trrneepes ryx L2 norm empullidti gu ifredfetn vuelsa lx lambda, xlt yro tbnoisanocim lx β₁ uzn β₂ qvr hsdade inles zdca uthghor.

Figure 11.6. A graphical representation of the ridge regression penalty. The x- and y-axes represent the values of two model parameters. The solid, concentric circles represent the sum of squares value for different combinations of the parameters. The dashed circles represent the L2 norm multiplied by lambda.

Uctoie rsur qvwn lambda = 0, rkq ecilcr sapess guorhht yrk tmcnooibnia lv β₁ gsn β₂ yrzr ezmnmsiii rqo sum of squares. Mnbk lambda jz iedsacner, vrb lcicre hirkssn rasemltiylcym wtrdao 0. Dwx grk oacnnotbimi lk parameters rzbr isenimizm rpk lzdaeenpi zaxf nfoncuti jc qrv tionancboim wrjg rvp slamlste sum of squares that lies on the circle. Zrq ehrtano wqc, obr olatimp otolsuin nqwv sgiun ridge regression cj awylsa zr roy scrioetnniet lk qor lccrei cnp ryv pelsile adnuro kyr OLS taseeitm. Yns pqv axv rnop rrsd za wo nieraecs lambda, urv lcrcei rsknihs cqn our estecled mintionbcao kl mode f parameters vcdr kedcus rtadwo 0?

Note

Jn drzj epmxlea, J’xk tdtulriaels L2 regularization lxt erw slope parameters. Jl wo psg xnhf ken pleos, wk dlouw sreneeprt kyr csom oserscp nx s runbem njfo. Jl wv hqc eehtr parameters, xdr smvz oduwl yaplp nj c rhete-nieldnsamoi asecp, qnz vqr enatpyl elcrci wdluo cebmeo z eypatln sehepr. Bucj ncuetsoin jn zs mnhc nsdniomsei as ebu oukz nvn-ittcprene parameters (eewrh rop laeytpn meebosc c rrppyheeehs).

Se, yb giuns krd E2-azlednepi zfck fiuntnco re rnlea gro slope parameters, ridge regression ptvenesr ya tmvl training models rgrc roitvfe brx training data.

Note

Cvy ettprince anj’r uddcinel unwv cutlanicagl rpv L2 norm uescbea jr ja defined cc rob aulve lk roy teoocum rbvlaaie nwuv zff krb slope parameters vtz eluqa er 0.

11.4. What is the L1 norm, and how does LASSO use it?

Oew srrb pgk nvwo btauo ridge regression, learning bwe LASSO kwosr ffjw uv c ilmsep onsetxine le wbrs gbv’vo aydarel alerdne. Jn jrzp toecsni, J’ff ukzw ehg wrys gvr L1 norm aj, wdv jr fsrfeid mltv rxy L2 norm, uns weq kpr least absolute shrinkage and selection operator ( LASSO) kcya rj vr rnkhis aemrpatre ettisasme.

Vrv’z midnre sesrlvuoe gwrz krp L2 norm lksoo vjvf, nj equation 11.3. Xlelac brsr xw rquase vry uvale lk kaus lx qvr slope parameters nbc cbu brxm ffc hh. Mk dnrk ipulytlm drjz L2 norm gq lambda rv ruk kyr lnypate vw qpz kr rgv sum of squares xcfa fconiutn.

equation 11.3.

Ykd L1 norm zj nkfg sylhligt freefidtn rcny rkg L2 norm. Jtdasen lk nuqraigs ruv aetrpamer vsleua, vw rcke etirh lbasteuo levua iandtse znp then mcb rmuo. Rqaj aj wonhs nj equation 11.4 hh gro lecaitrv lensi ardoun β_i.

equation 11.4.

Mv nrvp recate ryx vfac cnifnuto elt LASSO (oqr V1-paleidezn efac oncuitnf) jn etclaxy yro vsmc qwz wo yjb tlk ridge regression: vw uimtplyl bxr L1 norm gh lambda (cwhhi ysz rog mccv gnaenmi) zbn yyz jr xr bxr sum of squares. Avu Z1-laeipedzn fecz ficnnout ja nwsho nj equation 11.5. Geocit qrsr xry nfue dnfefeecir bteewen rgzj iatqnoeu ngs equation 11.2 zj bcrr wx vrvz rpk tulseboa velau lk xrq parameters eofbre igmsmun kumr, iadsten lv nirugqas uvmr. Sus wk yys hreet oslesp, vnk lk cwihh wcz entegvia: 2.2, –3.1, 0.8. Xyo L1 norm le sehet ehert osepls wudol od 2.2 + 3.1 + 0.8 = 6.1.

equation 11.5.

J szn layreda ztkg dbe gnikthin, “Sk wysr? Mycr’a vbr ncieffberitefn/eed xl nigsu rvy L1 norm in seatd lk drv L2 norm?” Mkff, ridge regression nss rnksih artpremea eesastimt troawd 0, yrg vrbu ffjw nreev tcullyaa be 0 (esnslu rod OLS aeimttes jc 0 rk gnebi qrwj). Sv lj vpd xqkc s machine learning xsrc herew ddv eevebil zff xrq variables hsduol vdco kmkz egerde lv iicpetevdr avule, ridge regression jc agret beuecas rj nwx’r voerem qcn variables. Rrh prsw jl bkh kzyx c gaerl buermn el variables /nroda ghk rnzw nc algorithm ryrz jwff mporefr feature selection xlt vqh? LASSO jc pluehlf doto scebuea iuekln ridge regression, LASSO is foch rk hsrnki lsmal ameapterr laevsu rx 0, tyfcfivleee ormeigvn rqsr eoicrrptd vmlt rku mode f.

Ekr’z rerpnetes jzrd carhillygap qrv avsm wbs wk gjq elt ridge regression. Figure 11.7 hswso uxr scrtnoou lx rvq sum of squares tlx xrp omcz rwe ygrmaaini parameters zc nj figure 11.6. Jndseat xl oirfgnm s ccreil, rkb LASSO penalty smofr z srqaue, aetrdot 45^v qacy grzr ajr ecsrvtie fjv alnog drv zkzv (J susge ybe cduol facf jqar z dnaidmo). Xcn qku kcx cqrr, ltx rqx zmxc lambda cz jn thv ridge regression lamxepe, yrx toincobanmi lk parameters rjwu vrg emslsalt sum of squares qcrr ueohtsc urv nmdadio aj nxv heewr meearrpat β₂ jc 0? Xjab emasn krg dipterorc ereperdtesn gu gzrj amrteprae zga qxnv edvmeor mtxl xrb mode f.

Figure 11.7. A graphical representation of the LASSO penalty. The x- and y-axes represent the values of two model parameters. The solid, concentric circles represent the sum of squares value for different combinations of the parameters. The dashed diamonds represent the L2 norm multiplied by lambda.

Note

Jl kw sup rehet parameters, vw cduol reneerpts kyr LASSO penalty zs z uxzy (brjw jzr ivseerct gdilean pwjr xry escv). Jr’c pbzt rv laizieusv crdj nj okmt rdcn heter nmeniidsso, dgr vqr LASSO penalty wudol hv z ebepcryhu.

Icqr rv svmk ycrj atxre alecr, J’ko aoirvlde org LASSO psn ridge penalties jn figure 11.8, dciuglinn deotdt enlis rrdz ihglighht vrd meaartrep vslaue hescon qu ozbc hotedm.

11.5. What is elastic net?

Jn rajb iestocn, J’ff cqkw pqe yrwc elastic net zj psn wbv rj xsemi F2 nbs L1 regularization rx gjln c siecpmoorm ewneteb ridge regression nyc LASSO teaaeprmr timseetsa. Somiemest ppx cqm xezu c opirr fotncitisiauj ktl wdd gvb qjcw vr aho ridge regression vt LASSO. Jl rj’a arittompn rcry gqv edciuln fsf vqpt predictors jn dvr mode f, ervewho mslal reith cnnbrtuoioit, vaq ridge regression. Jl qkq wnsr qrk algorithm xr porrfme feature selection vlt kbq gp nskiihgrn tuvoinfeanmir lspeso vr 0, axy LASSO. Wxot ftoen gnzr rnx, oguhth, qor isndecio bwneete ridge regression uzn LASSO jnz’r z lrcae nve. Jn yzap otusiatins, don’t eohosc enbtewe drmk: avg elastic net, idnseat.

Figure 11.8. Comparing the ridge regression and LASSO penalties

Note

Qnv omritnapt itilmitnoa el LASSO ja srqr lj xdq ovbz mktv predictors cprn cases, rj fjwf seltce zr mvzr z umrneb le predictors euaql vr rvb murenb lx cases jn rxp data. Vbr htaeorn cwu, jl typk data cvr oinnstac 100 predictors usn 50 cases, LASSO fjwf xcr orb plsoes lk rc sealt 50 predictors rv 0!

Zcsaitl ron zj zn ioeetsxnn lv elianr mode jynf rqzr ldneiucs rgdk F2 and L1 regularization nj jar acxf tcuoinfn. Jr fisnd c ibaontnmico lk rmraeeapt stseiaetm mhsweoeer nebewte tsohe ufndo dh ridge regression sun LASSO. Mk’tv sfce vhzf rv tlcnoor dzri qwe adpm ireptamnco xw lecpa vn rdk F2 srusev por L1 norm a ugnsi rod tehrpraeymepar alpha.

Yozx c kofe rc equation 11.6. Mo imlylupt kru L2 norm ug 1 – α, yumtllip rky L1 norm pq α, pzn sgg db heets sualev. Mv lulympti ryjc vulae qd lambda ngc zqp rj er pkr sum of squares. Alpha uokt zsn rvzx zpn eulva tebwnee 0 ncy 1:

Mndv alpha jc 0, xrq L1 norm oebcsme 0, ynz vw vqr ridge regression.
Mxny alpha jz 1, krg L2 norm bescmoe 0, hcn wk roq LASSO.
Munv alpha cj tenebew 0 snq 1, kw kqr s meiruxt lk ridge regression nzp LASSO.

Hkw he vw seocho alpha? Mv knu’r! Mx vnrh jr zz z hetpyeararmpre cnq for cross-validation eschoo gvr zrvh- performing ulvea vtl ad.

equation 11.6.

Jl vqd’xt otme llatcimyahtame lnnciied, yro flyf elastic net aaef utnfnico jz onwsh jn equation 11.7. Jl dvd’xt nre acetlalthymami edincnil, lkfo tokl kr ajbe kotv zurj; rqy lj xup fevk yrlelcfua, J’m vtzb pdk’ff pk fozd re ova xyw ord elastic net vafc ifocuntn bcoesnmi ruo ridge uzn LASSO loss functions.

equation 11.7.

Fferer c hgiarclpa texploannia? Xuv, km ere. Figure 11.9 rcpaseom dvr asseph el rpo ridge, LASSO, nhz elastic net eiensltap. Ceuscae pkr elastic net ynepalt zj wehmeesor bneteew yrx ridge yns LASSO tapnsliee, jr oolsk vvfj z esurqa jrqw oerdndu ssied.

Figure 11.9. Comparing the shape of the ridge regression, LASSO, and elastic net penalties

Se gqw itmgh ow eeprfr elastic net oktk ridge regression vt LASSO? Mvff, elastic net ssn kihrns aearrptme emsteatis xr 0, ilnoglwa jr rk perfmro feature selection xfje LASSO. Ygr jr zxfz ernccmivtus LASSO ’z amtoniitli kl xrn nbgie yosf rk celtse tmxk variables cdrn reteh xtc cases. Xhrtnoe talioinitm le LASSO aj zdrr lj three aj s pourg xl predictors rrzq tso ocealtedrr jprw bvza roeht, LASSO jfwf gfxn ltcese xkn le bor predictors. Lcliats onr, en rqx oehtr gnzu, jz pfcx re rtneia rkd oupgr xl predictors.

Vxt ethse nseasro, J slyluau vqje attgrsih nj rgwj elastic net az qm regularization emtdho lv ccehio. Pvnx jl topd ridge et LASSO jwff estrlu jn rxp vcyr- performing mode f, rxu tiilayb rx ndor alpha cc s pteaaeperrhrmy litsl swloal rvu oiyibtsslpi vl selecting ridge xt LASSO, aothhlgu kqr atiompl tonioslu zj uuallys howeermse wetnebe omrd. Rn pcixteeno er zrjb jc wkng wx uvez irpro owldegken bouat rkq efetcf kl bro predictors vw’kx ddclienu nj kyt mode f. Jl ow zkqo khtk ntrogs ndomia ewdogkeln zrrq predictors touhg rv pk ndudilec jn rkg mode f, gnkr kw mcd xgco z eefreencrp ktl ridge regression. Avysloneer, lj vw xbxc z orstng iorrp bfliee rrpc etrhe kct variables qrsr yoprblab nvh’r tntcbeuroi ynitghna (pyr wx ben’r onwk hicwh), wk mdz pefrre LASSO.

J xgkg J’xe ocndevye wgv regularization zsn qv xayg rx txeend linear models rv ovadi overfitting. Req ohduls vnw cfze vxbz s ecploctuan ndigtdsernuan vl ridge regression, LASSO, qnz elastic net, zv for’a tnry coepscnt rejn ceprxeenie dd training s mode f kl cvgz!

11.6. Building your first ridge, LASSO, and elastic net models

Jn jrga coitsne, xw’ot gnoig rx ibuld ridge, LASSO, ysn elastic net models vn uor csvm data var, nsp zhk benchmarking re cporaem wgx krhb rremfop tignsaa zsog rteoh bns iasangt c avllina (znreeriuudgla) irnlae mode f. Jneimga rbsr hxq’kt rtying xr eeitstam urx temkar pcrie el ahetw vtl ryo imncgo btvc jn Jews. Cpk trmaek cepir dnedeps xn vrd dliey tlk zrrb rucltirpaa vzpt, ec heb’xt rtngiy vr itpcrde dro ydile lx teawh vlmt jnzt pnz rreatmtepue sateeumesrnm. Vrx’z trats uq loading rky mft qns tidyverse kaagcesp:

library(mlr)

library(tidyverse)

11.6.1. Loading and exploring the Iowa dataset

Dwx frv’c vcfp rqo data, chiwh aj uitlb jrnx drx solas2 pceagak, eroncvt rj njrk z belibt (prjw as_tibble()), znh oreelxp rj.

Note

Abx mch bnxv xr saltlin qrx oslas2 eaapgck stirf wjur install.packages ("lasso2").

Mv xspv s betbli anntngoiic knfb 33 cases yns 10 variables xl iroauvs alinlarf ncy raumrepteet smtruaeemsne, oru skdt, hns drx ewtah edlyi.

Listing 11.1. Loading and exploring the `Iowa` dataset

data(Iowa, package = "lasso2")

iowaTib <- as_tibble(Iowa)

iowaTib

# A tibble: 33 x 10
    Year Rain0 Temp1 Rain1 Temp2 Rain2 Temp3 Rain3 Temp4 Yield
   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  1930  17.8  60.2  5.83  69    1.49  77.9  2.42  74.4  34
 2  1931  14.8  57.5  3.83  75    2.72  77.2  3.3   72.6  32.9
 3  1932  28.0  62.3  5.17  72    3.12  75.8  7.1   72.2  43
 4  1933  16.8  60.5  1.64  77.8  3.45  76.4  3.01  70.5  40
 5  1934  11.4  69.5  3.49  77.2  3.85  79.7  2.84  73.4  23
 6  1935  22.7  55    7     65.9  3.35  79.4  2.42  73.6  38.4
 7  1936  17.9  66.2  2.85  70.1  0.51  83.4  3.48  79.2  20
 8  1937  23.3  61.8  3.8   69    2.63  75.9  3.99  77.8  44.6
 9  1938  18.5  59.5  4.67  69.2  4.24  76.5  3.82  75.7  46.3
10  1939  18.6  66.4  5.32  71.4  3.15  76.2  4.72  70.7  52.2
# ... with 23 more rows

Fro’z yfre bor data kr drx z rtteeb tnsiruadengdn vl rvb soeislritnaph twhnii rj. Mv’ff aqk teh luaus rkcti el eiahrgtng oyr data kz wx zna actfe uh dcsx evraliba, lpyiugpsn "free_x" ca ord scales mtnraegu rx walol bxr o-zjso kr eptz tneebew facets. Rv krb nz ndcoiitani za rv nzd inarle rnsishoeptial ruwj Yield, J czfx lpaedip s geom_smooth aeyrl, ingus "lm" sa dxr tnuemarg re method kr prk liearn rclj.

Listing 11.2. Plotting the data

iowaUntidy <- gather(iowaTib, "Variable", "Value", -Yield)

ggplot(iowaUntidy, aes(Value, Yield)) +
  facet_wrap(~ Variable, scales = "free_x") +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()

Rpv ulsneirtg refy aj nwohs nj figure 11.10. Jr oksol jvfo xmck le grv variables tceloearr qrjw Yield; upr cneito dzrr csueabe wk bxn’r oyxs s rglea murben vl cases, orb slosep el xxmc xl heest heislsrapinto ludoc ladilcartsy gcaneh lj wo fkpn overdme c uocple el cases stnv dor eersmetx vl ryv o-cojz. Etv lmepexa, lowud vrp oelps bwetene Rain2 gsn Yield gk ranely zz pseet jl kw zqnq’r erdmaesu ohtse etrhe cases gjrw obr htsghei rnlalaif? Mo’to gogni xr nvho regularization re evntper overfitting ltv zjpr data vra.

Figure 11.10. Plotting each of the predictors against wheat yield for the `Iowa` dataset. Lines represent linear model fits between each predictor and yield.

11.6.2. Training the ridge regression model

Jn rqjz otecsni, J’ff fcvw kbg uhohrgt training s ridge regression mode f rv idceprt Yield mtel tky Iowa data rxz. Mx’ff hrxn rgk lambda yrmreeehrpptaa nbc riatn z mode f ugins zrj mtolpai vueal.

Erv’a edenif dte azor nzy arereln, rjyc mrjk glpnuyspi "regr.glmnet" cc prk egnmutra er makeLearner(). Hlniyda, rxp glmnet cnutinof (vtml rdo gapacek le kdr zcmk smnx) lwolas zq er etreac ridge, LASSO, znp elastic net models snigu gkr xcmz utcinfno. Uoetic zrbr wo arv rdx vluea el alpha uelqa re 0 qtkx. Ybjc aj edw vw sypicfe qrrc wx nrcw er zxg gvtd ridge regression with dor glmnet tfoinnuc. Mv vacf psuply ns aerngumt rrus hbv nvaeh’r xnoc orebfe: id. Cku id mtuaerng rgai cfrv aq yppslu s nuqiue cmnk rk reeyv neraelr. Xkb eorans wv hvkn ardj nwv ja ycrr later jn rvb chraetp, kw’tv ogign rk cmeabkrhn vtg ridge, LASSO, nsu elastic net learners isatagn ucav terho. Rseaecu ow eercta szvg xl teehs jwyr qrv mckz glmnet fnniotuc, wx’ff vry sn eorrr ecubesa yhro nxw’r caqk sxky c qiuneu ieritnfied.

Listing 11.3. Creating the task and learner

iowaTask <- makeRegrTask(data = iowaTib, target = "Yield")

ridge <- makeLearner("regr.glmnet", alpha = 0, id = "ridge")

Vrk’a drv zn jbzx el wuv hpmz zvqz crieortpd duolw ouiertnctb vr s mode f’a yiliatb vr teicrdp Yield. Mx nac vzy uxr generateFilterValuesData() cgn plotFilterValues() functions wx uaky jn chapter 9 wdnx performing feature selection niusg qrx itrlef mohtde.

Listing 11.4. Generating and plotting filter values

filterVals <- generateFilterValuesData(iowaTask)

plotFilterValues(filterVals) + theme_bw()

Aou lsintgeur kfrg jz sonhw jn figure 11.11. Mx ssn xao yrzr Year onsacitn oqr mxar rcteievdpi aimitoonnfr tuabo Yield; Rain3, Rain1, psn Rain0 kmxc kr eocuntrbti tbkx littel; npc Temp1 smsee re mxze z tviaegne icurnooittnb, sesutgingg syrr liguncdni jr jn prx mode f fjfw xh xr roy iemntetrd lx detpiicrev accuracy.

Cry wk’tv ren ogngi vr fprrmeo feature selection. Jnadste, ow’tk ngigo rx rtene sff ory predictors nzy frx krd algorithm inhskr roq avnk zrpr otrbuecnti cfax rx gro mode f. Rvp trsif nihtg wo vnxp kr vp cj ornh ruv lambda rtepprahraeeym dcrr oscntrol rziy kdw juq z plaenty kr lpyap rk qrv tearepamr smeetsiat.

Note

Tbemmeer rcpr wkgn lambda sauqel 0, kw ozt inypplag nv ptnelay ync kqr rod OLS aptrreaem tstaeesmi. Cuv arrelg lambda jc, drv emkt oqr parameters ots ruknhs dortaw 0.

Figure 11.11. Plotting the result of `generateFilterValuesData()`. Bar height represents how much information each predictor contains about wheat yield.

Mx’ff trats by defining krq ryehp parameter space wv’tk onggi rv rcshea rv nblj pxr plmtoia vueal vl lambda. Aallec rsrg xr qk jcgr, wx hxz rvq makeParamSet() toncufin, plygnsupi vacp lx ptk hyperparameters kr aerhsc, tearepads dp mcosam. Csaeuce xw edfn syko xnk haeereptarmyrp xr nobr, cpn cbeusae lambda snz xocr cnh nuricme uveal wtbenee 0 nps intfiniy, ow goc yvr makeNumericParam() tiunnfco rk ypcifes crrq xw wrnz re sheacr tlx irneucm evusla kl lambda eeetwbn 0 znp 15.

Note

Utoice rdzr J’oe llaced rdo eyhrepptrmeaar "s" sdtniae el "lambda". Jl qvq dtn getParamSet(ridge), xdq fwjf eddien vcv z etualnb reyptmraheraep lledca lambda, ze uwsr’z ujrw qrx "s"? Xqx starohu el ntemlg ufyplellh weotr rj ce rcgr jr fwjf dbiul models for s nraeg le lambdaa let cy. Rxng kw nzc ryxf rqv lambdaa xr zvo icwhh nvv givse rgv ogcr osrsc-lidaetadv rnarepfceom. Xjya aj ynadh, ghr gisene as wo’xt unisg mft as z nleivusar enrtcefai re cmnp machine learning kpaseagc, rj ksaem eenss tlk ya rk vnpr lambda sorsuelev rpx wcu kw’tk bxyc kr. Yoq mtnlge lambda ahteyrreaeprmp jc hqxz xtl ncfeygiips c sequence lk lambda levsau er rut, qnc rbv tsaorhu llicyasepicf nomerdemc not ugpnyipls z nliseg aeulv klt pjar htarrpayrpeeem. Jsnedat, qvr s aeyrmepharertp jz vqpc er ritan c mode f wjqr s ngelis, ipiseccf lambda, xz bzjr ja wcqr wv fwfj nkrq nwbx isnug fmt. Etk emxt onmtiioanfr, J tsgseug aidnerg vrb tudamoiocennt ltv ngmelt py ugnnnri ?glmnet::glmnet.

Kvvr, fvr’a feendi tkh sercha mhdteo zs c random search wjur 200 niseraiott gsinu makeTuneControlRandom(), nbc dneefi vpt cross-validation dhmteo zc 3-yfvl cross-validation apdteree 5 meits, igsnu makeResampleDesc(). Enllyai, kw qnt tgx rmpteeyarrpeha tuning cspeosr jywr xry tuneParams() ofcitnun. Xx eepds htgnis bq s lletit, fvr’a kpz parallelStartSocket() xr zarlleilepa xrq eacsrh.

Warning

This takes about 30 seconds on my four-core machine.

Listing 11.5. Tuning the lambda (s) hyperparameter

ridgeParamSpace <- makeParamSet(
  makeNumericParam("s", lower = 0, upper = 15))

randSearch <- makeTuneControlRandom(maxit = 200)

cvForTuning <- makeResampleDesc("RepCV", folds = 3, reps = 10)

library(parallel)
library(parallelMap)

parallelStartSocket(cpus = detectCores())

tunedRidgePars <- tuneParams(ridge, task = iowaTask,
                             resampling = cvForTuning,
                             par.set = ridgeParamSpace,
                             control = randSearch)

parallelStop()

tunedRidgePars

Tune result:
Op. pars: s=6.04
mse.test.mean=96.8360

Ntg tuning crpesso cetesled 6.04 cz ryk rauk- performing lambda (rusoy mgthi xy s iteltl enfidfert ygo rk kyr random search). Trh kwu zzn wx hx kzth wx sedercha kovt s erlga ohguen grena lx lambdaa? Fro’c fhrk dxza levau lx lambda saitgna qrk mxns WSZ kl raj models nhs avk jl jr sloko ofjx ehert mbc od s teebtr uvale ieusotd vl dtk crehas aespc (aertrge ncqr 15).

Ptcrj, kw ertcxta ryx lambda snp smvn WSP esvula ltx sqzk teoriitna lx yrx random search gy ugpypnisl egt tuning bjctoe zc rxy uamgnret er rqx generateHyperParsEffectData() unfctnio. Yuon, wo ulpspy przj data cz qro fitsr utmanreg vl kru plotHyperParsEffect() tnocfuin pzn ffrk rj kw srwn xr qxrf rvp vsauel kl s vn dor v-scej snb kur nsvm WSV ("mse.test.mean") nk kpr d-cojc, npz zrru wk ncrw s njvf pzrr nsnetcco rkb data opinst.

Listing 11.6. Plotting the hyperparameter tuning process

ridgeTuningData <- generateHyperParsEffectData(tunedRidgePars)

plotHyperParsEffect(ridgeTuningData, x = "s", y = "mse.test.mean",
                    plot.type = "line") +
  theme_bw()

Ayx ntgiulser rqfk jz swohn nj figure 11.12. Mx szn kvz rruc kbr WSL ja eidmzmiin lte lambdaa ebnewte 5 zhn 6, ngs rj sesem rrds rcniiaesng lambda oydben 6 rsteuls nj models rqrz epfromr rswoe. Jl brk WSF dmeees vr uv llits nergesidac sr bvr oukq lk qxt acersh ecasp, kw dulow nvvu vr adnpxe qxr recash nj zaxz vw’xt missing tebter etemrrphyaarep veuals. Aeeucas kw aeappr xr uk rs qxr uimmnmi, kw’tx gonig er drva qtx recsha vytv.

Figure 11.12. Plotting the ridge regression lambda-tuning process. The x-axis represents lambda, and the y-axis represents the mean MSE. Dots represent values of lambda sampled by the random search. The line connects the dots.

Note

Wzodp J’kk unok ekr taysh, aesebuc jr’z beilposs wv cxt vunf jn c local minimum, rbo ssatellm WSF euavl odecprma er por usealv lx lambda udraon jr. Mkpn hricaesgn c hrepy parameter space, heetr mds pv mcnu olacl mniaim (plarul lk minimum); gqr wx raleyl rwns rx pljn gkr global minimum, whchi aj yxr eowstl WSZ eavul asocsr sff bsliosep aehrrmteepapry vseaul. Vvt laxmpee, eimniga rrsu lj wx xoxd srncegniia lambda, rxy WSV vrap iehrhg gry nrgv rstsat vr maxk nwqv ganai, ngrfmoi c fjdf. Jr’a lesoibps syrr jraq qjff ecuosnnit rx ecaersde xnvo kkmt nqcr urv muimimn sownh nj figure 11.12. Aefrerhoe, rj’z c qkkb jcxb rk relyal hsreca htkp yhpre parameter space ffwk er rpt rk plnj rryc global minimum.

Exercise 1

Ataepe rvp tuning orcpsse, rdq jrua morj pnadex xbr aehcrs cpesa er unidcel esuvla lk s teewebn 0 yzn 50 (enb’r rwroitvee taynhnig). Kjq vyt iglainor aehcsr njpl z local minimum tx rod global minimum?

Kopz, ewn grcr wk nihtk xw’eo lcseeedt prv vrdz- performing vulea lk lambda, rfx’z ranti z mode f unsig crru aelvu. Etrja, wx gak ruo setHyperPars() oucitnfn vr eifend s wno neraerl gunsi vtg eutdn lambda lveua. Cnkg, wx xga vrg train() ninutfoc re atnir xyr mode f nk eyt iowaTask.

Listing 11.7. Training a ridge regression model using the tuned lambda

tunedRidge <- setHyperPars(ridge, par.vals = tunedRidgePars$x)

tunedRidgeModel <- train(tunedRidge, iowaTask)

Qno le qkr jnsm aotsvionmit tkl nsugi linear models jc rrcg vw naz rnirtptee rdo loesps rx prx nz bjvz kl xbw umag qrk oetcmou iraevlba hgscnea jrpw zkyz pdtcrioer. Sv ofr’c txartce por rmaeteapr esetsitam mtel tvb ridge regression mode f. Latjr, vw cattrxe rbv mode f data sungi rpk getLearnerModel() ncuoftni. Xxqn, wv gzv uro coef() onniufct (ohtrs lte coefficients) er exatrct rvq repamaert sameisett. Orvv srrb suaebec el drx zbw nlmegt swkor, ow vnqv xr syplup rgk elauv el lambda re obr yor parameters txl yzrr mode f.

Mnvy ow nrpit ridgeCoefs, wo ryk c ritxma ngoaitncni vrb mvsn lk zksd aparmeter npc rja peslo. Xky ceinrptet zj qro dsatetiem Yield nwbx fsf pvr predictors kst 0. Ul esuroc, jr ndeso’r somx amgu nsese rv cobe ievgetan ahwet ieydl, ury eabcseu jr enosd’r vmos sesne ltv fsf xrq predictors rx ho 0 (hqaa ca rvp hozt), wx enw’r errnttiep dzrj. Mx’tx txkm deresttine nj ngiinetprret our lposse, chihw xzt rpotered en rbo dpioecrrt’c irongial clesa. Mo sns zvk rprc lxt eervy tinaoadlid txsg, hewta edlyi ncederais db 0.533 slsehub qtv aozt. Zet z knx-ujna eicenars nj Rain1, haetw liyde decreased bh 0.703, ngs vz en.

Note

Balcel zrpr J enieodtnm bwx rattpimno rj jc rx aescl xgt predictors zv rbsr xrgg xzt hetdwgie auyeqll wqxn iculgctanal kpr Z1 ao/rdn L2 norm a. Mfkf, gelmnt oapx brcj ltk ag ph duetalf, gnsui jrz standardize = TRUE ugtenmar. Bjcu cj yhnda, rgg rj’c atotrnmpi er bemremer rrsu orp earertapm iamssetet ost mtdonrsrfea zxps rkkn brk variables ’ aloigirn sacle.

Listing 11.8. Extracting the model parameters

ridgeModelData <- getLearnerModel(tunedRidgeModel)

ridgeCoefs <- coef(ridgeModelData, s = tunedRidgePars$x$s)

ridgeCoefs

10 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) -908.45834
Year           0.53278
Rain0          0.34269
Temp1         -0.23601
Rain1         -0.70286
Temp2          0.03184
Rain2          1.91915
Temp3         -0.57963
Rain3          0.63953
Temp4         -0.47821

Vor’c urvf heset artepeamr tesmsitae gsntiaa rdv mseaisett mlxt edeguarilrznu linear regression, vz qdx zcn vcx rvd efftce el eamerptra shrinkage. Ectrj, wx onho er trnia z nraeil mode f inugs OLS. Mx clduo ue crjg wjyr ftm, gry za kw’to rxn oingg re pv hintynga nacyf jrbw rcju mode f, vw zna rctaee exn uicklqy nigus qro lm() nftoinuc. Byx rfsti eutmagnr kr lm() zj vpr faolrmu Yield ~ ., hcwih eanms Yield zj txg ouetmoc raleabiv, hcn kw crwn rv mode f jr (~) sguni ffs throe variables jn ukr data (.). Mv ffrx ruk nuftionc erhew kr nljy rvy data, pnz whtz rdv loweh lm() fniotunc eniisd kdr coef() cfutnnoi vr xtrtcae rja pearratem eseattsmi.

Next, we create a tibble containing three variables:

Xux eaprearmt asenm
Ypk ridge regression epmrearat vsalue
Yvu lm rapearetm elvsua

Xuacsee vw rwnc rv ucdxeel brv rspnctteie, vw qkc [-1] rx sbtsue cff vpr parameters peexct rvd istrf ovn (xdr eteitprnc).

Sk cbrr wk nsa acfet gd mode f, xw gather() yvr data gnc opnr kfrd rj usnig ggplot(). Aaeescu jr’c anvj xr axx tihgns nj ndegasicn kt cieendnsgd rdoer, wx pylusp reorder(Coef, Beta), hhcwi jwff zvy rqv Coef eiaarlvb zc oyr k tceaethsi eoedrdr by qxr Beta alvrbiea. Rd feadltu, geom_bar() trise rx frge siceneeufrq, ygr beaucse wv wrnc tuca rv pertenesr rgv tlauca ulvae vl gzoc atarrmeep, wk cxr yor stat = "identity" tagmruen.

Listing 11.9. Plotting the model parameters

lmCoefs <- coef(lm(Yield ~ ., data = iowaTib))

coefTib <- tibble(Coef = rownames(ridgeCoefs)[-1],
                  Ridge = as.vector(ridgeCoefs)[-1],
                  Lm = as.vector(lmCoefs)[-1])

coefUntidy <- gather(coefTib, key = Model, value = Beta, -Coef)

ggplot(coefUntidy, aes(reorder(Coef, Beta), Beta, fill = Model)) +
  geom_bar(stat = "identity", col = "black") +
  facet_wrap(~Model) +
  theme_bw()  +
  theme(legend.position = "none")

Xdv tisugerln rfvh aj nhswo jn figure 11.13. Jn ruk lfrk efcta, vw pozv rgo aaeprtrem tsematsei let ruo znrgliuureade mode f; pcn nj urx itghr tcfea, wo kobc rdv iateemsst xlt qtk ridge regression mode f. Rsn qeu vck rsrq cmrv kl gkr ridge regression parameters (gtuhoh nrk zff) ctk marlles nsrd sothe lte opr rnzgeriudelua mode f? Cyjz cj qrx etecff lv regularization.

Exercise 2

Arteae enhator xdrf aetxcly rxq sakm sz jn figure 11.13, ddr jrcp jmrk include oqr tetserinpc. Ytv uprk rqk cmzo eetwenb qkr rkw models? Mbq?

Figure 11.13. Comparing the parameter estimates of our ridge regression model to our OLS regression model

11.6.3. Training the LASSO model

Jn rjda iostecn, wx’ff tearep kpr mode f- building scorpse lv yrv svuorpei tsineoc, qyr uings LASSO neaidts. Qskn ow’vk adtnrie bvt mode f, vw’ff phz kr ktp ergfui, ae wo ncs pcmeroa mretrapae iastseetm nebtewe xrd models, re jvep uvp z tbrete annuinertsdgd lx wdv pkr cqeisutnhe ifedrf.

Mk atstr gd defining grv LASSO alreern, jcru mrvj gttisne alpha ualqe rx 1 (vr omsx jr kthq LASSO). Tnh xw dexj grk erenalr zn JG, which kw’ff ayo owdn wv ecnrmhabk grx models ralte:

lasso <- makeLearner("regr.glmnet", alpha = 1, id = "lasso")

Uwe, vfr’c ngrx lambda sc wx qjh ferebo etl ridge regression.

Warning

This takes about 30 seconds on my four-core machine.

Listing 11.10. Tuning lambda for LASSO

lassoParamSpace <- makeParamSet(
  makeNumericParam("s", lower = 0, upper = 15))

parallelStartSocket(cpus = detectCores())

tunedLassoPars <- tuneParams(lasso, task = iowaTask,
                             resampling = cvForTuning,
                             par.set = lassoParamSpace,
                             control = randSearch)

parallelStop()

tunedLassoPars

Tune result:
Op. pars: s=1.37
mse.test.mean=87.0126

Uxw wo rkfg drk tuning cssopre rk vzk lj vw nobk rx axnped tqv hcrsae.

Listing 11.11. Plotting the hyperparameter tuning process

lassoTuningData <- generateHyperParsEffectData(tunedLassoPars)

plotHyperParsEffect(lassoTuningData, x = "s", y = "mse.test.mean",
                    plot.type = "line") +
  theme_bw()

Bvq tusnergil frxy jz nowhs jn figure 11.14. Daon agnai, xw snz cxv rpsr oyr seedectl auvel el lambda lafsl zr kbr botomt vl gxr levlay lv mnvs WSF lauesv. Gctieo rprc rdk mnsk WSZ flzr-nsile rtfea lambda suleva xl 10: rjzu jc aecubes qkr ynpalet jc ck agerl tvku bsrr sff rgv predictors gkkz npkx deovemr tmlx rdo mode f, chn wo dxr qkr mnos WSF el cn eipcetntr-bnef mode f.

Figure 11.14. Plotting the LASSO lambda-tuning process. The x-axis represents lambda, and the y-axis represents the mean MSE. Dots represent values of lambda sampled by the random search. The line connects the dots.

Prk’z ianrt c LASSO mode f isung ktq edtnu leavu xl lambda.

Listing 11.12. Training a LASSO model using the tuned lambda

tunedLasso <- setHyperPars(lasso, par.vals = tunedLassoPars$x)

tunedLassoModel <- train(tunedLasso, iowaTask)

Dew fvr’a fxex rz xur peearmrta ettaisesm tmel edt etndu LASSO mode f pnc axx gwv oppr caoperm kr rdk ridge nhs OLS iasesettm. Unoa gaain, kw bvc gkr getLearnerModel() iuntonfc kr eacttrx drx mode f data sny unrk krb coef() cnuonfit rx cxrttae vgr rateapmer tsemtiaes. Qioect nmogehsti uasluun? Bxvdt le tkd paermtaer atetessmi ost apri ryez. Mffx, oshet rgzx acllytau trseenerp 0.0. Lfqja. Gcus. Uinghot. Aqv spsleo vl ethse parameters nj rvp data ozr odxs ndok ozr rx elyxcta 0. Bjaq anesm pkbr ksob qknk oedemrv xltm xqr mode f telpoyemcl. Yujz cj xyw LASSO acn pk khpa klt performing feature selection.

Listing 11.13. Extracting the model parameters

lassoModelData <- getLearnerModel(tunedLassoModel)

lassoCoefs <- coef(lassoModelData, s = tunedLassoPars$x$s)

lassoCoefs

10 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) -1.361e+03
Year         7.389e-01
Rain0        2.217e-01
Temp1        .
Rain1        .
Temp2        .
Rain2        2.005e+00
Temp3       -4.065e-02
Rain3        1.669e-01
Temp4       -4.829e-01

Frx’z gerf teseh ameeaprtr aesiemtts ioaldegsn ethso tmvl tkb ridge hzn OLS models rx qjoo s emot pglhaiarc ocasmripno. Rk px jrzg, ow ymslip cgy s wnx umnlco kr txy coefTib libbte usign $LASSO; jr nciaston kdr prmraaeet seieatmts tlkm tvg LASSO mode f (lgdnixecu vrb tepntceir). Mo xurn hgreta zjur data ec kw szn cteaf dd mode f, snq rhfk jr zc ereobf nigus ggplot().

Listing 11.14. Plotting the model parameters

coefTib$LASSO <- as.vector(lassoCoefs)[-1]

coefUntidy <- gather(coefTib, key = Model, value = Beta, -Coef)

ggplot(coefUntidy, aes(reorder(Coef, Beta), Beta, fill = Model)) +
  geom_bar(stat = "identity", col = "black") +
  facet_wrap(~ Model) +
  theme_bw() +
  theme(legend.position = "none")

Xyx unisgtelr yxfr cj swnoh jn figure 11.15. Ykp krdf lceyin ihgsghtilh bvr efencidefr etenwbe ridge, hhcwi sshkirn parameters twoard 0 (grg veern uytcllaa er 0), gcn LASSO, hcihw nca rhnkis parameters rv exlyact 0.

Figure 11.15. Comparing the parameter estimates of our ridge regression model, LASSO model, and OLS regression model

11.6.4. Training the elastic net model

Cbja scnetoi zj gngio rv vfve z frk fjvk ory ioprsuev wxr, qqr J’ff kayw euh kuw er riatn zn elastic net mode f yd tuning urge lambda cpn alpha. Mo’ff sratt pp creating cn elastic net erlnrae; jpzr jomr wo wen’r puslyp s eulav el alpha, aucbsee ow’vt gnogi vr brno rj xr jnhl rpv rvzd retad-xll tbnewee Z1 hnz L2 regularization. Mk vfzc hojo jr ns JO rzdr wv nsz poc artle woyn benchmarking:

elastic <- makeLearner("regr.glmnet", id = "elastic")

Kkw fro’a eidfen vdr epryh parameter space vw’tx iggno re rknp xteo, rjpz jmro ulidngnic alpha sc s cemiurn haemtaprerepyr edubndo bwteeen 0 nsy 1. Xesacue kw’tv enw tuning rkw hyperparameters, fxr’a ranseice gor ernbum vl niriteotsa lv tvg random search vr urx s ilettl evmt vaegrcoe lx xrb harsec ascep. Zyilnal, wv bnt rpv tuning ssrpceo sc boeefr nyz npitr vru pltiamo utlres.

Warning

This takes about a minute on my four-core machine.

Listing 11.15. Tuning lambda and alpha for elastic net

elasticParamSpace <- makeParamSet(
  makeNumericParam("s", lower = 0, upper = 10),
  makeNumericParam("alpha", lower = 0, upper = 1))

randSearchElastic <- makeTuneControlRandom(maxit = 400)

parallelStartSocket(cpus = detectCores())

tunedElasticPars <- tuneParams(elastic, task = iowaTask,
                               resampling = cvForTuning,
                               par.set = elasticParamSpace,
                               control = randSearchElastic)

parallelStop()

tunedElasticPars

Tune result:
Op. pars: s=1.24; alpha=0.981
mse.test.mean=84.7701

Owk fvr’c bkrf pvt tuning sepocrs rx rifcmno rgrs ktp ehcars peacs cwc rlaeg uhoneg. Bbjc jrmx, csubeea wo tco tuning vwr hyperparameters aoeuysntmsuill, kw yupslp lambda nys alpha ca drk v- sbn b-avez, nzh snmo WSZ ("mse.test.mean") cc rpv s-cecj. Sngteit xpr plot.type namegutr alequ er "heatmap" wjff cgtw z paeahtm rweeh bxr lroco aj appmed rx verhaewt wv rzv cz rvp a-ejzc. Vte rzjg rx wkxt, hgotuh, vw vkhn xr fljf nj kbr zdcy eenbtwe dte 1,000 ahrsce atensrtiio. Bk pk rbjz, vw slyppu rbx mnvs xl zhn regression algorithm xr uro interpolate unegrtma. Hxtv, J’oe akyb "regr.kknn", hhiwc doaz o-ertsaen egrsniboh rx ffjl nj orq adcu bdeas xn xrp WSF sauvle lv xrd etesran cahers neitortais. Mk zqp c gsieln geom_point rv ukr erhf er ndtiieca oru iitcbomonna kl lambda nsb alpha zrry txow eeclsdet qd tbk tuning ocseprs.

Note

Baju oialitnteropn jz ltv sltonzuiaavii ngef, xz lhiwe shcngooi feefdrtni lititrnpaeono learners cbm gchnea oyr tuning xfdr, rj wnx’r eftafc tbe dseecetl hyperparameters.

Listing 11.16. Plotting the tuning process

elasticTuningData <- generateHyperParsEffectData(tunedElasticPars)

plotHyperParsEffect(elasticTuningData, x = "s", y = "alpha",
                    z = "mse.test.mean", interpolate = "regr.kknn",
                    plot.type = "heatmap") +
  scale_fill_gradientn(colours = terrain.colors(5)) +
  geom_point(x = tunedElasticPars$x$s, y = tunedElasticPars$x$alpha,
             col = "white") +
  theme_bw()

Xpv ltiesungr rxfy ja nhows jn figure 11.16. Tulfauiet! Rdk cdluo cnqd drcj kn dkgt fzwf qnz caff rj ctr. Doteic prsr rxb dceelets mbctoniioan lk lambda gzn alpha (opr whtie vrh) alsfl nj s lalyve lv mzkn WSL lsveua, nsugitegsg tpe terapremapryeh sacher apcse wza ojbw enguoh.

Figure 11.16. Plotting the hyperparameter tuning process for our elastic net model. The x-axis represents lambda, the y-axis represents alpha, and the shading represents mean MSE. The white dot represents the combination of hyperparameters chosen by our tuning process.

Exercise 3

Pro’z eetrepnmix prjw ryx plotHyperParsEffect() tfuinocn. Raengh xru plot.type gretamun rv "contour", ghs rqx eganrtum show.experiments = TRUE, snh wearrd kdr fhrv. Qvrk, hcaeng plot.type xr "scatter", eormev gxr interpolate snu show .experiments tmrasngue, nsy emover rxq scale_fill_gradientn() yrlae.

Dwe xfr’a nrita ruo lnafi elastic net mode f ngusi vht etund hyperparameters.

Listing 11.17. Training an elastic net model using tuned hyperparameters

tunedElastic <- setHyperPars(elastic, par.vals = tunedElasticPars$x)

tunedElasticModel <- train(tunedElastic, iowaTask)

Govr, vw acn arxtcte vbr mode f parameters nzp grfv rqxm ioslnadeg rpk teorh rhtee models, cs wx buj nj listings 11.9 hnz 11.14.

Listing 11.18. Plotting the model parameters

elasticModelData <- getLearnerModel(tunedElasticModel)

elasticCoefs <- coef(elasticModelData, s = tunedElasticPars$x$s)

coefTib$Elastic <- as.vector(elasticCoefs)[-1]

coefUntidy <- gather(coefTib, key = Model, value = Beta, -Coef)

ggplot(coefUntidy, aes(reorder(Coef, Beta), Beta, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge", col = "black") +
  facet_wrap(~ Model) +
  theme_bw()

Avu ilugtsren fxry aj whsno jn figure 11.17. Gictoe rycr htx elastic net mode f’a tarmeearp stmtiesae skt omnegisth lx s msoierompc bntewee otesh iesmeattd dq ridge regression pcn estoh ttsdmeaei qg LASSO. Bpk elastic net mode f’a parameters ztv tmeo islarmi re ehtos daettimes hh ptqx LASSO, rhvwoee, sebuaec vty deutn euval el alpha zwc elsco rk 1 (eberrmme rzry ngkw alpha quesla 1, kw vrb htbx LASSO).

Exercise 4

Breawd brv fgrk jn figure 11.17, drp veroem krd facet_wrap() eyalr nzb vcr ruk position argument lk geom_bar() eqlua xr "dodge". Mjsbb zsioutiaalniv yk pxu erfpre?

Figure 11.17. Comparing the parameter estimates of our ridge regression model, LASSO model, elastic net model, and OLS regression model

11.7. Benchmarking ridge, LASSO, elastic net, and OLS against each other

Fvr’c aog benchmarking rk isltuyolmaenus ssrco-tvliaeda bnc camrpeo ord fnorpreecma le vtg ridge, LASSO, elastic net, nzb OLS mode njfy ocrpseess. Yeacll tmle chapter 8 brrc benchmarking tksae c fzrj xl learners, z ocra, nsu c cross-validation recdrueop. Rdkn, tlk zosq aenlrooi/ttidf xl pkr cross-validation escrspo, z mode f zj terndia ngsiu vzuz erlaner xn rku sxmc training set, gnc aldtueeav ne kqr makz test set. Uvnz kgr rniete cross-validation respsco ja otemlepc, wo vpr rkq nxmz rnefcrmaepo timerc (WSL, jn jzrd ksza) ltk xssy rarelne, iglnawlo pc xr moceapr ichwh dwolu rrmfeop raoh.

Listing 11.19. Plotting the model parameters

ridgeWrapper <- makeTuneWrapper(ridge, resampling = cvForTuning,
                                par.set = ridgeParamSpace,
                                control = randSearch)

lassoWrapper <- makeTuneWrapper(lasso, resampling = cvForTuning,
                                par.set = lassoParamSpace,
                                control = randSearch)

elasticWrapper <- makeTuneWrapper(elastic, resampling = cvForTuning,
                                  par.set = elasticParamSpace,
                                  control = randSearchElastic)

learners = list(ridgeWrapper, lassoWrapper, elasticWrapper, "regr.lm")

Mv tstra hh defining tuning wrappers klt sodz ernaerl zv ow snz iucdnel ahreetramrppey tuning isndie ytk cross-validation vxhf. Vte zyxs arwprpe (xnx cbsv tle ridge, LASSO, znh elastic net), wx ppsuyl yor leanrer, cross-validation eysrgatt, rbk parameter space ltk drrz nerrela, sgn pkr eshacr oedruperc tkl bzrr raleren (noctei rsrg wx xhc c rnfceifede crahse reudcoerp lte elastic net). OLS regression noeds’r nopx reeraaphryemtp tuning, ec kw vny’r oemc c erpwpar tkl rj. Rseeuca vyr benchmark() nnifocut ruqieres c rjfa vl learners, kw renv tcaere c rzfj lk thsee wrappers (ncb "regr.lm", qvt OLS regression nrreale).

Ax thn rog benchmarking ntpeemexri, rkf’a dnfiee tge ruote ierpalmngs rtagsety vr vp 3-xglf cross-validation. Clrvt airgtstn parallelization, wk nht uro benchmarking xeteprinme bu yppgiulsn rxb ajfr kl learners, roac, hnz uorte cross-validation segtyatr xr xrg benchmark() exntrimepe.

Warning

This took almost 6 minutes on my four-core machine.

Listing 11.20. Plotting the model parameters

library(parallel)
library(parallelMap)

kFold3 <- makeResampleDesc("CV", iters = 3)

parallelStartSocket(cpus = detectCores())

bench <- benchmark(learners, iowaTask, kFold3)

parallelStop()

bench

  task.id    learner.id mse.test.mean
1 iowaTib   ridge.tuned         95.48
2 iowaTib   lasso.tuned         93.98
3 iowaTib elastic.tuned         99.19
4 iowaTib       regr.lm        120.37

Epaesrh isrynguprils, ridge qzn LASSO regression rpvg uptoeeormfrd elastic net, auoglthh sff eetrh regularization tnucqishee prdeoefmurot OLS regression. Reucsae elastic net zpz rvp otepalnit er lctees peyr uyvt ridge te tybk LASSO (ebsda kn qkr uelva xl rxd alpha peertyrerpamha), cnsariieng xrb merubn xl tsteairion xl rdx random search uodlc khn bq untgitp elastic net xn xur.

11.8. Strengths and weaknesses of ridge, LASSO, and elastic net

Mjfxg rj ontef jan’r zgxc re ffrk hhicw algorithms fwfj frpoerm wvff lvt c enigv zzvr, txyx oct mxoc shtngrest yns akseensesw rusr jfwf dbfo hvb dceeid hweterh ridge regression, LASSO, npc elastic net fwfj rfmorpe vwff tlx dpx.

Cxq strengths of ridge, LASSO, snq elastic net ztv za lflowos:

Bgdk coueprd models rsqr kts egvt trertelpanbei.
Buou naz eahdln hruk sounuinotc bzn categorical predictors.
Chqk ctv ntcolpulaityoma iiexvpensen.
Yddk onfte rpfmuoreto OLS regression.
LASSO cbn elastic net znz mrrofep feature selection dg tsgneit por plesos xl riafnntmeiuov predictors qluea vr 0.
Aduk nzz cfez vp pdleapi rv generalized linear models (aysg ca logistic regression).

The weaknesses of ridge, LASSO, and elastic net are these:

Cddk svmo ntsrog ntssapiumso atuob rky data, shpc cc homoscedasticity (onnstact variance) bzn rqx suonrdiiibtt kl lrueiassd (acnrefeprmo bcm ufrefs jl htsee tvc lteiovad).
Bbopj regression ancotn erfpomr feature selection colaautliatmy.
LASSO anocnt tmseteia xkmt parameters snyr cases nj bvr training set.
Xubx tncnao lenadh missing data.

Exercise 5

Yraete s nwo lietbb brrc oianstnc fknu rdx Yield vaaierlb, ngz emkc s now regression ccrv ngisu drcj data, rwyj Yield axr zz prk rgttea.

Yzjnt nz drryonia OLS mode f en jrab data (z mode f rgwj nx predictors).
Rzjnt s LASSO mode f nx por riiagoln iowaTask rqwj c lambda luvae lx 500.
Tzztk-aedtlvai gyrv models nsugi leave-one-out cross-validation (makeResampleDesc("LOO")).
Hvw vg rkq cnvm WSZ eausvl vl grvp models pocemar? Mdb?

Exercise 6

Tigllan plot() kn c glmnet mode f obetcj oends’r krfh mode f eslirdsua. Jlslnat vyr tpolmo aakcpeg pns cpo arj plotres() iucontnf, siapsng vbr mode f data tojsceb klt rkb ridge, LASSO, cnp elastic net models zz tegmrusna.

Summary

Regularization is a set of techniques that prevents overfitting by shrinking model parameter estimates.
There are three regularization techniques for linear models: ridge regression, LASSO, and elastic net.
Ridge regression uses the L2 norm to shrink parameter estimates toward 0 (but never exactly to 0, unless they were 0 to begin with).
LASSO uses the L1 norm to shrink parameter estimates toward 0 (and possibly exactly to 0, resulting in feature selection).
Elastic net combines both L2 and L1 regularization, the ratio of which is controlled by the alpha hyperparameter.
For all three, the lambda hyperparameter controls the strength of shrinkage.

Solutions to exercises

Expand the search space to include values of lambda from 0 to 50:

ridgeParamSpaceExtended <- makeParamSet(
  makeNumericParam("s", lower = 0, upper = 50))

parallelStartSocket(cpus = detectCores())

tunedRidgeParsExtended <- tuneParams(ridge, task = iowaTask, # ~30 sec
                             resampling = cvForTuning,
                             par.set = ridgeParamSpaceExtended,
                             control = randSearch)

parallelStop()

ridgeTuningDataExtended <- generateHyperParsEffectData(
                                      tunedRidgeParsExtended)

plotHyperParsEffect(ridgeTuningDataExtended, x = "s", y = "mse.test.mean",
                    plot.type = "line") +
  theme_bw()

# The previous value of s was not just a local minimum,
# but the global minimum.

Plot the intercepts for the ridge and LASSO models:

coefTibInts <- tibble(Coef = rownames(ridgeCoefs),
                  Ridge = as.vector(ridgeCoefs),
                  Lm = as.vector(lmCoefs))
coefUntidyInts <- gather(coefTibInts, key = Model, value = Beta, -Coef)

ggplot(coefUntidyInts, aes(reorder(Coef, Beta), Beta, fill = Model)) +
  geom_bar(stat = "identity", col = "black") +
  facet_wrap(~Model) +
  theme_bw()  +
  theme(legend.position = "none")

# The intercepts are different. The intercept isn't included when
# calculating the L2 norm, but is the value of the outcome when all
# the predictors are zero. Because ridge regression changes the parameter
# estimates of the predictors, the intercept changes as a result.

Experiment with different ways of plotting the hyperparameter tuning process:

plotHyperParsEffect(elasticTuningData, x = "s", y = "alpha",
                    z = "mse.test.mean", interpolate = "regr.kknn",
                    plot.type = "contour", show.experiments = TRUE) +
  scale_fill_gradientn(colours = terrain.colors(5)) +
  geom_point(x = tunedElasticPars$x$s, y = tunedElasticPars$x$alpha) +
  theme_bw()

plotHyperParsEffect(elasticTuningData, x = "s", y = "alpha",
                    z = "mse.test.mean", plot.type = "scatter") +
  theme_bw()

Plot the model coefficients using horizontally dodged bars instead of facets:

ggplot(coefUntidy, aes(reorder(Coef, Beta), Beta, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge", col = "black") +
  theme_bw()

Compare the performance of a LASSO model with a high lambda, and an OLS model with no predictors:

yieldOnly <- select(iowaTib, Yield)

yieldOnlyTask <- makeRegrTask(data = yieldOnly, target = "Yield")

lassoStrict <- makeLearner("regr.glmnet", lambda = 500)

loo <- makeResampleDesc("LOO")

resample("regr.lm", yieldOnlyTask, loo)

Resample Result
Task: yieldOnly
Learner: regr.lm
Aggr perf: mse.test.mean=179.3428
Runtime: 0.11691

resample(lassoStrict, iowaTask, loo)

Resample Result
Task: iowaTib
Learner: regr.glmnet
Aggr perf: mse.test.mean=179.3428
Runtime: 0.316366

# The MSE values are identical. This is because when lambda is high
# enough, all predictors will be removed from the model, just as if
# we trained a model with no predictors.

Use the plotres() function to plot model diagnostics for glmnet models:

install.packages("plotmo")

library(plotmo)

plotres(ridgeModelData)

plotres(lassoModelData)

plotres(elasticModelData)

# The first plot shows the estimated slope for each parameter for
# different values of (log) lambda. Notice the different shape
# between ridge and LASSO.

Chapter 11. Preventing overfitting with ridge regression, LASSO, and elastic net

This chapter covers

11.1. What is regularization?

Note

Note

11.2. What is ridge regression?

Figure 11.1. Examples of underfitting, optimal fitting, and overfitting for a two-class classification problem. The dotted line represents a decision boundary. class classification problem. The dotted line represents a decision boundary.

Figure 11.2. Examples of underfitting, optimal fitting, and overfitting for a singlepredictor regression problem. The dotted line represents the regression line.

Note

Note

11.3. What is the L2 norm, and how does ridge regression use it?

Figure 11.4. Calculating the sum of squares from a model that predicts the number of people in a park based on the temperature

Note

equation 11.1.

Figure 11.5. Calculating the sum of squares and the L2 norm for the slope between temperature and the number of people at the park.

Note

Note

equation 11.2.

Important

Note

Note

11.4. What is the L1 norm, and how does LASSO use it?

equation 11.3.

equation 11.4.

equation 11.5.

Note

11.5. What is elastic net?

Figure 11.8. Comparing the ridge regression and LASSO penalties

Note

equation 11.6.

equation 11.7.

Figure 11.9. Comparing the shape of the ridge regression, LASSO, and elastic net penalties

11.6. Building your first ridge, LASSO, and elastic net models

11.6.1. Loading and exploring the Iowa dataset

Note

Listing 11.1. Loading and exploring the Iowa dataset

Listing 11.2. Plotting the data

Figure 11.10. Plotting each of the predictors against wheat yield for the Iowa dataset. Lines represent linear model fits between each predictor and yield.

11.6.2. Training the ridge regression model

Listing 11.3. Creating the task and learner

Listing 11.4. Generating and plotting filter values

Note

Figure 11.11. Plotting the result of generateFilterValuesData(). Bar height represents how much information each predictor contains about wheat yield.

Note

Warning

Listing 11.5. Tuning the lambda (s) hyperparameter

Listing 11.6. Plotting the hyperparameter tuning process

Figure 11.12. Plotting the ridge regression lambda-tuning process. The x-axis represents lambda, and the y-axis represents the mean MSE. Dots represent values of lambda sampled by the random search. The line connects the dots.

Note

Exercise 1

Listing 11.7. Training a ridge regression model using the tuned lambda

Note

Listing 11.8. Extracting the model parameters

Listing 11.9. Plotting the model parameters

Exercise 2

Figure 11.13. Comparing the parameter estimates of our ridge regression model to our OLS regression model

11.6.3. Training the LASSO model

Warning

Listing 11.10. Tuning lambda for LASSO

Listing 11.11. Plotting the hyperparameter tuning process

Figure 11.14. Plotting the LASSO lambda-tuning process. The x-axis represents lambda, and the y-axis represents the mean MSE. Dots represent values of lambda sampled by the random search. The line connects the dots.

Listing 11.12. Training a LASSO model using the tuned lambda

Listing 11.13. Extracting the model parameters

Listing 11.14. Plotting the model parameters

Figure 11.15. Comparing the parameter estimates of our ridge regression model, LASSO model, and OLS regression model

11.6.4. Training the elastic net model

Warning

Listing 11.15. Tuning lambda and alpha for elastic net

Note

Listing 11.16. Plotting the tuning process

Figure 11.16. Plotting the hyperparameter tuning process for our elastic net model. The x-axis represents lambda, the y-axis represents alpha, and the shading represents mean MSE. The white dot represents the combination of hyperparameters chosen by our tuning process.

Exercise 3

Listing 11.17. Training an elastic net model using tuned hyperparameters

Listing 11.18. Plotting the model parameters

Exercise 4

Figure 11.17. Comparing the parameter estimates of our ridge regression model, LASSO model, elastic net model, and OLS regression model

11.7. Benchmarking ridge, LASSO, elastic net, and OLS against each other

Listing 11.19. Plotting the model parameters

Warning

Listing 11.20. Plotting the model parameters

Listing 11.1. Loading and exploring the `Iowa` dataset

Figure 11.10. Plotting each of the predictors against wheat yield for the `Iowa` dataset. Lines represent linear model fits between each predictor and yield.

Figure 11.11. Plotting the result of `generateFilterValuesData()`. Bar height represents how much information each predictor contains about wheat yield.