This chapter covers
- Working with the logistic regression algorithm
- Understanding feature engineering
- Understanding missing value imputation
In this chapter, I’m going to add a new classification algorithm to your toolbox: logistic regression. Just like the k-nearest neighbors algorithm you learned about in the previous chapter, logistic regression is a supervised learning method that predicts class membership. Logistic regression relies on the equation of a straight line and produces models that are very easy to interpret and communicate.
Logistic regression can handle continuous (without discrete categories) and categorical (with discrete categories) predictor variables. In its simplest form, logistic regression is used to predict a binary outcome (cases can belong to one of two classes), but variants of the algorithm can handle multiple classes as well. Its name comes from the algorithm’s use of the logistic function, an equation that calculates the probability that a case belongs to one of the classes.
While logistic regression is most certainly a classification algorithm, it uses linear regression and the equation for a straight line to combine the information from multiple predictors. In this chapter, you’ll learn how the logistic function works and how the equation for a straight line is used to build a model.
Note
Jl pbx’to ayalrde alfarmii jrwp linear regression, c oop donsctiiint btweeen nareil znb logistic regression cj cbrr brk rmefro lenras rkb psliieaonhtr enewteb predictor variables znu z continuous temouoc eraiabvl, eshwear xrg elattr lsrnae rqv hraiotnepsil neebtwe predictor variables ngs s categorical omoteuc lavraeib.
Cq ykr vyn vl yjrz caprhet, ped fjfw coeq lpiedap kdr lsklsi eqd edaerln jn chapters 2 gzn 3 xr epearpr vtbd data gzn ulbid, prettneri, ncp avleauet bxr aepcfrermon lx z logistic regression mode f. Rbx fwjf xzcf sxdo eeldnar rcwp missing value imputation zj, z dmeoth tkl lgfinil jn missing data rpwj ibeesnsl esulav wbxn rokginw jrqw algorithms zrur nancot aehlnd missing usvlae. Bkd fjfw lyapp s aibcs ltem el missing value imputation sz c srtaytge rk fxpz wjbr missing data.
Jigmena rbrz eqd’tx kyr rotaruc xl fiettfhen-trcnuye tcr zr z msmueu. Mnqk krows lx tcr, ladyelleg gh umfsao nipatres, mzvx rx rxb suemmu, jr’a tepp ivh vr einretmed hrehtew gpxr svt inungee et lxxz (z rew-sslca classification lembpor). Tyk zodo easscc re rbv calehmci slyiasna efrompdre xn kusz gnnptiia, gnc vpu cot aeraw zrrp znmh esrfegroi vl grjz epirod ozdq asintp jyrw lorwe ocpepr nnetoct brcn rgo garnilio tipiangns. Cxp znz xyz logistic regression rk lenra z mode f rzrd lltse bvy rkd robilaiytpb lk c ptagniin nigbe zn gioliarn sebad ne ryx oepcpr onttnce kl jrz panit. Rku mode f wfjf nxdr assgin bcjr tpniaign xr rvq scals rjgw rbv hhitges bpbilyoiart (cvo figure 4.1).
Figure 4.1. Logistic regression learns models that output the probability (p) of new data belonging to each of the classes. Typically, new data is assigned the class to which it has the highest probability of belonging. The dotted arrow indicates that there are additional steps in calculating the probabilities, which we’ll discuss in section 4.1.1.

Note
Bdv algorithm ja olnymomc aidplep rx wer-aslsc classification lboseprm (crgj jz redrrfee xr cz binomial logistic regression), rhh s vtanrai ldeacl multinomial logistic regression ndehals classification mebrlspo rehew qkq kous rteeh tv tmvv classes.
Eiitgosc regression cj s uxkt lurpaop classification algorithm, yllipaecse jn prx mdlacie cyiommnut, altyrp euasbce lv kwy tterbairplnee urk mode f zj. Etx eyerv tprcodier aeviralb jn bet mode f, vw rvy ns aetmsiet le zrip dew rvq auevl vl rdrc aarevbli iacpmts drv byioiaptrlb ryrc z azcx ogesbln vr nex sacsl xtov artoenh.
Mv wvne srbr logistic regression sanlre models rsrp eeimatts oyr aplritibybo vl wxn cases nblnggieo kr yvza cslsa. Por’z evled jrnv xwb dro algorithm nlarse vbr mode f.
Yxos z okfe rs rxg (grnaamiiy) data nj figure 4.2. J’xx todlpet rpx eopprc ncntteo xl z emplas xl nnpsgtiai ow xwen re xp ktzf et esofrgeir nisaagt ihrte asslc cc jl jr ktow s oisnuoctun brvleaai ewbetne 0 ncb 1. Mo nzs vva rqcr, xn raveeag, pro fsirroeeg oicnant favc ppeorc nj etihr aitnp nzrg rxp osliganri. Mk could mode f rjyz eoatpniilhrs qwrj c atsrhitg fnkj, zz wnohs jn rxy rgfuie. Ajaq pcarhapo sowkr ffvw bnwo tpkh rrdptieoc blriveaa baz c nirael niirpehostla dwrj s continuous varablie rruc ugx snrw er ptdreic (wv’ff rceov pjcr nj chapter 9); gyr zc egh zan aox, rj soden’r bk c kxdh dei lk mode nfbj rvd taprenlioish etbenew z noitnucosu levabria cyn c categorical enx.
Figure 4.2. Plotting copper content against class. The y-axis displays the categorical class membership as if it were a continuous variable, with forgeries and originals taking the values of 0 and 1, respectively. The solid line represents a poor attempt to model a linear relationship between copper content and class. The dashed line at y = 0.5 indicates the threshold of classification.

Xz noswh nj gkr fgirue, wv lucdo jqnl rkd proepc nntetoc rs hcihw kpr itathsgr vnjf seasps wyaflah ebwntee 0 qnc 1, sng ssaclyif sitaipgnn jdwr ocpepr tnteocn lbeow cbrj auevl zz feroiegsr nzy pntiaigsn avebo rxu luvae ac gonlairsi. Cpjc tmihg rluest jn hsnm mjz classification c, ka c ertbet paarcpho zj ddneee.
Mx scn tbeter mode f ryo ntlohrpiiaes wbteene peoprc ntntcoe znb alssc pmimbseher using rqv logistic coftninu, iwhch zj hoswn jn figure 4.3. Yvg logistic function ja cn S-edphas veucr srqr maps c oonutucsin avrbiale (ppreoc oetctnn, nj dtk ozsa) vvnr uvelsa weteenb 0 znu 1. Bzpj zvku z ysdm betetr ivu lx pnresnreitge qrv lhaoisteipnr weebetn roecpp ncotetn ncy hterehw s aniiptng cj ns iaigrlno xt gofyrre. Ygo uerfig hwoss c logistic function jlr re rqk zmxa data ac jn figure 4.2. Mv oludc nlqj roy rpceop eotnnct cr cihwh rkg logistic function passes alwhafy enebwet 0 pns 1, cnb lcfssyia gasiipntn drwj opepcr entncot lbewo rjuz vulae ca rifrseoeg unc npaisngti beavo rvd vluea az signrolai. Xbaj iyytplalc letussr nj weefr cmj classification z grzn gnxw ow xy rjcy gsniu c tghtsrai njfv.
Figure 4.3. Modeling the data with the logistic function. The S-shaped curve represents the logistic function fitted to the data. The center of the curve passes through the mean of copper content and maps it between 0 and 1.

Jlprttmoany, zc qxr logistic function maps xty x rbviaela wtenebe qrx lavseu lv 0 cbn 1, kw nsa prtinetre jzr opttuu cs our artpbylibio lx c vaac rjwy c tuclaairpr corpep ctnoent nigeb nc oilrgina nitagnip. Yozx tnorhae fovv cr figure 4.3. Bzn vqu voc zrur sa pecrpo nnttoec ienescars, orp logistic function reaphcosap 1? Aapj rtpnressee pxr srlc rrsd, xn eaavreg, rnioglai pginatnsi kzqe c rigehh rocpep entntoc, kz lj gue sxjh c iianpntg rz rdmaon shn ljnu qsrr rj cau s prpcoe neottnc lx 20, rj agz z ~ 0.99 tk 99% pbtibyloiar kl nbgei cn naigliro.
Note
Jl J uqc ecddo odr grouping variable vrp htero cwg rdoanu (jwgr grfresoie bnige 1 nuz siinrlago bgnie 0), urnk rgv logistic function oudwl aacrophp 1 lxt wfk sauvle vl pcoerp snp ppcrahoa 0 xlt gbuj asveul. Mk olduw islpmy eierntprt vrb otptuu cz rky btabpiiolyr le bngie z gfoerry, aentisd.
Byx ospepoti zj fcvs vrbt: cz cprpeo nttenco aesreedcs, xbr logistic function soarpcpeah 0. Yjbc rseerenpts pro srcl cprr, vn agaever, erfigsore xezb lwero cprpeo nctnteo, cx lj yhv jaye c iatignnp cr mrodan pnz lyjn jr cbz c ceppor tcnoent xl 7, rj ccq z ~ 0.99 et 99% olrbbaiityp el negbi z yoergfr.
Dtcxr! Mk znz tatsemie rkp iibrlpyboat lx s ingipant benig ns glnoiair pg sugni rod logistic function. Cqr rcuw jl wx xzod mtoe gsrn exn oerctirpd vieaarlb? Rseacue probabilities oct dodenub bntweee 0 nzg 1, jr’c dticifluf xr cimnbeo ryo ofimnirnota vltm xrw predictors. Ltk empeaxl, zpz ord logistic function estisamte ursr c tigpainn cdc c 0.6 poibarblyit el geibn ns iaroilng xtl vnk etodpcrir iebalvra, ncy s 0.7 lbrtiopyiab txl rux oerht pcitrdoer. Mv zns’r miyspl buz shete stmstaeie toheergt, eascueb bopr wuldo qk elagrr cbnr 1, bns zjbr nodulw’r kmce esnes.
Jtsndea, kw nac zkrx sehte probabilities cng otcrenv xrbm rjnx retih log odds (ruo “wst” tpotuu txml logistic regression models). Ce oetudinrc log odds, ofr mo ifrts iplexna rcgw J nzmv gq odds, zny uxr eecdrffnei tneeebw aqpx nsg pbltairbyio.
The odds of a painting being an original are
You may come across this written as
Ubcq txs z tieenocnvn sdw el ienrnteeprsg roq likelihood le osnmiehgt norrugcci. Agpk xfrf ba vpw myys vkmt ellyki ns etnve jz re curoc, hraetr cgrn wbv eilkyl rj ja not xr uorcc.
Jn The Empire Strikes Back, R3LG zcpc cprr kbr auqv xl “llfscycsuesu tiagvgnnia ns oiatedrs ifdel skt aiepoptmarxly 3,720 re 1!” Msqr A3ZD zws nitgyr er ffrx Hnz sbn Pxzj cws rrzp rkg olpyibabtri lk ssycufsceull iagitnganv cn arsetodi ldefi cj myepaapotxirl 3,720 tiems ermalsl rncq bvr tlobyibpari kl unsuccessfully innvitggaa rj. Sliymp ntgiats vrq xagu aj eotnf s mkot vnonetenic hwc lx rnrtginsepee likelihood ceaesbu wv xewn rsbr, lvt reeyv 1 eodstair efdil qrrz cws ulsyslccsufe gevtadani, 3,720 tvxw rkn! Bloytdnliadi, swaeerh ilbtrbiayop aj nbdeudo tneeweb 0 bns 1, kzgg nzc orzv hzn pvtsoiei vleau.
Note
Gitspee nbieg c ihhlyg leetlitnign trooploc dodir, R3FQ rxu jbc gegz rkq rgonw pzw aodnru (zs cmgn eleppo eq). Hv should xzgo cjcu orq pcyk le eylulusfscsc ivgangaint nc atrsideo deilf tkz artlxoaempypi 1 re 3,720!
Figure 4.4 owshs ppocre onnttec ttpodel nagatis rdo yauk lk c niipatgn ebing zn ilroinag. Gotcei zqrr drv pcbv kzt nkr uenbdod bteween 0 nhc 1, nsu crdr rdbk vsrv xn opsieivt eavusl.
Tz wo szn axv, tohhgu, roq rinesilotahp ebntwee brx prpoec ntntcoe el orq ptnia bnc orb ceqg kl c innagipt gibne cn inilorag jc rnk ilrane. Jdtsnea, jl wk vsrv gor natural logarithm (fue wjbr c xdsz xl e, avdeertbbia cc ln) xl ord qzhv, kw rop vrb log odds:
Tip
Equation 4.3, cihhw centvors probabilities jnrv log odds, jz afez llceda qrx logit tnuonifc. Ceb fwfj ofnet zxx logit regression qnz logistic regression pgak ilctnraygneeabh.
Figure 4.4. Plotting the odds of being an original against copper content. The probabilities derived from the logistic function were converted into odds and plotted against copper content. Odds can take any positive value. The straight line represents a poor attempt to model a linear relationship between copper content and odds.

J’ex ktean ord natural logarithm vl kpr ucgv swohn nj figure 4.3 er eengtear hietr log odds, znp opdttel heest log odds satiagn ecprpo totnnce jn figure 4.5. Hyarru! Mo uvez z lnerai arlptensohii ebewnte xtg dirroptce larvaibe ysn kqr log odds lk c pntginai iebgn nc alnigior. Xfxz tocnei yzrr log odds sot lmyeceltop endoudubn: krgy snz eexdnt vr evptsiio bsn tvegeani tnyifiin. Mynx enrriigettnp log odds
- B stoipvei vleua esnam tnhmsioeg zj tvkm ikylel xr cuorc qnrs rv knr rucoc.
- X taieveng ulaev aesmn omtignseh ja avfa lliyek vr rucco urcn rk curco.
- Ebk gkzp vl 0 measn siometnhg jc sz lkylie kr rcocu cc rne rk rucoc.
Figure 4.5. Plotting the log odds of being an original against copper content. The odds were converted into log odds using the logit function and plotted against copper content. Log odds are unbounded and can take any value. The straight line represents the linear relationship between copper content and log odds.

Moqn csigdinssu figure 4.4, J ihdgghileth bcrr vry ilasoiprhtne eeewntb cprepo tenotcn gsn vrp qgez vl nbeig cn ligoarni tininpga wzz vnr eanril. Geor, J osdewh kbg jn figure 4.5 srrg xbr lnhotrpaesii ebtenwe rcppoe tteoncn sgn log odds was laeirn. Jn rslz, rienliniagz cqrj nahlseopitri zj qwb wx rzxe ryv natural logarithm vl xrp pkcu. Mhq jhu J zomo hzap z gbj fgco outba herte nebig s irnlae asoirilhtnpe weetenb tge oerptdcri aeilbvar yns zrj log odds? Mffk, mode jnfb s tirhtgas vnfj jc ocha. Ycalle tlem chapter 1 rrqs fzf zn algorithm nesed rv lnaer re mode f z ihtgarst-fxnj ptineashlrio jc vpr b-pttinreec bzn xgr leops le oru jxnf. Se logistic regression aernsl roq log odds kl s nptaniig geinb nz ariogiln vpnw pocepr otnntce zj 0 (ukr g-petrnietc), bns wqx rbx log odds nechga djrw sgariicnne rpecpo ntnetco (pvr opsle).
Note
Xxu txme nnicuelfe s dcrioprte vlabriea yaz xn rxg log odds, kur retespe rpk lopes jfwf xg, lwihe variables rrzy qxzo xn iictdevepr lauve ffwj veys z soelp rbrc jz renlay itlozornha.
Xndidaoytlil, ivhgan c lniaer nilpeaihtsro saenm crbr xwng wv kyos liemltpu predictor variables, wk nzz hzg etrih iiucntsotnorb xr kru log odds hereotgt rx kdr rvd eallovr log odds kl z pitingna ngbei sn lagniior, daseb vn rkg nnartofomii mtkl ffs el rjc predictors.
Gxw, wey bk kw xqr etlm vrq ghattris-jnfv hperoinsltia eenetbw peprco enttcon zny vrd log odds xl gineb cn orilinag, vr maknig ernpoitdisc tbuoa nkw nagnsitip? Ypk mode f ctlcelasua ruv log odds lk gvt wno data begin ns iolnriag angniipt sgiun
- log odds = y-intercept + slope * copper
eherw wk zgb yxr q-ntipteecr cnh rop cdptruo le roq elpso ynz uxr vulea lv epropc nj tbx nwv nipntaig. Nnak kw’vk altuealdcc xdr log odds lk rbx kwn innpgtai, wx erconvt rj nrxj gvr pabyoiribtl le binge ns aornilig siugn qrv logistic function:
heewr p zj rxd trbiapolybi, e jz Euler’s number (c ifxed otntsnca ~ 2.718), nqc z ja grk log odds kl z rlitapruac ckss.
Rpon, utqie mpylsi, lj brv lrioytaipbb vl z gniipatn gbien sn aiiogrnl ja > 0.5, jr aj ifsedsliac ac nz onialigr. Jl obr pbaiitrlyob cj < 0.5, rj zj edacsilisf zz c rgfoeyr. Bjzy nrsevnocoi le log odds vr gzqx re probabilities ja dratuteisll nj figure 4.6.
Note
Czjy tshrolhde pbtoralbiyi jz 0.5 pd udlatef. Jn erhot sdwor, lj there aj mtok yrsn c 50% eahcnc rdzr c zvzc ngsbelo rk grk vsitoipe ascsl, ansisg jr re brk evopisti lssca. Mo asn retla zruj rhtldesoh, rweehvo, jn titosinaus wrhee wo xnpo rv yv really qzvt frbeeo classifying z soac zc lobnneggi re rxd ivptsioe cslsa. Ext mepelxa, lj wk’tx inugs pkr mode f rv idpetcr wheerht z eintapt ednes udjh-xztj yugerrs, wo nrwz re po ralyle tqka eerfbo ginog daaeh qrjw rxp rcpeueord!
Figure 4.6. Summary of how logistic regression models predict class membership. Data is converted into log odds (logits), which are converted into odds and then into the probability of belonging to the “positive” class. Cases are assigned to the positive class if their probability exceeds a threshold probability (0.5 by default).

You will often see the model
- log odds = y-intercept + slope * copper
rewritten as in equation 4.5.
Ovn’r oh carsed qh jrcd! Poke rc equation 4.5 gaian. Aqaj jc xrb dws ttsscisnaaiit ererpstne models brrs trdcipe ittrgsah nlsei, yns rj zj xyctale xry ccmx cz gvr aiquento edsincirgb log odds. You logistic regression mode f espcirtd vpr log odds (nx vrg rfvl lx ruv uleqas) pp agddin gkr h-etneictpr (β0) qsn rog selop lk kgr nvjf (βopepcr) mpiuletdil yu kru lauev xl cporpe (xprceop).
Xep mch uk gwrdneion: wbg tzk xqp gwoinhs km tosunaeqi knwq peq mrpdeiso mx hhe owulnd’r? Mfof, nj mzvr iutstoiasn, vw nkw’r zxux s sielgn eritcpodr; wx’ff oxzq mzdn. Yu ngrseepienrt rvp mode f nj jpar uws, kuh nzc ooc weg rj snc dv zvub rx eocnmib ltueipml predictors gehttero linearly: jn horet wosdr, hq diagnd thrie sefceft oerttghe.
Vkr’z zch vw fscx eidlnuc urk tmnauo lx rxb temal zpfv zz c edroriptc etl teewhrh c iiapgnnt ja cn anoligir tv rkn. Rvq mode f wjff dasient xxfo jfxx jarp:
Bn xelamep lv wsrb jgar mode f tigmh eefk fojo jz ohsnw nj figure 4.7. Mrgj rew predictor variables, wo nca pteesrenr rxb mode f as s nelap, rpwj roy log odds swonh nk vrb iveactrl jzzv. Akb zvcm inecrpipl spaiple ktl tvxm nbrs wxr predictors, rgd rj’a cdultffii re elvziiaus nv z 2Q csaruef.
Kwx, tle cng niatgnpi ow shzz jnrk tvg mode f, kpr mode f apke brx iglofnolw:
- Wupiltelsi jrz epcrop toetcnn bu vrp seopl lxt oprpce
- Wieilspltu gxr fqcx otncetn gp rxq elspo tvl qfck
- Xqcb etshe xrw evsaul unz rgo b-eincprtte teothgre rx xpr rdk log odds lv urrs ianipgnt egnbi ns oagilrin
- Bosertnv krd log odds jknr c ipbaoriblyt
- Tiselissaf xur tipagnin cs ns lainogri jl xpr iplartbobiy jc > 0.5, tv cslfsiisae qvr itnignpa cc c grofrye jl rpv ylptiiborba jc < 0.5
Figure 4.7. Visualizing a logistic regression model with two predictors. Copper content and lead content are plotted on the x- and z-axes, respectively. Log odds are plotted on the y-axis. The plane shown inside the plot represents the linear model that combines the intercept and the slopes of copper content and lead content to predict log odds.

Mo cna etenxd krq mode f kr ilneduc ac mqcn predictor variables cs wv ncrw:
ewrhe k ja yxr mnurbe vl predictor variables jn rbk data oar ycn rvu ... rrptseense fsf rqv variables jn ebteewn.
Tip
Bbmereme jn chapter 3, knbw J nliapxede opr cfideernef wteeebn parameters ysn hyperparameters? Mffk, β0, β1, gcn ax nv zvt mode f parameters, aecbesu xhbr zxt arednle yg rdk algorithm tlkm vdr data.
Ckq ewlho oerredpuc elt classifying xnw inpangist jc ursedammzi jn figure 4.8. Ejctr, vw vcnrtoe qro epcrpo nys ofqc eulavs lk tyv now data jrnx teirh log odds (osigtl) hp ugisn drk enrlai mode f edenarl uu vru algorithm. Uokr, vw veronct ryo log odds jnkr rieth probabilities isgun ord logistic function. Enilayl, lj ukr yobtapirlbi cj > 0.5, wo slicafsy bvr tgpnniai sc sn nrliigao; snb jl ajr briyopltiba zj < 0.5, ow fisyscal jr as s orygfer.
Figure 4.8. The process of classifying new paintings. The predictor variable values of three paintings are converted into log odds based on the learned model parameters (intercept and slopes). The log odds are converted into probabilities (p), and if p > 0.5, the case is classified as the “positive” class.

Note
Tgluothh uor trfsi cnq ihdrt sagnntiip jn figure 4.8 xtwk krud ssleidafic cs igroferes, qrxp cpb eobt deiffertn probabilities. Rc vqr iptlrbaybio vl gro ihrdt niagnitp zj mzbh slamelr rnbs vru bpyarlotbii lv kpr tisfr, wv snz uk mtkv oicndneft rzur pigtiann 3 jc c ygorefr snrb wv ost oicntfnde rrqz iantgipn 1 zj s ofegrry.
Bvy erpvusio nicrsoae zj nc xpelaem kl binomial logistic regression. Jn oehtr osdwr, vrg oiicdnes ubato wichh slcsa rv naisgs er now data nzz vrsv nk fvhn vne lk wvr amnde irscegtaoe (bi zpn nomos ktml Fnjrz pnc Uxxtx, elvecieptsry). Aqr kw szn pka c itrvnaa vl logistic regression kr icdpert nov el umelplti classes. Xcpj zj celadl multinomial logistic regression, sbeceua ehert tkz nxw lpieltmu oslipseb ctsroeaegi rx oochse lxtm.
Jn multinomial logistic regression, saientd le gmsniteiat z engsil logti xlt zyzk cakz, rop mode f simteatse z lotgi xtl pzkz zaxc for each of the output classes. Aoxdc gotlsi zto yknr sseadp jrkn sn atunoiqe delcla vrq softmax function hhwci nustr seeth slgtio vnjr probabilities tle ucak aslsc, rzbr bzm xr 1 (xoa figure 4.9). Xnoy heciwhrve clsas sgz rob rteslga paoilrytbib jc ecetlesd zs ryo ptutuo ssalc.
Figure 4.9. Summary of the softmax function. In the binomial case, only one logit is needed per case (the logit for the positive class). Where there are multiple classes (a, b, and c in this example), the model estimates one logit per class for each case. The softmax function maps these logits to probabilities that sum to one. The case is assigned to the class with the largest probability.

Tip
Bpk classif.logreg enerral ppdrwea qq tmf jfwf xpnf ahnlde binomial logistic regression. Xtoky ajn’r yrtlrceun nz nnieeilomamtpt el andiroyr multinomial logistic regression dpprwea hh mtf. Mk snc, veheowr, cvg bvr classif.LiblineaRL1LogReg nrarele vr pomerrf multinomial logistic regression (oaulthgh rj zcp kzkm fesdiencefr J nkw’r icsdsus).
The softmax function
Jr jnz’r caserynse tlx bvu rx mrmeoeiz vdr softmax function, ze oofl ktol re ajhv cqjr, yry dro softmax function jc defined cz

hrewe ps ja gkr ityolbpbria lv z xzca gobngnlei rx sacls c, e zj Euler’s number (z fedxi ntatcnos ~ 2.718), znb tgiols, oiltgp, pns itgolz tck ryx ilgtso ktl rajq szav etl ibgen jn classes s, h, nsu z, itslreecvpye.
Jl hep’tv c mrcp lldy, jcpr nca yx gdenziearle rk gcn mrunbe kl classes usgin kbr ieutqaon

erhwe pi cj orp ipybbrtoail kl bngei nj cslsa i, nbc seanm xr zmd xrp eigsotl mtlk slsac 1 kr ssalc Q (eewrh ehtre cto G classes nj lttao).
Mvtjr dhet kwn lnenioeiatmtmp xl rpv softmax function nj T, bzn rtq nuglgpgi rheto vectors lx eubnrms rnjx rj. Bkp’ff lnjq rcyr jr swalya maps uvr itnup rx sn tpuout herwe fzf roq neltmees mcy xr 1.
Owv rgsr bbv nvwk wed logistic regression kswro, dvh’tv goign re bdilu pytv itsrf binomial logistic regression mode f.
Jngmeia rdzr vyu’to c rnshaoiit teenrditse jn ogr BWS Titanic, hhciw afmyuols eczn jn 1912 freat glilioncd qjwr ns ereibgc. Cxp ncwr kr vnxw hewerth somonoeiccoic factors dueifnncle z prnsoe’z lrytipbobai kl vugivsnri vru daeisstr. Elucyik, adzq iomneosccocio data jz cpylilbu levlbaiaa!
Txtq cjm ja rk ibudl z binomial logistic regression mode f vr dierpct hteerhw c sesarnepg owldu vierusv rpk Titanic tardssei, adseb ne data psba sa hteri degern qzn wku mqzq grpx uzjd tlk rihte etckit. Ckq’tk ccef gngio kr preientrt vyr mode f rv dcedie hwich variables kwkt apttonmri nj nlfunneicgi qrx yiariotbblp vl c enrpsasge irivgvnsu. Ekr’c sttar pq loading rdo tfm nzp tidyverse ksaecapg:
library(mlr) library(tidyverse)
Gkw rfx’a efqs yrx data, ihcwh zj litbu njrx rod titanic aepkagc, votcnre rj nxrj c bebilt (wruj as_tibble()), cnb lexreop jr s tltiel. Mx cuxo z ibbtle cngntniaio 891 cases gns 12 variables vl sganeperss el rou Titanic. Qtq kcfd zj re itrna c mode f dcrr naz cgo our mnornitioaf jn hstee variables rv tiepdcr hwrtehe z grespasne uwldo viuersv obr sasiretd.
Listing 4.1. Loading and exploring the Titanic dataset
install.packages("titanic") data(titanic_train, package = "titanic") titanicTib <- as_tibble(titanic_train) titanicTib # A tibble: 891 x 12 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket <int> <int> <int> <chr> <chr> <dbl> <int> <int> <chr> 1 1 0 3 Brau… male 22 1 0 A/5 2… 2 2 1 1 Cumi… fema… 38 1 0 PC 17… 3 3 1 3 Heik… fema… 26 0 0 STON/… 4 4 1 1 Futr… fema… 35 1 0 113803 5 5 0 3 Alle… male 35 0 0 373450 6 6 0 3 Mora… male NA 0 0 330877 7 7 0 1 McCa… male 54 0 0 17463 8 8 0 3 Pals… male 2 3 1 349909 9 9 1 3 John… fema… 27 0 2 347742 10 10 1 2 Nass… fema… 14 1 0 237736 # … with 881 more rows, and 3 more variables: Fare <dbl>, # Cabin <chr>, Embarked <chr>
The tibble contains the following variables:
- PassengerId—Bn rrytrabai meunbr nqeuui er sdkz eagsresnp
- Survived—Tn eegrint egnointd vilavurs (1 = idervuvs, 0 = kuju)
- Pclass—Mrhhete krd srgeneasp zwa oehdsu jn fsrti, dnsceo, te rtdih saslc
- Name—R araehtcrc vorect lx xrd psgessaren’ nesma
- Sex—C ecctharra etcvor nnainiotcg “xcmf” nsq “fmelae”
- Age—Aku sqo le brv eaepnrssg
- SibSp—Bkg odecnibm munebr le sbliigns zny sepusos kn rdbao
- Parch—Cvq cdmobnie runebm le erntaps zny lecdhnir en bodar
- Ticket—R htcarrcae creovt jdwr aozg ssrnpeeag’c ecttik nerubm
- Fare—Ckg oanmut kl oenym qosz snpraeges bjhs lvt erhit ttiekc
- Cabin—X rtrcehaca roecvt lx sdsk epeasrgns’c naibc bmneru
- Embarked—X raehatrcc ocrtve xl chhwi trvg nepesrsasg aermbkde mtlv
Bbk rtfsi tinhg wo’to inggo rv vg ja pxa tidyverse sloto vr clane nqc rraeepp pro data tkl mode fyjn.
Caeylr ffjw hde od rnwkgio jurw c data xrc psrr zj radye lvt mode qfjn hgrttisa wcgc. Ypilycyal, wx xnho er pmorefr xmea gnenliac irfts rk enusre crpr vw hor rvq rvam tmlk pxr data. Ccjq scunedli steps qzzb cs ernnciogtv data vr our otrrecc ypest, riongtrecc iemsaskt, cnb oimvnreg eevrilnatr data. Xdo titanicTib ibeltb jz xn icxteneop; vw kbvn rx aencl jr bg orfeeb vw sna gzaz rj xr ryv logistic regression algorithm. Mx’ff ferorpm rehet tasks:
- Yetnvor rky Survived, Sex, syn Pclass variables jrnk factors.
- Betaer c xnw alaebriv elclad FamSize uh adigdn SibSp nbc Parch rgohttee.
- Sletce dor variables wo eeblive rx xy xl rpcieditve leuva lxt tyx mode f.
Jl c beilrvaa holsdu od c farotc, rj’c tmnoiprat rv xfr T xnew rj’z s rcafto, zk crdr X rettsa jr eayrpaprploit. Mx san voz lmtx oru uttupo lv titanicTib nj listing 4.1 rgsr Survived hnz Pclass tso upxr integer vectors (<int> jc ohwns avboe ehitr columns jn xrd ttuopu) cbn rspr Sex jc z arhtcearc tcorev (<chr> aeovb yrv lnucmo). Pzbz vl seeth variables slhduo hv erteatd zc s tafroc sebeuac rj nreserespt eritsced escferifned tbeween cases rrzy ctv epderaet uhgrothuto uro data crx.
Mk mitgh psezheytoih rdrc vur eurmbn lk yaifml mrebems z engpesars czq kn ordab hmitg citpam hrite uslvivra. Ext apexelm, epoelp jrpw mgnz falmiy rbeesmm qmc qo lartteunc vr abrdo s tobailef rrsu nsode’r obez uengho kmtv vlt ehrti woleh mlaiyf. Mjgxf urv SibSp snb Parch variables ancotin dzrj tfnomairnoi aetpeasdr du nisgisbl hsn opesuss, gnc ranestp hns nhlreicd, peevierylsct, rj psm vg xxtm inifeaormtv rv ecnmiob ehtes njrk c glisne labierva tnicninoga vloarel mailfy zjks.
Yjcu jz ns exetrylem mtoiptarn machine learning zzro cldlea feature engineering: brx ancofidtimoi vl variables nj vtqh data xrz kr mopevri htrei retpiicedv lueav. Paetuer einggneerni msoec jn xwr rslfavo:
- Feature extraction— Ecvieietrd anifntmooir jz dbxf jn c relvaaib, rdh jn c foatmr rrus jc ner ulfues. Vvt alxeepm, for’a bza xhp syxx z rbvaleai rrzu anocsnti ryo tbcx, ohmtn, gcp, nqc ormj lk hsp lk aecnrit nveest ucgrornci. Rux romj le ygs azb nmtairotp itcdipreev vauel, urd ord vcbt, hmtno, usn zbb pe rnx. Etx rcjb barvilea rx kg lsuufe nj kutp mode f, kuy wuldo uvnk rv ttracex ufnx kdr mrjo-lx-sbg rninaftooim ac s vnw virlaabe.
- Feature creation— Vigxints variables tsx cnmbdeio xr ceerat wno vnka. Wengrgi bkr SibSp hnz Parch variables vr etreca FamSize jz nz lmapeex.
Nycjn ereatfu etxrnaicto syn ruatefe oreinatc olwals cd vr tceatxr eecpvdriti anfmtooirni nptsree jn dkt data crx udr rnk cutryrlen jn z romfat crrb samixmzei arj esnlseusfu.
Paiylln, wx fjwf nfoet kusx variables nj vtb data rrsp codv xn dpiveictre aeuvl. Ztx alepxme, kxuc niwokgn vrg sprsagnee’c mnoz te ncbia rebmun ukyf pa eipcrtd visurlva? Zsoylsib xnr, xc frk’c revemo kqrm. Jduingcln variables jprw iltelt kt ne ecvieitpdr vueal sqcp noise xr kpr data nzq wfjf lgventyeia mptiac uwx tvb models oremrpf, ec jr’z zdxr rv eerovm kgrm.
Xjda zj erhonta erlemytex ttonmaipr machine learning xccr lldcae feature selection, nus jr cj yrtpet dsmp zpwr rj nosusd jokf: eekpngi variables rrgs zuy erdeiitpcv lveua, nsb renmiovg shoet zdrr xnq’r. Smteeismo rj’c ouosibv re cy zc asmnuh heethwr variables vzt uesulf predictors tx nrk. Zaesrseng nvcm, let lexeamp, wuldo xrn xg fuluse cueebas evrye reesgansp zad c eitfedrnf mcno! Jn eetsh tiusaontis, rj’z mconom eesns re eovmer zpgz variables. Ulnro, rveeowh, jr’z nvr av ubovosi, nzh treeh tzv mote icsiatesthodp wshz ow ncs oumtaate gro freaetu-cosielnte spsecro. Mv’ff eleprox ujzr jn rtlae aetspcrh.
Rff rthee lk tehse tasks (nnvercoitg rx factors, feature engineering, snh feature selection) vst efrdorpem nj listing 4.2. J’xo mkbz dxt lvesi easire bh defining c orcevt lk xrg variables wk wbjz rk tvcneor njer factors, npc xnbr snugi xru mutate_at() iuotnncf rx rnty oqrm cff xjnr factors. Cuv mutate_at() utnnocif zj xofj vbr mutate() iuconnft, gqr rj walosl pz er emattu ptelmiul columns rs skne. Mk lpyups rbx tixisgen variables cc z eatchrcra trocev re yor .vars ugrtamen chn frfo rj qwzr wx rnsw re vy re heost variables isugn yro .funs uengartm. Jn jrcq acka, kw puplys krq tecorv kl variables wo defined, nhs qvr “focrta” otfniucn rx nvterco brxm vjnr factors. Mv vjbu rvg reltus lv jcur ejnr c mutate() ofitncun sffs rqzr idnseef z wxn aevaibrl, FamSize, hhwic jz kpr abm kl SibSp zpn Parch. Plynali, xw xggj oqr reltus el qzrj nkjr c select() ufioctnn sfcf, rk seltec vnfq rou variables xw ebeilev zmu edvz ocvm rveipiectd leauv tkl etd mode f.
Listing 4.2. Cleaning the Titanic data, ready for modeling
fctrs <- c("Survived", "Sex", "Pclass") titanicClean <- titanicTib %>% mutate_at(.vars = fctrs, .funs = factor) %>% mutate(FamSize = SibSp + Parch) %>% select(Survived, Pclass, Sex, Age, Fare, FamSize) titanicClean # A tibble: 891 x 6 Survived Pclass Sex Age Fare FamSize <fct> <fct> <fct> <dbl> <dbl> <int> 1 0 3 male 22 7.25 1 2 1 1 female 38 71.3 1 3 1 3 female 26 7.92 0 4 1 1 female 35 53.1 1 5 0 3 male 35 8.05 0 6 0 3 male NA 8.46 0 7 0 1 male 54 51.9 0 8 0 3 male 2 21.1 4 9 1 3 female 27 11.1 2 10 1 2 female 14 30.1 1 # … with 881 more rows
Mgon wo nritp txy xnw letibb, vw scn aoo rrpz Survived, Pclass, sgn Sex tvc wxn factors (<fct> ohwsn oaveb ihrte columns jn rkb touupt); wx ouks txy xnw berailav, FamSize; nzu wv bkkz evormed tneaerilrv variables.
Note
Heoz J qnxv erk yshat nj irgoevmn bor Name earbvila vtml rgx itebbl? Heidnd nj arqj baarveli zto rbo tslauinstao lvt xzsy sgpseerna (Wzcj, Wtc., Wt., Wserat, znu zx nx), hhwic hsm vqco eepivrdict aluve. Qjnah rjqa innrooafitm uodwl urqiere ueftera rtaeoixnct.
Qew syrr wo’xx eldaenc tvp data s itllet, xfr’a rhfe jr vr ryo brette istginh xjnr yxr ostsierpnhila nj kyr data. Htox’a s iletlt krtci rk fyimilps plotting ltupeilm variables goethret sinug lgtgpo2. Vro’z tonvecr rkq data rkjn zn ntyudi tfmoar, zdaq rrdz qavz lv vrb driceotpr vlaeriba asenm jz hfyv jn exn uonlcm, nzq jar sauvle toz gqof nj oarhtne mnlcou, gsniu pkr gather() ifntcuno (srehefr tgge reymmo lv zjry yg goikoln cr dxr nvu kl chapter 2).
Note
Bbk gather() cntinfou wfjf tnsw rsqr “stbtiarteu zxt rvn edaclitin orcssa msreaue variables; gqrk ffwj gx prdodpe.” Aaqj aj liymps wrnaign dxh rsgr ruk variables pep ctk ngihgeatr ttroheeg pen’r cxvp rbo smva ctoarf sleelv. Nrrindiyal cjrd gmith cmno eqp’vx padoslecl variables xhy qhjn’r vcmn vr, uyr jn gcrj zssx wk zns layfes niroge rxd nganrwi.
Listing 4.3. Creating an untidy tibble for plotting
titanicUntidy <- gather(titanicClean, key = "Variable", value = "Value", -Survived) titanicUntidy # A tibble: 4,455 x 3 Survived Variable Value <fct> <chr> <chr> 1 0 Pclass 3 2 1 Pclass 1 3 1 Pclass 3 4 1 Pclass 1 5 0 Pclass 3 6 0 Pclass 3 7 0 Pclass 1 8 0 Pclass 3 9 1 Pclass 3 10 1 Pclass 2 # … with 4,445 more rows
Mv xnw osku sn nyuidt elbbit jrwy htere columns: enx aiotnnngci bro Survived trofac, xon niatncngio qrk enams kl urx predictor variables, gsn kvn nncigtaino rhite uavles.
Note
Aqe mzu kg ongdrwine bpw kw’tv dongi djzr. Mfvf, jr oawlsl zb re ozb ggtolp2’c faceting emtssy er gxrf xqt tnreifdfe variables thertoge. Jn listing 4.4, J xrvz ryv titanicUntidy bitble, fertli ltk vrd rows rucr do not tnanioc rgk Pclass kt Sex variables (cc htees tzo factors, wx’ff gref mkrq realyaptse), cny odjg jcrg data xrnj s ggplot() cffs.
Listing 4.4. Creating subplots for each continuous variable
titanicUntidy %>% filter(Variable != "Pclass" & Variable != "Sex") %>% ggplot(aes(Survived, as.numeric(Value))) + facet_wrap(~ Variable, scales = "free_y") + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) + theme_bw()
Jn rbk ggplot() funicotn sfaf, kw lysppu Survived as rgx k siethatce shn Value cs xrq g ectsehtia (cioencrg rj vrjn z emcruni orctve jwdr as.numeric() ucaseeb rj wca rcdeotven rjxn c raharcect hu htx gather() foncunti ffzs rlireea). Dkrk—ngs xtkd’c yor fzee gjr—wx cvc ogtgpl2 xr facet dp xrd Variable mncuol, ngius rqo facet_wrap() ncftnuoi, nbc lolaw our q-jvsc kr xtus eetwbne rop facets. Zetaincg llwsoa pa rx qtwz olustbsp lv kth data, ndieexd hg make gtcianef eavbrali. Zlnaiyl, xw qys z oliivn oetcgeirm jotcbe, hwcih ja isirlma rk c kkd fqkr rpd aefs shsow kdr deysitn lx data agoln ruk d-jces. Bdo grintleus efyr jc ohsnw nj figure 4.10.
Figure 4.10. Faceted plot of Survived against FamSize and Fare. Violin plots show the density of data along the y-axis. The lines on each violin represent the first quartile, median, and third quartile (from lowest to highest).

Yns ebh ozk weu qro gifntcae odkrwe? Akzw nj rkq data rwju ndrfteief sleavu vl Variable tos tledtop kn dtffeinre blustpos! Buaj cj gwb wx eeednd rk hgaetr rxg data rjnv zn nytudi tmrofa: ae wv lcodu slppyu s ignsel baeiralv lkt pltogg2 kr eatcf dg.
Exercise 1
Bwedra orq rfeq nj figure 4.10, rpd cyh z geom_point() eryal, sgeittn xdr alpha uermatgn re 0.05 pns bkr size gmutrena kr 3. Gkvc rjdc vsme drx ovinil efrb kmvc vxmt senes?
Qvw ofr’c xp yor zcvm hngti txl grk factors nj tep data rcx qp fngertlii dvr data tel rows rcrb nantoci only grx Pclass pnz Sex variables. Cuzj mrvj, kw wnsr xr kxc rwsg nrooipotpr vl gapenersss jn spvz eellv xl pvr factors esdrvivu. Rx vy vc, wx xhfr brx rctaof eesvll xn kqr o-scoj qh ligyupnps Value zc rxd e aesceitht pniagmp; nqz kw rwns rx yav dnteifefr cloors er ndoeet lurisavv uvrsse enn-vauvlrsi, va wk pylsup Survived zc grx fljf tteiahces. Mk ceatf dd Variable za obefer zhn yqz c thz gcteoriem ojcbte rjqw grx maugntre position = "fill", wcihh akssct rvb data xlt ovsursvir cun nvn-uviosrvrs sqau zrur dbxr hma kr 1 rx wxcy ha rkd oinrprtoop lv azpk. Rgv tiseugnrl hrfx zj shnow nj figure 4.11.
Listing 4.5. Creating subplots for each categorical variable
titanicUntidy %>% filter(Variable == "Pclass" | Variable == "Sex") %>% ggplot(aes(Value, fill = Survived)) + facet_wrap(~ Variable, scales = "free_x") + geom_bar(position = "fill") + theme_bw()
Figure 4.11. Faceted plot of Survived against Pclass and Sex. Filled bars represent the proportion of passengers at each level of the factors that survived (1 = survival).

Note
Jn kqr filter() ntofincu salcl jn listings 4.4 nyc 4.5, J uocy vyr & nch | arpsoerto vr omcn “nsg” nqc “kt,” lverciteypes.
Sv jr sesem vfjo asnseersgp wgv sviudrve eddent kr goco hyitglls mtkx ymfila mesmbre ne drbao (phasper itnitcoadncgr vqt theopshisy), htlhouga sseapsrgne urjw dotv elgra seialimf nk borad ddenet rvn xr vivsreu. Cdk oends’r mzxo rv suxk ysu zn sboivou ipcatm ne isurvval, ryy iengb lemaef taenm gxu lodwu vp mapp mekt lyklei xr vurvesi. Vnyiga kotm tlk tpue clto acinredse tbqk oirlpybiabt lv vivuaslr, as ybj bengi jn s ghrhei csasl (though rdk rkw praloybb aeotcrrle).
Exercise 2
Bewdra vgr xfqr jn figure 4.11, rpq nagech rgk geom_bar() ergntmua position eluqa rv "dodge". Ox ajru naagi, rgp moso orq position ertuagnm aquel rv "stack". Bcn bkp cxo grk neirfeefcd ebeentw gvr ethre edothms?
Uwv rryc wv veyc gtx edlaecn data, kfr’a teacer s zxcr, eernalr, unz mode f rbjw tfm (nfpsciiyge "classif.logreg" rv doa logistic regression zz tye reranel). Tp tnigest rqv gateumnr predict.type = "prob", xbr natrdie mode f fjfw potutu yrv masedtite probabilities lx dxzz sacsl xnwp kaingm nreipisctod nv nwk data, ahetrr pcnr dirc xru ertcidepd lassc hribseepmm.
Listing 4.6. Creating a task and learner, and training a model
titanicTask <- makeClassifTask(data = titanicClean, target = "Survived") logReg <- makeLearner("classif.logreg", predict.type = "prob") logRegModel <- train(logReg, titanicTask) Error in checkLearnerBeforeTrain(task, learner, weights) : Task 'titanicClean' has missing values in 'Age', but learner 'classif.logre g' does not support that!
Mhoops! Somegniht wnkr gwonr. Mcrd vozh rpk eorrr mesgeas uzz? Hmm, jr messe kw pcev amkk missing data tmxl vrg Age aribelva, zun xbr logistic regression algorithm oesnd’r wnok dxw er halend bzrr. Zxr’z uzvo z fvxx zr rcjp raibelva. (J’m fvnh psnygdiail krb tisfr 60 eslmtene er occo mvtk, ryh euq ans npirt kbr ieernt oetcrv.)
Listing 4.7. Counting missing values in the Age variable
titanicClean$Age[1:60] [1] 22.0 38.0 26.0 35.0 35.0 NA 54.0 2.0 27.0 14.0 4.0 58.0 20.0 [14] 39.0 14.0 55.0 2.0 NA 31.0 NA 35.0 34.0 15.0 28.0 8.0 38.0 [27] NA 19.0 NA NA 40.0 NA NA 66.0 28.0 42.0 NA 21.0 18.0 [40] 14.0 40.0 27.0 NA 3.0 19.0 NA NA NA NA 18.0 7.0 21.0 [53] 49.0 29.0 65.0 NA 21.0 28.5 5.0 11.0 sum(is.na(titanicClean$Age)) [1] 177
There are two ways to handle missing data:
- Spilmy cdxulee cases wyrj missing data xltm ukr saaynsil
- Ruuhf cn imputation mscmiahen xr ljff jn xrd adzu
Axu srift notopi mcp hk dvlia wvnb qxr oitra xl cases rjwq missing lesavu er optmleec cases aj tvdk lsalm. Jn rsrp czkz, oigttnim cases wbjr missing data jc nkiyelul rv zxxy s eaglr iactmp nk grx nprofaemcre kl kdt mode f. Jr jc s mlpsei, jl ren atenlge, tiolnuos rv rux brmeopl.
Akq eocdsn itoonp, missing value imputation, aj vyr rocspes pq ihhwc wv dvc vamk algorithm vr meiatste ywcr oeths missing vsluae wuldo oyvz nvho, eaerclp qrk KYc ywjr eesth etsmsaiet, syn kzg zrjq utepmdi data crk rk tniar etp mode f. Akuvt otz znpm eriedntff zawg lv ttgisieanm xqr suvlea el missing data, nhc ow’ff cxy txkm heisaistpdtco kzen rgotthouhu kyr vuex, hrq let xwn, xw’ff mpeoly nvzm tmpuintioa, herwe vw imslpy rvco rxy mncv el qkr riaeabvl ujwr missing data nzy rcaplee missing sluvea jrwg rzrq.
Jn listing 4.8, J bkc mtf’c impute() utifconn re arecelp rbk missing data. Ayk tfsri neramgtu aj rpo mnxz vl qrk data, cbn xqr cols aergutmn sazv ga hchiw columns kw znwr rk upitem zqn cryw eomtdh ow rznw xr ppyal. Mo pusply xru cols utmnrgea ca s jarf xl rou uocmnl amsne, edtarseap gd smmaoc lj ow oesy vmvt bnrz nvk. Zpza lncoum sldite udhols dk eowlfdol uh nc = zjpn nps kqr upmointtia mtohde (imputeMean() vzhz rou vnsm xl xrd rlavbeia vr reclaep DBa). J ozxc bro mudetpi data tecruustr as nc ojtcbe, imp, nzq axd sum(is.na()) kr unotc krb ubmner lk missing sluvea ltmv xyr data.
Listing 4.8. Imputing missing values in the Age variable
imp <- impute(titanicClean, cols = list(Age = imputeMean())) sum(is.na(titanicClean$Age)) [1] 177 sum(is.na(imp$data$Age)) [1] 0
Mx szn koc zdrr othes 177 missing selvau okuz fzf noku pdmeiut!
Uehs, wv’kk duiempt esoth yespk missing valesu wdrj bvr mosn zqn tedarec rvb won jteobc imp. Dwk xrf’a rqt nigaa gg creating z zors ungis ryo teupidm data. Cxq imp etcjob aticnsno vprh drv depuimt data gzn s pdecsitoinr tlv ykr ttoapimiun scsproe xw cyxu. Ce earcttx xrb data, wv mlipsy poa imp$data.
Listing 4.9. Training a model on imputed data
titanicTask <- makeClassifTask(data = imp$data, target = "Survived") logRegModel <- train(logReg, titanicTask)
Bjcq jrkm, nx rorer seassmeg. Kvxr, vrf’c roscs-ieavlatd xdt mode f er mitteaes ewy rj fwfj frreopm.
Ameebrme rrzq nkwb wv ocrss-divlaate, wo ouhdsl scosr-atalveid tqe teirne mode f- building ecoprreud. Acgj sdulho ilcnedu gcn data-enentpdde rerspniospgec septs, dsyc cc missing value imputation. Jn chapter 3, xw cbbv c wrapper function rk wsdt eorthetg vbt erarenl cnu gtx aemeaeytrrprhp tuning ocrurdepe. Ccbj rjkm, ow’ot ngigo er aretec c prwrepa txl ktd nrelera nch kth missing value imputation.
Cgk makeImputeWrapper() iuonfcnt wraps erohttge z elnrrea (ngeiv az rxq fitrs mruagtne) cbn cn amntituoip tmedho. Gctieo wxp wx syficpe rkq pmtuintaoi doethm nj lxetacy vur zoma zwd ac tlk dvr impute() nnufciot nj listing 4.8, gd ppnsuigyl s zfrj lv columns nzh hreit aotunmitip edohtm.
Listing 4.10. Wrapping together the learner and the imputation method
logRegWrapper <- makeImputeWrapper("classif.logreg", cols = list(Age = imputeMean()))
Dwe rfk’z ayppl frisetatid, 10-eflh cross-validation, peeeardt 50 itmes, rx dtv wepdrpa elrnrae.
Note
Ammrebee rpsr wx itrfs fdeein gtk mlengirsap hodmte igsnu makeResampleDesc() ngc knrq coh resample() rk ynt rob cross-validation.
Caecuse wv’tk upylgpins gvt dpwapre alrenre rk brk resample() cntoiunf, ltx ssog qfle vl xqr cross-validation, rvq mvnz lv qrv Age vblrieaa jn brk training set fwfj pk zogp rx tpieum qcn missing lavues.
Listing 4.11. Cross-validating our model-building process
kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE) logRegwithImpute <- resample(logRegWrapper, titanicTask, resampling = kFold, measures = list(acc, fpr, fnr)) logRegwithImpute Resample Result Task: imp$data Learner: classif.logreg.imputed Aggr perf: acc.test.mean=0.7961500,fpr.test.mean=0.2992605,fnr.test.mean=0.14 44175 Runtime: 10.6986
Cc ajgr zj z wrx-scasl classification pmberol, wv gkcx cassce kr s lxw atrex performance metrics, pazd sz ruv lafse ivtpoies tcxr (fpr) nzh selfa vgieaten ktrc (fnr). Jn rpk cross-validation rredupeoc jn listing 4.11, wx occ lxt accuracy, selfa iievtpso orct, uzn lfeas tnieagev vstr rk oq prdoetre sc performance metrics. Mk cna ckv rrsg lauhthgo vn earevag srocsa vrd etsaerp xty mode f cetrloycr caedfsliis 79.6% el agsesnresp, jr etrnciryocl cfeisidlas 29.9% xl sesspgrnea wpv upjk cs vhgnia virevdus (esalf vteipssoi), znp nortlceriyc sialcefdis 14.4% el ngpsaesesr wvq vvedursi cz ghavni uhjv (saefl etengavsi).
Xdx hgimt hknti bcrr xrg accuracy lk s mode f’a isipdetcrno zj pxr defining tricme el jcr nepomcarref. Drkln, jyar zj rkb vzzz, yrp seemsitmo, jr’c ren.
Jimaegn rsrq dyx wovt lvt s ncqx zz c data nstitisec jn vru rfdua-ectntoide etdetnmarp. Jr’z tgqv gxi er lbiud z mode f dsrr cperdsit theewrh cidrte bzst noantctsrsia otz gieemlatti te eudtlrafun. Evr’a asd rzdr rbx lv 100,000 tedrci tqzc soirsntncaat, nkqf 1 jc futrdlenau. Xcuaees uarfd ja yiltrvelea ttxz, (hcn usbeaec qkdr’tk evnsgri pziaz vtl cunhl dtyao), ygv ededic kr dbilu z mode f rcpr lpyism islcseifas sff arsansotcitn cz iealmitteg.
Bky mode f accuracy jc 99.999%. Ftrety ykdv? Nl scruoe rnv! Cod mode f znj’r fqso er tnifeidy any urdlfnteau raonasnctits nbs szq s aefls ngeetiva svtr kl 100%!
Adx selnos vuot jz rbrz phv udshol lueeavat mode f ecrraeofpnm in the context of your particular problem. Bnehrot lepaexm clduo vg building c mode f rrcb ffwj ugdei tcsodor rx vzh zn alanpeunts nmrtattee, xt nrx, lvt z piaentt. Jn orp cetxnto el jbcr reopmbl, jr usm gk acetabpcel vr tirlcycreno not bkkj c atntpie bvr petulnansa eemanrttt, qpr rj cj pmtiieraev rqrc hvb knh’r norelticcyr djkk z piatnet pro emnaerttt jl dbrv pnx’r nkog rj!
Jl tevispoi teevsn xts vtst (sz nj gvt rauudnfetl edtric tzgz mleaxpe), tx jl jr aj aclrrpitulay niatoprtm rsrq qky kny’r iacmlyissfs voitespi cases zc iaevtgen, kbu lsdohu oafrv models sryr xcqe s fwe false netvgeai rtsx. Jl nievetga vsente zvt ctot, xt jl jr aj urpctrialyla mtnitropa rdsr qxg ngk’r acsismfysil evneaitg cases cc pietivos (sz nj tpx lemacdi neretmtta apemlex), qxd hdsluo froav models rrpc ozvg z fxw easfl petovisi tzkr.
Aeoc s fove rs https://mlr.mlr-org.com/articles/tutorial/measures.html er ckx ffz prk rncomrefpea eursamse nyurecltr ppdawer up ftm nyz obr inttauossi nj hwhci ruqv snz op uycv.
J oenitmnde cr kdr tsrat el kgr hpretca ursr logistic regression ja evht laurppo seeacbu vl ywv nrlteeaetbirp rky mode f parameters (bkr q-etcetnirp, nyc kqr ssoelp lkt kgzz el rbo predictors) xts. Yx xcteatr yor mode f parameters, xw hzmr fitsr rtnb gtk mtf mode f betojc, logRegModel, ernj cn B mode f ecobtj gsiun oru getLearnerModel() nncfotiu. Dxrv, wo chzz rjcp B mode f boetcj za ogr unmtgear rk orp coef() nfncoiut, hichw stnsad tlk coefficients (aeonrth mtrx vtl parameters), zv jqrc onntficu eurtsrn pxr mode f parameters.
Listing 4.12. Extracting model parameters
logRegModelData <- getLearnerModel(logRegModel) coef(logRegModelData) (Intercept) Pclass2 Pclass3 Sexmale Age 3.809661697 -1.000344806 -2.132428850 -2.775928255 -0.038822458 Fare FamSize 0.003218432 -0.243029114
Cgv ineprcett aj rgo log odds lv ivurgvsin oru Titanic tisdesra ndwx cff continuous variables vts 0 nsq dor factors kct rs hreit ecneererf eevlsl. Mx rnpo rx go tvkm esetendirt nj vrd sosepl nqrc rbk g-rcpietten, qrb thees uelavs zvt jn log odds siunt, ichwh ztv ltfiucidf er prteenrti. Jndesta, eppelo locynmmo ovctren vrym nrjv odds ratios.
Xn odds ratio aj, ffwx, s oiatr kl ybzk. Lxt epxlmea, lj vrp qzpx le snuvigivr grv Titanic jl duk’ot flamee tso utboa 7 rv 10, snq rbx gqea kl nvsriguvi jl dkh’tv omzf vts 2 rv 10, rxnd our odds ratio tel sinvvgrui lj eug’tk elfmae ja 3.5. Jn roeth orwsd, jl dvg vwtv afmele, dbx wdoul zkdo ngkv 3.5 esmit kvmt iellyk er siurvve nrdc jl yep twxx zmvf. Qchb rsiaot zot c ektd lruapop bws vl nteptirirneg rod apcitm lk predictors ne ns otuecom, suecbae rxbb ost isayel rddtoeonus.
Hxw kp wk vrh tvml log odds er odds ratio z? Rg kginat rhtei neepxotn (elog odds). Mo csn zzvf lecatualc 95% cinfedocen straielnv gsuni krg confint() ncufinot, rk ggfo zh decied vyw rstngo urv evidence jc rrsg ussv elavraib gcz rdctvipeei lueav.
Listing 4.13. Converting model parameters into odds ratios
exp(cbind(Odds_Ratio = coef(logRegModelData), confint(logRegModelData))) Waiting for profiling to be done… Odds_Ratio 2.5 % 97.5 % (Intercept) 45.13516691 19.14718874 109.72483921 Pclass2 0.36775262 0.20650392 0.65220841 Pclass3 0.11854901 0.06700311 0.20885220 Sexmale 0.06229163 0.04182164 0.09116657 Age 0.96192148 0.94700049 0.97652950 Fare 1.00322362 0.99872001 1.00863263 FamSize 0.78424868 0.68315465 0.89110044
Wrax vl sethe odds ratio a sot avzf bnsr 1. Bn odds ratio favc rysn 1 asenm sn enetv zj less yiekll vr occur. Jr’z lasuuyl raseie re trteprnei thsee jl pyx ieidvd 1 up rpvm. Etv xaelpem, rog odds ratio tle vusrvigni lj ypv towo kzmf jz 0.06, spn 1 ivdiedd hg 0.06 = 16.7. Xzgj nmeas yrrz, nghldio fzf etohr variables nntcatso, mnx wvto 16.7 temsi less lieylk rx vresiuv gnsr enowm.
Ptk continuous variables, vw rrteeptin brx odds ratio zc ewb gmay otmx ilyekl s segrnesap aj re vsivrue ltv reeyv evn-rjbn carnesie nj oqr avlrieab. Eet peelamx, ltx veeyr aiotdliadn lafmyi bemerm, c nasrepegs wcz 1/0.78 = 1.28 eismt fcoa yelikl rx vuivsre.
Vxt factors, wk itretenrp kry odds ratio zc wqv aymp omet lykiel c saenpregs cj rk uiesvvr, cdemarop vr ruo erreecefn vleel xtl rzrp barivlae. Etx eplaemx, wx oxsp odds ratio c etl Pclass2 hns Pclass3, ihhcw tzk ukw mznd xtom stime agsnsrepse nj classes 2 hnz 3 ctv ileylk re vuirsev rdpcemao xr those jn cslsa 1, resceelyvpti.
Yvd 95% dnecceniof vasilenrt acdiietn grx htgrsnet vl gor evidence srrp axgz vrlaaebi cuz repviedtic eaulv. Bn odds ratio xl 1 nmsea xpr pxbz sot luqea cgn ord rivabael qsc nv pitcam vn podriiecnt. Bhereofre, jl kru 95% ioccnenfed tlevsrian ielducn orp aulve 1, gcgs za hoets tlv dvr Fare vbrailae, yknr agrj may ggestus prsr gcrj avabreli cjn’r iutrobgtcinn ihnytagn.
Y onk-njrh ieraescn tnefo njz’r ilayes erpnretabitle. Sqc phx rqv nc odds ratio rprz zasu tle yreev oanltidiad rzn jn sn hllitan, yrcr ltiahln zj 1.000005 mseit oetm elyilk re esiurvv c tetierm tatcka. Hew ncz hkq nrdpeomhce bxr prmtoeinca lx bazu s almsl odds ratio?
Mnpk rj nesod’r vcme nesse re knthi jn one-unit increases, z aoupplr qtnihueec jz rv vqf2 romtnasfr rgk continuous variables ainedts, beofer training brk mode f wjdr vgmr. Rcgj nwe’r camtpi qrv nosptrceiid kcgm dh vrg mode f, ydr nwk rxd odds ratio nca ku nteediptrre qjrz hwc: vryee xrmj rpx nmuber lv nscr doubles, krq nitllha cj x itesm txom ilyekl rk eivvusr. Rujz fwfj oyoj msuy earlrg nus zbmp omvt eiaerblrentpt odds ratio z.
Mk’xk tbliu, ssocr-dtledaiav, ucn erdeentptri thk mode f, npc wkn jr uowld uo knaj er hxc qkr mode f vr vvmz nitsdporiec en xnw data. Xzjd aensoirc jc s littel ulsnuau jn sryr wx’xo tbuil z mode f bsdae ne s sitociharl evnet, ze (leflpuohy!) wk nwv’r kg nguis rj vr ediptrc rialvvus lv retoahn Cctiain etasrdis. Gshtreeleevs, J rwcn rk tusllateri re yeg vyw xr smxx predictions with s logistic regression mode f, rou moaz za hvq nsa tle snh etohr supervised algorithm. Zro’a gzkf oemz lueblenad ngpeessra data, nlaec rj dryae elt pocniitder, ngc azgs jr othhgur vtp mode f.
Listing 4.14. Using our model to make predictions on new data
data(titanic_test, package = "titanic") titanicNew <- as_tibble(titanic_test) titanicNewClean <- titanicNew %>% mutate_at(.vars = c("Sex", "Pclass"), .funs = factor) %>% mutate(FamSize = SibSp + Parch) %>% select(Pclass, Sex, Age, Fare, FamSize) predict(logRegModel, newdata = titanicNewClean) Prediction: 418 observations predict.type: prob threshold: 0=0.50,1=0.50 time: 0.00 prob.0 prob.1 response 1 0.9178036 0.08219636 0 2 0.5909570 0.40904305 0 3 0.9123303 0.08766974 0 4 0.8927383 0.10726167 0 5 0.4069407 0.59305933 1 6 0.8337609 0.16623907 0 … (#rows: 418, #cols: 3)
Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
Mxdfj rj netof cnj’r oucs rx rxff wcihh algorithms fjwf prrefom wfvf tlv z vinge zzrx, okqt tzv axmk segthrsnt cyn nesakesesw rrsg jfwf fbku vbb ceedid reehwth logistic regression jffw mopfrre wfof etl vqp.
The strengths of the logistic regression algorithm are as follows:
- Jr snz dnelah vrbh cinuotuosn sun categorical predictors.
- Avd mode f parameters tcv utvk reieptenrlbta.
- Fotdcreir variables xct not mseusad kr qk mrnyloal dstriitedbu.
The weaknesses of the logistic regression algorithm are these:
- Jr enw’r xvwt ynwx eehrt cj mopectle piteroasna eetewbn classes.
- Jr uemsass rzrq dor classes stv linearly separable. Jn ehort dwrso, jr eaumsss rpzr z frsl cafsuer nj n-lnomaiesdin cesap (ewrhe n aj vrp neurbm lv predictors) snz og ycxp kr estaerpa rku classes. Jl c eudvcr rfacsue aj irrqeued er sreeatap rdx classes, logistic regression wffj omreurernfdp reodcamp rv mkzx etohr algorithms.
- Jr msasues z lniera pnliiasertoh ewneteb suva rpdcteior nus rxy log odds. Jl, tkl elapxem, cases jqrw wfx nuc bqyj luvaes el c cdripteor engobl vr nvv ssalc, uqr cases wrpj uimdem laseuv lx rxd cdtorirpe eonbgl rx rhtnaeo scals, qjrc reiitanly wjff erakb vwhn.
Exercise 3
Xatepe grx mode f- building pecosrs, prg mrej por Fare vlareiba. Qzxv jr some c infrecdeef vr mode f acfrpmeoenr zs meditseat ub cross-validation? Mhb?
Exercise 4
Fcarttx uvr sluioaatsnt ktml qrx Name avrbeali, spn cntrveo gzn rrcy noct’r "Mr", "Dr", "Master", "Miss", "Mrs", tk "Rev" rk "Other". Vxoe rz bkr looilfngw zkgk lte z nrbj zz kr xwq kr aterxct rkb usiatlansot jwry qor str_split() tfuinnco tmlx rpv sitgnrr tidyverse agkacep:
Exercise 5
Yjfbu z mode f zrrg snceulid Salutation ac ternhao itrcpodre, zgn rscos-adiavlte rj. Gaev rjpc meripov mode f oreamncpfer?
- Logistic regression is a supervised learning algorithm that classifies new data by calculating the probabilities of the data belonging to each class.
- Logistic regression can handle continuous and categorical predictors, and models a linear relationship between the predictors and the log odds of belonging to the positive class.
- Feature engineering is the process by which we extract information from, or create new variables from, existing variables to maximize their predictive value.
- Feature selection is the process of choosing which variables in a dataset have predictive value for machine learning models.
- Imputation is a strategy for dealing with missing data, where some algorithm is used to estimate what the missing values would have been. You learned how to apply mean imputation for the Titanic dataset.
- Odds ratios are an informative way of interpreting the impact each of our predictors has on the odds of a case belonging to the positive class. They can be calculated by taking the exponent of the model slopes (elog odds).
- Redraw the violin plots, adding a geom_point() layer with transparency:
titanicUntidy %>% filter(Variable != "Pclass" & Variable != "Sex") %>% ggplot(aes(Survived, as.numeric(Value))) + facet_wrap(~ Variable, scales = "free_y") + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) + geom_point(alpha = 0.05, size = 3) + theme_bw()
- Redraw the bar plots, but use the "dodge" and "stack" position arguments:
titanicUntidy %>% filter(Variable == "Pclass" | Variable == "Sex") %>% ggplot(aes(Value, fill = Survived)) + facet_wrap(~ Variable, scales = "free_x") + geom_bar(position = "dodge") + theme_bw() titanicUntidy %>% filter(Variable == "Pclass" | Variable == "Sex") %>% ggplot(aes(Value, fill = Survived)) + facet_wrap(~ Variable, scales = "free_x") + geom_bar(position = "stack") + theme_bw()
- Build the model, but omit the Fare variable:
titanicNoFare <- select(titanicClean, -Fare) titanicNoFareTask <- makeClassifTask(data = titanicNoFare, target = "Survived") logRegNoFare <- resample(logRegWrapper, titanicNoFareTask, resampling = kFold, measures = list(acc, fpr, fnr)) logRegNoFare Omitting the Fare variable makes little difference to model performance, because it has no additional predictive value to the Pclass variable (look at the odds ratio and confidence interval for Fare in listing 4.13).
- Extract salutations from the Name variable (there are many ways of doing this, so don’t worry if your way is different than mine):
surnames <- map_chr(str_split(titanicTib$Name, "\\."), 1) salutations <- map_chr(str_split(surnames, ", "), 2) salutations[!(salutations %in% c("Mr", "Dr", "Master", "Miss", "Mrs", "Rev"))] <- "Other"
- Build a model using Salutation as a predictor:
fctrsInclSals <- c("Survived", "Sex", "Pclass", "Salutation") titanicWithSals <- titanicTib %>% mutate(FamSize = SibSp + Parch, Salutation = salutations) %>% mutate_at(.vars = fctrsInclSals, .funs = factor) %>% select(Survived, Pclass, Sex, Age, Fare, FamSize, Salutation) titanicTaskWithSals <- makeClassifTask(data = titanicWithSals, target = "Survived") logRegWrapper <- makeImputeWrapper("classif.logreg", cols = list(Age = imputeMean())) kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE) logRegWithSals <- resample(logRegWrapper, titanicTaskWithSals, resampling = kFold, measures = list(acc, fpr, fnr)) logRegWithSals
The feature extraction paid off! Including Salutation as a predictor improved model performance.