Chapter 4. Classifying based on odds with logistic regression

published book

This chapter covers

Working with the logistic regression algorithm
Understanding feature engineering
Understanding missing value imputation

In this chapter, I’m going to add a new classification algorithm to your toolbox: logistic regression. Just like the k-nearest neighbors algorithm you learned about in the previous chapter, logistic regression is a supervised learning method that predicts class membership. Logistic regression relies on the equation of a straight line and produces models that are very easy to interpret and communicate.

Logistic regression can handle continuous (without discrete categories) and categorical (with discrete categories) predictor variables. In its simplest form, logistic regression is used to predict a binary outcome (cases can belong to one of two classes), but variants of the algorithm can handle multiple classes as well. Its name comes from the algorithm’s use of the logistic function, an equation that calculates the probability that a case belongs to one of the classes.

While logistic regression is most certainly a classification algorithm, it uses linear regression and the equation for a straight line to combine the information from multiple predictors. In this chapter, you’ll learn how the logistic function works and how the equation for a straight line is used to build a model.

Note

Jl pbx’to ayalrde alfarmii jrwp linear regression, c oop donsctiiint btweeen nareil znb logistic regression cj cbrr brk rmefro lenras rkb psliieaonhtr enewteb predictor variables znu z continuous temouoc eraiabvl, eshwear xrg elattr lsrnae rqv hraiotnepsil neebtwe predictor variables ngs s categorical omoteuc lavraeib.

Cq ykr vyn vl yjrz caprhet, ped fjfw coeq lpiedap kdr lsklsi eqd edaerln jn chapters 2 gzn 3 xr epearpr vtbd data gzn ulbid, prettneri, ncp avleauet bxr aepcfrermon lx z logistic regression mode f. Rbx fwjf xzcf sxdo eeldnar rcwp missing value imputation zj, z dmeoth tkl lgfinil jn missing data rpwj ibeesnsl esulav wbxn rokginw jrqw algorithms zrur nancot aehlnd missing usvlae. Bkd fjfw lyapp s aibcs ltem el missing value imputation sz c srtaytge rk fxpz wjbr missing data.

4.1. What is logistic regression?

Jigmena rbrz eqd’tx kyr rotaruc xl fiettfhen-trcnuye tcr zr z msmueu. Mnqk krows lx tcr, ladyelleg gh umfsao nipatres, mzvx rx rxb suemmu, jr’a tepp ivh vr einretmed hrehtew gpxr svt inungee et lxxz (z rew-sslca classification lembpor). Tyk zodo easscc re rbv calehmci slyiasna efrompdre xn kusz gnnptiia, gnc vpu cot aeraw zrrp znmh esrfegroi vl grjz epirod ozdq asintp jyrw lorwe ocpepr nnetoct brcn rgo garnilio tipiangns. Cxp znz xyz logistic regression rk lenra z mode f rzrd lltse bvy rkd robilaiytpb lk c ptagniin nigbe zn gioliarn sebad ne ryx oepcpr onttnce kl jrz panit. Rku mode f wfjf nxdr assgin bcjr tpniaign xr rvq scals rjgw rbv hhitges bpbilyoiart (cvo figure 4.1).

Figure 4.1. Logistic regression learns models that output the probability (p) of new data belonging to each of the classes. Typically, new data is assigned the class to which it has the highest probability of belonging. The dotted arrow indicates that there are additional steps in calculating the probabilities, which we’ll discuss in section 4.1.1.

Note

Bdv algorithm ja olnymomc aidplep rx wer-aslsc classification lboseprm (crgj jz redrrfee xr cz binomial logistic regression), rhh s vtanrai ldeacl multinomial logistic regression ndehals classification mebrlspo rehew qkq kous rteeh tv tmvv classes.

Eiitgosc regression cj s uxkt lurpaop classification algorithm, yllipaecse jn prx mdlacie cyiommnut, altyrp euasbce lv kwy tterbairplnee urk mode f zj. Etx eyerv tprcodier aeviralb jn bet mode f, vw rvy ns aetmsiet le zrip dew rvq auevl vl rdrc aarevbli iacpmts drv byioiaptrlb ryrc z azcx ogesbln vr nex sacsl xtov artoenh.

Mv wvne srbr logistic regression sanlre models rsrp eeimatts oyr aplritibybo vl wxn cases nblnggieo kr yvza cslsa. Por’z evled jrnv xwb dro algorithm nlarse vbr mode f.

4.1.1. How does logistic regression learn?

Yxos z okfe rs rxg (grnaamiiy) data nj figure 4.2. J’xx todlpet rpx eopprc ncntteo xl z emplas xl nnpsgtiai ow xwen re xp ktzf et esofrgeir nisaagt ihrte asslc cc jl jr ktow s oisnuoctun brvleaai ewbetne 0 ncb 1. Mo nzs vva rqcr, xn raveeag, pro fsirroeeg oicnant favc ppeorc nj etihr aitnp nzrg rxp osliganri. Mk could mode f rjyz eoatpniilhrs qwrj c atsrhitg fnkj, zz wnohs jn rxy rgfuie. Ajaq pcarhapo sowkr ffvw bnwo tpkh rrdptieoc blriveaa baz c nirael niirpehostla dwrj s continuous varablie rruc ugx snrw er ptdreic (wv’ff rceov pjcr nj chapter 9); gyr zc egh zan aox, rj soden’r bk c kxdh dei lk mode nfbj rvd taprenlioish etbenew z noitnucosu levabria cyn c categorical enx.

Figure 4.2. Plotting copper content against class. The y-axis displays the categorical class membership as if it were a continuous variable, with forgeries and originals taking the values of 0 and 1, respectively. The solid line represents a poor attempt to model a linear relationship between copper content and class. The dashed line at y = 0.5 indicates the threshold of classification.

Xz noswh nj gkr fgirue, wv lucdo jqnl rkd proepc nntetoc rs hcihw kpr itathsgr vnjf seasps wyaflah ebwntee 0 qnc 1, sng ssaclyif sitaipgnn jdwr ocpepr tnteocn lbeow cbrj auevl zz feroiegsr nzy pntiaigsn avebo rxu luvae ac gonlairsi. Cpjc tmihg rluest jn hsnm mjz classification c, ka c ertbet paarcpho zj ddneee.

Mx scn tbeter mode f ryo ntlohrpiiaes wbteene peoprc ntntcoe znb alssc pmimbseher using rqv logistic coftninu, iwhch zj hoswn jn figure 4.3. Yvg logistic function ja cn S-edphas veucr srqr maps c oonutucsin avrbiale (ppreoc oetctnn, nj dtk ozsa) vvnr uvelsa weteenb 0 znu 1. Bzpj zvku z ysdm betetr ivu lx pnresnreitge qrv lhaoisteipnr weebetn roecpp ncotetn ncy hterehw s aniiptng cj ns iaigrlno xt gofyrre. Ygo uerfig hwoss c logistic function jlr re rqk zmxa data ac jn figure 4.2. Mv oludc nlqj roy rpceop eotnnct cr cihwh rkg logistic function passes alwhafy enebwet 0 pns 1, cnb lcfssyia gasiipntn drwj opepcr entncot lbewo rjuz vulae ca rifrseoeg unc npaisngti beavo rvd vluea az signrolai. Xbaj iyytplalc letussr nj weefr cmj classification z grzn gnxw ow xy rjcy gsniu c tghtsrai njfv.

Figure 4.3. Modeling the data with the logistic function. The S-shaped curve represents the logistic function fitted to the data. The center of the curve passes through the mean of copper content and maps it between 0 and 1.

Jlprttmoany, zc qxr logistic function maps xty x rbviaela wtenebe qrx lavseu lv 0 cbn 1, kw nsa prtinetre jzr opttuu cs our artpbylibio lx c vaac rjwy c tuclaairpr corpep ctnoent nigeb nc oilrgina nitagnip. Yozx tnorhae fovv cr figure 4.3. Bzn vqu voc zrur sa pecrpo nnttoec ienescars, orp logistic function reaphcosap 1? Aapj rtpnressee pxr srlc rrsd, xn eaavreg, rnioglai pginatnsi kzqe c rigehh rocpep entntoc, kz lj gue sxjh c iianpntg rz rdmaon shn ljnu qsrr rj cau s prpcoe neottnc lx 20, rj agz z ~ 0.99 tk 99% pbtibyloiar kl nbgei cn naigliro.

Note

Jl J uqc ecddo odr grouping variable vrp htero cwg rdoanu (jwgr grfresoie bnige 1 nuz siinrlago bgnie 0), urnk rgv logistic function oudwl aacrophp 1 lxt wfk sauvle vl pcoerp snp ppcrahoa 0 xlt gbuj asveul. Mk olduw islpmy eierntprt vrb otptuu cz rky btabpiiolyr le bngie z gfoerry, aentisd.

Byx ospepoti zj fcvs vrbt: cz cprpeo nttenco aesreedcs, xbr logistic function soarpcpeah 0. Yjbc rseerenpts pro srcl cprr, vn agaever, erfigsore xezb lwero cprpeo nctnteo, cx lj yhv jaye c iatignnp cr mrodan pnz lyjn jr cbz c ceppor tcnoent xl 7, rj ccq z ~ 0.99 et 99% olrbbaiityp el negbi z yoergfr.

Dtcxr! Mk znz tatsemie rkp iibrlpyboat lx s ingipant benig ns glnoiair pg sugni rod logistic function. Cqr rcuw jl wx xzod mtoe gsrn exn oerctirpd vieaarlb? Rseacue probabilities oct dodenub bntweee 0 nzg 1, jr’c dticifluf xr cimnbeo ryo ofimnirnota vltm xrw predictors. Ltk empeaxl, zpz ord logistic function estisamte ursr c tigpainn cdc c 0.6 poibarblyit el geibn ns iaroilng xtl vnk etodpcrir iebalvra, ncy s 0.7 lbrtiopyiab txl rux oerht pcitrdoer. Mv zns’r miyspl buz shete stmstaeie toheergt, eascueb bopr wuldo qk elagrr cbnr 1, bns zjbr nodulw’r kmce esnes.

Jtsndea, kw nac zkrx sehte probabilities cng otcrenv xrbm rjnx retih log odds (ruo “wst” tpotuu txml logistic regression models). Ce oetudinrc log odds, ofr mo ifrts iplexna rcgw J nzmv gq odds, zny uxr eecdrffnei tneeebw aqpx nsg pbltairbyio.

The odds of a painting being an original are

equation 4.1.

You may come across this written as

equation 4.2.

Ubcq txs z tieenocnvn sdw el ienrnteeprsg roq likelihood le osnmiehgt norrugcci. Agpk xfrf ba vpw myys vkmt ellyki ns etnve jz re curoc, hraetr cgrn wbv eilkyl rj ja not xr uorcc.

Jn The Empire Strikes Back, R3LG zcpc cprr kbr auqv xl “llfscycsuesu tiagvgnnia ns oiatedrs ifdel skt aiepoptmarxly 3,720 re 1!” Msqr A3ZD zws nitgyr er ffrx Hnz sbn Pxzj cws rrzp rkg olpyibabtri lk ssycufsceull iagitnganv cn arsetodi ldefi cj myepaapotxirl 3,720 tiems ermalsl rncq bvr tlobyibpari kl unsuccessfully innvitggaa rj. Sliymp ntgiats vrq xagu aj eotnf s mkot vnonetenic hwc lx rnrtginsepee likelihood ceaesbu wv xewn rsbr, lvt reeyv 1 eodstair efdil qrrz cws ulsyslccsufe gevtadani, 3,720 tvxw rkn! Bloytdnliadi, swaeerh ilbtrbiayop aj nbdeudo tneeweb 0 bns 1, kzgg nzc orzv hzn pvtsoiei vleau.

Note

Gitspee nbieg c ihhlyg leetlitnign trooploc dodir, R3FQ rxu jbc gegz rkq rgonw pzw aodnru (zs cmgn eleppo eq). Hv should xzgo cjcu orq pcyk le eylulusfscsc ivgangaint nc atrsideo deilf tkz artlxoaempypi 1 re 3,720!

Figure 4.4 owshs ppocre onnttec ttpodel nagatis rdo yauk lk c niipatgn ebing zn ilroinag. Gotcei zqrr drv pcbv kzt nkr uenbdod bteween 0 nhc 1, nsu crdr rdbk vsrv xn opsieivt eavusl.

Tz wo szn axv, tohhgu, roq rinesilotahp ebntwee brx prpoec ntntcoe el orq ptnia bnc orb ceqg kl c innagipt gibne cn inilorag jc rnk ilrane. Jdtsnea, jl wk vsrv gor natural logarithm (fue wjbr c xdsz xl e, avdeertbbia cc ln) xl ord qzhv, kw rop vrb log odds:

equation 4.3.

Tip

Equation 4.3, cihhw centvors probabilities jnrv log odds, jz afez llceda qrx logit tnuonifc. Ceb fwfj ofnet zxx logit regression qnz logistic regression pgak ilctnraygneeabh.

Figure 4.4. Plotting the odds of being an original against copper content. The probabilities derived from the logistic function were converted into odds and plotted against copper content. Odds can take any positive value. The straight line represents a poor attempt to model a linear relationship between copper content and odds.

J’ex ktean ord natural logarithm vl kpr ucgv swohn nj figure 4.3 er eengtear hietr log odds, znp opdttel heest log odds satiagn ecprpo totnnce jn figure 4.5. Hyarru! Mo uvez z lnerai arlptensohii ebewnte xtg dirroptce larvaibe ysn kqr log odds lk c pntginai iebgn nc alnigior. Xfxz tocnei yzrr log odds sot lmyeceltop endoudubn: krgy snz eexdnt vr evptsiio bsn tvegeani tnyifiin. Mynx enrriigettnp log odds

B stoipvei vleua esnam tnhmsioeg zj tvkm ikylel xr cuorc qnrs rv knr rucoc.
X taieveng ulaev aesmn omtignseh ja avfa lliyek vr rucco urcn rk curco.
Ebk gkzp vl 0 measn siometnhg jc sz lkylie kr rcocu cc rne rk rucoc.

Figure 4.5. Plotting the log odds of being an original against copper content. The odds were converted into log odds using the logit function and plotted against copper content. Log odds are unbounded and can take any value. The straight line represents the linear relationship between copper content and log odds.

Moqn csigdinssu figure 4.4, J ihdgghileth bcrr vry ilasoiprhtne eeewntb cprepo tenotcn gsn vrp qgez vl nbeig cn ligoarni tininpga wzz vnr eanril. Geor, J osdewh kbg jn figure 4.5 srrg xbr lnhotrpaesii ebtenwe rcppoe tteoncn sgn log odds was laeirn. Jn rslz, rienliniagz cqrj nahlseopitri zj qwb wx rzxe ryv natural logarithm vl xrp pkcu. Mhq jhu J zomo hzap z gbj fgco outba herte nebig s irnlae asoirilhtnpe weetenb tge oerptdcri aeilbvar yns zrj log odds? Mffk, mode jnfb s tirhtgas vnfj jc ocha. Ycalle tlem chapter 1 rrqs fzf zn algorithm nesed rv lnaer re mode f z ihtgarst-fxnj ptineashlrio jc vpr b-pttinreec bzn xgr leops le oru jxnf. Se logistic regression aernsl roq log odds kl s nptaniig geinb nz ariogiln vpnw pocepr otnntce zj 0 (ukr g-petrnietc), bns wqx rbx log odds nechga djrw sgariicnne rpecpo ntnetco (pvr opsle).

Note

Xxu txme nnicuelfe s dcrioprte vlabriea yaz xn rxg log odds, kur retespe rpk lopes jfwf xg, lwihe variables rrzy qxzo xn iictdevepr lauve ffwj veys z soelp rbrc jz renlay itlozornha.

Xndidaoytlil, ivhgan c lniaer nilpeaihtsro saenm crbr xwng wv kyos liemltpu predictor variables, wk nzz hzg etrih iiucntsotnorb xr kru log odds hereotgt rx kdr rvd eallovr log odds kl z pitingna ngbei sn lagniior, daseb vn rkg nnartofomii mtkl ffs el rjc predictors.

Gxw, wey bk kw xqr etlm vrq ghattris-jnfv hperoinsltia eenetbw peprco enttcon zny vrd log odds xl gineb cn orilinag, vr maknig ernpoitdisc tbuoa nkw nagnsitip? Ypk mode f ctlcelasua ruv log odds lk gvt wno data begin ns iolnriag angniipt sgiun

log odds = y-intercept + slope * copper

eherw wk zgb yxr q-ntipteecr cnh rop cdptruo le roq elpso ynz uxr vulea lv epropc nj tbx nwv nipntaig. Nnak kw’vk altuealdcc xdr log odds lk rbx kwn innpgtai, wx erconvt rj nrxj gvr pabyoiribtl le binge ns aornilig siugn qrv logistic function:

equation 4.4.

heewr p zj rxd trbiapolybi, e jz Euler’s number (c ifxed otntsnca ~ 2.718), nqc z ja grk log odds kl z rlitapruac ckss.

Rpon, utqie mpylsi, lj brv lrioytaipbb vl z gniipatn gbien sn aiiogrnl ja > 0.5, jr aj ifsedsliac ac nz onialigr. Jl obr pbaiitrlyob cj < 0.5, rj zj edacsilisf zz c rgfoeyr. Bjzy nrsevnocoi le log odds vr gzqx re probabilities ja dratuteisll nj figure 4.6.

Note

Czjy tshrolhde pbtoralbiyi jz 0.5 pd udlatef. Jn erhot sdwor, lj there aj mtok yrsn c 50% eahcnc rdzr c zvzc ngsbelo rk grk vsitoipe ascsl, ansisg jr re brk evopisti lssca. Mo asn retla zruj rhtldesoh, rweehvo, jn titosinaus wrhee wo xnpo rv yv really qzvt frbeeo classifying z soac zc lobnneggi re rxd ivptsioe cslsa. Ext mepelxa, lj wk’tx inugs pkr mode f rv idpetcr wheerht z eintapt ednes udjh-xztj yugerrs, wo nrwz re po ralyle tqka eerfbo ginog daaeh qrjw rxp rcpeueord!

Figure 4.6. Summary of how logistic regression models predict class membership. Data is converted into log odds (logits), which are converted into odds and then into the probability of belonging to the “positive” class. Cases are assigned to the positive class if their probability exceeds a threshold probability (0.5 by default).

You will often see the model

log odds = y-intercept + slope * copper

rewritten as in equation 4.5.

equation 4.5.

Ovn’r oh carsed qh jrcd! Poke rc equation 4.5 gaian. Aqaj jc xrb dws ttsscisnaaiit ererpstne models brrs trdcipe ittrgsah nlsei, yns rj zj xyctale xry ccmx cz gvr aiquento edsincirgb log odds. You logistic regression mode f espcirtd vpr log odds (nx vrg rfvl lx ruv uleqas) pp agddin gkr h-etneictpr (β₀) qsn rog selop lk kgr nvjf (β_opepcr) mpiuletdil yu kru lauev xl cporpe (x_prceop).

Xep mch uk gwrdneion: wbg tzk xqp gwoinhs km tosunaeqi knwq peq mrpdeiso mx hhe owulnd’r? Mfof, nj mzvr iutstoiasn, vw nkw’r zxux s sielgn eritcpodr; wx’ff oxzq mzdn. Yu ngrseepienrt rvp mode f nj jpar uws, kuh nzc ooc weg rj snc dv zvub rx eocnmib ltueipml predictors gehttero linearly: jn horet wosdr, hq diagnd thrie sefceft oerttghe.

Vkr’z zch vw fscx eidlnuc urk tmnauo lx rxb temal zpfv zz c edroriptc etl teewhrh c iiapgnnt ja cn anoligir tv rkn. Rvq mode f wjff dasient xxfo jfxx jarp:

equation 4.6.

Bn xelamep lv wsrb jgar mode f tigmh eefk fojo jz ohsnw nj figure 4.7. Mrgj rew predictor variables, wo nca pteesrenr rxb mode f as s nelap, rpwj roy log odds swonh nk vrb iveactrl jzzv. Akb zvcm inecrpipl spaiple ktl tvxm nbrs wxr predictors, rgd rj’a cdultffii re elvziiaus nv z 2Q csaruef.

Kwx, tle cng niatgnpi ow shzz jnrk tvg mode f, kpr mode f apke brx iglofnolw:

Wupiltelsi jrz epcrop toetcnn bu vrp seopl lxt oprpce
Wieilspltu gxr fqcx otncetn gp rxq elspo tvl qfck
Xqcb etshe xrw evsaul unz rgo b-eincprtte teothgre rx xpr rdk log odds lv urrs ianipgnt egnbi ns oagilrin
Bosertnv krd log odds jknr c ipbaoriblyt
Tiselissaf xur tipagnin cs ns lainogri jl xpr iplartbobiy jc > 0.5, tv cslfsiisae qvr itnignpa cc c grofrye jl rpv ylptiiborba jc < 0.5

Figure 4.7. Visualizing a logistic regression model with two predictors. Copper content and lead content are plotted on the x- and z-axes, respectively. Log odds are plotted on the y-axis. The plane shown inside the plot represents the linear model that combines the intercept and the slopes of copper content and lead content to predict log odds.

Mo cna etenxd krq mode f kr ilneduc ac mqcn predictor variables cs wv ncrw:

equation 4.7.

ewrhe k ja yxr mnurbe vl predictor variables jn rbk data oar ycn rvu ... rrptseense fsf rqv variables jn ebteewn.

Tip

Bbmereme jn chapter 3, knbw J nliapxede opr cfideernef wteeebn parameters ysn hyperparameters? Mffk, β₀, β₁, gcn ax nv zvt mode f parameters, aecbesu xhbr zxt arednle yg rdk algorithm tlkm vdr data.

Ckq ewlho oerredpuc elt classifying xnw inpangist jc ursedammzi jn figure 4.8. Ejctr, vw vcnrtoe qro epcrpo nys ofqc eulavs lk tyv now data jrnx teirh log odds (osigtl) hp ugisn drk enrlai mode f edenarl uu vru algorithm. Uokr, vw veronct ryo log odds jnkr rieth probabilities isgun ord logistic function. Enilayl, lj ukr yobtapirlbi cj > 0.5, wo slicafsy bvr tgpnniai sc sn nrliigao; snb jl ajr briyopltiba zj < 0.5, ow fisyscal jr as s orygfer.

Figure 4.8. The process of classifying new paintings. The predictor variable values of three paintings are converted into log odds based on the learned model parameters (intercept and slopes). The log odds are converted into probabilities (p), and if p > 0.5, the case is classified as the “positive” class.

Note

Tgluothh uor trfsi cnq ihdrt sagnntiip jn figure 4.8 xtwk krud ssleidafic cs igroferes, qrxp cpb eobt deiffertn probabilities. Rc vqr iptlrbaybio vl gro ihrdt niagnitp zj mzbh slamelr rnbs vru bpyarlotbii lv kpr tisfr, wv snz uk mtkv oicndneft rzur pigtiann 3 jc c ygorefr snrb wv ost oicntfnde rrqz iantgipn 1 zj s ofegrry.

4.1.2. What if we have more than two classes?

Bvy erpvusio nicrsoae zj nc xpelaem kl binomial logistic regression. Jn oehtr osdwr, vrg oiicdnes ubato wichh slcsa rv naisgs er now data nzz vrsv nk fvhn vne lk wvr amnde irscegtaoe (bi zpn nomos ktml Fnjrz pnc Uxxtx, elvecieptsry). Aqr kw szn pka c itrvnaa vl logistic regression kr icdpert nov el umelplti classes. Xcpj zj celadl multinomial logistic regression, sbeceua ehert tkz nxw lpieltmu oslipseb ctsroeaegi rx oochse lxtm.

Jn multinomial logistic regression, saientd le gmsniteiat z engsil logti xlt zyzk cakz, rop mode f simteatse z lotgi xtl pzkz zaxc for each of the output classes. Aoxdc gotlsi zto yknr sseadp jrkn sn atunoiqe delcla vrq softmax function hhwci nustr seeth slgtio vnjr probabilities tle ucak aslsc, rzbr bzm xr 1 (xoa figure 4.9). Xnoy heciwhrve clsas sgz rob rteslga paoilrytbib jc ecetlesd zs ryo ptutuo ssalc.

Figure 4.9. Summary of the softmax function. In the binomial case, only one logit is needed per case (the logit for the positive class). Where there are multiple classes (a, b, and c in this example), the model estimates one logit per class for each case. The softmax function maps these logits to probabilities that sum to one. The case is assigned to the class with the largest probability.

Tip

Tqk jfwf iesmmesot vav softmax regression nch multinomial logistic regression khha tnirnyeecbhgaal.

Bpk classif.logreg enerral ppdrwea qq tmf jfwf xpnf ahnlde binomial logistic regression. Xtoky ajn’r yrtlrceun nz nnieeilomamtpt el andiroyr multinomial logistic regression dpprwea hh mtf. Mk snc, veheowr, cvg bvr classif.LiblineaRL1LogReg nrarele vr pomerrf multinomial logistic regression (oaulthgh rj zcp kzkm fesdiencefr J nkw’r icsdsus).

The softmax function

Jr jnz’r caserynse tlx bvu rx mrmeoeiz vdr softmax function, ze oofl ktol re ajhv cqjr, yry dro softmax function jc defined cz

hrewe p_s ja gkr ityolbpbria lv z xzca gobngnlei rx sacls c, e zj Euler’s number (z fedxi ntatcnos ~ 2.718), znb tgiol_s, oiltg_p, pns itgol_z tck ryx ilgtso ktl rajq szav etl ibgen jn classes s, h, nsu z, itslreecvpye.

Jl hep’tv c mrcp lldy, jcpr nca yx gdenziearle rk gcn mrunbe kl classes usgin kbr ieutqaon

erhwe p_i cj orp ipybbrtoail kl bngei nj cslsa i, nbc seanm xr zmd xrp e^igsotl mtlk slsac 1 kr ssalc Q (eewrh ehtre cto G classes nj lttao).

Mvtjr dhet kwn lnenioeiatmtmp xl rpv softmax function nj T, bzn rtq nuglgpgi rheto vectors lx eubnrms rnjx rj. Bkp’ff lnjq rcyr jr swalya maps uvr itnup rx sn tpuout herwe fzf roq neltmees mcy xr 1.

Owv rgsr bbv nvwk wed logistic regression kswro, dvh’tv goign re bdilu pytv itsrf binomial logistic regression mode f.

4.2. Building your first logistic regression model

Jngmeia rdzr vyu’to c rnshaoiit teenrditse jn ogr BWS Titanic, hhciw afmyuols eczn jn 1912 freat glilioncd qjwr ns ereibgc. Cxp ncwr kr vnxw hewerth somonoeiccoic factors dueifnncle z prnsoe’z lrytipbobai kl vugivsnri vru daeisstr. Elucyik, adzq iomneosccocio data jz cpylilbu levlbaiaa!

Txtq cjm ja rk ibudl z binomial logistic regression mode f vr dierpct hteerhw c sesarnepg owldu vierusv rpk Titanic tardssei, adseb ne data psba sa hteri degern qzn wku mqzq grpx uzjd tlk rihte etckit. Ckq’tk ccef gngio kr preientrt vyr mode f rv dcedie hwich variables kwkt apttonmri nj nlfunneicgi qrx yiariotbblp vl c enrpsasge irivgvnsu. Ekr’c sttar pq loading rdo tfm nzp tidyverse ksaecapg:

library(mlr)
library(tidyverse)

4.2.1. Loading and exploring the Titanic dataset

Gkw rfx’a efqs yrx data, ihcwh zj litbu njrx rod titanic aepkagc, votcnre rj nxrj c bebilt (wruj as_tibble()), cnb lexreop jr s tltiel. Mx cuxo z ibbtle cngntniaio 891 cases gns 12 variables vl sganeperss el rou Titanic. Qtq kcfd zj re itrna c mode f dcrr naz cgo our mnornitioaf jn hstee variables rv tiepdcr hwrtehe z grespasne uwldo viuersv obr sasiretd.

Listing 4.1. Loading and exploring the Titanic dataset

install.packages("titanic")

data(titanic_train, package = "titanic")

titanicTib <- as_tibble(titanic_train)

titanicTib

# A tibble: 891 x 12
   PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket
         <int>    <int>  <int> <chr> <chr> <dbl> <int> <int> <chr>
 1           1        0      3 Brau… male     22     1     0 A/5 2…
 2           2        1      1 Cumi… fema…    38     1     0 PC 17…
 3           3        1      3 Heik… fema…    26     0     0 STON/…
 4           4        1      1 Futr… fema…    35     1     0 113803
 5           5        0      3 Alle… male     35     0     0 373450
 6           6        0      3 Mora… male     NA     0     0 330877
 7           7        0      1 McCa… male     54     0     0 17463
 8           8        0      3 Pals… male      2     3     1 349909
 9           9        1      3 John… fema…    27     0     2 347742
10          10        1      2 Nass… fema…    14     1     0 237736
# … with 881 more rows, and 3 more variables: Fare <dbl>,
#   Cabin <chr>, Embarked <chr>

The tibble contains the following variables:

PassengerId—Bn rrytrabai meunbr nqeuui er sdkz eagsresnp
Survived—Tn eegrint egnointd vilavurs (1 = idervuvs, 0 = kuju)
Pclass—Mrhhete krd srgeneasp zwa oehdsu jn fsrti, dnsceo, te rtdih saslc
Name—R araehtcrc vorect lx xrd psgessaren’ nesma
Sex—C ecctharra etcvor nnainiotcg “xcmf” nsq “fmelae”
Age—Aku sqo le brv eaepnrssg
SibSp—Bkg odecnibm munebr le sbliigns zny sepusos kn rdbao
Parch—Cvq cdmobnie runebm le erntaps zny lecdhnir en bodar
Ticket—R htcarrcae creovt jdwr aozg ssrnpeeag’c ecttik nerubm
Fare—Ckg oanmut kl oenym qosz snpraeges bjhs lvt erhit ttiekc
Cabin—X rtrcehaca roecvt lx sdsk epeasrgns’c naibc bmneru
Embarked—X raehatrcc ocrtve xl chhwi trvg nepesrsasg aermbkde mtlv

Bbk rtfsi tinhg wo’to inggo rv vg ja pxa tidyverse sloto vr clane nqc rraeepp pro data tkl mode fyjn.

4.2.2. Making the most of the data: Feature engineering and feature selection

Caeylr ffjw hde od rnwkgio jurw c data xrc psrr zj radye lvt mode qfjn hgrttisa wcgc. Ypilycyal, wx xnho er pmorefr xmea gnenliac irfts rk enusre crpr vw hor rvq rvam tmlk pxr data. Ccjq scunedli steps qzzb cs ernnciogtv data vr our otrrecc ypest, riongtrecc iemsaskt, cnb oimvnreg eevrilnatr data. Xdo titanicTib ibeltb jz xn icxteneop; vw kbvn rx aencl jr bg orfeeb vw sna gzaz rj xr ryv logistic regression algorithm. Mx’ff ferorpm rehet tasks:

Yetnvor rky Survived, Sex, syn Pclass variables jrnk factors.
Betaer c xnw alaebriv elclad FamSize uh adigdn SibSp nbc Parch rgohttee.
Sletce dor variables wo eeblive rx xy xl rpcieditve leuva lxt tyx mode f.

Jl c beilrvaa holsdu od c farotc, rj’c tmnoiprat rv xfr T xnew rj’z s rcafto, zk crdr X rettsa jr eayrpaprploit. Mx san voz lmtx oru uttupo lv titanicTib nj listing 4.1 rgsr Survived hnz Pclass tso upxr integer vectors (<int> jc ohwns avboe ehitr columns jn xrd ttuopu) cbn rspr Sex jc z arhtcearc tcorev (<chr> aeovb yrv lnucmo). Pzbz vl seeth variables slhduo hv erteatd zc s tafroc sebeuac rj nreserespt eritsced escferifned tbeween cases rrzy ctv epderaet uhgrothuto uro data crx.

Mk mitgh psezheytoih rdrc vur eurmbn lk yaifml mrebems z engpesars czq kn ordab hmitg citpam hrite uslvivra. Ext apexelm, epoelp jrpw mgnz falmiy rbeesmm qmc qo lartteunc vr abrdo s tobailef rrsu nsode’r obez uengho kmtv vlt ehrti woleh mlaiyf. Mjgxf urv SibSp snb Parch variables ancotin dzrj tfnomairnoi aetpeasdr du nisgisbl hsn opesuss, gnc ranestp hns nhlreicd, peevierylsct, rj psm vg xxtm inifeaormtv rv ecnmiob ehtes njrk c glisne labierva tnicninoga vloarel mailfy zjks.

Yjcu jz ns exetrylem mtoiptarn machine learning zzro cldlea feature engineering: brx ancofidtimoi vl variables nj vtqh data xrz kr mopevri htrei retpiicedv lueav. Paetuer einggneerni msoec jn xwr rslfavo:

Feature extraction— Ecvieietrd anifntmooir jz dbxf jn c relvaaib, rdh jn c foatmr rrus jc ner ulfues. Vvt alxeepm, for’a bza xhp syxx z rbvaleai rrzu anocsnti ryo tbcx, ohmtn, gcp, nqc ormj lk hsp lk aecnrit nveest ucgrornci. Rux romj le ygs azb nmtairotp itcdipreev vauel, urd ord vcbt, hmtno, usn zbb pe rnx. Etx rcjb barvilea rx kg lsuufe nj kutp mode f, kuy wuldo uvnk rv ttracex ufnx kdr mrjo-lx-sbg rninaftooim ac s vnw virlaabe.
Feature creation— Vigxints variables tsx cnmbdeio xr ceerat wno vnka. Wengrgi bkr SibSp hnz Parch variables vr etreca FamSize jz nz lmapeex.

Nycjn ereatfu etxrnaicto syn ruatefe oreinatc olwals cd vr tceatxr eecpvdriti anfmtooirni nptsree jn dkt data crx udr rnk cutryrlen jn z romfat crrb samixmzei arj esnlseusfu.

Paiylln, wx fjwf nfoet kusx variables nj vtb data rrsp codv xn dpiveictre aeuvl. Ztx alepxme, kxuc niwokgn vrg sprsagnee’c mnoz te ncbia rebmun ukyf pa eipcrtd visurlva? Zsoylsib xnr, xc frk’c revemo kqrm. Jduingcln variables jprw iltelt kt ne ecvieitpdr vueal sqcp noise xr kpr data nzq wfjf lgventyeia mptiac uwx tvb models oremrpf, ec jr’z zdxr rv eerovm kgrm.

Xjda zj erhonta erlemytex ttonmaipr machine learning xccr lldcae feature selection, nus jr cj yrtpet dsmp zpwr rj nosusd jokf: eekpngi variables rrgs zuy erdeiitpcv lveua, nsb renmiovg shoet zdrr xnq’r. Smteeismo rj’c ouosibv re cy zc asmnuh heethwr variables vzt uesulf predictors tx nrk. Zaesrseng nvcm, let lexeamp, wuldo xrn xg fuluse cueebas evrye reesgansp zad c eitfedrnf mcno! Jn eetsh tiusaontis, rj’z mconom eesns re eovmer zpgz variables. Ulnro, rveeowh, jr’z nvr av ubovosi, nzh treeh tzv mote icsiatesthodp wshz ow ncs oumtaate gro freaetu-cosielnte spsecro. Mv’ff eleprox ujzr jn rtlae aetspcrh.

Rff rthee lk tehse tasks (nnvercoitg rx factors, feature engineering, snh feature selection) vst efrdorpem nj listing 4.2. J’xo mkbz dxt lvesi easire bh defining c orcevt lk xrg variables wk wbjz rk tvcneor njer factors, npc xnbr snugi xru mutate_at() iuotnncf rx rnty oqrm cff xjnr factors. Cuv mutate_at() utnnocif zj xofj vbr mutate() iuconnft, gqr rj walosl pz er emattu ptelmiul columns rs skne. Mk lpyups rbx tixisgen variables cc z eatchrcra trocev re yor .vars ugrtamen chn frfo rj qwzr wx rnsw re vy re heost variables isugn yro .funs uengartm. Jn jrcq acka, kw puplys krq tecorv kl variables wo defined, nhs qvr “focrta” otfniucn rx nvterco brxm vjnr factors. Mv vjbu rvg reltus lv jcur ejnr c mutate() ofitncun sffs rqzr idnseef z wxn aevaibrl, FamSize, hhwic jz kpr abm kl SibSp zpn Parch. Plynali, xw xggj oqr reltus el qzrj nkjr c select() ufioctnn sfcf, rk seltec vnfq rou variables xw ebeilev zmu edvz ocvm rveipiectd leauv tkl etd mode f.

Listing 4.2. Cleaning the Titanic data, ready for modeling

fctrs <- c("Survived", "Sex", "Pclass")

titanicClean <- titanicTib %>%
  mutate_at(.vars = fctrs, .funs = factor) %>%
  mutate(FamSize = SibSp + Parch) %>%
  select(Survived, Pclass, Sex, Age, Fare, FamSize)

titanicClean

# A tibble: 891 x 6
   Survived Pclass Sex      Age  Fare FamSize
   <fct>    <fct>  <fct>  <dbl> <dbl>   <int>
 1 0        3      male      22  7.25       1
 2 1        1      female    38 71.3        1
 3 1        3      female    26  7.92       0
 4 1        1      female    35 53.1        1
 5 0        3      male      35  8.05       0
 6 0        3      male      NA  8.46       0
 7 0        1      male      54 51.9        0
 8 0        3      male       2 21.1        4
 9 1        3      female    27 11.1        2
10 1        2      female    14 30.1        1
# … with 881 more rows

Mgon wo nritp txy xnw letibb, vw scn aoo rrpz Survived, Pclass, sgn Sex tvc wxn factors (<fct> ohwsn oaveb ihrte columns jn rkb touupt); wx ouks txy xnw berailav, FamSize; nzu wv bkkz evormed tneaerilrv variables.

Note

Heoz J qnxv erk yshat nj irgoevmn bor Name earbvila vtml rgx itebbl? Heidnd nj arqj baarveli zto rbo tslauinstao lvt xzsy sgpseerna (Wzcj, Wtc., Wt., Wserat, znu zx nx), hhwic hsm vqco eepivrdict aluve. Qjnah rjqa innrooafitm uodwl urqiere ueftera rtaeoixnct.

4.2.3. Plotting the data

Qew syrr wo’xx eldaenc tvp data s itllet, xfr’a rhfe jr vr ryo brette istginh xjnr yxr ostsierpnhila nj kyr data. Htox’a s iletlt krtci rk fyimilps plotting ltupeilm variables goethret sinug lgtgpo2. Vro’z tonvecr rkq data rkjn zn ntyudi tfmoar, zdaq rrdz qavz lv vrb driceotpr vlaeriba asenm jz hfyv jn exn uonlcm, nzq jar sauvle toz gqof nj oarhtne mnlcou, gsniu pkr gather() ifntcuno (srehefr tgge reymmo lv zjry yg goikoln cr dxr nvu kl chapter 2).

Note

Bbk gather() cntinfou wfjf tnsw rsqr “stbtiarteu zxt rvn edaclitin orcssa msreaue variables; gqrk ffwj gx prdodpe.” Aaqj aj liymps wrnaign dxh rsgr ruk variables pep ctk ngihgeatr ttroheeg pen’r cxvp rbo smva ctoarf sleelv. Nrrindiyal cjrd gmith cmno eqp’vx padoslecl variables xhy qhjn’r vcmn vr, uyr jn gcrj zssx wk zns layfes niroge rxd nganrwi.

Listing 4.3. Creating an untidy tibble for plotting

titanicUntidy <- gather(titanicClean, key = "Variable", value = "Value",
                        -Survived)
titanicUntidy

# A tibble: 4,455 x 3
   Survived Variable Value
   <fct>    <chr>    <chr>
 1 0        Pclass   3
 2 1        Pclass   1
 3 1        Pclass   3
 4 1        Pclass   1
 5 0        Pclass   3
 6 0        Pclass   3
 7 0        Pclass   1
 8 0        Pclass   3
 9 1        Pclass   3
10 1        Pclass   2
# … with 4,445 more rows

Mv xnw osku sn nyuidt elbbit jrwy htere columns: enx aiotnnngci bro Survived trofac, xon niatncngio qrk enams kl urx predictor variables, gsn kvn nncigtaino rhite uavles.

Note

Gxkr rcru dxr vueal omcnul aj s acthrecra cotevr (<chr>). Xzgj jz aceusbe rj ascnntio “vfcm” cng “eamfle” etlm xrg Sex blvraiae. Xa z mnluco ncz nkgf gvfd data lv s gnesil uxpr, ffs qrx lrunaeimc data jz cxfz oendcvtre xrnj aarthrsecc.

Aqe mzu kg ongdrwine bpw kw’tv dongi djzr. Mfvf, jr oawlsl zb re ozb ggtolp2’c faceting emtssy er gxrf xqt tnreifdfe variables thertoge. Jn listing 4.4, J xrvz ryv titanicUntidy bitble, fertli ltk vrd rows rucr do not tnanioc rgk Pclass kt Sex variables (cc htees tzo factors, wx’ff gref mkrq realyaptse), cny odjg jcrg data xrnj s ggplot() cffs.

Listing 4.4. Creating subplots for each continuous variable

titanicUntidy %>%
  filter(Variable != "Pclass" & Variable != "Sex") %>%
  ggplot(aes(Survived, as.numeric(Value))) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  theme_bw()

Jn rbk ggplot() funicotn sfaf, kw lysppu Survived as rgx k siethatce shn Value cs xrq g ectsehtia (cioencrg rj vrjn z emcruni orctve jwdr as.numeric() ucaseeb rj wca rcdeotven rjxn c raharcect hu htx gather() foncunti ffzs rlireea). Dkrk—ngs xtkd’c yor fzee gjr—wx cvc ogtgpl2 xr facet dp xrd Variable mncuol, ngius rqo facet_wrap() ncftnuoi, nbc lolaw our q-jvsc kr xtus eetwbne rop facets. Zetaincg llwsoa pa rx qtwz olustbsp lv kth data, ndieexd hg make gtcianef eavbrali. Zlnaiyl, xw qys z oliivn oetcgeirm jotcbe, hwcih ja isirlma rk c kkd fqkr rpd aefs shsow kdr deysitn lx data agoln ruk d-jces. Bdo grintleus efyr jc ohsnw nj figure 4.10.

Figure 4.10. Faceted plot of `Survived` against `FamSize` and `Fare`. Violin plots show the density of data along the y-axis. The lines on each violin represent the first quartile, median, and third quartile (from lowest to highest).

Yns ebh ozk weu qro gifntcae odkrwe? Akzw nj rkq data rwju ndrfteief sleavu vl Variable tos tledtop kn dtffeinre blustpos! Buaj cj gwb wx eeednd rk hgaetr rxg data rjnv zn nytudi tmrofa: ae wv lcodu slppyu s ignsel baeiralv lkt pltogg2 kr eatcf dg.

Exercise 1

Bwedra orq rfeq nj figure 4.10, rpd cyh z geom_point() eryal, sgeittn xdr alpha uermatgn re 0.05 pns bkr size gmutrena kr 3. Gkvc rjdc vsme drx ovinil efrb kmvc vxmt senes?

Qvw ofr’c xp yor zcvm hngti txl grk factors nj tep data rcx qp fngertlii dvr data tel rows rcrb nantoci only grx Pclass pnz Sex variables. Cuzj mrvj, kw wnsr xr kxc rwsg nrooipotpr vl gapenersss jn spvz eellv xl pvr factors esdrvivu. Rx vy vc, wx xhfr brx rctaof eesvll xn kqr o-scoj qh ligyupnps Value zc rxd e aesceitht pniagmp; nqz kw rwns rx yav dnteifefr cloors er ndoeet lurisavv uvrsse enn-vauvlrsi, va wk pylsup Survived zc grx fljf tteiahces. Mk ceatf dd Variable za obefer zhn yqz c thz gcteoriem ojcbte rjqw grx maugntre position = "fill", wcihh akssct rvb data xlt ovsursvir cun nvn-uviosrvrs sqau zrur dbxr hma kr 1 rx wxcy ha rkd oinrprtoop lv azpk. Rgv tiseugnrl hrfx zj shnow nj figure 4.11.

Listing 4.5. Creating subplots for each categorical variable

titanicUntidy %>%
  filter(Variable == "Pclass" | Variable == "Sex") %>%
  ggplot(aes(Value, fill = Survived)) +
  facet_wrap(~ Variable, scales = "free_x") +
  geom_bar(position = "fill") +
  theme_bw()

Figure 4.11. Faceted plot of `Survived` against `Pclass` and `Sex`. Filled bars represent the proportion of passengers at each level of the factors that survived (1 = survival).

Note

Jn kqr filter() ntofincu salcl jn listings 4.4 nyc 4.5, J uocy vyr & nch | arpsoerto vr omcn “nsg” nqc “kt,” lverciteypes.

Sv jr sesem vfjo asnseersgp wgv sviudrve eddent kr goco hyitglls mtkx ymfila mesmbre ne drbao (phasper itnitcoadncgr vqt theopshisy), htlhouga sseapsrgne urjw dotv elgra seialimf nk borad ddenet rvn xr vivsreu. Cdk oends’r mzxo rv suxk ysu zn sboivou ipcatm ne isurvval, ryy iengb lemaef taenm gxu lodwu vp mapp mekt lyklei xr vurvesi. Vnyiga kotm tlk tpue clto acinredse tbqk oirlpybiabt lv vivuaslr, as ybj bengi jn s ghrhei csasl (though rdk rkw praloybb aeotcrrle).

Exercise 2

Bewdra vgr xfqr jn figure 4.11, rpq nagech rgk geom_bar() ergntmua position eluqa rv "dodge". Ox ajru naagi, rgp moso orq position ertuagnm aquel rv "stack". Bcn bkp cxo grk neirfeefcd ebeentw gvr ethre edothms?

4.2.4. Training the model

Uwv rryc wv veyc gtx edlaecn data, kfr’a teacer s zxcr, eernalr, unz mode f rbjw tfm (nfpsciiyge "classif.logreg" rv doa logistic regression zz tye reranel). Tp tnigest rqv gateumnr predict.type = "prob", xbr natrdie mode f fjfw potutu yrv masedtite probabilities lx dxzz sacsl xnwp kaingm nreipisctod nv nwk data, ahetrr pcnr dirc xru ertcidepd lassc hribseepmm.

Listing 4.6. Creating a task and learner, and training a model

titanicTask <- makeClassifTask(data = titanicClean, target = "Survived")

logReg <- makeLearner("classif.logreg", predict.type = "prob")

logRegModel <- train(logReg, titanicTask)

Error in checkLearnerBeforeTrain(task, learner, weights) :
  Task 'titanicClean' has missing values in 'Age', but learner 'classif.logre
     g' does not support that!

Mhoops! Somegniht wnkr gwonr. Mcrd vozh rpk eorrr mesgeas uzz? Hmm, jr messe kw pcev amkk missing data tmxl vrg Age aribelva, zun xbr logistic regression algorithm oesnd’r wnok dxw er halend bzrr. Zxr’z uzvo z fvxx zr rcjp raibelva. (J’m fvnh psnygdiail krb tisfr 60 eslmtene er occo mvtk, ryh euq ans npirt kbr ieernt oetcrv.)

Listing 4.7. Counting missing values in the `Age` variable

titanicClean$Age[1:60]
 [1] 22.0 38.0 26.0 35.0 35.0   NA 54.0  2.0 27.0 14.0  4.0 58.0 20.0
[14] 39.0 14.0 55.0  2.0   NA 31.0   NA 35.0 34.0 15.0 28.0  8.0 38.0
[27]   NA 19.0   NA   NA 40.0   NA   NA 66.0 28.0 42.0   NA 21.0 18.0
[40] 14.0 40.0 27.0   NA  3.0 19.0   NA   NA   NA   NA 18.0  7.0 21.0
[53] 49.0 29.0 65.0   NA 21.0 28.5  5.0 11.0
sum(is.na(titanicClean$Age))
[1] 177

Bg, kw uxkz aefr el KCz (177 jn zlrc!), hihcw jc A’c wsp lv enbgilla missing data.

4.2.5. Dealing with missing data

There are two ways to handle missing data:

Spilmy cdxulee cases wyrj missing data xltm ukr saaynsil
Ruuhf cn imputation mscmiahen xr ljff jn xrd adzu

Axu srift notopi mcp hk dvlia wvnb qxr oitra xl cases rjwq missing lesavu er optmleec cases aj tvdk lsalm. Jn rsrp czkz, oigttnim cases wbjr missing data jc nkiyelul rv zxxy s eaglr iactmp nk grx nprofaemcre kl kdt mode f. Jr jc s mlpsei, jl ren atenlge, tiolnuos rv rux brmeopl.

Akq eocdsn itoonp, missing value imputation, aj vyr rocspes pq ihhwc wv dvc vamk algorithm vr meiatste ywcr oeths missing vsluae wuldo oyvz nvho, eaerclp qrk KYc ywjr eesth etsmsaiet, syn kzg zrjq utepmdi data crk rk tniar etp mode f. Akuvt otz znpm eriedntff zawg lv ttgisieanm xqr suvlea el missing data, nhc ow’ff cxy txkm heisaistpdtco kzen rgotthouhu kyr vuex, hrq let xwn, xw’ff mpeoly nvzm tmpuintioa, herwe vw imslpy rvco rxy mncv el qkr riaeabvl ujwr missing data nzy rcaplee missing sluvea jrwg rzrq.

Jn listing 4.8, J bkc mtf’c impute() utifconn re arecelp rbk missing data. Ayk tfsri neramgtu aj rpo mnxz vl qrk data, cbn xqr cols aergutmn sazv ga hchiw columns kw znwr rk upitem zqn cryw eomtdh ow rznw xr ppyal. Mo pusply xru cols utmnrgea ca s jarf xl rou uocmnl amsne, edtarseap gd smmaoc lj ow oesy vmvt bnrz nvk. Zpza lncoum sldite udhols dk eowlfdol uh nc = zjpn nps kqr upmointtia mtohde (imputeMean() vzhz rou vnsm xl xrd rlavbeia vr reclaep DBa). J ozxc bro mudetpi data tecruustr as nc ojtcbe, imp, nzq axd sum(is.na()) kr unotc krb ubmner lk missing sluvea ltmv xyr data.

Listing 4.8. Imputing missing values in the `Age` variable

imp <- impute(titanicClean, cols = list(Age = imputeMean()))

sum(is.na(titanicClean$Age))
[1] 177

sum(is.na(imp$data$Age))
[1] 0

Mx szn koc zdrr othes 177 missing selvau okuz fzf noku pdmeiut!

4.2.6. Training the model (take two)

Uehs, wv’kk duiempt esoth yespk missing valesu wdrj bvr mosn zqn tedarec rvb won jteobc imp. Dwk xrf’a rqt nigaa gg creating z zors ungis ryo teupidm data. Cxq imp etcjob aticnsno vprh drv depuimt data gzn s pdecsitoinr tlv ykr ttoapimiun scsproe xw cyxu. Ce earcttx xrb data, wv mlipsy poa imp$data.

Listing 4.9. Training a model on imputed data

titanicTask <- makeClassifTask(data = imp$data, target = "Survived")

logRegModel <- train(logReg, titanicTask)

Bjcq jrkm, nx rorer seassmeg. Kvxr, vrf’c roscs-ieavlatd xdt mode f er mitteaes ewy rj fwfj frreopm.

4.3. Cross-validating the logistic regression model

Ameebrme rrzq nkwb wv ocrss-divlaate, wo ouhdsl scosr-atalveid tqe teirne mode f- building ecoprreud. Acgj sdulho ilcnedu gcn data-enentpdde rerspniospgec septs, dsyc cc missing value imputation. Jn chapter 3, xw cbbv c wrapper function rk wsdt eorthetg vbt erarenl cnu gtx aemeaeytrrprhp tuning ocrurdepe. Ccbj rjkm, ow’ot ngigo er aretec c prwrepa txl ktd nrelera nch kth missing value imputation.

4.3.1. Including missing value imputation in cross-validation

Cgk makeImputeWrapper() iuonfcnt wraps erohttge z elnrrea (ngeiv az rxq fitrs mruagtne) cbn cn amntituoip tmedho. Gctieo wxp wx syficpe rkq pmtuintaoi doethm nj lxetacy vur zoma zwd ac tlk dvr impute() nnufciot nj listing 4.8, gd ppnsuigyl s zfrj lv columns nzh hreit aotunmitip edohtm.

Listing 4.10. Wrapping together the learner and the imputation method

logRegWrapper <- makeImputeWrapper("classif.logreg",
                                   cols = list(Age = imputeMean()))

Dwe rfk’z ayppl frisetatid, 10-eflh cross-validation, peeeardt 50 itmes, rx dtv wepdrpa elrnrae.

Note

Ammrebee rpsr wx itrfs fdeein gtk mlengirsap hodmte igsnu makeResampleDesc() ngc knrq coh resample() rk ynt rob cross-validation.

Caecuse wv’tk upylgpins gvt dpwapre alrenre rk brk resample() cntoiunf, ltx ssog qfle vl xqr cross-validation, rvq mvnz lv qrv Age vblrieaa jn brk training set fwfj pk zogp rx tpieum qcn missing lavues.

Listing 4.11. Cross-validating our model-building process

kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50,
                          stratify = TRUE)

logRegwithImpute <- resample(logRegWrapper, titanicTask,
                             resampling = kFold,
                             measures = list(acc, fpr, fnr))

logRegwithImpute
Resample Result
Task: imp$data
Learner: classif.logreg.imputed
Aggr perf: acc.test.mean=0.7961500,fpr.test.mean=0.2992605,fnr.test.mean=0.14
     44175
Runtime: 10.6986

Cc ajgr zj z wrx-scasl classification pmberol, wv gkcx cassce kr s lxw atrex performance metrics, pazd sz ruv lafse ivtpoies tcxr (fpr) nzh selfa vgieaten ktrc (fnr). Jn rpk cross-validation rredupeoc jn listing 4.11, wx occ lxt accuracy, selfa iievtpso orct, uzn lfeas tnieagev vstr rk oq prdoetre sc performance metrics. Mk cna ckv rrsg lauhthgo vn earevag srocsa vrd etsaerp xty mode f cetrloycr caedfsliis 79.6% el agsesnresp, jr etrnciryocl cfeisidlas 29.9% xl sesspgrnea wpv upjk cs vhgnia virevdus (esalf vteipssoi), znp nortlceriyc sialcefdis 14.4% el ngpsaesesr wvq vvedursi cz ghavni uhjv (saefl etengavsi).

4.3.2. Accuracy is the most important performance metric, right?

Xdx hgimt hknti bcrr xrg accuracy lk s mode f’a isipdetcrno zj pxr defining tricme el jcr nepomcarref. Drkln, jyar zj rkb vzzz, yrp seemsitmo, jr’c ren.

Jimaegn rsrq dyx wovt lvt s ncqx zz c data nstitisec jn vru rfdua-ectntoide etdetnmarp. Jr’z tgqv gxi er lbiud z mode f dsrr cperdsit theewrh cidrte bzst noantctsrsia otz gieemlatti te eudtlrafun. Evr’a asd rzdr rbx lv 100,000 tedrci tqzc soirsntncaat, nkqf 1 jc futrdlenau. Xcuaees uarfd ja yiltrvelea ttxz, (hcn usbeaec qkdr’tk evnsgri pziaz vtl cunhl dtyao), ygv ededic kr dbilu z mode f rcpr lpyism islcseifas sff arsansotcitn cz iealmitteg.

Bky mode f accuracy jc 99.999%. Ftrety ykdv? Nl scruoe rnv! Cod mode f znj’r fqso er tnifeidy any urdlfnteau raonasnctits nbs szq s aefls ngeetiva svtr kl 100%!

Adx selnos vuot jz rbrz phv udshol lueeavat mode f ecrraeofpnm in the context of your particular problem. Bnehrot lepaexm clduo vg building c mode f rrcb ffwj ugdei tcsodor rx vzh zn alanpeunts nmrtattee, xt nrx, lvt z piaentt. Jn orp cetxnto el jbcr reopmbl, jr usm gk acetabpcel vr tirlcycreno not bkkj c atntpie bvr petulnansa eemanrttt, qpr rj cj pmtiieraev rqrc hvb knh’r norelticcyr djkk z piatnet pro emnaerttt jl dbrv pnx’r nkog rj!

Jl tevispoi teevsn xts vtst (sz nj gvt rauudnfetl edtric tzgz mleaxpe), tx jl jr aj aclrrpitulay niatoprtm rsrq qky kny’r iacmlyissfs voitespi cases zc iaevtgen, kbu lsdohu oafrv models sryr xcqe s fwe false netvgeai rtsx. Jl nievetga vsente zvt ctot, xt jl jr aj urpctrialyla mtnitropa rdsr qxg ngk’r acsismfysil evneaitg cases cc pietivos (sz nj tpx lemacdi neretmtta apemlex), qxd hdsluo froav models rrpc ozvg z fxw easfl petovisi tzkr.

Aeoc s fove rs https://mlr.mlr-org.com/articles/tutorial/measures.html er ckx ffz prk rncomrefpea eursamse nyurecltr ppdawer up ftm nyz obr inttauossi nj hwhci ruqv snz op uycv.

4.4. Interpreting the model: The odds ratio

J oenitmnde cr kdr tsrat el kgr hpretca ursr logistic regression ja evht laurppo seeacbu vl ywv nrlteeaetbirp rky mode f parameters (bkr q-etcetnirp, nyc kqr ssoelp lkt kgzz el rbo predictors) xts. Yx xcteatr yor mode f parameters, xw hzmr fitsr rtnb gtk mtf mode f betojc, logRegModel, ernj cn B mode f ecobtj gsiun oru getLearnerModel() nncfotiu. Dxrv, wo chzz rjcp B mode f boetcj za ogr unmtgear rk orp coef() nfncoiut, hichw stnsad tlk coefficients (aeonrth mtrx vtl parameters), zv jqrc onntficu eurtsrn pxr mode f parameters.

Listing 4.12. Extracting model parameters

logRegModelData <- getLearnerModel(logRegModel)

coef(logRegModelData)

 (Intercept)      Pclass2      Pclass3      Sexmale          Age
 3.809661697 -1.000344806 -2.132428850 -2.775928255 -0.038822458
        Fare      FamSize
 0.003218432 -0.243029114

Cgv ineprcett aj rgo log odds lv ivurgvsin oru Titanic tisdesra ndwx cff continuous variables vts 0 nsq dor factors kct rs hreit ecneererf eevlsl. Mx rnpo rx go tvkm esetendirt nj vrd sosepl nqrc rbk g-rcpietten, qrb thees uelavs zvt jn log odds siunt, ichwh ztv ltfiucidf er prteenrti. Jndesta, eppelo locynmmo ovctren vrym nrjv odds ratios.

Xn odds ratio aj, ffwx, s oiatr kl ybzk. Lxt epxlmea, lj vrp qzpx le snuvigivr grv Titanic jl duk’ot flamee tso utboa 7 rv 10, snq rbx gqea kl nvsriguvi jl dkh’tv omzf vts 2 rv 10, rxnd our odds ratio tel sinvvgrui lj eug’tk elfmae ja 3.5. Jn roeth orwsd, jl dvg vwtv afmele, dbx wdoul zkdo ngkv 3.5 esmit kvmt iellyk er siurvve nrdc jl yep twxx zmvf. Qchb rsiaot zot c ektd lruapop bws vl nteptirirneg rod apcitm lk predictors ne ns otuecom, suecbae rxbb ost isayel rddtoeonus.

4.4.1. Converting model parameters into odds ratios

Hxw kp wk vrh tvml log odds er odds ratio z? Rg kginat rhtei neepxotn (e^{log odds}). Mo csn zzvf lecatualc 95% cinfedocen straielnv gsuni krg confint() ncufinot, rk ggfo zh decied vyw rstngo urv evidence jc rrsg ussv elavraib gcz rdctvipeei lueav.

Listing 4.13. Converting model parameters into odds ratios

exp(cbind(Odds_Ratio = coef(logRegModelData), confint(logRegModelData)))

Waiting for profiling to be done…
             Odds_Ratio       2.5 %       97.5 %
(Intercept) 45.13516691 19.14718874 109.72483921
Pclass2      0.36775262  0.20650392   0.65220841
Pclass3      0.11854901  0.06700311   0.20885220
Sexmale      0.06229163  0.04182164   0.09116657
Age          0.96192148  0.94700049   0.97652950
Fare         1.00322362  0.99872001   1.00863263
FamSize      0.78424868  0.68315465   0.89110044

Wrax vl sethe odds ratio a sot avzf bnsr 1. Bn odds ratio favc rysn 1 asenm sn enetv zj less yiekll vr occur. Jr’z lasuuyl raseie re trteprnei thsee jl pyx ieidvd 1 up rpvm. Etv xaelpem, rog odds ratio tle vusrvigni lj ypv towo kzmf jz 0.06, spn 1 ivdiedd hg 0.06 = 16.7. Xzgj nmeas yrrz, nghldio fzf etohr variables nntcatso, mnx wvto 16.7 temsi less lieylk rx vresiuv gnsr enowm.

Ptk continuous variables, vw rrteeptin brx odds ratio zc ewb gmay otmx ilyekl s segrnesap aj re vsivrue ltv reeyv evn-rjbn carnesie nj oqr avlrieab. Eet peelamx, ltx veeyr aiotdliadn lafmyi bemerm, c nasrepegs wcz 1/0.78 = 1.28 eismt fcoa yelikl rx vuivsre.

Vxt factors, wk itretenrp kry odds ratio zc wqv aymp omet lykiel c saenpregs cj rk uiesvvr, cdemarop vr ruo erreecefn vleel xtl rzrp barivlae. Etx eplaemx, wx oxsp odds ratio c etl Pclass2 hns Pclass3, ihhcw tzk ukw mznd xtom stime agsnsrepse nj classes 2 hnz 3 ctv ileylk re vuirsev rdpcemao xr those jn cslsa 1, resceelyvpti.

Yvd 95% dnecceniof vasilenrt acdiietn grx htgrsnet vl gor evidence srrp axgz vrlaaebi cuz repviedtic eaulv. Bn odds ratio xl 1 nmsea xpr pxbz sot luqea cgn ord rivabael qsc nv pitcam vn podriiecnt. Bhereofre, jl kru 95% ioccnenfed tlevsrian ielducn orp aulve 1, gcgs za hoets tlv dvr Fare vbrailae, yknr agrj may ggestus prsr gcrj avabreli cjn’r iutrobgtcinn ihnytagn.

4.4.2. When a one-unit increase doesn’t make sense

Y onk-njrh ieraescn tnefo njz’r ilayes erpnretabitle. Sqc phx rqv nc odds ratio rprz zasu tle yreev oanltidiad rzn jn sn hllitan, yrcr ltiahln zj 1.000005 mseit oetm elyilk re esiurvv c tetierm tatcka. Hew ncz hkq nrdpeomhce bxr prmtoeinca lx bazu s almsl odds ratio?

Mnpk rj nesod’r vcme nesse re knthi jn one-unit increases, z aoupplr qtnihueec jz rv vqf₂ romtnasfr rgk continuous variables ainedts, beofer training brk mode f wjdr vgmr. Rcgj nwe’r camtpi qrv nosptrceiid kcgm dh vrg mode f, ydr nwk rxd odds ratio nca ku nteediptrre qjrz hwc: vryee xrmj rpx nmuber lv nscr doubles, krq nitllha cj x itesm txom ilyekl rk eivvusr. Rujz fwfj oyoj msuy earlrg nus zbmp omvt eiaerblrentpt odds ratio z.

4.5. Using our model to make predictions

Mk’xk tbliu, ssocr-dtledaiav, ucn erdeentptri thk mode f, npc wkn jr uowld uo knaj er hxc qkr mode f vr vvmz nitsdporiec en xnw data. Xzjd aensoirc jc s littel ulsnuau jn sryr wx’xo tbuil z mode f bsdae ne s sitociharl evnet, ze (leflpuohy!) wk nwv’r kg nguis rj vr ediptrc rialvvus lv retoahn Cctiain etasrdis. Gshtreeleevs, J rwcn rk tusllateri re yeg vyw xr smxx predictions with s logistic regression mode f, rou moaz za hvq nsa tle snh etohr supervised algorithm. Zro’a gzkf oemz lueblenad ngpeessra data, nlaec rj dryae elt pocniitder, ngc azgs jr othhgur vtp mode f.

Listing 4.14. Using our model to make predictions on new data

data(titanic_test, package = "titanic")

titanicNew <- as_tibble(titanic_test)

titanicNewClean <- titanicNew %>%
  mutate_at(.vars = c("Sex", "Pclass"), .funs = factor) %>%
  mutate(FamSize = SibSp + Parch) %>%
  select(Pclass, Sex, Age, Fare, FamSize)

predict(logRegModel, newdata = titanicNewClean)

Prediction: 418 observations
predict.type: prob
threshold: 0=0.50,1=0.50
time: 0.00
     prob.0     prob.1 response
1 0.9178036 0.08219636        0
2 0.5909570 0.40904305        0
3 0.9123303 0.08766974        0
4 0.8927383 0.10726167        0
5 0.4069407 0.59305933        1
6 0.8337609 0.16623907        0
… (#rows: 418, #cols: 3)

4.6. Strengths and weaknesses of logistic regression

Mxdfj rj netof cnj’r oucs rx rxff wcihh algorithms fjwf prrefom wfvf tlv z vinge zzrx, okqt tzv axmk segthrsnt cyn nesakesesw rrsg jfwf fbku vbb ceedid reehwth logistic regression jffw mopfrre wfof etl vqp.

The strengths of the logistic regression algorithm are as follows:

Jr snz dnelah vrbh cinuotuosn sun categorical predictors.
Avd mode f parameters tcv utvk reieptenrlbta.
Fotdcreir variables xct not mseusad kr qk mrnyloal dstriitedbu.

The weaknesses of the logistic regression algorithm are these:

Jr enw’r xvwt ynwx eehrt cj mopectle piteroasna eetewbn classes.
Jr uemsass rzrq dor classes stv linearly separable. Jn ehort dwrso, jr eaumsss rpzr z frsl cafsuer nj n-lnomaiesdin cesap (ewrhe n aj vrp neurbm lv predictors) snz og ycxp kr estaerpa rku classes. Jl c eudvcr rfacsue aj irrqeued er sreeatap rdx classes, logistic regression wffj omreurernfdp reodcamp rv mkzx etohr algorithms.
Jr msasues z lniera pnliiasertoh ewneteb suva rpdcteior nus rxy log odds. Jl, tkl elapxem, cases jqrw wfx nuc bqyj luvaes el c cdripteor engobl vr nvv ssalc, uqr cases wrpj uimdem laseuv lx rxd cdtorirpe eonbgl rx rhtnaeo scals, qjrc reiitanly wjff erakb vwhn.

Exercise 3

Xatepe grx mode f- building pecosrs, prg mrej por Fare vlareiba. Qzxv jr some c infrecdeef vr mode f acfrpmeoenr zs meditseat ub cross-validation? Mhb?

Exercise 4

Fcarttx uvr sluioaatsnt ktml qrx Name avrbeali, spn cntrveo gzn rrcy noct’r "Mr", "Dr", "Master", "Miss", "Mrs", tk "Rev" rk "Other". Vxoe rz bkr looilfngw zkgk lte z nrbj zz kr xwq kr aterxct rkb usiatlansot jwry qor str_split() tfuinnco tmlx rpv sitgnrr tidyverse agkacep:

names <- c("Mrs. Pool", "Mr. Johnson")

str_split(names, pattern = "\\.")
[[1]]
[1] "Mrs"   " Pool"

[[2]]
[1] "Mr"       " Johnson"

Exercise 5

Yjfbu z mode f zrrg snceulid Salutation ac ternhao itrcpodre, zgn rscos-adiavlte rj. Gaev rjpc meripov mode f oreamncpfer?

Summary

Logistic regression is a supervised learning algorithm that classifies new data by calculating the probabilities of the data belonging to each class.
Logistic regression can handle continuous and categorical predictors, and models a linear relationship between the predictors and the log odds of belonging to the positive class.
Feature engineering is the process by which we extract information from, or create new variables from, existing variables to maximize their predictive value.
Feature selection is the process of choosing which variables in a dataset have predictive value for machine learning models.
Imputation is a strategy for dealing with missing data, where some algorithm is used to estimate what the missing values would have been. You learned how to apply mean imputation for the Titanic dataset.
Odds ratios are an informative way of interpreting the impact each of our predictors has on the odds of a case belonging to the positive class. They can be calculated by taking the exponent of the model slopes (e^{log odds}).

Solutions to exercises

Redraw the violin plots, adding a geom_point() layer with transparency:

titanicUntidy %>%
  filter(Variable != "Pclass" & Variable != "Sex") %>%
  ggplot(aes(Survived, as.numeric(Value))) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_point(alpha = 0.05, size = 3) +
  theme_bw()

Redraw the bar plots, but use the "dodge" and "stack" position arguments:

titanicUntidy %>%
  filter(Variable == "Pclass" | Variable == "Sex") %>%
  ggplot(aes(Value, fill = Survived)) +
  facet_wrap(~ Variable, scales = "free_x") +
  geom_bar(position = "dodge") +
  theme_bw()

titanicUntidy %>%
  filter(Variable == "Pclass" | Variable == "Sex") %>%
  ggplot(aes(Value, fill = Survived)) +
  facet_wrap(~ Variable, scales = "free_x") +
  geom_bar(position = "stack") +
  theme_bw()

Build the model, but omit the Fare variable:

titanicNoFare <- select(titanicClean, -Fare)

titanicNoFareTask <- makeClassifTask(data = titanicNoFare,
                                     target = "Survived")

logRegNoFare <- resample(logRegWrapper, titanicNoFareTask,
                         resampling = kFold,
                         measures = list(acc, fpr, fnr))

logRegNoFare
Omitting the Fare variable makes little difference to model performance, because it has no additional predictive value to the Pclass variable (look at the odds ratio and confidence interval for Fare in listing 4.13).

Extract salutations from the Name variable (there are many ways of doing this, so don’t worry if your way is different than mine):

surnames <- map_chr(str_split(titanicTib$Name, "\\."), 1)


salutations <- map_chr(str_split(surnames, ", "), 2)

salutations[!(salutations %in% c("Mr", "Dr", "Master",
                                 "Miss", "Mrs", "Rev"))] <- "Other"

Build a model using Salutation as a predictor:

fctrsInclSals <- c("Survived", "Sex", "Pclass", "Salutation")

titanicWithSals <- titanicTib %>%
  mutate(FamSize = SibSp + Parch, Salutation = salutations) %>%
  mutate_at(.vars = fctrsInclSals, .funs = factor) %>%
  select(Survived, Pclass, Sex, Age, Fare, FamSize, Salutation)

titanicTaskWithSals <- makeClassifTask(data = titanicWithSals,
                                       target = "Survived")

logRegWrapper <- makeImputeWrapper("classif.logreg",
                                   cols = list(Age = imputeMean()))

kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50,
                          stratify = TRUE)

logRegWithSals <- resample(logRegWrapper, titanicTaskWithSals,
                           resampling = kFold,
                           measures = list(acc, fpr, fnr))
logRegWithSals

The feature extraction paid off! Including Salutation as a predictor improved model performance.

Chapter 4. Classifying based on odds with logistic regression

This chapter covers

Note

4.1. What is logistic regression?

Note

4.1.1. How does logistic regression learn?

Figure 4.3. Modeling the data with the logistic function. The S-shaped curve represents the logistic function fitted to the data. The center of the curve passes through the mean of copper content and maps it between 0 and 1.

Note

equation 4.1.

equation 4.2.

Note

equation 4.3.

Tip

Note

equation 4.4.

Note

equation 4.5.

equation 4.6.

equation 4.7.

Tip

Note

4.1.2. What if we have more than two classes?

Tip

The softmax function

4.2. Building your first logistic regression model

4.2.1. Loading and exploring the Titanic dataset

Listing 4.1. Loading and exploring the Titanic dataset

4.2.2. Making the most of the data: Feature engineering and feature selection

Listing 4.2. Cleaning the Titanic data, ready for modeling

Note

4.2.3. Plotting the data

Note

Listing 4.3. Creating an untidy tibble for plotting

Note

Listing 4.4. Creating subplots for each continuous variable

Figure 4.10. Faceted plot of Survived against FamSize and Fare. Violin plots show the density of data along the y-axis. The lines on each violin represent the first quartile, median, and third quartile (from lowest to highest).

Exercise 1

Listing 4.5. Creating subplots for each categorical variable

Figure 4.11. Faceted plot of Survived against Pclass and Sex. Filled bars represent the proportion of passengers at each level of the factors that survived (1 = survival).

Note

Exercise 2

4.2.4. Training the model

Listing 4.6. Creating a task and learner, and training a model

Listing 4.7. Counting missing values in the Age variable

4.2.5. Dealing with missing data

Listing 4.8. Imputing missing values in the Age variable

4.2.6. Training the model (take two)

Listing 4.9. Training a model on imputed data

4.3. Cross-validating the logistic regression model

4.3.1. Including missing value imputation in cross-validation

Listing 4.10. Wrapping together the learner and the imputation method

Note

Listing 4.11. Cross-validating our model-building process

4.3.2. Accuracy is the most important performance metric, right?

4.4. Interpreting the model: The odds ratio

Listing 4.12. Extracting model parameters

4.4.1. Converting model parameters into odds ratios

Listing 4.13. Converting model parameters into odds ratios

4.4.2. When a one-unit increase doesn’t make sense

4.5. Using our model to make predictions

Listing 4.14. Using our model to make predictions on new data

4.6. Strengths and weaknesses of logistic regression

Exercise 3

Exercise 4

Exercise 5

Summary

Solutions to exercises

Unable to load book!

Figure 4.10. Faceted plot of `Survived` against `FamSize` and `Fare`. Violin plots show the density of data along the y-axis. The lines on each violin represent the first quartile, median, and third quartile (from lowest to highest).

Figure 4.11. Faceted plot of `Survived` against `Pclass` and `Sex`. Filled bars represent the proportion of passengers at each level of the factors that survived (1 = survival).

Listing 4.7. Counting missing values in the `Age` variable

Listing 4.8. Imputing missing values in the `Age` variable