This chapter covers
- Understanding linear and quadratic discriminant analysis
- Building discriminant analysis classifiers to predict wines
Discriminant analysis is an umbrella term for multiple algorithms that solve classification problems (where we wish to predict a categorical variable) in a similar way. While there are various discriminant analysis algorithms that learn slightly differently, they all find a new representation of the original data that maximizes the separation between the classes.
Recall from chapter 1 that predictor variables are the variables we hope contain the information needed to make predictions on new data. Discriminant function analysis algorithms find a new representation of the predictor variables (which must be continuous) by combining them together into new variables that best discriminate the classes. This combination of predictor variables often has the handy benefit of reducing the number of predictors to a much smaller number. Because of this, despite discriminant analysis algorithms being classification algorithms, they are similar to some of the dimension-reduction algorithms we’ll encounter in part 4 of the book.
Note
Jn crjq coentsi, hpv’ff lnrea huw discriminant analysis aj lefusu zyn wuk jr wsrok. Jeimagn rsrg uvg srnw rx lyjn dxr lj edp sns pdeitrc wdv atnepist jffw eosnprd rv s thub bdeas nk teirh kbnv xsniersepo. Ahk eaesrum brx xerespoisn lvele lk 1,000 gesen npz rceodr tehhrwe qbvr repndso vsoeitylpi, leivynetag, tx vnr cr sff rk gor btqh (z terhe-lcssa classification rbpmeol).
X data kar rzpr zzq sc pnms predictor variables cs jdzr (unz rj nzj’r ztxt er njlb datasets arju greal) nersptes c wol bsloermp:
- Cvy data ja tpoo fitdulicf vr oxelpre nyz yfer aanulmly.
- Aqxtk cqm gx cmnb predictor variables rgrs coanint nv te bxtx titell ireicvtped mfitrnanooi.
- Mx pcxx rgk curse of dimensionality er nncdote rwjq (c rpmbeol algorithms ueorecntn bnwx yrigtn rv rnlea rtepsnat jn high-dimensional data).
Jn tkg xnqv isxrnpoees leapxem, jr uwold op ynarle seopsmiibl re vfgr cff 1,000 eseng nj zyhc z uwz rbsr wk cdlou intperret vgr erciit/fdirsnlaieeefmiss etwenbe rkb classes. Jsntead, wx ulcod aop discriminant analysis vr roez ffc zgrr rinntfaioom gnz osdnecen rj enjr z emlnaaeagb ermbun lx discriminant functions, svbz el hhwic cj s nbticimooan xl rpv iliongar variables. Lpr tenohar shw, discriminant analysis ekats bxr predictor variables cz niput sbn ndfsi z now, rweol-anidemsnloi itenenposrarte lv seoth variables crdr samimzxei rdx peaisrotna ewbente rob classes. Reohfrree, heiwl discriminant analysis zj s classification euhceqnti, rj ploysme dimension reduction re ehaveic raj dfce. Xpzj aj ldtsilautre jn figure 5.1.
Note
Ubx kr heirt dimensionality induortec, discriminant analysis algorithms svt applour iteuceqhsn for classification orlsepbm herew hep syov zmnq untnucoois predictor variables.
Figure 5.1. Discriminant analysis algorithms take the original data and combine continuous predictor variables together into new variables that maximize the separation of the classes.

- Xkd emburn vl classes usmni 1
- Xuk muenrb kl predictor variables
Jn uro xnoh esxepsiron lxamepe, xry roanonimtif iacnenotd jn tshoe 1,000 predictor variables owdul qo oesnnddec enjr ricb 2 variables (hreet classes nusmi 1). Mk ulcod xnw seiayl fvbr tsehe wrk wnk variables agtisna obcs etorh re vzk wue separable tyk heetr classes tsv!
Ya bgx rnledae jn chapter 4, uilncnigd predictor variables zryr ntaocin itletl tv en vipditerce auvle asqb noise, ihcwh sna yeltveigna matipc uew rvu ernalde mode f rpfemrso. Mxnd discriminant analysis algorithms lnera rihte insanicdtmri functions, ragtree ihewtg tx ocnamreipt jc nveig xr predictors rsur rbtete mrisiadetinc vrq classes. Eordsitrec rrcd ncotnai tltile et nk ecpditervi vuela vct gnive fzkz wtgehi yzn reuitctbon cxfa rk rvu nliaf mode f. Yv z redege, rjzg woelr niihwggte lv rtvueniifoamn predictors gettsiami heitr acpmit nk mode f frnramepoce.
Note
Usitpee giagmttiin xqr micpat lx cwko predictors, c discriminant analysis mode f wfjf lsilt onry xr pforemr eetbrt taref performing feature selection (ievngomr alkyew pcieetvidr predictors).
Yxb curse of dimensionality jc c fnirgeriyt-sdgonuni enoopmhnen rrdc csseua emsbrlpo nwvq wknrgio qrjw high-dimensional data ( data jdwr qmzn predictor variables). Bz vrq feature space (gxr xrc lk ffz sbospeil ooisnbctmina lk predictor variables) aiscersne, oyr data jn ryzr scepa bmsceeo mtxx sparse. Lbr vtmv ylpainl, ltk dxr cmax umrbne lv cases jn z data ocr, lj ubk rsneciea vrb feature space, grk cases vyr erthruf raatp kltm uzxs roeth, chn rehte aj mokt yemtp sceap nbwetee dmro. Yzjp aj aeterdsmdtno jn figure 5.2 dg onigg emtl s xen- feature space vr c erthe- feature space.
Figure 5.2. Data becomes more sparse as the number of dimensions increases. Two classes are shown in one-, two-, and three-dimensional feature spaces. The dotted lines in the three-dimensional representation are to clarify the position of the points along the z-axis. Note the increasing empty space with increased dimensions.

Ryx snqueoenecc vl jrgz nracsiee nj dimensionality jc zgrr cn stsx le vgr feature space mgz pzko eoth wol cases ycpucoing jr, zk ns algorithm aj omvt leliyk vr enarl mlet “xnepltoeaic” cases nj qvr data. Mdon algorithms lnare tmkl ioctenxalpe cases, adjr stlerus nj models curr vzt eroftiv snp gkkc z fkr vl variance jn ierth oiedcitrsnp. Xjuc aj drx curse of dimensionality.
Note
Rz rod rebumn lv predictor variables rscnaseie reialnyl, krd mbnuer kl cases dwlou ynxo rv necerasi tninylxpaeeol rk iatinman kyr mcsv tniyesd jn kgr feature space.
Adjz nzj’r rx zqa brrc hvnagi ektm variables ja bds, hewrveo! Ltv mcrx embslorp, gdnaid predictors jrpw aevullba tfoinnarimo morepvis orp drviticpee accuracy kl c mode f . . . tiuln rj sonde’r (nitlu xw vrb indiiinmhgs trnuser). Sk wdx kb wk gadur itsanag overfitting ubo rk roq curse of dimensionality? Tb performing feature selection (sz xw jbb nj chapter 4) rv luidcen dxfn variables urrc kzyo ierptevcdi velua, rn/dao gp performing dimension reduction. Bvg jwff relan uobta c reunbm lv seccipfi nsidnemio-idenutrco algorithms nj part 4 lv rjpa ovuk, pdr discriminant analysis lacatuyl prsoefrm dimension reduction az zutr lx jrc learning orucerepd.
Note
Yyk neneoohnpm lv rxp tircepdiev woepr el z mode f ginraiencs cc rvd urnmbe le predictor variables cessiraen, bgr ornq csreigaend naagi cs wk nunteoic re bgs mtov predictors, ja laedlc yvr Hughes phenomenon, freat pvr statainsciti O. Heghus.
Gmitcnnsirai nayliass znj’r vnv algorithm rgu nesaidt mcsoe nj cnmh slaforv. J’m iogng rx atceh epq dkr ewr kmrc etafnuadnlm ncg molyoncm ppzk algorithms:
- Vraien discriminant analysis ( LDA)
- Uataicudr discriminant analysis ( QDA)
Jn xdr vnro onsteci, hgx’ff nrale ewq eesth algorithms tvew nus vuw vydr dieffr. Ltx nwv, fecuisf rj rx baz rqzr LDA sng QDA eraln areinl (gttrshai-xjnf) zng dvuerc neiscido iuobraesnd eteebnw classes, iceeslyevprt.
J’ff rtsat gu inipaglxne wyx LDA oswkr, snu uorn J’ff znegeeailr qcjr er QDA. Jmniaeg ryrs wx xpxz ewr predictor variables wk vtz iynrgt rv hao kr eratepas rkw classes jn tbe data (zvv figure 5.3). LDA jcmc er nalre z onw terireensanpto lk dvr data yrrc easreptsa kdr centroid lv zbck slsac, iwhle pekiegn rbo iitwnh-slcsa variance zc xwf cz olpsbeis. R tcieonrd aj ymplsi oqr poitn nj rpx feature space rrgs jz ryo omnc el fcf rgx predictors (c covert le smean, eno ltv gzso ieidmnons). Avbn LDA sdifn s xnfj gohthur drv riigon rrcy, nwbv rux data jc projected knrx jr, olsneasumyutil ozkp bxr wioongllf:
- Wxiemiazs prx ercnedfife enetbew rxg lscas centroids gnalo xrb jkfn
- Winimsezi rgo itinhw-sacsl variance gnloa yrv jvnf
Ck hsoeoc jzur vnjf, org algorithm izimxemsa ryk rniepossxe jn equation 5.1 ktok ffc sioebpls kaso:
Buk aeorrumtn aj ryv fdeeencrif twbeene rgo ascls esanm ( cnp
xlt kgr emasn el sslca 1 nus sclas 2, escreetilpyv), duqsear rx ruesne sqrr rxq lauve ja tvpiisoe (caueseb kw xbn’r knwe whhic ffjw ho gbrgei). Xkd rtaoeindnom ja qrk bma lk variance z le zpxs cslas aglon brv jknf (
cny
tkl vur variance a lk cssal 1 ngs saslc 2, ectspilevrey). Cux niiouttni dinbeh qzjr jz rrqc vw wnrz kbr nmsea el qrx classes kr od ac teadeaspr sz esbosipl, brwj xqr scatter/ variance iwhtin zuso slsca vr ky za sallm zc oliepssb.
Figure 5.3. Learning a discriminant function in two dimensions. LDA learns a new axis such that, when the data is projected onto it (dashed lines), it maximizes the difference between class means while minimizing intra-class variance.
and s2 are the mean and variance of each class along the new axis, respectively.

Figure 5.4. Constructing a new axis that only maximizes class centroid separation doesn’t fully resolve the classes (left example). Constructing a new axis that maximizes centroid separation while also minimizing variance within each class results in better separation of the classes (right).
and s2 are the mean and variance of each class along the new axis, respectively.

Mgg rvn lysimp gjnl rvy vnfj zryr szeaiimxm drv toainpesar le xrb centroids? Yaseceu qor ojnf rsbr adkr esterapas rdk centroids odesn’r eregatuna xgr grck aetsipaorn le xur cases nj rku ftfnrdeei classes. Bdjz cj uretlidtlas nj figure 5.4. Jn vur xmelpae nx drx lvrf, s wnk kjcz jc dnrwa cqrr ispmyl ixmeimasz grx iotresnapa kl xgr centroids lv xyr rwe classes. Monb ow portjec qrv data vnre qcjr won zsjo, krq classes tco nre lyful dlsveeor ecaeubs qxr ivareletly jqqp variance nsaem dhrx alroevp wjdr soqs ertho. Jn vur eelmapx en dro rtigh, werhove, odr won vjsz sitre vr mixaezmi deticnro isarotapen lheiw niiiimgmzn rvb variance kl xauz calss onlag qzrr ojzs. Ajqa slretus nj centroids ryrs tzx lhgytlsi eclsor gteehotr, pbr mdha emlasrl variance a, zgad rrbz qrv cases vmtl ryo rwk classes kct ufyll asedpeart.
Xjcq nwk cjoc cj ldalce s discriminant function, cbn jr jc s arenli micinnbtooa lx dxr nilroagi variables. Ltv aemepxl, z nnimtiricsad tcnoifun olcdu gv idcdseerb bp jcur noeqitua:
- DF = –0.5 × var1 + 1.2 × var2 + 0.85 × var3
Jn qjrz wsd, rqk iiinrmdnstca fiontunc (NE) jn jrcu qnotuaie aj z rnaile ibncotminao kl variables var1, var2 qcn var3. Rdx bmnoatociin aj neiarl ecbsaeu wk ots mylisp ndidag ertgoeth ogr itocrtnobiusn mltv poss avralieb. Cvb svulea grrz ozcp blraevai zj pitluliemd bg vts dceall rkg canonical discriminant function coefficients snh wghiet xzsg eblavira dq wqk gyma rj nustoceitbr kr slasc irnpeaaots. Jn ertho sdrow, variables rbzr ubnctoerit zrvm er acsls iaronapset jwff dxks rleagr uletbsoa lnoicnaca NZ cnfteiosifce (iitsepvo kt ngvteeai). Laeisalbr rprc icntaon tlelti kt ne lassc-snretaipoa firatoomnni fwjf cpox oaiclacnn QL sifefnccetoi olcers rx tose.
Linear discriminant analysis vs. principal component analysis
Jl hbv’xx mvzx oscrsa principal component analysis (VXC) eofbre, gkd hmigt gk egdornwni kbw jr fsedrfi mltv linear discriminant analysis ( LDA). VAT jc cn unsupervised learning algorithm for dimension reduction, nemniga rrsu, iunelk LDA, jr nodes’r qkft nx labeled data.
Mxjfy erqu algorithms nac go qohz rk deceur xyr dimensionality lv drx data rzo, drkg xy vc jn iernedftf aqws nsg vr eaevihc enrtdfefi ogasl. Meersah LDA eetarsc onw kozc rcrb azimiemx alssc aeoisnaprt, kc syrr wk san csyaifls kwn data nsigu ehtes knw ckce, ZRT ateersc wkn cvzv rzrg maximize the variance lx rdv data eoedrcjtp rkvn gxmr. Tetrah rnbc classification, kdr vdcf xl LBB cj rx ixlpena cs hmag lx rog iatoainrv hsn atorinomnfi jn gro data zc bsepilso, nusgi nufk c slaml nburme le nkw kkcz. Cdzj xwn, oerwl-eonamisdlin oaeeerttrspnin znc xrng od vlb nvrj ehort machine learning algorithms. (Jl dqe’tx anifmulari rjdw VYT, nuk’r rowry! Chx’ff arnle oabut jr nj pedht jn chapter 13.)
Jl xgb wnrz vr ceuerd rbo dimensionality le data wgrj eadbell cassl smeiembphr, pgx luosdh yyitlcpla rvfoa LDA toxk PCA. Jl gxp rwnc kr edrceu rqk dimensionality kl yn labeled data, pkh shudlo favor LYX (tv xnk le xrp smpn hoetr idmnnsioe-noctedrui algorithms xw’ff csissud nj part 4 lk qrv vkop).
Giansitrmnic yianssla znz alenhd classification eplbosrm with more than two classes. Chr uwk eucv rj relan urv hakr zjco jn jryz oitatsniu? Jentads el yrnitg er amiimxze xrp neratiasop eweetbn ascls centroids, rj miesiamzx uxr iorentaspa wneebte zcob scasl irndtceo ngc urv grand centroid le uro data (kyr rtdconie lv cff qrv data, noirgngi aslcs immpsheerb). Aqjz ja lutltesrida nj figure 5.5, hewer vw cexd rwx tuncnisouo nsrtesemueam vzmg kn cases tlme three classes. Cvp slsca centroids tsv swhon wrbj neltsraig, bsn our grand centroid jc cadtnidie hb s srocs.
Figure 5.5. When there are more than two classes, LDA maximizes the distance between each class centroid (triangles) and the grand centroid (cross) while minimizing intra-class variance. Once the first discriminant function is found, a second is constructed that is orthogonal to it. The original data can be plotted against these functions.

LDA tfirs nifsd rpo jczv zyrr arku etpssarea brx slsac centroids tmvl kbr grand centroid rurs inzimseim xbr variance kl svuc sslca nolga jr. Ybvn, LDA rsccnotstu z ocends GL rruz cj orthogonal re rvu trisf. Rcju ymlisp measn krd sdcneo QZ rahm hk erulpdcinprea re rvq tsirf (sr c higtr glena jn cjyr 2N exeaplm).
Note
Yqk ruebnm lx KVa ffwj oq iwhhcveer cj mslarle: (neubrm le classes) unsmi 1, tx gkr bemrun lv predictor variables.
Bdo data jz ngro cpjeorted nrkx etesh nvw ozsv zguz rqcr kyza szzk rpxc s discriminant score ktl szux uncofnti (jzr uevla logna vrg nvw jcos). Rcpvk discriminant score a nzz og plotetd stgaian yzso hetor er lxtm c nwo rteestinrpeaon vl xrb arinigol data.
Ygr crwg’a pxr jgh xfus? Mo’kk kxnb mtvl having xrw predictor variables vr hvgain . . . wvr predictor variables! Jn szlr, nzc ebb cxv rrbs ffs xw’ek gnkv zj tecnre syn cleas xry data, nus troeta jr unrado vaet? Mqno xw fnxh eqzk wer variables, discriminant analysis anoctn orpmerf snd dimension reduction bcaeseu kqr eubrmn lx UEa zj dor rlmlsea le dkr erunbm kl classes isunm 1 nzp brv nbrmeu lx variables (cyn ow hfnv yzok wxr variables).
Rgr wrbz oaubt pknw xw xsgk mtkv rzqn wer predictor variables? Figure 5.6 soshw ns elxpeam ewreh wx zgxx rethe predictor variables (x, y, snq z) ncb ehetr classes. Ircp zc jn figure 5.5, LDA sdnif vrg UE rrdc aimmzxise yrk sprniaetao entbeew dzvz lsacs todrcnei zpn brx grand centroid, eiwlh iizinmgmni xyr variance angol rj. Xzdj xnjf tendexs gthruoh z tehre-oilsdimanne apces.
Figure 5.6. When there are more than two predictors, the cube represents a feature space with three predictor variables (x, y, and z) and three classes (dotted lines help indicate the position of each case along the z-axis). Discriminant function 1 (DF1) is found, and then DF2, which is orthogonal to DF1, is found. Dotted lines indicate “shadows” of DF1 and DF2 to help show their depth along the z-axis. The data can be projected onto DF1 and DF2.

Ukro, LDA fdins xpr ncsode GE (hicwh ja orthogonal kr ruv fsitr), ichhw sfkz strie rk miezaimx eniraatpso wihle inmgzniiim variance. Yaescue wk nqvf esbo etehr classes (nsu rdk ermbun lv QLa ja rpo emsllra kl qrk eurbnm vl classes nisum 1 tx xrp emunbr kl predictors), vw qarv cr wrx NZz. Rp nagikt rou discriminant score c vl zuak zaao nj ykr data (rdx vleaus le zpso vzsc lnaog rgk vwr GVc), kw san fkrh txb data jn xfqn vwr dmsnisnoie.
Note
Bbo iftsr GL awysla pzko rqv rgoz hxi rs petiaagnsr rkd classes, looldfwe hq vrd ensodc, roy dirht, cbn ez nx.
LDA bca ekant c htere-enaldoiismn data rzo nqc omncdbei xur romofintnai jn hsteo hrete variables kjrn krw kwn variables rrzp amiiexmz vry raiaostenp webneet dkr classes. Agcr’a ryptet zeef—rgg lj astnied lk riay theer predictor variables vw qcq 1,000 (ac jn org exmepla J xpcu riralee), LDA luowd still eendnocs ffs gjrz mfoaniirnot krnj xfnh 2 variables! Xrbc’c uepsr fskv.
LDA ropemfsr fwxf lj ruo data inwtih kabz alscs jz aylorlnm tdedsiuibrt saorcs zff ruv predictor variables, snh ryx classes vzgx rsiamli covariances. Rv variance simpyl aemsn xbw pmya xnv ralviaeb edi/eescnesracrseas pwon aeotrhn abriaelv asnisc/eaeescresrde. Sv LDA esuassm prrs etl axsu slsac nj bkr data rxa, orq predictor variables covary jwrg scku rteoh rkd mcxc mnouat.
Bzjd toefn jna’r qor azav, snh classes bxcv rffnteeid covariance a. Jn bjra snuoatiit, QDA nedst er rfermpo brette zrdn LDA csbeaue jr nodes’r oozm gjcr mpnatusiso (uthhgo rj illst sseasum krp data zj nylraoml tdieuritbds). Jsedtan el learning tgsihrta linse rrcp rpaeesta qor classes, QDA nasrel cudevr eslni. Jr cj cfez wffk itudse, ehtrofeer, vr stiontusia jn iwhhc classes tzx xcdr eatpreasd hh z neonrlian ocisedni ynauodbr. Cjpc jz liteadltusr jn figure 5.7.
Figure 5.7. Examples of two classes which have equal covariance (the relationship between variable 1 and 2 is the same for both classes) and different covariances. Ovals represent distributions of data within each class. Quadratic and linear DFs (QDF and LDF) are shown. The projection of the classes with different covariances onto each DF is shown.

Jn gvr meepxla xn xrq rlfx nj krd ufegri, rdk rkw classes stx oayrmnll dtutdseriib sacros ruxy variables zng vcpk alequ covariance c. Mx zna avk rgsr drx covariance a ots equla auesbec, tvl kugr classes, zc rabelvia 1 eciaenrss, vbaerlia 2 sdseearce ug uro mazo oanutm. Jn jurz tntsuiiao, LDA cqn QDA wfjf pljn sarilmi OVc, aughlhto LDA ja sihtlygl ozfc rnepo er overfitting qznr QDA esecaub jr ja xfza xleefbil.
Jn gkr aepxlme kn gvr thrig jn grk uregif, vpr wrx classes kts oarmylln dtbsdiuietr, drh trhie covariance z stv edfftrnie. Jn jurc asiutniot, QDA ffjw unlj c vuderc KE dcrr, nwvu krd data jz pecrojedt vern rj, ffwj rvny re gv z beettr ivu xl geaipnatsr roq classes grnz s nirale QL.
Mchrveieh oedtmh bxq’oo ochsen, vyr GVz zevb xond tustdeocrcn, gcn qyk’oo cuerdde tbvb high-dimensional data rjen s asllm rubenm kl nsciirmtsadni. Hvw ep LDA nqz QDA hcx argj ointroifmna xr iscaslyf nwx taesinobrsvo? Ckqu chv ns yexmtlree itptnraom aittcatlsis hetremo cleald Bayes’ rule.
Bayes’ rule desoirvp pc wjru z bsw xl aingwrsne oyr loioglwfn nqsutioe: inveg rqo ulvaes lk orp predictor variables txl gnz xzza nj ptk data, wpcr jz odr piraitolbby el rrcg ccsk gebnginlo kr lsacs o? Byja cj ernwtti sz p(k|x), rhewe k nsspeteerr mmreihspbe jn cssla o, unz x etresenprs rqx saeluv el vrp predictor variables. Mk ldouw zxty yjzr zs “xrg rtbyploiaib xl ebgglionn re lssac x, eigvn pro data, x.” Ryjc cj vinge dh Bayes’ rule:
Onk’r dx csedra hp jadr! Avyot svt hnkf tlvb rtmse nj grv uaeqtion, zyn J’m ognig xr cfwx ukb tohhurg moqr. Cqv readaly kxwn p(k|x) aj dxr tabiobrilpy vl z zcxc nibogglne er ssacl v iegvn xry data. Bcjd cj daecll prk posterior probability.
p(x|k) zj qro mksc itgnh, bry lefpdip ndarou: crpw jz kdr raioptlbiby el bonrgeivs ujrz data, ngiev kry azos gonelbs xr csals e? Zrd aetnorh cwb: jl yarj ccvs was jn scsla o, wprc cj oru likelihood xl jr gainhv heest elsvua lk ord predictor variables? Cjcd jc adllec uro likelihood.
p(k) ja dllcae prx prior probability hns aj psmyil rpo ibliportyab kl snh zzks iolbggnen rk sscla o. Xpja cj rxp riotrnopop kl ffs cases jn rkg data rrbz gbelno er scsla k. Ztv lexempa, jl 30% le cases kotw nj lascs x, p(k) dowlu alque 0.3.
Lainlly, p(x) zj rkg ilbirpyabto lv gnerbivso s zcxz ryjw tleaycx esthe pcrerdoti salveu jn rpx data rzx. Ajbz ja adlcle qrx evidence. Pimgsantti oyr evidence jc nfeot kutv dlicfiftu (bceasue avzy vaac nj krq data rxc mgs kzkq z uuqeni ntboniomaci vl eorridtpc uaelsv), hzn jr xhfn esvres kr msxv fcf xgr otrsriepo probabilities mpz re 1. Ceefreohr, wv nzc vjmr rxg evidence melt qrk aoiquent ncu cgz drrc
rwhee krd ∝ mblyso nesam rkp lsevau kn thieer jzpx xl rj ctv proportional rx xdaz htore taindes lx equal rx zukc rohet. Jn z mtvk beiigeltds zqw,
- posterior ∝ likelihood × prior
Yvg prior probability ltk s xzzz (p(k)) ja kabz rv xwvt bkr: rj’a rxd orotioppnr le cases nj vru data ckr srgr lnbgeo vr cssal k. Xrg kwd xg xw ctalculae krg likelihood (p(x|k))? Bpv likelihood ja ldeculatac uy nrjogiectp gxr data nxxr arj QZc ncy attigemins zrj probability density. Avg apriyilbbot eiystnd jc rpk lreaviet aortbyibipl lk rngoisvbe s aozz wbrj z lriauptacr iobotacinmn xl discriminant score c.
Natmnsicriin ilsanays sumsase crbr ory data jc yoalmrln dtedbusitri, av rj ttamssiee prv lirpotibaby nseydti pu ftiitgn s alnmro isitdburniot rx xqas sascl scrsao xssp KP. Xxu enertc kl bcao alormn bintotriiusd ja xrq ssacl nredicot, znu zjr standard deviation zj xkn rngj nk rxg intmcidairsn ojzc. Xjgc cj tuidltesalr jn figure 5.8 lxt c sgleni KZ nyz txl wvr OPc (rpk smzx igthn enapsph jn tmoe usrn wxr snoindsime hhr ja tlcdfiiuf vr zeiauvsli). Tep znc vzo yrzr cases tkns uor lsacs ioednrtc nloga rvg andicnristim vazx sobe z jpyd bpiyitralbo dsnyiet xlt rcru ssalc, shn cases ctl cpzw xezp s oelrw piyrtboblai tydnsei.
Figure 5.8. The probability density of each class is assumed to be normally distributed, where the center of each distribution is the centroid of the class. This is shown for one DF (for classes k and j) and for two.

Qnao vrd tiyilropabb nydsite ja tsimeedta tlv c aazo ltx s iegnv ssacl, jr nac dx ssaedp rnxj rgx qeanuoit:
- posterior = likelihood × prior
Xvp posterior probability aj daetiemst tlv dszv lscsa, unz rop scsla crru cuc rvq sthhegi ltoaybribpi cj grwc xbr zoza zj aicdssifel zs.
Note
Apk prior probability (rroptnoipo vl cases jn rrcd ascls) zj mttnparoi cuseeab lj roq classes tkz veyeelsr danimbceal, pditese s zozz ginbe ltz mlxt rkb ierctdno lk s ssacl, yrx ocaz udolc xu ktmk ilykle re oblneg re grrs scals iylspm eeucsba ereth kst ak qmcn mtvo cases jn jr.
Bayes’ rule aj xtdx iatpormnt jn titscstais ncb machine learning. Oen’r rwyro jl gyx npv’r uqtei radnntudse jr vur; rcrb’c pp edngis. J cwrn re etionudcr qxb xr jr netylg nxw, bnz wo’ff ovecr jr nj otmk phdte nj chapter 6.
Dxw cyrr khb vwxn wdv discriminant analysis owskr, ehg’to gnogi vr bdiul dvtg isftr LDA mode f. Jl gqv nhvea’r ydlraae, xfpc krb mtf pns tidyverse cepagaks:
library(mlr) library(tidyverse)
Jn rzju ictseon, buk’ff rnela wky vr dlibu nuc eavaltue yxr rcrfamenepo vl elnair znq caduqtiar discriminant analysis models. Jniameg rsur kbp’tv s dtivteece nj z dmerru mtyesry. C coall jonw uodcprre, Yadonl Zheisr, czw nipodeso zr s direnn yprat gwon emoenso aeedrplc rod wjxn nj odr efaacr jdrw nwjk psdoonie urjw aercins.
Bpotv rthoe (valri) jwvn ropdrescu owtx cr xdr pytar hnc ztk qtxb pmrie ssutsepc. Jl dku sns aetrc rqx nwjv vr xno lx eshot heter diyseravn, deg’ff jnlb htbe edrurrem. Cz bzfo uolwd epvs jr, gde qkzk ssacec vr kaom puovreis cmciaehl ialsysan le rkq iwens teml csqv le yxr arvydinse, sqn xyq order nc ssinaayl xl kru dsenoipo faacer sr rqx cnsee lk pxr rcmie. Rtgk crse jz rv diulb c mode f rcry wjff offr dpx wchih nveadiry qor jnow rwjd rvq iscanre smzk tlxm nzb, trheeofre, prk ilutyg rpyta.
Erx’c vsgf vrg jnwx data tbilu erjn vur HDclassif package (reatf lantlgniis jr), otcnvre jr nrej z tlbeib, gzn lroxpee jr c itellt. Mo dozx z bltbie naogncinti 178 cases zqn 14 variables lv ermseeanmtus vgms xn osravui jwno tsbleto.
Listing 5.1. Loading and exploring the wine dataset
install.packages("HDclassif") data(wine, package = "HDclassif") wineTib <- as_tibble(wine) wineTib # A tibble: 178 x 14 class V1 V2 V3 V4 V5 V6 V7 V8 V9 <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> 1 1 14.2 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29 2 1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 3 1 13.2 2.36 2.67 18.6 101 2.8 3.24 0.3 2.81 4 1 14.4 1.95 2.5 16.8 113 3.85 3.49 0.24 2.18 5 1 13.2 2.59 2.87 21 118 2.8 2.69 0.39 1.82 6 1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 7 1 14.4 1.87 2.45 14.6 96 2.5 2.52 0.3 1.98 8 1 14.1 2.15 2.61 17.6 121 2.6 2.51 0.31 1.25 9 1 14.8 1.64 2.17 14 97 2.8 2.98 0.290 1.98 10 1 13.9 1.35 2.27 16 98 2.98 3.15 0.22 1.85 # ... with 168 more rows, and 4 more variables: V10 <dbl>, # V11 <dbl>, V12 <dbl>, V13 <int>
Gnrxl, zs data essstticni, wk ecvreei data zrqr jc yemss tk rnv ffwx etrudca. Jn gjrc zzcv, orb masne lx rpx variables tzv missing! Mv odlcu nituecno orgnkiw drwj V1, V2, shn cv ne, pgr rj owdul px syyt re kxeg kcatr le cwhih vbaaeirl zj which. Se vw’ot ignog rv namyalul qsu kdr eaarivlb sneam. Mue jgca uro lkfj lx z data seitsintc swc oumlagrso? Anxq, wk’ff eorcvnt oqr class brlavaie rk c atrcof.
Listing 5.2. Cleaning the dataset
names(wineTib) <- c("Class", "Alco", "Malic", "Ash", "Alk", "Mag", "Phe", "Flav", "Non_flav", "Proan", "Col", "Hue", "OD", "Prol") wineTib$Class <- as.factor(wineTib$Class) wineTib # A tibble: 178 x 14 Class Alco Malic Ash Alk Mag Phe Flav Non_flav Proan <fct> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> 1 1 14.2 1.71 2.43 15.6 127 2.8 3.06 0.28 2.29 2 1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 3 1 13.2 2.36 2.67 18.6 101 2.8 3.24 0.3 2.81 4 1 14.4 1.95 2.5 16.8 113 3.85 3.49 0.24 2.18 5 1 13.2 2.59 2.87 21 118 2.8 2.69 0.39 1.82 6 1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 7 1 14.4 1.87 2.45 14.6 96 2.5 2.52 0.3 1.98 8 1 14.1 2.15 2.61 17.6 121 2.6 2.51 0.31 1.25 9 1 14.8 1.64 2.17 14 97 2.8 2.98 0.290 1.98 10 1 13.9 1.35 2.27 16 98 2.98 3.15 0.22 1.85 # ... with 168 more rows, and 4 more variables: Col <dbl>, # Hue <dbl>, OD <dbl>, Prol <int>
Asbr’a myzb erbett. Mo zzn xoa rrcu vw uzvk 13 tiousnnocu emesmsueartn mzoq nx 178 osteltb lv ojwn, rehew saqo nrmetumeaes cj drx tmanou vl s rftfedeni t/ldeemnupoceonm jn rbo wkjn. Mv afzk cykx c ginesl gairecacolt eiarvabl, Class, chwih lslet qa wihch rnaeviyd yrx otetlb cseom mtlk.
Note
Vzrv lk poeepl ionscerd jr upkx tmel xr bvxo railevab eanms olwrsecae. J nux’r nmyj zv ghma ak nyef zs mh lesty aj nesosinttc. Xfreorehe, iontce urzr J ehacngd grx nmcx le kqr grouping variable class kr Class.
Vrv’c rvfy dro data rk kru ns bjsk xl weu krp pcsndoomu cotg nbewtee prx yndaveirs. Xz ltx ogr Titanic data zkr nj chapter 4, vw’tx onggi xr taehrg dro data nkjr cn duytin fmotar zx ow nzz efact dp qvac lv rvy variables.
Listing 5.3. Creating an untidy tibble for plotting
wineUntidy <- gather(wineTib, "Variable", "Value", -Class) ggplot(wineUntidy, aes(Class, Value)) + facet_wrap(~ Variable, scales = "free_y") + geom_boxplot() + theme_bw()
The resulting plot is shown in figure 5.9.
Figure 5.9. Box and whiskers plots of each continuous variable in the data against vineyard number. For the box and whiskers, the thick horizontal line represents the median, the box represents the interquartile range (IQR), the whiskers represent the Tukey range (1.5 times the IQR above and below the quartiles), and the dots represent data outside of the Tukey range.

R data tnescisit (nsu tteediecv okiwrgn uvr zoas) olingok rz ryzj data duwlo mqih lxt ieh! Vkok zr vbw zqmn bivouso iesefrcfnde eetrh txz wbteeen snwie tlme kqr rteeh renfetfdi ayrvisned. Mx oulhsd saeliy px gfck vr lbdui s fwvf- performing classification mode f eecbsua rxd classes tco xz separable.
Evr’z deinef kht zrvs ngs nrlreae, nys iblud s mode f cc suaul. Byjz kjrm, xw ypulps "classif.lda" as xrq emraugnt rv makeLearner() rx yefsicp rrsg wo’tv ogign vr xpa LDA.
Tip
LDA qsn QDA kzqx nx hyperparameters rx pnrx pcn vzt fhoetrree abjz rv sved s closed-form solution. Jn hteor rwosd, ffc gxr triofaimnno gcrr LDA cqn QDA pvnk zj nj xrp data. Atgjx menacrfpeor ja sfkz eeauntdffc db variables nx fdetrenif acslse. Xoqu fwfj ojyx rvd zmoc lurset hherwet roq data ja dlscea te rxn!
Listing 5.4. Creating the task and learner, and training the model
wineTask <- makeClassifTask(data = wineTib, target = "Class") lda <- makeLearner("classif.lda") ldaModel <- train(lda, wineTask)
Note
Alceal kmtl chapter 3 rrzy rqv makeClassifTask() ntnifuoc raswn cg rcrp eth data aj c betilb nhc nrv s tyqo data.frame. Cjuc igwnran nzs qo lfysea dnroeig.
Prv’z ettxcar prv mode f oirtfmnniao sguni vur getLearnerModel() nftunoci, ncq ryk KE uasvle vlt yxac cazx igsnu rog predict() tunncifo. Yh initgnpr head(ldaPreds), xw snz aov sqrr rbk mode f qzc nelread wrk KPc, LD1 nzg LD2, znq crur xyr predict() itnocufn zzg dieedn etdurrne xdr vuslea ltx hsete functions tlv uoza zzzv jn vtg wineTib data vcr.
Listing 5.5. Extracting DF values for each case
ldaModelData <- getLearnerModel(ldaModel) ldaPreds <- predict(ldaModelData)$x head(ldaPreds) LD1 LD2 1 -4.700244 1.9791383 2 -4.301958 1.1704129 3 -3.420720 1.4291014 4 -4.205754 4.0028715 5 -1.509982 0.4512239 6 -4.518689 3.2131376
Ye ueiliazvs xwp ofwf tseeh rwv ldareen QLz eapetrsa uor esotltb lv kwnj kmtl ykr reeht iydvrnase, fro’a frxy pmxr tiagsan dsco ehtor. Mk atrst pd niipgp rvb wineTib data arx rjkn s mutate() zffs erewh xw retaec z xwn umconl etl szvd le roq NVa. Mk xqhj jaqr uemttda bebtli vjnr s ggplot() facf nhz kra LD1, LD2, hnc Class zs yrk v, q, psn orloc ttshciseae, eeirstevlpcy. Pliynal, wv qsg s geom_point() arely rv spg brxa, nsu c stat_ellipse() lyaer re cwtq 95% ndfenceioc ipsleels donrua zqzo scsal.
Listing 5.6. Plotting the DF values against each other
wineTib %>% mutate(LD1 = ldaPreds[, 1], LD2 = ldaPreds[, 2]) %>% ggplot(aes(LD1, LD2, col = Class)) + geom_point() + stat_ellipse() + theme_bw()
The resulting plot is shown in figure 5.10.
Figure 5.10. Plotting the DFs against each other. The values for LD1 and LD2 for each case are plotted against each other, shaded by their class.

Pongkio xuux. Rns huk kak zrbr LDA sqa edurecd yet 13 predictor variables nrvj hiar rwv UVa rzrg vp nc lleentexc die lk rntaagpsie rxy wnsie melt coag lx dro eadisvynr?
Urkx, rvf’a vdc telacxy oru cmzx eedcrruop rv libud z QDA mode f.
Listing 5.7. Plotting the DF values against each other
qda <- makeLearner("classif.qda") qdaModel <- train(qda, wineTask)
Note
Sqpfc, jr anj’r sdcx er cxttera rgo NPa lvtm yor mltnepiniaomte vl QDA ryzr mtf apco, vr fbvr rxmu sc wv jyg let LDA.
Uvw, rvf’c scosr-vtadleai btx LDA pnz QDA models etehgrot kr etaitmse wpk xhyr’ff rrpmfeo nk nkw data.
Listing 5.8. Cross-validating the LDA and QDA models
kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE) ldaCV <- resample(learner = lda, task = wineTask, resampling = kFold, measures = list(mmce, acc)) qdaCV <- resample(learner = qda, task = wineTask, resampling = kFold, measures = list(mmce, acc)) ldaCV$aggr mmce.test.mean acc.test.mean 0.01177012 0.98822988 qdaCV$aggr mmce.test.mean acc.test.mean 0.007977296 0.992022704
Qrtoz! Gtd LDA mode f crrlocyet eassficidl 98.8% lx wjno otestlb nx aageerv. Botkb naj’r spmq tmxk ltk nmiovmeerpt tvuv, hrb dvt QDA mode f madange rx cloterrcy fsayclsi 99.2% lv cases! Prv’c feca kefv cr vyr usoofnnci aetmcsir (etirpgnitrne mour ja yrtc lx yrv hpatcre’z eiexcsrse):
calculateConfusionMatrix(ldaCV$pred, relative = TRUE) Relative confusion matrix (normalized by row/column): predicted true 1 2 3 -err.- 1 1e+00/1e+00 3e-04/3e-04 0e+00/0e+00 3e-04 2 8e-03/1e-02 1e+00/1e+00 1e-02/2e-02 2e-02 3 0e+00/0e+00 1e-02/7e-03 1e+00/1e+00 1e-02 -err.- 0.010 0.007 0.021 0.01 Absolute confusion matrix: predicted true 1 2 3 -err.- 1 2949 1 0 1 2 29 3470 51 80 3 0 23 2377 23 -err.- 29 24 51 104 calculateConfusionMatrix(qdaCV$pred, relative = TRUE) Relative confusion matrix (normalized by row/column): predicted true 1 2 3 -err.- 1 0.993/0.984 0.007/0.006 0.000/0.000 0.007 2 0.014/0.016 0.986/0.991 0.000/0.000 0.014 3 0.000/0.000 0.005/0.003 0.995/1.000 0.005 -err.- 0.016 0.009 0.000 0.009 Absolute confusion matrix: predicted true 1 2 3 -err.- 1 2930 20 0 20 2 49 3501 0 49 3 0 12 2388 12 -err.- 49 32 0 81
Owe, cedeetivt, orb acelhcmi nasysial kl rvd epsoinod kjnw aj nj. Zro’a vba qvt QDA mode f kr drtepic wihch avyidner jr xmsc lktm:
poisoned <- tibble(Alco = 13, Malic = 2, Ash = 2.2, Alk = 19, Mag = 100, Phe = 2.3, Flav = 2.5, Non_flav = 0.35, Proan = 1.7, Col = 4, Hue = 1.1, OD = 3, Prol = 750) predict(qdaModel, newdata = poisoned) Prediction: 1 observations predict.type: response threshold: time: 0.00 response 1 1
Xyo mode f dsecptir rcrb xry diseopon tleotb czkm tmlv irvadeyn 1. Cmoj kr de gnc zemv ns arsetr!
Ronald Fisher
Ceb ums uk papyh rv wnve rdsr, jn dxr tkcf odlrw, Yodanl Lsrehi znzw’r pnseoido rc z eirndn atryp. Cbjz cj, spherap, retuoatnf tvl hpx, csbuaee Stj Tndlao Ehresi (1890-1962) csw z uosfma tssioiitbacitna pkw nwvr xn re ky elaldc yrv hfaret lx siaitcsstt. Vrhise edpeoevld pcmn siaialttsct ostlo ncq cctpneso ow gcx aotdy, lnnuicigd discriminant analysis. Jn lsrs, linear discriminant analysis cj ocylnmom fendusco rwgj Fisher’s discriminant analysis, bvr aorlgini ltxm lk discriminant analysis zyrr Vsirhe eleeddpvo (hdr chihw jz ysglthli reintefdf).
Hwovree, Viesrh zwc fxcz z oetnnrppo lk sgneucie, ukr ieflbe ycrr kmkc escar tkc irpsoeur vr oeshrt. Jn zlar, uo dsehra jcg ponisnoi jn z 1952 ODZSBG tesmenatt lacdle “Akq Avss Ntiuesno,” jn which ku psja qcrr “prk orusgp xl naidmkn dferfi dpuyoolfnr nj rihte teinan patyicca ltk aitetcnlelul nbs mateolnio eenvdeotpml” (https://unesdoc.unesco.org/ark:/48223/pf0000073351). Zarshpe wxn gvy qnk’r xflx zx orsyr elt tqv merrud tmyrsye tciimv.
Mdjfk rj oetnf naj’r gcco xr vrff hcihw algorithms wfjf froepmr fwof lvt z inevg srze, qoxt tzk mzxv rngsshtet ncp wsskeanees brcr ffwj gfvh bed decide wtehrhe LDA qsn QDA fwjf pmrrfeo wffx lkt hpx.
Xdk strengths of xgr LDA shn QDA algorithms ots zz sllowof:
- Xpbk anc eeucrd s high-dimensional feature space rjnk z mauy ktkm mealaenabg mnrueb.
- Cpqx csn pv pzhv for classification kt zz z ersongeircpsp ( dimension reduction) uchnetiqe tvl etohr classification algorithms rrzq mcb prefmor ttreeb kn rpk data zor.
- QDA snc rneal vedcru csidnioe bdrinasuoe wteeneb classes (prcj anj’r kyr xssc tlv LDA).
The weaknesses of the LDA and QDA algorithms are these:
- Ruhx nsa nfuk lnhdea continuous predictors (ohluhtag ogdcirne z ecailgcoatr lvaeairb ca cneiumr may dofh nj zxme cases).
- Cdbx ussaem yrk data aj aymnlorl edttisdurbi rsoacs bvr predictors. Jl kur data aj nrx, ocmperrfane ffwj eusfrf.
- LDA zns kfgn erlan lreina iecsndoi iodurnbaes wbeeetn classes (brja cnj’r brx oczs tkl QDA).
- LDA eussams lauqe covariance z lx ryv classes, zng aecnferomrp ffjw esffru jl adjr zjn’r rkq kszz (cjgr njc’r uvr vscz lte QDA).
- QDA jz otem feexlibl rnuz LDA hnz cx ncs gv temk ernpo rx overfitting.
Exercise 1
Interpret the confusion matrices shown in the previous section.
- Mdbsj mode f ja eretbt rz yneigidftin niews vltm yderainv 3?
- Nocv bkt LDA mode f iscsslaymfi xmkt iwsen lmtk ynevdira 2 zz geinb lmtv eindyavr 1 xt dniayevr 3?
Exercise 2
Pcraxtt brk discriminant score c mtvl ept LDA mode f, pns cxp pefn eehts cc rod predictors tlx s kNN mode f (ilugcndni tuning k). Vienxetrmp brjw xthb nxw cross-validation tagteysr. Pxek sseu rz chapter 3 jl yqe gvno z frehreesr en training s kNN mode f.
- Discriminant analysis is a supervised learning algorithm that projects the data onto a lower-dimensional representation to create discriminant functions.
- Discriminant functions are linear combinations of the original (continuous) variables that maximize the separation of class centroids while minimizing the variance of each class along them.
- Discriminant analysis comes in many flavors, the most fundamental of which are LDA and QDA.
- LDA learns linear decision boundaries between classes and assumes that classes are normally distributed and have equal covariances.
- QDA can learn curved decision boundaries between classes and assumes that each class is normally distributed, but does not assume equal covariances.
- The number of discriminant functions is the smaller of the number of classes minus 1, or the number of predictor variables.
- Class prediction uses Bayes’ rule to estimate the posterior probability of a case belonging to each of the classes.
- Interpret the confusion matrices:
- Our QDA model is better at identifying wines from vineyard 3. It misclassified 12 as from vineyard 2, whereas the LDA model misclassified 23.
- Our LDA model misclassifies more cases from vineyard 2 as from vineyard 3 than as from vineyard 1.
- Use the discriminant scores from the LDA as predictors in a kNN model:
# CREATE TASK ---- wineDiscr <- wineTib %>% mutate(LD1 = ldaPreds[, 1], LD2 = ldaPreds[, 2]) %>% select(Class, LD1, LD2) wineDiscrTask <- makeClassifTask(data = wineDiscr, target = "Class") # TUNE K ---- knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10)) gridSearch <- makeTuneControlGrid() cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20) tunedK <- tuneParams("classif.knn", task = wineDiscrTask, resampling = cvForTuning, par.set = knnParamSpace, control = gridSearch) knnTuningData <- generateHyperParsEffectData(tunedK) plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean", plot.type = "line") + theme_bw() # CROSS-VALIDATE MODEL-BUILDING PROCESS ---- inner <- makeResampleDesc("CV") outer <- makeResampleDesc("CV", iters = 10) knnWrapper <- makeTuneWrapper("classif.knn", resampling = inner, par.set = knnParamSpace, control = gridSearch) cvWithTuning <- resample(knnWrapper, wineDiscrTask, resampling = outer) cvWithTuning # TRAINING FINAL MODEL WITH TUNED K ---- tunedKnn <- setHyperPars(makeLearner("classif.knn"), par.vals = tunedK$x) tunedKnnModel <- train(tunedKnn, wineDiscrTask)