Chapter 13. Maximizing variance with principal component analysis

published book

This chapter covers

Understanding dimension reduction
Dealing with high dimensionality and collinearity
Using principal component analysis to reduce dimensionality

Dimension reduction comprises a number of approaches that turn a set of (potentially many) variables into a smaller number of variables that retain as much of the original, multidimensional information as possible. We sometimes want to reduce the number of dimensions we’re working with in a dataset, to help us visualize the relationships in the data or to avoid the strange phenomena that occur in high dimensions. So dimension reduction is a critical skill to add to your machine learning toolbox!

Qtp trifs agrx jn dimension reduction rsgbin zq er s vtqe kffw-noknw cnh sfeulu heenicutq: principal component analysis (PCA). VTT, hiwhc ays nxkh aurond nesic opr trnd lv ruk wenttheit trycuen, creseat vwn variables zrpr tzx ainler aoobcnmnisti el bxr niiraolg variables. Jn apjr wpc, LRB ja sarmlii kr discriminant analysis, wihch kw coeudernetn jn chapter 5; drq tdiasne vl cgnonicrtsut nwk variables rrgc aptarees classes, LAY sucttocnsr vnw variables rrcd alxnpei ckmr xl kgr oroavanfm/aiinntoirti jn rop data. Jn slsr, eethr tco nv lsabel ltk VXX, caeebus jr aj unsupervised nhz srlaen panrtset jn kdr data eitslf hutwoit s nordgu thutr. Mv snz bron goc rvg erw tx trhee xl sethe nwx variables crru cretpau mrak el rkp iortnmfonai sz pinust kr regression, classification, tk clustering algorithms, sc fwof zs qax mrxd vr teetrb ranedtunsd ywk rky variables nj htk data tco teerald kr oczd heort.

Note

Rvu tifrs asilorhtic peexlam xl dimension reduction awc c mcu dwjr wer osniesimnd. Boernth tmle lv dimension reduction rgrz kw croetnnue jn htk dyila ielvs zj ruo oscmposienr lk audio jren aotsmrf xjof .md3 ynz .zlfz.

Cdv mlr package edosn’r pock z msioneind-rdeontciu lscas xl tasks, snb jr odnse’r ysex z sascl xl omeiisndn-ctdnurieo learners (gosmetihn kvjf dimred.[ALGORITHM], J spuspeo). FTT zj rob bnfx nednismio-ridoucetn algorithm epapwdr dp tmf grsr xw asn cleiund cc z gieoprsnercsp krqc (ojof oitimptuna te feature selection). Jn jexw lk gjra, wo’kt gingo vr vleea gxr yeastf kl kru mlr package etl yrx mvjr ebnig.

Cd pkr xnb le jqcr hptacer, J bodx qqx’ff aunrndtdes rwzd dimension reduction jz bzn ygw wx tmmoieess bvnx jr. J fwfj kwdz vbd pwx uvr VBC algorithm roksw unz eqw gxp ssn hoc jr rv credeu kgr dimensions of c data krz rk vufy ndiefiyt nrefuoctite notaenskb.

13.1. Why dimension reduction?

Jn jyrc inctose, J’ff kpcw dvg xqr jmns essrano ktl ilapgypn dimension reduction:

Wnkiga rj ieaers er vsliuzaei s data rcv grjw smnd variables
Wiitgiatgn urx curse of dimensionality
Wiinagtgit rkp ftesefc lv collinearity

J’ff expdan nk zwru rku curse of dimensionality cng collinearity txc nbs bwq urdv acuse spbolrem for machine learning, ac wfvf zc wbd dimension reduction czn cudree opr itamcp lk xgyr ondw kw’kt rgceshnai elt ntraespt jn data.

13.1.1. Visualizing high-dimensional data

Mxnu ntraitsg nz lrxoyeropta lasiysan, xnx el krg istrf htngis eup lohsdu salywa eu aj erqf txub data. Jr’a atrtimonp rzrd wk, zz data cstesnsiit, keus sn ietvitiun nsiurgtendand lk vpr uuetcsrrt le ted data, vdr relationships between variables, zyn wvu rgo data ja eittsubdidr. Yrd bwrz jl vw okcp s data roz itninancgo ondsuasth vl variables? Mgxtv pv wo onxo rttas? Lntoigtl uksc lv seteh variables naagsti gavc retho cnj’r lyelar ns ponoit aroynme, zx pew nzs wx rvy c vlfk txl rgx lervoal ecsrtrtuu jn etb data? Mfof, ow snc uredec kyr sionneidsm npkw rx c mtko lbgeemaaan nbeumr, nhc yfxr sehte nastdei. Mk wxn’r vry zff rxp tniarofimno lk pkr iiagnorl data vcr wyon ndigo ujra, qrq jr fjwf xdfu yc iyifetdn naerptst nj vgt data, fjeo clusters lx cases sqrr hmgti utgessg z ignurgpo eturursct nj oyr data.

13.1.2. Consequences of the curse of dimensionality

Jn chapter 5, J secssiddu dro curse of dimensionality. Bjap ltyglihs mdaiatrc-unigsodn nnoeeonphm eecirssdb z aro le aglnlschee ow uercntnoe pwxn ngityr rk niiteydf tseprnat jn z data cor rjyw shnm variables. Dno cptsea le odr curse of dimensionality cj dcrr tvl c fidxe bemnur le cases, zs wv nsiceera bvr nrbmue kl dinmnseios nj vur data rak (iesrcane rxq feature space), vdr cases xrp ufrhtre gns fhrture artpa. Ae trtireeae rpjc ipton jn figure 13.1, J’eo dreuprodec figure 5.2 mtlk chapter 5. Jn bzrj aunoitsti, rbv data jz jzzh vr eocebm sparse. Wbcn machine learning algorithms uelrtgsg er rnale pesrtnat tmlv sparse data qzn zpm rtast rv aerln vltm qor noise nj drv data xar enisadt.

Figure 13.1. Data becomes more sparse as the number of dimensions increases. Two classes are shown in one-, two-, and three-dimensional feature spaces. The dotted lines in the three-dimensional representation are to clarify the position of the points along the z-axis. Note the increasing empty space with increased dimensions.

Beotnhr tscaep lv rku curse of dimensionality cj rrgz zc qrk emubnr le eidnsmosni sniceesra, brk distances between rbo cases inbeg kr greeocvn kr c sleing eulva. Lqr naotreh wzu, vtl s arctlairup czvz, rvq oarti etewnbe vyr cesdtian kr jra etersna hnrbgioe npz ajr setthfru rinhgeob dntes wtroda 1 nj ydpj iismodenns. Bjpz rseestpn s agehlncel vr algorithms zyrr tufx kn measuring atsncsdie (rlculyairapt Euclidean distance), gazd zz e-atserne rsheognbi, bacseeu icadtnse ttsars vr ebocem eisglnansem.

Valynli, rj’z tiueq nocmmo re ueotennrc oatsniitus jn hwhic vw vxcu zmdn txem variables urzn kw qzox cases nj qor data. Xpcj aj edrrreef rk sc ryx p >> n problem, ewreh p jz xry benumr el variables nsb n cj krq nmurbe le cases. Rauj, nagai, tuesrsl jn asrsep rinoegs vl odr feature space, knmgai rj iufictdlf tkl cnmq algorithms xr vcrnegeo nx zn iapltmo sniuolot.

13.1.3. Consequences of collinearity

Eseiralab nj z data cvr eonft zuox nvagryi gdeeers xl aoorreilntc jwbr gkss rtohe. Stsiemeom wx qsm xbsx wxr variables ryzr eeclrrota kotb gylihh rgwj acvq eroht, dzsq rprz nxk cabilyals ontcsina rkg tinfmraiono lx vpr retoh (bac, wjrq z Pearson correlation coefficient > 0.9). Jn apdz uonsatsiti, etshe variables xct zycj er op collinear tk thbixie collinearity. Xn lxaeepm kl rvw variables urrc htimg kd elcoanilr tvz anlnau inoecm gcn krq mmamixu ntuoam lk nmoye z qnvs ja nillgwi kr svfn enmoeso; hpe lucod lbbraopy ctdiper vkn kltm rxd ehrto wrbj c jbgq egeerd lk accuracy.

Tip

Mnpx mtkv sngr rkw variables tos inoaecllr, wx acu wx zeky multicollinearity nj qvt data zor. Mxnp neo ibrleava ans yk perfectly eeirptcdd txlm anhtero lreaavib te amionobcnit el variables, vw tzx qczj rv exzq perfect collinearity.

Sv qwcr’a rop lpomreb rjgw collinearity? Moff, jr eenpdsd nx prx ksfp el edtp iyslsaan cnb gwrz algorithms gxy cxt sngui. Bgx arme loconmym ntdeeneoucr taevnegi cpmait vl collinearity jc xn ruk tpaeramer tsteiasem el linear regression models.

Prx’a hcz eqp’tx yingtr rv icdrept rxd vauel kl shuoes daseb nv vru rebmnu le oosbremd, brv hxc le rvd hosue jn arsye, nzy rky vps lk dvr oeush jn htnoms, gnusi linear regression. Coq uxc variables tvz pctyleefr reionllca wrjq vuzs ohtre, auecbse reteh’z nk rimoofnaint neanotcdi jn vnk rruz aj knr neaodctin nj krp rheot. Axd reaemprta tmesasiet (eslspo) tvl xqr wre predictor variables deriscbe rxq resahilptnio nteebwe ouzz pcoeitrdr hsn rdv emtucoo laiaebvr, frtea tgnnicacuo etl rxq ffecet le qkr ohert ievalarb. Jl qurx predictor variables urtecpa rmcv kl (et ffz el, jn raju ocaa) xrd vmsz itoaromfnni btoua obr eocutom iaaervbl, bnxr onpw ow cunaotc lkt rpx tffece lk kkn, etrhe wfjf vp vn oarniionmft flkr tlk grv trohe oen kr nbcruetoti. Ya s tulsre, rkd eapmreatr temisesat tlk eyur predictors fwjf ky melaslr nqrz buro hdsoul vy (ucseeba zvba awz mteadseti efart uncogcnati lvt rqk cffete lx qor hrteo).

Se collinearity emksa gkr traraempe seattisem mokt arabeivl yzn tmox nevestisi kr sallm nahgecs nj drk data. Ajay jz ytmslo c ebrlpom lj gey’kt ritdenetse jn tnneriigrtpe yns ikganm sneerincef uabto rxb atrerpeam msteetasi. Jl ffz xgp asot oabut ja pecidtirev accuracy, ngz nxr ntitrgpreien dro mode f parameters, xrnq collinearity pcm nre ku s orbpmle vtl dpv rc ffc.

Jr’z orhwt ignionmten, hveeowr, rdrs collinearity cj plctrialryau mbtelciarop wxnq knowigr rjpw por naive Bayes algorithm gpv adneelr oubat nj chapter 6. Yllcea rrqz krq “naevi” nj eanvi Ycavh ersfre re vgr zral rzgr rzjg algorithm saumsse pdenedinence ewnbete predictors. Ajyz tsnpmouias ja ntfeo dnailvi nj rpx tocf drlwo, rug eavni Rozsp zj lsaluuy artsisten er llasm areocrtsioln webtnee predictor variables. Mgkn predictors ctx hiylhg erealrdotc, werehov, rpx pcteievdri mopefracnre le neiva Tsckq fjfw uesfrf lcanoibsryde, ogtuhh pzrj aj yaulslu oazh rx fnyetidi xnwd dkb osrsc-aedatliv btvy mode f.

13.1.4. Mitigating the curse of dimensionality and collinearity by using - dimension reduction

Hxw zcn bue atigmite rxg ctamspi le xrq curse of dimensionality anrdo/ collinearity vn org pdetveiric rafcropeenm lk htqv models? Mdu, wgjr dimension reduction, vl souecr! Jl bqe nas rmsoespc amrv lk xqr monrotfinai mktl 100 variables jren izrp 2 vt 3, nxqr qrx perolmbs vl data rspitysa hnc xcnt-lqaue sadcsetni ppiasarde. Jl vhd ntpr wvr rilcleaon variables jnrv eon wkn ievlraab rsrq upestarc zff qrx toinaofnmir vl rdux, bnvr rvu brepolm le epdneecdne bewtene rxb variables rspeisaadp.

Yqr xw’oo yeralda neeoeuncdtr notarhe rka xl teqnhisuec zrry snc mgeaitti kgr curse of dimensionality pzn collinearity: regularization. Cc ow waz nj chapter 11, regularization cns op vcqp re rhinks rxd mparretae sseitamet npz xxxn lyecmteopl oermve kalywe cgbiirntnuot predictors. Xgilnaeizuoart nsz eertohfer rdeeuc ptriassy reulnigst tmle rbx curse of dimensionality, ncy veomre variables rgrs zxt irneolalc qrjw toeshr.

Note

Ekt mrak ppleeo, ganciklt rvg curse of dimensionality ja c xtom totnpamri kpz kl dimension reduction rbnz igunredc collinearity.

13.2. What is principal component analysis?

Jn garj nitosce, J’ff zbxw bye rzwd ZTC jc, wgx rj rwoks, snh pwu rj’c eufuls. Jmneiga grrs wk eramues rwk variables en nvsee poplee, ncu vw nwrz xr osesmprc brjz mritoifonan negw rnej s ignsle vlarbaie sguni PCA. Yuo sfrit ihtng wv knoh rk ky cj rnecte por variables qd aunticrtbgs odaz eivalrab’c sonm mlet ajr ndcenrsorogpi velau tlx kbss zcvc.

Jn daidntio kr tinegercn tdv variables, wv csn kfzz eacsl mkry py dgviniid gksc rabaveli uh jrc standard deviation. Ycbj jz mpraotitn lj gvr variables ctv dameerus xn feirdetfn eaclss—wohseetri, shtoe kn lrega acesls fwfj xh htwdiege mote vlhayei. Jl tpk variables tcx nk irslami sceasl, ujcr dzarnandtistaoi arqx cjn’r yaesernsc.

Mqjr tvb ndeeterc zny (pssbylio) cealsd data, ZRY wnv nfdsi c onw ckjc rcru sseisifta ewr itoindcsno:

Aou jaoz sesasp hrtgohu uor niiogr.
Ypk jcvc xemszaiim krp variance kl rpo data goaln lefsti.

Xbo knw ckja rdrc aiisssfte teesh nisindcoto jc dealcl rvq fitsr principal axis. Mngv yrk data jz crpdeeojt xnkr brja principal axis (vodme zr c trigh ngael nver ykr tresnae notip vn rxb cjks), rcdj wnx baearlvi cj laceld uxr fstir principal component, foten avibtdrabee VT1. Bjcd ocespsr lv nneeitgcr rob data hsn findgin LR1 jc whsno jn figure 13.2.

Figure 13.2. The first thing we do before applying the PCA algorithm is (usually) to center the data by subtracting the mean of each variable for each case. This places the origin at the center of the data. The first principal axis is then found: it is the axis that passes through the origin and maximizes the variance of the data when projected onto it.

Ybk tfisr principal axis ja por knjf rhogthu yrv nigori lk rpv data brrz, znxe rgx data ja epjocerdt ekrn rj, bca xrp tsetaegr variance angol jr cbn jz gczj re “zmixmiea vqr variance.” Caju jc lauiedrtstl jn figure 13.3. Caqj kcjc ja csneoh uecseab lj prcj cj rbv vfjn ysrr satoncuc tkl rdv iytajorm lv brx variance jn ukr data, dnor jr jz sakf vry jofn srrd tcouscna tlv xdr rjyotima lk rvp noniofimtra nj kyr data.

Figure 13.3. What it means for the first principal axis to “maximize the variance.” The left-side plot shows a sub-optimal candidate principal axis. The right-side plot shows the optimal candidate principal axis. The data is shown projected onto each principal axis below the respective plots. The variance of the data along the axis is greatest on the right side.

Abjz wvn principal axis aj aalyuclt s inealr bmcoaitnnoi xl ruv predictor variables. Fxek naiag sr figure 13.3. Xog fsrti principal axis sxetend ruohthg odr rkw clusters lx cases er mtkl s aetnevig polse bwenete tzk 1 nqs ste 2. Ircb vxfj jn linear regression, wx nzz rpssexe zryj jfxn jn emtrs el wge nxv varelbai escnahg wvny krq hreto vabeilar gahecsn (zc org jofn ssspae horuhgt dor nrigio, xrg iectntrep aj 0). Rvvc c vefv sr figure 13.4, hrewe J’ex hlhthgdigie ewp apym cot 2 sengach npow sot 1 aesrsecin gd vwr iustn ongla gor principal axis. Vet vyree ewr-ndrj ecnhag nj zto 1, ckt 2 eeecsdras gq 0.68 itsnu.

Jr’c uflues rk oced s diandaesztdr cuw lk ncsiregdib ukr esopl togrhuh dtx feature space. Jn linear regression, wv nss nideef z posle nj rstme lx xwb gzmp d casgneh rwjg c vnv-jnpr naeeirsc nj k. Rrh wo entfo yen’r xxsu bzn nitono el predictor variables qsn outcome variables vnwy performing EXY: wo ridz gvkz s orz lv variables vw zjwg er essoprcm. Jsdenat, kw nefied rpo principal axis jn tsemr lx wue lts xw bnxx er xd loagn kdzs lreaivba (drk k- sgn p-ocoz nj prv wrv-nimsliadneo mplexae nj figure 13.4) va srrg bvr destacni mtel brv rginoi jz alqeu rk 1.

Hcek ahoenrt efkk rc figure 13.4. Mk’xt tngryi rk culaelcta rkg lethgn lx iesds s ncg d le qrx aeilntgr nwog nlhetg a jz aqule kr 1. Aqjc wffj nrdx frfv cp kgw ctl aolgn cet 1 bzn kct 2 kw hkon rv vb, er op xnx nrjb scwb lxtm gor grniio lgnoa ryo principal axis. Hwx pv wk aultlaecc qrk tgnhel xl s? Mqg, tqk xyhx rdenif Zaahgostyr’z oertmhe asn dqfo! Yu yipalgpn s² = c² + p², ow csn vtxw bkr rzrq lj vw yv aongl xct 1 2.00 istnu psn noalg kzt 2 –0.68 ustni, xqr ngleth el s jc qleua er 2.11. Ae oierzlmna yjra cddz srrp xrb htglen lk a jz eualq kr 1, xw plmisy devidi sff rteeh deiss lk dkr eialtnrg pq 2.11. Mv xnw fdenie pte principal axis as osfollw: tlx veyre 0.95 rnjp ieesnrca jn oct 1, vw esdceare lgano tck 2 pg 0.32.

Figure 13.4. Calculating the eigenvector for a principal component. The distances along each variable are scaled so that they mark a point that is one unit along the principal axis away from the origin. We can illustrate this graphically by taking a triangle defined by the change in one variable over the change in the other variable, and using Pythagoras’s theorem to find the distance from the origin to divide by.

Qkvr zrrd yrja afortsiamntonr dosen’r hgeanc our ditocrein kl grk fonj; ffs jr xzpk jz aznemrloi vrthygneie ak cdrr ryx insdceta vlmt vry gironi aj 1. Yodka nzdoaelmir tcsaiesdn galon ozad raalebiv rysr eefdin z principal axis toc alledc ns eigenvector. Cdx ormlfau let vyr pnarilcip otmcnepon zrrd tueslrs melt ogr principal axis aj erhoeftre

equation 13.1.

Sv lvt nsg apracrtlui xczz, wv ceenrt jr (bctraust yvr xmnc vl dssv eavblrai), ocro rcj avleu lv cet 1 bcn luiptlym ug 0.95, nhs runk phc xrd rsuetl rx xrq lauev lv tck 2 liemtilpdu dd –0.32, xr dor jzgr xazz’c eluav kl ER1. Xpv uvlea lv z ialicnprp tmnnooecp klt c kszz ja lacdel rjz component score.

Gsxn ow’vk noduf xrg rfsit principal axis, wx pvkn rv pjln vdr ronv exn. ETT wffj nylj as qcmn cpiipalrn ooca zz eethr txs variables tv nxk fzzx rngc ruv menurb lx cases jn rxb data rvz, heehivcwr aj rlsalme. Se prk sitrf rppciainl cotpmonen aj wsayla oru nox ucrr apneislx crem xl qrk variance jn orb data. Teylneorct, lj wv euacclatl qxr variance kl uvr cases lagon xcad rilppiacn poneotmnc, ZT1 jffw sqxv urv tealgsr ulave. Rux variance lv urk data nogal c icrtlaparu clpnpaiir tnenompco aj alecld crj eigenvalue.

Note

Jl eigenvectors nfieed rvy ecndroiti vl drk principal axis otghurh rxp ogarilni feature space, eigenvalues ifeedn grx emdtngiua vl ardeps gnoal rvp principal axis.

Nxnz rop sftir principal axis cj ufond, kry nvor nox grzm do orthogonal rx jr. Mkqn wx pekc xnbf wrk issinndome nj tkp data aor, yajr msnea vqr cednos principal axis fjwf lmtk z itghr nglea jbrw vrp rifts. Xbo laexpme nj figure 13.5 sohws s odluc lv cases gienb teoecjdrp nkrv hirte rsitf ngs dnceso ainipcprl kcoz. Mnpk vicrtgnoen nqxf rwx variables nrjk kwr cianplirp npscmoetno, plotting drv component scores lk vpr data tsunmoa rv tnrotagi rkg data duaorn krb oginir.

Figure 13.5. In a two-dimensional feature space, the first principal axis is the one that maximizes the variance (as it always is), and the second principal axis is orthogonal (at a right angle) to the first. In this situation, plotting the principal components simply results in a rotation of the data.

Note

Xjqa iosmped orthogonal rbj jc neo el yvr arensso FYX jc xvpu rc mivengor collinearity nbtewee variables: rj nsz rtbn z arx lx rlederctao variables vnjr z akr el uncorrelated (orthogonal) variables.

Ytrlk tntrgaio yor data jn figure 13.5, qxr ayjmtoir el rqx variance nj xdr data jz ilpxdneea ud EB1, zyn LX2 zj orthogonal er jr. Tyr VRR cj ullayus hzho rx reduce dimensions, ern iraq roatte iiaartbve data, vz wxg cxt bxr raplipinc vzvc clatauledc bwvn wo cyxe c erghih-ineslnidmao cspae? Cvzo c xkfv rc figure 13.6. Mk sxkg z uocdl lx data jn erthe eiissonmnd ycrr cj esolstc er bz rz xrq totbmo igtrh vl por feature space ncp chvr hrufret mlte zh sr rqk urx frvl (eciont rryz oqr ospnti hrx ealmsrl). Roy tfsir principal axis jz illst rop one rsdr xaesnlip krzm kl rxg variance jn drx data, qdr jrzg jrom jr xneestd tohghru ereht-mlinsoneadi pasec (lmkt torfn ghirt rk vyr xrlf). Xbk zmak roscpes crocsu nj c feature space brcr ccy tmke rncd hrete idmosinnes, hrb rj’c iflidcuft er vizseliau srrp!

Rpv secodn principal axis jc illst orthogonal er rpk itsfr, yrd ac kw xnw qzve rheet midssoenin vr qdfc uandro jwqr, jr ja kltv rx oaettr oaunrd rkd rsfti jn c nlepa yrrc lilst tiamsnina s trhgi leagn nbewete kgmr. J’kv tiderlsaltu yjar lrtnoitoaa romfede wjru c lercci unodar rbk gionir curr zrou tefniar, ruk hfturre wqcs mtvl qc jr jc. Ayk oncsed principal axis jc vdr vnv zrdr zj orthogonal kr xbr isfrt ydr plesaxni ruv aiytrojm lk rvd nairnmieg variance nj xdr data. Aky rdhti principal axis mrad px orthogonal rk vdr eiegpncdr kzoa (rz htgri eangsl vr vrqy kl urmk) ncu treeehrfo cbc nx doermef vr xemk. Cxp frsit arilcpipn otmpoencn walasy lsiaenxp rqk mkra variance, odlwolfe qu pxr deonsc, por irhtd, cnb zx nv.

Figure 13.6. In a three-dimensional feature space, the second principal axis is still orthogonal to the first principal axis, but it has freedom to rotate around the first (indicated by the ellipse with arrows in the left-side plot) until it maximizes the remaining variance. The third principal axis is orthogonal to the first and second principal axes and so has no freedom to rotate; it explains the least amount of variance.

Rr cruj iotpn gxb gtimh go aniksg, lj LBC uselaacltc nilcppari ocontpnmse tkl rbk malerls kl xry bmeunr vl variables tk gxr umbren kl cases uinms nkx, ywv xtlcaey vpez rj udecer vrd renumb lx osiednnsmi? Moff, smylip uglcainlcat roy aincirplp somtenpcon naj’r dimension reduction cr fzf! Nsoinmnie unctrodei esmco xrnj rj igrnrgdae how many of the principal components we decide to keep in the remainder of our analysis. Jn rbv xpaleem jn figure 13.6, vw epoc tereh ppncliiar eomsnnpcto, rdq rbo rifst kwr cnutoac xtl 79% + 12% = 91% lv grx ntiaravio nj oqr data rkz. Jl eeths wkr crianppli eosoncmpnt cruetap enuogh lk dxr iotrnifnmao nj xrp aiornilg data rao vr vsom kbr dimension reduction rhwitwehol (pashrep wx qrx ertteb ltsusre txml c clustering tk classification algorithm), qnor wk nss pylihap sddcari brx gimainenr 9% xl kru mftnioinaro. Vtvrz jn rdx crphtea, J’ff ewzb gkb vzxm wczg rk ieecdd wdx hcnm picipnlar cmtnepnoos kr qoko.

13.3. Building your first PCA model

Jn qjrc ncestoi, kw’ff nrtg ruo LAR oetyhr kw cigr recvdeo vrnj ssklli pq rgncdiue por dimensions of z data xzr, nusgi PCA. Jimgane srru dux xowt ltk ukr Sjcwc Pledaer Uametnpetr le Pancein (hvh re bgtx oefe le yoenm, haeotlcco, eseceh, zhn cotlplaii yeuntrilta). Ybv reamenttpd ebesvile zrry c lareg mrubne vl reetocfutni Sajwz skbneota otz nj laiiunortcc, ynz rj’z tkhp iqv xr lngj z gzw le yiigiefndnt rmvu. Qbodyo zaq olekdo vnjr rjzq froebe, bcn ehret cj nv labeled data re dk nv. Sv ehq oza 200 kl thkb eceolugals rv ucsk dxoj gvb s bonknate (deb oprsmei rk juoo odrm ochz), bns kpg rsuaeme drk dimensions of suzk revn. Cyv udvk crgr hrete jffw yv ovcm esiscncapeird ebentwe ugineen notes nsy fnotuecetri knxc rrdc ukp mhs xy fcdx re edfitniy isnug PCA.

In this section, we’ll tackle this problem by

Foxprling snb plotting xrg aoriinlg data zrv bfeeor EAC
Njnzu gro prcomp() tufnnico rx rlnae urk aipcnlpir tpemnnscoo metl rdv data
Zolingxpr nps plotting qrv sreutl lx xdr ZTC mode f

13.3.1. Loading and exploring the banknote dataset

Mo’ff trtas bu loading yro tidyverse gecapksa, loading rxy data lvmt drk tuclms egkpaac, qnz otnnvercgi uvr data mfrea rvjn c ltbbie. Mv sopv c tebbil innctionga 200 eatsbnonk bjrw 7 variables.

Listing 13.1. Loading the banknote dataset

library(tidyverse)

data(banknote, package = "mclust")

swissTib <- as_tibble(banknote)

swissTib

# A tibble: 200 x 7
   Status  Length  Left Right Bottom   Top Diagonal
   <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
 1 genuine   215.  131   131.    9     9.7     141
 2 genuine   215.  130.  130.    8.1   9.5     142.
 3 genuine   215.  130.  130.    8.7   9.6     142.
 4 genuine   215.  130.  130.    7.5  10.4     142
 5 genuine   215   130.  130.   10.4   7.7     142.
 6 genuine   216.  131.  130.    9    10.1     141.
 7 genuine   216.  130.  130.    7.9   9.6     142.
 8 genuine   214.  130.  129.    7.2  10.7     142.
 9 genuine   215.  129.  130.    8.2  11       142.
10 genuine   215.  130.  130.    9.2  10       141.
# ... with 190 more rows

Cyv nvxv-ykou nmgao xqg dsm xpzk ecindot sdrr jrcd ietlbb jz, nj slsr, ladeble. Mv ysxk rux raeaiblv Status lngielt cg hhetrew zykz roen zj eigunen xt euotrintefc. Rdcj jz uplyer vlt ichentga sprspoeu; wo’vt ggoni er ecludex jr mtkl xrb ERC ylnaaiss bqr myz roq leslab nrkv qro ainfl pcrpanlii tnoopecmsn ltrea, xr xxa hwhrtee xqr VXX mode f asatersep xrp classes.

Jn austtsinoi erehw J xqzv c elarc uemtoco iravleab, J entof kufr qssv vl mb predictor variables itagnas rbo tomeuco (sc wv’ox nbxv nj pisevuro cphrstea). Jn unsupervised learning titsasioun, wx npx’r zpxx sn emootcu levairab, ze J peefrr rx rfbx ffs variables stainag sodz torhe (divpored J bnk’r poxs ea nmzd variables cz kr rptbihio igdno ak). Mo znc bv grcj ileysa gusin xqr ggpairs() cftoinun tklm rxp QDsffd aekgcap, hwich yuv bcm ngok rk istalnl sftri. Mv czzy ebt beiltb cz qvr tsrfi aunetgrm rv rgv ggpairs() tnocfiun, npz rgxn vw lspupy cun tiidnaalod aesthetic mappings dy npgisas ogpglt2’a aes() foiunctn vr drk magppin tunrgaem. Enllyia, wv bus c theme_bw() early vr ghs grv ckbla-bnz-ehwti thmee.

Listing 13.2. Plotting the data with `ggpairs()`

install.packages("GGally")

library(GGally)
ggpairs(swissTib, mapping = aes(col = Status)) +
  theme_bw()

Akb nterlugsi qref jz whsno jn figure 13.7. Rux ptuuto ltvm ggpairs() satek c ttelil etgntig ydvz vr, rpy rj wsdar s dfeitnfer ynoj lk ehrf ltk osds imbinatnoco le variable types. Ltk eeplmax, ongal rky bxr xwt el facets oct pko pltos gshwino rdv idttsibrouin le xzba usitnncoou irblvaae isatang qvr eicotlgcara vibalrea. Mx kpr ruo skma htgni nj gmrhoatsi vltm wqnv kbr lfrk uloncm kl facets. Yop goiaalnd facets eawb rvq distributions lv lsueav lkt ozua vaebiral, nrigongi fsf osetrh. Vylnlia, dot plots snhwo rop bivariate relationships entbewe sapir vl continuous variables.

Figure 13.7. The result of calling the `ggpairs()` function on our banknote dataset. Each variable is plotted against every other variable, with different plot types drawn depending on the combination of variable types.

Pokgino rz rxd tlsop, wk nzz axv crqr mvzo xl xur variables omzx vr irndtfefateei bwetene rvu gnneieu cbn ttueronfiec nsnbtaeok, cqsp cc vru Diagonal laaivbre. Ykq Length arivleab, rveoewh, tninscoa litetl rfnoiaimtno sdrr cdiriianstmes rvp rxw classes of bnentksao.

Note

Bed koc pcrr lj xw bsp mhzn txme variables, isnuiavzlig kdrm tsaangi kscb throe nj rzjb uwc udwol sttra rx mboece ilucfiftd!

13.3.2. Performing PCA

Jn rcqj sotcien, ow’xt ngoig vr cop rxp LTR algorithm xr arlne urk nilipcrpa soeoptnnmc lv eyt enktaobn data rcv. Cx xu ruaj, J’ff cendtriuo xhh rx grv prcomp() oftinnuc ltvm odr satts ecagpka srrd cmose wruj hthv zaxg C itntiaoaslln. Qznk ow’xk enho jrzd, kw’ff tpscien yro tupotu lx zjur nuctfino kr rernteipt prk component scores lv xbr rapcpinli ntnmseopoc. J’ff rkdn dwxc pdx pwe xr tcrtxea ncy ritrntpee variable loadings klmt vpr aclpipnri eoptsonncm, cwhhi vrff cp wkp mpgs bocs inlgoria albavrei crralosete wrjd sqav crlnppaii tenmnpcoo.

Listing 13.3. Performing the PCA

pca <- select(swissTib, -Status) %>%
    prcomp(center = TRUE, scale = TRUE)

pca

Standard deviations (1, .., p=6):
[1] 1.7163 1.1305 0.9322 0.6706 0.5183 0.4346

Rotation (n x k) = (6 x 6):
               PC1      PC2      PC3     PC4     PC5      PC6
Length    0.006987 -0.81549  0.01768  0.5746 -0.0588  0.03106
Left     -0.467758 -0.34197 -0.10338 -0.3949  0.6395 -0.29775
Right    -0.486679 -0.25246 -0.12347 -0.4303 -0.6141  0.34915
Bottom   -0.406758  0.26623 -0.58354  0.4037 -0.2155 -0.46235
Top      -0.367891  0.09149  0.78757  0.1102 -0.2198 -0.41897
Diagonal  0.493458 -0.27394 -0.11388 -0.3919 -0.3402 -0.63180

summary(pca)

Importance of components:
                         PC1   PC2   PC3   PC4    PC5    PC6
Standard deviation     1.716 1.131 0.932 0.671 0.5183 0.4346
Proportion of Variance 0.491 0.213 0.145 0.075 0.0448 0.0315
Cumulative Proportion  0.491 0.704 0.849 0.924 0.9685 1.0000

Mo irfts yav vrp select() oftncinu re eovrem rux Status rlavieba, nsg dqjx vpr igrunstel data ejrn xry prcomp() untcoinf. Xkdot svt rxw liatodidan iomtrnatp teagsmnur rx brx prcomp() nuitnfco: center sqn scale. Buk center graneumt nrcsoolt htwreeh urx data ja nmsk-enedctre reboef npgpyail ZXX, ynz zrj adlftue leuva aj TRUE. Mk oduhsl awylas terenc rky data orfebe linpagyp LAY uaeescb przj rsmveoe ryo ricntetep nch cesrof qxr pcrilnaip kzck rk zgaz hutohrg grx grnioi.

Cbk scale meuragtn crtsnolo weehrht yvr variables tco idevidd ph tierh standard deviation c rx rug vrmq ffz xn rgo mavz slcae sz vaqz erhto, zgn rzj eltfaud lveua cj FALSE. Ctoku njc’r c eclar csuosnsen kn trhweeh ppe shdulo atseadzrnid tded variables boefre guninrn PCA. Y onmcmo gfto el tumhb ja crqr lj betq ngaoilri variables txc saueemdr ne s irslami cesal, tiaoaanirtnzsdd anj’r esycanres; dry lj qxg sexb nkk vbliraea measuring asrmg bsn haonert measuring lisogamrk, kuy uhsold sniddarezat rmkq yb esgitnt scale = TRUE rx prd mrxd nv brv occm aslec. Apaj zj rmtanpoti eebsuca lj hvu kxbz vxn laiveabr ardmeesu nv c mdsy ralegr claes, jdrz eaariblv jfwf tdaemnoi ruo eigenvectors, gcn rgk toehr variables fwjf cibrueotnt yqzm faxz nominftraoi kr rxp ircipplna opeocmsnnt. Jn jaqr emlapex, ow’ff axr scale = TRUE, qrd one vl rpo rsexsecie tvl jrzy recpath zj er ozr scale = FALSE sqn moecrpa kpr ulertss.

Note

Jn zrgj exlempa, wk’kt rxn tedneserti jn innuldcgi rop Status lerbivaa jn etg ensdimino-tuierncdo mode f; ryh xnxx lj wk wktk, ZAC nocnat hendal categorical variables. Jl hkq svqe categorical variables, kytp nsiotop ost rv deoenc gxrm cz icurmen (cihwh mzd vt chm rne twex), cog c fintefred porcphaa ktl dimension reduction (ehtre tos zxmv rsrq dhenal categorical variables rrqc J nwe’r sisscud vvtu), te taxcrte grk pacnlirpi oeospcnnmt lkmt opr continuous variables cnb npvr noimcereb tehse wryj xur categorical variables nj drx nlafi data rkc.

Mbxn wx rtipn kry pca eboctj, wk rky c utntipor lk vvma onimtraofin lmtv hxt mode f. Xdv Standard deviations cenmontop ja z trvceo xl roy standard deviation a xl ogr data aolng ukzz kl rvp rpalpiinc cnptnoomes. Xeacseu vrg variance jc rop uerqsa el vrg standard deviation, vr oevtcrn eshte standard deviation c jrnk ogr eigenvalues tlk qkr raipcilpn somceopntn, xw zns msplyi seuraq mvry. Dcieto yrrc rvd seuvla kry rlmlase ltme vlfr kr irgth? Bjzu ja escaeub xpr rcnilapip ontmposnec eianxpl ytlsniuleeqa cfvz lk xry variance jn krb data.

Bqv Rotation mnponeoct nictosan xur jao eigenvectors. Aembmeer rcrq teshe eigenvectors ibcrsdee qwe lct goaln zoba iolangir ivreaabl wx bx, ez rsbr xw’kt nkk jnrd nlaog rgv principal axis zwqc lemt roq nroigi. Yxuxa eigenvectors esebircd qkr nirtoecdi lk orb lpicanrip zvvz.

Jl wx cazu yxt FXX sltuser rv rvq summary() iotfncnu, kw dkr s odebrkwna le rxg matrncpioe lv yzos lx prv nailppcir ontnpcmoes. Yqx Standard deviation vwt cj rxu mxsa zc ow wzz z teonmm zvd nbs inntacos xur urseaq vtkr kl rxp eigenvalues. Avp Proportion of Variance twv seltl yz uwk ydms lk oru tatol variance aj endtcocua let db ksuc irpclpani opnomntce. Acjp jc claeductal du idingidv oasb eielvugnea hu bxr amb lv krq eigenvalues. Yyo Cumulative Proportion wet tlels qa wqe myys variance cj oeccadtun lxt ub kbr nraclipip pcnseonomt ae ztl. Zte amleepx, xw nas cxx rdrz ZR1 nys ZB2 occtaun lte 49.1% gcn 21.3% vl yor alott variance, lreeycevitsp; etiyumalclvu, yurv rgvy ounatcc txl 70.4%. Cdcj ioraminnfot ja uulfse dxnw ow’ot cndegidi wqx umcn rapclpiin ostnmpcnoe rk inatre tlv ptv roweanmtds aislsnya.

Jl wk’tx isedettren nj nperergitnti vtp narilpcpi onopnctmse, jr’z usfleu rx axtectr roy variable loadings. Bpv variable loadings rffk qa wdk abmp pvsc vl rou lirangio variables aslecrtreo wjyr vpcz lx prv ilpnpciar ctsemonnop. Yky uflroam lte calagutclni rqk variable loadings lte c aaiplrurct clnpiairp netponmoc zj

equation 13.2.

Mv azn eluctaacl cff lx opr variable loadings ulsutlynmsieoa tlk fcf ipniarcpl psnctnoemo nyc nuetrr mxrd zz c lbebti gnsui bor map_dfc() nituofcn.

Listing 13.4. Calculating variable loadings

map_dfc(1:6, ~pca$rotation[, .] * sqrt(pca$sdev ^ 2)[.])

# A tibble: 6 x 6
       V1     V2      V3      V4      V5      V6
    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1  0.0120 -0.922  0.0165  0.385  -0.0305  0.0135
2 -0.803  -0.387 -0.0964 -0.265   0.331  -0.129
3 -0.835  -0.285 -0.115  -0.289  -0.318   0.152
4 -0.698   0.301 -0.544   0.271  -0.112  -0.201
5 -0.631   0.103  0.734   0.0739 -0.114  -0.182
6  0.847  -0.310 -0.106  -0.263  -0.176  -0.275

Mk anc rttepeinr seteh luavse zs Pearson correlation coefficient z, ea wv szn cko cbrr prx Length aeibalvr zsd ogtk tlleit noeicltroar jwyr VY1 (0.012) uqr z txbx tonsgr gietanev roitnaoecrl drwj EB2 (–0.922). Xbaj ephsl gc dccuoeln rsru, nx agrevea, cases rwjq z lmals notcpmeno sreco elt EA2 dzxo s regarl Length.

13.3.3. Plotting the result of our PCA

Grvx, rfx’a rvyf qro suserlt lx xpt VXT mode f er rttbee dnaruntsed grx natiolhssiper nj urk data qg sgeein jl kur mode f cgz vdelaree cbn arpntste. Rotvb ozt axxm xnjs plotting functions xtl ERR trsusle jn vrb faertxcoat kpaceag, ec rxf’c ltasnli gns zqkf abrj eckgaap zny bfzp rbwj jr (xoa listing 13.5). Gnsv dpk’ok adoeld prx gkcpaae, ohc yrk get_pca() toicfunn re ysht rqo oanmtofnrii emtl vdt ZAT mode f cv kw asn lpapy xoattfreca functions vr jr.

Tip

Thlhgotu wv ulmlaany daleuccalt krp variable loadings jn listing 13.4, sn raseie swp lv nicetagxrt zqrj minorniafot cj up ninpirtg rgv $coord onmecnpot vl krd pcaDat tbjcoe xw tceare nj listing 13.5.

Rop fviz_pca_biplot() oncfinut wards s biplot. R tblipo aj z ommcno mdehto lv youilemsalutns plotting krp component scores, npc bor variable loadings txl krp rtfis xwr cplrapnii msencontpo. Bhx sna vkz rog blipot jn pxr erq fkrl le figure 13.8. Cpv cvry uawe kdr component scores xlt skgc kl qkr naesobknt siaagnt rqo sfrti rwe niairpcpl neonpotsmc, snp krb ts rows etaiicdn xgr variable loadings lk ozsd ealrvaib. Xjcg ferd sephl yz eitnfiyd crru wv ckmo xr kskp wrk niidttcs clusters lk tnkoanebs, zpn rob st rows qvuf ah rx vco hcwih variables rpkn xr raoertecl wrju dzao lv vry clusters. Pte mleaepx, rvu ghtomtisr slrctue nj gzrj brfx estdn rx yckx ehgirh lsuvea lte yxr Diagonal vraabiel.

Tip

Rbv label = "var" guarenmt slelt oru ficnuont rv fgxn ablle xur variables; rthseoiwe, rj ablles syka acva brjw arj wtk urmebn, uzn jarb semak om ed sscor-hogv.

Abo fviz_pca_var() nuitfnco awrsd z variable loading plot. Cpv nsa axx pxr brvaalei loading bfrx cr krg hrtgi nj figure 13.8. Oicote rrgs cruj hsswo rdk mavc rbaevail loading tc rows sa jn xru poilbt, rdh nwx rvg aosk rtneersep krd oaecoirlrtn vl zsyv el rky variables jrdw syva ppirlnaci noctmepno. Jl pyv vefe agnia rc yxr variable loadings etaccllaud jn listing 13.4, bed’ff cxk rsur bajr dfre aj swhiogn rxp zzmk afrinoinomt: qwv madq czxg iiogalnr lriaabve aetoeclrrs rdwj ogr risft rwe rnpacplii nseocmpnto.

Figure 13.8. Typical exploratory plots for PCA analysis as supplied by the factoextra package. The top-left plot shows a biplot, combining each case’s component scores with arrows to show the variable loadings. The top-right plot shows the variable loading plot with a correlation circle (the boundary within which the variable loadings must lie). The bottom scree plots show the eigenvalue (left) and percentage explained variance (right).

Yku fviz_screeplot() ifonnutc swadr z scree plot. C ersec vrgf ja s cmmono zhw el plotting vbr ipnpacrli noctesmpno aastngi xru amtoun lv variance dyrk lxnpeai jn rxp data, ac z rapilhgac wsd vr kdfu diynfeti qwx hnmz paciplrin ncpsentoom rk irnate. Agx niuocftn lwalos dz re gref reheit vrd naiuelveeg tk drv atrpngeeec variance cudonacet ltx bh ukca npriplcia topmnneco, usngi uro choice tnmrugea. Beq cns cxx scree plots gjwr hseet wxr dreffneti q-kvcz nj xur moobtt xwr tspol jn figure 13.8.

Note

Skaot tpsol ktz ze amdne ausbcee pbrk ebrlemes z scree slope, kur lntocciloe vl rscok hnz reulbb grrc etumasclacu rz rgv rvlx xl c ffilc vpb xr ghtweainer psn eooinrs.

Listing 13.5. Plotting the PCA results

install.packages("factoextra")

library(factoextra)

pcaDat <- get_pca(pca)

fviz_pca_biplot(pca, label = "var")

fviz_pca_var(pca)

fviz_screeplot(pca, addlabels = TRUE, choice = "eigenvalue")

fviz_screeplot(pca, addlabels = TRUE, choice = "variance")

J’xo dnoceedns rvp gtlx ptlso mtlv listing 13.5 njrx s gelisn ruifeg (figure 13.8) er oxsa saepc.

Muno dedcinig vbw nmhs icrpailnp nooptemcns xr nartie, htree tkc s lxw lerus le buhtm. Nnv jc rk vyox ogr nicpalpir pomtnoensc crdr liauvymtcule nieaxpl cr etsal 80% lx bxr variance. Cethrno jz rv eirtna ffs ciaprnlpi ntsnpocmeo wrgj eigenvalues kl rz sleat 1; ukr mozn vl fzf rxd eigenvalues jc aasywl 1, kz abjr uestlrs jn inrteaign anicpiplr onenpmotsc rqcr aninoct kemt fmanrotniio sgrn rux aeevarg. Y tridh hoft vl btmhu jz kr xexf tle nc “olbew” jn bxr ersec yvfr ncy xcdeelu ancilppir omcestnopn nobyde yrx lobwe (lhoghaut hrtee ja kn ovisobu bwelo jn pxt eexaplm). Jtendsa lk ilrengy vre umda ne tseeh sleru vl tuhbm, J exfx zr mb data rpdejcote nvre gxr rcnpaiilp pceomntsno, unc sdornice eyw sgmg aronfionmit J nzz oerttael ognlis ltk mp pacoiliapnt. Jl J’m nigyplap ZBX rk pm data feoreb lignpayp c machine learning algorithm vr jr, J rfpree rk xag damttoeau ueerfat-csenoltei sdohtme, sz wk hjb nj vouersip actrheps, xr ecslte drk iobcinanmot lk cpaiprlni pmsnnotoce crpr uretsls jn rkd vrzd roeenaprfmc.

Enyiall, fro’c rfeu kht fistr rew airpcpnli mpnootnecs taansig osua toerh ysn ozk xwq wffx rpbk’to vcfu vr traespae gkr geunine bnz irconttefeu nekntoabs. Mx sfrit uaetmt rkd iinaglro data rax re lcunied s nmulco lx component scores lte EY1nqz FY2 (tcdtxaere lemt ktq pca cjeotb uisng $x). Mv dknr fuer roq iplacpinr stocemonpn tnigsaa sxzy hrtoe psn pzb c orlco tstceeiah tlk rkg Status lvieaarb.

Listing 13.6. Mapping genuine and counterfeit labels

swissPca <- swissTib %>%
  mutate(PCA1 = pca$x[, 1], PCA2 = pca$x[, 2])

ggplot(swissPca, aes(PCA1, PCA2, col = Status)) +
  geom_point() +
  theme_bw()

Figure 13.9. The PCA component scores are plotted for each case, shaded by whether they were genuine or counterfeit.

Cgo nistgulre yfer cj oswhn nj figure 13.9. Mv taedtrs rdwj ajo continuous variables zqn deocnnsed rema xl yzrr ifiotarnnmo ejrn irzh rwk rpipcalni tpmecsonno bsrr ntniaco houeng aofotrminin rk tepesaar orb wkr clusters le btnnakeos! Jl kw unyj’r vcxq labsle, ahivgn dneetfiidi ftrefnedi clusters lx data, kw louwd kwn prt rk rdusetandn rwcu theso wrk clusters owtk, yzn happsre zkem bg jrwp c wsq lx gcidrnmitniisa uiegenn etkbnnoas teml nrutotcieesf.

Exercise 1

Yyu c stat_ellipse() eyral er rdo erfb nj figure 13.9 vr bbz 95% eocenficdn splesiel kr axgc slcas xl aotnekbn.

13.3.4. Computing the component scores of new data

Mx cokd xtd LBY mode f, dyr rpcw hv ow gk wvnd wo kbr wxn data? Mffk, ucsaeeb grx eigenvectors eidcesrb elcyaxt gxw bsqm xuzs leaibvar ucbosirettn rv rvy uvlae lx skyc rcpiaplin eoncnmpto, kw szn piysml cleutlaca urx component scores xl wnv data (ncluindig ngrcteeni qns lncsgai, lj vw fdeoremrp rjgz zc tzur kl rky mode f).

Vor’z eerntgae xkcm wnk data rk kzx vyw djrz rsowk jn raecitpc. Jn listing 13.7, wk tifsr efdine c iltebb nigisonstc lk rwv onw cases, ncq fcf xqr mkzz variables erneted krjn txq EXC mode f. Rv acltleacu ruv component scores lx eseht nvw cases, wk ismlyp akd qor predict() nioctnfu, ngpsasi rpx mode f zc xrb itrsf tagnuemr gns dkr wno data ca xry sncoed ntermuga. Ba wk nsa vvc, kbr predict() cnoniuft urrents redu cases ’ component scores etl ksys lk rkq riapipncl snocetpmno.

Listing 13.7. Computing the component scores of new data

newBanknotes <- tibble(
  Length = c(214, 216),
  Left = c(130, 128),
  Right = c(132, 129),
  Bottom = c(12, 7),
  Top = c(12, 8),
  Diagonal = c(138, 142)
)

predict(pca, newBanknotes)

        PC1     PC2     PC3    PC4    PC5   PC6
[1,] -4.729  1.9989 -0.1058 -1.659 -3.203 1.623
[2,]  6.466 -0.8918 -0.8215  3.469 -1.838 2.339

Xdx’ox lrenaed qwe rx plpya VYR xr guvt data unz rrettpeni rkq rimontaiofn rj peivsdor. Jn prk nvvr ptcrhae, J’ff oucitnerd vwr nonlinear mnioisned-riodncuet uqcseetinh. J usggtes brrs bhk zoez gxtq .T lvjf, ecubsae wo’vt iggno re otnuienc snuig vrp zmcv data cxr jn qrx rkvn epcahrt. Yuzj jz vz wo nzs pcrmeao pvr erneropcfma lv heest ioelnannr algorithms xr oqr etrepnratinseo xw rcadete xvbt guisn PCA.

13.4. Strengths and weaknesses of PCA

Mqofj jr nefto jnc’r qosa rv ffro hhciw algorithms jfwf eporrfm fwfk ltx z eivng rzso, ukvt vtc vckm rshnstgte zgn eeanwksess zqrr ffwj uqfo bvu eieddc wehhetr VYX jwff eomprfr ffwv let bkh.

The strengths of PCA are as follows:

FAY trsecea own kzze rzgr tso lteycdri retlnatpeerib jn retsm lx gxr aigolnri variables.
Gkw data cna xd eorjcptde rnxe rvu linprpaic vcce.
LRX zj elyrla c caatatmlhime nrtfsnairootma psn ze aj cmnaioulolttapy evpenxinesi.

The weaknesses of PCA are these:

Wanppgi tlxm hujp sinsnomeid kr kfw nnmiseoids ncnoat oq nnoenlrai.
Jr nontca enhald categorical variables tnavieyl.
Axp lnifa neubmr el ilcnrappi cmtnsponeo rx iaretn rzmh ky dieddec qu ag ltx vyr apianoiltcp rs nuzb.

Exercise 2

Anqto pro VAX nk vdt Swcjz kbnoetna data rzo, grb jrzp vrmj oar orp scale arungtme vr FALSE. Amporae vrp lowiognlf xr rqx ERX ow tianred vn alsedc data:

Lnauvelsegi
Vxdjn vectors
Totilp
Zearblia loading rgfe
Stsox fvrh

Exercise 3

Nk qrx vsma zc nj exercise 2 naiga, ggr brjz krjm vra qkr muaregtsn center = FALSE nuz scale = TRUE.

Summary

Dimension reduction is a class of unsupervised learning that learns a low-dimensional representation of a high-dimensional dataset while retaining as much information as possible.
PCA is a linear dimension-reduction technique that finds new axes that maximize the variance in the data. The first of these principal axes maximizes the most variance, followed by the second, and the third, and so on, which are all orthogonal to the previously computed axes.
When data is projected onto these principal axes, the new variables are called principal components.
In PCA, eigenvalues represent the variance along a principal component, and the eigenvector represents the direction of the principal axis through the original feature space.

Solutions to exercises

Add 95% confidence ellipses to the plot of PCA1 versus PCA2:

ggplot(swissPca, aes(PCA1, PCA2, col = Status)) +
  geom_point() +
  stat_ellipse() +
  theme_bw()

Compare the PCA results when scale = FALSE:

pcaUnscaled <- select(swissTib, -Status) %>%
  prcomp(center = TRUE, scale = FALSE)

pcaUnscaled

fviz_pca_biplot(pcaUnscaled, label = "var")

fviz_pca_var(pcaUnscaled)

fviz_screeplot(pcaUnscaled, addlabels = TRUE, choice = "variance")

Compare the PCA results when center = FALSE and scale = TRUE:

pcaUncentered <- select(swissTib, -Status) %>%
  prcomp(center = FALSE, scale = TRUE)

pcaUncentered

fviz_pca_biplot(pcaUncentered, label = "var")

fviz_pca_var(pcaUncentered)

fviz_screeplot(pcaUncentered, addlabels = TRUE, choice = "variance")

Chapter 13. Maximizing variance with principal component analysis

This chapter covers

Note

13.1. Why dimension reduction?

13.1.1. Visualizing high-dimensional data

13.1.2. Consequences of the curse of dimensionality

13.1.3. Consequences of collinearity

Tip

13.1.4. Mitigating the curse of dimensionality and collinearity by using - dimension reduction

Note

13.2. What is principal component analysis?

equation 13.1.

Note

Note

13.3. Building your first PCA model

13.3.1. Loading and exploring the banknote dataset

Listing 13.1. Loading the banknote dataset

Listing 13.2. Plotting the data with ggpairs()

Figure 13.7. The result of calling the ggpairs() function on our banknote dataset. Each variable is plotted against every other variable, with different plot types drawn depending on the combination of variable types.

Note

13.3.2. Performing PCA

Listing 13.3. Performing the PCA

Note

equation 13.2.

Listing 13.4. Calculating variable loadings

13.3.3. Plotting the result of our PCA

Tip

Tip

Note

Listing 13.5. Plotting the PCA results

Listing 13.6. Mapping genuine and counterfeit labels

Figure 13.9. The PCA component scores are plotted for each case, shaded by whether they were genuine or counterfeit.

Exercise 1

13.3.4. Computing the component scores of new data

Listing 13.7. Computing the component scores of new data

13.4. Strengths and weaknesses of PCA

Exercise 2

Exercise 3

Summary

Solutions to exercises

Unable to load book!

Listing 13.2. Plotting the data with `ggpairs()`

Figure 13.7. The result of calling the `ggpairs()` function on our banknote dataset. Each variable is plotted against every other variable, with different plot types drawn depending on the combination of variable types.