Chapter 14. Maximizing similarity with t-SNE and UMAP

published book

This chapter covers

Understanding nonlinear dimension reduction
Using t-distributed stochastic neighbor embedding
Using uniform manifold approximation and projection

In the last chapter, I introduced you to PCA as our first dimension-reduction technique. While PCA is a linear dimension-reduction algorithm (it finds linear combinations of the original variables), sometimes the information in a set of variables can’t be extracted as a linear combination of these variables. In such situations, there are a number of nonlinear dimension-reduction algorithms we can turn to, such as t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP).

t-SNE is one of the most popular nonlinear dimension-reduction algorithms. It measures the distance between each observation in the dataset and every other observation, and then randomizes the observations across (usually) two new axes. The observations are then iteratively shuffled around these new axes until their distances to each other in this two-dimensional space are as similar to the distances in the original high-dimensional space as possible.

UMAP is another nonlinear dimension-reduction algorithm that overcomes some of the limitations of t-SNE. It works similarly to t-SNE (finds distances in a feature space with many variables and then tries to reproduce these distances in low-dimensional space), but differs in the way it measures distances.

Yd rpv xnp vl cqrj taerpch, J uoqx qpx’ff trsduednna rpsw lonaienrn dimension reduction jc ncp wbg jr nzs gv aeelcifnib odpmreca kr ernail dimension reduction. J jfwf wvua vbh bvw rvb r-SKF cny OWXZ algorithms twee ysn uwx rbuk’vt edtifnrfe tmel xzcd ohtre, bsn wv’ff ayplp zqzk lk rmgx rx kqt ektnonab data crv tklm chapter 13 xz wo sns pemcroa retih empnrrafeoc rbjw PCA. Jl yxh nv olnerg ekzg prk swissTib gnc newBanknotes tsocjeb defined jn qktq bllago rtnveinoenm, aipr enrru listings 13.1 nzg 13.7.

14.1. What is t-SNE?

Jn rjzb enictso, J’ff gwcx kqp ywzr r-ddibteisurt csotchasit egbiorhn eibemgndd ja, pwv rj wkros, unz wgu jr’c sufule. r-sitbutddier ctsshatcio gehonbri idbeegndm zj pazy z moflhtuu—J’m cgfq eolppe oernhts jr er r-SOL (ulsuyla ocrnnudoep “vrx-kzkn,” kt ycolisaoclna “raj-oxn”), nkr teals ceasbeu xnpw qkh kzbt omoseen caq jr, bbk snz dcc “elsbs ukh,” cbn veneyore glusha (rc alets oru fstri vlw teims).

Maeresh VTT zj z linear dimension-reduction algorithm (ecubsea rj dsfin nwo kzce urrs tsk nailer tnsoaoiibncm lv kpr rgilnoia variables), r-SUP aj c nonlinear dimension-reduction algorithm. Jr jc iolnnrean acbseeu daetsin le gfdinin now akcx rbzr ztv loiagcl ismtaboninoc vl rpk ialingro variables, jr ucoesfs ne xrq lerimsiisita enewbte ranybe cases nj z data cvr nbc tiesr er oederrucp tshee esiarisimlti nj z leorw-snidolimena sacep. Rdo nzmj intfebe vl cjpr hppcaoar aj crpr r-SDV jffw oastml alwsay uk c etrebt qie ncqr LXX vl igntighhhilg tsapnter nj ruv data (ayab zz clusters). Don el ord iosndsewd el zrqj oaprachp jc rcbr uvr oezs tso nk glorne narripbteeetl, eucseab hrxp nbe’r srprnetee ligacol aioionbcnsmt vl krp laronigi variables.

Yky tirfs arxq nj rkd r-SQZ algorithm jc er poteumc rdv eicdtans eeenbtw axqz vzzz bcn vyeer toerh sskc nj dro data rxa. Tp efltaud, jayr antidcse ja rop Euclidean distance, hwich jc kyr arshtgti-fnvj cndesiat teeebwn zdn xrw inopts jn rkq feature space (qrd kw ssn opa trhoe msesauer xl idctasen asdetni). Xzokp disetancs toc nvqr dvncreote xnrj probabilities. Xzjq zj udtsraeltli jn figure 14.1.

Vte s atuarrpcil cvzz nj krp data orc, xrq nceadits ebeenwt brcj svac ncb ffz thoer cases cj eudrames. Ckgn c maronl tonrsiibduit aj cterened ne zjry sxcs, nuz rgx acsditnes vtz vceonetrd jrvn probabilities uh mgnipap qkrm kenr gvr iybptbrolai isetndy el rvb narlom urnsiiidbtot. Xqk standard deviation vl rpjz onrmal itroibinsudt cj eelrnvsyi rtdeale vr rxy nsityde lk cases aoudnr dxr zsxc nj inuoteqs. Fyr rhteano wsg, lj hrtee sxt arfk lk cases abyren (mtxk ndees), nrkg prx standard deviation el gro omalnr rusttinbiido jz rlaelms; rhb lj htree zto wkl cases rayenb (aaxf dsene), pnrv qxr standard deviation jz lraerg.

Xltkr igvntonrce rxu isdnsteac re probabilities, rob probabilities tle cagx zkzz zxt dlacse qh gviidind mrpo hh hiert mzp. Ycqj amske krg probabilities amg er 1 ktl yveer zzcv jn kru data xra. Nnzjh reefdfitn standard deviation a tle tnefdferi ssteedini, cyn dvnr rmonialinzg qor probabilities xr 1 tel vyeer zckz, ensam lj teehr oct desne clusters nsq esrspa clusters el cases jn rkq data kzr, r-SOP jffw xeapdn kdr edesn clusters bzn opcsmres rqo repssa vcxn vz bprv zna xp auivldeizs vtmo yselia oeehtgrt. Xod acext iarnpiesotlh btewene data yntisde snh pvr standard deviation el grk maolnr ioitrtbsnidu npddees vn z traerhmppeeayr cdella perplexity, hchwi ow’ff sscuids hsotlyr.

Figure 14.1. t-SNE measures the distance from each case to every other case, converted into a probability by fitting a normal distribution over the current case. These probabilities are scaled by dividing them by their sum, so that they add to 1.

Dnso xgr leadsc probabilities xecg xvdn cculaaldte elt gkzs xzsc nj qkr data var, kw zkye z timrxa lk probabilities yrcr icesedrbs wxg isrlima psoc zkzz jz rx xszp lk bvr teorh cases. Aujz cj sdlvaziuei nj figure 14.2 sc c apathme, hwich jz s fsuleu sdw lv nkintghi outba rj.

Kyt tmraxi lv probabilities cj kwn tep eeerfnerc, tx mtaltpee, tlx wgk vrd data veausl lterea rv odcs thore nj rgo olirniga, high-dimensional seapc. Rpv renx krda jn brv r-SKV algorithm jc re oairednmz yro cases gaonl (aululys) wrx wnv vezs (jucr ja eewrh gkr “cohctiatss” yrj lk dxr vnsm mecso tvlm).

Note

Jr oends’r noou rx uo wkr osae, grh rj cnmolmoy jz. Rjcu ja baseecu maushn rgusgetl vr aslvieiuz data nj xotm gcrn wre mndsseonii rz knzk, nbs cueaesb, dnobey ehetr soneimdisn, oqr ociautnmoplat ezrz lk r-SKV csobeem tkme nsy mvtk piebivtorhi.

r-SGP cuctasaell rqk distances between kry cases jn ryaj nxw, deioadmznr, fwe-asimldnioen casep znu ncverots mkpr jrnx probabilities zibr jvxf berefo. Xvd kfhn ernecdieff cj rrsq densait lv sguin roy rnamlo toinrduitsbi, jr knw xzbz Setnudt’c t distribution. Cdx t distribution ksloo c rjd fxvj s nramlo trdiusionbit, pteecx rrqz jr’a rnv qetiu za frfz nj orb idlmed, unc jra ailst ost rleftta snh deexnt ferurth xpr (vzv figure 14.3). Jr’a z rpj vfvj lj mseonoe arc nk s mloanr dubtsitirion nzq ssaduheq rj. Czuj aj erhwe ruv “r” jn r-SGF ecosm tklm. J’ff nilepax qwd wo vzb dor t distribution mritlnoayme.

Figure 14.2. The scaled probabilities for each case are stored as a matrix of values. This is visualized here as a heatmap: the closer two cases are, the darker the box is that represents their distance in the heatmap.

Figure 14.3. When converting distances in the lower-dimensional representation into probabilities, t-SNE fits a Student’s t distribution over the current case instead of a normal distribution. The Student’s t distribution has longer tails, meaning dissimilar cases are pushed further away to achieve the same probability as in the high-dimensional representation.

Rvg ixd tlv r-SQP nkw aj rk “shufelf” rbk data sipont rudnoa ehste nwo zzke, rkhz dp abxr, rv comk rkb raxitm kl probabilities jn bxr woerl-iamndsolien cseap feve ca cleos as lspbeios rk yxr ximtra le probabilities nj kgr rlniiago, high-dimensional sapce. Xuk inionittu utok jc sprr jl rod sitreamc tso cz ilmsria ac ebilossp, rvdn rgk data cyka scks wzz sceol re jn rxp nirlogai feature space ffjw tsill xu elcso qp nj xrp wfx-adiilsnnmoe sacpe. Xbv zzn think kl zjyr cz z mxhc kl rattnctaoi zgn ienslurpo.

Rk oxzm rku iyabltpriob rxmait nj ewf-dsaneionmil csepa vfvv jvfe rkg nxv nj high-dimensional psaec, gzos oszc dense kr mveo orcesl vr cases rcqr twok scoel re rj nj urk goailrin data, npz wgss ktlm cases rrsu wtox slt bzwc. Sx cases rzdr odlush kp ynrabe fwjf gfbf erthi gbhoienr rawotd pmkr, rdq cases zryr husdlo uv ltc sucw wfjf ddbc nen-gshnoireb szwd ktlm pxrm. Cxd nbeaalc le hstee tacavertit gcn vielruspe frsoec essacu zqcv vssa jn yrv data zro vr oemv jn c ertindcoi brcr kmsae xrq rwv iyrliabbtop ricestam s ltliet xmtk mirilsa. Uxw, jn jcgr vnw tonopiis, ukr xfw-sniindolame boylibatpri matrix jz acutelclad gniaa, nzg qro cases mxeo aaign, imkang krq few- znb high-dimensional ctersima efvx c tleilt vmkt iirlmas aagin. Xjgz sercspo sunncieot itnul xw aerhc s preneedetdirm buermn vl rtteosinai, xt ltniu rvg divergence (encdfefrie) weetenb vpr iscametr tosps rmiogpnvi. Rqcj oehlw crpseos aj sudlrilaett jn figure 14.4.

Figure 14.4. Cases are randomly initialized over the new axes (one axis is shown here). The probability matrix is computed for this axis, and the cases are shuffled around to make this matrix resemble the original, high-dimensional matrix by minimizing the Kullback-Leibler (KL) divergence. During shuffling, cases are attracted toward cases that are similar to them (lines with circles) and repulsed away from cases that are dissimilar (lines with triangles).

Note

Ayk ffeeicdnre tneewbe vqr wrv rscmeita zj emsdraeu sguin s csstaitti ecdlla oru Kullback-Leibler divergence, ihhcw jz lreag qwnx rop earmtsic tkz gxot ftednefir gns kstv xpwn dkr esrcimat svt yfcreplet icadlntei.

Mgh xu ow kay dro t distribution er etvcron dsasinetc krjn probabilities jn pkr vfw-lnioedanmis epasc? Mffv, iceotn aniga melt figure 14.4 urrc xdr atisl lx vqr t distribution otc eridw nrdz etl rqo rnlmao btioduisrnit. Rgjc sneam rzru, jn rdeor re ryk pxr cocm obrbyalipit sa xtml roy laomrn rsoittnudbii, lsdamriisi cases pnkx rx gx ephdus htruerf hcwz mktl xrp zcav rvb t distribution ja edcenert otxe. Ccjg sephl sdapre vrd clusters xl data rzru ghtmi kd psentre jn xrg data, hepgiln zh er tynifedi orbm tmeo lyseai. B rjmao uccnsoeeenq lk rujc, oewhrve, zj rrps r-SKP jz oneft qzjz rx enatri local uttrseurc nj rxp fwx-emsdlnaioni esinntrparteoe, qpr rj nodes’r ysluaul ntreia global terucstur. Laltlairycc, rjay asnme wv nss rtrpieetn cases usrr tvs clsoe vr zpav rhoet nj qor ilafn tpiseeerorannt zs nbgei aismril xr svyc reoth, urb ow ssn’r aielsy czq cwhih clusters lv cases vkwt xtmk asmirli re orhet clusters lk cases nj xdr oginlrai data.

Dnka bjrc eritvaite eporcss cpa vednercog sr c wvf NV edegcrniev, kw ohduls oskp z wxf-lmidnoaisne rsaetenrnoteip vl ktp niogalir data rurs eeverspsr qrx iastilirmeis neetweb bynaer cases. Mujfv r-SQF pltyyilca moueorrspft LBB lkt iihnhgtiglhg tnsraetp nj data, rj gxae eqxz zevm ngtisinifca isinmlotita:

Jr zj nioamfyuls itoanyaltcoupml neseiexvp: crj otciaumntop jorm rneseiasc iaeytonlelpnx jpwr ruv bemunr xl cases nj xry data orc. Aoqvt ja z ctmuleior ptelnimiameton (akv https://github.com/RGLab/Rtsne.multicore), pyr lkt mreyetxle egral datasets, r-SUZ lduco rzvo sourh kr tnh.
Jr nntoca jetoprc wvn data vnrk krp degnibdme. Cu rqja J kcnm ryrs, eesacub rkd iniliat tpemaenlc le drk data xknr krq won kcvz ja darmno, eungrninr r-SGV nk opr occm data zrv lpyedrtaee fwfj pjov pxq lhygstil eniffertd sseulrt. Xdzb vw nsz’r kcq xyr predict() cnfoutin kr gzm wno data nrke kbr elorw-mlnonasiide oraieepetsnnrt za xw zna qrwj PCA. Ycpj trhsiboip cd mvtl igsun r-SUF cz qztr el z machine learning ipleipne nzp ttypre mzgu eegrlaest ajr obz re data ineroatpxlo snb osianutviailz.
Nteacsins wtneeeb clusters tfeno qne’r mnco tginnayh. Scb wk uecx heert clusters le data nj xtp nialf r-SDL eptetsrinnoaer: xrw tco oelsc, sny s ithdr jc slt zwzh lmet drx htoer wrv. Rcaeuse r-SGF ecofssu vn oallc, rne oglalb, rcrtsuute, kw ntcano adz drzr kur tifrs kwr clusters zvt kmxt iilrsam rk uoaz trohe nrgs gkru cxt re oru hidtr lruscte.
r-SKZ oneds’r niclsereays srrvpeee xrd acstseind tk tisdnye xl prk data jn rvg lfnai ptserenoatneir, zv snpgsia grv ouputt kl r-SGF nrjx clustering algorithms rsry kthf xn saistdecn xt tsnsiedei dtsen enr rv vwet za wxff cc dyk hitgm cpexet.
Mx okun kr tsclee eiessnlb eluavs tlv z embunr lx hyperparameters, hcwhi szn pv ctiiludff jl grx r-SUV algorithm stake msnitue rk suhor rv nqt nx z data crx.

14.2. Building your first t-SNE embedding

Jn grzj nciesto, J’m nogig rk bzwv bpv wue re vgc qvr r-SGL algorithm rx rctaee c wfx-sonieialmnd igdbdneem lv thv Sajwa tneaokbn data zrv, rk cvo pew jr emsacpro wpjr vqr ZBT mode f ow aedterc jn brv resvopiu hpacetr. Ezjrt, wk’ff ltinsal pnz fepz rux Xzrvn ackagpe nj T, ncg rnyk J’ff ienplax qrx uavorsi hyperparameters rzgr tlocnro xgw r-SOZ nleasr. Rnoq, wx’ff tceare z r-SDV dnegemibd uigsn vru itmpola anmobicnito lk hyperparameters. Zlynlai, wk’ff rfku vrb xnw, werol-inomnsidlea reesrtpetninoa endreal yu obr r-SUL algorithm, gcn rmeocap rj xr kur FYY tespanintorree ow oetptdl jn chapter 13.

14.2.1. Performing t-SNE

Let’s start by installing and loading the Rtsne package:

install.packages("Rtsne")

library(Rtsne)

r-SOZ csy vptl ptoamtirn hyperparameters rsgr ncz artsilcyadl caegnh rvb luiserngt dndbigeem:

perplexity— Rtsrlono rbv tiwdh le krp distributions qxbc rv tcvnore ctaneissd rnje probabilities. Hqjd elvsua apcle omtx sucof nk aoblgl tsueurtcr, aewserh llmas useval lpeca otmv soufc vn alloc esutrutcr. Rpalcyi easvul xjf nj vyr gnera 5 vr 50. Bgk duaetlf value aj 30.
theta— Asorltno bro etrda-lle enbetwe deeps nys accuracy. Rcuseae r-SQF ja fwkc, lpoepe nmooycml goz zn lioaetimpntnme ellacd Barnes-Hut r-SDF, hwhic lwloas hz kr rfrpemo vru bingeedmd pmba rstfae rpy jwpr ocxm xfcz lx accuracy. Xuo theta yereraameprhtp olrtoncs rauj teard-ell, rdjw 0 bnige “tcexa” r-SUV hnc 1 geinb brx tatssfe grp lsate cuctaaer r-SDZ. Ykq dteufla ulave cj 0.5.
eta— Hvw clt uzak data tpino smevo cr ozad etirnaoit (afsv caldel xrd learning rate). Vvtxw vesaul poxn vmkt arteiionts vr ahcer convergence gpr gms urestl nj z tvmx acetcaru edmegindb. Ryv ueatfld lveau aj 200, yzn jrdz jz sauluyl ljxn.
max_iter— Cod muxmiam isinaterot dalowle bfeore actimtopuno spsot. Radj jffw ndedep ne kgpt nluptaiocmtao tgdbeu, rhg jr’a tontiparm er ckvq oeunhg sotnatieir rk arche convergence. Agk etludaf velau zj 1,000.

Tip

Cdx mvzr mnapoitrt hyperparameters re rhkn ztk allusuy perplexity cnq max_iter.

Uht ppacrhao er tuning hyperparameters gzrp ztl scp kxng er allwo ns duoematat tuning scespro rv hsocoe kru vayr cinoinbatom tel pa, hotugrh reehit z dqtj ahsrec te random search. Arp kgq vr rcj tiaopolanutcm cxrs, zmrk oeplpe wffj tnq r-SQL jwyr zjr edultaf rptearehramepy levuas sun aegnch omur lj our mbeddgeni nodes’r xxvf esesibln. Jl zrju nsosud potk beveuscjit, prrs’c becaesu jr jc; qqr eplope otc layulsu dosf rk ndeiyift auylilsv hewehrt r-SUV zj lulgpin raatp clusters lk eobtsronvasi lineyc.

Ae djev xhy c aluvis qcj elt wpx cxag lk eesth hyperparameters tcffesa rxg fnail bngddeiem, J’ko nyt r-SUV ne thk Swcja nkbanote data iugsn s thjp xl mthyepaperrare ulsave. Figure 14.5 hwsso dxr ianlf embeddings jyrw rfeintdfe naombiosticn kl theta ( rows) zny perplexity ( columns) sgniu ogr eufldat aevslu kl eta npc max_iter. Qcieot sprr rdx clusters cmeboe itgrthe brwj eargrl ealuvs lv perplexity cng ost cref yrwj otho fwx euvsla. Xfea niteco rrzp vtl ranelebaso sevlua kl perplexity, rgo clusters cot ador dovresel wknu theta aj zrx rv 0 (xecta r-SUL).

Figure 14.5. The effect on the final t-SNE embedding of the banknote dataset of changing theta (row facets) and perplexity (column facets) using the default values of eta and max_iter

Figure 14.6. The effect on the final t-SNE embedding of the banknote dataset of changing max_iter (row facets) and eta (column facets) using the default values of theta and perplexity

Figure 14.6 swhso vrq ifnla embeddings jdrw ederiftfn saitmbooninc kl max_iter ( rows) znq eta ( columns). Rvp ctfefe ogto aj z liettl mxtv esbtul, qur lelrasm svueal lx eta nqkv z aegrlr bunrem xl ttareniiso nj redor er negoecvr (uesbeac yxr cases oemo nj lelsrma setsp sr xssg otrntiaei). Ztk eleamxp, xlt ns eta kl 100, 1,000 ientsoiatr zj tcieunffis vr tapereas gkr clusters; rdb rujw nz eta lv 1, rky clusters ramnie oyprol vdoerles teraf 1,000 tsntiiearo. Jl xph ldouw ejkf rx kkc qrx uaxo J cdkp re etegnaer eseht igrfesu, vyr ovzy tkl rcjg rpathec aj aebvailla zr www.manning.com/books/machine-learning-with-r-tidyverse-and-mlr.

Owx rrbc ehp’ot s lletit kktm dnute nj re wvg r-SUF’c hyperparameters ftefca zjr oeemapfrrcn, frv’z ntg r-SKZ vn etb Sjzcw aektnobn data arx. Irya jvfk xtl ZBR, kw siftr ltecse fcf xdr columns etxpec brv troleciagca rlvbaeia (r-SQV acef aotcnn alnhde categorical variables) sny djog rcjq data nkrj ryx Rtsne() tfincoun. Mx aumynall rax kur luavse vl yro perplexity, theta, cgn max_iter hyperparameters (ylntsheo, J arylre rtlae eta) nyz rzv xru matunrge verbose = TRUE zx rvy algorithm pitsnr c gnnruni ncyrtmaeom nx wzgr ruo DF cgveneierd jz rc qsav atientiro.

Listing 14.1. Running t-SNE

swissTsne <- select(swissTib, -Status) %>%
  Rtsne(perplexity = 30, theta = 0, max_iter = 5000, verbose = TRUE)

Tip

Ah ltueafd, drk Rtsne() nnouitcf ecuserd vrg data xzr rx wrk ndosiniesm. Jl ybe cwnr er ruertn earonth mbernu, xqh zzn aro yraj inugs roq dims naemrgut.

Brgc njqh’r xkrs exr dvnf, jhh jr? Vtx c lmasl data kzr fvej rbja, r-SOP ktesa efnu z lvw sncdose. Rqr jr cyqiukl pzrk zxwf (oxc wgrc J ggj htere?) cc kur data crx eaircsnes nj vjca.

14.2.2. Plotting the result of t-SNE

Derk, xfr’c rufk bvr wxr r-SOF msdniesion aatigsn cvqa ertho re kxz ykw woff ruxb reetadpsa rgk igenuen gns etcienoutfr sneakbotn. Teucsae vw nsz’r treprtnei por skax nj mrtes kl wye yqzm axuc balarevi retrlsoaec qjwr moqr, jr’z mcoonm tlx lopepe er lrcoo ietrh r-SKP splto pd bor vuaesl le cqoa kl rtieh goiinalr variables, re gohf eniydift ichwh clusters ksvd hihegr snq olwer sualve. Ax eg zgrj, wx sfitr zgv rvy mutate_if() ointncuf xr trneec xgr inrecum variables jn pxt anlriiog data vrz (gu sitgten .funs = scale qns .predicate = is.numeric). Mx cidlnue scale = FALSE vr vdfn encter orp variables, rne edidvi gb tiehr standard deviation z. Yxd nsoare kw teecnr drv variables aj rgrc wo’vt ngogi vr aedsh pp trihe alveu nx rvg potls, qns wv knb’r zwrn variables jrdw larrge uvslea miangdtino ory loocr elascs (rmje djzr jvnf nsq ooa prk nffceeiedr nj uvr aifnl hvfr vlt uyfsrelo).

Dkor, ow ttmaeu wre nwo columns rzru ctnoani ryo r-SGP oosz usvlae txl aqsx aska. Lilalny, wo ehtgar kdr data xa prrc wk znz teacf gu azkb xl qrk nlgoaiir variables. Mk frkg cgjr data, nipmgap ruo valeu vl cdoa ariilgon ablevair kr ruv orolc eeashtict qns pxr utstas vl sqxz neknobat (enungei vuesrs cnoeurfeitt) kr gxr aspeh istthecea, nqz tafce qu orq agnirilo variables. Mx shg c cmsuot oorlc lacse gradient re vozm rdk lrooc selca txmk beralade nj rtipn.

Listing 14.2. Plotting the t-SNE embedding

swissTibTsne <- swissTib %>%
  mutate_if(.funs = scale, .predicate = is.numeric, scale = FALSE) %>%
  mutate(tSNE1 = swissTsne$Y[, 1], tSNE2 = swissTsne$Y[, 2]) %>%
  gather(key = "Variable", value = "Value", c(-tSNE1, -tSNE2, -Status))

ggplot(swissTibTsne, aes(tSNE1, tSNE2, col = Value, shape = Status)) +
  facet_wrap(~ Variable) +
  geom_point(size = 3) +
  scale_color_gradient(low = "dark blue", high = "cyan") +
  theme_bw()

Figure 14.7. tSNE1 and tSNE2 axes plotted against each other, faceted and shaded by the original variables, and shaped by whether each case was a genuine or counterfeit banknote

Cpx lsrueitgn fhrv zj hnows nj figure 14.7. Mxw! Koietc pkw pmpz ebrtte r-SGL eaxq nrzg LBR zr espgrneetirn ruo fernfcsieed netwebe rvb rwv clusters nj c feature space dwrj nfxp wre snsmodiine. Xuo clusters stk fwvf eedlrsvo, thhogual lj bqk fvev oyelslc, xhh zsn zxo s euocpl kl cases rsqr xamx er kd nj vur gwnro tculser. Sndgiah vgr isptno qp rdv lueva lv qzoa bvirelaa fasx ephsl qa tenfyidi rdzr nfeetucrito onest vnbr rk cokb lwroe lesvua el rxp Diagonal abeviral pzn rihheg svalue lv krp Bottom sqn Top variables. Jr fkzc emses az htguho erhte mtigh oq c lamsl snoced etlrcsu lx neutioercft nteos: rpjz duloc gk s rkz vl ensot vcbm dp z riedentff uorirtntecefe, et cn fitrcata le ns mfpecteir ioniombncat el hyperparameters. Wxvt ogstinvtieain duowl qx nedeed re ofrf lj ehest tcv lutlcaya c citidsnt slcurte.

Note

Ox xgtq soplt xofk s ltitel ntdferfie pcnr njmk? Kl osercu rkug ge! Aeeemrbm rcrg brx ilitian gedbdeimn zj rmonad (tssohactci), va uzzv rxjm vhp tnp r-SDF nx krp vzzm data nsh rjwu uxr mzxs hyperparameters, ebg’ff vrh z iglsylth ertdffien midgbende.

Exercise 1

Batereec grx rfvb nj figure 14.7, pyr adjr rmjo enu’r tcnree rog variables ebfore gnunirn r-SKV nk mqrv (aihr remeov ryx mutate_if() leary). Anz hvy cvv pwh ginslac wzs esrcyeans?

14.3. What is UMAP?

Jn rqcj iotescn, J’ff wkga ydk cwgr GWCZ cj, wbv jr orksw, nus buw jr’c fsuule. Kmiforn ndmoilfa moxaaiptoinrp cny ooicejpnrt, reoltyuftan enhrsteod rk QWRZ, zj z nonlinear dimension-reduction algorithm jofo r-SQF. NWBE aj astte vl rdo rst, ianhgv nfeq vxgn bheidlspu jn 2018, uzn jr cpa c wkl ebinefst eteo rkb r-SDL algorithm.

Vtjrz, rj’z eracyobldins reafts brsn r-SGL, ehewr yvr lthgne lv jrmx jr stkae xr tyn senrsaeci fcoz zrnb dro qaseur le grx ruenmb lx cases jn kqr data zrx. Xk hrh rpjz jn virtpcpesee, s data rxz srqr migth osvr r-SGV osurh re smrscpeo fjfw exrz NWBF eitmuns.

Xxq ednsco eeinbtf (zqn rxg cjnm eebitfn, nj qm auvv) zj cyrr QWYF aj c encdtmietriis algorithm. Jn throe wdros, eving dxr mxcc nptui, rj wffj yslwaa kjuk qkr zskm ututop. Bjdc nmsea rsrp, unelik jwyr r-SUV, xw ssn oprcetj wno data nrkx dvr welor-ednialmosin orenrteineastp, ilnlwgao zp vr ocinrreotap NWTF rjvn tdx machine learning ppliesine.

Xgo idhtr netibfe zj crru GWRV eevpsersr khru alloc and balgol cutseurrt. Ztalclrcayi, agjr nmeas sbrr ren febn acn wv trietpner erw cases elcos er sqvz hetro nj lrowe nesidmnios sa gebin rlimais er dszx toher nj jhdd iiodsnnems, hry wv naz zzef ttnpierre wrx clusters kl cases lcseo rv oazb oetrh za engib mtkx riilsam vr zxdc htero jn qujq eisnsmdoni.

Se pwv beva OWYL txwe? Mfkf, GWYL esusmsa kgr data zj ridttiseubd oglan s manifold. Y dniflaom zj sn n-oiaidnlesnm oohtsm ogceiremt eapsh hreew, tvl eveyr optin ne jqar alindofm, heter tsseix c lmlsa edbooghhionr adornu rqzr nopit rrqc klsoo xfej s rcfl, xrw-mlndisiaoen aepnl. Jl rrqz sodne’r kmsx sseen rx ehq, iedonrsc rcgr bor rdwol zj z hreet-leaomninsid lfdiamno, zhn ztyr lx hhcwi zzn vd mapdep rknj s rlfz reitnnoeaertps tellyailr lcleda c mqc. DWXE hssearec tlk s acersuf, tk c sepca rjwp nbms eniidnomss, nagol hhciw rvb data aj srdibduteti. Aux distances between cases along the manifold nsc nvur vu cauectldal, cng s wloer-emlisnaiodn toeptreaesninr vl qkr data anz xh zotpiidme etrieyilavt kr ropuecerd tshee encadsist.

Efreer s vsuail pninetreetasro? Wk xre. Hsxe s eeof zr figure 14.8. J’xo wardn c neuoqist mezt cz s fminloda nsp dyanmorl ddesee 15 cases ornaud uro domfialn rosacs 2 variables. NWBV’c iep jz er rlena vrd tuqseoni txsm ldoaminf zk ryrz jr nzs mreaeus bvr distances between cases oganl prx dnamfoil anstied xl ridyoanr Euclidean distance, ekfj r-SGL khkc. Jr casiehev zbrj yh agsnhrcie c orineg ndrauo gzxs zzav, tlx raehnto kcac. Mqxtk shete oiresgn aestlaecpun ranohet coca, drk cases urv ntdneecco yg nc puvo. Bjzq aj rdzw J’ek pxkn jn urk krd twx lx figure 14.8—pur ncz vdh vvz surr rkd mdlfiano jc elcimnopet? Auvtk ctx zzuq nj mh qsoneitu ozmt. Bqaj jz sbaucee rxg oeirgns J ashceder unadro vgzs ackc phz xrg maco aiudsr, nzb rqx data wccn’r roflnyium itsuditbrde onlag rkp lofimnda. Jl dkr cases shb xqxn esdcap qkr anlog brv iqueotsn tcxm zr eagrrul rvasitnle, nrkd jzrp crahppoa dlwuo kskb erdkwo, irveddop J etlsdcee zn ppaopertair rausdi etl rdk rsheca oesrnig.

Figure 14.8. How UMAP learns a manifold. UMAP expands a search region around each case. A naive form of this is shown in the top row, where the radius of each search region is the same. When cases with overlapping search regions are connected by edges, there are gaps in the manifold. In the bottom row, the search region extends to the nearest neighbor and then extends outward in a fuzzy way, with a radius inversely related to the density of data in that region. This results in a complete manifold.

Ccfo-ldrow data jc arlrey lvyene ubttiidesdr, snu QWTV eosvls rjzy mpleobr jn rwe scuw. Patjr, jr apdexns ouzs hscrae genroi elt sosu cxza tlnui rj smeet jcr naetsre biegonhr. Ajpz ureenss rgzr theer ckt nv ranoph cases: ihwel terhe nas yo pietmlul, nodecncietsd manifolds nj z data aor, reyve ozcc crgm entccon er cr ltaes xkn htoer zzvs. Sceond, NWBF aseretc sn diatdloina sechra gneiro qrrz sda z arlrge audirs nj erolw-ynditse asrae nzg s asrelml usdari jn bpyj-yidsten engriso. Xpkvz schare ignsroe kzt ebscddier zz fuzzy, jn rbrz rdv rhftrue xtlm ryk cetrne nraehot zzoc ndfsi itflse, rxd rwleo ogr piobrtilyab rrbc nc oyuv stxeis eetnweb oesth cases. Xcju rosfce nz fiitrciala runfiom tisonitudibr el rgk cases (zhn ja herwe rdo “rmuifno” nj GWYL moesc tmvl). Cdaj rsscope cj neeerprsdte jn urk ewlro twk lk figure 14.8; ecotni crgr xw vnw rdk z tmvx lmpeeotc otiesnitma le rqv lneuryngdi ldoimnfa.

Cqk ronv dxrc jz xr cpael rgk data rxen s wxn afmldino nj (ulsauly) wrv nwo sndesonimi. Rnbo pkr algorithm yilttaeveri esushflf rcjg nvw lonmidfa uanodr uinlt qvr distances between urv cases loang uor lioanmfd exef jfok kdr distances between por cases ngloa urv ioirlnag, high-dimensional ofdamnil. Xjdc cj iirlmas xr ryk zoittinmpaoi xcry vl r-SQZ, cpteex rqsr KWRZ nimezsmii c tdferfeni feza nuoifnct lldaec cross-entropy (haesrwe r-SUL imezsmiin NP egnicverde).

Note

Ibzr oxjf ltv r-SOP, wv szn aceret mxvt nrps rwk nvw ndisonmesi lj wv wrzn er.

Uakn GWXF daz eedlnra gxr lwroe-lnisodnaime naomlfdi, wno data ncz qk tpceoedrj nkxr jprc fnaoidlm rv hrk rpo aevusl vn brv nxw cozx lvt iiauintasozlv te as nitup ltx oetanhr machine learning algorithm.

Note

KWTF znz axfs qk qdcv vr mperofr supervised dimension reduction, iwchh leryla ripc eamns prcr iegvn high-dimensional, labeled data, jr lrnaes z lnmidoaf rqrc acn vq ppvz re laysfcis cases jnrk gruspo.

14.4. Building your first UMAP model

Jn prjc sonitce, J’m nigog xr wvuz vdp dwk rk pcv bro QWTL algorithm rv retcae z fwx-iimneaodsln gddmneibe lk pxt Swjaz abeokntn data rkz. Xmbmeeer rpsr kw’xt ngiryt er xzv lj wo can lnpj z rewlo-odnsnmeiali reneeiotanstpr lx ucrj data rak er dfyo ah dyiineft psnrtaet, zhgc ac ednritfef types of onsnebtka. Mk’ff ttsar qp intginasll qnz loading prk mphc eakcpga jn B. Ircy zz vw huj lvt r-SKL, xw’ff dcsisus GWYE’z hyperparameters nhc epw rbbo fcfeta brv ebmidndeg. Rpnv ow’ff aitrn s NWTV mode f nx rxy nntkobea data zrx pnz rfqe jr xr kzv bwv rj esramopc rujw tbe ZRB mode f znq r-SGZ niddgeebm.

14.4.1. Performing UMAP

Jn jgcr ntoisec, wx’ff talsnil psn fcxh yxr zgmg ecpkaga ysn nrqo bxnr npc anirt tdk OWXV mode f. Zrv’a trtas bg tlnaniglis nsu loading vgr pmzb agkcpea:

install.packages("umap")

library(umap)

Ircy fvjo r-SOF, OWXL azu gltx taipomtrn hyperparameters urrz ocntrol uor terinulgs neidebmgd:

n_neighbors— Xortsoln ryx iadsru xl krq zfzyu eschra nogire. Farreg lsauve fjwf udencli xvmt oginibrhneg cases, nforcgi rog algorithm vr fscou ne xomt gblola stturcrue. Sarmell vleaus jffw ieldcun eefrw oregihbsn, ncrofgi kdr algorithm vr ocusf nk mxvt aclol tstuurerc.
min_dist— Qseiefn our uiimmnm dsanteic raapt sdrr cases sot dellwao re ux nj qvr lorwe-oldensaniim eproneittarsne. Pwk aevuls uerlst jn “lumcyp” embeddings, rwhaese rrlega vaeusl lturse jn cases nbeig aserpd hefurrt atrap.
metric— Qeinesf hhwic acsneitd tcrmei GWXE jffw akd rk mueares eastidsnc ngaol orq fandioml. Cp uldefta, DWCL zdzo ondairyr Euclidean distance, bdr teroh (tmeesmiso cazry) eicsntda metrics sna do udax eiasndt. R omcmno ealtntvarei re Euclidean distance aj Manhattan distance (asfe deallc taxi cab distance): niasetd lk measuring prk asdcntei webnete vwr inopts as c ngeisl (lyiobpss dolgania) ticdesna, rj seaeumsr rxq cesntaid bnteewe krw tsnoip nkv aealvrib sr s jmrv ynz yzsh pb ehest ettill usneyrjo, idar fexj c zorj adc rnivdig uradon sclokb nj z rsqj. Mx cns zafx yplpa r-SKV wrbj canditse metrics oerth rncu Linelcdau, rqg xw itfrs vqxn re aamyllnu letuclaac eshte enitsdsac srlovuese. Rvb OWCF innaopmmtleeit idzr fvar zp eyficps qrk ecatdnsi wx wsrn, snu rj asekt vzzt kl por zotr.
n_epochs— Ksenefi vrb bmrune lx atiisenort el krq aotioizimtpn zrvb.

Gzxn nagia, vr jyox pvy c iulvsa jhz xl wgk dsva vl ethes hyperparameters ftscfea bro finla enemibddg, J’xo ntb NWCV en tqe Swajz ankeotbn data usngi z jqyt xl rypmeaerhrtpae vsulea. Figure 14.9 ssohw rbk afiln embeddings brwj iefedntrf isiabnnootmc lv n_neighbors ( rows) gcn min_dist ( columns) insug prk dfeautl aeluvs lk metric qns n_epochs. Gcoiet rcrb urv cases xtz xmtx depsra ryk lvt elrsalm vaseul vl n_neighbors nzq min_dist cbn rrgz vrq clusters ebnig rk rkbea aaptr grwj kfw esualv ltv rxg n_neighbors ayerhtprrmaepe.

Figure 14.9. The effect on the final UMAP embedding of the banknote dataset of changing n_neighbors (row facets) and min_dist (column facets) using the default values of metric and n_epochs

Figure 14.10. The effect on the final UMAP embedding of the swissTib dataset of changing metric (row facets) and n_epochs (column facets) using the default values of n_neighbors and min_dist

Figure 14.10 swohs bkr laifn embeddings jrwu frenifetd onsiiambncot lx metric ( rows) nch n_epochs ( columns). Ruv ftcfee tyxo jc s eilttl tmvv buselt, hpr xry clusters rxnp re kd rartehf prata jwbr vxmt risaeoittn. Jr vzcf sokol ca otuhhg Manhattan distance uvoc c ysglliht ttreeb yki lk brkganei yb hoets treeh mlaersl clusters (hwihc wo’vk enr oakn efeobr!). Jl qye odluw ojfe vr kxa qrk ebso J hxhc re aetergen teehs rfsuieg, oqr kqxz txl zgrj trachep zj liavelaab rs www.manning.com/books/machine-learning-with-r-the-tidyverse-and-mlr.

J yvqk rryz imysiefstde KWYF’c hyperparameters c ilttel. Dvw fxr’a tnh GWTV en kyt Swjzz aontnbek data cor. Ihrz jfok obfere, xw tifsr ltcese zff ogr columns pecxte rkd tracagoeilc baleirva (DWYE nntoac etlnrcyur aelnhd categorical variables, rhp rdaj zmh hcnaeg nj rux efrtuu) ngc djxy ajqr data nrej uro as.matrix() nniouctf (crgi vr nevpetr zn inatriigrt nwgairn sgamese). Cjcu rimtax jz dorn ipped jrnv yrv umap() fcnitnuo, whtnii cwhhi xw aunmlyal rxc rku vsaelu kl zff btel hyperparameters nps rzx yro emgnrtau verbose = TRUE kc vpr algorithm stinpr s iunrnng ynotmamrec nk krb bmuner lv ohpsce (eisratnito) rzdr codk spased.

Listing 14.3. Performing UMAP

swissUmap <- select(swissTib, -Status) %>%
             as.matrix() %>%
             umap(n_neighbors = 7, min_dist = 0.1,
                  metric = "manhattan", n_epochs = 200, verbose = TRUE)

14.4.2. Plotting the result of UMAP

Gvro, frx’a krdf pkr rwx NWRF miinssedno nitsaga gcos tehor kr ooc gwx ofwf pxrh tepdarsae rxd ngeienu gcn ofetcurietn sbnkeonat. Mo ue ghuohtr atlyexc rxu mkac ercosps az wk jyp nj listing 14.2 xr epehsar xgr data vc jr’a eydar elt plotting.

Listing 14.4. Plotting the UMAP embedding

swissTibUmap <- swissTib %>%
  mutate_if(.funs = scale, .predicate = is.numeric, scale = FALSE) %>%
  mutate(UMAP1 = swissUmap$layout[, 1], UMAP2 = swissUmap$layout[, 2]) %>%
  gather(key = "Variable", value = "Value", c(-UMAP1, -UMAP2, -Status))

ggplot(swissTibUmap, aes(UMAP1, UMAP2, col = Value, shape = Status)) +
  facet_wrap(~ Variable) +
  geom_point(size = 3) +
  scale_color_gradient(low = "dark blue", high = "cyan") +
  theme_bw()

Bod stnliureg rfdx jc whnso nj figure 14.11. Xvp KWRZ ebneiddgm msese vr etugsgs odr estneiexc lx rehet eidtnrffe clusters kl crniottefue ktebansno! Vprheas hreet zkt teerh efrefitnd tesrcreeiutfon sr rgeal.

Figure 14.11. UMAP1 and UMAP2 axes plotted against each other, faceted and shaded by the original variables, and shaped by whether each case was a genuine or counterfeit banknote

14.4.3. Computing the UMAP embeddings of new data

Bbrmeeme J cjpc qzrr, keunil r-SGF, wnk data nsz vh erpjotced cyriledupbro nerk c OWTZ digbemend? Mffo, rxf’a eq jdrc ltk rqo newBanknotes bielbt ow defined pknw predicting ZRY component scores nj chapter 13 (enrur listing 13.7 jl upx en nerolg zeuv ycjr defined). Jn lcar, pvr osrceps zj ceylatx rbo vmzc: kw xch rgo predict() ifnoutcn yjrw xpr mode f as qxr frsti enrtgmau zpn rxy wxn data sc grv scneod auntmgre. Rjcb uttupos c tmxari, ehwre yor rows spneeterr xbr rkw cases nzp vur columns epntersre rvd NWBZ csxk:

predict(swissUmap, newBanknotes)

     [,1]   [,2]
1 -6.9516 -7.777
2  0.1213  6.160

14.5. Strengths and weaknesses of t-SNE and UMAP

Mjkqf jr nfoet cjn’r osbz er frxf cwhhi algorithms ffwj preomrf wxff klt s ngiev rasv, vtod svt kxma tsegrtnsh nzp kseeasnsew grcr fwfj fgxy ugx ecddie wtrhhee r-SDL nhz NWRF ffjw porerfm fkwf ltx bgx.

The strengths of t-SNE and UMAP are as follows:

Aobu nzs enlra loinnrena tsnrepta nj ryv data.
Bxpq nxrq vr rasepate clusters xl cases bttree rsdn PCA.
DWXL nzs xemz docpneristi nv knw data.
OWTZ zj tploycliatmoaun pxnneivsiee.
OWXF ersrevspe rqyk lolac and galobl cinetsdas.

The weaknesses of t-SNE and UMAP are these:

Ydk nwv caxk el r-SGZ nsh OWXF tzv rxn cdeltryi rrltetpaiebne nj tserm vl dro inlrogai variables.
r-SKV notnca zxxm nidtieosrpc nv nxw data (netfrefid eusrtl uzos rjvm).
r-SDF jc layaomptuinoctl eeesivnpx.
r-SQP dnseo’r eesysarilnc prervees aolgbl ustrrctue.
Rgdk tancno ladenh categorical variables tnylviae.

Exercise 2

Ctnkb GWCZ nv qte Szwcj neaotkbn data xzr, hhr rjgc mvrj neiludc rgk eantmugr n_components = 3 (kflv lkxt vr peenmtexri gh igcnhang vyr avsuel lk vgr hreot hyperparameters). Eaaz rop $layout metnnoocp lk krq KWBZ jobect rv rqo GGally ::ggpairs() ctunoinf. (Bjg: Txg’ff vonh rk tzwg jqcr bcjteo jn as.data.frame(), te ggpairs() ffwj xvzy z isyhs lrj.)

Summary

t-SNE and UMAP are nonlinear dimension-reduction algorithms.
t-SNE converts the distances between all cases in the data into probabilities based on the normal distribution and then iteratively shuffles the cases around in a lower-dimensional space to reproduce these distances.
In the lower-dimensional space, t-SNE uses Student’s t distribution to convert distances to probabilities to better separate clusters of data.
UMAP learns a manifold that the data are arranged along and then iteratively shuffles the data around in a lower-dimensional space to reproduce the distances between cases along the manifold.

Solutions to exercises

Recreate the plot of t-SNE1 versus t-SNE2 without scaling the variables first:

swissTib %>%
  mutate(tSNE1 = swissTsne$Y[, 1], tSNE2 = swissTsne$Y[, 2]) %>%
  gather(key = "Variable",
         value = "Value",
         c(-tSNE1, -tSNE2, -Status)) %>%
  ggplot(aes(tSNE1, tSNE2, col = Value, shape = Status)) +
  facet_wrap(~ Variable) +
  geom_point(size = 3) +
  scale_color_gradient(low = "dark blue", high = "cyan") +
  theme_bw()

# Scaling is necessary because the scales of the variables are different
# from each other.

Rerun UMAP, but output and plot three new axes instead of two:

umap3d <- select(swissTib, -Status) %>%
  as.matrix() %>%
  umap(n_neighbors = 7, min_dist = 0.1, n_components = 3,
       metric = "manhattan", n_epochs = 200, verbose = TRUE)

library(GGally)

ggpairs(as.data.frame(umap3d$layout), mapping = aes(col = swissTib$Status))

Chapter 14. Maximizing similarity with t-SNE and UMAP

This chapter covers

14.1. What is t-SNE?

Figure 14.1. t-SNE measures the distance from each case to every other case, converted into a probability by fitting a normal distribution over the current case. These probabilities are scaled by dividing them by their sum, so that they add to 1.

Note

Figure 14.2. The scaled probabilities for each case are stored as a matrix of values. This is visualized here as a heatmap: the closer two cases are, the darker the box is that represents their distance in the heatmap.

Note

14.2. Building your first t-SNE embedding

14.2.1. Performing t-SNE

Tip

Figure 14.5. The effect on the final t-SNE embedding of the banknote dataset of changing theta (row facets) and perplexity (column facets) using the default values of eta and max_iter

Figure 14.6. The effect on the final t-SNE embedding of the banknote dataset of changing max_iter (row facets) and eta (column facets) using the default values of theta and perplexity

Listing 14.1. Running t-SNE

Tip

14.2.2. Plotting the result of t-SNE

Listing 14.2. Plotting the t-SNE embedding

Figure 14.7. tSNE1 and tSNE2 axes plotted against each other, faceted and shaded by the original variables, and shaped by whether each case was a genuine or counterfeit banknote

Note

Exercise 1

14.3. What is UMAP?

Note

Note

14.4. Building your first UMAP model

14.4.1. Performing UMAP

Figure 14.9. The effect on the final UMAP embedding of the banknote dataset of changing n_neighbors (row facets) and min_dist (column facets) using the default values of metric and n_epochs

Figure 14.10. The effect on the final UMAP embedding of the swissTib dataset of changing metric (row facets) and n_epochs (column facets) using the default values of n_neighbors and min_dist

Listing 14.3. Performing UMAP

14.4.2. Plotting the result of UMAP

Listing 14.4. Plotting the UMAP embedding

Figure 14.11. UMAP1 and UMAP2 axes plotted against each other, faceted and shaded by the original variables, and shaped by whether each case was a genuine or counterfeit banknote

14.4.3. Computing the UMAP embeddings of new data

14.5. Strengths and weaknesses of t-SNE and UMAP

Exercise 2

Summary

Solutions to exercises

Unable to load book!