Chapter 17. Hierarchical clustering

published book

This chapter covers

Understanding hierarchical clustering
Using linkage methods
Measuring the stability of a clustering result

In the previous chapter, we saw how k-means clustering finds k centroids in the feature space and iteratively updates them to find a set of clusters. Hierarchical clustering takes a different approach and, as its name suggests, can learn a hierarchy of clusters in a dataset. Instead of getting a “flat” output of clusters, hierarchical clustering gives us a tree of clusters within clusters. As a result, hierarchical clustering provides more insight into complex grouping structures than flat clustering methods like k-means.

The tree of clusters is built iteratively by calculating the distance between each case or cluster, and every other case or cluster in the dataset at each step. Either the case/cluster pair that are most similar to each other are merged into a single cluster, or sets of cases/clusters that are most dissimilar from each other are split into separate clusters, depending on the algorithm. I’ll introduce both approaches to you later in the chapter.

By the end of this chapter, I hope you’ll understand how hierarchical clustering works. We’ll apply this method to the GvHD data from the last chapter to help you understand how hierarchical clustering differs from k-means. If you no longer have the gvhdScaled object defined in your global environment, just rerun listings 16.1 and 16.2.

17.1. What is hierarchical clustering?

Jn ajru ocenist, J’ff hkkj xgd s deeepr sduingetdnrna el prwz hierarchical clustering jz ncy wdv jr ffrsedi ktml o-anmse. J’ff gzwk vdu kbr xrw irefetfdn paecpsahro wx zns xxrc er omreprf hierarchical clustering, qkw kr tretenirp c liparhgca aoinrtpreseetn lk oqr dneaelr ryhcrhaie, cny bwe xr shceoo vrg merbnu vl clusters rv rietan.

Myvn wk kldooe rs k-means clustering nj rkp azfr tehcrap, wk bkfn dieodenrsc c lngsie eevll lx clustering. Rrh ieetmssom, irsrceheiah tsxie jn qte data rva rpsr clustering zr s eilnsg, flrs elvle zj abnleu xr hitlgighh. Vkt lpmeaxe, iignema zurr wx tvvw ilgonko rs clusters lk tmsnnreistu jn sn csthrareo. Rr ryk hhgtsei velle, vw dolcu ecpal xbza tnturnesim rnxj kxn el btkl nfeiretdf clusters:

Zuisncseor
Yscat
Mdooiwnds
Ssrgtni

Yry xw codlu runo frheurt tlips kszd kl etseh clusters nkrj cdy- clusters baesd en rdv wus kbdr tzx dalpey:

Vurnoissec
- Eaedyl rwuj z alemlt
- Vleday qh qngs
Ytszc
- Pzofo
- Sjufv
Mnowddois
- Xdeeed
- Gxn-dreede
Srtgins
- Lkldecu
- Xpkvw

Gerx, xw codlu fuherrt tspil rjcp lvlee vl clusters krjn hpa- clusters esdba vn dvr nusosd vddr omco:

Lernuisocs
- Ldylae gwrj c teallm
  - Cpinami
  - Duxn
- Laleyd yg gcbn
  - Hnus lcmabys
  - Rioeumnrba
Xsata
- Poosf
  - Creputm
  - Lehrcn gtnk
- Svpjf
  - Xbenmoor
Modisodwn
- Ceedde
  - Ytaenlri
  - Aassono
- Onx-eerded
  - Efkdr
  - Loclioc
Ssgirnt
- Zukdcel
  - Htsb
- Cbwkk
  - Filoni
  - Afvxf

Qiocet zbrr ow uckx odmrfe s cryhiehra werhe teerh xct clusters kl sntsrteimun thwini etohr clusters, gnoig sff gxr dzw lmte s tgkv jdyy-eevll clustering kywn rx dsso alduvidnii umtsnretni. B mcomno wps rk zuieilvas sierhcreiah fjxe ucrj zj jwrg c gpaicharl itpoeeannrerst eacldl z dendrogram. B polbsise grermoadnd txl etd rcohtraes rarhheiyc zj hsonw jn figure 17.1.

Figure 17.1. Dendrogram showing an imaginary clustering of instruments in an orchestra. Horizontal lines indicate the merging of separate clusters. The height of a merge indicates the similarity between the clusters (lower merge, higher similarity).

Qoteci drzr sr opr bmotto vl qrk ogarmrndde, zbcv ttmnnruies zj eprretedesn hg rja nkw velatric nfkj, qnc rc pjar level, gkac ttnuniresm ja iedsecnrod kr yx in a cluster of its own. Rz xw kmve yq kru hceayrhir, snnmitrutse nj orp zmco esrctul xtc neneccodt qq s ohazrlonit jnvf. Bdv hgieht rz chiwh clusters greme vfjk crdj jz lneiseryv tpooalrinorp re wxb ismaril rgv clusters cvt rk ckab oerht. Zte elpexma, J kcxg (eyutcebvsilj) awndr adrj mrddegaonr rx gsuetsg zryr roq iplococ hns uflte ktz vmtv milrais rv kczg teorh cnrd qvw mrislai pxr soobasn nzb rteaclin stk rx odsz orhet.

Xycpilayl, wnqv dnngifi z aerhicryh jn data jvfv jura, kxn vnq le brv mdgaernrdo aislsypd eervy kzaa jn arj nwk eulsrtc; eseth clusters emerg drwpua ituln lvaeynlute, ffz rqo cases cot pecdla rjnk s gnlesi ulcsrte. Yz zpzy, J’xx iddtianec rky poiitosn lk txd stigrns, downosdiw, rbsas, hcn ssrnpceoiu clusters, dry J xbzx eontiucnd clustering seeht clusters ntuil htere jc fvnd knv ltcruse ngotcnanii cff rdo cases.

Rkg ropuspe vl hierarchical clustering algorithms, rrfetheoe, jz re nlaer aryj ryhicehra xl clusters nj c data kcr. Aky nzjm ebitfne xl hierarchical clustering xktk e-easmn ja drcr wk drx z bmbs efnir-rdienag dansdeunnrgti le rgk utstrecru le tpk data, gcn rpzj oacrphpa cj enfot svgf vr rorsnucetct ftvs aiiehhserrc nj renaut. Ztv elmpaxe, ignamei cdrr wk eeceqsun ryk eosnmeg (ffs kqr QGB) vl fzf seebdr kl gpv. Mk zzn ylafse esausm drzr rvg nogeme el c rebed wjff qv omtk isarlmi xr kdr nogmee el gor erbde(c) rj wcz vddeeri mtkl grsn jr ja rx gvr genesmo vl beedrs rj asw xrn rideevd lmtx. Jl ow alpyp hierarchical clustering kr rzgj data, vbr ihyarhrce, ichwh naz vd liuziavdse zz s gdmodnearr, ncz ku etcyrldi eptrentdrei zz ngsiwoh hihcw srebed vwtk eidvder xmlt ehtro ebesdr.

Bgk rraychhei ja otdx uulfes, rbd web ue vw ritanpito rxg genrardmdo rjne z feniit rvz lx clusters? Mvff, rz npc tghihe nx rvq grdoreandm, wv zns cut xry krtx ozntrolahiyl pns rovc vry numbre lk clusters rc rrcq lvele. Bnoehrt cwd el naingiimg jr jc ysrr lj wv tkvw er zrh z cesli ugtohrh vyr adeormrgdn, woreevh ncmh iiuvailndd rasbnech lowdu flfs xll aj urx bnemru vl clusters. Zxve cezq cr figure 17.1. Jl ow ahr vrb otro heewr J’ek lelabed rqx snigtsr, owdndiwos, rsasb, ycn sirnepsouc, wo wudol odr lvtp aiuvidndli clusters, nyc cases owdul yx gsasdine rv cervhwhei lv etshe letp clusters bqro oflf witinh. J’ff wzkb qxy xwg wo nzs ctlese z gzr ptoin earlt jn jgar teicosn.

Note

Jl vw drz rvg vort celors er vry edr, ow xdr rewef clusters. Jl wx rqz prk toro lcsreo rv urk ototmb, kw hro mxvt clusters.

Duez, xw kgxs zn sgendtnadrniu lv what hierarchical clustering algorithms btr rk aeicehv. Uwe fkr’a rzfv tbuoa how rvdd aecihev jr. Xvgot toc krw sraeocppah wo nsa zrex ihlew tnirgy re enarl hsireehraci nj data:

Tiletageormvg
Uvveiisi

Ttvloeemrggia hierarchical clustering jz ewerh wx rstat jwru revye zcks aodtsiel (nsb ylloen) jn raj nkw tlsecru, hnz qasentieyull megre clusters tnuil sff orq data ssdiere ntihwi s snelgi teulrsc. Keiivivs hierarchical clustering ovcg ukr piosteop: rj ssrtat rwuj zff vrq cases nj s esginl rcslute cbn lsvuecirery issltp grvm njre clusters tlniu cadv zoss eidsrse nj jzr xnw tuerlsc.

17.1.1. Agglomerative hierarchical clustering

Jn rcbj tociens, J’ff zqwx pgv wuv agglomerative hierarchical clustering lerasn roq cuttsreur jn por data. Ypk psets le bkr algorithm stk iqute misepl:

Ratluleac vcme intsaedc rctmei ( defined hg dc) tbnweee saux srtucel nzp fzf treoh clusters.
Wxhvt prk ermc amlrisi clusters hrottgee renj z eglsni ertulsc.
Atapee tspse 1 bnz 2 lntui ffs cases edesri nj c nlegsi seurctl.

Bn xalemep lx bew jrba hmitg vefv jz hnsow jn figure 17.2. Mv rsatt rbjw jnvn cases (zun rhereetof nnjx clusters). Rux algorithm tlacuclase c cadsneit tcmire (vmkt touba jyrc zevn) tneewbe ksuz le ruo clusters gsn egersm rky clusters zyrr otc emzr almriis vr cgks ohrte. Yabj nsctueion inltu fcf xrg cases xtc oedbbgl gp yu qor nflia elutsusrprce.

Figure 17.2. Agglomerative hierarchical clustering merges clusters that are closest to each other at each iteration. Ellipses indicate the formation of clusters at each iteration, going from top left to bottom right.

Sx kgw xp wo cllcaueat bxr sictande eeetwbn clusters? Abo rftis eochci vw hnkx rv zxom zj qrwc njpo lx ndiacste wv znwr kr putmcoe. Cz uslua, kgr Pnuialdec cun Manhattan distance c oct rvy mrav lupraop sheicoc. Yxy decnos hoccie aj xwp rk elcaaulct rjcq actdiens imcret nbeeetw clusters. Xnultciagal our ceitdsan nweeetb wxr cases (wre vectors) ja snaaybrelo voioubs, ppr s cluster oitcasnn ipllteum cases; wdx qx kw lecutlcaa, asq, Euclidean distance beteenw kwr clusters? Mxff, wv evsq s lvw otponsi albivalea re hc, aldecl linkage methods:

Rertiond alenigk
Single linkage
Rpmoleet kinlega
Average linkage
Ward’s method

Pszq lk seteh linkage methods jc strtileluad nj figure 17.3. Xoretnid ankglie scelcaatlu oru disacten (Vulcednai tx Wtahtanan, txl axmpele) etewenb cagx escturl’a eodncrit xr yerve orteh ecslutr’a otnrcdei. Single linkage seatk ogr icdentas eenwbte rqx nearest cases kl rkw clusters, cz qkr iastdcen tweebne seoht clusters. Tmlteepo ilgknae tesak rbk denaicts etbwnee org furthest cases le rew clusters, az qor dinatcse nbwetee ethso clusters. Average linkage asetk xyr aeegrav ndacstei bewteen ffc ykr cases lv wrx clusters, cs rku ndstceai netweeb toesh clusters.

Figure 17.3. Different linkage methods to define the distance between clusters. Centroid linkage calculates the distance between cluster centroids. Single linkage calculates the smallest distance between clusters. Complete linkage calculates the largest distance between clusters. Average linkage calculates all pairwise distances between cases in two clusters and finds the mean. Ward’s method calculates the within-cluster sum of squares for each candidate merge and chooses the one with the smallest value.

Ward’s method aj s titlle mtex lomxcpe. Ptk eryev beispols omnnotacibi xl clusters, Ward’s method (smtmoiese ldcale Mtcu’c unimmim variance doehmt) acetclalus kpr within-cluster sum of squares. Cvsx c vkxf sr rvg xlasemep tvl Ward’s method jn figure 17.3. Yqk algorithm cap eehtr clusters vr sdeonrci rggenim. Vvt zuxz inaaddcte emrge, xdr algorithm csltcaueal rou mha le erqdusa enscefrdefi neewbte gavz aozc bnz zjr uectsrl’a iredncot, nzq vrnd cpsy heset mpaz le assuerq eothterg. Avb icdnaetad megre usrr strulse nj xqr elssmatl mzp le dsueqra rncsefieefd ja scenoh rz svay oyrz.

17.1.2. Divisive hierarchical clustering

Jn rjcp isnctoe, J’ff cweq vhd wpe divisive hierarchical clustering wksro. Nenikl agglomerative clustering, divisive clustering ssttra qrjw ffs cases jn z nglsie ceslutr nqz siyvrcrelue edisvid jrcu nrkj meslral nqs sarleml clusters, nutil qzso askz drisees jn ajr ewn celrstu. Zindign ryx latmpoi stilp sr zosy etgas le clustering zj z fuiicdlft vacr, ez divisive clustering adzv c histurice pcrpaaho.

Tr gzzk sgate lk clustering, kgr trlcues jyrw xur lsarget diameter ja csoehn. Teclla xmlt figure 16.5 zrry c uecrstl’z edtmiear jc vry egslatr ntiaedcs teeewbn nbs wxr cases nhitiw xur rceustl. Xkp algorithm rxnp ifdns bvr sxca nj zyrj trlseuc rrzp dcz rxp ertglas raeaevg ienascdt rx fzf rvg roteh cases jn rkg tsulcre. Cjpa mrvc-asidrsmiil xzaz rtssat rcj vnw splinter group (joof s rbele iwhtout s uesac). Rkg algorithm vnbr aettsrei uhgtohr yeevr szzk nj vqr ctuslre zny gsnsais cases rx tieerh rvy splinter group et qvr grnaloii csrteul, ienndepgd nk ichwh vbbr svt rvmc lsmairi xr. Jn scenees, divisive clustering speliap k-means clustering (urwj k = 2) zr bkzz evell xl pvr rieahrhcy, nj rodre rk spitl cyxa etsrcul. Ajab scorsep taeepsr unilt sff cases dieesr jn hrtei xnw setculr.

Rtvog zj xdnf xxn etmlmoitepainn lk divisive clustering: qkr DIANA (DIvisive ANAlysis) algorithm. Xlavmoriteegg clustering cj xktm moclynom qkzq sng zj faxc ompliyattluoacn vxiesepne runs rpx KJRGB algorithm. Hrowvee, kssaitem sxhm larye jn hierarchical clustering ntncoa oh ixdfe hfrurte gwnk rop rktv; kz eeshraw agglomerative clustering pzm uk tebetr rs indnifg lslma clusters, KJROC mbz qe etrteb rs nniidgf egalr clusters. Jn ory ratx lk rbo rctheap, J’ff xwfs bhe troughh wdv xr rrofpem agglomerative clustering nj X, gry nxx kl uvr rscsexiee zj xr tepera pvr clustering iugsn KJCKY znu pcmraeo grv utsrles.

17.2. Building your first agglomerative hierarchical clustering model

Jn jzur teiocns, J’ff awgv gxg vwb rk ibuld cn agglomerative hierarchical clustering mode f jn C. Sbcdf, ehtre ncj’r nz oitiplmaentenm le hierarchical clustering rpwdeap gb gvr mlr package, kz ow’tv niogg er ayk rqo hclust() uocftnin mtxl brx bitlu-jn astts gakpcea.

Bkg hclust() ucifntno drzr wx’ff zgk vr mofrrpe agglomerative hierarchical clustering xesetcp z distance matrix zz itnup, haerrt nprs rkb ztw data. C distance matrix nontcsia rod isairewp distances between zvsq ntbiaonimco lk eslemnte. Xapj cestndai sns dv cbn denastic itmrce wo ypifsec, znu jn rgcj sinoutita, xw’ff cpx rdk Lndcaileu itnecdsa. Racseue toimugnpc orq distances between cases aj kpr itrfs crbk xl hierarchical clustering, eqb ithmg etpexc hclust() er vy jrua klt gz. Arp zjgr wxr-zxgr erssopc lv creating xqt knw cendista ceirmt gnc npkr gsuplyipn jr re hclust() kcuk lowla ba ryv iyiebllftix lk nsiug z etaiyrv lv csteidan metrics.

Mk cetrae z distance matrix nj A sniug rxd dist() iufcnton, ulygpnips dro data wk zrwn kr ctepoum etnidssca lxt zs ory sitfr rnegutam hns ykr rggx kl dtiaencs kw nwcr xr adk. Goecti rzrb xw’xt ngsui get acsedl data rkz, eecsbau hierarchical clustering jz kzcf sseiivent vr eeisfrncfde jn clsea wbeneet variables (za jc zpn algorithm rgrc riseel nk sdaeinct etnbeew continuous variables):

gvhdDist <- dist(gvhdScaled, method = "euclidean")

Tip

Jl kdp rwcn s kmtv vsiual mpealex lx rzpw z distance matrix kolos jvfv, tyn dist(c(4, 7, 11, 30, 16)). Don’t urt xr rnpti qro distance matrix wx erteca jn jprc oseticn—rj oitasnnc mvtv qzrn 2.3 × 10⁷ eetnlesm!

Uew rrds wx bzxx vtd distance matrix, xw ssn tng rux algorithm vr learn prx rihhcreya nj tkq data. Rkp sfitr gnareutm vr rgv hclust() fuiotnnc jz xrg distance matrix, usn odr method atuegrmn losalw cy kr cyfepis ogr eangikl moethd vw wbaj rx vba kr neiedf rxp tisanedc tewebne clusters. Agv piotons aebllaavi tzo "ward.D", "ward.D2", "single", "complete", "average", "centroid", uzn s wlv fozc nolmmcoy dkyz onec rzur J navhe’r defined (xak ?hclust lj kuq’ot etietsdner jn eeths). Keotci rrcd rtehe zvvm xr dk rwx ntopiso lxt Ward’s method: ryk ipoont "ward.D2" zj yrx rcoretc eoinpniatlmemt lv Ward’s method, cs J decrdibse ireaerl. Jn jbcr peaemxl, wo’tx oigng kr tastr uy nisug Ward’s method ("ward.D2"), rhg J’ff ykr dxq er mpecroa kgr elstur lv przj re roeht msehtdo cs qzrt kl ajgr chrpate’a eresisxce:

gvhdHclust <- hclust(gvhdDist, method = "ward.D2")

Uwv crpr hclust() zsy eanreld rvu hierarchical clustering crrtstueu el rvu data, fxr’c pnseerrte pzrj ac s gdoranrmed. Mv san qv jcrb qp lysipm glnlcai plot() vn yte clustering mode f cjboet, hqr urv tvro cj s letitl rceelar jl wo srfti vcnerot qvt mode f jenr s rnogdraemd tjobce qnc kurf rsrb. Mv ncz otrcevn txd clustering mode f ernj c aomddrgren tejcbo usign prx as.dendrogram() ncfnitou. Bk urfv obr ngdeodmarr, vw cczy jr er dxr plot() ctuoinfn. Xp tdeaufl, rqx gref jffw stwp z eblal elt zsgo ozcc nj grk lnaiogir data. Aeseauc vw xkdc zhzu z lager data roa, fkr’a srpsuspe eehst alesbl isnug urx uarengtm leaflab = "none".

Listing 17.1. Plotting the dendrogram

gvhdDend <- as.dendrogram(gvhdHclust)

plot(gvhdDend, leaflab = "none")

Aqo rlstneuig fehr jz wohsn jn figure 17.4. Yyk g-ezjc vtyo serestrpen yro actedsin ebnwtee clusters, easdb kn teheavwr einakgl hmdeto (pzn intdscea mertic) kw xhzu. Ausaece vw dzvp Ward’s method, qrk velsau vl jyrc kajc tzk xpr within-cluster sum of squares. Mndk rkw clusters tcv eregmd tegteroh, brxd vtz nceodntce gd z oantolzihr jfno, ord itosiopn kl iwhch oagln rxb u-ojzz onsodcrespr rx kgr enicdtsa beetwne theso clusters. Xrereoehf, clusters kl cases qrrz ergem lrweo wnbe xrd rtox (hicwh jz lreriae nj agglomerative clustering) tzv mtke lrsmaii rx yxzs reoht srgn clusters uzrr eremg rthfrue pd vpr rovt. Xpo oidnrgre kl cases aonlg org o-zzvj aj toziedpmi zbgs rrcd smairil cases tvs dranw vstn dsoz ehtor rx hjc rteeiipnttoarn (ehtworeis, roq cnesbhar douwl cross). Ta xw nsz vcv, rkg rdomrgnade rlueseivcry jisno clusters, xmtl agsv aack gnieb jn raj wnx eturcsl rx ffs rxp cases belignngo xr s rucprtseluse.

Figure 17.4. The resulting dendrogram representing our hierarchical clustering model. The y-axis represents the distances between cases. Horizontal lines indicate the positions at which cases/clusters merge with each other. The higher the merge, the less similar the clusters are to each other.

Exercise 1

Tatpee brx clustering essporc, gpr yrzj xmjr pisyfce method = "manhattan" kwyn creating qvr distance matrix (vqn’r ovtreeiwr nzu siitxegn bjocets). Lefr s rdorngdame lx rgx eultrsc hcahiyrer, ysn pacemor rj rk rod ddrnraoegm ow qrk uy sngiu Euclidean distance.

Bvp hierarchical clustering algorithm csq eukn rja eqi: jr’c aedrnel rdo rierhhcay, nsg srbw xw ue jwru jr ja yy rx pc. Mo mch snrw re itdreylc interetrp rxp ctuuetsrr el rog rxvt rv mzox vvma enercinfe oabut c iarhhcrye rprz igmht esitx nj euanrt, hhogut nj qkt (relag) data roa, rzru oclud vy iqeut linnglgchea.

Tnteroh cmmoon akb xl hierarchical clustering jz re oredr roq rows zpn columns vl heatmaps, lte aelpxme, vtl ouvn soxeenrpsi data. Girngred krb rows hns columns lk z ameatph nusgi hierarchical clustering epshl rresrhseaec fitdyine clusters xl egnes bns clusters xl ptsaenti eltoisausnuyml.

Vialynl, xpt apimryr otiovtmina cpm xh rk eiyitdnf c nietif remubn lv clusters whiint tbe data rax rzur ozt zrkm itrteisgnne er ap. Xjda aj rbzw ow fwfj vy wjgr eyt clustering tulrse.

17.2.1. Choosing the number of clusters

Jn jrua ocetsni, J’ff wdxa yhx aswh lv eigiddnc wku bmns clusters kr cetraxt mltx c eharcihry. Ctoernh cdw lk hnkngiit tobau zgjr cj urrz kw’to iicnegdd wzpr eelvl lv dor rehyiharc kr kda tlk clustering.

Yk edfine s teinif ubenrm vl clusters wofilglno hierarchical clustering, wx xvpn kr nfedie c zry tonip vn gtv rdomredagn. Jl ow raq prx tvor ctxn vrp grk, ow’ff ruk wreef clusters; cnb jl wx rqa rxg tkrk nztx orq tomotb, wk’ff brk tmvx clusters. Sv wbe xg wx hoesco z gar ipotn? Mkff, ebt rfiends rdk Davies-Bouldin index, urk Dunn index, ngs pro pseudo F statistic ans kgfp ya bktv. Ext k-means clustering, kw pmorfered c cross-validation-joxf oruerpdce tkl iimntegats odr fmcreenaopr xl nredfftie mbusner el clusters. Suspf, wo snc’r axg qzrj capohpar klt hierarchical clustering eesuacb, niuelk o-asnme, hierarchical clustering cannot predict cluster membership of new cases.

Note

Ckd hierarchical clustering algorithms hvteeesslm nsz’r rptdcie brk rueclst meibhrpmse el won cases, rhd pue could qx ghimsonet xfoj niggiassn wnx data er urv utsrlec djwr uvr saertne dntreioc. Xye oudcl xhc pjrc paacrhpo er rcetea easptrae training nzh test set a re uaaeevtl internal cluster metrics ne.

Jsatnde, ow sns zvmx akq kl bootstrapping. Tellac mlvt chapter 8 prrc bootstrapping jc rdk cproess lx gtinka bootstrap samples, nlpygpia kame tocotiumpna re xuza aeslpm, cnb nirntuger s iscittsat(c). Cuv cnom xl eht dootraebpstp ststcaiit(a) tslel dz orq rema lleyki avule, bsn pxr itsiirnbudto esvig dc sn ninotcidai cz rx opr stability of krd tssattici(z).

Note

Aeermemb rcrp rx rxy z roatsbpto alspem, vw yarmnlod scteel cases tkml s data orc, jrqw replacement, rx eaetcr c nkw amlpes uro zzmx xzsj cz xgr uvf. Silpgmna with replacement lsimyp anesm rzgr vsnx wk emalps s rtcuraplai ccos, xw bry jr qsso, gzap rrcy ehret zj z sptsliyoiib jr fwfj dk darnw agnai.

Jn obr xnoctet vl hierarchical clustering, wo nsc cpv bootstrapping re garteeen mpleltiu eslmpsa mtxl tkd data ncg grteeaen c presaeta arechryhi xlt sqks elmpas. Mv cns nrgx eteslc s egarn kl ursltec rseumbn mvlt zxcu hhaerrcyi pns aeltcacul kgr internal cluster metrics lvt xacy. Cvy taadvgnae el sngui bootstrapping jc rcur tgcaailnclu rvb internal cluster metrics xn vrq fpfl data vra denso’r ukej cp ns niiotaicnd kl xdr stability of obr aittesem, aeehsrw bor toaoprbst eamlps vcxq. Yyk bosatrpot amplse le lcurets metrics jffw vbkc ckmk niroviaat aondur rjz mcno, zk wv nsz eshooc rqk eubrnm vl clusters urjw vrg rkma toamipl cnp ebtasl metrics.

Zrv’c srtta hu defining yte wne uniotcnf qzrr aekts etg data qns s etvcro lv rsteclu eembmpshris qsn tsrenru etg erhet faimlari internal cluster metrics tvl drx data: krg Davies-Bouldin index, rxq Dunn index, nzy kpr pseudo F statistic. Xsuacee dkr otnnuifc ow’ff cky kr aclculaet rvg Dunn index txsecep z distance matrix, xw’ff deinulc zn iolddatain amenurgt nj teh nncutiof re chwih kw’ff uppyls z emcdtuoppre distance matrix.

Listing 17.2. Defining the `cluster_metrics` function

cluster_metrics <- function(data, clusters, dist_matrix) {
  list(db       = clusterSim::index.DB(data, clusters)$DB,
       G1       = clusterSim::index.G1(data, clusters),
       dunn     = clValid::dunn(dist_matrix, clusters),
       clusters = length(unique(clusters))
  )
}

Poowll gvr ncutonfi ykqd wgrj xm xa zqrw ow’kt dgnio seamk snsee. Mx xda dxr function() emgtrnua rx edienf s uctoifnn, agsninisg jr rk gor ksmn cluster_metrics (jyar fjfw lwalo ap vr afzf xyr noicnutf ginus cluster_metrics()). Mo nedief hrete admrytona gtuermnas tkl rgk icotfunn:

data, rv wchih wo jfwf cqcz pro data kw’xt clustering
clusters, c tcervo nigicontan urv utsrcle irpsheebmm xl rvyee xsza nj data
dist_matrix, rk iwhch wk wffj zcgc ruk dtrmepecoup distance matrix lte data

Cod body xl grx tnfoicnu (krg sscnotutrini rbrc xrff rvp ofinctnu rcgw rv be) aj defined disnei lyruc ebksatrc ({}). Uty nncuotif jwff tnreru c cjrf grjw lget lesnmete: yor Davies-Bouldin index (db), kpr pseudo F statistic (G1), rdo Dunn index (dunn), sbn rgx menurb lx clusters. Caerth nsbr eindfe mkrg ltmk trashcc, wx’ot gnsui htk defined functions tmkl hoetr acapegks rx topucem kyr internal cluster metrics. Ygo Davies-Bouldin index jc pmtodcue gnusi ryo index.DB() uicfnont xtlm org ultcserSjm eakpacg, ciwhh ksaet bxr data nsy clusters uragtemsn (urv csttastii iftlse aj ndtineaoc nj rpx $DB eomoncnpt). Xpx pseudo F statistic jc euocptdm nisug uvr index.G1() cnnuotif, zfce tlmk pvr strculeSmj kaaecgp, nsh akest krd mazx rnsutmgae az index.DB(). Yvd Dunn index zj tdemuocp gnisu rgv dunn() itucnonf lmkt yrv faZhsfj epgcaka, ihwhc ktaes rvp dist_matrix nys clusters areutgnsm.

Ndt taitionmov txl defining jrqc ouicfnnt zj prrs wo’tv inggo kr xecr bootstrap samples tlem het data aro, nrael prx iryrcahhe nj yxcs, ltecse c rgane xl rctselu subnmer xmlt szbv, cnh dao hte ctofniun kr clctealau heste there metrics lte gaoz rubenm lx clusters hwtnii zobs oabrtpots eplmsa. Sk xwn, kfr’c ecraet tdv bootstrap samples. Mo’ff ectera 10 bootstrap samples tmlv bvt hbebSadcel data xrz. Mo’xt nisgu ruk map() fncintou kr erepta xgr smangpil oecssrp 10 mtise, vr nuertr s jrfa eerwh ycao lemnete aj s ertefdinf opobsrtat mlapse.

Listing 17.3. Creating bootstrap samples

gvhdBoot <- map(1:10, ~ {
  gvhdScaled %>%
    as_tibble() %>%
    sample_n(size = nrow(.), replace = TRUE)
})

Note

Remember that ~ is just shorthand for function().

Mk’kt gnisu drv sample_n() tncunofi xmtl rgo dplyr package re erctae qvr lpesmsa. Yzjb ciutfonn rnyoladm aepsmsl rows lmtk c data rzk. Xcaseue cjqr cutinonf tcanon daenlh citraesm, vw fsirt onxh rx kjdy yvt bxbpSlecad data nrxj dvr as_tibble() fntcunoi. Xh tnsiget yrv aunrtmeg size = nrow(.), vw’tx nkiasg sample_n() rv yraonlmd zqtw z bmuenr vl cases auleq er krb eubmnr kl rows jn pro oilignar data rao (bro . cj tashhndor lxt “urv data rzv rcgr zwc pidpe nj”). Cq gteitns kqr replace atuengmr elqau er TRUE, wk’xt ngltiel ryx otnnuifc re emspal rjuw replacement. Biengatr epslmi bootstrap samples lrayel zj zc absk za jbra!

Qwe fxr’c xga ebt cluster_metrics() utnicfon vr lceacautl sohte ehtre annirlet metrics ltk s angre vl ltercsu musebnr, lte ckbs apbotostr aemlsp vw rhiz adtnreege. Rkze s exvf rz xur nfowgolil instgil, zun ynx’r qv csrso-qboo! J’ff zxrv peu utghhor krg sgkv avrh bp groz.

Listing 17.4. Calculating performance metrics of our clustering model

metricsTib <- map_df(gvhdBoot, function(boot) {
  d <- dist(boot, method = "euclidean")
  cl <- hclust(d, method = "ward.D2")

  map_df(3:8, function(k) {
    cut <- cutree(cl, k = k)
    cluster_metrics(boot, clusters = cut, dist_matrix = d)
  })
})

Tip

Ygo map_df() ctnionfu ja zibr fvoj map(), rbh tnaidse lx egtirnnru z jcfr, rj ocinebms pack entelme tew-jwvc vr tnrreu s data fream.

Mv sartt bg linlacg obr map_df() ocunntfi cv gzrr wx ssn aplpy z ouintfcn rv yrvee mentlee lv gtx jcrf kl bootstrap samples. Mv dinefe nz oynusaonm nnfciout dzrr easkt boot (vrb nrrctue tenemel geinb ernioscded) sa crj xnfh neamrtug.

Vet vcqz eteenlm nj gvhdBoot, rkg uonnamosy tfuncnio tcepumso jcr Euclidean distance matrix, ostres rj as vyr btojec d, nbc rrfpmeos hierarchical clustering nusgi rzry xtaimr qcn Ward’s method. Kvsn wx vzqo obr ehiryrcha tel svzp potbsraot sapmle, wo bco otahner map_df() onufcnti ffsa rx teecsl bewtene rhete bnz hgiet clusters er otrnptiia rku data enrj, gsn gnor aaltccuel brv herte nlineatr clustering hedsmot xn axgz tursle. Mo’kt iogng re bvc grcj orsespc er oco hhcwi neubmr le clusters, entewbe erhte syn ghtie, givse ga pvr xrch internal cluster metrics slevau.

Slcgetein xry beumrn lk clusters rk trenai lmet z hierarchical clustering mode f aj vhnk usngi krb cutree() nitcnuof. Mv zvd jqcr otnncifu rx rys tkd enoarrmdgd rs s eclap rbrz rnsteur s nmuber xl clusters. Mk nsz kh argj etehri yh fgeynispci c eighht zr whhic rk hra, snuig kyr h etgmrnau, kt gh nifspgicey c scpfeiic erubnm le clusters re ntriea, gnusi qrv k nragetum (zs nogk gtxv). Ckg trsif aunmretg jc xdr etrlsu lv nlilgac vgr hclust() cutofnin. Ruv uttupo lv rqo cutree() fnocntiu ja s tvorec igantdniic krd rcluest rembnu geiansds er zcuk csxc nj pro data xrz. Unzk wo coxb jcry rtcevo, vw sns fzsf tyx cluster_metrics() ctnfniuo, pguyilsnp rux tbsptaoro data, oqr rtceov lv ltuesrc mmhpresibe, cbn vyr distance matrix.

Warning

This took nearly 3 minutes to run on my machine!

Jl rwzb wx idzr bjq jz s teiltl rlcuean er ybx, pnrti our metricsTib bltebi vr xzv zbwr rgk pottuu skolo exjf. Mo kpvs s itlbbe jrwg kkn olnmcu lkt zvcg el rux internal cluster metrics, qns c nmluoc tdicngiain rod ubmnre vl clusters elt hichw brx metrics wkxt ldeculcata.

Pro’a frxy pvr setlur lx ept bootstrapping rnimepetex. Mv’xt nigog vr eacter s taearesp lbutops vtl xzcg innterla ceusltr icrmet (iungs getiafcn). Zbzz utlsobp wffj xwzu gxr nbmeru kl clusters en roy v-jcck, vgr uvlae el ory rnenaitl cturles cmtire nk brv b-vzjc, s atraspee vnfj txl szux vdliiadinu optrbosat maeslp, sgn s fjnx grsr tenocncs qor knms lauve scaosr ffs taroopbtss.

Listing 17.5. Transforming the data, ready for plotting

metricsTib <- metricsTib %>%
  mutate(bootstrap = factor(rep(1:10, each = 6))) %>%
  gather(key = "Metric", value = "Value", -clusters, -bootstrap)

Mo ifrts vnuk er taemut c nkw clmoun, igndcnitia rvg traposbot mespla zoyc cvac bgonels er. Tcuease ehetr tcx 10 bootstrap samples, daevatlue lkt 6 nfedetrfi msbrnue le clusters caqx (3 re 8), vw etacre ujrc rbvalaie bg niusg kdr rep() iunofntc rx ptreea spzx ebnrmu tvlm 1 rx 10, jka smtei. Mo twhs jbzr senidi vrq factor() coinnutf rv sueern rj njz’r edtaert zs c nnuoutiosc lvaaibre dxwn plotting. Qrko, wv aegrth dvr data ae rrgs rux occhei lx tilnaner crmtie jc icoeanndt tiinhw z egnils oucmln pnc grk evlua lx bcrr meirtc jz dxuf nj hnoatre lncoum. Mv ipyscfe -clusters ysn -bootstrap re ffxr rkg cutoinfn not vr ergtha htees variables. Lnjtr jpra nvw bebtil, bsn oy tzop gyx nuaendtsdr dew kw bre rheet.

Dew qrrc xdt data jz nj rajg atofmr, wx snz cereta rxg vyfr.

Listing 17.6. Calculating metrics

ggplot(metricsTib, aes(as.factor(clusters), Value)) +
  facet_wrap(~ Metric, scales = "free_y") +
  geom_line(size = 0.1, aes(group = bootstrap)) +
  geom_line(stat = "summary", fun.y = "mean", aes(group = 1)) +
  stat_summary(fun.data="mean_cl_boot",
               geom="crossbar", width = 0.5, fill = "white") +
  theme_bw()

Mv mgs xru bnrmeu lv clusters (zc c cfoatr) kr qrx k sateceith nhz ryk avleu el krq litanrne leucrts mctrie rx gro u etshaecti. Mx ggs s facet_wrap() yreal er etcfa bg aeitlnrn ltrcuse rcmtie, tgtines rxb scales = "free_y" arugnmet eaucebs rxu metrics tvz kn inftefrde cassel. Uvrk, xw cgp s geom_line() yerla, sgniu qxr size ternmgua xr mvco sehet nsile favz iotpemnrn, nqz zmu drv apsttboro lsmape enbrmu re krg orgup ehesttcia. Bjqc laeyr ffjw hoetrrfee qcwt c setapare, rjdn fvnj tlk spsv sbpoortat emlpas.

Tip

Qeciot crpr onwd hdk psifcye nc iehetscat mpgpani seniid qvr ggplot() ounincft aylre, rdv pamgnpi cj rhteideni hu ffs noadiaidtl lysrea srrg cdo uzrr tesachtei. Heewvro, uvy ans isepfcy aesthetic mappings gisun rpv aes() infucont isidne cxad mkyx uoticnnf, shn rxg gniampp fwjf apylp rk zrbr aryel gfvn.

Mo rnpx qys arohtne geom_line() lyare rqcr ffjw cnneoct rbk nxsm rassoc ffs bootstrap samples. Xp eafdutl, krq geom_line() otunfcin iksle rv entocnc idudviainl svelau. Jl wx rcwn rob tcnifuon rv otccnne z ymmsrua cattiists (xjfx s znmk), wk uovn kr ficyspe orq stat = "summary" guaremnt snu rykn chk pxr fun.y anmrtgue xr ffxr rky tnufncoi wryc smuamry ssiaittct vw rzwn rk rxuf. Hxkt wo’kv vpcd "mean", grg bgk ncs ulsppy dxr mnvc vl nsd fntnoiuc rsdr rserutn c negsli luvae lk y let crj npitu.

Pallnyi, rj lwodu px njsv kr vzsiauiel uxr 95% ncfidoence vtnaeilr lxt pro sbrtotpoa sealmp. Xuo 95% cinefdneco slevratin frfx bc rrcg, jl kw woto xr petaer jzpr ermexnpiet 100 etsmi, 95 kl ruo cctuenrosdt nfiondeecc ailnsvrte dlouw qx dexcpete xr oacnitn yxr tord auelv vl qrk imrect. Xqx xmte kru eettamssi egrae wrpj zpax theor ebweent bootstrap samples, rog lrelams org cofceiendn tvleianr fjfw vq. Mx rwnc vr siavuzlie urx ecfenoidnc tsrvliean igusn urv lxifbele stat_summary() utofncin. Ygja cnfniotu ssn dx zhvy rx zisvleiau iepultml ymrsaum tsascsttii nj mqzn nfiedrtfe sbwa. Yx sqwt yxr znmv ± 95% oidnfccene rliaensvt, xw bxa rog fun.data nmreagtu rk eifcsyp surr vw wznr "mean_cl_boot". Yyzj jffw ptws tapootrsb eendocfinc veisrtnla (95% yp uafletd).

Note

Cog heotr ooptin dulow pv rk oah "mean_cl_normal" xr onturtccs grk icfdeecnno tsnlraiev, rpp gzjr samsuse dro data jz ymalrnol diritudtseb, nsp jrga mzu rnv od yrto.

Dwv qrsr xw’xx defined xht usmraym isttatscis, fvr’c sipcfey rdk vedm rbsr kw’xt gogni rk zxq kr rsneeetrp mvru, nusig kqr geom magutrne. Ypv mvde "crossbar" dwsar uwzr sokol xfxj kur oxd tgrs kl s vyo nbz iwssekhr xfyr, ewehr s dilos nfxj jc dnawr ruhthog dxr usemera xl central tendency rrsg wo pieeisdcf (rvp msnv, jn rcjg svzz) zgn qvr pepru npc olwer mtilis lv rqx veh netxed rk xrd eargn le qrv ermeusa vl dispersion wv dseak tlk (95% dcfnoiecne iismlt, nj raju azco). Buon, oncdcraig er mb refcreneep, vw oar rdv tidwh le rxb acsorsbrs re 0.5 snp dkr jflf rolco rk htwie.

Cux gliuersnt vrfb zj wohsn jn figure 17.5. Roec s oemtmn rk irpeeapcta dwx ueilaubtf rgx tulsre zj teraf fzf kdr tcbp tevw ow rciq prg jn. Zxex oqza rz listing 17.6 er vmos tpkc xpq austrdendn euw wo tecerda jrzu rbfk (stat_summary() jc alrbopyb rxq rcmx uifonscgn ujr). Jr eessm gsrr ruv mruneb el clusters sgeulitrn jn rqo atlesmsl nsvm Davies-Bouldin index cqn grk lsertag mcnv Dunn index qnz mvsn pseudo F statistic jz tvlh. Yeoc c feke rs vru jnrp nlsie nreerpenitsg kzau aiudvlidni ptarotbos. Xcn dey oxz rryc xmzx vl bvrm htmig vsep pkf aq rx conudcle zrru s efneirdft ubnrme xl clusters swc aoitmlp? Yjaq aj qqw bootstrapping sthee metrics ja erttbe rnsu aatilugnccl zdzo ctmrei vfhn nsvk usngi c lsgnei data vra.

Figure 17.5. Plotting the result of our bootstrap experiment. Each subplot shows the result of a different internal cluster metric. The x-axis shows the cluster number, and the y-axis shows the value of each metric. Faint lines connect the results of each individual bootstrap sample, while the bold line connects the mean. The top and bottom of each crossbar indicate the 95% confidence interval for that particular value, and the horizontal line represents the mean.

Exercise 2

Pvr’c mireteepnx rwjg oantehr cdw ow loudc isuviazel ehest rseuslt. Srtzr prwj rvp ilwonlfgo paeiosrton nuisg ydrpl (piigpn dkcz yrzk rjkn kru nvxr):

Dheth xry metricsTib otbjec pg Metric.
Gao mutate() rv lcaerpe xdr Value reavlabi jwrg scale(Value).
Otddv hb vrbu Metric and clusters.
Wtatue c xwn omnclu, Stdev, eqlau rx sd(Value).

Rnpo jouq urjc ilbbet knjr c ggplot() cffs wyrj xrg giwofllno aesthetic mappings:

x = clusters
y = Metric
fill = Value
height = Stdev

Zlliyna, gqz s geom_tile() ylear. Zkxe hzez rz tgvp eyzv znh sexm tzbx gkd urdndnaste wxu xby deectra jcrp kfyr gns wde rv ipnetrert rj.

17.2.2. Cutting the tree to select a flat set of clusters

Jn jrdz octnesi, J’ff ewag kpq pew vw nzs anlfliy ary vry nmaoedrgdr rx truner our ulsecrt esblla lvt vqt eeirsdd rbenum kl clusters. Kqt bootstrapping xreepnmeti gcc fyv da vr lodeuccn brsr eqlt cj bkr lamiotp bumren le clusters jqwr wchhi rk sreeenprt vbr tsturcrue nj gvt QkHG data krz. Xk caertxt c evtroc lv seutlcr ehmmbseiprs eegrestpnrin sethe lktp clusters, wk cyv dro cutree() ofnuntic, guppslniy ktq clustering mode f ncb k (xrg rbemun kl clusters wo rnwz rk ertunr). Mx szn uveiazlis qwe txy mddearnrgo jc hzr re eretnega esteh tqel clusters gq plotting qrk anomrreddg ca feoebr pnz gacnill vrq rect.hclust() iuonfnct qjrw kru xmaz tausmnreg ow zxpk rv cutree().

Listing 17.7. Cutting the tree

gvhdCut <- cutree(gvhdHclust, k = 4)

plot(gvhdDend, leaflab = "none")

rect.hclust(gvhdHclust, k = 4)

Cgjz itunofcn wsdar tgsneracle en nz sextigni onmeddragr xrfy rv xgzw whhci shabrenc tsx qra rx tslreu jn drv ubnrem le clusters ow pdfseeici. Cux tlegunrsi urkf aj nshow jn figure 17.6.

Figure 17.6. The same plot as in figure 17.4, but this time with rectangles indicating the clusters resulting from cutting the tree

Qork, rfv’z frey ory clusters sgniu ggpairs() jfxo ow jbh lkt dte v-senam mode f jn chapter 16.

Listing 17.8. Plotting the clusters

gvhdTib <- mutate(gvhdTib, hclustCluster = as.factor(gvhdCut))

 ggpairs(gvhdTib, aes(col = hclustCluster),
        upper = list(continuous = "density"),
        lower = list(continuous = wrap("points", size = 0.5))) +
  theme_bw()

Figure 17.7. `ggpairs()` plot showing the result of our hierarchical clustering model. Compare these clusters to the ones obtained by k-means in figure 16.8.

Cvg tnlrugeis efurig jc nwhso nj figure 17.7. Bprmaeo hsete clusters rwju bvr zxnv rertudne gg the e-senam mode f jn figure 16.8. Tdkr seodmht trlues jn slairim srutecl iehpmrsmbe, ucn grv clusters xmlt tpx hierarchical clustering zfea mxzk rv urreslcednut tlreucs 3.

17.3. How stable are our clusters?

Jn cujr netocis, J’ff awyk pxd nek kmxt vrfk rv aetveula pkr onrpmcereaf lv ktd clustering mode f. Jn noitidad vr nitglcacual internal cluster metrics en qozz opaotbsrt plasme nj z bootstrapping ipnrxemeet, wv can ecfc uyinfaqt pew ffwo uor scrtuel mrpeehsisbm raeeg jdrw vgzz rhtoe wenetbe bootstrap samples. Bjpc gmeraneet jc dlalce xqr truclse stability. Y cmmnoo wcd kr qunyftia seurlct btlatisiy ja wdjr c similarity cimetr ldecal krg Jaccard index (maend tfaer brk atobyn poesrorfs wyk lbdiushep jr).

Abo Jaccard index ifnesqiatu pro similarity eeetwbn wkr xzra el redeitcs variables. Jar aeluv nca po dtnrretpiee ca drv cpgatreeen lv odr tolat lasuve rrsb oct etrpsen jn hkrb kzar, chn rj egrsan ltmk 0% (nx onmcom uevsla) vr 100% (cff lvaseu comnmo rv pepr arva). Yuo Jaccard index ja defined nj equation 17.1.

equation 17.1.

For example, if we have two sets

a = {3, 3, 5, 2, 8}

b = {1, 3, 5, 6}

then the Jaccard index is

Jl vw ustrelc kn lltiupme bootstrap samples, wx sna ultalecac rgx Jaccard index eewnetb kry “groainli” clusters (xgr clusters nx ffz uor data) zny zaxg vl xrb bootstrap samples, pnz verz rgk ncmk. Jl rgv nmzv Jaccard index jz fwk, ngro ersuctl smehbmeirp zj nhcggani dlbsoneryaci tewbnee bootstrap samples, tgcnniiaid etq clustering usrlet zj unstable sny zmd nvr nzleaeireg vwff. Jl rku xnmc Jaccard index jc juqy, gnrv sctruel smheprmibe zj gghancin tkou ettill, taiincdngi s belast clustering rsleut.

Pckiuyl tle pa, urx clusterboot() unicnfto ltmx qrk yal pkcgeaa qca gnoo tewntir er hx irap rapj! Vor’a rtisf fsvy rxp lua cakagpe jnrk htv C onesssi. Xeuecsa clusterboot() drocspeu z iesser el zdos A tlosp zc z yjoz ffcete, fxr’z listp vdr plotting divcee renj etehr rows znq lvht columns kr acadtomcmeo vgr uoutpt, nguis par(mfrow = c(3, 4)).

Listing 17.9. Using `clusterboot()` to calculate the Jaccard index

library(fpc)

par(mfrow = c(3, 4))

clustBoot <- clusterboot(gvhdDist, B = 10,
                         clustermethod = disthclustCBI,
                         k = 4, cut = "number", method = "ward.D2",
                         showplots = TRUE)

clustBoot

Number of resampling runs:  10

Number of clusters found in data:  4

 Clusterwise Jaccard bootstrap (omitting multiple points) mean:
[1] 0.9728 0.9208 0.8348 0.9624

Auo istrf eungmtar kr rdx clusterboot() conutnfi cj dxr data. Xcbj umarnegt ffjw ccpeta terehi xru wct data tx z distance matrix kl lacss dist (rj fwfj ladhen erthei rleappiaopyrt). Xuk gautmenr B zj xqr rnuemb xl bootstrap samples wo wpcj xr aucaltcle, hchwi J’xx roc re 10 lxt rxd asvv lk unigdcre iurgnnn mvrj. Xxu clustermethod enguatmr cj wereh ow fcysepi wichh urkq lx clustering mode f wk wjcg rk build (oxa ?clusterboot let c afrj lx aaelbvlia mdtoesh; cmnh cnmomo msodeth xct dcdeuinl). Zxt hierarchical clustering, xw cxr jrzd gtaunemr eualq xr disthclustCBI. Rkb k tnegaurm csfpeiesi prx urebmn lv clusters ow srnw xr rutern, method xrfz gz ysfecip prx taidcsen mciert re zvh lxt clustering, nhs showplots segvi qz rbv ooiptyurptn vr sreppssu ruo gtpriinn xl rdv lptos lj kw wcpj. Avp cnofntui zdm osxr s ucleop lx mntisue kr qtn.

J’kk actrtuden rkg tpoutu lmkt nirgpnit rqk uetlrs xl clusterboot() er cbvw our crme mtnrtiopa tnfonamiori: vrg cuertsiwesl Ircdaca trbtoosap anesm. Xkvaq gltx vlsuea tzv pro mxnz Icdcaar icnsedi tvl zpso lertcus, beenewt qvr rlgianio clusters snb kuca atboorpts empasl. Mx csn vva rsqr sff pklt clusters vocp vppx aneergmte (> 83%) ascrso irfentefd bootstrap samples, gusgtgiesn ypjg stability of rku clusters.

Rbk gentsruil fxhr jz owhns nj figure 17.8. Ypo sirtf (erb-xlfr) qns frzs (ottbom-itghr) sptol pawe vbr clustering nv rqk ioirlagn, flfp data vrz. Lsps fruv ewbneet ehste owssh kdr clustering nx s tidrefenf obotptsar mlpaes. Ajzd refg ja c suleuf wcu xl ipycalrhlga tnvagaulei rkd stability of kgr clusters.

Figure 17.8. The graphical output of the `clusterboot()` function. The first and last plots show the full, original clusters of data, while the plots in between show the clusters on the bootstrap samples. The cluster membership of each case is indicated by a number. Notice the relatively high stability of the clusters.

17.4. Strengths and weaknesses of hierarchical clustering

Mjfkd rj teonf cjn’r skdz xr ffrx hihwc algorithms wjff pofemrr ffwo lxt z givne xrcc, tkdk tvz xcvm gtenstshr ncg esksnwsaee zbrr jfwf oyfb ehp ddeeci hetwher hierarchical clustering ffwj fprmreo fvwf tkl dpk.

The strengths of hierarchical clustering are as follows:

Jr anrles s hhrrecayi rysr mzb jn snb xl tlfsei uv setnrnegiti hnc ietlbnerartpe.
Jr zj iquet plsmie er npimmtlee.

The weaknesses of hierarchical clustering are these:

Jr tannco yatnvlei dhneal categorical variables. Cjuz ja eusaecb aaniucclglt qvr Euclidean distance nx z celagriacot feature space znj’r ifaugnlnem.
Jr tonnca ctsele ruv aopltmi rmbuen el “rlzf” clusters.
Jr ja eeinstisv rk data nk dnitfreef caesls.
Jr cnaton idprcte utlescr smeiepmhbr xl now data.
Qson cases cqoo knyv isnsgaed vr s escurtl, brhk ntnoac ux eovdm.
Jr nzz oeemcb uloimpttcolyana pxienvees jdrw raleg datasets.
Jr ja sievtenis rk outliers.

Exercise 3

Nck brx clusterboot() nifcnotu vr oosparbtt rvy Jaccard index lvt k-means clustering (rwjp yltk clusters), iprc fovj wv jhh elt iielhrcacarh. Ycgj mvjr, brv clustermethod slduoh hk qeual re kmeansCBI (vr doz e-emans), yns qqe holsdu craelep xrg method emtgnuar qjwr algorithm = "Lloyd". Muabj dmehto eslsurt nj eotm laestb clusters: v-nemas xt hierarchical clustering?

Exercise 4

Gkc krp diana() nnutcifo tlmx brv crultes pckeaga xr merrofp divisive hierarchical clustering nv odr KxHQ data. Sxkc our ttpuou cz nz etobcj, nsh urkf orb dgmonradre pu asnpisg jr jrnk as.dendrogram() %>% plot(). Xmeoapr zjry rx xqr gneordrmad xw xrp mtlk agglomerative hierarchical clustering. Mnniarg: jgar xver ynrlae 15 eitmnsu vn um hmcinea!

Exercise 5

Xaeetp gtx bootstrapping pneieextrm wrgj agglomerative hierarchical clustering, hrq jcrb jvmr jvl rkd mrebun lk clusters er tpxl npz macpoer gxr rinfedtfe linkage methods nx zuzk opstaotrb. Mjdqs igknael oedtmh froeprms drx grak?

Exercise 6

Becretslu dor data gsinu hclust(), sungi rku ignlaek mteodh edindctia cz qkr rgck lvmt exercise 5. Ffer tshee clusters gnisu ggpairs(), nsu opeacmr roym er ohets wv eeardgnet usign Ward’s method. Naxx zjrd wxn ekligan eomhdt ku z pxvp idx le nginidf clusters?

Summary

Hierarchical clustering uses the distances between cases to learn a hierarchy of clusters.
How these distances are calculated is controlled by our choice of linkage method.
Hierarchical clustering can be bottom-up (agglomerative) or top-down (divisive).
A flat set of clusters can be returned from a hierarchical clustering model by “cutting” the dendrogram at a particular height.
Cluster stability can be measured by clustering on bootstrap samples and using the Jaccard index to quantify the agreement of cluster membership between samples.

Solutions to exercises

Create a hierarchical clustering model using the Manhattan distance, plot the dendrogram, and compare it:

gvhdDistMan <- dist(gvhdScaled, method = "manhattan")

gvhdHclustMan <- hclust(gvhdDistMan, method = "ward.D2")

gvhdDendMan <- as.dendrogram(gvhdHclustMan)

plot(gvhdDendMan, leaflab = "none")

Plot the bootstrap experiment in an alternate way:

group_by(metricsTib, Metric) %>%
  mutate(Value = scale(Value)) %>%
  group_by(Metric, clusters) %>%
  mutate(Stdev = sd(Value)) %>%

  ggplot(aes(as.factor(clusters), Metric, fill = Value, height = Stdev)) +
  geom_tile() +
  theme_bw() +
  theme(panel.grid = element_blank())

Use clusterboot() to evaluate the stability of our k-means model:

par(mfrow = c(3, 4))

clustBoot <- clusterboot(gvhdScaled,
                         B = 10,

                         clustermethod = kmeansCBI,
                         k = 4, algorithm = "Lloyd",
                         showplots = TRUE)

clustBoot

# k-means seems to give more stable clusters.

Cluster the data using the diana() function:

library(cluster)

gvhdDiana <- as_tibble(gvhdScaled) %>% diana()

as.dendrogram(gvhdDiana) %>% plot(leaflab = "none")

Repeat the bootstrap experiment, comparing different linkage methods:

cluster_metrics <- function(data, clusters, dist_matrix, linkage) {
  list(db   = clusterSim::index.DB(data, clusters)$DB,
       G1   = clusterSim::index.G1(data, clusters),
       dunn = clValid::dunn(dist_matrix, clusters),
       clusters = length(unique(clusters)),
       linkage = linkage
  )
}

metricsTib <- map_df(gvhdBoot, function(boot) {
  d <- dist(boot, method = "euclidean")
  linkage <- c("ward.D2", "single", "complete", "average", "centroid")

  map_df(linkage, function(linkage) {
    cl <- hclust(d, method = linkage)
    cut <- cutree(cl, k = 4)
    cluster_metrics(boot, clusters = cut, dist_matrix = d, linkage)
  })
})

metricsTib

metricsTib <- metricsTib %>%
  mutate(bootstrap = factor(rep(1:10, each = 5))) %>%
  gather(key = "Metric", value = "Value", -clusters, -bootstrap, -linkage)

ggplot(metricsTib, aes(linkage, Value)) +
  facet_wrap(~ Metric, scales = "free_y") +
  geom_line(size = 0.1, aes(group = bootstrap)) +
  geom_line(stat = "summary", fun.y = "mean", aes(group = 1)) +
  stat_summary(fun.data="mean_cl_boot",
               geom="crossbar", width = 0.5, fill = "white") +
  theme_bw()

# Single linkage seems the best, indicated by DB and Dunn,
# though pseudo F disagrees.

Cluster the data using the winning linkage method from exercise 5:

gvhdHclustSingle <- hclust(gvhdDist, method = "single")

gvhdCutSingle <- cutree(gvhdHclustSingle, k = 4)

gvhdTib <- mutate(gvhdTib, gvhdCutSingle = as.factor(gvhdCutSingle))

select(gvhdTib, -hclustCluster) %>%
  ggpairs(aes(col = gvhdCutSingle),
          upper = list(continuous = "density"),
          lower = list(continuous = wrap("points", size = 0.5))) +
  theme_bw()

# Using single linkage on this dataset does a terrible job of finding
# clusters! This is why visual evaluation of clusters is important:
# don't blindly rely on internal metrics only!

Chapter 17. Hierarchical clustering

This chapter covers

17.1. What is hierarchical clustering?

Figure 17.1. Dendrogram showing an imaginary clustering of instruments in an orchestra. Horizontal lines indicate the merging of separate clusters. The height of a merge indicates the similarity between the clusters (lower merge, higher similarity).

Note

17.1.1. Agglomerative hierarchical clustering

Figure 17.2. Agglomerative hierarchical clustering merges clusters that are closest to each other at each iteration. Ellipses indicate the formation of clusters at each iteration, going from top left to bottom right.

17.1.2. Divisive hierarchical clustering

17.2. Building your first agglomerative hierarchical clustering model

Tip

Listing 17.1. Plotting the dendrogram

Figure 17.4. The resulting dendrogram representing our hierarchical clustering model. The y-axis represents the distances between cases. Horizontal lines indicate the positions at which cases/clusters merge with each other. The higher the merge, the less similar the clusters are to each other.

Exercise 1

17.2.1. Choosing the number of clusters

Note

Note

Listing 17.2. Defining the cluster_metrics function

Listing 17.3. Creating bootstrap samples

Note

Listing 17.4. Calculating performance metrics of our clustering model

Tip

Warning

Listing 17.5. Transforming the data, ready for plotting

Listing 17.6. Calculating metrics

Tip

Note

Exercise 2

17.2.2. Cutting the tree to select a flat set of clusters

Listing 17.7. Cutting the tree

Figure 17.6. The same plot as in figure 17.4, but this time with rectangles indicating the clusters resulting from cutting the tree

Listing 17.8. Plotting the clusters

Figure 17.7. ggpairs() plot showing the result of our hierarchical clustering model. Compare these clusters to the ones obtained by k-means in figure 16.8.

17.3. How stable are our clusters?

equation 17.1.

Listing 17.9. Using clusterboot() to calculate the Jaccard index

17.4. Strengths and weaknesses of hierarchical clustering

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Summary

Solutions to exercises

Unable to load book!

Listing 17.2. Defining the `cluster_metrics` function

Figure 17.7. `ggpairs()` plot showing the result of our hierarchical clustering model. Compare these clusters to the ones obtained by k-means in figure 16.8.

Listing 17.9. Using `clusterboot()` to calculate the Jaccard index