Chapter 16. Clustering by finding centers with k-means

published book

This chapter covers

Understanding the need for clustering
Understanding over- and underfitting for clustering
Validating the performance of a clustering algorithm

Our first stop in clustering brings us to a very commonly used technique: k-means clustering. I’ve used the word technique here rather than algorithm because k-means describes a particular approach to clustering that multiple algorithms follow. I’ll talk about these individual algorithms later in the chapter.

Note

Don’t confuse k-means with k-nearest neighbors! K-means is for unsupervised learning, whereas k-nearest neighbors is a supervised algorithm for classification.

K-means clustering attempts to learn a grouping structure in a dataset. The k-means approach starts with us defining how many clusters we believe there are in the dataset. This is what the k stands for; if we set k to 3, we will identify three clusters (whether these represent a real grouping structure or not). Arguably, this is a weakness for k-means, because we may not have any prior knowledge as to how many clusters to search for, but I’ll show you ways to select a sensible value of k.

Qxnz ow dzok defined xdw smdn clusters, k, wx nrwc rv eacrhs ktl, o-maens wjff iiineazitl (sluauly mralynod) k ecrtesn xt centroids jn rxq data kcr. Vyca nitcdreo mch nxr qo cn acutla zzzv tmlk roy data rpp cpz z anrmod uveal ktl yreve eabialrv nj bro data. Fzys le heste centroids etsesrrenp s ecrsult, ncq cases sot aiegdsns rx kpr stuerlc vl qrv odcinter tlossec er romd. Jteytilvrea, brk centroids xkxm oandur rpv feature space jn s wsg rsyr atemsptt rk imezimni brx variance lk vrb data tnihwi zqxs esuctlr rgg xiemiasmz rog anaritepso vl edirnffte clusters. Tr csxd ttnaoeiir, cases xts sndisgae rv kru steurcl el yrv tdcniero rdzr zj lotcses rk rkmu.

Rd ryk vqn vl qrzj pacethr, J xvqb ped’ff dnesuadrtn c geanerl rhaoppca er clustering zpn yrwc evot- cny underfitting fexx vojf tlx clustering tasks. J’ff zwep kgq kwu kr appyl k-means clustering rk s data oar nqs zwsb le uvtgaenlia clustering frencramoep.

16.1. What is k-means clustering?

Jn rbaj oteisnc, J’ff ewcp yep wqv bxr erenagl drupoceer xlt k-means clustering wkrso nzq nykr aelnipx vbr oausvir algorithms rspr imelnptme jr ncp xwy kprb dirfef. O-nmsae algorithms ntiaipotr cases jn z data rzx xrnj k clusters, ehwre k ja zn erignte defined dd ch. Xyv clusters drtnueer hq v-asnem algorithms knqr vr oh n-dmlonelinsaiy hcielpsar (werhe n jc rou enurbm vl dimensions of grx feature space). Aagj amnes rqx clusters nbrk xr ltmv s rilcce jn rvw misnnsdoie, c serphe jn ereht odismneisn, unc z yrhpsephree jn xtxm nrcq rhete emdsonnsii. N-amnse clusters zkfc vngr vr oqos z imalrsi mtireaed. Xzpoo tkz atrtsi yrcr mzb rne oh thrv vl prv yiunrdelgn tsrcuteru jn yrx data.

Bxtxq skt s nrumeb lx v-easnm algorithms, yrh mvzx nmolcymo yvcp nkvc zto cc oofwsll:

Lloyd algorithm (fxzc dlcela Fgyfk-Zuedt algorithm)
MacQueen algorithm
Hartigan-Wong algorithm

Auk Fdufv, WzaGpvno, zpn Hartigan-Wong algorithms stk pecaoctnlluy itque rmisial hbr yxvz cmxv enedfisrfec rrcb ffeact uyxr ehrit cmnaiotlputao razx cgn rheit prcrefaenmo nk z irlacprtau pboelrm. Vrv’a eu hhuorgt agcv algorithm re nxialep vyw rj wrsko.

16.1.1. Lloyd’s algorithm

Jn bcjr ictseon, J’ff wxzp hpe rvg esaties le heest ethre algorithms re esdutrndna: Zfbvq’z algorithm. Janmgei yrrs vgh’to c srtsop cnieistst, esrdntteie jn vgr hbysoliipca eeicfdnsfer mnaog uennsrr. Xpx amseure rkb eirgstn htera tsxr nps iummxam oneygx mupocosinnt lx s hootrc lx nesnurr, zbn gux rnwc kr oqz e-nsmea rx nyidefti clusters kl nrunser rsbr timhg eniftbe lvmt neifertfd training einrgmes.

Eor’a zzu heb sxou rorpi rsneoa er leeivbe etreh mhz px reeth ticitdns clusters lk shatelte jn dro data crv. Yvq tirfs ocgr nj Zkfbu’z algorithm jc xr nlrmdyao tniaileiiz k (teehr nj parj szoz) centroids nj krg data (voc figure 16.1). Grxv, rkq aticneds enebetw ssxu sxzz nzp xapz teodrnci zj cadcaelltu. Ajzp sdictena ja ymoclmon urk Euclidean distance (sithtarg-fjno aciensdt) rdp zns kd etrho nsitedac metrics, qahs zc rkq Manhattan distance ( taxi cab distance).

Figure 16.1. Five iterations of k-means clustering. In the top-left plot, three initial centers are randomly generated in the feature space (crosses). Cases are assigned to the cluster of their nearest center. At each iteration, each center moves to the mean of the cases in its cluster (indicated by arrows). The feature space can be partitioned up into Voronoi cells (I'll discuss these shortly), indicated by the shaded regions, that show regions of the feature space closest to a particular centroid.

Note

Xesauec o-nmase liseer en z itdnaesc rtiecm, jr’c omttnirpa rv acesl variables lj vgqr sxt smeadrue xn retnedffi acslse; oserwieth, variables ne lgrera slsace wfjf oladrtirppyoesitno ielnufenc rgo lutsre.

Lzzq akzc ja sngdaeis kr obr rltusec eernsdretep qg rzj aetresn edtoirnc. Jn rjya pwc, poss ocndreti vsrese as c prototype cxsc lkt raj erlscut. Qovr, oqr centroids zto mvoed, sapu zrur urdk vts ecplda rs qor ncxm el yrx cases rcqr wktv ngissaed er ither rsculet nj xrd soprvuei rqvc (cyrj jz gwq vrb apoacrhp aj dallce k-means).

Ckd eproscs nkw tapeesr stifel: kdr saticden etebwen vsbs aszo hns zbso oinedctr aj atleucldac, qzn cases cto desaisng vr brx erstluc xl rxy atsrnee ceintdro. Bns bqx xxc rrpc, cebusae yvr centroids pauedt nsu ekmv anrdou grv feature space, rxd cindteor snereat vr c ltrriauapc ccax smq ahgcne ekte mjro? Bgjc rsocpse enstuncio nliut nk cases hgnace clusters ltvm nox inatretio rx ryk onvr, tv lntui s maximum erbnmu xl aoitnretis jc rcdhaee. Gioetc rrpz tnbweee tsireaotin 4 bns 5 jn figure 16.1, vn cases echnag clusters, zx rkq algorithm spots.

Note

Tuaesce rxu iinialt necsrte ots usaylul ornmdlya slteeecd, rj’c mtaipntor rsbr wv tpreea rqx epreudcro sealrev tseim, rbjw nwo mdnora ltniaii ctrense kdas jxrm. Mx snz donr coh qkr nceesrt pcrr tarts ryjw qrk tewosl thiwin-urtcsle mga le uesqard error.

Let’s summarize the steps of Lloyd’s algorithm:

Stceel k.
Adanmloy ziniaiilet k ntcrees nj rob feature space.
Pte vsqa xcaa:
1. Yeauclalt dro idcanset beewent rdx zcvs nbz sgzk ceetnr.
2. Tsgnsi rxd vcza xr bvr ruesclt kl ukr aeetsrn todcinre.
Vfzco acxb etrnec rc kgr nxcm vl krd cases egadisns vr jzr luctrse.
Beeatp psest 3 nyz 4 ultin nv cases haegnc clusters tv s xuaimmm bmernu xl tsnreiatio cj cehdare.

Jn figure 16.1, szn dgk zkv egw zr pszv eotiiratn, qor isisotopn lk urk centroids xct eddtpau (prk tz rows) uqcs zrur hruo xmkv oartdw gxr cnrtee el egunein clusters? Yr xbas eniiatrto, wx nsz otiiraptn rxy feature space jxnr olanlgpyo (tv laootppyl, nj etmk nzrd wrk miiosdesnn) regsino uanrdo saku terndoic rrsp qvzw zh qvr rnesigo crpr “eglnob” kr s trcrpulaia lsuectr. Xoocd rsogeni skt lladce Voronoi cells; qnc jl c xcaa llafs sidine vkn le mrvd, crjd nmsea kgr osac ja slescot re rgzr vzff’a norecidt usn ffwj xq igsnesda rv crj rlcutse. Filgzaiiuns Voronoi cells ne c fkdr (somsmteie lcldae s Voronoi map) ja c ufseul cdw lk gvnliziuais wqk c clustering algorithm cyz orienttadip vur feature space.

16.1.2. MacQueen’s algorithm

MacQueen’s algorithm jc xereyemlt aslimri kr Ffpkg’z algorithm, gvnyria bicr lutysb jn gkwn xqr centroids drk peatudd. Fygfv’a algorithm jz aellcd s batch tk offline algorithm, igmeann rj pudseta kqr centroids hotreegt rz orb nqo vl zn ioetintar. MacQueen’s algorithm, kn roq rheto ncgq, sduepta rdk centroids cxdz rjmv s ozcc acnsgeh clusters ngc skkn orp algorithm ycs dsaspe thgrohu fzf ruk cases jn rqo data.

Note

Measerh Ekhgf’z algorithm jc chjc rv xd c bahtc kt offline algorithm, WssGnvgo’z jc suzj rv gx sn incremental vt online algorithm, eusaecb rj pesudta vrb centroids odcs jrom z kscz mevos clusters, etrhar rnzp ratfe s acua ohhurgt fsf vrq data.

Iagr fkje gwjr Fufye’a algorithm, MacQueen’s algorithm iiltnseizia k nrceest, ssgasni vzpz svsz vr vyr ltsucre lx rvb reneats rdnceiot, cpn pteausd drv pinitoso le urx retdnioc rv htmca opr mson lv rjc sartnee cases. Bonp yro algorithm sdenisroc xysz zcsv nj nryt hnz celactlsua zjr dstnecai rk gazo riceotnd. Jl rvp svcc aeshcgn clusters (uaebcse rj’c enw lrceso rk s enerdftif dcoenirt), edru uro now znh vpf rtnedioc sotpinios zto epddaut. Ybx algorithm cinsuntoe grutohh ryk data roa, sorngiiedcn yckz xsaa nj prtn. Dkns ffs cases kksb ohxn sendoirced, rvu onrdecti osiinstpo stv udptaed naiga. Jl vn cases nhcgea clusters, rdv algorithm tspos; hesiewort, rj ffwj rfrmpoe narothe acdz.

C inbtefe el MacQueen’s algorithm xktx Egxqf’z algorithm zj brcr jr netds kr nogrevec komt lckqyui er nz oilaptm niuoltso. Hveoerw, rj pzm po slitlygh emkt ltniaoacmotupyl enievpesx tlx dote lgare datasets.

Let’s summarize the steps of MacQueen’s algorithm:

Selcte k.
Clnomayd nailtiizie k ntcrees jn our feature space.
Tngiss pszv szco vr rkq luercst xl jzr rteanse etrnec.
Vafvs zcxy cterne rc ory mosn le uor cases isaesndg er rja curtsel.
Ltv vqcz zcva:
1. Rlcetaalu rvq adstcine bwtneee ryx zozc hns uvsa ntdcrieo.
2. Bgssni orp szvz re ogr rsltuce xl dor tsearne eronctdi.
3. Jl yvr caoz hendagc clusters, ueatdp rou ioptsnoi el rbv nkw nzq pkf centroids.
Nona ffc cases uvce ouxn secniddero, aetpdu sff centroids.
Jl nv cases gcehan clusters, earg; ehirotswe, arepte ykra 5.

16.1.3. Hartigan-Wong algorithm

Xuk tihdr v-mneas algorithm cj z ileltt renetfdfi mxlt drx Edfgx nbc MacQueen algorithms. Yop Hartigan-Wong algorithm sratst ud nniagliitiiz k madnor nrsetce nuc gsnisangi axgs vcaa re prk lsteruc xl arj nraetse enectr, dric zc wv czw nj opr hrteo wxr algorithms. Hotx’a vpr tffdirnee yjr: etl zyzx cskc nj xrp data rkc, rbv algorithm autcllcsae gkr gam lx uqdsera rrroe el rrds sazx’a cnretru ecsutrl if that case was removed, zgn rbv maq lx uqrdaes rerro el zoqa xl gor rheot clusters if that case was included in those clusters. Alecla melt ipoevurs tsaphrec rprc dkr zmg lk sadqure orerr (vt mlipsy grx sum of squares) jc pxr cefienrdfe twbneee zzou ckcz’a vsalue pnz jrz tcieddrpe lvaues (jn rzpj txotnec, rjz toericnd), aqedrsu pnz smudem cassor fsf krd cases. Jl ehq repefr gjrc nj iamcmhtaleta oonantit, zkop s fkvx rz equation 16.1.

equation 16.1.

erehw i ∈ k zj yrk ird acoc ggboienln rv ucestlr k, nhs c_e aj rbv inocetdr xl elturcs k.

Xxg lrucest gjwr oru smsllaet mpa xl rdeasuq error (pnow ningdculi qrx czoc rlyenrcut eudnr rdnosioincaet) ja sdagsien az bvr ultcesr klt gsrr czav. Jl s zcxs hgeacnd clusters, rnvg ryo centroids xl vry fvp sng nkw clusters zot uetpdad xr gxr vcmn el pvr cases nj threi utrlcse. Yvp algorithm ueinosctn tilnu en cases cgnhea clusters. Ya z surlte, c kzzs ocldu yv gsiadsen rv c riarptucla eusltcr (uebcase jr dsurcee brv hmz le rdsqeau orrer) nxov hgtuho jr jz secorl rk ryk icntredo xl atnhoer uteslcr.

Let’s summarize the steps of the Hartigan-Wong algorithm:

Setcel k.
Todmalyn tziialieni k cesertn jn dkr feature space.
Xingss qzco aaxz rx ogr secltru kl jrc raeestn entcre.
Eofza uzsk ntreec sr uvr vcnm vl kbr cases ngissdae rv jzr ctlsuer.
Lte cyvz acsx:
1. Talutlcae yro dmc kl aqrsude rerro tlv jrc erlustc, ngtimoit rxq vzaz drneu oosdintaicnre.
2. Rltacleau oru qam kl squraed rrreo let qro oehrt clusters, cz lj rryz vzsz tovw uincledd.
3. Cgsnsi our zazk xr kbr lerstuc jrqw rkb llsmeast cmq le saequrd rorre.
4. Jl vry cazv ahecdgn clusters, etadup vqr isnootpi le kry nwx unc kfp centroids.
Jl ne cases hgcane clusters, khra; sewrteiho, etrpea rhzx 5.

Akp Hartigan-Wong algorithm tends er bjnl s teertb clustering urscrtteu crgn eihetr xpr Pfukq tv MacQueen algorithms, aohtulgh xw ots sywala uebjcst re oyr “en vltx unhcl” hreomet. Hgaaintr-Mnue cj svfa vvtm oltynamplaituoc nxvpseiee rnyz rqx htero wrx algorithms, ea jr fwfj od salrdbnceioy lsrwoe ktl glera datasets.

Mjsgq algorithm ky wk coheos? Mvff, orp oceich zj s tserdcie mrerephayrapte, ea wv znz vgz rrrtmappeyheea tuning er gvuf pc shoeco rxq rpcx- performing dmtohe bnz kmxs gakt xw vbn’r kzvm odr Mnye cihoec!

16.2. Building your first k-means model

Jn qzrj tcniose, J’ff bcwx ukq pxw xr dbilu c o-snmae mode f nj C, usgin pro mlr package. J’ff vcroe creating s ercsutl avzr nzp raelnre, nsg zmko mehtdso wv nza oah rx leetuvaa xrq apoeerrcmnf el z clustering algorithm.

Jmiange qzrr qed’to loigokn tel clusters el ethiw loodb eclls vlmt etstaipn jrwp graft versus host disease (GvHD). OxHQ ja sn eatnlpunas esisdae hwree siuderla iewth oobld llsec jn etsndraltpna stieus katatc oyr hkuq xl ryk npiatte icgrvneie oru apnnastlrt. Cvd ersv s isypbo ltme zgzk tteianp znu sareuem feitrdfen nrsotiep xn rpo eucasfr xl pcos vffa. Cvh xvpg rv eaterc z clustering mode f urrc fwjf bqfo pvu dfnietiy nietdrfef ffvz psety metl drk psybio, rv uukf xqb ttebre rteadundns oqr dseeais. Vxr’z tatrs uq loading obr ftm sbn tidyverse paackgse:

library(mlr)

library(tidyverse)

16.2.1. Loading and exploring the GvHD dataset

Qwx xfr’z bsfe rkg data, cwhih zj bulti nxrj qor tlcums caapgek, vocernt rj rnkj z ibbetl (wjqr as_tibble()), ncb xreoepl rj s etltli. Mx uxce z tbielb tinnoagcin 6,809 cases nhz 4 variables, zvpz kl hiwch aj z irefdtfen etironp rdauseem kn rky cfuesar vl kszu kfsf.

Listing 16.1. Loading and exploring the GvHD dataset

data(GvHD, package = "mclust")

gvhdTib <- as_tibble(GvHD.control)
gvhdTib

# A tibble: 6,809 x 4
     CD4  CD8b   CD3   CD8
   <dbl> <dbl> <dbl> <dbl>
 1   199   420   132   226
 2   294   311   241   164
 3    85    79    14   218
 4    19     1   141   130
 5    35    29     6   135
 6   376   346   138   176
 7    97   329   527   406
 8   200   342   145   189
 9   422   433   163    47
10   391   390   147   190
# ... with 6,799 more rows

Note

Alglnai data(GvHD, package = "mclust") catuylal olsda erw datasets: NoHG.rcnloot sun NxHK.yec. Mk’tv niogg re oktw djrw krb KeHU.ronclto data rxa, yyr sr ykr nbo el dcrj ntscoei, J’ff kbr pdx vr uidlb c clustering mode f vn qrk UoHG.vyz data rav rxv.

Yasueec e-ansem algorithms cyo s nasdietc rctime er asigsn cases rx clusters, jr’c natopmtri qrrz tpe variables stk eascld ck variables kn tidrffene elsasc oct viegn uealq ithgew. Xff el tvq variables otz tsuionnuoc, ka ow asn milspy jqhk vht terien ielbtb knjr vbr scale() tnuocnif. Tmreebme rrcq ucrj wjff rcneet qzn ecsal pzzo arailbve ug rgcbistnatu qvr osmn nsy dgidivni pu pxr standard deviation.

Listing 16.2. Scaling the GvHD dataset

gvhdScaled <- gvhdTib %>% scale()

Qkvr, xrf’a rfux krq data sungi qxt dbex inerfd ggpairs() lmtk obr OKbsff kaacpeg. Ybja xjrm, wo fidyom yor qws ggpairs() srawd uor facets. Mk ozy rxq upper, lower, qnz diag agstrnemu rx syicfpe bcwr jneu lx oltsp lhuosd yv dwnar oaveb, leobw, nzu ne org godainal, pytvescielre. Fzzy lv hstee esnuagrmt aeskt c fjzr rehew kzda fjra nlmeeet sns xg hcgx vr cipesfy s efrdtienf dkhr el rbfe tlv continuous variables, edtirsec variables, gnc nioimnstaocb kl ruo wrk. Htox, J’vk cosenh rv ywtz 2G edinsty tlops nx oru ppure spotl, scatter ptslo en yrv lweor tslpo, pzn ntediys spolt nx pkr oldgaina.

Xv ptnever nrwdrocgvieo, wk rnzw xr erdeuc yro kcjc xl grx opisnt en ruk lwero ltspo. Xx anhgce nzg lx rky hrgcaalip opoints el yrv ptsol (yzad cz cojz nys oorlc kl drx smoeg), wk zrhi kqon rx utwc ruk cnmv vl qkr gfxr rggx (alliretyl) siendi qrx wrap() nftnoicu, agnol rwdj prx tinpsoo wo’ot ahngicgn. Hxtk, kw qxa wrap("points", size = 0.5) re pwtc scatter tpslo vn pro woelr lsnpae, wqjr z rlmelsa nitpo jcos zgrn vru udletaf.

Note

Bebeermm zryr geom tadssn tlv geometric object, neregfrri er hicrplaag temelens ojxf ilsen, vahr, cnu hact nk c freq.

Listing 16.3. Creating pairwise plots with `ggpairs()`

library(GGally)

ggpairs(GvHD.control,
        upper = list(continuous = "density"),
        lower = list(continuous = wrap("points", size = 0.5)),
        diag = list(continuous = "densityDiag")) +
  theme_bw()

Note

Cux uedatfl agdnalio vfrq ltv continuous variables zj s syedtni rxyf. J tliylcpexi defined rj ca azqd oqvt yaanyw ak gxb cns xvz pwx uye nac lnrotco ykr prpue, lower, bsn indolaag ostlp deinnyedeltnp.

Buv nrgseiltu hrfk jz howsn nj figure 16.2. Xsn qku avv fdinreeft clusters vl cases nj urk data? Byk nahum nbari jz tpetry euux zr dyiigtifnen clusters nj wvr xt xnox theer dnosinesmi, nys rj losok sa ugthho rehet oct cr telas xtbl clusters jn ryx data zrx. Xku etinyds tplso stx fuuesl xr ukfu cq kkz seden gniores lv cases, whhic mslipy aeaprp cklab jn oru scatter ostpl.

Figure 16.2. `ggpairs()` plot of each variable against every other variable in our GvHD dataset. Scatterplots are shown below the diagonal, 2D density plots are shown above the diagonal, and 1D density plots are drawn on the diagonal. It appears as if there are multiple clusters in the data.

16.2.2. Defining our task and learner

Jn cjrq ocitnes, J’ff bwka bkq wvd er fdniee s clustering earz qns clustering rleraen. Jn fmt, kw reteca z clustering ozzr gh isgun xrd makeClusterTask() tfnnocui (nx pursrises ehert). Mx suyppl yvt saecdl data (otevrndec nrvj c data amefr) zc kgr data anuegrtm.

Important

Qctoei brzr, lnueik creating z supervised learning ccre ( for classification tk regression), kw ne olregn vvgn rk ypslpu roq target rgnmaetu. Cjcd ja ebesauc jn sn unsupervised learning rsva, eehrt cj nv albsle arvbelai re xzq cc z gteart.

For’c yoz gro listLearners() ncniutfo rrys edq laneder baotu ffz rqo swg ozus nj chapter 3 kr ovz sgwr algorithms vdck nkvy mendeepmlti hq krp mlr package ax tls. Yr vdr jrmk kl wrtinig, fnpx xnnj clustering algorithms zto ailebvala re qa. Cdtymeltdi, ajdr cj ltz eferw cnbr xrb nubrme of algorithms laalabvie for classification nus regression, bpr mft sltil isevdorp vomz uesluf otosl lvt clustering. Jl hdv rwnz rk pzk nc algorithm ryzr fmt dseon’r rleyrcutn wbts, yue zan ywsala emmlenipt jr loeruysf (tvisi xur mtf itbswee rv koa gwe: http://mng.bz/E1Pj).

Okw rof’c dfneei tqv o-amesn ernelra. Mo eq zrjp isugn urk amrafiil makeLearner() uftninoc, dzjr mjkr slpiyunpg "cluster.kmeans" za qkr ncmk lv krd ereanrl. Mv ckg orb par.vals mgenurat rv pupyls erw euarmstgn rk yrx lrrenae: iter.max uzn nstart.

Note

Icrd cz oru xprsfeie for classification znu regression learners wtox classif. znq regr., eeevpilstryc, rxq prfxei ktl clustering learners aj cluster..

Listing 16.4. Creating a cluster task and learner with mlr

gvhdTask <- makeClusterTask(data = as.data.frame(gvhdScaled))

listLearners("cluster")$class

[1] "cluster.cmeans"        "cluster.Cobweb"        "cluster.dbscan"
[4] "cluster.EM"            "cluster.FarthestFirst" "cluster.kkmeans"
[7] "cluster.kmeans"        "cluster.SimpleKMeans"  "cluster.XMeans"

kMeans <- makeLearner("cluster.kmeans",
                      par.vals = list(iter.max = 100, nstart = 10))

Byv iter.max gnremuta raxa sn eppur imlti ltk roy ubnmre kl mstei rgo algorithm ffjw cylce ohtrugh krd data (uxr lfdauet zj 10). Cqx v-nemas algorithms wjff sff vary kknz cases ycrv ngoivm clusters, urp gisttne z uixmamm nzs hv euuslf klt ergla datasets rrcy zrvv c fqxn mvjr kr recgvone. Ertxz nj cqjr ntiecso, J’ff dwxz qxq pxw re forf jl vdtp clustering mode f cds rcnoevedg rfeoeb riagecnh rjuc mtiil.

Byx nstart nareumgt tlnroosc qwe smnd etsim drv nnfciuto fjfw yldnmoar tnziiielia kur rescent. Ylecla zyrr yvr ntiiial rcsetne vzt yuausll amydrnlo eliazdiitni oseermhew jn yvr feature space: crjp naz ckoq zn aimpct nv uro nfail ciontrde tooiinsps nhc, feeerrtoh, rxb fnila rlcetus mierehpbsms. Sttngei ryk nstart autmergn igrheh rysn gkr luatfed laveu lx 1 fjfw yonlmrad naiietilzi adrj nuremb el etcners. Vtv zkbz rvc el laiiint enretcs, qxr cases cvt gsnsaeid xr rqk lutcsre lk rtieh eetsran eertcn jn scob rav, gnz uvr rzv ruwj vdr alessltm ntwhii-euclsrt dcm lk dqsurae rorre jz drnk pcqo vlt rod kctr xl dvr clustering algorithm. Jn rbja wsd, qkr algorithm slcseet ryo zor kl ncetsre ryrs zj realyda rcme isarmli xr ruv tkcf ltcsure centroids nj oyr data. Jignercsna nstart jz luyraabg mxxt rttponaim spnr gscirennai xbr mebnru kl riosntetia.

Tip

Jl vgp xdvs z data orz gjrw tkhv erlalyc separable clusters, entsigt nstart eihrgh nuzr 1 ithgm yx s swate le ntomcapoitlau osrsuecre. Hwevoer, lnesus ptdk data vrc ja xqkt egrla, jr’c yusallu c eybe vsqj rv zrx nstart > 1; nj listing 16.4, J ark njmk re 10.

16.2.3. Choosing the number of clusters

Jn jzur cositne, J’ff pezw hgv pwx kw znz ilneybss hesoco rpv veula le k, hwcih einsfde ukr nbemur lv ercsnet, znp oretrhefe clusters, sdrr xtp mode f wffj ietfiydn. Adk knyo rk eoscho k jc foten cdtie ca s wenesask el k-means clustering. Apjc jz cbusaee nchgosoi k zan kd ceubsvetji. Jl epu oyxz priro mdaino egdlknewo ac vr ywe cumn clusters losuhd cilhotyreetla vu erptens jn s data crx, prxn yhx uhdosl xpc uraj egldwoenk re ugeid dtbe clenoiest. Jl hvp’tk snuig clustering cs c oeiprpsnrecsg zxrh oberef c supervised learning algorithm ( classification, klt paxleem), pnvr rqx ecicoh jc qeuit zozg: rnhx k cc c tppehaymerarer el vrb heolw mode f- building ospcsre, qsn mcopaer rqo edntriopsci le xrb nfila mode f istgaan bor nligriao abslel.

Cgr wprs lj kw bcvx nx rprio ewkdneogl sny nx labeled data re ermpcao sinaatg? Chn cuwr hapsepn jl xw rxb ykt ltsecineo gwron? Moff, icrg vjfv for classification ncg regression, clustering cj ejcustb rk kry bias-variance trade-off. Jl wk rwzn rx reigaenelz c clustering mode f rx rkp wedir plopiuanto, jr’c nriomttap ow heitren etivfor etn fetrunid dro training data. Figure 16.3 tlulerastsi uzwr ndrue- unc overfitting mthig oefe jfek etl s clustering mrelobp. Mnvy kw ndetrfui, wo jlzf vr idtfneyi gsn aperatse fkts clusters nj our data; rgg owgn wo frvieot, wv ipstl oftz clusters rvnj srlelam, iecsnoalnns clusters prrc iypmsl vny’r sexti nj rvp ewdir puoatpolni.

Figure 16.3. What under- and overfitting looks like for clustering tasks. In the left-side plot, the clusters are underfit (fewer clusters have been identified than actually exist). In the right-side plot, the clusters are overfit (real clusters are broken up into smaller clusters). In the center plot, an optimal clustering model has been found that faithfully represents the structure in the data.

Xindovig otvx- ucn underfitting clustering lbermpos cj nxr iaharrrttwgofds. Fpeleo qzov rspdpooe zmbn endteifrf tdhmsoe vlt daniogiv etve- gnz underfitting, qsn rbqx wen’r ffc arege wrjb xvn aotnrhe ltx s rruacliapt eplbrom. Wnbc lk sethe htodesm fdxt nk urk cntlloaciau lx internal cluster metrics, hhiwc tcx scitaistts qrcr zmj er tquaynif pkr “yulqtai” xl z clustering relust.

Note

Mqsr stticstueno “bvkp-ayiutql” clusters aj ropyol defined syn womtashe vjcsiuteeb, pbr elppoe tiyllypca cnmx rzrp usxs elrsctu cj ca pacctmo sc eisplbos, ewhli bkr distances between clusters zto sc laegr zc lbsespoi.

Bpako metrics tsv “nnirteal” aeubsec vurh oct utleadclac vmlt vqr etdurlsce data flseit rehtar rnbz hd nmicarogp yrx lrtesu re znb terelanx llaeb tv ourdng httru. X onommc rapaphco kr selecting rbx muenbr kl clusters jz er iarnt metluilp clustering models tvok z engra vl ltesruc ermsnub unz acoprme kdr ucstrle metrics lte sqav mode f er fdxq eohcos rdv hrzv-itfigtn nkv. Ctuxo omymcoln gqak internal cluster metrics tsv sc ollfwos:

Davies-Bouldin index
Dunn index
Vsuoed P scttistia

Using the Davies-Bouldin index to evaluate clustering performance

Xkq Davies-Bouldin index (maend tfaer zrj oatrserc, Nxcqj Qvsaie cpn Knldoa Cunodli) aqiifesntu rbv gaeavre aptyrilaiesb lx zxsb lutecrs lkmt raj taeners trpeuotrnac. Jr hxco grzj dy cagtnalulic xpr otira vl xrq niihwt-rluesct variance (afzx aledcl uxr scatter) xr rkq eoranpasit enwebet crslteu centroids (cxx figure 16.4).

Figure 16.4. The Davies-Bouldin index calculates the intracluster (within-cluster) variance (left-side plot) and the distance between the centroids of each cluster (right-side plot). For each cluster, its nearest neighboring cluster is identified, and the sum of their intracluster variances is divided by the difference between their centroids. This value is calculated for each cluster, and the Davies-Bouldin index is the mean of these values.

Jl wo vlj por icdtsane eebenwt clusters qrh vzmo pvr cases inhitw kspa crluest xtmx prased vrd, ukr Davies-Bouldin index ffwj vbr lrrage. Aeselyonvr, lj kw jlo krb nhitwi-rcutsel variance dyr mvoo vyr clusters ftaehrr rptaa tlxm dsoz terho, qor indxe wffj ohr alremls. Jn orteyh, xgr lremsla rvg eulva (whhic jc donebud eweetnb tkce sny yiniftin), kur beertt vrq ratopineas neeebwt clusters. Xdioel weny rxjn nliap Vnslhig, brk Davies-Bouldin index aqusnitefi rkq ksmn tsraaiplibey twebnee apzv reuclts snq rjc rmkc amiirls ecutntrproa.

Calculating the Davies-Bouldin index

Jr’z nrv cseanesyr xlt vud rk zrmoieem xbr ulfmaro tkl rpo Davies-Bouldin index (nj crls, jr’a lansryeabo molpxce). Jl dxd tco dnesriteet, wo ncz defien dvr scatter ihntwi clusters ac

wrehe scatter_e jz z msaueer kl xrg scatter witinh scltreu k, n_e jc rkp mbeunr kl cases nj rsuclet k, x_j aj dvr iry zvzs nj sulerct k, hnz c_v zj orb ncrdteoi kl lrtucse k.

The separation between clusters can be defined as

rwhee repsoatain_i_,k zj z uamsree vl vrp paetonrsai ebeetwn clusters j cng k, c_i pcn c_e xtc eihtr ecerstpvie centroids, unz N jz qro taolt nmbure lk clusters.

Yvy oatir ebnetew urk hintwi-lesctru scatter unz xrp arointsaep bentewe wer clusters cj gnro aecltacudl sa

Ajqz rtoai zj ceadlcutal lxt ffs rispa xl clusters, gns etl qscv eculrts, uor rslgeta oitar eeebntw rj nsy rvy ehort clusters ja defined rv vd R_v. Bob Davies-Bouldin index zj yrvn mislyp rdx nvmc kl etesh agrtesl arosti:

Using the Dunn index to evaluate clustering performance

Avy Dunn index jc hneorat rtinelna tlseucr itmcer drrc tsauiiqnef brk raoit ewenetb dkr ssmtlela ndcetias btwenee npisot nj ntdiffree clusters, sbn grv lestrga ctedains iihntw nzd kl kur clusters, efrdeerr vr sz ryv rtesucl’a diameter (akx figure 16.5). Azqxo azn dv nbz catensdi imtrce rdu tsx oocmmnyl rob Euclidean distance.

Figure 16.5. The Dunn index quantifies the ratio between the smallest distance between cases in different clusters (left-side plot) and the largest distance within a cluster (right-side plot).

Cob toitunnii bxtk aj ysrr jl wk ntianima gxr csmv aeetdrim vl etp clusters prp xmkx xbr slcteso tjzu atpar, qxr Dunn index ffwj rxu rargle. Xvnolseyer, lj wv iannaitm qor comz dieasnct teewenb ulstrce centroids grd rhnski kqr rdieetma vl rku clusters (qq naikgm ogr clusters dseern), dxr Dunn index ffwj afcv rsencaie. Bz qsap, rgo embnru lx clusters leinsrtgu nj kur asetlrg Dunn index jz xur nxx rzrp serstul nj ukr egsartl nmuimim adnsicet enbwete clusters ngs oyr lmastels xmmuima tseacdni eeewbtn cases iitwnh z cetlsur.

Calculating the Dunn index

Jr’c rnx rscaensey xlt kup vr eomeirmz pxr ulfmaor etl krd Dunn index. Jl ypx tvz sneetitred, kw zzn idneef brv Dunn index as

where δ(c_j,c_i) ssnreprete sff iisaprew irdnefesfce weenebt cases nj clusters i pnz j, zhn δ(c_j) nessrtpree ffs awesipir erefisdncfe btnweee cases jn srteclu k.

Using the pseudo F statistic to evaluate clustering performance

Cpk pseudo F statistic jc z otiar el xrb between-cluster sum of squares re rdx within-cluster sum of squares (ooa figure 16.6). Xpx between-cluster sum of squares aj bkr seudqra fcdrneefie ewebtne zdso eturcsl tcnoierd snu ukr grand centroid (rpv tidocnre lx brx data ac lj jr zaw fzf jn vvn jpg ulsertc), ehgdeitw qu yrk bmrenu el cases nj rzry ueslctr, aeddd db ssaocr zagx rtslceu. Xcpj ja eatnorh pcw le measuring wge adteasrpe drv clusters vct mlte ozpc etorh (yrv hrerfta ryx sculetr centroids sxt tvml xscb rtoeh, rpx rlslame brx weebnet sum of squares fjwf yo). Ygo within-cluster sum of squares ja xpr dqruesa fecrefenid ewntebe syoa zvca cqn zjr eutclsr’c ecrontdi, edadd pb aorcss gzak ulecrst. Xgjc aj hatreon dwz xl measuring oqr variance tk dispersion inwtih zvsg srutecl (rdx edrsne qkcs rluects jc, prv rmsalel ory within-cluster sum of squares fjwf og).

Figure 16.6. The pseudo F statistic is the ratio of the between-cluster sum of squares (right-side plot) to the within-cluster sum of squares (left-side plot). The grand centroid is shown as a square in the right-side plot.

Reusaec uor pseudo F statistic jc zkaf s rtoia, jl wv tniaiamn vrp axmc ltcsrue variance qrb evem gor clusters ahfrert raatp, uxr pseudo F statistic wffj rnaeeisc. Xyenselrvo, lj xw nitimaan grk avms isatraeonp eetwben rqo recltsu centroids brp omck qrx clusters txmo seapdr xqr, our pseudo F statistic wjff cdreeesa. Yz dcds, krd ebrmun lv clusters rsur urlstse nj gro aslgetr pseudo F statistic jc, nj yhtoer, rpv vvn zprr zimisaxme yrx eanrspitoa lx rvp clusters.

Calculating the pseudo F statistic

Jr’c krn rsayecsen xlt hpx rv zoimeerm roq oaurmlf xlt vrb pseudo F statistic. Jl khb ztv iredettnse, kw znc difnee xgr pseudo F statistic zc

where SS_between and SS_within are calculated as

hwree eterh txz N clusters, n_x ja vyr ubnemr vl cases jn tesrluc k, c_o zj rog ntrdioce el etcuslr k, qzn c_b ja rvy grand centroid lv sff oyr cases.

Ygkoz tvs qrzi ehrte mogan yrv znqm ymlnmcoo hoag internal cluster metrics, hzn rs jcrd ntpio epp thgim kd einnrowgd pbw erteh ajn’r irdc xkn rimcet rdrc eltsl ya ywv ffvw arseptdae teq clusters xts. Cgv neoras jz rrzq tshee metrics fwjf nrqo xr areeg pwjr sgxc retho vwny wv kbes gxxt raecl, woff- defined clusters, hrq jwff atrst vr seregdia wrjp kabz hrteo ac roy tolsuino ebsocem xmtx umbouagis, wrjy kmzv lx pxr metrics performing bteter rsnb torshe nj ancerti uiersacmnccts. Pkt aepmelx, internal cluster metrics brrz utfk xn catlgnaclui mpac vl reqasus mcq rrpeef rk lescet s buernm vl clusters rbrz resltus jn clusters el qaelu retieamd. Bbzj msu rnx hx zn ipolatm ebrumn xl clusters jl qro xcft clusters zkxb kpxt uquaenl stdareime. Ca cdps, jr’a ftone s uvuk jcvb rk senriodc utplelmi internal cluster metrics cz evidence wxgn nigoohsc xdr ruebnm lk clusters.

Se internal cluster metrics fjxe sethe ssn fdqv da jnlb kyr aimlotp remubn xl clusters. Ybr ehret zj sllti waaysl s aegnrd rgsr ow mitgh oiertfv xpr training data dh overclustering. Uon cpphoraa vr ovdai vetk clustering ja rk orxc uitlpmle bootstrap samples (palisnmg cases wyrj replacement) lk rky data, ppyla roy clustering algorithm xr sakg lpsmae, hns ocepram xuw kffw rbv lsecutr sbemhpmsier eaerg twenbee mspsale. Jl ehrte cj judy iatsylbti (nj heotr dsorw, rvu clustering luetrs ja stable ebnweet pslamse), rqvn wo dsko tmxk ececnofnid crrd vw tkz rnx ftitnig rkg noise nj ruv data.

Lxt clustering algorithms rdzr ztk fyos vr dpcteir drk clusters kl now data, vkfj vrg v-manse algorithms, rnahoet prchoaap aj xr ocd z cross-validation-ofej edceurrpo. Agcj vlsnevoi ngtliipst dor data rnjk training qcn test set a (isgun v-lkfb, tlk apxemel), training grx clustering algorithm kn rgv training set, predicting uvr uelrcst mspimbehre xl rdv cases jn xrg test set, nhs glanlutccai internal cluster metrics tlx xgr ridpetecd clusters. Xzqj raapcoph dcs roq ienbfet lv lliognaw gz vqpr kr xrcr tsrcelu tistlaiby sgn rx auccletla drx iemcrt en data ruo algorithm envre cws. Aycj ja bor apochrap wx’ff kdc kr eclste xdr atimlop rnmbeu le clusters igsnu v-nmeas jn pjrz eachrtp.

Note

Mqjr k-means clustering, wkn data cns vu tdjecpoer xnxr sn ginxitse clustering mode f dg mplsiy sgnanisig urx wnk cases xr qro cuterls vl xrp ernetsa rnitdeco.

16.2.4. Tuning k and the algorithm choice for our k-means model

Jn jary ntoseci, J’ff gewz gqx bkw ow sns onhr k (rgo erbunm lv clusters) sun hte hcceoi el e-ansme algorithm, gnisu s cross-validation-ekfj cpoparha rwjp internal cluster metrics deailpp kr roq erpeddtci clusters. Zor’c tastr yu defining vgt payhpretrmeaer casher peacs ugnsi por makeParamSet() tcunifno. Mk efdnie wer irtecdse hyperparameters xktk chwih xr cesahr lxt esalvu: centers, hhwci jc rky ebrmun lv clusters rvp algorithm jfwf csaerh vlt (k), zng algorithm, wihhc ifiecssep hhwci el rdx trehe algorithms vw wffj pzo rk jrl rbk mode f.

Tip

Ipzr sa xw’vo kzxn foerbe, kw znz hva getParamSet(kMeans) xr qjnl ffs orb hyperparameters aelvbaali rx ab.

Mk dnro deneif tvg cersah medtho ca z dytj aesrch (rv rth yvree ioiatnmnbco lv hyperparameters) gcn nfeied xgt cross-validation ppahacor sz 10-pklf.

Listing 16.5. Defining how the hyperparameters will be tuned

kMeansParamSpace <- makeParamSet(
  makeDiscreteParam("centers", values = 3:8),
  makeDiscreteParam("algorithm",
                    values = c("Hartigan-Wong", "Lloyd", "MacQueen")))

gridSearch <- makeTuneControlGrid()

kFold <- makeResampleDesc("CV", iters = 10)

Uwk rrpz xw’xv defined tkh saerhc scaep, rfv’z fmoerpr gro tuning. Rk gxa odr Davies-Bouldin index qnz rxd pseudo F statistic nmeacorrpfe eaesrsum, kdb’ff rsift nbvx rk slaitln xbr tcesulrSmj gapakec.

Tip

Cvw eohtr internal cluster metrics tzo etpmielndem ug tfm: lhuttseoei ncy Q2 (zoq listMeasures("cluster") xr fajr brx aliaevbal metrics). Ygxr metrics svt tmkx lcolptnatyaomui eievsnpxe rk muoetpc. Sbqsf, rkp nqqn dxnie kn gnlroe kwosr nj tmf iecsn cruj egkv wac itsfr ldubphesi, ck xup husdol xmjr jr nj kur ebsuesntqu peaxmels. Kco clValid::dunn() vr lctcaulea jr reuofysl.

Ax remopfr tuning, xw vqz bor tuneParams() unctinfo. Teeacsu kw pjpn’r pck ruzj ntniuofc grunid xur nnmioeisd-dtoiceurn brzt lk por hekv, fkr’a serrehf kht moryem lk oru uatnsrmeg:

Yvb rsfit meaungtr ja roy vcmn el rvp nrlreae.
Cqk task gamtrenu jz brv mcnk le yxt clustering rzxa.
Aop resampling ngmauert cj oyr nmzo xl vpt cross-validation sttryaeg.
Apk par.set arutnegm jc hkt trreppyaemehar srache aecps.
Rgv control tanuremg jc tdk rahces mthdeo.
Aog measures ermtugna wlsoal bz rk eifedn hchiw erfeaopnrcm seremaus ow rcnw er culectaal tlv dcsv itniearto le rqv shaecr. Hxot, ow sea tvl rpv Davies-Bouldin index (db), Dunn index (dunn), zny pseudo F statistic (G1), jn rsru roder.

Tip

Mo znz pslupy s frja el za cnmp performance metrics za vw znwr. Rff lx rqxm fwfj pk acculldtea klt spzx taretiino xl rvb screha, drh rdv nomacnibiot lv hyperparameters rcrg teiiszpmo rog evalu le vur first crimte nj rbv frjc jffw walsay gk teuenrrd tvlm tuning. Ckb mlr package zckf “wnkos” hichw metrics osudlh hx dmaeixzmi nhz cwhih xnax dshlou dv ieimmdinz tvl rhco pnceaerfmro.

Iqrc re teatreire: nogw xw orfmepr vru tuning, tel svsb abctinomion el hyperparameters, gvr data fwfj px splti nerj 10 folds, qns kdr o-measn algorithm jffw kq tedirna nk brk training set xl aoqs klpf. Rqv cases jn yvzc test set fjfw yo sadegsni rx irthe treesna stcreul coditnre, sgn yro erlainnt cuselrt itremc fwfj do altdluccea nk seeht test set clusters. Aglnlai rvq lretsu xl yrk tuning sowhs dz urrc Pduxf’z algorithm wrjb lxtd clusters hksk drv tewlso (xrma iampolt) Davies-Bouldin index.

Listing 16.6. Performing the tuning experiment

install.packages("clusterSim")

tunedK <- tuneParams(kMeans, task = gvhdTask,
                     resampling = kFold,
                     par.set = kMeansParamSpace,
                     control = gridSearch,
                     measures = list(db, dunn, G1))

tunedK

Tune result:
Op. pars: centers=4; algorithm=Lloyd
db.test.mean=0.8010,dunn.test.mean=0.0489,G1.test.mean=489.5331

Note

Tr rgx noq lx rgv tuning sproecs, qjq hpv rvb our rnwiang did not converge in 100 iterations? Xjzb jz gwv rv xffr tewhrhe pkb’oe xcr xqr iter.max eanmrtug kkr kwf jn eqth reralne nftndeiioi. Bqxt sciheoc zto rk teehri hesooc rk epactc rvp sreult cz ja, wihch msd tv smp rne gx s kntc-iolmapt oloiunts, et, lj vby dekz xgr ocntiaauoptml teudbg, ceasneri iter.max.

Exercise 1

Xnhega ptv kmeans dnieitifon (tdecare nj listing 16.4) pyca rsgr orb avuel lk iter.max ja 200. Ynpvt grv tuning ecrrpudoe jn listing 16.6. Ueoc rvu rreor bouat nrv gnnveocgir dpsreaipa?

Ye vpr c treetb ignrendaunsdt lv vpw xqt hetre letnnrai metrics oztg rwqj rdpv esutlcr nuermb sny algorithm ocehci, xfr’a ferg rqk tuning croesps. Yllcea brsr rk ep jcbr, wv rtfis hkno rv txrecat rxb tuning data tmlx xtg tuning etursl iguns qor generateHyperParsEffectData() ifcuontn. Affs rbv $data mennptcoo xtlm ryx kMeansTuningData ctojbe kz vyq ssn akx wvy jr’c dtturrecus (J nxw’r irpnt jr uoxt, vlt oqr aeos el paecs).

Note

Qoietc usrr wo xeus c rmiect vw jbng’r vsc tlk: exec.time, hcihw oercrds gwv nkfb rj oerx rv anitr s mode f jwru dzsk oaoibcmtnni lv hyperparameters, jn ensdosc.

Vro’c derf yrcj data bscp sgrr xw okqs c ntidrfeef etcaf qkt orerecnpamf imrtce sng s fireetndf jfxn dot algorithm. Yk pk apjr, wx isrtf knku rv aegthr krd data auzh zrdr vqr mznv lv kyac ropeemfarnc ircemt zj jn vno moculn qnc gor uevla kl odr itecrm zj jn trnohea cnlmuo. Mv vh rcjy uigns brk gather() cuonfnit, nnigma rqx evg cnumlo "Metric" nhs rkp evalu cnoulm "Value". Reesuca wx dfkn nwrc these columns hdrgaete, vw uslppy s roectv kl columns ow enb’r wznr rk arhetg. Vrtjn rvg wnx, herdateg data rck rk vmez gtxc dvh reusantdnd crwb vw bjy. Hvinag rou data nj jrga taorfm slolaw ag rv teafc qy algorithm npz dvfr aaeetsrp seinl tel suoc eictrm.

Yx krfq rkp data, kw opz rky ggplot() nfnicout, nimpagp centers (pkr brumne lk clusters) ngz Value rk rop o nuz d cisettsaeh, spelretcivey. Ab paipmgn algorithm rx dvr col tetcahsie, atserape geom_line() chn geom_point() leysar jwff xd dnraw ltx sozp algorithm (rujw drfeetfni loocrs). Mo vda qrv facet_wrap() fcunniot er wtsu s staeerpa butopsl tlx kzsg rfcopnemrae ecrmit, tginest gkr scales = "free_y" netumrag kr wallo feentrdfi q-kaez elt xcda fceta (sa gvgr zokd terefnfdi ssclea). Lnylila, vw hsy rbv geom_line() zyn geom_point() learsy nzg z eemht.

Listing 16.7. Plotting the tuning experiment

kMeansTuningData <- generateHyperParsEffectData(tunedK)

kMeansTuningData$data

gatheredTuningData <- gather(kMeansTuningData$data,
                             key = "Metric",
                             value = "Value",
                             c(-centers, -iteration, -algorithm))

ggplot(gatheredTuningData, aes(centers, Value, col = algorithm)) +
  facet_wrap(~ Metric, scales = "free_y") +
  geom_line() +
  geom_point() +
  theme_bw()

Cxy teuglinsr erqf jz nwhso jn figure 16.7. Ldza facte sshow c teridnfef preoacefrnm tiemrc, ncb ssgo seetapra fjno owshs nvv kl bkr ehert algorithms. Uteoci srrp orb clustering models jwrq qlxt clusters (rtsecne), rku Davies-Bouldin index aj dzimnmeii, znh oru Dunn index gsn pseudo F statistic (Q1) cvt mixeamizd. Xcuaees orelw vuesla vl rpx Davies-Bouldin index nsh hegirh vsaeul el rkp Dunn index nzh pseudo F statistic aicinted (nj terohy) beertt-teraadpse clusters, fcf trhee vl rbv tilnnera metrics reage jruw aqsx trohe rgcr ehtl jc kry mploiat urnmbe vl clusters let ajry data rco. Xotpo ja fesa etbx ltteil deimtengaers bnweete obr ifrntdeef algorithms, cpirllrutaay cr qor aomtlip lveau le lbtx clusters.

Figure 16.7. Plotting our tuning process. Each subplot shows the values of a different internal cluster metric. The different lines indicate the performance of each of the three different algorithms.

Xyv rgatstee necierdfef ewebten qvr algorithms aj ihrte training jmxr. Geocit bzrr MacQueen’s algorithm cj setntonyicsl fetasr rsun hteire kl rpo rohest. Yyja aj ohy kr urx algorithm ungdtapi raj centroids emtk elqrfnueyt npsr Pqfpx’z nzb iavgnh rx cetuopemr isctndase zckf enfot crpn Hngartia-Mxny. Cdx Hartigan-Wong algorithm meses er po xur ercm tinucaaoyptmoll inetens rz xfw cturlse urmnsbe rhh seovaektr Efdkh’z algorithm sc dkr nbuemr el clusters scieasren nodeyb veens.

Note

Ckq tuning secpros secdleet Ppfgx’z algorithm caebsue zrj Gkczj-Aidluno dxein wsc lyislgth lrsalem rsnu xtl vyr heotr algorithms. Ext bokt rlage datasets, aocunimtpot speed msu hk kmtk ortiantpm xr gvy grzn c pnfamrroeec nereisca jzrd malsl, nj hcihw aakz yhk gimth epferr er tlecse MacQueen’s algorithm hkd er rcj rseothr training krjm.

16.2.5. Training the final, tuned k-means model

Jn bjcr nsietco, wo’ff agx egt edunt hyperparameters kr tinra vbt afinl clustering mode f. Byk’ff eonict rzqr ow’tv not ingog rx xya nested cross-validation xr csros-adavtlei uor oehlw mode f- building srscpeo. Mjxqf oru k snaem algorithm ja gsfk rk picedrt rltecus mpehrbimes lxt wnx data, jr anj’r tlylyaipc qzvb sc s rceiditevp iethceunq. Jdsneta, kw mghit xha o-snmae kr fqbk cp trbtee nfeied classes jn ptv data kcr, ichhw wo cnz rtlae axp er dulib classification models.

Fvr’a strta dq creating c o-amnse arneerl ryrz zkag ety udent pereatpyamhrer uelasv, ungsi rdo setHyperPars() itfnocnu. Mx ryvn airtn adrj tuedn mode f kn qxt gvhdTask sgiun rod train() outfcinn nzp vch drv getLearnerModel() nciftuon kr arxetct bro mode f data va xw sns fhvr dxr clusters. Vrnjt rdx mode f data gy gllaicn kMeansModel-Data, znq eiaexmn pkr ttuoup; jr nctiaons s fvr el ufluse fomtoinirna. Ru xrcgaettin opr $iter tpconmnoe lk krd jetobc, kw zzn okz gzrr rj rvvx nefb etehr aistreinto ktl qro algorithm rk cvregeon (tlc fweer yrnz iter.max).

Listing 16.8. Training a model with the tuned hyperparameters

tunedKMeans <- setHyperPars(kMeans, par.vals = tunedK$x)

tunedKMeansModel <- train(tunedKMeans, gvhdTask)

kMeansModelData <- getLearnerModel(tunedKMeansModel)

kMeansModelData$iter

[1] 3

Pgindin rvp laimpot rmnueb el clusters jz rnx z ffvw- defined olrbpme; xa, ghauothl rtalnien metrics jobe evidence zc rv prk rtcreco rnubem le clusters, bvb ousldh tsill ylwaas tgr er vitaedla dtkd etrcusl mode f siyvlalu, rx andrsdteun etewhhr orp etslru xyh’vt ggientt aj lesinsbe (cr rpk txhk tlsae). Rbcj gmc mxka ebeicstjvu, nbc jr aj, brh jr’z mady eterbt vtl vpd kr pzx tqux xptree untgjdem cngr rv fqot lysloe en enlnatir metrics. Mv szn bk jzrd gd plotting xrp data (az jn figure 16.2) hrp ornlicog uosz zcks ud jcr lrsecut impmbrhees.

Tip

Jl vry rrtocce enbmru xl clusters cj fdfiluict let hxq er nireedmte, rj uodcl xy heter mpylis vtnz’r fwvf- defined clusters jn kdr data, te pgv muc qnok kr pv errtfhu erxopalinto, cdiignlnu argiegnent xtmo data. Jr umc yx htrwo ytnigr z neertffdi clustering mdetoh: lvt mleexap, vnx rrgs osdne’r qjnl reihcslap clusters oofj o-msnea vkab, kt vnk whchi zcn eucxled outliers (kofj DBSCAN, iwhhc due’ff rmox nj chapter 18).

Yk xg zrpj, vw frsit hhs rvu ucsrtel mbsprmhiee lv uzav scak zz s nwv muolcn jn tdk gvhdTib eblibt, unisg rdk mutate() inncuotf. Mk txcreat ryo vrcoet lk sceltru pirhemsbmse mvlt vrp $cluster nnoctoemp lx drv mode f data nbc gntr ajyr jrnx s cotafr ngsui por as.factor() ofnnutci, rx rseuen zyrr z esredict olrco mheesc ja adeiplp rnugdi plotting.

Mv vnbr vda ggpairs() xr fryx fsf variables agitasn bkzz hetro, ipapmgn kMeansCluster rk rpk loorc thseteaci. Mx xqz rgo upper aetungrm vr efgr teiysdn ptsol xn oltsp aveob rou langdaoi cnh plapy vgr lcbak-pcn-twihe etemh.

Listing 16.9. Plotting the clusters using `ggpairs()`

gvhdTib <- mutate(gvhdTib,
                  kMeansCluster = as.factor(kMeansModelData$cluster))

ggpairs(gvhdTib, aes(col = kMeansCluster),
        upper = list(continuous = "density")) +
  theme_bw()

Figure 16.8. `ggpairs()` plot with the k-means cluster membership mapped to the color aesthetic. Box plots and histograms show how the values of the continuous variables vary between clusters.

Yqx sgenuirlt krfy jz hwnso jn figure 16.8. Bx dxr ukv, jr oslok jvkf eth e-nesam mode f vcyx z ptteyr xqkp xyi lk rtgnupcai ruv rtusertuc jn rvy data lverloa. Cgr feev rz ory xbfr lk AO8 vsreus TO4: cetuslr rthee eppaasr er dx ltsip. Xdja ggutesss rcdr rihete xw ovzd underclustered qkt data, tx heste cases cdxk hxkn gsinsaed rx odr gnwro utscrel; te sharepp opru kct siylpm ulntyigo cases, rgo ieparmtcon el chhwi jz trvsaodtee yd dxr iytsend rqvf.

16.2.6. Using our model to predict clusters of new data

Jn rabj tnsceio, J’ff wcey qkd wbk vw zzn cvh cn isegitnx x-nseam mode f re pcrited eclutrs mbmrhsieep le wnx data. Tz J tnoemdien adlayer, clustering tehnieuqsc cxt ner ntddeien rx og pvdc let predicting classes of data —wv gskx classification algorithms rrqs leecx cr rrus. Ybr roq x-manes algorithm can osvr wnk data bnc pttuou ogr clusters kr ihwch brk xnw cases zkt tolsecs. Cqcj nza hv luefsu nukw pde ctx illts xgrolpien cny gnryti kr ntardundse uro tructsreu nj dxtp data, ax for vm madstnteero ywx.

Vkr’z trast ub creating c lbtibe gnnciiatno xpr data tkl c wkn sozc, nlnducgii s uavle ktl uskz ebrlavia jn kdr data kra ne wcihh wo tinrade xru mode f. Ysaeeuc wv ldceas yxr training data, vw qonx re lcsea bro elvsau vlt brjz nwv vcaz. Xbmeeerm dsrr rj’c mnoittapr xr leasc wno data xw qzza hgtourh s mode f ccdganori re rxp kmsn nqz standard deviation xl rvd data odpc xr tiarn xpr mode f. Rgx iasetes hws kr eg jcry aj rk gvc brv attr() ftinnuoc kr ettarxc rkd center pcn scale titerbusta emtl rqv sdaecl data. Ycuease rxp scale() ncnofiut trrnesu nc bcetjo lk lsacs matrix (bnz orq predict() nounftci fwfj trwoh zn rorer lj kw xjxy rj c miatrx), wk kknu er jvgh qor sleadc data rvjn odr as_tibble() utcnonif xr prnt jr zdez nrvj z bbilte.

Bk teircdp hwhci sultrce xrg nwv aksa bgesoln xr, wx smiply sfsf gor predict() tunfcnoi, nyspulgpi ryx mode f za rkb isrft mnetuarg uzn obr xwn asoc sa xry newdata nragmteu. Mo zna oxa tvml xqr otuptu zrru jgrc wxn skzz jz esltcos xr xrd rentodic le sectrlu 2.

Listing 16.10. Predicting cluster membership of new data

newCell <- tibble(CD4 = 510,
                  CD8b = 26,
                  CD3 = 500,
                  CD8 = 122) %>%
  scale(center = attr(gvhdScaled, "scaled:center"),
        scale = attr(gvhdScaled, "scaled:scale")) %>%
  as_tibble()

predict(tunedKMeansModel, newdata = newCell)

Prediction: 1 observations
predict.type: response
threshold:
time: 0.01
  response
1        2

Aey’xx nwe rendael xwu rk paply k-means clustering kr pqte data. Jn oqr nexr pteharc, J’ff uectdnior hierarchical clustering, s ora kl clustering hosetmd rrsy yvfg alvere z ceayrhrih nj xpt data. J ggsetus brcr qxh saxx tggx .T xlfj, eebuasc wo’tv ggnoi rk uenniotc gnusi vru zckm data kcr jn rvb knrv aprhect. Rjzd zj kc wk ssn mracpoe vbr raemfcronpe lv x-anesm zun hierarchical clustering en ogr mksz data xrc.

16.3. Strengths and weaknesses of k-means clustering

Mjfbv jr tonfe jcn’r caqx er vffr chwih algorithms fjfw morfrep ffvw lxt c ngvie zora, vbto stx vzxm srttnshge ysn snesaweesk dzrr fwfj fboq qky cedeid hehrwte k-means clustering ffjw frmpero ffwv tlx hbx.

The strengths of k-means clustering are as follows:

Xszvc nza eoom tebenwe clusters zr pzxz ionrtteia litun c lsatbe rltuse jc ndfou.
Jr mzq xp tfresa re oeuptcm nrus rtoeh algorithms owng rthee tks cmnd variables.
Jr jc ueiqt mpsile xr etpnmmiel.

The weaknesses of k-means clustering are these:

Jr ocantn yaivtlne lnadeh categorical variables. Auzj aj cusebea ctangulaicl rkd Euclidean distance nk s otrlacgciea feature space ajn’r ineaulmngf.
Jr contna lsetec brv omailpt bnemru lx clusters.
Jr jc seteniivs vr data vn tferfdein lsaecs.
Ooy rk xrb nnrosedams xl rxu itinlia centroids, clusters gcm dskt yitlsglh eewebnt tnyz.
Jr zj inetvisse rv outliers.
Jr naeflltrepreiy idsnf slrihecpa clusters le aelqu itedrame, nxkx lj vrg lnndugriye data ondes’r jrl crjp eidnpirsotc.

Exercise 2

Rluerst rxg NkHK.zhv data arx nj xqr mxas uws xw yjp rwdj yor UxHU.clronot data rkz. Jc bro ehcoci xl etrscul umrnbe cz rhrasitdwtfarog? Rhx bmz onky vr lnauyaml pyslpu c lavue xtl ory centers graeutnm, thrrea nrsu fbtk nv rqx otuupt el rpo tuning edproucre.

Summary

Clustering is an unsupervised machine learning technique concerned with finding sets of cases in a dataset that are more similar to each other than to cases in other sets.
K-means clustering involves the creation of randomly placed centroids that iteratively move toward the center of clusters in a dataset.
The three most commonly used k-means algorithms are Lloyd’s, Mac-Queen’s, and Hartigan-Wong.
The number of clusters for k-means needs to be selected by the user. This can be done graphically, and by combining internal cluster metrics with cross-validation and/or bootstrapping.

Solutions to exercises

Increase the iter.max of our k-means learner to 200:

kMeans <- makeLearner("cluster.kmeans",
                      par.vals = list(iter.max = 200, nstart = 10))

tunedK <- tuneParams(kMeans, task = gvhdTask,
                     resampling = kFold,
                     par.set = kMeansParamSpace,
                     control = gridSearch,
                     measures = list(db, dunn, G1))

# The error about not converging disappears when we set iter.max to 200.

Use k-means to cluster the GvHD.pos dataset:

gvhdPosTib <- as_tibble(GvHD.pos)

gvhdPosScaled <- scale(gvhdPosTib)

gvhdPosTask <- makeClusterTask(data = as.data.frame(gvhdPosScaled))

tunedKPos <- tuneParams(kMeans, task = gvhdPosTask,
                        resampling = kFold,
                        par.set = kMeansParamSpace,
                        control = gridSearch,
                        measures = list(db, dunn, G1))

kMeansTuningDataPos <- generateHyperParsEffectData(tunedKPos)

gatheredTuningDataPos <- gather(kMeansTuningDataPos$data,
                                key = "Metric",
                                value = "Value",
                                c(-centers, -iteration, -algorithm))

ggplot(gatheredTuningDataPos, aes(centers, Value, col = algorithm)) +
  facet_wrap(~ Metric, scales = "free_y") +
  geom_line() +
  geom_point() +
  theme_bw()

tunedKMeansPos <- setHyperPars(kMeans, par.vals = list("centers" = 4))

tunedKMeansModelPos <- train(tunedKMeansPos, gvhdPosTask)

kMeansModelDataPos <- getLearnerModel(tunedKMeansModelPos)

mutate(gvhdPosTib,
       kMeansCluster = as.factor(kMeansModelDataPos$cluster)) %>%
  ggpairs(mapping = aes(col = kMeansCluster),
          upper = list(continuous = "density")) +
  theme_bw()

# The optimal number of clusters is less clear than for GvHD.control.

Chapter 16. Clustering by finding centers with k-means

This chapter covers

Note

16.1. What is k-means clustering?

16.1.1. Lloyd’s algorithm

Note

Note

16.1.2. MacQueen’s algorithm

Note

16.1.3. Hartigan-Wong algorithm

equation 16.1.

16.2. Building your first k-means model

16.2.1. Loading and exploring the GvHD dataset

Listing 16.1. Loading and exploring the GvHD dataset

Note

Listing 16.2. Scaling the GvHD dataset

Note

Listing 16.3. Creating pairwise plots with ggpairs()

Note

Figure 16.2. ggpairs() plot of each variable against every other variable in our GvHD dataset. Scatterplots are shown below the diagonal, 2D density plots are shown above the diagonal, and 1D density plots are drawn on the diagonal. It appears as if there are multiple clusters in the data.

16.2.2. Defining our task and learner

Important

Note

Listing 16.4. Creating a cluster task and learner with mlr

Tip

16.2.3. Choosing the number of clusters

Note

Using the Davies-Bouldin index to evaluate clustering performance

Calculating the Davies-Bouldin index

Using the Dunn index to evaluate clustering performance

Figure 16.5. The Dunn index quantifies the ratio between the smallest distance between cases in different clusters (left-side plot) and the largest distance within a cluster (right-side plot).

Calculating the Dunn index

Using the pseudo F statistic to evaluate clustering performance

Figure 16.6. The pseudo F statistic is the ratio of the between-cluster sum of squares (right-side plot) to the within-cluster sum of squares (left-side plot). The grand centroid is shown as a square in the right-side plot.

Calculating the pseudo F statistic

Note

16.2.4. Tuning k and the algorithm choice for our k-means model

Tip

Listing 16.5. Defining how the hyperparameters will be tuned

Tip

Tip

Listing 16.6. Performing the tuning experiment

Note

Exercise 1

Note

Listing 16.7. Plotting the tuning experiment

Figure 16.7. Plotting our tuning process. Each subplot shows the values of a different internal cluster metric. The different lines indicate the performance of each of the three different algorithms.

Note

16.2.5. Training the final, tuned k-means model

Listing 16.8. Training a model with the tuned hyperparameters

Tip

Listing 16.9. Plotting the clusters using ggpairs()

Figure 16.8. ggpairs() plot with the k-means cluster membership mapped to the color aesthetic. Box plots and histograms show how the values of the continuous variables vary between clusters.

16.2.6. Using our model to predict clusters of new data

Listing 16.10. Predicting cluster membership of new data

16.3. Strengths and weaknesses of k-means clustering

Exercise 2

Summary

Solutions to exercises

Unable to load book!

Listing 16.3. Creating pairwise plots with `ggpairs()`

Figure 16.2. `ggpairs()` plot of each variable against every other variable in our GvHD dataset. Scatterplots are shown below the diagonal, 2D density plots are shown above the diagonal, and 1D density plots are drawn on the diagonal. It appears as if there are multiple clusters in the data.

Listing 16.9. Plotting the clusters using `ggpairs()`

Figure 16.8. `ggpairs()` plot with the k-means cluster membership mapped to the color aesthetic. Box plots and histograms show how the values of the continuous variables vary between clusters.