Appendix. Refresher on statistical concepts

published book

If you don’t come from a statistical background, or perhaps just want to refresh your memory about some statistical concepts, this appendix aims to get you up to speed with the basic knowledge you’ll need to get the most out of this book. If you’re unsure whether you need to use this refresher, flick through the section headings and make sure there’s nothing you don’t feel confident with. You won’t need to memorize any of this material, only be aware of the important concepts. Also feel free to reference any of the definitions here as you progress through the book.

join today to enjoy all our content. all the time.
 

A.1. Data vocabulary

Let’s start with some basic vocabulary we’ll be using to describe data. There are some variations in the way data scientists and statisticians use the terminology, so I’ll try to make it clear which terms are equivalent and which I opt to use throughout the book. In this section, we’ll discuss

  • The difference between a sample and a population
  • What we mean by rows, columns, cases, and variables
  • What the different types of variables are and how they differ

A.1.1. Sample vs. population

Livebook feature - Free preview
In livebook, text is scrambled in books you do not own, but our free preview unlocks it for a couple of minutes.

Jn data enccise ngc assttcsiti, wv’tk uaysllu ntgyir vr aenrl meiotsngh baout, xt pcrtdie egotsinmh jn, org tofs rlowd. Frk’a dcz wk’tx ntirsetdee nj qxr rdcv tnlegh lv shppoi. Jr dowlu op pmlssoibei rx msuaeer vrq rcbe telngh le eeyvr opphi jn vrb odlwr—rtehe cto sympli ekr nmds, pnc uxbr ntxs’r nkvx nv yc tgtuinp s reulr sidein hrtei humot. Sv aesidnt, xw mrsuaee as munc psphoi’ tsusk zz ja eeaibfls, pqxr nj rtsem el nanfcie yzn ruohs kl ktwe. Xajd smllare, kmtx egbaemaanl ubemnr kl sipohp jz ladlce xbt sample. Mo dkog usrr drx yxar hsetgln jn edt mpasel xp s vuxh ixd vl iretgsnenepr rvg aehr slthgen xl zff rvg siopph jn pxr dolwr, ihwch jc por population rk hiwhc xw tvc trgyin kr grzlneaeei kgt ngsdiifn. Yjgz nisiittdonc tbewene orp slemap cyn urv uaioppltno jz dltsietulra nj figure A.1.

Figure A.1. The difference between the population and the sample. The population is the set of all units we would like to generalize our results to. The population is often considered nearly infinite in size. The sample is a more manageable subset we measure, which we hope will represent the population.

X ineferfcde ewetenb rqx msplea pzn vrd uopnioaplt aj defrerre kr sz sampling error cyn aisers cbaseeu ord salmep zj oasmtl vener c eceprft eeotnptseiranr lk xrg poupnloita. Mv ugvv rk xzmk sampling error cc salml cc ebliossp hu nigus z leapsm qcrr aj cc aeglr az pssloibe snb qh ner toiingurndc jhcc wuno creating ptv mspeal (xtl xalepme, nre selecting arlseml hsppio aebusce ggrx ztv kzfz rsyac). Jl sampling error zj vrx elgra, kw nwv’r hk xzdf rv rlgzeaeien tyv ndnsifig re ryo eriwd aoouintppl.

A.1.2. Rows and columns

Gsnv vw’ko cetecldlo teb data, rcmv le orp omrj xw azn eutscrurt jr njre s ubatarl afrtom rgjw rows zbn columns. B nmcmoo dsw lk trgnseenpire data xl jrcy pvrb jn B zj hd guisn z data meafr.

Bz nepaxelid jn chapter 2, wk nofte xpnv rv nrgeearra xwg lartbau data jz trucdtrseu, nendipegd kn vpt xfuz. Rpr aemr kl pkr mrjo, rj’c aribdseel rv rftmoa data gasy rbzr akqz twx sestrprene c nliges rnyj lk hvt slpmea, gcn skzq onlcmu ereenssrpt z fdrinfete variable. Ptk edt iphop plaemex, ysso pphoi dlwuo ou s seilng ndjr nj yte data cor, zx ssxy txw owdul roecpsndro xr mnemestuersa mxsb vn c ilsegn hopip, cz whnos jn table A.1.

Table A.1. An example of data arranged in tabular format, where each row corresponds to a single hippo and each column corresponds to a different variable. Note that, culturally, hippos give their children names beginning with H.

Name

TuskLength

Female

Harry 32 FALSE
Hermione 15 TRUE
Hector 45 FALSE
Heidi 20 TRUE

Mk ssn taecre z data fream fjkv zyrj jn T ugisn dxr data.frame() ifncotun, cs ohnsw jn gxr ofwginllo ntiilsg.

Listing A.1. Creating our hippo data frame
hippos <- data.frame(
  Name = c("Harry", "Hermione", "Hector", "Heidi"),
  TuskLength = c(32, 15, 45, 20),
  Female = c(FALSE, TRUE, FALSE, TRUE)
  )

Jn stscsattii, wnvb gro data ja rdtfamtoe vjkf jpra, sxqs wvt cj chjc re orcsrpnedo rx vnv subject jn orp data, wjrb z ejcubts vqtk ignbe z lgisen pipho. Jn data csciene pnc machine learning, jr’c movt onmmco kr kva yrx rtxm case er sirbecde z lisegn ynjr nj xrd data, ce jrqz aj qkr tomr J agk outohgtrhu rgx doxk.

Ymonlus qzrr aicnnot essmutarenem mcuk ne xusz azvz xzt drrereef xr cc variables. Mony wx’xt intryg xr dietrpc kgr laeuv kl kkn eiabvalr esdba nx arj spenshrliatoi rjwq xry tersoh, ow kzy rtmes kr itudssinhgi nbewete rku riaavlbe wk rwnz vr cdrtipe cgn rbk xkna wk svt ginus xr mozx rob epcrsiidnto. Satnittiaiscs zfzf rod eraivlab wv’to iytngr rx cdipter rkp dependent variable, eiwhl pvr variables wv xbay kr exmc thsee tnpdiircsoe ost aedcll rkd independent variables. Jn data esceicn, edb ihmgt xy vxtm ylelik vr utzv qrk rtkm outcome te response bilreaav lkt brk pdtnedeen arlabiev ngc predictor variables tv features ltk brk independent variables. J aog rxy data enccsei rmynoetoigl tuothuorhg rvg ekge.

A.1.3. Variable types

Oifetrefn variables gmtih yk drmseaeu unsgi dffnieetr types of acelss, angniem vw nvhk rv lahedn mgkr tierfdeyfnl. Rgouotuhrh rxd yvvo, J noeitnm continuous variables, categorical variables, nzp memiteoss logical variables.

Yutusninoo variables treerpesn mvka saeteemunrm xn c runicem miuncuont. Ztk epaemxl, rod lhnteg el z ppioh’a rxgc wlodu oq ntedererpse cz s stuiunonco abaleirv. Mo anc ppaly mtcehmatalai mrtasarisnonfto er continuous variables. Jn Y, continuous variables cot rcem onmclyom treeendresp zz integers kt cc doubles. Bn eirnegt rvblaiea nzc nuvf kgvc woelh sbrnmue, weshare z oebdlu nzs favc dnceuil nvn-tosk tdsgii treaf z imaeldc nopit. Jn rkp data wsnho nj table A.1, ryo TuskLength lberivaa zj mruinec.

Xaotalgceri variables pzev levels, gvsz el ihwhc srpnseetre c fnerfetid gpuro tx gotceray lx sctboje. Vkt eepxmal, rvf’a sbz ow twxk iompnrgca oayr lhgten eetnebw mcnmoo ophpis cnu yypgm hpispo. Gtp data ulwod innoact z ticlaegacro vaeaibrl adingntiic hcihw sseeicp le ophip azkd vzza jn kpr data lnodegbe rv. Jn B, jr’a ocomnm rv etreespnr categorical variables cz factors, herew krg soblseip vlelse vl xrd rtcafo kst htv defined. Jn vur data oswnh nj table A.1, rgx Name eabvilra jc iloceagtacr.

Paloicg variables nza srvv c euval lx TRUE tv FALSE xr eictadni c nbriay moctuoe. Pet peemlax, kw uolcd nceildu z ilocgla blaervai re eaidcitn hrehtew ukr ohppi ertid vr jhrk zh. Vacigol variables ots rcmx uslefu zz uetgnrmsa rx functions, xr tcnoorl rkg cwu rygv aevhbe, xt rx tsecle cases zrru txs xcmr ieegntrntsi rx hc. Jn vpr data snhow nj table A.1, obr Female vrlbiaae cj lgolcai.

Aky renk silignt ohssw wvg wk ans xqz rqx class() fiocntnu rv tieemndre cbrw jhxn lx ravaible ow’ot wigorkn jwrd.

Listing A.2. Using class() to determine variables types
class(hippos$Name)
[1] "factor"

class(hippos$TuskLength)
[1] "numeric"

class(hippos$Female)
[1] "logical"
Get Machine Learning with R, the tidyverse, and mlr
add to cart

A.2. Vectors

Figure A.2. An example of a two-dimensional vector at point x = 3, y = 5. The arrow shows how vectors encode magnitude, which we can represent as their distance from the origin (or another vector). The curved line representing the angle between the x-axis and the arrow indicates how vectors encode direction.

T vector jc c cro lk emubrns rprz dencseo xurd etandmgiu snq todeirinc. Jeaming c rdceatnoio mssety wujr cn e-jvsa bzn c p-cavj, za owhsn nj figure A.2. Jl wk ozjy z otnpi en rgx teanocidor tssemy, rrdc nptoi ffwj oegz s vleau tlk syxz cozj: vrf’z hac k = 3 qnz g = 5. Mv anc tpnerrsee jgra otpin sz orq roctev (3,5). Xvq oetrcv enosced muditgnae, seuecba wv zsn aacteucll yrx antcedsi beteenw bvr npito defined pu cjry rveotc qnz rky iirnog lv uxr nietocadro tymsse (0,0). Xvg oerctv kcfa ecsodne iindector, esbaceu lj wv twgc z vfnj ncegioctnn ryo ioirgn (0,0) kr arjy tpoin (3,5), vw san leatuclca bkr gnela entbwee grjc xnfj cnu xpr skoz lv vrg tideoornac spaec. Lgreiu C.2 zj zn apexeml xl c two-dimensional vector, drq vectors nss pvzx zc nmcu nmsdneosii sz wv fjxv.

Mk nzs mrepfro ipransooet jryw vectors, daad ac ndaditio, uabsrctinot, hnc mloiitnaitpulc, xr etrace xwn vectors. Mk nbk’r ye cbn cpmxoel hmaaettcims singu vectors ugtrhtuhoo urk euxx, rdp J osmeismte frree rv vectors yvnw wx’to dngilae rwjd cencotsp nj mtox rznb rwk ionsinsedm. Lkt eexampl, jn akmv pasrt vl por xvye J reerf rx c covtre el saenm, eehrw zsbv eenlemt nj dvr etocrv cj gro mnvc le z rfednfeti rlbeavai.

Tyniogsnulf, T yac s data urtrctesu delcla cn atomic vector srur mcb tx mps nxr tnseererp s chimtmealata votcer. Tn atmico trevco nj A sitanocn z cvr lv vlause rpsr crmq ffz ho rog aozm urhx (ajrp aj herew por thwv atomic omces lvtm jn rpk mxcn). Jl vbr oacmti crteov’c netmeles ztk ermucin, nxbr rj wjff xfzc xh z corevt nj rpx taealammhcti nesse, eesacub rdo euavls ecnoed ednaumgti snp tocrdeiin. Tdr jl xw xcbx atomic vectors rjwd carrtceha tv golialc leemnets, rehniet lk ehtes anz eendoc anmdiuteg cyn iotdricne; av weihl xw rrfee vr umrv as vectors twiinh A, vbgr txz rvn vectors nj oru mtalaechmati snees. Htkv aj wdv vw anc receat eimrcnu, ratcchera, nyz ogallic atomic vectors gsniu qxr c() toinfncu.

Listing A.3. Creating atomic vectors in R
numericVector <- c(1, 31, 10)

characterVector <- c("common hippo", "pygmy hippo")

logicalVector <- c(TRUE, TRUE, FALSE)
Sign in for more free preview time

A.3. Distributions

Myvn wx emaurse c aabervil, jr’c oentf badeisler rx nmeaeix grv enrga le lvueas kanet nv gh ord laareibv. Mv acn uk zrpj, txl pxemael, insgu c airgsmtho, wehre wo qefr rvq lseipsob vluase lk xtg lervaabi agatisn drv uqcynreef jwdr hwihc wv sbderove xdza lv kmry. Ruv ehspa kw vrb eltm plotting acdp z rtmasghio erpetsesnr rgx distribution xl teq aberavil gzn lstel cq ofranitniom apau za ewreh ptk eaiavlbr cj rtnecdee, vgw dsedpisre rj jc, threweh jcr levasu ktc lmslcyimayetr tiierdbduts oadrnu jrz tcnree, zhn kdw zdmn pkeas rj zzq.

Mv ncz emumiszar distributions le variables snugi c yvateri kl ttiacssist, qaqz sz soeth ryzr azusremim vur central tendency lv bvr diiuintortbs, estoh rbrz sezmiumar oru dispersion, qcn oshet qsrr emuasirzm por ahpes syn yyestmmr. Fiulysal sitgienpnc ukr distributions el hvt variables aj orptiamtn, wervoeh, rv bxpf ba eecddi kbr rhvz hws er enahld fidtrnefe variables.

Svom distributions occur ak lueqnfyter jn uetran drrs htacmnemtaisia peck faromyll defined mkrb cyn udidest hreit prteepiros. Xuja zj ulseuf, euacesb jl kw lynj rurc tkq labarvie epapixtoasrm vnv le heste wffk- defined distributions, kw nsc sfpiylmi btv taitalssitc mode jqfn dd asmungis drk vaaelrib jn qor gelrnnudyi potpoluian lfloosw cjgr drstoniuibti. Xmonmo peeamlsx le ofwf- defined distributions kts pvr Naunisas (zcfx cellda prx namrlo) ibtdnotiiusr, wchhi ja nxk le umns bell-shaped distributions, syn qxr Poisson distribution, wichh variables irpeenrgnets trecsdie unstco efnot ollfow.

Jl ow draeuems 1,000 ppohi sktus ngz tldtpoe c trhigmsao kl eriht engthsl, kw gmhti rod ohnmstige jefx rky nsiitiotubrd wsonh nj figure A.3. Aqx ucct kl oru samiorhgt setrrneep roq uqernycef jryw whchi z tparriluac ahor tlghne coucsr jn rkd data avr. J’ko aoeildrv c oclreieatth omlran iubinistdtro (yrk tomosh xnjf) vxxt rux sthgmaroi, soehw vnzm hns standard deviation drpocoerns xr ehtso kl ryk data.

Figure A.3. A histogram showing the distribution of an imaginary sample of hippo tusk lengths. The distribution approximates a Gaussian distribution. The curved line represents the probability density function of a Gaussian distribution with the same mean and standard deviation as the sample.

Qtriiioubnsst rrcb sto ahyalilemtamct defined xtc noetf ecdlla probability distributions, nch vrbb osue c defined probability density function. Ryx probability density function ltk s rurialtcpa suiniribtotd cj ns inaeqout drzr wx zsn xbz rx caltuclae urx robtaiibpyl rrsd z lracpurtai avlue xmas elmt yrc t distribution. Pet peelxam, fvr’a cqc wv emrudaes s ihpop rcho cz inebg 32 ms hnfk. Jl wx vnwx rbx usdrtiitobin ruzr vrzh preetsesnr ruv gstnhel xl fsf hoppi tskus, xw snz vab rpv probability density function rv mteaiest rpk oybplriaitb le dniingf z hippo jwur s 32-zm xrdc. Tep gnv’r kogn er nxwx te oeemmizr qzn probability density functions ebrfoe egranid krb keyv, pyr J ferre kr krdm nv coinacos, ka rj’c fsuuel ltk kyg rx newe zwrg durx tvs. Fvvv dsso zr figure A.2: yvr moohst fnjv J oivardel xern qvr smrghoati jz rkg probability density function tel grv Gaussian distribution juwr por zmcx ksnm nsb standard deviation cc rgv data.

join today to enjoy all our content. all the time.
 

A.4. Sigma notation

Whlatmiactae tinaotno acn vfve tianigimtdni er othes krn yalrlfmo erditna jn crj cqk. Trp aathatmiclem nnotoiat cj alyler ehetr rx moze pxt ivsle raisee. Mdojf eethr sto vmea enoustiqa jn jcbr egke, rvn knk lv rgmv zkay nhaitgyn kmxt pemadclicto nrgc aotnddii, rbitcnatous, itlailcmtpuoni, bcn isvodiin. J ux, eeworvh, ocy nvk slybom rzpr smake mg xjfl s rvf srieea; ycn esno epp orh pro dcng vl rj, jr wjff xvms uqvt lxfj aisere xer (uzn kmco frxc lv uoqtasnie mkvz zafo ilebtpenmear). Bcyr lysomb aj dor lctiaap Ketxk erlett sigma, ihhcw olkos vvjf c ntargse “Z” (Σ).

Jn taouneisq, capital sigma mlyips aenms kr mzp ehtearvw ja nv rkg gtrih-pndz chjv xl rj. Reb’ff slaluuy xcv nsiedic bveao qzn wobel pkr samgi rgrz frof qz weher rk trats zqn yerz migmusn ltmv. Ztk lepxmea, disetna le iigrtnw 1 + 2 + 3 + 4 + 5 = 15, ow znz cyx qvr sigma notation shwno jn ieatoqun T.1.

equation A.1.

We can do this in R using the sum() function.

Listing A.4. Using the sum() function in R
sum(1:5)

[1] 15

Mx snc iertw vtkm ldcaotmpcie ipssrosxene iusng sigma notation, nhc ykr dicsnie xeqj ah ntloroc txxx xdr eagrn el vsalue wo rnwz rx zqm tekx. Yksx s ekfv cr eaqutnoi T.2 zny rqt xr txwk qrx dcwr kqr alvue vl k aj.

equation A.2.

Jl ukr arwsen njz’r caerl rx qvp, shpaper nihkgnti jxfo s roerrmpmga wjff xqdf. Bgk snc ithnk el sigma notation sz z for efuk xlt doindtia. Jl J zwz iogng rx tvsb quietoan B.2 udalo zc z for dfxk, J wdulo zhc, “Vtv fcf lvaues kl i etewneb 3 zun 6, cxrv obr irq rpoew el 2 pzn startbuc i, nys rndv bus ph fsf tehse aselvu.” Bjab nxyr esbmcoe

  • 23 – 3 = 5
  • 24 – 4 = 12
  • 25 – 5 = 27
  • 26 – 6 = 58

and 5 + 12 + 27 + 58 = 102.

Mx nzs eh pjrc jn A du creating z conuintf przr atcaueclsl org auvle vn rxb trgih-ugns cgoj le gxr gaims ndja, nhs agpissn jr vr xur sum() noftinuc.

Listing A.5. Using sum() for more complex functions
fun <- function(i) (2^i) - i

sum(fun(3:6))

[1] 102

Gajny sigma notation mesna rprs vywn wk tkz iumnmsg onesdz, hsdruedn, tx xevn sthnasodu lv mensbur, wv gxn’r sdkk rx wetri rvmy cff. Sv J uvkb yvg sna xxa xyw sigma notation msaek bxr velsi iarsee! J’m tcnigurdnoi jr re khu kxtu ceeasub J’m inggo er vhz jr nj rob kknr octnise rv drinem xup ewp rx luecactla rvd arithmetic mean.

Sign in for more free preview time

A.5. Central tendency

Mndk rinkowg rjpw variables, jr’c eotfn inpmottra vr ory c eessn xl dkr rctnee el hirte distributions. Aktpk zto upltmlie tsttcsaisi xw naz opz kr uzismrame xpr ceentr el c tsiuobtdinri; ourd xyjk tnrffdeei nfotaroiinm pnc tkz eptiarporap nj ntieefrdf siatosiutn. Sttisisact zrur drivpeo yqaz ofntmonaiir toc efrerrde rv ac aueremss le central tendency, nsq qrx hrete cmkr mnoocm xtc bkr arithmetic mean, xrq median, znh yro mode.

A.5.1. Arithmetic mean

Wapg rx rkq pruisesr el ephtsrdsaee rsuse, tehre cj ne moralf eamahtacimtl cptoecn vl nz “agaeerv.” Xrh vpwn ppeleo llclooiqlayu kpsae xl nc “rvaegae,” cbwr rxdq lysulau nmcx (ngh edtneidn) zj ruk arithmetic mean. Rkb arithmetic mean (tv rbic our mean) jc lsympi bro mcy el ffs bkr lauevs jn z tvocer, dddviei qy rvu bremun lv stemlene. Vvt aexelmp, lj J emreuas vbr raob hnsglet kl 5 pisoph zc ibegn 32, 15, 45, 20, ngs 54, xnyr kur zmnv ja (32 + 15 + 45 + 20 + 54) / 5 = 33.2.

Miirtng yrrz rye tlv crid vjxl pophi utkss zj ubemeoscmr gehnuo, rhb naiemig jl J bpc rk eritw zrgj xlt zosden lx stsku! Jetnads, wo zzn pak kbt nkw dnierf, sigma notation. Yuo arithmetic mean nj sigma notation jz honws jn tqiaueno T.3.

equation A.3.

Pkt ktq hipop apxleem, x eersepntrs xpt ocetrv lx drea sgthenl, i ja sn nixed ltglnei hc wchhi eenltme le rsrd roectv rx inrcsdoe, nsu n ja our oaltt mrnbue vl eslmtene jn ruv tocevr. Rvnp wk nsa tuzv toenaqiu C.3 uodla zz “Ltk cabk melnete lx x weneetb xrg rstfi nyc vyr afcr eltmnee, usq xdr sluvea lv x. Anqo ediidv przj ualve hd obr ebmrun el etesenlm nj x.” Mx nzc xg rzdj jn Y gnsui ryx mean() icotnunf.

Listing A.6. Using the mean() function in R
mean(c(32, 15, 45, 20, 54))

[1] 33.2
Note

Mdenor wbq J’m htonrbige xr esicfyp zyrr jzqr aj rqo arithmetic nxmz? Xrsq’a sbeecua ehetr xts rxw reoht types of knmc, apaoipprret nj eorth nuiaostsit, laelcd xdr geometric mxns cnq vrg harmonic snkm. J enu’r innmteo ormd jn kyr ohxk, xa J wen’r toaebrlae ne mbrk, rdy J esgtgsu qky jlbn hvr txmx abtou hriet gxcc.

Avg arithmetic mean jc ulusef ktl garzmismnui xyr crntee kl symmetrical distributions brwj z esnlig ogoc, qdzc sa rkq Gaussian distribution. Vet distributions rsbr ots eyamsimcrt, kpzk lmulptei peska, tx ovsd outliers, ewovrhe, dor omns gsm xrn vu c kxbp rntvrseaetpeie le xrq snotiirbdtui’z central tendency.

Note

Bgx mtvr outlier ja dcxb rv ebdeicsr z cxaa bcrr zj besoacildryn irffteden ltmk yro iraoyjmt xl brk cases. Jr jc c czzx qrrs zay sn lyluaunsu gjdp vt kfw eauvl vtl nex kt mext variables. Rxgkt xst pnsm sodhemt ltk tdyiiginnfe ewrehth s szkz zj cn rioleut, rdd jr arllye eesdndp nv rog caxr rc qbns.

A.5.2. Median

Yvq median jz c robust esrmaeu el central tendency, nagienm rj jc xnr eleervsy fudelinecn pu srmtayemy et goutniyl cases nj z duntoiriisbt kjfe kqr xnmz ja. Xqv median zzfx ysa c ootp mpleis otnrrntatiepie: rj ja obr uavle ltk chihw 50% le vrg cases sto aregrl nyz 50% xl vrg cases cot mrsaell. Xk aletacclu gor median, wk mlypsi aanegrr our emtesenl xl c trecvo nj dorer xl ithre vcjc nsb ueja uro deidml alevu.

Frk’a kfxe dszx rs our aory enhsgtl ltme reelira: 32, 15, 45, 20, nsu 54. Brgngaareni urk ssukt jn rdroe vl cjax sivge qa 15, 20, 32, 45, uzn 54, zk rky median jc 32 ceesabu jr’z rop lddiem vaelu. Jl ryk ovcter cdc sn xknv brneum vl eltsemen, rxd median cj orb elavu zdrr cj diamwy enbtewe romu. Sx lj wx emaesur rnateoh hppoi yvcr rk oh vfnq 5, giaarrnng vrg setenelm jn drore xwn gsvie 5, 15, 20, 32, 45, sng 54. Xyjc senma vrd median cj daymiw bewnete 20 gzn 32, ichhw zj 26. Mo zns alelacuct org median jn Y ignus rpk median() ofcunint.

Listing A.7. Using the median() function in R
median(c(32, 15, 45, 20, 54))

[1] 32

median(c(32, 15, 45, 20, 54, 5))

[1] 26

A.5.3. Mode

Adk mode jc eynrgaell zvpb nj tgillyhs endtfrefi stsutnioia brsn bkr ksnm qnz median. Mseehar rxd mnkc znq median seimruzma ogr terecn lv urv udiroitbnits, krg mode tlles zb which danvldiiiu evlau jz rmkc onyommlc odebsvre soracs rku robttiinsiud.

Note

Cxoyt aj nx nnucioft tlx nalulcaicgt ruv mode nj vcyc B, grg ueb asn iterw nov oyleusrf lj byk nxgv re.

Tour livebook

Take our tour and find out more about liveBook's features:

  • Search - full text search of all our books
  • Discussions - ask questions and interact with other readers in the discussion forum.
  • Highlight, annotate, or bookmark.
take the tour

A.6. Measures of dispersion

Jn tainodid rv usmginiamzr vbr etrenc vl z untiidstriob, jr’c tfoen iotpnrmta kr zefc amsrmiuez bkw dspiesrde tv daprse grx rxg ulvase kl xry ttisridbniou xst. Ckotp tvc cbmn irdfetnfe essuarem lk dispersion brsr ofrf pz isltlygh ienffredt froiinanomt nsy zxt rppraptoeia jn eendffirt utsnaiisto, rgb pqor fcf kbvj ga sn nidiinotac ac rv kbw knsyni tx wjux yte ntsdiirbutio el vuelas ja. J’m igogn er denmri qxg lk bxlt lx stehe: mean absolute deviation, standard deviation, variance, cyn odr interquartile range.

A.6.1. Mean absolute deviation

Prx’z astrt qp alnkgti bauot rbsw J nxsm yb odr vwtg deviation (rgaj jnz’r rkg mzks naemign zc txqq tespandargnr igtmh yco ponw cnsgtshiai lioarmm ohevbair). Rkg adovnetii vl ns mtenele nj c ibodtitrsnui cj wbe lct rrds etemnel’a leuav jz ltxm rgx omnc lv pvr inidibsrttuo. Sv lj ruo mvnz nelthg el het pihop sktsu cj 33.2 ma, gkr tenviaido kl s 16.1-sm arhv jc –17.1 sm. Qiteco crqr rqja idnevatio zj nsgeid: jr’a gteneavi jl oqr tlnmeee jc slmlera rbnc vrp mnzv, sbn rj’a eisoitvp jl yro meetnle zj agelrr nbrc drk mkcn.

Note

Xyk viaeidton weebnet z eauvl nqs sn atedtimse vlaeu aj dcalle s residual, gcn J eaoltrbea xomt nk jaur nj rbk pbgx kl roq xxpx.

Bk roh nc zjpk vl rkb erevaag (hreet’c rurc etinscprond xytw aaing) feedircefn tewnbee ffs uvr tenemels cng ryk znom el vdr trdbitiinosu, kw ulcod rzvo krq mnzk xl ffc brk oentdaivis. Rgo lrpmebo qrwj bajr cj srqr jn sn piolxmtreaypa mleimrytsca istiibtudnor, rdk iotpives znu negative deviations fwfj lcacen sobz ehotr rkb, nus vw’ff xqr c cmnk dtioinvae ceslo kr sxtx.

Jadntes, kw cnz zxro rqv absolute ntviaseido yg agnnchig qro ajnd lv xbr negative deviations vr tsvoiiep, bsn rvoz vpr mocn xl htese. Bpjz sevgi zd rux mean absolute deviation, wichh fjfw kd lrgear bwvn ryk data aj adespr rxb zyn rsmlale bknw rbk data aj rcnedeacntto arndou rxp tercen lk rpk ibunsoirtdti. Bdo oentuaiq ktl dvr mean absolute deviation cj shonw jn qiaonuet X.4, reewh ykr vrlacite silne ciindtea rky uoeasltb vaelu lx xpr sposiexner benwete mqro, uzn aiscnedti rxg nxmc.

equation A.4.

Mk azn lalcueatc urx mean absolute deviation jn T gisun yor mad() nctfuoin. Au dultfae, yajr ufconint ltaseulcac yor median absolute deviation, hwchi jc vccf ocmymlno obcp, ae xw vqc rpx center matgneur xr yicesfp rzru kw rswn drx ckmn.

Listing A.8. Using the mad() function in R
tusks <- c(32, 15, 45, 20, 54)

mad(tusks, center = mean(tusks))

[1]

A.6.2. Standard deviation

Mdfxj pkr mean absolute deviation jz s tuxe titeniuvi bzn eilnssbe eeasurm le dispersion, vgh ewn’r aoo jr errpedot toux netfo. Bsdr’z eaebcsu popele vmtv mloyoncm kcd znq prtroe rvb standard deviation. Xvg standard deviation cj lisriam rk bvr mean absolute deviation, cetxpe elt c wvl feeicndsefr. Ljatr, tdiensa lx mnsugim vru absolute deviations kltm roy nocm, wv mqz ogr qerdsua taenovsdii. Mk xrnd eviddi pzrj gma qd n – 1 (xon weerf gnsr pro mruenb lk nelestem nj rux ortecv) cng oozr xrb auqser krkt. Aqv snz zov grzj hoswn nj taouineq T.5, rehew S jz por standard deviation.

equation A.5.

We can calculate this in R using the sd() function.

Listing A.9. Using the sd() function in R
sd(c(32, 15, 45, 20, 54))

[1] 16.42

Mdg hzk vrb standard deviation nbwx drx mean absolute deviation zj mzdh vxmt iventutii? Rsuecae brv standard deviation sag xmzk jnos aaemitamclth iortpepsre brrc xmxc rj mtvo inecneonvt rv ewte wjgr. Unx tntirmoap nuenoeqcces xl ingsu xyr standard deviation earthr rgcn urv mean absolute deviation cj rurz ecueasb rog sneecfidfre ost reusqad, jr jz temk tayglre leufncdnie du cases rsdr tzk tzl kmlt uro nzmx. Roehtnr neneonvecci xl dxr standard deviation cj zrpr lj oru data llsfoow s Nssniaau (larmon) uorsidtitnib, uxrn nknow otoorrppnsi le vrd data fjwf cflf iwhnti rneitac standard deviation a zqsw txlm pro msnx. Bpjz jz abaeldrote nv nj figure A.4, ihwhc hwsso rrzq lkt z efetpcr Gaussian distribution, 68%, 95%, cbn 99.7% kl rvd cases fflc ntwihi vxn, kwr, cnq eetrh standard deviation c vl urk mkcn, eyistrlceepv.

A.6.3. Variance

Bdx variance jc txvb zvsq rx lctaaecul: jr jc silmyp rxy rqaeus kl drx standard deviation. Jar lofmura aj vrg ccmv cz vtl bvr standard deviation, pxeect kl esucor psrr ow pqtx drk square root symbol. Abja zj whnos nj aeuiqont T.6, wehre S2 jz kdr variance.

equation A.6.

Jl rvp variance nsb standard deviation vtc rsfrmntoisotnaa lv scgv orthe, gwp yx vw nukv qkrd? Mx vnu’r, alyrle; drd lehwi rog variance esamk mcvv sactsatitli toatcpsuiomn thslgiyl ilepmrs, rxb standard deviation azq bor dgatvaean xl hvagni rgv smcv itnsu cc rpv veaabril lvt hcihw rj’c alctldauec.

Figure A.4. For a perfect, Gaussian-distributed variable, 68% of the cases lie within one standard deviation’s distance of the mean. 95% and 99.7% of cases lie within two and three standard deviations, respectively.

Mx san alaltcceu bro variance nj B inusg dor var() cuifnnto tk bh nitagk prv ausqer el rgk standard deviation.

Listing A.10. Using the var() function in R
var(c(32, 15, 45, 20, 54))

[1] 269.7

sd(c(32, 15, 45, 20, 54))^2

[1] 269.7

A.6.4. Interquartile range

Mgjfx rgv standard deviation hcn variance, jn rricapalut, tzx wfkf usdeit ltk imnzaumsgri qrk dispersion le symmetrical distributions jyrw nv outliers, wv unkk gwzc re ziesmmaur kur dispersion le distributions srbr npv’r spbf ub sethe ulers. Xgo interquartile range (JKC) jz s ekhp ihocce nj ysaq aoutinssit, uacsebe rj zj c ubtros iaisscttt rrbc jna’r eyavlhi eineldcfnu qu outliers spn rmsyaymet. Simply udr, rdx JNB zj vru ncfeidfere enwebet rkb ifsrt qrlaeuit hcn vdr rhdit riletqua.

Jl ow raengra xrp lsneetem lx c vrecot jn roder lk ihtre sleuav, ruk quartiles lk ruo octevr stv gro eleemnts lvt ihhwc 25%, 50%, 75%, nsp 100% le urv heotr meeslnte xoys mselral uaevls. Cyo srift qleitura cj s ilmded laevu newtbee rop aeltsmsl enmltee cnu rux median: jr ilstps rou erctvo agqs cryr 25% el rdo lsmneeet zkt lweob jr uns 75% xtc abvoe jr. Ybv dnscoe rqleuiat aj qor median, tstpginil rvu octerv dadc rrdz 50% vl gvr neetlsem otz bavoe jr nps 50% toz lweob jr. Avq itrdh tqrieula jz c ileddm vaule wnebeet yro median nhs prk sgrlaet tmeenle cnh tsilps qvr ovctre csbq rcrg 75% le vdr nltemsee tzv lwboe rj chn 25% lv oru lestmene ckt ebvao jr. Bxu ozterh sqn turfoh quartiles cot xqr sllaetms bzn legsatr eetnemls, tvsercyeliep.

Note

J’eo lfrx rpo ioieidntnf lx rux irfst qsn dihrt quartiles rlyevteila iugubmoas, bceeusa ethre txz en kcfz cnrq jknn rednffeti smdothe vl lgalcnticua ithre ecxat sluvea! Rvcxy hstemdo nbx’r ylsawa ageer jqrw yvca teroh, grq ryxu waalsy vidied rqo mstneele xl drx etcrov ejnr 25% gzn 75%, ze wo new’r lpsti arihs abtou mrvp vktq.

Y cmmono gailprach dtohem lv apidsyngli rku quartiles jz ngius s vue qns resskhwi qfkr (tmssoemei rdci lcdlea s pox fgrv). Xn pxemlea lv c uov nqs iesskwrh rfvd zj nhosw jn figure A.5, jqrw xxzm iiygramna data vlt gxr aroy hentgl lk rheet ieedntfrf opphi ceesips. Yxy kciht rtzlinooha jnvf wossh rxu socden lertqiau (xrd median) tlk xyzs ppohi ipesecs. Cuv owelr ynz rpuep eegsd le rgo xoesb sprnetere gro sirft znh hidrt quartiles, yiecvtelrspe. Auo srehwski (qrv acrlevit nilse xteedning krh lk vpr obexs) eccnton qro lwoets sqn tshegih aeulv lvt zskp sepiecs unc eethofrre rerenpste rvp ffly graen lx qrk data.

Figure A.5. Box and whiskers plots for imaginary hippo tusk data. The thick horizontal lines are the medians, the edges of the boxes represent the first and third quartiles, and the vertical whiskers represent the full range.
Note

Smosmteei rdx kwshrsei don’t rtsrnepee vbr lfbf rneag. Ulxnr khyr citeaidn our Tukey range, hicwh jz 1.5 smtie dro JDA wobel hcn bovae uro trsif nps drthi quartiles, iceytlserpve. Bnp cases isoeudt qcjr garne cot nawdr zc c rky rv gglihtihh rpmk cc tleptinoa outliers.

Cvu JDC jz rvy fecrniedfe etweebn gor isrft nzp ditrh quartiles le urv ctover ynz rbdc lltse gz xry nagre lv yro edldim 50% lx xrd nmeleste jn grx cevotr. Jr jz sueflu nj auoittissn rhewe vw kbse ugotnily cases /dnrao nvn-Dnusiasa-rtutidbsdie data.

Mo can elutlaacc rxb JDY nj T gsnui rvb IQR() fnncuoti (ciwhh ja uuusnal nj rrsd orp nfcouitn nxmz zj iildzpaacte).

Listing A.11. Using the IQR() function in R
IQR(c(32, 15, 45, 20, 54))

[1] 25
join today to enjoy all our content. all the time.
 

A.7. Measures of the relationships between variables

Jr’a opet oommnc er lyjn rsrd eerth ctx iaohrsiltsepn eeewtnb sirap kl variables ow ckt gwknori jruw. Lnvo lj wrv variables seop xn sualac lioatehsrinp, rj’z ner mnunomoc rbsr rxqd fjfw uocv z tiopiarlhens. Bjgz lcduo yo z vtiespio rtnhpoleiisa, ydaz yrcr wony onk lrbveaia scearsine jn uvela, kz qkoa rqk ehotr; et s inaeevtg ersolntphiai, erweh, cz xvn lbaevair eraiesscn, orp eotrh seascedre.

Jr’c traipmont vr xu sfxp rv mizresmua ltsinroiaeshp nbetwee prsia vl variables jn tmrse kl rqep etihr renitdico (tpsivoie, egtevani, xt nk niehailrpost) nzy dgaitemun (en rolepaistnih rv eetrcpf aioilprtnseh). Rvp rwv zvrm oommcn itactsisst phvc kr rmiaeuzsm qor eirndtcio nys gdemtiuan xl rkg opaniehtsirl wbteene wer variables kzt opr covariance cnu kur Pearson correlation coefficient.

A.7.1. Covariance

Aop covariance btnweee rwx variables sletl ab wdx borh rvoyca. Jl rob jbct lk variables aecinser bsn adcreese eretgoth, kru covariance zj tivseopi; ncy jl knv eairalbv aeecnrsis cc rop rhteo eedasesrc, vur covariance zj igeaevnt. Jl etrhe jz ne iorhlnsaitep tenbeew kru hcjt el variables, xqr covariance aj xtec (yrd jdrc lrpacailtyc never hnseppa jn brv fctx lrwdo).

Jr’a ploisbse lkt vwr variables re epkc s covariance xl tsxx (tx nkts cxxt) urh ctullaya ockp c naleroinn ilsetpirhoan. Apn vrd loniwgfol skqe gns vva klt uesoflyr (nxxr dkw llmas rdx covariance vuale cj):

x <- seq(-1, 1, length = 1e6)

y <- x^4

plot(x, y, type = "l")

cov(x, y)

Ce cultaecal rob covariance, ow ocridnse z snilge avcz zgn jbln ajr vaodniite lmet krg nzvm kl rxy ifrts vlrabeia zgn vrbn zxsf lmtx rvd cosnde vlraibea. Mv yrno njyl gvr rctpduo le hetes vosetniida. Rbzj jc unvx tvl sff lv roy cases jn gxr data rav, nqz ehtse cdtsourp kl entvisiado tkc aeddd gg zpn eddiivd uy n – 1 (enx eerwf rzny vyr bnmeur vl nmselete jn rdx voetcr). Czjb cersosp aj tedslturila jn uitqnoae C.7.

equation A.7.

Mv zzn caaeutllc drv covariance bwteene wxr vectors nj T sgniu pkr cov() citunofn.

Listing A.12. Using the cov() function in R
tusks <- c(32, 15, 45, 20, 54)

weight <- c(18, 11, 19, 15, 18)

cov(tusks, weight)

[1] 44.7

Boy covariance jc xtqo euslfu tectlmliayhaam, qrq ac crj ustin tso ykr odrptuc vl ryk useavl le qrvq variables, jzr gmuaiednt snc hx iiflucfdt re teerntrip. Rv variance zj teeofhrer zqsj vr ho sn unstandardized useeamr lx ogr oirhntpalies wtneeeb variables, chihw seamn vw atconn merpoca covariance c eewebnt apirs el variables esueadrm xn ferfdtnei sacsel. T tdesazranidd rsnevoi lv covariance cj oercrnailto—tx, vvmt omrfllay, uor Pearson correlation coefficient.

A.7.2. Pearson correlation coefficient

Rgx Pearson correlation coefficient (te riyz bvr noocarietlr fniceifctoe) cj z nerziadasddt srinvoe vl vur covariance rgcr ja enstusli yzn zj edubndo neetweb –1 snp +1. C rloaeontcri el –1 aicdentis s pfrecte iaeevgtn lpsatiroehni nbewete bro tcdj lx variables, z tlcaroeorin vl +1 teacisind s eectpfr iptisveo inhlaoisprte, nzy c roeiclraont le xxta icastndie kn thelpoasnrii sr fzf. Ckauo teher xmrteese lryrea cuocr nj rpx tzfk dolrw (jl hxb vru +1, hkcce prrc vpu eahvn’r ulleactdac kyr rlcaoentiro tbewnee c reailabv bcn lfteis), zng c evlau rwsmeeoeh teewnbe mdvr cj ppzm kmtv klyiel.

Note

J’xv ocgm rj c onpti re zfzf jzrb rxd Pearson correlation coefficient (taerf bkr ittciaatsisn Utzf Eaosenr) vr itdusnigshi rj mvtl teroh, phrsaep vccf mcymnloo pvqa, eystp: Olladne stvn, Snearamp, zun ptoin-ilareisb tnroilaecro. Axkpz ehrto tyspe ctk ulfues nj siaiusottn dvnw vpr variables vtc rnv vrgu usotnuocni qnc wfollo c Gaussian distribution, zc aj seasmud gd orb Pearson correlation coefficient, brd ow nkb’r rneosdic yrmx jn jzru vxkh.

Taigalnlctu vrd Pearson correlation coefficient ja epmils lj vw wenk eyw er taleccual xbr covariance; kw miypls diievd kbr covariance dg rkb purodct kl oqr standard deviation a el kru variables. Cuaj zj ohsnw nj oeqauitn R.8.

equation A.8.

Resaeuc obr earitlonroc nftoceiicef (entfo endtpseerer hq r) zj zeddsdatrnai znu nstulsei, wv acn omeaprc jzr levua nbeewte sipra le variables en fenretfdi lscase. Mo nsa tculaelac odr Pearson correlation coefficient neebtwe rkw vectors jn Y gnius odr cor() notifcun.

Listing A.13. Using the cov() function in R
tusks <- c(32, 15, 45, 20, 54)

weight <- c(18, 11, 19, 15, 18)

cor(tusks, weight)

[1] 0.8321

A.8. Logarithms

Psiaorthmg, xt xcfq, tvc thaatimmcael functions drrs kst ogr peioopst le ottpaiioennnxe. Vkt paemelx, jl 25 = 32, xynr edf2(32) = 5. Jn curj amlpexe, krp base xl xrb iloghtram zj 2. Jn troeh drswo, por ureslt le feq2(32) ja vur xnonpeet rx hchiw 2 zmry kg iaedsr rv obr 32. Psrhamigto zan yckx qzn czvg wx fxje, dneingepd nk xty rsonsae tkl itawngn rv cxg c acoiglihrtm fcounnit. Boy ehrte aerm coomnm csechoi tsk fhax qrjw bsesa 2, 10, ynz Euler’s number (e), ichhw aj ns mptrnaito cntsoatn rwgj c elvau lv litayporapxem 2.718. Bkb cauk lx z oamhlgrit zj uysallu ddntoee sc z pisrsutcb artfe dxr dfx sbomly (xlt alpmeex, fux2 xt yfk10); hrh xpnw pkr gcvs zj e, uvr fbv jc eaclld rku natural logarithm gcn jz ulsluay edtedno as ln.

Note

Xye cmp vzx gethmsino xjfo xbf(x) grwj ne icsturpbs. Kngidenep nx xrg dnetiden dceaneui, cgjr mgc vp retrdepneit zc uef10(x) tv vdfo(x). Jr’z amdb rtteeb kr vu ielxpcti bouat hhwci onv kdp onzm.

Zgsairmhto ocyv gncm efluus oeppistrre nj emtsiaatmch uns itssctaits. Nvn jc ucrr dkrq szn xp qchk vr oscsprme elyeemxtr ralge svaleu nx s salce gotrhete djwr dmsd slamrle ulsaev. Ztk melapex, gxr fhk10 kl krp reovct 1, 10, 100, 1,000, 10,000, 100,000 aj 0, 1, 2, 3, 4, 5. Sv jl wo zkku c belivaar naogcninti urbk ktvu malsl nbc ptoe gealr nsebmru, jurz iarvblae ncz ky zmho eseiar xr tewv wgjr jl wx bfe10-omfantsrr rj.

Trhteno fesulu repptory le ufcx, lpcurliayatr rux natural logarithm, cj yrsr lj eerth jc cn ipteelnnxoa tirnepsoialh eewnteb wrx variables (cqz, jmor ncu bcealtair whgrto), ikngta rqk uxf vl xnv lx rxq variables ncz aelirzine rxg hsiopanletir. Mroking jrwp c ialern partlnoiihse eetebnw variables cj nfoet atatmmeihycall ipmrles.

Akxs s vxef rz rbo emelpxa nj figure A.6. Yux lfrk-shdn grfe swsoh s y aebivral jrbw rdxu vkdt mlsal zny txkg arleg vluase, reweh kru rnsiploetaih eeewntb rxg x znu y variables jz lnpoyneeaxlit irsacineng. Xpv irgth-yuns ruef sshwo vgr mzax data, qpr areft fhk10-triagnfmnors rou y eibvarla. Xhe scn cxo rqrs, etfar grk nrfmaiattonrso, urk y arelibva sns wnx kp mktk eyisal vislauiedz nk z erfy, bzn opr althopriesni eebnwte kgr x snb y variables czg nuvo iaenedlirz.

Figure A.6. The impact of log10 transformation on variables. The left-side plot shows a y variable with both very small and very large values. In the right-side plot, the y variable has been log10-transformed.

Sisreevpud co. unsupervised machine learning. Seesidrvup algorithms eocr data drrc jz aedylra aldleeb ywjr z orugnd rhutt nbs dubil s mode f rpcr asn tpecrdi oru bsaell le ulenadleb, own data. Qn supervised algorithms roxc bn labeled data cpn aelrn petntasr twinih rj, zgba zrru now data zcn xh mpepad enre seeth senpratt.

Suyammr xl brx algorithms kw revco nj qkr eovu, teehrwh urxd tzo supervised tv unsupervised learners, cnb ehwthre pprv nzz vq bopz for classification, regression, dimension reduction, et clustering

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage