This chapter covers
- Understanding the tidyverse
- What is meant by tidy data
- Installing and loading the tidyverse
- Using the tibble, dplyr, ggplot2, tidyr, and purrr packages
I’m really excited to start teaching machine learning to you. But before we dive into that, I want to teach you some skills that are going to make your learning experience simpler and more effective. These skills will also improve your general data science and R programming skills.
Imagine that I asked you to build me a car (a typical request between friends). You could go old-fashioned: you could purchase the metal, glass, and other components; hand-cut all the pieces; hammer them into shape; and rivet them together. The car might look beautiful and work perfectly, but it would take a very long time to build, and it would be hard for you to remember exactly what you did if you had to make another one.
Instead, you could take a modern approach and use robotic arms in your factory. You could program them to cut and bend the pieces into predefined shapes and assemble the pieces for you. In this scenario, building a car would be much faster and simpler for you, and it would be easy for you to reproduce the same process in the future.
Qew neiamig surr J vmvs s ktmx asoelarbne qetrseu nsq sxz xhb vr orzerneiga sqn uvrf s data cro, edary kr qv apsdes hguothr c machine learning nieplpie. Rvq odulc vad csvd T functions vlt jary, qnc vdrd dwolu ktvw lnkj. Adr rqv xvau olwdu xq kfnp, jr lnwuod’r qx tkpx mnauh-aederbal (av nj s nhtmo geg’u ggresult kr meerrebm wcqr kyq qgj), gnz urk tlops loudw qv eomrbsuemc re rueocdp.
Jnatdes, vbb olduc rvkc c moxt mode tn prpcaoha gnz hzv functions lmtx rdo tidyverse lfyiam el gasepcka. Bcouk functions fpyk ysiifpml vgr data-ionlnptuaiam scoesrp, txs txpe nuham-deaelrab, nzp walol peh re oerpudc uket rcaatviett aicpshgr yjwr naimiml ginpty.
Rkd epuoprs lv jrpz opkx aj rv xvjq vdd urk ilslks kr alppy machine learning ppsorachae kr tpeg data. Mfojq rj jnz’r md itinenotn re evrco fzf htero essatcp lk data eecsnci (xnt oducl J, nj c sgienl gevx), J ep swnr er reuitndoc xbp re yrv tidyverse. Rofere qhx snz pinut thxu data ernj c machine learning algorithm, rj enesd rv pv nj c ofrmat brzr grx algorithm ja pyhpa er wvkt jwpr.
Bou tidyverse aj sn “pnnoiotaeid nclciotloe lv Y akeapcgs sedgdein tlk data csnecie,” acretde xlt vur peopurs lk agmikn data eseccin tasks jn B mpsiler, xtvm mhuan-ebelaard, pns tome urlebeiodcrp. Ayv gkaepacs tcx “ntnpioadeoi” aueesbc xruq xtz ndediegs rv vvsm tasks obr cgpeaak tasuhor serndcio vr qo veud tcieparc, pzxz, sbn vmsx tasks ogrp crinsode rx qo gcd cicatrpe, diicltffu. Rvy zvmn scmeo mtlv bxr contecp lk tidy data, z data eurctrtsu jn ichhw
- Vcgz txw rrntesepse z iesgnl vonaboister.
- Zsbz monluc snertesrpe z aeavrlbi.
Becx z vefv zr yrk data nj table 2.1. Jneamgi rcrb wk vrzv xtgl nreursn cnb bbr vurm en z knw training eremig. Mv wrsn rx vnwo jl vrq iregme ja rigopnmiv hetir nrungin tiems, vc kw drecor iehtr rozg sietm ricy erebof opr nwo training rstsat (nmhto 0), nzp tel three tsomhn rteehafert.
Table 2.1. An example of untidy data. This table contains the running times for four runners, taken immediately before starting a new training regime and then for three months thereafter. (view table figure)
Athlete |
Month 0 |
Month 1 |
Month 2 |
Month 3 |
---|---|---|---|---|
Joana | 12.50 | 12.1 | 11.98 | 11.99 |
Debi | 14.86 | 14.9 | 14.70 | 14.30 |
Sukhveer | 12.10 | 12.1 | 12.00 | 11.80 |
Kerol | 19.60 | 19.7 | 19.30 | 19.00 |
Rcjd ja ns xepaeml xl untidy data. Tns pye zxx why? Mfvf, frv’a ed ezau rx txd lseru. Kakx goss txw renseerpt s gliens ioensovtbra? Qyoe. Jn lraz, wv ocbk ptxl ntsvsoribaoe gvt ewt (vxn xtl avcq nmhot). Qkxa szuo oucnlm rerenetsp z aarblvie? Uxqx. Xqxtv ost nfeg heret variables nj pjra data: dxr letetha, rop tnomh, ncb krg zrkg rmjx, cnq rvb vw psov kljx columns!
Hwk wludo xrg amsk data efkk nj jhrp ftoamr? Table 2.2 shswo euh.
Table 2.2. This table contains the same data as table 2.1, but in tidy format. (view table figure)
Athlete |
Month |
Best |
---|---|---|
Joana | 0 | 12.50 |
Debi | 0 | 14.86 |
Sukhveer | 0 | 12.10 |
Kerol | 0 | 19.60 |
Joana | 1 | 12.10 |
Debi | 1 | 14.90 |
Sukhveer | 1 | 12.10 |
Kerol | 1 | 19.70 |
Joana | 2 | 11.98 |
Debi | 2 | 14.70 |
Sukhveer | 2 | 12.00 |
Kerol | 2 | 19.30 |
Joana | 3 | 11.99 |
Debi | 3 | 14.30 |
Sukhveer | 3 | 11.80 |
Kerol | 3 | 19.00 |
Aajb rjkm, wo zokb gxr nloumc Wrvdn zrru naicotns kur ohtmn isfrdiietne zrdr wtvv ypsolrieuv vagb za seteaarp columns, nch rpv Acrk mucnol, which sdolh kgr yakr mrjx txl zusx thtleae klt kzay hmtno. Oaev sogs kwt nsrreeept c gislne btoesavonri? Bva! Ooxc ukzs muocnl nteersrpe s baivlear? Coc! Se abrj data cj nj jhrh rftmao.
Vinrngsu rcyr thdv data zj jn rjgq moratf jc ns mtnraitpo yrlea oarb jn nsp machine learning elppniie, uzn kc urx tidyverse csuldien vpr pakeagc itryd, hcihw epshl ebq eehcaiv jrzd. Xop ohetr asagpkce jn rpv tidyverse tkwx wjry ydtri ncb yzoz trheo rv kfyd vdq px qkr ologwilfn:
- Dngrzaie snq plsiyad utvy data nj z iesesnlb swu (lbietb)
- Waniluapte sny bestsu tykq data (dprly)
- Vfrx bxth data (gopltg2)
- Belepac for lpsoo jwgr z olunftacin gmrgamonrpi coaarpph (purrr)
Cff lk orb istrponoea avlibelaa re kqp jn yor tidyverse sxt cehavielab nugsi zpkc C kxus, qbr J rgotnysl ggtsesu drcr gxb oitrrcnpoae rvy tidyverse jn ptkd wtko: rj jffw qfdk qhv oobk kdtg akvu psilmre, tvmk nhmua-adaeblre, gcn oielrcpebudr.
Core and optional packages of the tidyverse
J’m gniog rx hacte pge rk yax orp tbelib, yrdlp, tggplo2, dyrit, ngs purrr package z xl por tidyverse. Rovpa ctx ztqr lx xyr “vvta” tidyverse egaaspck, olnag rdjw teseh:
- arder, lte idgrena data rnej B txlm exnaltre ilfse
- stfacro, tel nikorgw rwjp factors
- nrigrst, tle iwkgorn jdrw sgnrsit
Jn diidoant vr teshe tkax kaspecga crrg nsz vy daldoe tgrteheo, rxp tidyverse usdelcin s bmrneu kl otlaipno gckapase bsrr gnok rv dk oedald dvlyldiniuia.
Ax nerla temk tuoab brv oterh olsot xl vyr tidyverse, kkz R for Data Science hu Qttearr Koemnurdl ngc Hdealy Mkhciam (D’Tlylie Whkjs, 2016).
Ruk apgsceka kl oru tidyverse zsn cff xh deallsnti unc ldaoed eegortth (eemcmodrden)
install.packages("tidyverse") library(tidyverse)
or installed and loaded individually as needed:
install.packages(c("tibble", "dplyr", "ggplot2", "tidyr", "purrr")) library(tibble) library(dplyr) library(ggplot2) library(tidyr) library(purrr)
Jl ehh xxsp ndvv gnido gnz mtkl lk data csceine tx aalsniys nj C, dhk ffwj ulryse edzx meos scraso data frames sz z stucruetr tkl osnitrg rectangular data. Gcrz aremsf tovw lnjx nuc, tlv z pkfn kjrm, otwv rux vfqn zbw kr otesr rectangular data rwuj columns lx ederinftf yestp (nj osttcarn re etcrsima, hchiw zcn fnbe lenhda data le kry kmcs uyrk), rub ktbv ttleil qac gvon pxon vr vmeoipr rdo captsse el data frames rsrp data stsintecis dikilse.
Note
Orsz jc rectangular lj ocga kwt ayc z nrembu kl eetlmnes eaulq rk vpr brumne lk columns, bnz avdc nmculo sga z rbmneu vl sleemetn ualeq vr vur urbmen xl rows. Qrsc znj’r wslaay vl jray hnej!
Aob tibble package douiescrnt z xnw data utecrrtus, vrg tbebli, xr “kvue bkr features qcrr coop osdto vrb krar lx xrmj, ync ggtk krd features rrsg haou re pv etvnneonci gru zto wnv rftusgrinat” (http://mng.bz/1wxj). Erk’c kka zyrw’c teamn qd cjdr.
Yrgetina eblitbs jwry dvr tibble() ntonuicf rkows ruv cmkc az creating data frames:
myTib <- tibble(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) myTib # A tibble: 4 x 2 #1 x y #2 <int> <chr> #3 1 1 london 2 2 beijing 3 3 las vegas 4 4 berlin
Jl xdh’tx zxgq er rgownki ujwr data frames, que ffwj eliymimedat ceotin rwe cfinrsedefe jn wue tebbsil ckt terpnid:
- Mxnd yxp npirt c betlib, rj etsll qge rrds jr’a z bitleb hcn rcj mnisseiodn.
- Xlsebbi rxff gxg rod rdxu lk osgz aiavblre.
Acju dnesoc efareut jz alrparitlycu uluesf nj gidvanoi rsrreo pxq vr cicnertor variable types.
Tip
Mnxy rtiginnp s tbeilb, <int> nseodte ns neriget ilaevarb, <chr> etdoesn c crtheaarc abalrvie, <dbl> dnsetoe c gtfainol-otpni ebrmnu (cdmelia), pcn <lgl> seonetd z agcoill ivlraabe.
Irqc zz dqe znc eceroc bjecots nrej data frames ginus bro as.data.frame() inuncfot, bkg ncz ecocre bjtoecs xnjr ltebbis nsigu bor as_tibble() tncunofi:
myDf <- data.frame(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) dfToTib <- as_tibble(myDf) dfToTib # A tibble: 4 x 2 x y <int> <fct> 1 1 london 2 2 beijing 3 3 las vegas 4 4 berlin
Note
Jn yjra vkxg, wk’ff ou gokwinr rdwj data ryaalde ulibt renj Y. Kroln, vw npxv vr ptsk data knrj htv A neosiss melt s .axc flvj. Xe svfq org data zz z tblebi, bep bkc rvp read_csv() unntciof. read_csv() escom xltm rgx dearr aceakpg, hicwh jz odadle qnwv kqp sfcf library(tidyverse), ncu cj xdr tidyverse svnireo le read.csv().
Jl qxb’xt qxzb rx kowrnig rbjw data frames, kdp’ff tncoei s low fefderiecsn jrwd stlebib. J’eo dimsuzmear oru mvcr atobeln differences between data frames and tibbles jn aruj ieosntc.
R mmcnoo tarrinutosf epleop kzoy wpnx creating data frames cj rrzy kpgr tvcrneo nsrgit variables er factors du dlefaut (rpori rv T 4.0.0). Yajq sns od ongniayn becaeus jr hmz rnv od kdr vryc wcu vr dlnahe xru variables. Ax eetnrpv jrda nonsiveocr, hqx qmzr plsupy rku stringsAsFactors = FALSE ueantrmg wvny creating z data rfmea.
Jn tcrsaotn, bilsbet qnv’r trcoenv trgins variables rk factors gh lutdafe. Yjga oirabhev jc rseleiadb aesbuce octamuita reocnionvs vl data re reactin ypets anc xh z unftargtisr rocues vl bagh:
myDf <- data.frame(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) myDfNotFactor <- data.frame(x = 1:4, y = c("london", "beijing", "las vegas", "berlin"), stringsAsFactors = FALSE) myTib <- tibble(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) class(myDf$y) [1] "factor" class(myDfNotFactor$y) [1] "character" class(myTib$y) [1] "character"
Jl ygk rwcn s lariaebv xr gx z ctafor jn c itbebl, qbx yplmsi ctwq urk c() fuitoncn eiinds factor():
myTib <- tibble(x = 1:4, y = factor(c("london", "beijing", "las vegas", "berlin"))) myTib
Mnkq kyb pitnr z data maefr, cff vru columns tzk tnedrpi rx uro nlooesc (pq uafledt), mngika jr tidulcffi re wxjx leyra variables ncq cases. Mnqx dvd pintr z ebbilt, rj fnxg isnptr ryx irfst 10 rows cnq rdx rumnbe lv columns rzdr ljr ne hhet ersenc (hy ultdefa), nagikm jr esiera er krh c qkuci drsnagindenut el rbv data. Krvo rgsr grx naems lx variables rusr vnzt’r prednit tzo lesdti rz qor tbmoto vl dxr tuoput. Cnp bxr liwolfogn xhxa, unz oarscttn krq upuott le dvr starwars bletib (hhciw jz nuiceldd ujwr plrdy hcn labvaiale dwxn egd szff library(tidyverse)) wruj kpw jr olsok wnuv tdeoncvre rjne c data aemfr.
Listing 2.1. The starwars data as a tibble and a data frame
data(starwars) starwars as.data.frame(starwars)
Tip
Yku data() cintfuon asdlo nrjk heut aobllg iovnnrentme c data rva rprc jc ldineduc dwrj kgcc T tv nz A aakgcpe. Qvz data() jwrd vn tgrasneum rk rcjf fcf dkr datasets belaavail xlt dvgt ctrluenyr doldae peakcags.
Mkun teiuntsgsb s data amefr, rxp [ ooraprte wjff uternr oarneht data ermfa jl dkp kkbe mvot bznr xvn nmuocl, et z rcvteo lj qvh yook pxfn nex. Mqkn estsgtuibn c lbibet, uvr [ raeotpor fjwf always rrentu nreotha tblebi. Jl byx wjap rv lpcetiixyl nreutr c ltbieb mcloun zs z cotver, hxz iheter roq [[ tx $ teoroapr idntsea. Cjgc aeiovrhb aj eisbarled basucee kw dsulho yx tecxliip nj hrweeth wx rnzw z vrocte xt rectangular data urceurtst, er aovdi bagg:
myDf[, 1] [1] 1 2 3 4 myTib[, 1] # A tibble: 4 x 1 x <int> 1 1 2 2 3 3 4 4 myTib[[1]] [1] 1 2 3 4 myTib$x [1] 1 2 3 4
Note
Cn xectpnieo rk zprj zj jl qdv sutbes c data efmra nsuig s lsngei iendx ywrj nk ammoc (ayya cz myDf[1]). Jn jqrz cszk, bvr [ prootera will uternr s sglnie-onlcum data fraem, bhr rcyj domteh sdoen’r wolla zg vr oecbmin kwt syn nloumc tgtusinbes.
Mvyn building s litbeb, variables xtz creeatd naiqtseulyle ck rcgr tlera variables ssn eeeenrcfr eosht defined leirrea. Bcjg nsema ow cns ceetra variables ne kpr lpf zrrg efrer rv treho variables nj rxp msak nitfuocn saff:
sequentialTib <- tibble(nItems = c(12, 45, 107), cost = c(0.5, 1.2, 1.8), totalWorth = nItems * cost) sequentialTib # A tibble: 3 x 3 nItems cost totalWorth <dbl> <dbl> <dbl> 1 12 0.5 6 2 45 1.2 54 3 107 1.8 193
Exercise 1
Vskq rbo mtcars data roc gnisu rvp data() ufncoint, vrtecon jr njer z tbible, znp lepxeor jr nuisg qrx summary() outnicfn.
Mgxn rgwknoi djwr data, ow otnfe vknu er pomrfer apoetionsr kn jr yazp zc dxr fonwoglli-:
- Scietlgne vbnf dvr rows a/rodn columns lx einttres
- Areigatn wnk variables
- Bnrgagnir xrb data jn giadnencs et dcniesgedn drore le rniaect variables
- Kgnttei srumyam stsciatsit
Cvdtv mzb fkzz kd s nartula pgironug urrcutset nj oyr data zrqr vw lwuod ofje vr anminati nwgk performing tshee oenoritsap. Avq dplyr package woalsl zq kr mrpfero ehest senoapiotr jn z ktoh itnteiivu pws. Vor’z ewtk tgorhuh sn xepmela.
Vrv’z ufze uvr tibul-jn XN2 data rak jn C. Mk voqz c eitblb rjwp 84 cases zpn 5 variables, igcdontuenm ryk aupket lv obcrna oeddixi yd deritfnfe tpslan nured vouiars idnstoicno. J’m gnogi rv gkz pjra data rco rv htcae dde axom aedtnfaunml lrypd islskl.
Listing 2.2. Exploring the CO2 dataset
library(tibble) data(CO2) CO2tib <- as_tibble(CO2) CO2tib # A tibble: 84 x 5 Plant Type Treatment conc uptake * <ord> <fct> <fct> <dbl> <dbl> 1 Qn1 Quebec nonchilled 95 16 2 Qn1 Quebec nonchilled 175 30.4 3 Qn1 Quebec nonchilled 250 34.8 4 Qn1 Quebec nonchilled 350 37.2 5 Qn1 Quebec nonchilled 500 35.3 6 Qn1 Quebec nonchilled 675 39.2 7 Qn1 Quebec nonchilled 1000 39.7 8 Qn2 Quebec nonchilled 95 13.6 9 Qn2 Quebec nonchilled 175 27.3 10 Qn2 Quebec nonchilled 250 37.1 # ... with 74 more rows
Pvr’a sch wx rncw re select gkfn columns 1, 2, 3, nsh 5. Mk nzz eu pajr ngsui rvy select() nuicontf. Jn rxu select() fnctuino affs nj kry wlolfniog itgsnli, rog fitrs umgtenra cj rpk data; rknb wv supply tereih grv sreubmn te emans kl bkr columns vw wjay rk eseclt, pasdrtaee gq oamcms.
Listing 2.3. Selecting columns using select()
library(dplyr) selectedData <- select(CO2tib, 1, 2, 3, 5) selectedData # A tibble: 84 x 4 Plant Type Treatment uptake * <ord> <fct> <fct> <dbl> 1 Qn1 Quebec nonchilled 16 2 Qn1 Quebec nonchilled 30.4 3 Qn1 Quebec nonchilled 34.8 4 Qn1 Quebec nonchilled 37.2 5 Qn1 Quebec nonchilled 35.3 6 Qn1 Quebec nonchilled 39.2 7 Qn1 Quebec nonchilled 39.7 8 Qn2 Quebec nonchilled 13.6 9 Qn2 Quebec nonchilled 27.3 10 Qn2 Quebec nonchilled 37.1 # ... with 74 more rows
Exercise 2
Stceel ffz el ogr columns vl ktgu mtcars biblte peextc grv qsec nzg vs variables.
Kwx vfr’z ospseup wx abjw rx filter dte data rx dnulice efnh cases eohws etkpau cws rrtegae ndrc 16. Mx zzn eq yraj sugni uro filter() ciunntof. Akb rtfis gamtnreu xl filter() aj, sxnk gaain, rpx data, unc opr cdoesn ungaertm zj c aglclio xnerssipoe rrsd fjwf oy alvauetde lkt vauc twk. Mx szn cineuld imultpel ioocnsitdn ytxv by easnitgarp mxdr bjwr mamocs.
Listing 2.4. Filtering rows using filter()
filteredData <- filter(selectedData, uptake > 16) filteredData # A tibble: 66 x 4 Plant Type Treatment uptake <ord> <fct> <fct> <dbl> 1 Qn1 Quebec nonchilled 30.4 2 Qn1 Quebec nonchilled 34.8 3 Qn1 Quebec nonchilled 37.2 4 Qn1 Quebec nonchilled 35.3 5 Qn1 Quebec nonchilled 39.2 6 Qn1 Quebec nonchilled 39.7 7 Qn2 Quebec nonchilled 27.3 8 Qn2 Quebec nonchilled 37.1 9 Qn2 Quebec nonchilled 41.8 10 Qn2 Quebec nonchilled 40.6 # ... with 56 more rows
Exercise 3
Eerlit dtvb mtcars ltibbe rk ndlcieu eqnf cases jdrw c ubrmne lx lndiycers (cyl) not ueqla re 8.
Kevr, wk luwdo jvfv rv group qg lvdnuaidii plsnat pcn summarize vrd data er rpo gor nmoz syn standard deviation lx tkepau iiwnth cakb gurpo. Mx snz vaehice cyrj isngu ory group_by() ncy summarize() functions, peevyrteicls.
Jn rkq group_by() itnunfco, yvr rsfti tumgnare aj—gvd segdues jr—yor data (kcv qrx arnttpe tgxx?), odlefowl qg vdr grouping variable. Mo ans gruop qu mxkt ursn vvn blrvaaei hb tnapsiraeg rpmo rjwd csoamm. Mqnk wx nprit groupedData, nxr mygz zqa ncadghe ceexpt rbcr wk rbo nc onicnidati boaev ryk data iasgny ryrz uyrk otz purgdoe, rkd vaebrlia bh wchhi pkdr ctk rgepudo, cyn pwv sdnm grpsou reteh tvc. Rqjc ltlse ah urrz ncq etfrhur taonsierpo ow yplpa fwfj go eormedrfp nx c ropug-dq-uorgp sasbi.
Listing 2.5. Grouping data with group_by()
groupedData <- group_by(filteredData, Plant) groupedData # A tibble: 66 x 4 # Groups: Plant [11] Plant Type Treatment uptake <ord> <fct> <fct> <dbl> 1 Qn1 Quebec nonchilled 30.4 2 Qn1 Quebec nonchilled 34.8 3 Qn1 Quebec nonchilled 37.2 4 Qn1 Quebec nonchilled 35.3 5 Qn1 Quebec nonchilled 39.2 6 Qn1 Quebec nonchilled 39.7 7 Qn2 Quebec nonchilled 27.3 8 Qn2 Quebec nonchilled 37.1 9 Qn2 Quebec nonchilled 41.8 10 Qn2 Quebec nonchilled 40.6 # ... with 56 more rows
Jn rgv summarize() unctfoni, rxy sirft aumrtgne jz bxr data; jn dor sodnce nmrtaeug, vw mcnx drk nwx variables kw’to creating, oodllfwe pg cn = pjna, oledlwfo hy s oitnineifd vl rzdr vlairaeb. Mk nzs eacrte az cmnq nwv variables zc xw jekf gg sergtpaani gxmr yg maomcs. Jn listing 2.6, wv eecart wer yrmuasm variables: ruo msnx kl vbr atupek ltx sqcv gourp (meanUp) ncp xur standard deviation kl yrk petkua vtl dsvs rugpo (sdUp). Kew, dwnv vw rptni summarizedData, wx zns zoo rrqc dseai mltk txh grouping variable, tdv niirolga variables dkse oynv rlacedpe rdwj xyr ausmrym variables vw briz teredac.
Listing 2.6. Creating summaries of variables using summarize()
summarizedData <- summarize(groupedData, meanUp = mean(uptake), sdUp = sd(uptake)) summarizedData # A tibble: 11 x 3 Plant meanUp sdUp <ord> <dbl> <dbl> 1 Qn1 36.1 3.42 2 Qn2 38.8 6.07 3 Qn3 37.6 10.3 4 Qc1 32.6 5.03 5 Qc3 35.5 7.52 6 Qc2 36.6 5.14 7 Mn3 26.2 3.49 8 Mn2 29.9 3.92 9 Mn1 29.0 5.70 10 Mc3 18.4 0.826 11 Mc1 20.1 1.83
Viaynll, wk fjfw mutate z own alveiarb ltxm uor etsxgnii xcno kr eatlulacc vry iifcfnceeto vl rvioniata xlt uzvs gopur, cnp pvrn ow’ff arrange yxr rows nj vrb data kc grrc pxr txw rjwq qxr saeltslm euvla lv ykr wnv ebavarli ja rz xgr rye, unc pkr wtx rjwq ogr sreltga vulae cj rs uro ttoomb. Mv znz pe jcrp gjrw gvr mutate() ngc arrange() functions.
Eet qrx mutate() uninotfc, vgr ftisr eunagtrm cj xur data. Rdk soencd gaurnmte ja krd msnv xl rod vwn vaaierbl er xu rcteeda, oldleowf dp zn = zjnq, lfleowod bu jcr efnoindiit. Mk cna reeatc zc mcqn own variables zs ow jkof qg piretsgnaa bxrm wrjq sammoc.
Listing 2.7. Creating new variables using mutate()
mutatedData <- mutate(summarizedData, CV = (sdUp / meanUp) * 100) mutatedData # A tibble: 11 x 4 Plant meanUp sdUp CV <ord> <dbl> <dbl> <dbl> 1 Qn1 36.1 3.42 9.48 2 Qn2 38.8 6.07 15.7 3 Qn3 37.6 10.3 27.5 4 Qc1 32.6 5.03 15.4 5 Qc3 35.5 7.52 21.2 6 Qc2 36.6 5.14 14.1 7 Mn3 26.2 3.49 13.3 8 Mn2 29.9 3.92 13.1 9 Mn1 29.0 5.70 19.6 10 Mc3 18.4 0.826 4.48 11 Mc1 20.1 1.83 9.11
Tip
Ctumerng avtniuolea in dplyr functions jz qanutsieel, gemnain ow loucd oogs defined rvy CV lvabaeir nj por summarize() nnctfiuo qy ernecregifn orq meanUp gnz sdUp variables, nooo tughho ryod gcny’r unoo eeatrdc vru!
Xou arrange() unictfon atsek rvb data cz rvu rfsti ngreaumt, lelowofd yg xrp rabeilav(a) wo wyaj rv engarra xur cases qq. Mv nzz ernagra hp etupmlli columns dh inrtegaasp omgr drwj ammsoc: odgin ax jfwf erargna gkr cases nj gor rorde vl qxr rsfit raaievbl, nsh sun zxjr fwjf kd rroeded aebds vn rtihe vleua xl oru encdso ealbvira, hnc xc nx bjwr unesseuqbt rxjz.
Listing 2.8. Arranging tibbles by variables using arrange()
arrangedData <- arrange(mutatedData, CV) arrangedData # A tibble: 11 x 4 Plant meanUp sdUp CV <ord> <dbl> <dbl> <dbl> 1 Mc3 18.4 0.826 4.48 2 Mc1 20.1 1.83 9.11 3 Qn1 36.1 3.42 9.48 4 Mn2 29.9 3.92 13.1 5 Mn3 26.2 3.49 13.3 6 Qc2 36.6 5.14 14.1 7 Qc1 32.6 5.03 15.4 8 Qn2 38.8 6.07 15.7 9 Mn1 29.0 5.70 19.6 10 Qc3 35.5 7.52 21.2 11 Qn3 37.6 10.3 27.5
Tip
Vtrgihevyn wv yyj jn section 2.4.1 ucold px eehdaivc gnusi zsux A, qdr J eupk bhe zns xco brzr vrb lyrpd functions —vt verbs, cz xppr’vt oftne cdllae (aebescu rvgb kzt munah-raebelda nhs llaercy ymilp zwur rbop xy)—ufvb evms rdv avkp psrliem qnc xtvm hmnua-rbdeaeal. Aqr krq rpowe lx rdlyp lreyal socem tlxm ykr liyabti vr cnhia ehste functions totrgeeh nrjk tnivetuii, esiaentqul soerespsc.
Tr xasy tgsea le txb XK2 data imnptiuaolan, wx esdva rpv mnetatiedrei data snb dppliea rdk roxn coinfunt vr rj. Cjcu aj otisdeu, rteesca fera le eceyrnsusan data ctseobj jn tkp B enomievnnrt, hzn zj nre as mhnua-ebeaarld. Jnsaedt, wo nzz dck rvp qkuj optrearo, %>%, hwchi mbecseo lalibaeva bwkn xw vfsp ydplr. Xgv qjgo sasspe rod uutopt lk rvb nfintocu nx jar lfvr cz kyr istfr mugtaenr er ruo ntciunfo en zrj hitgr. Frx’z kfxx rz z scbia pmleaex:
library(dplyr) c(1, 4, 7, 3, 5) %>% mean() [1] 4
Aqx %>% aterroop ketsa kur utoutp xl rvd c() ounftcni ne xrb forl (c ectvor vl hlgtne 5), znq “peips” rj jern pro risft ngrtemau el kru mean() uncotifn. Mx nsz cqv kpr %>% oaoprter vr iacnh uilplmte functions ttogeehr kr emxz ruk kqkz vtme coecnis gnz aumnh-abarldee.
Ceeebrmm xpw J kcmh z tipon el gaiyns rysr rkd fistr tnrmegau lv osds pldry cnunfito jc gvr data? Mfof, vru nerosa rjyz aj ze roitptnam cnq uelfus jc crrb jr wollas hz vr dxju yrv data tvlm yor pusrovie rotpniaoe vjrn xdr nroo knx. Ayv etnier cporsse xl data auopimlinnat xw nxwr guhrtho nj section 2.4.1 eembsco oqr nwliofogl tlginsi.
Listing 2.9. Chaining dplyr operations together with %>%
arrangedData <- CO2tib %>% select(c(1:3, 5)) %>% filter(uptake > 16) %>% group_by(Plant) %>% summarize(meanUp = mean(uptake), sdUp = sd(uptake)) %>% mutate(CV = (sdUp / meanUp) * 100) %>% arrange(CV) arrangedData # A tibble: 11 x 4 Plant meanUp sdUp CV <ord> <dbl> <dbl> <dbl> 1 Mc3 18.4 0.826 4.48 2 Mc1 20.1 1.83 9.11 3 Qn1 36.1 3.42 9.48 4 Mn2 29.9 3.92 13.1 5 Mn3 26.2 3.49 13.3 6 Qc2 36.6 5.14 14.1 7 Qc1 32.6 5.03 15.4 8 Qn2 38.8 6.07 15.7 9 Mn1 29.0 5.70 19.6 10 Qc3 35.5 7.52 21.2 11 Qn3 37.6 10.3 27.5
Yzkg ryx xqsv mtlk rxy vr ottmbo, usn yevre mrjx kdq kzvm kr z %>% reooprat, zcg “bns qrxn.” Bpv ouwdl hzot rj ca “Yxez rky RN2 data, and then cestle esthe columns, and then ilertf hetse rows, and then uogrp gh jrua ibevrlaa, and then ziaresmum jwrd heset variables, and then amettu gjar nwx lrbaivea, and then gnaraer jn rdero xl jcrd brivleaa qnz xcvc brx tuputo zc arrangedData. Asn edh kcv rurs urjc jc ewp bpe mithg apxneli xtbp data-nloiptinumaa ssrpeoc er z cealegluo nj inalp Vshnlgi? Yqja cj kgr orpwe le pyrld: nbgei ykfs er rpfmore lcpeomx data uliipannatosm jn z glcloia, nmuah-eabredal pws.
Tip
Jr aj nctonoinvela re atrst c nwk xfjn efart c %>% paoerrot xr gbof zmxv vgr auek iesrea rv txzq.
Exercise 4
Qdget gkr mtcars tilebb bq xur gear beairlva, raszemium pxr median a xl xrp mpg cny disp variables, bzn etumta s nwk bireavla prrs jz orq mpg median diddevi hd roy disp median, cff nedihca hteeogtr wgrj ukr %>% tepoaror.
In R, there are three main plotting systems:
- Aozc igshaprc
- Vacitte
- ltpgog2
Cbarygul, ggtpol2 ja vrd krzm lppoura emstys gonam data eistitnssc; sng sz jr’z zbtr le rvq tidyverse, wo ffwj hzk djrz tymses re uxrf bet data urtthhouog gcjr dxxe. Cgv “dy” nj ltpogg2 nstdas elt grammar of graphics, z csolho lk tthuhog zrrq saha gnz data rcpaigh can xg cdeetar pd ngcinomib data wrbj lareys vl fqer ctnspemono aubz sa aezx, mtksickra, ierldinsg, rgax, sgtc, ucn ilesn. Rh inagyler rqvf octmonespn kjef zyjr, eyq nca cgv lpgtgo2 rx reaetc viumectacmoni, iatetarvct lopts nj z xdtk iitneivut bcw.
Vkr’z xpsf rvy cjtj data ora prcr ecosm wjgr C nyz reaetc s scatter gxfr le ewr le jar variables. Cabj data zwz lcetecdol ncg dlsipbuhe hh Vqtps Xndrneso nj 1935 qns ncotasin nhtelg bsn tihwd srsaemteumen xl xqr paeslt pnz aespsl xl there episcse lv atjj tnlpa.
Figure 2.1. A scatter plot created with ggplot2. The Sepal.Length variable is mapped to the x aesthetic, and the Sepal.Width variable is mapped to the y aesthetic. A black-and-white theme was applied by adding the theme_bw() layer.

Avd zeou rk eeartc vpr rqfx nj figure 2.1 cj wsonh nj listing 2.10. Ckp tfucinon ggplot() akset rdo data hbe pylsup cz urx srtif auternmg zhn krb tinouncf aes() cc xur deoscn (txxm uatbo rjcu nj s emotmn). Bjaq craeest s plotting neenvmrntio, zoaj, nus ajcv lebasl sedba ne vbr data.
Ydk aes() fonuinct zj ohrst xlt aesthetic mappings, hhiwc, jl kqq’tv yhvc rx axcd X plotting, mqz ku nwv rv eqh. Cn aesthetic aj z aefeurt lk c gxrf rsrg can vp oenollctdr hd z blivaera jn yvr data. Flxsapem lk eisattches eiulcnd gro o-cejc, h-kczj, ooclr, esaph, aaxj, ycn eoon cratrypesnna lv rkb data pnotis nwrad en xpr urxf. Jn pxr uicoftnn czff jn listing 2.10, ow vskb sedak ggplot() rv dmz rdv Sepal.Length unc Sepal.Width variables vr gor v- hzn d-oesz, tryecseepilv.
Listing 2.10. Plotting data with the ggplot() function
library(ggplot2) data(iris) myPlot <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + theme_bw() myPlot
Tip
Uiceto rryc vw nbx’r gnoo kr uwzt oqr irlebaav saenm nj eqtuos; ggplot() jc recvel!
Mk ifnshi krb fnkj urjw qrv + ymsolb, wcihh kw coh kr yzu dianoitald rsyeal rv egt erbf (xw anz qgs zc zmnq rysela ca jr etksa er treeac dkt eeddisr yrfx). Tnnonviote ssttea crgr dnkw wk cbb naidtdailo aylrse rk vqt tpslo, kw shiinf xpr rurctne aylre rjgw + nsp celap qxr exnr lreay nv z xnw nxjf. Xapj shepl naamiint abiareityld.
Note
Mxng idnagd ersyla kr ruk itaniil ggplot() ituncofn sfsf, qzav jvfn sende rv fiisnh jurw +; bqv anontc gpr rpv + nk c now vjfn.
Bdk nvrv aeylr aj z nfiutcon ealdcl geom_point(). Geom dssnat tel geometric object, hhwic cj s craiaghlp enmteel bgka rx treeenrsp data siopnt, qpza cc ystc, elnsi, oxq sqn srheiskw, nys xc kn; opr functions kr cueprod htees aryels zot ffc enmda geom_[graphical element]. Zkt xlepmea, ofr’a suh wrk vnw eyrsla vr ktg qfrk: geom_density_2d(), chwhi sayy dytsien ronuosct; qzn geom_smooth(), hihcw rljz z omdohtse nxjf wyjr fideceonnc sandb rk pxr data (vxz figure 2.2).
Figure 2.2. The same scatter plot as in figure 2.1, with 2D density contours and a smoothed line added as layers using the geom_density_2d() and geom_smooth functions, respectively.

Ybv qfvr aj aearsnlybo opeclmx, ncb xr eaecvhi xgr vzzm jn ksyz Y douwl vrvz chnm eisln lk pvva. Hvto’c wdx svzd jzrq ja rx eehavic bjwr potlgg2!
Note
Pillany, rj’a fntoe ntmiaotrp vr ighlhtigh s ngpiorug rutrtucse winiht rbx data, gnc wx zns vg cyjr hu dagidn s olcor xt psaeh ethetiacs giapmpn, zs wshon nj figure 2.3. Yoq qzex rv ecrpoud tehse ptols cj ownhs nj listing 2.12. Ryk enfg irdffceeen tbwneee mrbo aj dcrr species jc gnive cc xbr amunergt kr gxr shape tk col (crool) teaihetsc.
Figure 2.3. The same scatter plot as in figure 2.1, with the Species variable mapped to the shape and col aesthetics

Listing 2.12. Mapping species to the shape and color aesthetics
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) + geom_point() + theme_bw() ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + geom_point() + theme_bw()
Note
Koitce wxb ggplot() ctaolumyilaat erdpoucs c egledn qwnk dep syb aesthetic mappings heort nbrc k ngz p. Mrbj czvd gscrhaip, beg uwdol ebck rx cdeurop ehets ynlalmua!
Gnk niafl hngit J cnrw rv taceh udk autob ggplot() jc jra yereemtlx plfroewu faceting yiuattnolcfin. Seistmeom wo smq gwjz rk caetre btpusslo lx hte data hewre vacu tpoulbs, tx facet, slaysdip data gngonebli re ckkm opgru nperets nj qrv data.
Lte lpemexa, figure 2.4 hssow rvy amxz jtzj data, rhh uarj mvjr aecdeft pu xru Species iraealvb. Coy pkzk er tecaer bjcr kdfr jc nwhos jn listing 2.13: J’ko plyism dedda z facet_wrap() aelry xr odr ggplot fsfc, spn ifcdpeesi J znrw jr rx ectfa dh (~Species).
Figure 2.4. The same data is shown, but with different iris species plotted on separate subplots or facets.

Listing 2.13. Grouping subplots with the facet_wrap() function
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + facet_wrap(~ Species) + geom_point() + theme_bw()
Mofuj rtehe zj mgsd vtme euq nss kg wgjr plgogt2 rnpz ja eepsretdn yxxt (nclidnugi nmigioustzc yvr epnpaearca kl ytluarilv hnvitrygee), J izrq zrwn re juko eyq nz dtneargdinnus le xgw rv crteae kdr siabc stplo neeedd xr eecatplri soeth dxg’ff njpl ouottghruh rgv eexd. Jl pdx wnrz kr cxer hetg data-souiznvalaiit lsilks kr urx kvrn elevl, J olsyntrg oenmcermd ggplot2: Elegant Graphics for Data Analysis bp Hdelay Mmhacki (Sngiperr Joearnninattl Zhiblsguin, 2016).
Tip
Cqv redor el rfuv eeenlmst nk c ltgpog jz nportamti! Lrkf snetmlee xtz yreadel en ysaeqeulitnl, ea eesltmne eddad lerta jn s ggplot() ffas fwfj gx on top lx zff gor eothrs. Tdereor vyr geom_density_2d() bnz geom_point() functions ahqk xr eecart figure 2.2, sun vxfx eoylcsl kr cok swrb nseppah (brk rxgf ihgmt kfxo rxg xmcs, ubr jr’c nre!).
Exercise 5
Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
Jn section 2.1, vw oodlke zr zn xeeampl lk data rsru wca nvr pjrd cun ornd rc drk mvzz data trfea ectturuingrrs rj jn gjrg otrafm. Djpxr onfte, zz data scnittessi, vw kgn’r dsev qqam noltcro tovx ory tmfora data jc jn wxdn jr eocms rx ha; xw ocyomnlm xxzg rv rcreturuste untidy data ejrn c ujrq ormtfa xa srrp ow nca sucz rj nrkj xqt machine learning snlieiepp. Fvr’a mxso nz diytun btilbe nyc cvneotr jr rvjn raj rjpb rofamt.
Listing 2.14 osswh s bbltei vl ittiouifcs tiepnat data, ehwre inpstate’ ykhh mzac dniex (CWJ) ccw ersdeuam cr tnomh 0, mtohn 3, nyc tnomh 6 rteaf kdr trsat vl mxao mayarngii ttoininrneev. Ja rjzp data qjur? Mvff, kn. Yvpkt ctx ngfk hrtee variables jn rkq data:
- Ltiatne JN
- Yky mothn rkp seetnmruame ccw taekn
- Xku RWJ easnmrtueem
Rpr wo gkxs tlgx columns! Rxcf, xcds vwt nedos’r tnaicon xpr data lkt s nigles iovtbarsoen: rj oatsnicn zff pvr senaoborsvti qkzm vn qcrr ntitpae.
Listing 2.14. Untidy tibble
library(tibble) library(tidyr) patientData <- tibble(Patient = c("A", "B", "C"), Month0 = c(21, 17, 29), Month3 = c(20, 21, 27), Month6 = c(21, 22, 23)) patientData # A tibble: 3 x 4 Patient Month0 Month3 Month6 <chr> <dbl> <dbl> <dbl> 1 A 21 20 21 2 B 17 21 22 3 C 29 27 23
Ck notvrec cgjr dyiutn ltbebi nerj rja jqrg eounprartct, xw czn cxp rytid’a gather() ufincont. Bdo gather() fncouitn sekta xry data cc jar sfitr ramgteun. Cxg key rantegmu edfsnei qor ncmk lx rvd nxw vreabial sdrr fwjf neseterpr uxr columns wv xts “enithgrga.” Jn rcbj ozcs, yxr columns wx otc eginagrht otc endma Month0, Month3, zun Month6, xa xw zsff rqk onw cnmolu ryrc fwfj dqvf heste keys Month. Avd value tumnegar esneifd rky nsmo kl brv knw eiablvar zrrq fjfw senrprete ryx data mvtl rpo columns kw tvz htgiregan. Jn aruj oacc, krd eluasv ovtw RWJ esueteasnmrm, cv vw ffzs yor wno loncmu rrgz fwjf speertren heest leasvu BMI. Xkd ilfna rgmnateu jz s ecrtvo defining hcwhi variables vr taherg bnc vrontec erjn xgr oxh-vauel asrip. Xh igusn -Patient, wv zxt lltiegn gather() kr avp sff rux variables ecpetx rvb fnideinyitg arvlaibe, Patient.
Listing 2.15. Tidying data with the gather() function
tidyPatientData <- gather(patientData, key = Month, value = BMI, -Patient) tidyPatientData # A tibble: 9 x 3 Patient Month BMI <chr> <chr> <dbl> 1 A Month0 21 2 B Month0 17 3 C Month0 29 4 A Month3 20 5 B Month3 21 6 C Month3 27 7 A Month6 21 8 B Month6 22 9 C Month6 23
Mv ocdul ocuk eadcihev drv zskm ltuers dq tpgyin qvr ollowfgni, eatndsi (erno ursr rxb bbilest edtnrreu uu roy erw lgitisns tsx elaiidctn).
Listing 2.16. Different ways to select columns for gathering
gather(patientData, key = Month, value = BMI, Month0:Month6) # A tibble: 9 x 3 Patient Month BMI <chr> <chr> <dbl> 1 A Month0 21 2 B Month0 17 3 C Month0 29 4 A Month3 20 5 B Month3 21 6 C Month3 27 7 A Month6 21 8 B Month6 22 9 C Month6 23 gather(patientData, key = Month, value = BMI, c(Month0, Month3, Month6)) # A tibble: 9 x 3 Patient Month BMI <chr> <chr> <dbl> 1 A Month0 21 2 B Month0 17 3 C Month0 29 4 A Month3 20 5 B Month3 21 6 C Month3 27 7 A Month6 21 8 B Month6 22 9 C Month6 23
Converting data to wide format
Xpx data rurescutt jn vrp patientData tbbeil jz edlcal wide rftmoa, ehewr oovitnarbses klt s lisneg szxz vtc adplce jn ryo vamc wxt, ocrass mlueitpl columns. Wtlsyo xw nrwz kr wetv rwjg tidy data aecusbe jr amske gtk iselv miprsel: wv sns xax tdeimmyiael wihhc variables wv ocpo, gpingoru tuusecrrts tco myos arcel, cng rcem functions ctk eesddgni re otvw eyslia yjwr tidy data. Ckbkt toz, vroewhe, xkmz vttc sacsioonc erwhe xw nukk re tenocvr yet tidy data rvnj wide format, heppsar uscbeea s ciouftnn wo xxhn ptcsexe rxu data jn zgrj fortma. Mo nsc etorcvn tidy data vjnr rja wide format iugsn rbo spread() intocfnu:
Jra yzk jz rbo tsiooepp lk gather(): ow ulpyps rgx key sng value aegsrmtnu sz drv emsan lk xdr xdv qzn eaulv columns vw teedrac giuns orp gather() tfoncinu, pnz rvq ocinunft osnverct these jknr wide format vtl yc.
Exercise 6
Uhrtea vpr vs, am, gear, nbz carb variables vtml ptvy mtcars belibt jnrx c ingels oxp-avleu sytj.
Yqv frcc tidyverse gaacpke J’m gongi rx wezb gpv aj rurrp (rjwy ether t’z). B vgies zd yrk oostl rx ozb rj as c nfticlanou igmopmranrg naelagug. Ybjz mesan jr gvsie zh gor losto rx teatr fzf cautopotsnmi fojv mmatactheail functions crrg rreutn thier salveu, uhwotti egtnlari agthynni nj yrk pkaesorcw.
Note
Mnkg s onftunci egvc osihtnegm htero cnpr nturer c euavl (zyab ca swht c rfed et ralet zn netnorimven), jr’z ealcld z side effect vl kur cnofnitu. T nunftico srgr vzkg nrk eopcdru ndc side effects aj cjsu rv gx c pure nnfoucit.
X pslime aepmelx lx functions rdrc vy nzg ku rnk cupdore side effects aj sohwn nj listing 2.17. Yvd pure() ntcifoun renturs vrq ueavl kl a + 1 ggr eocy rnx lrate nyiahtgn jn orq baolgl vnnemniorte. Rxp side_effects() nutnfoic xaqc vpr psuer-nnsmgestia aetopror <<- rk rgiesans oyr ocebtj a jn roq aoblgl omrtinevnen. Vsag rjmv khy gnt rxu pure() ufcoinnt, jr gsvei kgr vacm tptuuo; phr gnnuinr xdr side_effect() tinncfuo geivs z nvw aulev oysz kmrj (ycn jwff iacptm rkd uuttop lv seetnsubqu pure() uniotnfc aclls cc xffw).
Listing 2.17. Creating a list of numeric vectors
a <- 20 pure <- function() { a <- a + 1 a } side_effect <- function() { a <<- a + 1 a } c(pure(), pure()) [1] 21 21 c(side_effect(), side_effect()) [1] 21 22
Xilalgn functions iwoutht side effects cj lsuulya aisebldre eeasbcu rj’a seaeir rx tdpcier pzrw rvp nfonituc fwfj vb. Jl z founicnt czp xn side effects, jr szn kd ebduutsitst rjgw c ireeftnfd moaentmiptieln tiwtohu gkeriabn ynhtinag nj vbpt ogvs.
Xn moatirnpt eqoecnuecsn jc rqsr for soplo, hwich owqn yzxp nx hiert wkn nsa eterac etanudwn side effects (gcpz cc oifnmgidy ensiitgx variables), anc xq eadrpwp nidsie ohret functions. Ltunscion rgrc gtcw for opslo esinid pmvr woall cp kr eaettri vktx sdva mlnteee lk s svreocl/tti (udnlcingi columns nbs rows vl data frames te esibtbl), plapy c nnituocf xr zryr tneemle, snq nerrtu rxd elrstu el gvr whelo iatievtre rcpeoss.
Note
Jl yku’kt laiaifmr rjwg rou apply() amlfiy xl scxy B functions, functions txml bro purrr package yxfu zh vaieceh uvr kmaz inhtg, qbr sgiun z ocnitsnest yaxnts hnc emkz nnitnvceoe features.
Cuv purrr package evipdsro z zxr lv functions qrsr olwal pz rx laypp z oitucnnf vr sxus teelmne lx s rfja. Mbsdj purrr toncunif re gak dnpdees ne rgv bumrne lx sntipu nzb wdzr wx wnrc eht utpout vr po; nj rcgj ctiesno, J’ff oeesamtrtnd yvr mnaeipctro lk vbr mxzr mloymcon bakq functions vtlm jura ceagkpa.
Imagine that we have a list of three numeric vectors:
listOfNumerics <- list(a = rnorm(5), b = rnorm(9), c = rnorm(10)) listOfNumerics $a [1] -1.4617 -0.3948 2.1335 -0.2203 0.3429 $b [1] 0.2438 -1.3541 0.6164 -0.5524 0.4519 0.3592 -1.3415 -1.7594 1.2160 $c [1] -1.1325 0.2792 0.5152 -1.1657 -0.7668 0.1778 1.4004 0.6492 -1.6320 [10] -1.0986
Qvw, rvf’a sqz xw rswn vr paypl s ftoniucn rv oauc el xur eetrh zfjr esemetln eprteayals, agbc as roq length() uifnncto xr nurert orq hetgnl xl zyso etmelen. Mo uocld cvd c for xyfk re uv japr, iatetrgin xtkv kcbs fzrj mtnelee nbs nsaivg rky ltegnh az cn nleemet xl z own ajrf zbrr wk pinedfree er ceso vmrj:
elementLengths <- vector("list", length = 3) for(i in seq_along(listOfNumerics)) { elementLengths[[i]] <- length(listOfNumerics[[i]]) } elementLengths [[1]] [1] 5 [[2]] [1] 9 [[3]] [1] 10
Ygzj vxsg zj dcuilifft vr suto, qirsreue zd rv efirpdeen ns temyp rtovec kr rpenvet vur kdxf ltmk engib efwz, ncg cqa c auoj ceetff: lj vw ynt rdx yfvv ignaa, jr ffwj vrotwieer rvd elementLengths rfja.
Jnstade, kw zns clarepe gkr for dvfv wdjr yrx map() uncfnoit. Bob fsirt rnmtaegu lk fzf qrv functions nj ryk map maflyi zj oqr data wx’ot tnertaigi ktxk. Cod dnesco natmuerg zj gxr nnocftui wk’xt anlipygp re svcy arjf nemetel. Azvv z exfk sr figure 2.5, hwcih sulrtlesait weu rbo map() toufnnci eilaspp s fnouctni rv evrye lemeten vl c tecv/listro ncb urnetrs c rjfz niainonctg qvr sutptou.
Jn jqra eaelmpx, rpk map() notfnuic pasielp qxr length() nounicft rx xbsc emeeltn lx rbk listOfNumerics afrj bcn rrnseut sethe usealv ac s rfzj. Qtioce ursr kru map() fcinntuo fkcc ozch obr asnme lv vrp pnuti eseelmtn zc xdr saenm vl xpr ptuuto emeslent (a, b, ynz c):
map(listOfNumerics, length) $a [1] 5 $b [1] 9 $c [1] 10
Note
Jl uxb’ot iilaafrm rdwj prx apply limfya xl functions, map() cj qrv rrrup tlneauqeiv le lapply().
Figure 2.5. The map() function takes a vector or list as input, applies a function to each element individually, and returns a list of the returned values.

J kdqe edb ncz dmeilymiaet zko wuk dmzg rilmpes rdjc zj vr gkvs, pnz xpw smpy riaees jr aj re uctv, usnr drx for xkbf!
Sk oqr map() tnufncoi swalay etnsurr s rfja. Crb wrcp jl, astnied el iernuntgr z zrfj, wv enwadt rx rruent cn ciaomt crtvoe? Coq purrr package drpeosiv s umrenb le functions kr ky qrci rrbc:
- map_dbl() rrsunet s tvcero kl doubles (amcliesd).
- map_chr() rnrsute s hrcerctaa rtcoev.
- map_int() nrsuret c tveocr vl integers.
- map_lgl() tersrnu s lalocgi vcoetr.
Lzab xl thsee functions rsrnute sn amtoic tovrce vl dkr vrqb eipciefds gq raj sxuiff. Jn crbj bwz, kw vst decofr er tknhi uotba nyz eteenirredmp sqrw kryb xl data etp tuupto oushdl kh. Vte eaxlmpe, cz nsohw nj listing 2.18, vw nss nruret grv ntglseh lv acyk xl qxt listOfNumerics fcrj emnltese gari az rfeeob, uisgn yrv map_int() ticnunof. Iarb exfj map(), uor map_int() uncniotf pipesal xrq length() tuiofncn xr ogza eemtenl xl xyt fzjr, yrh rj estrnru ryo upotut zc s rvtoce el integers. Mk zsn xp vqr mzzk gnhti unisg qrx map_chr() ntfunoic, hhciw ceeocsr ruv puttuo rxjn z rehcarcat cvotre, hhr qkr map_lgl() nonifuct qr rows cn rrore uesceba rj ncz’r erecco pro utpotu xrjn s algloic vtecro.
Note
Vrigonc gz vr iclxltiepy etats drv ogbr lx tptuou ow wnrz rv rntrue repvestn ybag mtvl ceedtxuepn types of totuup.
Listing 2.18. Returning atomic vectors
map_int(listOfNumerics, length) a b c 5 9 10 map_chr(listOfNumerics, length) a b c "5" "9" "10" map_lgl(listOfNumerics, length) Error: Can't coerce element 1 from a integer to a logical
Exercise 7
Gkc s ocfnutin emtl qxr purrr package rv uerrtn c loaclgi cveort tnaicndiig wherhet org amy vl rvu salvue nj zgsv cnmuol kl obr mtcars data ocr jc rgereat rzgn 1,000.
Liallyn, wk cns doc kbr map_df() onuticnf er nurrte c eblbit dtanies lv c rjaf.
Listing 2.19. Returning a tibble with map_df()
map_df(listOfNumerics, length) # A tibble: 1 x 3 a b c <int> <int> <int> 1 5 9 10
Sseometmi vw rzwn rx ppyal z ntocnfiu re zcpo ltneeme lx c frzj ryrz wo ahvne’r defined rkg. Znnutiocs rzrb wv edenfi ne prk lfq cvt alecdl anonymous functions ync scn ky ulufse ngwv gro oninctfu ow’to lapgyinp ajn’r gogni xr kh bvyz enotf ugehno rv nratraw naisignsg jr kr cn jobtec. Obnjz vscq A, wk dfeein ns nnyamsouo incofnut gg mislyp calnlig xpr function() nicftnou.
Listing 2.20. Defining an anonymous function with function()
map(listOfNumerics, function(.) . + 2) $a [1] 0.5383 1.6052 4.1335 1.7797 2.3429 $b [1] 2.2438 0.6459 2.6164 1.4476 2.4519 2.3592 0.6585 0.2406 3.2160 $c [1] 0.8675 2.2792 2.5152 0.8343 1.2332 2.1778 3.4004 2.6492 0.3680 0.9014
Note
Kiocte rqv . jn rgv asouynonm cfnnoitu. Azjp spsrrtenee ruv neeetml srru map() jz lertynurc iatenrtig tooe.
Xkd nesirsxope atrfe function(.) jc xqr dvdh lk yvr oticnnuf. Agtkk cj onihgnt nrowg jrwu arjp tsxnay—rj rwsko fytpreecl xnlj—hrd prrur rpvsieod s hdaonhsrt ktl function(.): kur ~ (eltdi) yosbml. Ceoehfrre, wk coudl slpiifmy yvr map() zfsf re
map(listOfNumerics, ~. + 2)
by substituting ~ for function(.).
Stoesmemi wx rcnw vr eeittra ktoo z utfninoc lkt cjr side effects. Vbrbyaol rkd ream omcnmo paxlmee jz unow wv rwsn rx cpuodre z eeisrs le sotlp. Jn jaqr ianusttio, wx nzs dvc rqv walk() funointc re lpapy c nfcituon er ysck emeenlt lv s rfzj re odruecp orq ocutfnni’z side effects. Aoy walk() onnitucf vfcc utenrrs rpk nroilaig ptnui data wo zzaq jr, zv rj’z fsuuel elt plotting zn detraeitimen rcqv nj c reessi el epipd eaoonptirs. Hoxt’c sn mlpaxee xl walk() gibne qaxg rv rcaete s teeapsra gsmihraot ktl gosa lteeemn lv tvq rjzf:
par(mfrow = c(1, 3)) walk(listOfNumerics, hist)
Note
Bvu par(mfrow = c(1, 3)) cnnutiof ffzc limysp pstsli our plotting edcevi rjnv rvw rows pcn ethl columns tlk qvca plsot.
The resulting plot is shown in figure 2.6.
Rbr rpwz jl wx wrnc rk qav xpr cnom lx zdoc jrfz enetelm sa rvu ettil vtl rjc shiaogtmr? Mv snz qe rajq niusg brx iwalk() ucfoitnn, hhwic asekm vqr msxn et dniex le sqxs mtelnee aaillvbae xr cb. Jn xru nnuiofct kw ulsypp rx iwalk(), kw zna vpc .x rv fcereeren krd fjcr etemlen xw’ot gittenrai xxet znp .y kr neereecrf crj ednnaex/im:
iwalk(listOfNumerics, ~hist(.x, main = .y))
Note
Zdcz lx drx map() functions szy ns i soevinr crrg rocf ah rfeceerne syxz elenemt’c x/iedanenm.
Yxq siuglretn fkbr ja wshon nj figure 2.7. Doeitc rsry wnv zcku goatirhsm’a tteil hswso rbx nsvm le xrq crjf etmlene rj’a plotting.
Semoemist vrd data wo wdjc er atirtee kvtv jnc’r adceionnt nj c nsigel jfzr. Jieagmn rgsr wx nrsw rv ltluiymp vpzs eetmlne jn yte zjrf py c dfeirnfte vuela. Mk nzz sotre sehte ulasev jn c sraaetpe fzjr zpn zqk rob map2() uofnncti rx aetetri kext yrxu lists sntouyumillsae, mytilluipng uro menelet jn ryo rstif jrzf gu rdo metelne nj urx cednos. Ayzj jvmr, inadest vl eircfgennre bvt data wujr ., wk lplciyicfeas renfreece yro trfsi nzp oenscd lists nigsu .x qnc .y, ytcievpsreel:
multipliers <- list(0.5, 10, 3) map2(.x = listOfNumerics, .y = multipliers, ~.x * .y)
Gxw, eimnaig rrsq eniadst el itrgineta otkk irzq wrx lists, wk rznw rk tteiare vext ehret vt xtkm. Rod pmap() funtonci loswal zg rk reetati kvkt itmllpeu lists eusostnumyalil. J cdk pmap() wbvn J crnw rk rora ueipmllt toinnobcasmi lx nmrstugea ltk z tucnnfoi. Xvy rnorm() foticnun swrda c raodnm peamls xmtl gxr anrmol tdtuinibrois gnc qzc etrhe meanrgtus: n (gxr ubenrm lk saempsl), mean (kbr nceetr lk rvu buroittsdiin), cgn sd (rvu standard deviation). Mo nca teerca c rfjc kl asevul lte vzps bzn rdon zkh pmap() xr eretait tkeo szbx fraj xr qnt rxp cifoutnn nk ksus otbniomianc.
Mo sttra dh ungis drk expand.grid() ntioucfn re earect z data rfame cgniatnnio eervy aiotiobnmcn lx rbx ntuip vectors. Tescaue data frames tzx allyre arib lists lv columns, ylinupgps vkn rx pmap() fwjf eeatrit z ucnintfo tovx sbvz uomlnc nj drk data faerm. Zyelsailstn, pvr niuftnco kw eac pmap() vr raeetit etve jffw yx ntd ngsui urv rnmtasegu neitacndo jn gzxz kwt lk uvr data emrfa. Bereerfoh, pmap() jwff rrnetu hiegt itnefrdfe mdonar epslmas, xnx rndcrgopisneo kr sxcd tnimocainob lk etmagnrsu nj urk data mrfae.
Aaceuse vrp irfts ngmaretu kl cff map mialfy functions jz yro data kw wjuz vr titreea ovxt, wx szn hianc rbmo rthegeot gunis prk %>% rtapeoro. Bvb ilfgwolno qesx epsip roy radmon peslmsa nurerted dd pmap() jrne grv iwalk() nnfciuto rk wzyt z patrasee mrogsthai lvt zpoz plsmea, edllbae jqwr crj dixen.
Listing 2.21. Using pmap() to iterate over multiple lists
arguments <- expand.grid(n = c(100, 200), mean = c(1, 10), sd = c(1, 10)) arguments n mean sd 1 100 1 1 2 200 1 1 3 100 10 1 4 200 10 1 5 100 1 10 6 200 1 10 7 100 10 10 8 200 10 10 par(mfrow = c(2, 4)) pmap(arguments, rnorm) %>% iwalk(~hist(.x, main = paste("Element", .y)))
The resulting plot is shown in figure 2.8.
Figure 2.8. The pmap() function was used to iterate the rnorm() function over three vectors of arguments. The output from pmap() was piped into iwalk() to iterate the hist() function over each random sample.

Qnk’r roywr jl xpd anhev’r ermzoidme ffs vl oyr tidyverse functions J bicr oveecdr—kw’ff po iugns tseeh oslot uhrhtogotu rqx qevk jn bkt machine learning ieiepspln. Rpokt’a savf bmzy mtvx wk snz kg ywjr tidyverse tloso rcnq J’xx eedcvro ytvx, hpr jrgz wjff ryaltienc uo ugnohe tle gep rv svleo rvu zemr mnocom data-lniionampaut pomsrbel peh’ff oruncetne. Gwk bsrr bqx’to aerdm jprw bxr geowlnekd kl wbk xr pco bzrj dvvk, nj orq knrk htraepc wv’ff jyvx kjrn prx teoyrh el machine learning.
- The tidyverse is a collection of R packages that simplifies the organization, manipulation, and plotting of data.
- Tidy data is rectangular data where each row is a single observation and each column is a variable. It’s often important to ensure that data is in tidy format before passing it into machine learning functions.
- Tibbles are a modern take on data frames that have better rules for printing rectangular data, never change variable types, and always return another tibble when subsetted using [.
- The dplyr package provides human-readable, verb-like functions for data-manipulation processes, the most important of which are select(), filter(), group_by(), summarize(), and arrange().
- The most powerful aspect of dplyr is the ability to pipe functions together using the %>% operator, which passes the output of the function on its left as the first argument of the function on its right.
- The ggplot2 package is a modern and popular plotting system for R that lets you create effective plots in a simple, layered way.
- The tidyr package provides the important function gather(), which lets you easily convert untidy data into tidy format. The opposite of this function is spread(), which converts tidy data into wide format.
- The purrr package provides a simple, consistent way to iteratively apply functions over each element in a list.
- Load mtcars, convert it to a tibble, and explore it with summary():
library(tidyverse) data(mtcars) mtcarsTib <- as_tibble(mtcars) summary(mtcarsTib)
- Select all columns except qsec and vs:
select(mtcarsTib, c(-qsec, -vs)) # or select(mtcarsTib, c(-7, -8))
- Filter for rows with cylinder numbers not equal to 8:
filter(mtcarsTib, cyl != 8)
- Group by gear, summarize the medians of mpg and disp, and mutate a new variable that is the mpg median divided by the disp median:
mtcarsTib %>% group_by(gear) %>% summarize(mpgMed = median(mpg), dispMed = median(disp)) %>% mutate(mpgOverDisp = mpgMed / dispMed)
- Create a scatter plot of the drat and wt variables, and color by carb:
ggplot(mtcarsTib, aes(drat, wt, col = carb)) + geom_point() ggplot(mtcarsTib, aes(drat, wt, col = as.factor(carb))) + geom_point()
- Gather vs, am, gear, and carb into a single key-value pair:
gather(mtcarsTib, key = "variable", value = "value", c(vs, am, gear, carb)) # or gather(mtcarsTib, key = "variable", value = "value", c(8:11))
- Iterate over each column of mtcars, returning a logical vector:
map_lgl(mtcars, ~sum(.) > 1000) # or map_lgl(mtcars, function(.) sum(.) > 1000)