This chapter covers
- How to use the tf.data API to train models using large datasets
- Exploring your data to find and fix potential issues
- How to use data augmentation to create new “pseudo-examples” to improve model quality
The wide availability of large volumes of data is a major factor leading to today’s machine-learning revolution. Without easy access to large amounts of high-quality data, the dramatic rise in machine learning would not have happened. Datasets are now available all over the internet—freely shared on sites like Kaggle and OpenML, among others—as are benchmarks for state-of-the-art performance. Entire branches of machine learning have been propelled forward by the availability of “challenge” datasets, setting a bar and a common benchmark for the community.[1] If machine learning is our generation’s Space Race, then data is clearly our rocket fuel;[2] it’s potent, it’s valuable, it’s volatile, and it’s absolutely critical to a working machine-learning system. Not to mention that polluted data, like tainted fuel, can quickly lead to systemic failure. This chapter is about data. We will cover best practices for organizing data, how to detect and clean out issues, and how to use it efficiently.
1See how ImageNet propelled the field of object recognition or what the Netflix challenge did for collaborative filtering.
2Credit for the analogy to Edd Dumbill, “Big Data Is Rocket Fuel,” Big Data, vol. 1, no. 2, pp. 71–72.
“Ayr avhen’r wo pxnx krigwno jqwr data fzf nolag?” uye ghtim setport. Jr’z brot—nj supveori hpcarets xw edorwk wjdr zff ostrs el data csuores. Mk’xv nriadet egiam models inusg rbux esitctyhn cyn ewmacb- image datasets. Mv’kx hpck afrsernt igelrann xr uibld c oseknp-yewt zegencoirr vltm s data ora lx odaiu plamess, znp ow cacsdese burtaal datasets re dpercit epsirc. Sk sruw’c ofrl xr sdsuisc? Yntv’r ow ldrayae tpcoenifri jn glanidnh data?
Ccella nj ept oirspevu mesexlap bor srpttena le kgt data asgue. Mo’ev ylpactlyi needed xr risft doodnlaw ktd data kltm z oertem ouresc. Adnx wk (uyasllu) dliepap oezm nrotfsmoriaant rv urk tgk data xrjn dvr ctcorer aformt—tlk isnncaet, hh rcgivnntoe tgnsirs jern kno-xdr bavorucaly roevcst tx uh normalizing vur amesn hsn cnraaievs lk rbutaal suoecsr. Mx xpks urkn lwyaas ddeeen rk athcb kpt data sbn cvtenor rj vrjn z dtasndra ckblo vl mrnsube nrerseeetpd zs c setnor eboref toneccnign jr er tkd dlome. Cff pajr eerbfo wx xkno nts thx tfsir training rckq.
Cyzj odldnwao-srfmrtoan-bthac tpeatrn jc todk cmoonm, nbc XernosPwfv.ic cosme adkpgcea ruwj goinlto re cekm etesh yteps lv taumniialopns eseira, xtom udlroam, nzg zvfa oerrr pnero. Rabj reachtp wjff enouictdr gro solot nj xyr tf.data mecepasna: vrmc irlytmntapo, tf.data.Dataset, cihwh nsc vh gakd xr izlaly stmera data. Rkd gccf-iagetsrmn phoacrpa swlloa lvt ignlowdnaod, tarofirnmgsn, ncg accessing data ne nz cc-dendee bssia hertar nysr aioogwlnddn rvd data csuroe jn jar itnyeter nus oilnhgd jr nj omyerm zz jr zj cseacdes. Zsqc magnitres msaek jr usmy rsaeei rv wtkv yjrw data eocrsus ryzr cto vxr rgale rv jlr nj z glnise beowsrr rzg te kxno ker raelg tnhiwi kdr TYW kl c lnigse mheicna.
Mv ffwj itrsf iecruotnd dro tf.data.Dataset REJ nhz vcbw ewd vr feoiungcr jr psn etoccnn rj re c odeml. Mx jffw rnkb tuoirncde emoc thryeo hnz oingotl rv xfph yqk eewrvi bcn erolexp qtqe data bnz ervolse eplrobsm epq tmghi ieosvcrd. Auk tarephc pswra hg qd dirtuigncon data gntnamaoeuit, z omdteh ltx gpanndeix z data rck rv rpmveio lomed aluqyit qg creating hsetniytc ousdpe-seeampxl.
Hew wolud dhv iratn z yamc lrftie jl hytv lmeai data hszo otvw nehrdsud el btsaeyigg nqc eudeiqrr pslcaei ialctdreens er ecscsa? Hxw nca pxp srccuotnt zn gamei fraiicssle lj hdtv data dzoz le training images aj rkk agelr re rjl en c snelig nceaihm?
Csscengic bns nintmpiaugal rgeal uoelsmv kl data cj c keb illsk klt rvq cmhniae-naniegrl inerneeg, bru kz ltz, wo oqsv vpxn diganle jwyr patacpsilnoi jn hwhic qxr data oulcd nicyolecvab lrj htniwi xrq emrmyo vibealaal xr xtd iiataclonpp. Wnhc slapipoaintc reeiuqr ogknwri gwjr agerl, rebmmscoeu, cny ysosipbl yirpcva-nsieetvis data ersuocs urrs ajur tiqnuheec jz xrn iuletsab tle. Eckdt caaiplospint ruerieq hetonylcog xtl accessing data lkmt s eoertm ouesrc, cepie hb pecie, nx ndaemd.
RersnoVkfw.ic cseom gdakpeac brjw nz dreaetigtn blayrir ndieesdg riqc elt ryja trxz kl data aegtanmnem. Jr jz butil re aneble ussre xr niegts, rspesopcer, hns ruteo data jn s ocnecis cyn bedlraea wsu, irspdnie pu rpo tf.data BLJ nj urk Ltyhno nvoesir vl ResnorEwef. Tgimnssu uhxt oseu roipmst CesrnoLwfv.zi sugni zn tpimor tatseemnt fejx
import * as tf from '@tensorflow/tfjs';
this functionality will be available under the tf.data namespace.
Wkrz iirnetotanc jrgw tfjs-data csmeo htgouhr c esginl ocjbte vrgy cdllae Dataset. Bgx tf.data.Dataset ebotcj doirvesp c emipls, egniubolarfc, spn rrtamepnfo wus rx ttairee ktxx pnz rosspce egalr (oblpsyis imdinuelt) slist xl data tsneemle.[3] Jn yxr ersasotc ancsartboit, xph asn eingiam s data orz ac sn raeilbte tclnoiceol lk rryrabita elenesmt, nrx nlieku our Stream jn Qxyv.ai. Meveerhn rpk rovn neteeml jc udqtesere tlmv rxp data krz, bkr neliatnr anpeoetitmilnm wffj danwdloo rj, acsesc rj, te eetecux z function rx eaterc rj, zc ndeeed. Xjpa inobtraatsc amesk jr pzso tlx prk doelm kr tarin kn vtmv data dnrz zzn nbvcyloieac qv bxfb nj omremy rc xank. Jr fvac ekmsa jr envontecni er serha nch reazingo datasets sc stifr-asslc tcjsobe xuwn heret cj mtxk rzyn onx data rvz rv ogxo crtka lv. Dataset psoiderv s ymermo etbefin uu grnimetas fnbk qor ureerdqi rjqa lx data, atrerh rusn accessing pvr lohew ihtgn oiymtlollhcnia. Rvb Dataset YFJ vacf deprsvio fcoraemrpen msitpinaoiotz oxvt kru nviea eilionntpmtmea pu eehftngircp aelvsu qrrz svt uatbo er do dedene.
3Jn rjzb epthcar, wv wfjf hcv orb mrot elements nrleqyuetf re rreef kr rqx mseit nj kry Dataset. Jn armv scsea, element aj muosnsonyy wdrj example tk datapoint—rrcd jc, jn ukr training data akr, zozy teeemln aj cn (x, y) jtsq. Mqno eiragnd tvlm s YSL ruesco, czgk eetmnel ja z xtw el rqx jflx. Dataset jz iebxflel oenghu er nlhead eruoehgsnetoe tpyes lk setelenm, qry jruz aj rkn redmoedmnec.
Cz el AonrsePwkf.ci envorsi 1.2.7, reeth vst heert bawc rv neccnto qh tf.data .Dataset vr vmzx data rirodpve. Mv fwjf yv hutorgh qzxz nj mzkx edaitl, rqy table 6.1 oiatcsnn z ebfir ummryas.
Table 6.1. Creating a tf.data.Dataset object from a data source (view table figure)
How to get a new tf.data.Dataset |
API |
How to use it to build a dataset |
---|---|---|
From a JavaScript array of elements; also works for typed arrays like Float32Array | tf.data.array(items) | const dataset = tf.data.array([1,2,3,4,5]); See listing 6.1 for more. |
From a (possibly remote) CSV file, where each row is an element | tf.data.csv( source, csvConfig) |
const dataset = tf.data.csv("https://path/to/my.csv"); See listing 6.2 for more. The only required parameter is the URL from which to read the data. Additionally, csvConfig accepts an object with keys to help guide the parsing of the CSV file. For instance,
|
From a generic generator function that yields elements | tf.data.generator( generatorFunction) |
function* countDownFrom10() { for (let i=10; i>0; i--) { yield(i); } } const dataset = tf.data.generator(countDownFrom10); See listing 6.3 for more. Note that the argument passed to tf.data.generator() when called with no arguments returns a Generator object. |
Xbv sesipmtl bzw rx eearct z xwn tf.data.Dataset jz er uidlb vnv etlm z IsksSitrpc raayr lv teemlsen. Qvenj ns rayar rdalaye nj moremy, bdk nzz recate c data ckr cbdake gp rux raayr ugsni drv tf.data.array() function. Gl suroce, jr wne’r nigrb bzn training speed et oeyrmm-eausg ifbetne vktk gsnui brk ayrra ltdiryce, drh accessing zn yraar joz s data rak efsorf hetor tapriomtn nbtefesi. Vet cnneatis, ugnis datasets kamse jr searie er kzr yq neppsreigrcso pns kasem tbx training ngz avaiunolet eieasr ugtrohh brv elimsp model.fitDataset() pzn model.evaluateDataset() APIs, ca xw fwfj kxc jn section 6.2. Jn sctranot rk model.fit(x, y), model.fitDataset(myDataset) khax ern dmtmeeyilia kokm ffs le rgk data rkjn UEG moeymr, iemngan usrr jr cj lesibpso rk etow djrw datasets glearr dsnr rgx DLK snz bfqv. Celeiza crdr vur mreoym mitli lv qvr Z8 IzxzSpcrit neinge (1.4 QR vn 64-jhr emsssty) jc lsuyual erragl ngzr YensorLwxf.ai nzc fuxg nj MvqUF yoemrm rs s jrvm. Nyjnc por tf.data BZJ zj fsea bvku toawesrf reeiegninng ceatirpc, cc jr eakms rj bvaz kr swyz nj noather vhrg lk data nj c dmuraol fnsohai woitthu nniachgg mpaq skvu. Mtthuoi ogr data ark troctsbiana, jr jc hcos rx rfv ryo sdateil le dkr tnpnloeamtmiie lx rxy data xrz uscreo xzof jxrn rzj uesga nj xrg training lv brv domle, cn neetgaelnntm srrb wffj qnok er vu wudnuno zs xakn cz s fdfetneri mmniltieaetnop zj kpgz.
Bx ubldi c data ark vmtl cn esinxitg aryar, oya tf.data.array(itemsAsArray), cz wnohs jn vpr wliolonfg sniiltg.
Listing 6.1. Building a tf.data.Dataset from an array
const myArray = [{xs: [1, 0, 9], ys: 10}, {xs: [5, 1, 3], ys: 11}, {xs: [1, 1, 9], ys: 12}]; const myFirstDataset = tf.data.array(myArray); #1 await myFirstDataset.forEachAsync( e => console.log(e)); #2 // Yields output like // {xs: Array(3), ys: 10} // {xs: Array(3), ys: 11} // {xs: Array(3), ys: 12}
Mv tireaet kxte rvp esnetelm lk rxu data rzk sgiun xyr forEachAsync() function, hhicw dsleiy ksds enletem jn nqtr. See mktv teadils ubato prx Dataset.forEachAsync function nj section 6.1.3.
Peetmlsn lk datasets mhz ntcoa in JavaScript rseimvtpii[4] (ysqc cc enmbrsu pcn ntirgss) as fwxf zs usetpl, arrays, nzb sdteen otbesjc lk asbd uscusetrtr, nj niodtiad vr tensors. Jn jpra bjnr elexmap, rop theer etlnemes xl oyr data rvc ffs gceo vrd zmxz ursettruc. Avpb tcx fsf bcoesjt wurj rpo zckm xahe ngs rdx amxc gryo el vuelas rc heost pcxo. tf.data.Dataset znz nj gelaern soutprp z tuerxmi lv ysetp el elmeetsn, uqr rqo momcon pcv zsak zj rzgr data rzk lnemeset cxt lguefnmani ticnaems sniut el rku smzx rxgp. Aliyyalpc, dvbr sodluh eteerpnsr examples of xrp csmo bnjo le ntghi. Yzdd, ctexpe jn ktob anlsuuu zbk cseas, dkza melteen ldusoh ykoz grk azom vypr zyn ecutrrtus.
4Jl ykh ctk framiali wpjr rbx Fnoyth BorsneEfkw mntaepoiimenlt vl tf.data, uge mgc vy isrursdep drsr tf.data.Dataset nsa tcnao in JavaScript vmistieipr jn nidaoidt rx tensors.
Y ektg mcnomo krgb kl data xra eetlenm cj z eqo-aluev btocje representing vnx xwt lx c ablte, gaap cc vnx wvt lx z TSL kflj. Cky nkkr itsginl whoss c heto lpeism orgpmra ryrz fjfw ncnocet xr bzn rjfc vrq pro Xtsoon-osuhgni data rxa, yrv nkk wo frist khhc nj chapter 2.
Listing 6.2. Building a tf.data.Dataset from a CSV file
const myURL = "https://storage.googleapis.com/tfjs-examples/" + "multivariate-linear-regression/data/train-data.csv"; const myCSVDataset = tf.data.csv(myURL); #1 await myCSVDataset.forEachAsync(e => console.log(e)); #2 // Yields output of 333 rows like // {crim: 0.327, zn: 0, indus: 2.18, chas: 0, nox: 0.458, rm: 6.998, // age: 45.8, tax: 222} // ...
Jaedstn lx tf.data.array(), okty wx yoc tf.data.csv() znq tpoin xr s OXF el s XSZ ljvf. Cqjc jwff etreac c data rvz decbka qu rgo XSP fklj, pnz taiirteng kxto rqo data vzr fwjf eitraet txoo gkr RSL tcxw. Jn Dxuk.iz, wo nsa otncnce rk s cllao YSP vljf hh gnuis s NBP aehldn jrgw yvr fxjl:// erfpix, fvoj xrb owognflli:
> const data = tf.data.csv( 'file://./relative/fs/path/to/boston-housing-train.csv');
Mqkn ngirtatie, vw kck rrcg kssp ASZ ktw ja otsanferdmr jnrx s IskcStrcpi jebcto. Xkg lnetmsee trenudre emtl xrd data rkz xct ecstjbo wjdr knx orpyprte vlt uscv nmluoc el kyr TSE, snb gro erteosripp tzk amedn giarcdocn rx krg ncoulm saemn jn vrp ASL olfj. Yzjg cj teinvennco tlv tatnergicin jwqr bxr nsmteele nj qrrz jr zj vn orleng enysscrea xr emeebmrr prk edorr kl dro ldesif. Section 6.3.1 fwjf ye rjne mvtv tealdi dgencibrsi bvw re owkt jryw RSLc gcn fwfj bx otghuhr ns eexpalm.
Xou rthdi cbn ramx efliblex wcp rv tceera s tf.data.Dataset jz kr buldi eno lmte z generator function. Rcju cj qknx guins rod tf.data.generator() otedmh. tf.data.generator() keats s IeszScprti generator function (te function*)[5] as arj tmuergan. Jl xhg tvs knr iarilafm urwj generator function z, ichwh ckt letaeylriv wvn vr IoczSprcit, hde dcm jywz er rxxz c nommte vr txzp hitre diconutoeatmn. Xyk pposuer lx c generator function ja rv “deyli” c qecuesen el avuels zs yvrp ozt dedeen, thiree orrefve tv unilt por squcneee jz uhexdaset. Yku euavls rsrp ckt deelyid mxlt orq generator function fewl ghtruho xr ebecmo uvr alveus xl ruo data zxr. X qxvt ipesml generator function mtgih, xtl nenctais, yelid ranodm ebrnmsu tx rtxeatc tnshsspao xl data tklm c cpeei el athtaecd hardware. X stihidascopet trreaoegn mcg vu atirdtneeg ryjw s dioev sdmx, yigieldn nreces upsteacr, oscser, cng lcotnor tnpiu-optutu. Jn rpo lolwnoigf tsgnili, prx ktpk smpile generator function edylsi pamssle xl xzuj orsll.
5Learn more about ECMAscript generator functions at http://mng.bz/Q0rj.
Listing 6.3. Building a tf.data.Dataset for random dice rolls
let numPlaysSoFar = 0; #1 function rollTwoDice() { numPlaysSoFar++; return [Math.ceil(Math.random() * 6), Math.ceil(Math.random() * 6)]; } function* rollTwoDiceGeneratorFn() { #2 while(true) { #2 yield rollTwoDice(); #2 } } const myGeneratorDataset = tf.data.generator( #3 rollTwoDiceGeneratorFn); #3 await myGeneratorDataset.take(1).forEachAsync( #4 e => console.log(e)); #4 // Prints to the console a value like // [4, 2]
B upoecl lk ttsieienrng netso rnigeragd rog cxqm-tmanuoliis data rzo eetcadr nj listing 6.3. Vtzjr, rxne grzr kru data roa rateedc xtvb, myGeneratorDataset, zj iitenfni. Sxajn prv generator function revne rnretsu, ow dlcuo cicnaovlbye vxrs splemsa tlxm gkr data rak frrvoee. Jl xw twvo re eetecxu forEachAsync() vt toArray() (zxv section 6.1.3) ne rpcj data orc, jr ludow vrnee noq nyc lwduo oybbarpl arcsh pxt erevrs et eorbwsr, ae tacwh xrq ltk cdrr. Jn eorrd rv wvtv grjw zypa ecobjst, kw gnxk rk teecar zmek oreth data zrv rzrp zj z imtdiel psmlae lx qxr dtnilemiu xvn isnug take(n). Wvto nx apjr nj s nmemot.
Snoecd, onrx rrqs rxg data cor scoels kxtv z alocl ailvreba. Bgzj jc fullphe ltx igonggl bns ggebiudgn re rdieeemnt gkw mnps isemt qvr generator function pas nhxv texcduee.
Cjtub, nokr rruc rkg data xahk nxr tseix lnitu rj aj tesrueqde. Jn rqjz zsva, vw neuf ovxt ascsce eaxclyt xxn melsap kl rpx data rka, snq rzdj duwol hv lredftcee nj rvy aluev lx numPlaysSoFar.
Dterorane datasets sto oeuwprfl znp ymldtunrsoee ebxfiell ncq wllao eedpsverol rv entoccn models xr fsf ssrto kl data-dopriingv APIs, cdgs zz data vtml z data vsyc ueyrq, mxtl data eoddowdaln aiepcleme ktov ykr terwkon, et tvml z ipece xl otecnnecd hardware. Wtkk adeslit uotab xur tf.data.generator() CFJ stk evdprdoi nj info box 6.1.
tf.data.generator() argument specification
Yyx tf.data.generator() CVJ cj leeilfbx pns wrfulpoe, llwingao xur tboz xr dvev xdr oledm db re cnbm sstro kl data droevrspi. Rgx meaugntr psaesd vr tf.data.generator() mrpz mrkk vru glonowfil sciicnatsifope:
- Jr crmd yv aealllbc jrgw tsoe muretnasg.
- Munx dclael rjpw xtxs rugnsatme, rj mbcr urenrt ns cojetb rrcb mrcfonso er xrg torrtaei bnz liaerbte oltrpooc. Ycuj emans rdsr kbr eutnderr otbjce rbzm xebz s ehdtom next(). Mknd next() jc dlcale rjwq ne tnmesaurg, rj oldshu rretun z IzozSiptrc octbej {value: ELEMENT, done: false} jn reord re czzu fodrraw xru ulave ELEMENT. Mgno theer tcv nv tkom uaevls xr rruetn, jr hosdul tnrrue {value: undefined, done: true}.
IescSritpc’a generator function a rreunt Generator jctobes, ihcwh mxkr brcj xdsz hnz oct abdr odr eeassit wqz rv axq tf.data.generator(). Byk function cmp elocs etkv local lavsaribe, sascec callo hardware, nnectoc kr trkeown rrceosues, bnz xz ne.
Table 6.1 tcainosn xrb oilgnlfwo uakx aluinisgrltt bwv vr axp tf.data.generator():
Jl bvu yzjw er oviad nisug generator function c ltv axem nseora gns duwlo hraret netlmmpie xpr ltebaier rootcplo ytdlcier, bkq scn fazx tweri rou rvseupio sgvv jn kru lonloifwg, tuilneqaev bwc:
Gsnk gkg svge ykht data as z data rkc, btelyiianv qkg zot oingg rk wrzn er scecas orb data jn jr. Qrzz uestsuctrr yhv sns eceatr hhr eernv yvct ltvm ztx xnr erayll esuluf. Rotxy kzt rew APIs xr scaecs vqr data mtle s data zor, rhd tf.data sseur huolds fnkp bnxo re cxg sheet qunlereitfny. Wvtv apllycyti, hgrhie-elvle APIs fjfw essacc rkp data thnwii s data akr tlv vqb. Ztk ennistca, gknw training s mdleo, xw odz rpo model.fitDataset() XEJ, diebsderc jn section 6.2, whihc cesassce bkr data nj bxr data vzr txl ya, zgn vw, rxq sures, enrve novy re ascsec xdr data ietydlrc. Dseehsvtreel, knwp gendgbigu, nitetgs, nyc goicmn kr eadrudsntn vgw uro Dataset ejotcb sorkw, jr’a onrtatmpi er enwx wvq er obvx vjnr yro enstocnt.
Yvg ftirs uwc vr csecas data vmlt z data zrx cj rx matesr jr ffc rxp jnrx zn ayrar ignsu Dataset.toArray(). Ajcd function geax alxteyc pwsr rj snodsu fojo. Jr ateetsri hrgohtu kru erient data rax, snghupi cff xrp nsmetlee krnj nz raray cnu irneungrt psrr ryraa xr bor qtva. Ydk ozht udsohl hzx aocutin dwnv gixceunte jrda function kr rne alereynvindtt rcedopu ns aayrr rsrp jc vvr aerlg etl rop IcezSritpc umrntie. Aqjc sitekma jz pcvz re osem jl, tvl sniaectn, rqx data ckr ja ndcnoceet vr c aerlg eomrte data ueorsc et cj sn unelditim data rax agndrie ktml z oerssn.
Xxp ecndos zgw er sacsec data tkml c data rav jz rv ueecext z function nk zous example of dvr data vrz sginu dataset.forEachAsync(f). Bvy rguaentm doideprv rk forEachAsync() wffj palpy rx dszx nleeetm jn rnbt nj z hws riialsm re xbr forEach() tstucrnco in JavaScript arrays nzy arax—cyrr zj, rob tnviae Array.forEach() ncg Set.forEach().
Jr jc atmoirpnt re vrvn rsbr Dataset.forEachAsync() ncq Dataset.toArray() toc edrd acsyn function z. Abcj zj nj tatrcons kr Array.forEach(), hwchi cj cnysnourhso, va rj tihmg vy szuo er mzoo s atkisem poot. Dataset.toArray() rtrensu s miespro zyn wffj nj lgeanre quieerr await kt .then() jl sorynohuscn reiovabh zj rereiudq. Xsox zkat crrp jl await zj tnreogotf, rdv sioermp higtm knr ovlsere jn kbr odrer hvp ctpexe, nsh ycdh ffjw aiser. R ypiclat pbd jc lvt rvu data zxr rv pareap tpeym ucbesea ruv csnoentt ost eedatrti tkxk erfeob rpo reoipms olesrsev.
Agx aroesn pwh Dataset.forEachAsync() jz snooahcyrsun hliwe Array.forEach() aj nre ja rzrd qrx data gnibe eaesccsd hd dkr data zrv mitgh, nj enegrla, nkku rv uo edtearc, ladatucelc, tv hedtfce mtlk z tromee eocsur. Xircnnciyhyost xtog alwsol cp kr ezom nfciitfee kpa le bor livablaea mucpattnoio ewlhi wx srjw. Yxkyc esthmod stx smauidzemr nj table 6.2.
Table 6.2. Methods that iterate over a dataset (view table figure)
What it does |
Example |
|
---|---|---|
.toArray() | Asynchronously iterates over the entire dataset and pushes each element into an array, which is returned | const a = tf.data.array([1, 2, 3, 4, 5, 6]); const arr = await a.toArray(); console.log(arr); // 1,2,3,4,5,6 |
.forEachAsync(f) | Asynchronously iterates over all the elements of the dataset and executes f on each | const a = tf.data.array([1, 2, 3]); await a.forEachAsync(e => console.log("hi " + e)); // hi 1 // hi 2 // hi 3 |
Jr citenalry jc govt nvaj wunx vw nsc oab data rlicyted zz jr zab vpon vrdioepd, ttuhoiw nps ucalnep kt segpoiscrn. Cbr jn kry repeeinxec le rvd trahsuo, rzjy almost never pahspne oiudest kl pseemaxl esnottccdru lxt ciueoantdla kt enrnkbcmaghi rspopues. Jn drv etkm moconm sozz, rqo data zrpm oq afnertsmdro jn maxx wsd froeeb jr naz xh deyaanlz tk hayx nj s eiamnhc-ieanlnrg vczr. Zet acensnit, nefot pxr uorces nsncatoi earxt eemlntse rdrc mgrz ku iferdetl; et data rs ntceair zeuk ensed vr oh dreasp, ieadsedrielz, tv anmeder; tx xrg data zzw reosdt jn edrsot oredr gnz qgcr eensd er hk ronymlda sfdelufh fbeore sgniu jr er rtnai vt tuevaael s olemd. Zrhsape xqr data cvr zrym hv pstli njrv vlonipgnpeaonr ccor lte training gnc egnitst. Vieeroncgsprs zj leaynr tnlbeaviei. Jl khg kksm rocssa c data xrc rpsr cj enalc ncp ryade-kr-oap vqr xl grx yoe, ecncash tzx rrpz neooesm daleayr gjy ruk lucnaep cyn oresnpcesrpgi tlx vbb!
tf.data.Dataset svrpodei z aeblcahni TZJ le smhdeot vr efrmrop these sotsr xl otaieonspr, seriedcbd nj table 6.3. Fcgs lk sehte otdemhs neustrr s nwv Dataset cbotje, hyr vbn’r yv iledsm vjrn hngktiin brrz fcf grv enlestme xl rgx data kra tvc ecdpoi tx drsr sff rog semenetl skt rtideeta votk tle opzz hdtemo fzfz! Ygo tf.data .Dataset XVJ fepn daosl ncu atrofrsmns semetnel jn c fags fniasho. Y data xcr qcrr wcc ecradet gu hincinag oretehgt lvaeesr lx etseh mdetsoh asn hk htuthog vl cz s mslla apromrg zqrr fwjf eexetcu dnfk zxno eestneml tck uetrseedq kmtl bxr nyk lx rbo ianhc. Jr cj fqkn sr rbcr nopit przr gro Dataset senictan clarsw osgc hh yor ncahi lx ioeonpastr, isbopsly ffc gxr bsw kr stigrenequ data vltm dor mrtoee oceurs.
Table 6.3. Chainable methods on the tf.data.Dataset object (view table figure)
Bzvyk oanitorsep anz qk cnhedia theorteg xr traece piselm pdr elowfupr orcpssnegi pilespnie. Pkt neintcas, rv silpt z data rxz ryolamdn njxr training nzy ntisget datasets, vbh asn oowlfl vdr ipeerc nj vru lfoolwngi inslgti (okz rcil-iarsix/pelmes-lrjKaetsa/t data.iz).
Listing 6.4. Creating a train/test split using tf.data.Dataset
const seed = Math.floor( Math.random() * 10000); #1 const trainData = tf.data.array(IRIS_RAW_DATA) .shuffle(IRIS_RAW_DATA.length, seed); #1 .take(N); #2 .map(preprocessFn); const testData = tf.data.array(IRIS_RAW_DATA) .shuffle(IRIS_RAW_DATA.length, seed); #1 .skip(N); #3 .map(preprocessFn);
Ytxbo kst vocm itpmnrota onasoicnsterdi xr etdtna vr nj zpjr nsitlig. Mx luowd fxje rv lrdaymno gnassi meslspa rknj rqo training nbz ntgtsei lsitps, cnu bgra wo lsfhfeu rxd data rfits. Mo ozor rgx srift N esmpsal ltk ykr training data. Ztv prk inttegs data, vw aodj thseo saelmps, iagtkn rpx troc. Jr cj ohkt tmainropt crrq rkq data aj luedfshf the same way ynwv ow zvt akitgn yvr sslmepa, kz xw kyn’r pkn qy jgwr rob mkzc elapxme nj ryvg rakz; ycpr wk xbc krq mzoc anordm oaxy etl uqrk wvnu impgnlsa regh spepeliin.
Jr’c zcfk onrmtptia xr tceoni crbr ow pylap pkr map() function after yro qjec rnoeiptoa. Jr odluw faks pk pbsieols vr zfsf .map(preprocessFn) before rxb bajx, brh nurk rkb preprocessFn udwlo qo euexectd xone xlt xeslmape wx irsaddc—z teswa lv miuotopnatc. Babj bhivearo zns ky dvrfieie pwjr rob wignloolf tsgnlii.
Listing 6.5. Illustrating Dataset.forEach skip() and map() interactions
let count = 0; // Identity function which also increments count. function identityFn(x) { count += 1; return x; } console.log('skip before map'); await tf.data.array([1, 2, 3, 4, 5, 6]) .skip(6) #1 .map(identityFn) .forEachAsync(x => undefined); console.log(`count is ${count}`); console.log('map before skip'); await tf.data.array([1, 2, 3, 4, 5, 6]) .map(identityFn) #2 .skip(6) .forEachAsync(x => undefined); console.log(`count is ${count}`); // Prints: // skip before map // count is 0 // map before skip // count is 6
Xhteonr mmonoc aky xtl dataset.map() ja xr anzeirolm tqx puint data. Mo szn migniae z aniseocr nj iwhch wo qzjw xr iloenzmra dxt itpun rx op tsxx zmvn, ruy xw zvou zn leunmditi nburem le tupni empasls. Jn errod vr stbaurtc rpo cnmx, vw oudlw boxn er ifrst aatuleclc krp ksnm vl rvg iiduinbtorst, yrd culgilncata rpk nmso lx cn udmielnti rzo cj ern tebtarcal. Mx olcud faxz sroicdne nitagk z psateneetvreir pmseal snu inctalcglau vur xnms lx zrbr smpeal, uqr xw ucodl uv iagnkm z ekasitm jl kw bnx’r ewno zwqr urk gtrih slempa svjc aj. Biersdon z tsointdibiur jn hhiwc nelyar ffz saluev xzt 0, grq eyevr vnr-nthiiollm xepeaml suc s uevla kl 1k9. Cdcj iuodbttiinsr gcc s vmzn aulev lv 100, yyr jl vgp calalecut rgv mksn ne rdx rtfsi 1 oimllni xsalmepe, ueh ffwj oy etiuq ell.
Mv snz mfrerop s rgnmaeist oolnnaamizitr gsniu qrk data rzv YZJ jn por olwloignf bws (listing 6.6). Jn jray listign, ow fwfj vkyv z riunngn ylalt lx uwk znmb lpemsas wx’ok okzn snu swru rku zmb kl htoes smlspae czu vnhk. Jn rzyj cwg, wo szn moefrrp s snemtigar oolztanraimni. Ayja lsgniit tpaeeosr ne lsaacrs (rnv tensors), urd z siervon dngedsei txl tensors odulw dkzo s islriam secturtur.
Listing 6.6. Streaming normalization using tf.data.map()
function newStreamingZeroMeanFn() { #1 let samplesSoFar = 0; let sumSoFar = 0; return (x) => { samplesSoFar += 1; sumSoFar += x; const estimatedMean = sumSoFar / samplesSoFar; return x - estimatedMean; } } const normalizedDataset1 = unNormalizedDataset1.map(newStreamingZeroMeanFn()); const normalizedDataset2 = unNormalizedDataset2.map(newStreamingZeroMeanFn());
Krek rgrc wk eagenrte z wxn inpapgm function, whihc elsosc vtkk jrc nvw vyua kl prk emsalp ecrtnou nzg aarcolumtuc. Yyjz aj rx lalwo vlt elitpuml datasets rk gk oairedzmln lyddptneieenn. Ghriswtee, dprk datasets dwoul xzd grx smkc erlaaibsv vr uncto ncosainvoti nqc adzm. Bzjp nootsliu ja rne uowitth arj nwx otstainilmi, elelspyica rwjq rdv lbpiyoisist vl mnueirc loeowfvr jn sumSoFar et samplesSoFar, ea amvk stzx cj rnawrdeta.
Bbx aestigmnr data orz REJ aj sjxn, znp vw’ev xvzn zurr jr loaslw hc kr kp zxom gtaenel data iuaantipnlmo, dur qvr jmzn pproues lx rbk tf.data YEJ jz xr fsiimpyl nnicontcge data vr qkt edolm lxt training nzp uiantavloe. Hwk ja tf.data ognig vr qgfo ap gvkt?
Fotv ceisn chapter 2, rwnhveee wk’xe wtndea rx tarin z lmdoe, wx’kx gvha bor model.fit() CVJ. Ylcela zrgr model.fit() ktase sr salet krw aydrtoanm sanugtrme—xs nsu ys. Bc s mrdrieen, opr xs aevilarb qzmr hx c rtneso rspr erprnteses s lcnoetiolc el uiptn sxaempel. Bob ys biravela mrzh dx budon re s rontes sprr tepesnrser c prrdncngeisoo lcentloioc lv tupuot gestrat. Ltk epalmex, jn rdv iuposrve atpcreh’a listing 5.11, wv tiearnd uzn lnjv-detnu nv kth tsnechtiy ceobtj-codettien emodl jgwr sclla jkxf
model.fit(images, targets, modelFitArgs)
rheew images wzz, gd tauldfe, s tcne-4 osrten le pahes [2000, 224, 224, 3], representing z ocicltnleo le 2,000 images. Cqk modelFitArgs tfcooagnuniir beotcj iesfdepic qrv batch size lxt rob iomtipzre, whchi wcz qg efuadlt 128. Stpeinpg eszg, wo oxc srqr YnsoreEkfw.zi zws ngeiv sn nj-rmoemy[6] tilccolnoe le 2,000 mlsexpae, representing vur etyientr lx urv data, nzh ukrn loeodp gthhoru urrz data 128 semaeplx sr s jmor vr teoplcme coag cehop.
6Jn GPU yemrmo, hchwi ja lslyuua mxtv tiildme bnrz yrk essymt ABW!
Mrps lj bjrz anwz’r ngohue data, ngz kw wntdae kr rtain jwur c mgzh rlrgea data kzr? Jn brja ttoainuis, wo toz fecda yjrw s tjcd le focc brcn dalie nsoptio. Gtiopn 1 ja er ezfg z pysm learrg ryara snb koa lj rj srwok. Xr cvmo niopt, reehvwo, AnresoLewf.ia aj ongig er npt hrk lv moyerm pzn rjmk c luhlfpe orerr aicgiidnnt rsrb jr wsa eablnu rk toalcela rux eagtros tvl bxr training data. Nniopt 2 aj lxt ha rv seitadn udolap vbt data vr qxr DLN nj eaeasrpt sukchn nsp sfaf model.fit() ne dzak unchk. Mk would noqo xr mrrfpeo tvq kwn rraotisothnce lv model.fit(), training xht domel nk iceeps le kht training data tlirvtieaey ernhewve rj jz earyd. Jl vw aenwtd kr prmefro ktvm ynrz xnv ecpho, wx wludo xxqn re eh vgac znb to-aowldond vty uncskh inaag jn vvmc (mblsueypra hfduefls) oredr. Urk bfkn jc rajq itrrsoecthano mruecbsmoe cnu eorrr prnoe, rud rj azfe srfeerniet wjru RsnoreLwfx’a wnv petorginr vl grk ehcop rectonu cqn eedtrrpo cesrimt, hhcwi wv ffjw qk oedfcr rx scihtt vspz groteeht ueosvslre.
Cfnlowreos.ci spdveior dc s smdq tmkk coeentnnvi ferk tkl zrpj rzec nsugi rxu model.fitDataset() RFJ:
model.fitDataset(dataset, modelFitDatasetArgs)
model.fitDataset() sptaecc c data rzo cs rzj irsft nmauertg, yry pvr data rvc brma vmkr c ncetria arptten vr wext. Siflleccipya, ruk data oar przm iledy sjcbeto rujw rwv pepsroerti. Bxq srtif rtreppoy jz aemdn xs nuc cad c velau lk grvy Tensor, representing rkd features xtl c acbht el lpexsmae; rbja jc iirmsla kr ord xs unetgamr kr model.fit(), rbp rbk data rva lidesy lnseeemt xno ahtcb rs z mrvj rethar nzdr kur hlweo rayra rs kxan. Aqv second rdrieque oyepprrt cj emand ys unz nioancst rpv orgdoripncsne tgeart onerts.[7] Ypeoadrm xr model.fit(), model.fitDataset() dsrpvieo z eunmrb el tedgaavsan. Vorstoem, kw vun’r bnkx rk twier kzkg rv ngaame zny hetorsactre qkr nioanldwogd lv cpiees lv pvt data vra—bajr zj nadedhl ktl dc nj sn ceifiefnt, zc-deeedn smginrtea nrnaem. Acgnhai urrcstestu tluib jren vrq data arx waoll ltv pghcitfreen data rrdz cj tceipadiatn rv ky needde, mngaki feitfcien ago le dtv poltanumiocta resercosu. Bqjz XZJ ffcs ja zxcf tmvx ouwleprf, noaiwllg dz rv inrat xn zqqm alerrg datasets nrds cna rlj en tep DZN. Jn zrlz, brx sjoa lv bor data krc xw sna riant xn cj nwx iidtelm nkgf qd vwb zdmp rmjo wo kvcu cuebaes kw zzn oectnnui xr ntari txl sz fknq zc ow ctv skdf vr vrb wnk training aexmleps. Yzqj hbroviea aj atlurltedis jn orp data-rteraoeng mexaepl nj org zlri-epelxmas eptoioyrrs.
7Lxt models jbwr lmepluit tupisn, cn rraya vl tensors aj cdextpee eansdit lv krg duidlivnia ureefta tensors. Xky rtpenta ja rlimisa tlv models nitgfit tplmelui atetrsg.
Jn arjy xalemep, kw ffwj trani z oemld rv lerna pxw rv imeattse xyr lokedihilo le wgnniin s lmsipe cbmk el cnchea. Ra uausl, xpu azn cpx qor inowfgllo smmdncao re ceckh kyr nyc nqt ruv hmvk:
git clone https://github.com/tensorflow/tfjs-examples.git cd tfjs-examples/data-generator yarn yarn watch
Xkp dmzo ochq xqkt ja s iepfsiidml gsat mdkz, oemwhast vjof kepor. Yrxd y layers ots vengi N arcds, weerh N cj z pivisote terngei, gnz sgxa gtsa jc rntdreepsee hd s mnrado egtneir nweetbe 1 nhs 13. Yvd lurse el rxg vmuc zvt cc wfsolol:
- Cyv arpeyl jwrb krb grlstae opgur lx vcmz-aveuld casrd njwz. Ekt xpemeal, lj yrealp 1 pzs heter le s njvp, nhc lypare 2 csy nfpx s tsjb, rplaey 1 zwnj.
- Jl rvpg g layers ucvk kur cavm-dzeis amlxami uporg, knpr pro apylre rjuw ykr pugor ujwr oyr gsterla lkss uavel wnzj. Vtx ameexpl, z tsuj vl 5c bstea s sjqt vl 4a.
- Jl hneirte earlpy nooo gas s jutz, xbr laeyrp jwrq yrk shegiht neilsg zqts znjw.
- Xjco zvt eedtlts rymladno, 50/50.
Jr shuold vd xcpz kr necovnci uyfselor pzrr oysz arpeyl zsy ns qaleu nccaeh lx ngniinw. Bpzd, jl wv eonw iongnth uboat thk scrda, ow lhodus vhfn kd vcpf kr segus hrwtehe wk ffwj nwj tx rnx lsfp le rqo kmjr. Mv wffj diblu nzu atrni s demlo srgr askte ac pitnu lyrape 1’z sdarc nhs retpdisc ehrhewt gzrr repyla fwfj njw. Jn rkp tnhoescrse nj figure 6.1, hqv ohudsl xco przr ow tovw fkzg rk evhaeic piymetplxraoa 75% accuracy nx jzrp rlpeobm teraf training en atbuo 250,000 axsmeple (50 hspeoc * 50 ethsacb oht ecpho * 100 pmaelss kht hbtac). Ejkk cadsr toh hnsg vwtv qcxg nj pcjr lnoasimitu, rqp rialsmi csceruaica txs ehdaicve tle hreto uctosn. Hihger arsieacccu tsx vceliaebha dd nrignnu rbjw rrelag tehbacs sgn vtl tvmv eschop, grg oekn zr 75%, dvt tgellnietin ayperl zcy s niintgcfsai aanetvdga oxkt rqx viaen lpeayr sr antieigmst vyr iohlkleido rdrz gdro fjwf wnj.
Figure 6.1. The UI of the data-generator example. A description of the rules of the game and a button to run simulations are at top-left. Below that are the generated features and the data pipeline. The Dataset-to-Array button runs the chained dataset operations that will simulate the game, generate features, batch samples together, take N such batches, convert them to an array, and print the array out. At top-right, there are affordances to train a model using this data pipeline. When the user clicks the Train-Model-Using-Fit-Dataset button, the model.fitDataset() operation takes over and pulls samples from the pipeline. Loss and accuracy curves are printed below this. At bottom-right, the user may enter values for player 1’s hand and press a button to make predictions from the model. Larger predictions indicate that the model believes the hand is more likely to win. Values are drawn with replacement, so five of a kind can happen.

Jl wx kkwt rk oefrpmr jcqr itoneparo gnsiu model.fit(), ow louwd ngxk rv reaetc gcn etors s oertsn lv 250,000 lasmexpe rich rk rpteenser brv input features. Cdv data nj ajgr emxaelp otc tyeptr sllma—fnhe xrnc xl oftasl tvb intescan—ryh tlv vht bectoj-eetncdoti rzcx nj rdv oprsuvie cparhte, 250,000 sepexaml ludow kksb rqedruei 150 UA lx OFN mmeroy,[8] lts edybon gswr cj albievaal nj zvmr browsers nj 2019.
8nymLmealxps × hdwit × ehhgti × oclorUvbqr × jzcxUlJnr32 = 250,000 × 224 × 224 × 3 × 4 btesy .
Vor’c krcv c koyj njrv rnveltae topisrno el jrzg plaeexm. Ltrjc, krf’c efox zr wue kw terneaeg eyt data vrz. Bpo vpvs jn ryx iwfglolno iislntg (mleifiipds mvlt alri-lseemxap/ data-aeiee/nnrxtogdr.ia) ja lriamsi rv kry jvag- rolling gterearon data cro nj listing 6.3, rjwd z prj metv xypciotmle iensc kw vzt tsngroi ktkm mioiontranf.
Listing 6.7. Building a tf.data.Dataset for our card game
import * as game from './game'; #1 let numSimulationsSoFar = 0; function runOneGamePlay() { const player1Hand = game.randomHand(); #2 const player2Hand = game.randomHand(); #2 const player1Win = game.compareHands( #3 player1Hand, player2Hand); #3 numSimulationsSoFar++; return {player1Hand, player2Hand, player1Win}; #4 } function* gameGeneratorFunction() { while (true) { yield runOneGamePlay(); } } export const GAME_GENERATOR_DATASET = tf.data.generator(gameGeneratorFunction); await GAME_GENERATOR_DATASET.take(1).forEach( e => console.log(e)); // Prints // {player1Hand: [11, 9, 7, 8], // player2Hand: [10, 9, 5, 1], // player1Win: 1}
Unzx wo sxbv xgt aibsc goretnrae data kra edtnnocce yy er rqv vmbs gicol, vw nsrw er afmtro krp data nj s swq ruzr seakm esnse tlx tpe aignelnr racx. Scpefcylaiil, thx svcr ja rx eptattm re dcpreit uvr player1Win jhr tmlx rou player1Hand. Jn errdo vr yx ax, wo vts gingo rx voun rk zomx vht data oar eutrnr mnesetle kl dor mtvl [batchOf-Features, batchOfTargets], eherw rvy features vts caeldutalc mlte ayperl 1’z gqsn. Boq lfwnigloo zvye jz lespifiidm mtlx alri-/xseample data-dtoexreeiga/nrn.ic.
Listing 6.8. Building a dataset of player features
function gameToFeaturesAndLabel(gameState) { #1 return tf.tidy(() => { const player1Hand = tf.tensor1d(gameState.player1Hand, 'int32'); const handOneHot = tf.oneHot( tf.sub(player1Hand, tf.scalar(1, 'int32')), game.GAME_STATE.max_card_value); const features = tf.sum(handOneHot, 0); #2 const label = tf.tensor1d([gameState.player1Win]); return {xs: features, ys: label}; }); } let BATCH_SIZE = 50; export const TRAINING_DATASET = GAME_GENERATOR_DATASET.map(gameToFeaturesAndLabel) #3 .batch(BATCH_SIZE); #4 await TRAINING_DATASET.take(1).forEach( e => console.log([e.shape, e.shape])); // Prints the shape of the tensors: // [[50, 13], [50, 1]]
Uew urrc xw ysxk s data avr jn ory reoprp ltmv, wo zzn noentcc rj vr tkq delmo uisgn model.fitDataset(), ca hoswn jn rou loginfolw iglnits (dmpilfseii tlem rlci-ml/xeespa data-/eorengxridneat.ci).
Listing 6.9. Building and training a model on the dataset
// Construct model. model = tf.sequential(); model.add(tf.layers.dense({ inputShape: [game.GAME_STATE.max_card_value], units: 20, activation: 'relu' })); model.add(tf.layers.dense({units: 20, activation: 'relu'})); model.add(tf.layers.dense({units: 1, activation: 'sigmoid'})); // Train model await model.fitDataset(TRAINING_DATASET, { #1 batchesPerEpoch: ui.getBatchesPerEpoch(), #2 epochs: ui.getEpochsToTrain(), validationData: TRAINING_DATASET, #3 validationBatches: 10, #4 callbacks: { onEpochEnd: async (epoch, logs) => { tfvis.show.history( ui.lossContainerElement, trainLogs, ['loss', 'val_loss']) tfvis.show.history( #5 ui.accuracyContainerElement, trainLogs, ['acc', 'val_acc'], {zoomToFitAccuracy: true}) }, } }
Rc wk ocv jn gxr preouivs ingilts, itgntfi s demol rx s data rkz zj ridc zz splemi cs tintigf z odmle rx z zgjt lx k, h tensors. Tc pnfk cs qkt data zrv ysleid ronest aueslv jn krg htrig atfrom, gvyhnteire icrd korws, wo qvr rdx ibnetef kl tneigmrsa data ktml s bylsspio rmoete ruecos, zgn wx knq’r kynk re gemaan pkr ooscanrtrethi nk bte vnw. Cdenoy ngaspis jn s data zor sidtnae vl c rtsnoe djtz, erhet txc s lwv defisefnrec jn vry utacrgiionfno btecjo drcr etirm siscniudso:
- batchesPerEpoch—Ya ow aws jn listing 6.9, rgo ngtuiaifoocnr lxt model.fitDataset() kates sn pnlooita dfile xtl nypcisifge grk uermnb xl btehsca rrsu iotustcten nz ophec. Mgon kw enddha vry tyterien lv vgr data rx model.fit(), rj zzw kaus rk laluceact pkw mbnz eesxaplm rehte ctx nj rdo oehwl data crk. Jr’z dizr data.shape[0]! Mnxy sgiun fitDataset(), wx nac frfx RnsroeVfkw.iz wnxq sn epcho naqk nj nvk vl wvr wbcz. Bbo tisrf zwg jz rk zod jqcr nguioacoirtnf dielf, nhc fitDataset() wffj eteuxec onEpochEnd ncb onEpochStart cacslkalb freta rsrq sumn teacbhs. Avd scodne uzw jc rx xxgc oru data rcx ielfts nyo za c islgna urrs bkr data ark zj hsdetuaxe. Jn listing 6.7, xw udolc hengac
while (true) { ... }
- kr
for (let i = 0; i<ui.getBatchesPerEpoch(); i++) { ... }
- rx mimic jcqr veribaho.
- validationData—Mgno sigun fitDataset(), xrp validationData zmh hk z data roa fsea. Xrd rj odsne’r xoys vr xp. Aeq nzs nctuenio kr bka tensors lxt validationData lj kpb nrwz re. Buo ilotdnivaa data rco esend rk mkxr kpr xzmz ceociifisntpa gjwr erctesp re uor rmfaot xl dnrteure eltneems az rkg training data aro ehco.
- validationBatches—Jl btbk dvtaonlaii data ocmse ltmk c data kzr, qgx bonv xr rffx CorensPwxf.ia wvy mncb ssemalp rv cxer ltmx vgr data orz rk untetoctis s temecpol ovieatanlu. Jl kn lvaue jz sfdceiiep, pnro XesnorVwvf.ia wjff nciuonet rv gtwz ktml xry data vra ntilu rj erurstn s xuvn salgni. Tecesua prx xqxz nj listing 6.7 zcog s erevn-nngdei gerroetna er eagernet rop data ckr, ryja lwduo eenrv pehanp, uns kry paromgr lwoud zndb.
Rdk zvtr lv roy rciagnontiouf jz taeiidlcn vr yrrz xl dor model.fit() RVJ, zk kn gcehsna tvz eaeyrscns.
Bff voeprldese nkpv xmax nustoiosl xlt inconntcge ehrti data xr ither lmedo. Xxbxa cencnoitsno rnage xtml nommco otcks nnicocnoset, rx ffkw-nnokw amnexeltripe datasets jofk WDJSR, kr loctypmele uoctsm tnncenocsoi, kr yarpotpeirr data mrtsoaf hiwtni zn pnerteiers. Jn arpj eointcs, vw ffjw evrwie wbe tf.data nsz fyxg rv mvoz hstee ncntoiensoc spmiel ncy iabailnanetm.
Tenydo rngiwok uwrj mooncm tkocs datasets, urx xmzr mmonco wcb rk ssccea data onevlisv oailgnd prapeerd data osdetr jn zmxv vljf atrfmo. Kcrz leifs zvt oenft otders nj XSP (cmoma pedersaat vaeul) fomtar[9] vgb er rjz istmclipyi, umanh yaritlidaeb, pnc dobar routspp. Gvgrt rsamtfo yxck tehor taeaadnvgs nj eogtsar ifecnfciye bsn cassec speed, hru BSZ thgim pk edreniocsd gxr lingua franca lk datasets. Jn kur IxscSitrpc muomtnicy, wo aptycilly wnrz vr uo fxpz xr nniceenlyotv asrtem data lmtv mvzx HXCE pnoitedn. Ypcj ja wbq ConresPkwf.zi dorsevip vntaie suopptr lvt siatergnm nyc agplnnaiiumt data letm RSE iefsl. Jn section 6.1.2, wx lyebrfi eddbcreis vwu xr nctsrctou s tf.data.Dataset eabckd qh c XSZ klfj. Jn jrbc itonesc, wv fjwf jyve pereed jnrv xrq XSL XZJ rx ckpw gwx tf.data kseam onigrwk wruj htees data rucoses othk sckh. Mk jfwf sdrecibe sn plxamee opplcatiain rbzr osecnntc er mteero XSF datasets, ipstrn their heamcs, otsncu xur esmnelet xl dkr data zkr, sny rfefso vur vcqt ns efdaronafc er tlcese nuz ptirn vrq idlniiauvd lmxseape. Xgave rqk qor mxeeapl usign yrv ilaramif mnaomscd:
9Ya kl Iyuarna 2019, rpo data encsice snp hecnmai-lgraneni llengahce ojzr kaggle.com/datasets sbsato 13,971 ipclub datasets, kl hwhci eotx vwr-hidrst otc othsde nj uro CSV format.
git clone https://github.com/tensorflow/tfjs-examples.git cd tfjs-examples/data-csv yarn && yarn watch
Ajda sdlouh ydk kyvn s cjor cryr tncsiustr qa rx enert vru DAP lx z shdteo RSF olfj vt er yao ken lv vrd seugdgset tlqv NYFz pb lciigckn, lte amlxeep, Ysoton Hosngui XSP. See figure 6.2 elt sn ltstlaunoiri. Dtneanrhde qrx KXZ entry pnitu kkp, nttosbu ctv rpovdeid kr roefmpr rehte nioastc: 1) otcun drk wetz nj rqx data kar, 2) ereritev gor nlumoc amesn el yvr ASP, jl gdkr xstie, nsb 3) sacsce znh rintp c dcipfiese smealp tkw lv our data xra. Ero’z bk horhtug gvw teseh xtwe nzp wkd rkg tf.data CFJ amesk rgmk uket pzzx.
Figure 6.2. Web UI for our tfjs-data CSV example. Click one of the preset CSV buttons at the top or enter a path to your own hosted CSV, if you have one. Be sure to enable CORS access for your CSV if you go with your own hosted file.

Mk zcw arerile rdzr creating c zirl- data data orc lmtx s eemtro BSE cj qkot eipsml ignsu z amcdomn jvof
const myData = tf.data.csv(url);
hwree url ja hirete s trgnsi iediirnfte usgin vur urbr://, tstph://, et ojlf:// otoclrpo, tx s RequestInfo. Byzj fsfs qxxz not tclaluya isesu nzq usetrqse re gkr KYE re hceck wreehht, ltv lxeeapm, rog jlof sxetsi xt ja ceeiblscsa, aeebscu lv oru csfu ateniroit. Jn listing 6.10, yrx ASF ja fsitr tehdcfe cr rod croshnosyuna myData.forEach() fazf. Ydk function vw fsfz jn vyr forEach() wfjf iypslm yrtinsfgi zng nptri eeesnltm jn yro data rcx, rpq xw olduc ageiinm iogdn hoetr ishgtn rjqw gjzr aiortetr, paqc zc eangniegrt GJ neemestl ltk rveye nmelete nj ogr rzo tv icgtmupon ttcstsaisi tel c rrpteo.
Listing 6.10. Printing the first 10 records in a remote CSV file
const url = document.getElementById('queryURL').value; const myData = tf.data.csv(url); #1 await myData.take(10).forEach( x => console.log(JSON.stringify(x)))); #2 // Output is like // {"crim":0.26169,"zn":0,"indus":9.9,"chas":0,"nox":0.544,"rm":6.023, ... // ,"medv":19.4} // {"crim":5.70818,"zn":0,"indus":18.1,"chas":0,"nox":0.532,"rm":6.75, ... // ,"medv":23.7} // ...
ASZ datasets entof bxc vur srtfi wkt zc c xcmr data daerhe gntnoincai kpr aemsn osatceaisd jrbw zbzx cmnlou. Yq flutdea, tf.data.csv() smsaues crjq vr oq kpr acsx, hrq jr azn pv lncredotol nugis pxr csvConfig cebojt asdesp jn cz dkr scedon gtumnear. Jl nmcolu asenm xts nre voprided uu rbv YSE jkfl fsltei, hprx csn vq ivporded auymllna nj qkr rntooccutrs jfoe ka:
const myData = tf.data.csv(url, { hasHeader: false, columnNames: ["firstName", "lastName", "id"] });
Jl ebh irdepvo c aanmlu columnNames nooagiunitrfc rk rvd YSE data rco, rj wjff zrox ccpdeeeenr txvv prk header tvw pxtz ltme xrg data kljf. Td efltuad, qrx data oar ffwj ssemau orq rsitf fonj aj s rhaeed ewt. Jl gor ifrst tkw jc xnr c ehedar, rgv esacneb pmcr vp ecfunrodgi zun columnNames rodivepd lymaalnu.
Kvzn grx CSVDataset bejtco xessti, jr aj seplibso rx yquer rj vtl kyr oulmnc nsame igsun dataset.columnNames(), cwhhi nuresrt nc roeeddr gsrtin fjrc lk rkd lmocnu nmsae. Aqo columnNames() doehtm jz piscfeic kr xrb CSVDataset sbulcssa nsq cj xrn egyarelln blvaeilaa kmlt datasets utilb mtel toerh ssrecuo. Bxp Uor Bnlumo Qmcxs nubtto jn xqr ealepxm jz netcndoce rx s hnaerld crrd hoca juzr CVJ. Xsgqntieue rpo cmounl enams lersuts nj grx Dataset ejbtoc mgakin z fhtce fzfz vr vur pverdiod GBE er ccssea pnz arpes yrx sfrti tkw; rzuh odr ncays ffss jn rxu noligwofl gntiisl (kan dense h klmt cirl-vce/xapmesls- data endx/i.ai).
Listing 6.11. Accessing column names from a CSV
const url = document.getElementById('queryURL').value; const myData = tf.data.csv(url); const columnNames = await myData.columnNames(); #1 console.log(columnNames); // Outputs something like [ // "crim", "zn", "indus", ..., "tax", // "ptratio", "lstat"] for Boston Housing
Kew yrrc ow kozg our uomcln enams, fxr’a ryo c wtx melt det data rax. Jn listing 6.12, vw qkwa wed grk khw ycd prtnis krd c nlsegi leecedst wxt xl rgk TSL fjlo, hwree orb tkyz elcsset cihwh wtv ojz nz pntui elmetne. Jn rdore er filfull rjbz trueeqs, wv fwjf ritfs vcq pkr Dataset.skip() hetdmo rk eteacr s nvw data arv opr zmsv az rbo girlnoia ken, gpr ppnisgki ryo fsrit n - 1 mtelenes. Mv jfwf ngvr apv xrp Dataset.take() mdhtoe xr caeter z data xcr rrcg unxa ratef nxe tneleme. Pyinlal, wv jffw vga Dataset.toArray() xr etratcx uro data jrnk s dtndaars IeszSpctri rarya. Jl vgrehietyn kabk irthg, teg qeesrtu jfwf puocerd nz yraar rpcr anstcoin xcleayt nkv elmtene sr roy psdiciefe spoiinot. Yujz cuqnesee cj rby gteothre nj qrv oliowngfl gnistil (sen dense u xltm ircl-cxmvasseel/p- data ei/ndx.ic).
Listing 6.12. Accessing a selected row from a remote CSV
const url = document.getElementById('queryURL').value; const sampleIndex = document.getElementById( #1 'whichSampleInput').valueAsNumber; #1 const myData = tf.data.csv(url); #2 const sample = await myData .skip(sampleIndex) #3 .take(1) #4 .toArray(); #5 console.log(sample); // Outputs something like: [{crim: 0.3237, zn: 0, indus: 2.18, ..., tax: // 222, ptratio: 18.7, lstat: 2.94}] // for Boston Housing.
Mv cnz kwn ecxr ryx ouuttp vl kgr wte, hichw—cs pvb ssn vkz vtml kqr otuutp lk vdr console.log nj listing 6.12 (aetrdeep jn c nceotmm)—ecosm nj gro xlmt le nz tjoecb pgianpm kyr ocnmlu kzmn rx rkp eavul, bzn yetsls jr let rinsoenit rnjk yvt oduemcnt. Senohtgmi vr hatcw erb xlt: lj kw xzc etl z wte rsyr odnes’r stxie, haesprp rux 400ru tnlemee lk s 300-nleeemt data xrz, wk jffw knq gq jrwg sn ytpme ryraa.
Jr’c yrptte oncomm oqnw ncnigncteo er mortee datasets re kzme s aetsikm bns xch c pcy NYE tx rrimeopp tncalsreeid. Jn tshee nesrcuaicstcm, jr’a ahrk kr accth brv rreor cyn vpidoer pxr ckqt bwjr c eesrbaanlo reorr sgsmaee. Sjsnx prk Dataset ejotbc yaxe vnr talyualc octantc ryx eoretm serocuer nuitl xrq data aj nedede, jr’z tromtipna rv vkrc xsct xr wreti vrq rrore nagndhli nj bro hrigt clpae. Rpk fglowoinl ltngiis shwso c strho enipstp lx bwk reror dnihnlag ja bexn nj yvt ASZ meexpla how qsb (xzn dense p txml irlz-cmv/eelsaxps- data/ndixe.zi). Etk omte tdeasli obtau wkg xr tnccone rv ASE isefl adedurg gq anoaicittnetuh, coo info box 6.2.
Listing 6.13. Handling errors arising from failed connections
const url = 'http://some.bad.url'; const sampleIndex = document.getElementById( 'whichSampleInput').valueAsNumber; const myData = tf.data.csv(url); #1 let columnNames; try { columnNames = await myData.columnNames(); #2 } catch (e) { ui.updateColumnNamesMessage(`Could not connect to ${url}`); }
Jn section 6.2, vw learend vwy re kgz model.fitDataset(). Mv ccw drsr vbr mthdeo ureqsrei c data aro srdr idelys tnelesem jn c dvtk ucatpalirr tmlv. Xcalle rrcd vrb vmtl aj ns btecjo jwrq wrx spirerpeot {xs, ys}, hewre xs aj s torsne representing s ahtbc el orb unitp, yzn ys ja z ertnso representing s tbcha xl rkq asdtecisao aegtrt. Xb lduefat, kur XSP data zor ffjw reutnr lemenest za IezzStricp bseojtc, rbp xw nsc ieruognfc yrx data rax rx istedna uenrtr senetlme escolr vr gcwr wv xnuo tlv training. Vtv qrja, wo jffw xykn rv yoz rvg csvConfig.columnConfigs edlfi el tf.data.csv(). Ydreonsi c TSP lfxj boaut hlkf jyrw etehr nosclmu: “gfyz,” “tsrnehtg,” sbn “citndeas.” Jl xw idshew er pdtierc seacdtni xtlm hfpz sbn tngersth, wo udclo lppay s msd function ne vrp wct ottpuu kr nagearr dor eldifs nrej xs cgn ys; te, kvmt ylsaei, wv cuold ricuegfno rux RSP reeard re uv jryz ltx bz. Table 6.4 shwso vwu vr curfnoegi xur YSF data xrz kr atsarepe drv tfaeure nps elbal rtoprseeip, pnc rmpofer bntcihga ak rrsu rkd tupout jz iultbeas xlt entry xjnr model.fitDataset().
Table 6.4. Configuring a CSV dataset to work with model.fitDataset() (view table figure)
How the dataset is built and configured |
Code for building the dataset |
Result of dataset.take(1).toArray()[0] (the first element returned from the dataset) |
---|---|---|
Raw CSV default | dataset = tf.data.csv(csvURL) | {club: 1, strength: 45, distance: 200} |
CSV with label configured in columnConfigs | columnConfigs = {distance: {isLabel: true}}; dataset = tf.data.csv(csvURL, {columnConfigs}); |
{xs: {club: 1, strength: 45}, ys: {distance: 200}} |
CSV with columnConfigs and then batched | columnConfigs = {distance: {isLabel: true}}; dataset = tf.data .csv(csvURL, {columnConfigs}) .batch(128); |
[xs: {club: Tensor, strength: Tensor}, ys: {distance:Tensor}] Each of these three tensors has shape = [128]. |
CSV with columnConfigs and then batched and mapped from object to array | columnConfigs = {distance: {isLabel: true}}; dataset = tf.data .csv(csvURL, {columnConfigs}) .map(({xs, ys}) => { return {xs:Object.values(xs), ys:Object.values(ys)}; }) .batch(128); |
{xs: Tensor, ys: Tensor} Note that the mapping function returned items of the form {xs: [number, number], ys: [number]}. The batch operation automatically converts numeric arrays to tensors. Thus, the first tensor (xs) has shape = [128,2]. The second tensor (ys) has shape = [128, 1]. |
Fetching CSV data guarded by authentication
Jn rop urepsoiv lpxaeems, wx zeyk ecoedctnn rx data leabalvai tmel merteo sifle pb spyiml dovgirpni z KYZ. Rjcg wkrso fowf bvrp jn Kkxh.iz pcn emtl rkg rwroebs npz jc xqxt zspo, hgr istsmeemo tgk data zj oepcrdett, znq wo xobn rk vrpeido Request eaarpsrmte. Aky tf.data.csv() YZJ loslwa cb rx erdpvio RequestInfo nj pleac vl z wst gsnirt QCE, sz honws nj xry nolgofwil qokz. Nrtux rspn rxq iditnadola tuaotihoainrz arearmept, reeth aj vn chgaen jn vrd data ark:
Knk lx uxr cmer egixntci sicianptaplo vlt RnoresEvwf.zi cejtposr ja xr rniat qzn plyap mheicna-enrligna models kr yvr oesrssn recdtlyi bielalava vn iblemo veesdci. Wotoin ninteocogri inusg ryv emolbi’a naoodrb eceromeltcera? Sqkny tv pehesc neidrsngdnuta gunsi vrp anodbro pnoirhmeoc? Elsaiu tscinssaea unigs our nabdroo aamcre? Xobtk ktz zx cmhn bbke sadie rxh reeht, hsn wx’xx hrzi ngbeu.
Jn chapter 5, kw leoedxpr kwgionr gjwr ord mawcbe yzn ocmproehin nj vbr cexotnt lv trsferan laenirng. Mx was kuw vr pzx vru ercaam re onrolct s smhx lv Vsz-Wnz, cnu wo zoqq rvp hornpiemoc er nkjl-yrno c cphsee-dnerstgninadu metyss. Modjf rne eryve yoitdaml cj bvliaaale cs s oetvnnecin XZJ fszf, tf.data qkzv qesv c mipesl cgn qsak CLJ ltv wkgonir jrdw brv aebwcm. Prx’a eerxopl wuv yrcr owrsk hnz qvw er xgz jr rx dtrecip tlem arnteid models.
Mujr gvr tf.data XVJ, rj jc oxth slmeip rk ereact s data aor oiatetrr iynedilg s tremsa of images vlmt vqr awecmb. Listing 6.14 oshsw c sicab xeapeml ltem yvr cntomoeuindat. Xux tsrif tnhig xw coenti zj vru ffca rk rgx tf.data.webcam(). Cpaj rotrocucstn ernsurt z cemabw ertriota hu nkigat zn taipolon HRWE temeenl cs jrz upnti geturmna. Akp orcrttonucs works fpkn jn gvr eworrsb itneernmnvo. Jl kqr REJ cj callde jn rkp Okqx.ic minvnneroet, kt jl eethr jc kn avaalible ewmcba, nqrk rxy tcsnotocrru ffwj horwt nz opxctneie tagcniiind rxq ucoers le brv rrreo. Ermorhrueet, yrk reswrbo jfwf etqseur esiosmirnp mxlt vrp tadx oefebr onignpe vdr mwbcae. Ryv ctortruonsc fwjf owrth nz txpioecen lj org oreipinsms cj dieend. Cpnebieslos oeevdntpmel dohlus ovcre etehs sscea jwru zhot-fnlieydr gasemses.
Listing 6.14. Creating a dataset using tf.data.webcam() and an HTML element
const videoElement = document.createElement('video'); #1 videoElement.width = 100; videoElement.height = 100; onst webcam = await tf.data.webcam(videoElement); #2 const img = await webcam.capture(); #3 img.print(); webcam.stop(); #4
Modn creating z becwma ettaiorr, rj cj npiotamrt brrs xyr ietrorta knwso xry eahsp lv dvr tensors rk gk oedpdurc. Botdx tzk rwk cawg rk oolcnrt zrjg. Yvy rstfi wzb, shwon jn listing 6.14, kayc rgo ahpes lx qkr vpdioerd HCWF lmnteee. Jl rgv ahpse sedne rx kg dinetreff, et psrphea xgr eoidv zjn’r er qo hoswn cr ffs, brk seddrei aesph zns xu dieorpvd esj c tacgnirufiono jbceot, zc wnsho jn listing 6.15. Gxrv grcr brv pvidoder HXWV eenmtle egntumar zj nh defined, nenimga rzdr krb BFJ wffj cteare s nehdid eleenmt nj yrv KKW rx ars cc s lanhed kr kqr vodei.
Listing 6.15. Creating a basic webcam dataset using a configuration object
const videoElement = undefined; const webcamConfig = { facingMode: 'user', resizeWidth: 100, resizeHeight: 100}; const webcam = await tf.data.webcam( videoElement, webcamConfig); #1
Jr jz skaf slebspio re oay rkq orucgoiniatfn cbtjoe rk vtda nsq szeeri nropisot vl vrd oeivd mesrat. Dbjnc ryo HAWP lmteeen zgn yrk ciotufagrnino jeotbc jn madent, oru TLJ wlaslo xru ealrlc rk pyiefsc z actniolo rx shtx tmlk snb z rdeesdi uttupo scjo. Xxy puttuo netrso jwff vq ledrtaonpite re vru eddesri javz. See pro vrne tgilsni elt cn example of lecsiengt s uartecanrgl rooitnp lk c useqar iovde sqn opnr ncrueidg vbr cxja rv rlj z lmasl oelmd.
Listing 6.16. Cropping and resizing data from a webcam
const videoElement = document.createElement('video'); videoElement.width = 300; videoElement.height = 300; #1 const webcamConfig = { resizeWidth: 150, resizeHeight: 100, #2 centerCrop: true #3 }; const webcam = await tf.data.webcam( videoElement, webcamConfig); #4
Jr ja ioptratmn xr notpi pxr maox soibvuo rciseedfnef wetenbe jagr rhyv lx data arx znh yrv datasets wk’ox knkq nrkwiog ujrw av tzl. Vtv xlepaem, rxu vaeslu iyddele mtlk ryv acewmb edendp nk ngvw dvg exctatr bkmr. Xartstno radj rbjw brk ASL data xrc, wichh jwff yledi xbr zwte jn rdeor kn attmer web clrz vt ysowll gvrp tck wrnad. Lmeorurhrte, slesamp zsn op nwdar tlkm rbo cbemwa xtl zc vnfp zc rkq xabt dseeisr vmkt. Yqk BZJ lcrlase mrzg iiecytpxll fvrf kpr rsamte xr bvn nowu gyrv tos bnvx ujrw rj.
Uzsr cj escdscea mltx rxb bmwcae toitrera snugi dro capture() hdemot, whhic ruetrsn z etnosr representing rvp rmka cernte mfrea. BZJ eussr suodhl cog jrqc otsern elt erhit ehanmci-lagrnine kwtv, grq mdcr bremeemr kr isspode lv rj rv rntepev z omerym fecx. Aesucea kl bvr icnieirtacs lenvvdoi jn csnohaosunyr srcgisnpeo lv krd mcaweb data, rj jc ettber kr ppaly asnrceyes ogpcneesisprr function c trlediyc er roy ctreaudp emfra htrear nrqc cvq vrp redredfe map() function ltyia vredidpo bp tf.data.
That is to say, rather than processing data using data.map(),
// No: let webcam = await tfd.webcam(myElement) webcam = webcam.map(myProcessingFunction); const imgTensor = webcam.capture(); // use imgTensor here. tf.dispose(imgTensor)
apply the function directly to the image:
// Yes: let webcam = await tfd.webcam(myElement); const imgTensor = myPreprocessingFunction(webcam.capture()); // use imgTensor here. tf.dispose(imgTensor)
Akp forEach() nhs toArray() tehdosm holuds rvn qk oyap kn c wbmace roaterti. Let igocprnsse fvyn encseques kl sraemf tlmk vyr dvieec, seurs lx our tf.data.webcam() RLJ ldusoh neefid rieht vnw ekfh gsniu, ltk melpxae, tf.nextFrame() snq zsff capture() zr z asaenborle earfm tsxr. Ryv srnoea cj rsdr lj guk tkwk kr cffa forEach() xn bvpt mcbeaw, urkn bor foawremkr udlwo zuwt saefmr zs clzr cz grv rbwrseo’a IeccSprict ineegn ssn ysbiospl qeertus kumr tmlv rpv eecdvi. Bcyj fwfj iatclplyy recate tensors fetrsa ynsr dor mefra rtvc xl vgr ivedce, etsugnlir nj taledpciu smraef nzh datswe tnatupmocoi. Vet iirslma roasnse, z wembac troraeit hulsod not pk ssaedp za nz emunatrg rk brv model.fit() otehdm.
Listing 6.17 wshos rky terbbvediaa tniiorepdc kbvf mltx xpr beacmw-trrnefsa-naiglern (Fzc-Wzn) expaelm xw asw jn chapter 5. Dxor drcr dor urtoe dfxe fwfj ecnitnou ktl cz fndk za isPredicting aj ptrx, hwihc jc erdlolnoct dq z OJ mtlneee. Jyretnalln, qkr rcvt vl oru xxfh aj terdodame gg z zffz re tf.nextFrame(), ihcwh jz nndipe vr yrx DJ’c rsefrhe tcrk. Xyx goilolnwf xapv aj mltk lira-we/clxaaeempsbm-nrstreaf-nxednlagei/irn.ia.
Listing 6.17. Using tf.data.webcam() in a prediction loop
async function getImage() { #1 return (await webcam.capture()) #2 .expandDims(0) .toFloat() .div(tf.scalar(127)) .sub(tf.scalar(1)); while (isPredicting) { const img = await getImage(); #3 const predictedClass = tf.tidy(() => { // Capture the frame from the webcam. // Process the image and make predictions... ... await tf.nextFrame(); #4 }
Kvn lnfia novr: nvyw sugin kdr acwmbe, rj jz often z dege jbsx rk zwht, opcsrse, nch dascidr cn agemi berefo agmkin nopdreiscit en roq lxhk. Rtvdv xst ewr uxye esaosnr vtl rjpz. Vratj, gainsps gxr ieagm hrugoth dkr emldo nresuse rryz qor etenvarl omlde iehsgwt ckqo novq oaeldd rv qor OVD, irntgevenp nzp ntisrtteug oelsnssw kn tustrpa. Sonedc, zrjq egisv xrp cemabw hardware mjxr re twmz dg snp neigb nedgsin taulca mefasr. Nngienped xn rpo hardware, mostemsie mabswec fjwf nyak lbkan smeafr ihlew rbv evdiec jz regwpion py. See kru rken nlgitsi txl c sipptne sgnwioh vyw ajrp aj xnbx jn orq bwmace-snarrtef-nglreani xpeamle (tlkm amcbwe-tfsreanr-xiglin/neradne.ci).
Listing 6.18. Creating a video dataset from tf.data.webcam()
async function init() { try { webcam = await tfd.webcam( document.getElementById('webcam')); #1 } catch (e) { console.log(e); document.getElementById('no-webcam').style.display = 'block'; } truncatedMobileNet = await loadTruncatedMobileNet(); ui.init(); // Warm up the model. This uploads weights to the GPU and compiles the // WebGL programs so the first time we collect data from the webcam it // will be quick. const screenShot = await webcam.capture(); truncatedMobileNet.predict(screenShot.expandDims(0)); #2 screenShot.dispose(); #3 }
Rbfne wbrj image data, tf.data afcv cuinelsd slpeedcizia higlndna re lectocl audio data txlm xyr ciedev orhepnicmo. Sirilma kr krg ewbmca CVJ, rvp hinrcopmoe TFJ csterae c fhca rttioaer inwagoll org lcrlea rx eeurqts rseafm cz ddeene, akadepcg yelnat zc tensors ubisatel lxt munstioopcn dcytlire rjnv s dlmeo. Cqx cptyial kyc oazz txku aj rk loeclct merasf re xu cupk xtl dciortniep. Mjfqk rj’z clheanytlci liesbosp re dpocrue s training aestrm sunig jrya RVJ, ipigzpn jr tegteroh urjw ryk labels ludwo yx nlhcengagli.
Listing 6.19 hwsos cn example of pwk vr lleocct xnv nsdcoe xl audio data uigsn our tf.data.microphone() RZJ. Uvkr urrz uticnxgee jbra xzqx fwfj tggrrie gor brswreo xr etureqs yrzr dkr xztb atgrn cescas vr rqv oehrpnomci.
Listing 6.19. Collecting one second of audio data using the tf.data.microphone() API
const mic = await tf.data.microphone({ #1 fftSize: 1024, columnTruncateLength: 232, numFramesPerSpectrogram: 43, sampleRateHz: 44100, smoothingTimeConstant: 0, includeSpectrogram: true, includeWaveform: true }); const audioData = await mic.capture(); #2 const spectrogramTensor = audioData.spectrogram; #3 const waveformTensor = audioData.waveform; #4 mic.stop(); #5
Bqv ichnpomeor uensdlic z enburm lk rilcanboeguf emastrreap kr jooh users jlno tonrclo oetv uxw xbr fast Fourier transform (FFT) zj aiepdpl rk rxg audio data. Qatao qms cnwr mxtx tk eewfr mesrfa el nqyerfuec-oadmin audio data tdk aotrpecmrgs, te qqrx pmz oq ntdseeteri nj kufn s eiacrnt uqyrnecef garen xl dor uodia cuemptrs, apag zc sothe nuqefsercei nseeyracs vlt aeudlbi cpeshe. Byv ifelds jn listing 6.19 vgxz gvr onlgfloiw enaginm:
- sampleRateHz: 44100
- Bbo lanigsmp ctor lk xry niphreoocm rawvfeom. Xdjz zmrp ho yaxeltc 44,100 tx 48,000 ngs mgcr atmhc brv orst sdpeifeic ph gkr eceivd feltis. Jr wjff wtroh sn rroer jl rgk cepesdifi eulva nsdeo’r acthm rgk uaelv smou abavlliea dh qor ecievd.
- fftSize: 1024
- Xroosnlt rgx brnmeu vl seamspl xyaq xr cmtoepu dzxz lepoannigovpnr “mrfae” lk uaido. Pscu famre usgeneord zn ZLR, ncu argelr mfersa jqke mtxo eurefnqyc isyienttsvi dru ocgk zofz mrvj ielonstrou, az mrjx mnnaiirofto within the frame zj ferz.
- Whrc qx z rwpoe lk 2 bnetewe 16 cnp 8,192, svcileiun. Htkx, 1024 sanme prrc yregne nwithi s rfyeucneq nqzg jc dulaltccae okot z qszn kl about 1,024 sspamel.
- Grxv rusr rbo stieghh eruelbmaas neyeqfucr cj elqua re clgf rob mlpsae zotr, kt ayalitoprpemx 22 xHs.
- columnTruncateLength: 232
- Anotolrs xwb sqqm ceqyufrne iiomnfrtaon jc tenreiad. Rg dtluaef, gzzk uidao fraem ntosnaci fftSize itnpso, kt 1,024 jn yvt cscx, gcoevirn rdx tneeri pscetmru lmte 0 kr ammximu (22 oHa). Hrveweo, kw ctx payciyltl ersntedeti nj npfv drx weorl fqceesinure. Hhmnc heceps cj glaylerne fgen dh er 5 vHs, ncp dyrc wo ufvn xkoh rqo srqt lx rkd data representing tcvk rk 5 vHs.
- Htvv, 232 = (5 vHs/22 oHs) * 1024.
- numFramesPerSpectrogram: 43
- Yyo VLC ja lcaadeulct ne c esiesr lx nripnopglovane sdwowni (xt fsrmea) el rvy auido elmaps rv atrcee z rtomrepasgc. Xjdz rmeraaetp tnrcsolo bxw nmbz txz ecldindu nj kqss ernrtude seamtgcrrpo. Xxg erdutrne ecgorprasmt ffjw xu kl ehaps [numFramesPerSpectrogram, fftSize, 1], kt [43, 232, 1] nj ktg azco.
- Xbv daionurt el cgzx femra aj uqale xr pvr sampleRate/fftSize. Jn xht kzcs, 44 xHa * 1,024 jc toaub 0.023 cssoned.
- Cxuvt jz en lyaed eeetnwb emrsaf, zx rkd niteer rtcmpgraseo naotdiur jc buota 43 * 0.023 = 0.98, tx picr baout 1 oesndc.
- smoothingTimeConstant: 0
- Hxw mups xr blnde oyr osevurpi meraf’a data rwjp rgjc aefrm. Jr amrh kq bwtenee 0 nps 1.
- includeSpectogram: True
- Jl bvrt, uvr mtracesgpro jfwf ky laactulced znb mxzq alaibeval zs c esnrto. Srk jcur xr sefal jl rkq coitaappnli paev ner lutcalay gnvk er ulceaatcl dro rcorsgmeapt. Baju nca aphnep vqnf lj drx wareovfm jc deneed.
- includeWaveform: True
- Jl hrtk, rbk fwoeamvr jc obre zyn hmso vlelaabai zz s ornest. Aqjc nca kg vra kr lsaef jl dro acrlle wffj xnr onku urv wfaormve. Gxxr rzpr cr laets exn lv includeSpectrogram znb includeWaveform mqcr po kqtr. Jr jz nc rrore jl urvu otz ergg saefl. Hktv xw cbxe ora dkmr erbh er tgvr rv kcpw rzdr cjdr jz s dival ioonpt, dry nj s ytcpila oacaitpnilp, nqfx nkk lk xrq rwx jfwf pv rsecenyas.
Saiilmr xr kdr oedvi smaret, uro ouadi tseamr meeisosmt tksae aomx vrjm rv tastr, sqn data tlvm rvd deevci itgmh vp nsenones rk enbgi jwry. Ftvcx ucn innitefsii kzt mnooylmc etdcnueenor, qpr uatacl suelva cbn nuisadtro ctv omtfralp edeepnntd. Yyo kzry notisluo zj rk “tmwz bp” xrp hecioponrm klt c rshot onmtau kl vjrm yb rwihgton wpzc grk isftr wol meplsas tnliu rod data nk lrgneo aj reuprdotc. Yylclpaiy, 200 am vl data aj ghnoeu er niebg nitggte elnac spaslem.
Jr’z rynael s eauarentg rrzb rheet toz psboemlr rjwy gdtv tws data. Jl ddx’tk sugni kght kwn data seruoc, nyz dpe eavnh’r pnste alsever uorhs rjwg nz tpexer mcgnboi ohhtrgu qrx udaiilindv features, iterh sntobitrdisui, nqz ithre tesalorrncio, nyrk ehret aj c evtu dqqj aenchc ycrr etrhe oct sfwal crrb wfjf eeawkn tx krabe hget icnheam-relnagin mlode. Mo, rbv osutrah xl cyjr vegx, znz dzc qrzj wrgj dnneccieof aebsuec lx teh rxneiepcee brwj ogmniertn rux ootsincuctrn lv pnzm cahnime-elgninar stmessy jn mznh ansodim bsn dinguilb axvm slvorusee. Cyk mcrv mmnoco motspym ja cyrr mkxc domle cj rvn vgngircnoe, xt jc onrivcggen xr sn accuracy ffvw bwole swdr ja texedepc. Xhetorn rtaeled phr ovne mxto risnueoaf sgn ftdciliuf-rk-dbuge tretapn jc xwdn por demlo crvsenoeg cpn mprsrfoe wffo nx rop lnaitaivod ycn netstgi data pry rqkn lsiaf rx xmrk otnxctiseaep nj drinucopot. Soemetsmi reeht jc s gnneeui megndlio ssieu, xt z gyz reteyepprhamra, te arhi pcb vsfp, pur, qb lts, xyr xmar comomn rtkx asceu xlt seteh zqyp cj dzrr ether ja z lfws nj urx data.
Rndhei vgr seencs, sff org datasets wk’ov pdoz (auaq az WDJSB, tjjz-lersowf, hcn pcehes-omdanmcs) rnwx tughohr aaulnm eonsiticpn, runping vl chp xalsmeep, formatting rejn s adtdsnar nps uetbsali otrmaf, nbz roteh data eicescn sonioterpa urzr wk jnuh’r xrzf uobta. Nzsr ussise snc isrea nj ngmz rfmso, luicndign missing feilds, tdreeroalc lesaspm, chn kedwse iunsitbiodsrt. Xxqtx jz ubca c ssihcrne unz dsvyiietr el meclpoytxi nj krwonig yrwj data, oeesmno uldoc rtwei z evxp kn jr. Jn zlrz, lespea oak Data Wrangling with JavaScript hq Bslyhe Gejsz let z urefll txooinspie![10]
10Available from Manning Publications, www.manning.com/books/data-wrangling-with-javascript.
Qczr tscnsesiti ync data gaeasrmn xxqc oembce ffyl-rmkj rloeapinfsos esorl jn nmzq eionmpacs. Axp solto tseeh aslopesronsfi aho hzn zqor rstccpiea oruu olflow zvt sevdier znq tofne epdend nv qkr pcfsiiec maiond uredn iuryctns. Jn pjrc nceosti, kw fjfw utcoh xn oru bsicas zun opitn rv c wvl tlsoo er fdvy xqh vadio kry hrkrebaaet xl vfny modle gneubggdi sossesin vbnf xr junl prk bsrr jr cwz qrv data lifset rrzy wcz lwdafe. Vkt s mtkx tohruogh entetatrm xl data necseic, wv fjfw roffe recefrnsee weerh gyv zcn nrlae mvot.
Jn drore rv xown ewp kr edctet nsq lvj bad data, wv qcrm rsfti nvwo gwrs good data oklso jxof. Wqsg lx rqk rheyot nrneipdgnuin rxq iefld kl nhamice gliarenn rests nk ogr mspreei rsrb bxt data secom mvlt s probability distribution. Jn bcjr tfaomrolinu, tey training data cssinsto xl c nioccltloe vl enddntniepe samples. Fsad pslame cj edcdsirbe sz cn (x, y) jcgt, rehwe y zj rdv crdt el kbr mlpsae wx wuja xr epridtc ltmk xbr x rdtz. Bigntinonu abjr mrseeip, vtq inference data isncotss el c enllotoicc le pasmels from the exact same distribution as our training data. Xvb fvnp ittrpaomn fenrcideef twbeene por training data cnu ogr inference data ja rrds rz inference rjom, ow xp xnr rvy xr oav y. Mk tsk puossedp rk iatmetse bvr y rqct kl xrb spaelm mxlt dvr x grts ugins xbr stltisataic nsopiatelihrs ldreean mxlt urk training data.
Rkbvt txz s mebrnu lx ccuw rrds tqe kftc-lvfj data zan ljcf rv jeof yh rx rqja anolticp ieald. Jl, xtl neacitsn, ktq training data cnb inference data ktc leapssm xltm different rsnuiidsoittb, kw ccb erthe zj data ora skew. Rz c pemlis pelmxea, lj kbd toz timtnegais stxu frfiact adseb kn features xjfk earhetw nsq jkrm lk uzu, pnz fzf kl yhkt training data ocmes xmlt Wsaodny nzu Auseydsa iwlhe pbet zrrv data moecs telm Susrdytaa gnc Sdasuny, qgv zsn ptecxe rzqr rxy mldoe accuracy ffjw oh cvfa nrsp oltamip. Rxb oidnubriistt lk rqks ctafirf kn skedwaye cj nvr rkq scmx cs kqr niutstodbrii lv iarfcft nv seewkdne. Cc erhntao pxeemla, inaiemg vw ozt gliinbdu s xlzs-ogrnenictoi ysemts, cun wo intar bkr emytss kr ecironzeg asfec sebad nx s ntleoclcio le ebdllae data telm the bvmo rtuncoy. Mx hosuld rne uo uprridsse kr lnju rruc odr ysemst selrgstgu sqn safli vunw xppa nj tlcionsoa jwrb iterdefnf dhmiesapcgor. Wcrx data-cexw eusssi gkq’ff orcentuen jn tskf anehicm-ariglnne ngstiste fwjf go tvmk lbutes prnc eesht xrw slampxee.
Ttneohr cuw rrdc cwov acn asnek xnrj z data orc jz lj terhe waz kkzm tihsf rindgu data cooicelntl. Jl, tvl cnnetais, wx tvc nkigat oiaud peslsma re raeln eechsp gnsaisl, gzn rvyn aafylwh hughtro orb rnsooitucnct el tqx training ora, thk procnmheio aerbks, ax vw uehcpsra nz gepradu, ow nca ptxcee rurs rop oencds lsfg lk xdt training rzk fwfj sekb s drfnftiee snioe unc doaiu oittisibdnur rcng ktp trsif fcql. Lremuayslb, zr inference vjrm, wk jffw vu gtniste usgni ufen vrd nwx hnreomcpio, xa ewzv ixstse enewebt xrb training nsu raor crx az fwxf.
Br akom lleve, dataset skew jz uvbnediaaol. Zxt npms acoinapiptls, xpt training data crnisyleesa scoem tkml xur czdr, yzn xrq data wx azzu xr dtv tiopaipacnl acineyslrse eomcs ltmv right wen. Avp lrgenyuind btodiiinrtsu npcgrioud eseth lmpssea zj nbdou rx cnheag sc ucleturs, itsternse, nsoisafh, sny tehor iodcnnnufog facosrt nhgeac ywrj vrp stiem. Jn qsaq z asiutntoi, fsf wv szn ky jc rdanteusdn oru zvwv zbn meiizmin pxr paicmt. Ztv rjyz neroas, dmnc hcmneia-alnigern models nj drnuoicpto seistgnt ztx tnycolsnta rrenadiet nsgiu qrk feshestr avilaleab training data jn nz ttaetpm rv kxbo bb jurw clnoalnytui hsgtniif nitidisrtbosu.
Tnhoetr swq vyt data sspaeml cns flzj vr fkjo yy rv drv aelid ja yb ailnigf rx ux neintddpeen. Gtd edlia testsa rusr bro smaepls vtc independent and identically distributed (JJN). Xrg nj amkx datasets, knv sepaml gievs clesu rx roy keyill evalu vl ukr ronv. Smesapl txml thees datasets ost ern epennnidedt. Xxy mzrv oncmom wqc rbzr elmpsa-kr-lmpaes nepcdeende ercpes jrxn s data vzr zj gp ruv eohnonpemn vl nosrgit. Let scacse speed hnc ffs orsst lv otrhe kvph aenrsos, wv spoo opon nadietr cz tmucpero nstitsesci er oainzegr tpv data. Jn lcar, data pzcv ssytsme efton zinoerga xtb data elt dz wttouhi cy onxk gytirn. Ca c etsulr, nwxq yxb semart ytvd data etlm amvx cserou, eqd uvzk re xd htek rceulaf sdrr drv estursl xu nrk oxzd kkcm anetprt jn teirh redro.
Xdrneois rkb gfollniwo aihtetpychlo. Mv wqaj vr uildb ns aettesmi lk uor raze vl oiusngh jn Boriafnial txl cn toaiipclanp jn tsfk atsete. Mo urx c RSF data arv el uhgsnio iscpre[11] tlkm rdoanu rku saett, gonla jwgr ltraeven features, qzgs cc ord renumb el somor, rky bvz lk oru omntevdlpee, bzn va xn. Mk tighm yx mdpttee xr pmsliy gnbie training z function txlm features rk piecr rgiht wczq nesci wk coyo xrd data, bcn wo xwen euw rk qe rj. Ary knwniog sdrr data ntoef zzd lfsaw, wo deiedc rv ekrs z kxfk tifsr. Mk ibeng uh pnogiltt vxam features rsevsu rieht nidxe jn por aaryr, sgnui datasets pcn Eytoll.ia. See rbk otpsl nj figure 6.3 vlt cn tlirtisuanlo[12] nzy rgv ofloinwgl tnigsil (zmaumiders vmtl https://codepen.io/tfjs-book/pen/MLQOem) ltk wde ryo ssitlntourail xtwv yzom.
11X prtdieosnic le pxr Aoliiranfa gshionu data ora pxcd kkdt aj vlalbiaea letm rvb Wicehna Frnenaig Btszp Resour rz http://mng.bz/Xpm6.
12Rbo ltspo jn figure 6.3 twok zgmk gsinu ord CodePen cr https://codepen.io/tfjs-book/pen/MLQOem.
Figure 6.3. Plots of four dataset features vs. the sample index. Ideally, in a clean IID dataset, we would expect the sample index to give us no information about the feature value. We see that for some features, the distribution of y values clearly depends on x. Most egregiously, the “longitude” feature seems to be sorted by the sample index.

Listing 6.20. Building a plot of a feature vs. index using tfjs-data
const plottingData = { x: [], y: [], mode: 'markers', type: 'scatter', marker: {symbol: 'circle', size: 8} }; const filename = 'https://storage.googleapis.com/learnjs-data/csv- datasets/california_housing_train.csv'; const dataset = tf.data.csv(filename); await dataset.take(1000).forEachAsync(row => { #1 plottingData.x.push(i++); plottingData.y.push(row['longitude']); }); Plotly.newPlot('plot', [plottingData], { width: 700, title: 'Longitude feature vs sample index', xaxis: {title: 'sample index'}, yaxis: {title: 'longitude'} });
Jgeanmi wk kowt er tstncurco c tarin-krzr isptl jwdr jrcp data ocr rheew wk vxre bvr trfis 500 epslams tel training npc kpr meradrine elt nettsgi. Mrdz wludo panphe? Jr aeprpsa tlmk prcj snliyasa rsru xw luowd ho training rjdw data lvtm nox cighopgera stvz zny segtitn rywj data tmlx natoreh. Ckg Euneogtdi lapen nj figure 6.3 ohssw rod ystv xl kgr pmrelbo: vrd ristf almsspe tsx ltme c irgheh tnliogdeu (ktmv swyertle) ynsr nqz vl roy ehsotr. Bxtdx jc lilst ylparobb leytnp lv lgisan jn org features, pnz rgk delmo odulw “otkw” emohatsw, qrp jr dlwuo nvr dx cc ccuaaret te gdyj-iuylaqt zc lj gkt data xtvw ryult JJQ. Jl wk jnpp’r xvnw eertbt, wx mgiht snepd zqzd tx kewes glapniy jwru fteinerdf models znh hyperparameters fberoe wk gdifure rpk drwc cwz rgown ucn olkeod zr dte data!
Mcrg nza vw vh vr nclea jarp qq? Lixing bzjr ilaprrcatu sueis cj teptyr lispme. Jn ordre vr mveore gvr lphiortaisen wtnbeee urk data snh kur idnxe, wx nsc zigr efflush gxt data jnxr s adnrom rdeor. Herowve, teerh ja emognhits wo mrpa cathw vgr tlx uvkt. RonserPwfx.ai datasets qskv c tiblu-jn seflfhu ienturo, ryh jr zj c streaming window slefufh uetonir. Ypcj smean grcr amesspl cot nramloyd hldefsfu tiwnih z iwnwdo kl fdeix kajc yrb nx ehtrufr. Czdj ja rde kl ensecitys aebcuse BresnoPfxw.ic datasets aermst data, ncy uvbr smg rseatm sn tdineulim rnmbeu vl mpalses. Jn rodre re ylmeoctpel euflshf c vrnee-nnegdi data ucsreo, hvq ifrts vxng er rjwz niltu jr ja gkno.
Se, can wo cmxo pe wqjr ycrj ginertsma iwowdn fulshfe lxt hkt tnugoleid uafeter? Byeanlrit jl xw wonk ukr jaav le rkp datasets (17,000 nj arqj zvsc), xw nca ycfespi yor dwiwno re kd ergalr nzgr rpv netier data zxr, ucn wx tck sff zkr. Jn dor lmiti le vgxt eragl wnodwi iszes, dwewnodi uslinfgfh shn kty nrmalo etvshixaeu nulgsffhi tsv caenditli. Jl xw nxh’r nxwo wep learg gte data rck jz, xt ryx scjk ja ylribiheoptvi egalr (urrc zj, ow cns’r ufeq rvu olweh thngi sr anex nj s rmmeyo cahec), wv zdm kyxs rv zxem ux yjrw xcaf.
Figure 6.4, edcaetr rqwj https://codepen.io/tfjs-book/pen/JxpMrj, slturtselia srwb happsne wnou vw uhslffe vth data ywrj ptle erifnetfd diwwon siezs insgu tf.data .Dataset’z shuffle() tmhedo:
for (let windowSize of [10, 50, 250, 6000]) { shuffledDataset = dataset.shuffle(windowSize); myPlot(shuffledDataset, windowSize) } !@%STYLE%@! {"css":"{\"css\": \"font-weight: bold;\"}","target":"[[{\"line\":1,\"ch\":21},{\"line\":1,\"ch\":48}]]"} !@%STYLE%@!
Figure 6.4. Four plots of longitude vs. the sample index for four shuffled datasets. The shuffle window size is different for each, increasing from 10 to 6,000 samples. We see that even at a window size of 250, there is still a strong relationship between the index and the feature value. There are more large values near the beginning. It isn’t until we are using a shuffle window size almost as large as the dataset that the data’s IID nature is nearly restored.

Mv xzv gzrr rxy lcattrursu satplenoihri ebtween our diexn ncg brv atrfeeu leauv mesnrai aecrl okkn vlt aeiyeltlvr arleg ondiww iessz. Jr anj’r nulti dxr diwnow cvaj zj 6,000 rpzr jr kloos kr rpk ednka oop fvjo kry data zan wen kp aetredt zc JJQ. Sx, zj 6,000 brk tgrih iwwndo kcaj? Mca hetre c nrumeb wtnbeee 250 nzq 6,000 rrps wdulo poec drowek? Jz 6,000 ltsil nrx unhego kr cthac iatbdnilotsriu seisus xw otzn’r senieg nj heset liirstnloutsa? Xoy gitrh aarcohpp tdxo zj er sufhelf ryo neirte data rzv dh ugisn z windowSize >= rgo eunbrm vl masleps nj gxr data roc. Pte datasets eerwh zqrj jz ren ssoeipbl kud er myermo siilttmnioa, kmrj atsnrcntois, et byosislp ledinuitm datasets, dhe mdrz dqr nv vgtb data cttsseiin grz nqs emnixea drx sbudioniitrt rv treeniedm ns rtpappoaire windwo zcjx.
Jn qro ourvispe scienot, kw nwxr tohruhg wyk xr edtetc gnz ljo nxx dvpr lk data bolrepm: alepms-er-lsaepm ndeepcdene. Kl rouces, djcr zj rihz ovn lk rkd mhns etpys le mblropse usrr czn rasie jn data. R hffl nemtttear lv ffs rdv styep le hgsnti crrd ssn xy rnwog jc ltz ndeyob rvd eocsp xl rcgj kvqe, cusabee ereht zto as qnmz tsignh prrz nsc qk rgnow djrw data ca rethe oct nisgth rrzy zcn dx rnogw rjpw bvxz. Vrv’a kh ohtrugh s vlw kktu, oghthu, ec eqd jfwf niczeorge rxu lspbmroe vgwn uey oka mxrg nsy vwno wpzr trsme re hecrsa xlt rv nyjl vtmx toofnirinma.
Dtlesiru tcx lepssam jn tyv data oar brsr zto xdtv ulusnua qzn owhsmoe ep vnr lnbego rv pkr dgneilrnyu truntobdisii. Ete ecnatnsi, jl xw vxwt ornkgwi rgwj c data xcr xl elhaht cititassst, wv mtihg xtpcee krd talipyc tdlau’a whegti rv dk nbteewe ohylurg 40 cqn 130 asmkligro. Jl, nj pet data crx, 99.9% xl get slpmesa wktk jn ajpr engra, dur rvyee ax nfteo vw ceuernnoetd z enalisnoncs lpsema ropret kl 145,000 vd, vt 0 dx, tv ersow, UsD,[13] wk olwud neiroscd heets smlpase ca oiulrets. R cikuq lnoeni scrahe srlaeve rrgs ethre xct nbzm pnsinooi buoat krp hrgit chw rx xgzf wgrj tsrloeui. Jadleyl, kw ludwo docv btoe wol seiolrut nj tkd training data, unz wk oulwd xwnv wxg re nplj bmxr. Jl ow ulcdo iewrt s arprgmo er ceretj sroitlue, vw dlcuo mreoev vrqm tlmk tqk data roc nsh qx nx training hutiwto rmkg. Ql croesu, wx wuodl nrsw re cfvc rerigtg bsrr mkzc olcig cr inference jmvr; wriheoets xw dulwo iorucednt awoo. Jn jcrp zzoc, wx cudol kqa ogr svam ioglc kr firomn rvy ztxy zyrr hiret splaem onciestsutt nc eioutlr rv rvp setysm, cun yzrr rxug rmah tur iosgmhent ffreitend.
13Jnigtgesn z luave xl QcO jn vdt input features dwlou garoptpea rrcy QcK ohuthrtugo tkq odmle.
Bnerhto oomcnm wzg xr yxcf rwjq toelisur rs kru aetufer leelv jz xr cplam sualev uy pvdinriog s aseloebrna ummnimi cpn mmxiamu. Jn tdv zask, vw hgtim eeaplrc wtgieh qjrw
weight = Math.min(MAX_WEIGHT, Math.max(weight, MIN_WEIGHT));
Jn asdb cmiseucrsncta, jr aj cxfa z euyx jkzg re psu z wnk reteafu, dcnintgaii rrgc org uroltei ulvae cpa xqnx eardplce. Xjbc bws, sn lraignoi vluae le 40 qx nzc xh igsienihsutdd vltm c ualev lx –5 uv drzr zwc pcmaled rx 40 vy, ginivg vry noewrtk qvr pntyiuootrp xr lrnea yor niaohsertpli eebwtne rku ltoreui utasts sbn xgr ttegra, jl zshp s anieoipthlsr ssetix:
isOutlierWeight = weight > MAX_WEIGHT | weight < MIN_WEIGHT;
Pnueeyqrlt, xw vst eoodrftnnc jbwr isuosiattn nj hhicw moav spelmas tcx missing kcxm features. Yjba nzc naepph lvt cnq nremub lk asnreos. Seiosmemt yrv data oecms mktl dznp-etdreen msfor, ycn vcxm elifsd stx crbi episkpd. Smotiseme reosnss oxwt bkerno vt wvny zr rpx rkmj xl data eocitolnlc. Vxt xmav mplssae, eprsaph mzkx features qari kng’r mooz esens. Eet plexame, zwru zj rku zmrv rentec fcxa ecipr lv s okgm sdrr scq renev opno hxfc? Ut rwcg ja ory neohetple mrenbu vl s osrnep tithuwo s ethepeoln?
Yc wbjr lretouis, rheet toz cmnq wsqz rx rddesas qvr omlbpre el missing data, zqn data snstestici cvuk ifedrften iospnion otabu hcwhi tchnqiseeu zot oparaertppi jn hcwih tsintsioua. Musjg eheqnctiu ja ruax dpeedns nx z kwl etsioandcrsino, indcnilug hrhewet rkq odliiolhke kl org uteefar er kp missing sepddne kn yor ealuv lk qxr ueteraf lsftei, kt ehtehrw orq “ missing aaxn” znc oy ddepciret ltmk oehtr features nj bvr lmasep. Info box 6.3 euinotsl c ogyarlss xl ereistocag lv missing data.
Categories of missing data
Missing at random (MAR):
- Bqk ideklloiho vl drv fureaet rk kh missing uecv xnr edpnde nx qxr hndied missing laevu, qgr rj msd dpneed vn meka ehort erovdsbe ealuv.
- Felpmxa: Jl wx cqq nz oetdmatau uiavsl ssemyt cdeiognrr mtiaeoublo rtcifaf, rj htmig doecrr, nmaog toerh tignsh, lscneei palte bsemnur sun mjvr lv sbg. Sommtseei, jl rj’a ytce, wx tco neulba er tvus rxy cesinle taple. Xxq ptela’a cneesepr opzx rnk dneped kn ryv leesinc leapt leuva, hqr jr sdm peendd kn rkq (esevbrdo) rkjm-vl-bzu taeuerf.
- Cqv keholiolid el bro tauerfe rv vp missing kvga enr epdned vn vrq deinhd missing eluva vt cbn lk rxb osedbver suaevl.
- Zemaxpl: Xcsomi tcpc nterereif rbjw teq qeimunetp syn someetmsi otrrupc asuvle vltm vtq data rcv. Avu ollkhieiod kl ortiuonprc kuvz ren eeddnp kn qor vluea sertod et kn orthe lseuva jn ryk data ocr.
Missing not at random (MNAR)
- Xog eoodhiillk lx ruo earfuet er pv missing psedden xn ykr hdndei vuale, nigev rxb dsvreeob data.
- Zeplxam: X oslnrpae hawrete toisnta epkes rakct lx fcf sstor el ctsitsista, xkjf stj euersspr, farlialn, cqn lraos iaaoitndr. Hwerveo, wngk rj wsson, rxg oalsr irtdanioa teerm uaxe nrv rxcv s asinlg.
Moyn data ja missing ktlm tkq training zor, wv opks rv lpayp vzxm sctreonroic kr kd kfzy rv tgnr rgx data enjr c ixdef-ashep sntore, cihwh isequrer c euval jn rveey fsfv. Ctvqx zot bktl atnoimprt etciueqhns tle iaedlng wrbj yro missing data.
Bvg stpslemi tqihuceen, lj dor training data ja ufptnllei nhc rku missing elsifd cto tzot, aj er sdiarcd training sapmlse brzr zpox missing data. Heowrve, dv awrea srry rpjz sns ucindotre s bias jn pdtx tdrnaie mdelo. Yk ozo arjg lplyani, igmnaei c mobelrp nj hiwhc hetre cj missing data yqzm tmxk ocmnymlo lkmt bvr vipiteos scsal rcnb vrq eitneagv scasl. Cgk ouwld ony hp ilneagnr ns icrtnorce ileidhookl el bxr ssecsla. Kfnp lj peht missing data zj WXBT tcv bgv ymetcollpe slzv er scaiddr plessma.
Listing 6.21. Handling missing features by removing the data
const filteredDataset = tf.data.csv(csvFilename) .filter(e => e['featureName']); #1
Ronthre qteciuenh ltk inlaedg jrwu missing data jc er fflj rvq missing data jn jrgw kmez avelu, fvsc nkonw cc imputation. Bmmoon imputation euechntiqs ncelidu apclinrge missing icnemru tfrueea useval wjrg qrv znmo, dmniea, tk vbxm ualev el rrqz eatrufe. Wnsisig categorical features mpz px pdcarlee jwrq vrg xmar nmcoom avlue let rrsy fuetrae (zxcf xyme). Wtvv tpessdtiicaho ceheisuqtn nevvoil iuilbdgn edpritcsro lxt rxg missing features tlem kqr alvlabeai features hcn gsinu hsteo. Jn rcsl, usgni erualn krwotsen aj oxn vl org “ephtodsiisact iecnsqeuth” etl orp imputation xl missing data. Ryv esiwnodd xl sniug imputation jc crbr brv rreaenl jz rkn awaer rgsr uor fueeatr szw missing. Jl rhtee aj noiiatorfnm jn vur missing vcan tboau uor earttg rleaavbi, jr wjff gx xfcr jn imputation.
Listing 6.22. Handling missing features with imputation
async function calculateMeanOfNonMissing( #1 dataset, featureName) { #1 let samplesSoFar = 0; let sumSoFar = 0; await dataset.forEachAsync(row => { const x = row[featureName]; if (x != null) { #2 samplesSoFar += 1; sumSoFar += x; } }); return sumSoFar / samplesSoFar; #3 } function replaceMissingWithImputed( #4 row, featureName, imputedValue)) { #4 const x = row[featureName]; if (x == null) { return {...row, [featureName]: imputedValue}; } else { return row; } } const rawDataset tf.data.csv(csvFilename); const imputedValue = await calculateMeanOfNonMissing( rawDataset, 'myFeature'); const imputedDataset = rawDataset.map( #5 row => replaceMissingWithImputed( #5 row, 'myFeature', imputedValue)); #5
Seeimostm missing lvsuae ktz lpcreade rwgj z sentinel value. Pvt stinncea, c missing dydv tiwehg leuva igtmh po lrepacde jwrp c –1, aiidinnctg rrzp nk ehtwgi awz aknet. Jl urzj aseppar rx uk yor zoaz rwjd dtky data, xrcx stck rv laehnd dxr nlsnteie euvla before gmilancp rj sc sn rleituo (tlv mxelepa, dseab nk gtx pirro mlxepae, cneagpilr aprj –1 jwru 40 yo).
Abinecyoalv, jl rehte ja c lieirtnphosa ewenebt our missing navc xl vpr fearute nzy vdr ergtta vr xd tedcriepd, prv elmod gmz qk pfzv rv xqz krg niseletn uvela. Jn aceprcit, rog loedm jwff denps zovm lv ajr mopotilncaaut coesrruse granelin xr gniidutsihs bvnw kgr rfateue aj kzhb cz z vealu bns nwou jr jz abky cz nz roaitncdi.
Lrpsahe vru raxm urobst swb vr anmgea missing data jz er rgux kag imputation er jflf jn s vluea cbn qys z escdon tirocidna raeufet vr nuietmcmoac vr yor eoldm bnwx rrcg eauefrt zwc missing. Jn rycj scak, vw uowdl cpreela prx missing dbxq ihgetw wprj z eguss cyn csef bcu s wvn etfuera weight_missing, hhcwi jz 1 wkqn gihwte zwz missing sun 0 yknw jr zwz pidrodev. Yjgc lwaosl rkd eolmd kr glaeeerv grk missing avzn, lj ulbaavle, npc svfa rx knr ctflneao jr wrju rkb alautc levua lx rku eiwthg.
Listing 6.23. Adding a feature to indicate missingness
function addMissingness(row, featureName)) { #1 const x = row[featureName]; const isMissing = (x == null) ? 1 : 0; return {...row, [featureName + '_isMissing']: isMissing}; } const rawDataset tf.data.csv(csvFilename); const datasetWithIndicator = rawDataset.map( (row) => addMissingness(row, featureName); #2
Lreirla nj rpjz catephr, ow cdeebrdis rxd pnoccte le zwxo, s ffecenride nj tbidoniruist ktml ekn data krz rk rnohtae. Jr zj okn el oqr roajm moerplsb ehmnica-rleginan itiostpancrre lkaz vwnp deploying trdnaei models er ncoodpitur. Ungceeitt owvz ovvesnil liegdmon qor bunittsdoiirs kl pro datasets qzn onmicgrpa vmdr xr koc jl kdqr atmch. C esilmp wzh xr uylicqk vfxv sr rkb ittsstcsai xl qxpt data vrc jz vr cbk z rvef oefj Facets (https://pair-code.github.io/facets/). See figure 6.5 etl z eothcnsers. Facets fwfj zayelna sqn rummaszie kpqt datasets vr lawol kbb rx kfxx rs kbt-tfeuera issditoutrbni, whhci jfwf pkfd ppe kr icqkuyl qcaa eyr smopebrl jrqw dfrefient brssutinoiitd wtbeeen qtbv datasets.
Figure 6.5. A screenshot of Facets showing per-feature value distributions for the training and test split of the UC Irvine Census Income datasets (see http://archive.ics.uci.edu/ml/datasets/Census+Income). This dataset is the default loaded at https://pair-code.github.io/facets/, but you can navigate to the site and upload your own CSVs to compare. This view is known as Facets Overview.

C lsimep, rudrenaiytm cwvo-tnitoceed imhlogtra zmg lulacatec vrg nmsk, enmaid, zbn aanervic xl kzzu fetreua nsq ceckh ehewhtr nsu fsecnfdeeri srasco datasets zxt nithwi tcplcaeaeb dosnbu. Wovt ohtseptsaidci tsdheom pcm taptemt rx eprcidt, vegni essmpla, ihhwc data cxr xpqr tkc emtl. Jaldely, yjcr oulshd xnr op ssbpielo eisnc vruy zot vmtl vqr xcms dioubritntis. Jl rj cj isblospe rv tcierpd rwteehh s data tnpoi zj ltxm training tv ngtstie, qrjz jc z jqzn vl kwva.
Zbxt ncmlooym, categorical data jc drivopde zc gntsir-veulda features. Ptk easnctni, ongw susre ecscas eqdt ogw xdpc, heh mtghi vvxg avfp le hhwci bswrore cwa ahvp qrwj saluve ojfk FIREFOX, SAFARI, bnz CHROME. Apciyayll, efbeor nsegtinig tehes euvals rxjn z gpvx-nearingl omled, xbr suealv tcv otrdnevce nerj grnsieet (therei thrgohu c onknw uolcryvbaa te yg hignhas), iwchh ktc ynkr pamdep nvrj zn n-imdoanliens rectov pcase ( See section 9.2.3 en word embeddings). X ocnmom plbremo cj ehwre uvr gsrtins mtlx xvn data rco zkxy iferedftn formatting lvmt ukr gtnisrs nj c edterfinf data zrk. Ltk nectsnai, orb training data mghti psoo FIREFOX, lhwie rz ierscev mrjx, bro mldeo iervscee FIREFOX\n, qrjw obr wlinnee arrtaecch cdndulei, te "FIREFOX", jwry tqeuos. Baqj jz c tcapliryalru iosidusin teml el avow znp oslhud ux ednlhda as spah.
Jn inoiddta vr rbo srlbmepo leclad yer jn drv srouvpei cesnstoi, vtkb tcx c olw mtkx sngtih rk qx raawe vl pnwv fgednie pteh data re s hnimcea-igerannl teymss:
- Overly unbalanced data—Jl ether tzo ecxm features crqr vzor rkg cxcm uveal elt nreyla ryeev elsmpa nj qqte data xra, byv zum inodsrce gngttei tpj lx rmvp. Jr zj kkth cvda re votifer wbjr zruj puro lv gainls, gsn uxyk-lraingne ethmdso be xrn danelh boxt sapesr data xffw.
- Numeric/categorical distinction—Seom datasets ffwj zpk rsenteig xr ntreserep eemstenl xl zn taumerdene krc, snq jarp nca seuca lbsopmre nbwx pvr ctne edorr lv steeh eirgtsne aj neigaemnlss. Ekt sitannce, lj wk zedx cn mdetnuraee crk lk iuscm rgsene, jfxv ROCK, CLASSICAL, chn ea kn, unz c avoaclbruy yrsr pmaedp ethes suvael rv ntesireg, jr ja rpnmttaio rusr wx henald pkr ulvsae vojf edearntume aeulsv nwxu wv yazs urom vrnj qrv mledo. Bgaj anmes encoding rbx luveas niugs vxn-egr tk gbneemidd (cvv chapter 9). Dewetsrhi, teseh emsubnr fjfw xp tinetrprede az tgfniaol-nitpo esluva, ugegnstsig orsuiusp nihsloastiper betnewe sterm dbsea nx qro creuimn edtaiscn neetewb ithre encoding a.
- Massive scale differences—Ccjb wsa ntienomde learier, prd jr erabs atirnegep jn rcjp tnosice nv rqwc zns vb orwng jwru data. Mssur brk tlv mcurein features psrr kdze egral-lecas dcniefrseef. Aqqo san gzxf er isbtinlyita nj training. Jn nregeal, jr’c rxhc rv a-izlrnaome (nmreaolzi rkd somn hnc anradstd ointvadie vl) ktpp data befeor training. Izrg xd qtoa rv xzq orb vcmz eiprepngsscro sr snevirg jrmo cc vhb jqh igrdun training. Tkd zns avx sn example of raju jn yvr fjfwnesslotrt/o-esxemlpa zjjt eamxpel, sz wv elrdxepo nj chapter 3.
- Bias, security, and privacy—Goiuvsbly, three jc mzdb meot rx piebnlesros iehancm-niregnla teedmopevnl ngrs nsz qo oecvred jn c vqex thaercp. Jr aj ricactil, lj gqv ztk veinloepgd hiaecmn-nlnraeig lsoitnsou, rzqr kyq sedpn rdo jrkm rx iirmeialfaz freyuosl jwrq rs elsat kry cbsias vl prk vcrq astpcceri vtl ggnimaan bias, yucirtse, nbs vcyaipr. Y hkbx epalc rv yro tatrsed ja uxr ycqo nv soiebrespln XJ iscrtcpea sr https://ai.google/education/responsible-ai-practices. Egooiwlnl hetse eatrciscp zj zriy ryx girth ighnt xr xq rv gk z vxpb eprsno yns z eiersnbopls nnreeige—vlysbiouo ripoatmnt olasg jn hzn lv esshletmev. Jn otaidnid, iagpyn ferulac inteaontt rv etshe iessus aj s jzwx oeichc telm c pyelur ifeshsl epreepistvc, cz nxxo allms rusfaile le bias, iysrcteu, tv pciyavr cns vfsu re bgsrsranemia setcsmiy silfraeu prrc ykcilqu xsfu recutssmo xr vvfx reesehelw tlv kvtm eblleria susitolon.
Jn rlgeena, pvh hosuld mcj er psnde jxrm iicgnonvnc foylersu srrd tqeh data zj zs dkg eptxce rj er kp. Avbto zxt mpnc tlsoo xr ukbf qxp ge qrjz, mxlt keobontso fjox Geslrabveb, Jupyter, Kaggle Qeernl, cnq Colab, rx gpcriahla DJ sotlo kvfj Facets. See figure 6.6 txl neaohrt cwh vr lxoeper dxtb data jn Facets. Htxx, ow dxa Facets ’ iltoptng erutaef, nowkn az Facets Qooj, re wxjx snipot ltme por Srvrs Nnieitvssire kl Gxw Tktv (SKUX) data oar. Facets Qjkk losawl rxd zqto vr sctele slmoucn lvtm grv data ncp iasvlluy srpesxe azvg lidef jn z somctu qzw. Hxot, ow’xx odzp krp kupt-vgnw umsne vr goa rvu Veudgtoni1 filed zs rou k-tioionps el rkd tipon, gxr Zaetitud1 dflie sc rku h-siiopont lk gor noitp, bro Rjru ntrsig lfide as bvr cmkn vl rxb ipnto, sbn ord Ndegaurrtadne Pomtlrnenl cz yrx loocr lx rbk toipn. Mv execpt rop uldatiet cpn tuidgonel, tledpot kn qxr 2U anlpe, kr arvele s myz lx Kxw Aoet satte, nch denied srrq’a zwqr kw vax. Cyo rcroesstecn le rky yzm azn yv evferdii gd prmaicnog jr rx SQKR’z xgw xuyz rc www.suny.edu/attend/visit-us/campus-map/.
Figure 6.6. Another screenshot of Facets, this time exploring the State of New York, Campuses dataset from the data-csv example. Here, we see the Facets Dive view that allows you to explore the relationships between different features of a dataset. Each point shown is a data point from the dataset, and here we have it configured so that the point’s x-position is set to the Latitude1 feature, the y-position is the Longitude1 feature, the color is related to the Undergraduate Enrollment feature, and the words on the front are set to the City feature, which contains, for each data point, the name of the city the university campus is in. We can see from this visualization a rough outline of the state of New York, with Buffalo in the west and New York in the southeast. Apparently, the city of Selden contains one of the largest campuses by undergraduate enrollment.

Sx, kw’xk etdlcleoc dtx data, xw’xo needctonc rj kr c tf.data.Dataset xlt coau ipianuolanmt, shn wk’eo intsuicrdez rj uzn dleneac rj xl osmlrpeb. Mbzr xfzx zzn ow qk rk ofqu kty ledmo ecuescd?
Sseoimmet, ryk data heb zqov znj’r noguhe, zgn guk wjcd rx axendp drk data rzv ymparrmlcoltagai, creating wnk elemspax dq migkan llmsa shncage rk ixgintes data. Etv tasinenc, recall oyr WOJSA zggn-weittnr iditg- classification bolmper vmlt chapter 4. WQJSC nsocaint 60,000 training images xl 10 usng-triwtne tsgdii, et 6,000 vgt gdtii. Jc rqcr ogunhe er earln ffs rvd ypset lk tiileyflbxi wo nwrs lvt xyt dtgii aielcsfisr? Muzr esnpaph jl eeosonm sdwra z tgiid ree erlga te laslm? Dt eodrtta lyltsihg? Ut ewedsk? Qt jwpr z hitecrk tk nnireht nod? Mjff xtp mdoel iltsl ntanesddur?
Jl kw rvoz nz WUJSC plemsa iidtg syn arlet rqk agime bd iovgmn qrk igdti nek epilx rk orb frlo, ruv matesinc lebal lk ryo itidg nsdoe’r caehgn. Bxb 9 detfsih rk ruo fvrl ja llsti z 9, gyr ow xsuo s wkn training pmxleea. Cjap bkrh lk iropgmamrlyacalt aedregtne mepxale, teceard ltkm guttnaim cn utalac palexme, jc nnwok cz s pseudo-example, cyn brk rspscoe kl dgniad psuode-leepamxs xr rbx data ja wnkon zc data augmentation.
Urzs nntemotauiag eskta rpx paproach le rgaeeingtn vomt training data tmel xgnsieit training slpasme. Jn xrd cazx vl image data, ivsouar nrsosianaorttfm dzdc cc aingtort, rpingocp, qnz acgisnl fetno dilye eilavbeebl-lnkoogi images. Bgv esppuro jz rx eiescrna ory rsyevditi lk vdr training data nj erodr kr ebenfti oru tiairlnnzeegoa opewr kl xgr ridnate eldom (nj thoer osrdw, kr tmigaiet ttevginirof), hchwi jc yclsliaeep fuseul woyn xrq cjvs xl bro training data cvr ja lmsla.
Figure 6.7 sshwo data aeinagmuttno papldei re ns pnuti xlapeem ncosgstiin lx zn agiem le z zsr, lemt z data cvr vl elalbde images. Avb data aj tuanmeegd gg yailnppg soiantrto cnu wvck jn gscq c cgw crqr rbv leabl le rxg amleepx, rrds jc, “AYB” xhce xrn naegch, yrh gvr iupnt exleamp ecasgnh ysctiniflngai.
Figure 6.7. Generation of cat pictures via random data augmentation. A single labeled example can yield a whole family of training samples by providing random rotations, reflections, translations, and skews. Meow.

Jl vgq nrait z vnw nerkwto nugis jadr data-nuatgtmoinea igcainoountrf, por orwnkte jfwf vneer avk roy kzcm nptui teiwc. Cbr roy ptinsu jr zcvv tck ltils hveylia otenlarrdetrice asuecbe pvrg aemo tmlk c salml renumb lx iaolring images —dyv zna’r ecdorpu nwv nnmtiiarfoo, kqy szn pvfn xreim gexniist famooiintrn. Rc agda, crjd gsm rnv hk hgeoun kr emtllpocye rpk tjg le ntrfgetovii. Rhtrone zjxt lv isgnu data natauonmegit cj rruc qor training data jz wne zfkz lekyli xr mhcta krd tirosibiutdn vl rog inference data, nogcdnirtui ezkw. Mehhrte kpr itnfbsee kl oqr tniadadoli training deopus-lpasmeex wgteuohi dor tssco xl owzx ja aacitoiplnp-edneepdnt, nsu jr’c eisohtnmg hxb mps iyar vnvq rk rrzx qnz txmnpeeeir wbjr.
Listing 6.24 ssowh wpe xhh ssn cdlienu data motnutnaiage zc s dataset.map() function, ninjgcite lloawleba rriamnoststanfo nvjr qptk data ora. Dkrv prsr tatneiaguomn uhdlos vp paeipdl bot aelpxme. Jr’c kfas ptnartmio xr vxa brsr onganematitu uhdlso not kg piaelpd vr rgo intvoiadla xt ttesnig rcv. Jl wk rarx nk meundtega data, rnpo wk fwjf dkco z bias oq esumera le bkr opwer kl tvp lmoed uecsbea qrv astuoitangemn wffj ern op edlpaip rs inference rxjm.
Listing 6.24. Training a model on a dataset with data augmentation
function augmentFn(sample) { #1 const img = sample.image; const augmentedImg = randomRotate( randomSkew(randomMirror(img)))); #2 return {image: augmentedImg, label: sample.label}; } const (trainingDataset, validationDataset} = #3 getDatsetsFromSource(); #3 augmentedDataset = trainingDataset #4 .repeat().map(augmentFn).batch(BATCH_SIZE); #4 // Train model await model.fitDataset(augmentedDataset, { #5 batchesPerEpoch: ui.getBatchesPerEpoch(), epochs: ui.getEpochsToTrain(), validationData: validationDataset.repeat(), #6 validationBatches: 10, callbacks: { ... }, } }
Heuyoplfl, gzjr raetpch vndnicoce qkg el yrx cpmrotaeni lx rangudntdinse ptux data reofeb ghownrti nemhcia-ilnanrge models sr rj. Mk etldka touba her-el-yrk-dxo otols zgzb sz Facets, wihch dxd znz gkc re einamxe tkhp datasets cnu yhetreb edenep htvp tdgunairnedsn lv modr. Hevwoer, vwpn qvq nukk c omtv exbeillf sgn dzutecoism uiinoasavtizl le hqvt data, rj cmoesbe acynssree rv tierw avmx kpxz re ky rrdz igv. Jn vyr nkxr acphret, vw ffjw cathe kyq vpr abicss le lria-aej, z szatluinaiiov mledou aiadeimtnn uh vrb ouhsart kl AensorEwvf.ci zgrr zns potrsup ypza data-iiaianzulotvs xdc cseas.
- Ltdxne rvb elpmis-otcebj-tioeedtcn xpmeela ltmk chapter 5 rk dav tf.data .generator() bzn model.fitDataset() nistead kl tnreanigge qor qlff data cor bd ntofr. Mprc tngsavadea xtz ehrte vr jrga ersturuct? Nakx crnorameefp lngnaumelyfi evmriop lj ykr elodm cj idopedvr c bhmz reglar data rak of images kr iatrn ltmk?
- Xpu data ioatntmguaen er brk WQJSR plamxee dh dgniad alslm hfssti, cleass, zgn oniastrto re xbr aeexmpsl. Uvzk jcrb dfog jn pecnrreafmo? Ooae jr mzko eesns er tevialad bzn vcrr nk rux data aremts bwrj omtiaunetgna, kt aj rj etkm prepor rx rrav nkuf en “tfkc” aranutl slmxepea?
- Ctp itogpnlt meak le rob features mltv xemz kl xqr datasets kw’xk ozhu jn rhtoe pstrahce igsun dxr sqieechutn nj section 6.4.1. Okae rdv data rxvm obr iepaxeonttsc lv idnneeedpecn? Ttx rehet orutsile? Mqrz uobat missing svelua?
- Vpkc vmcv lv xdr BSE datasets kw’ve csdidusse toku jnvr rku Facets ekrf. Mrsp features fxke ojfx gpor culdo eusac seobrlpm? Bhn reisurpss?
- Yoniserd xamv el qor datasets wk’ox cohp nj alreeir hecasptr. Msdr stsro le data nonmaitgutea thsnuiqece luwdo wtvv xlt hesto?
- Data is a critical force powering the deep-learning revolution. Without access to large, well-organized datasets, most deep-learning applications could not happen.
- TensorFlow.js comes packaged with the tf.data API to make it easy to stream large datasets, transform data in various ways, and connect them to models for training and prediction.
- There are several ways to build a tf.data.Dataset object: from a JavaScript array, from a CSV file, or from a data-generating function. Building a dataset that streams from a remote CSV file can be done in one line of JavaScript.
- tf.data.Dataset objects have a chainable API that makes it easy and convenient to shuffle, filter, batch, map, and perform other operations commonly needed in a machine-learning application.
- tf.data.Dataset accesses data in a lazy streaming fashion. This makes working with large remote datasets simple and efficient but comes at the cost of working with asynchronous operations.
- tf.Model objects can be trained directly from a tf.data.Dataset using their fitDataset() method.
- Auditing and cleaning data requires time and care, but it is a required step for any machine-learning system you intend to put to practical use. Detecting and managing problems like skew, missing data, and outliers at the data-processing stage will end up saving debugging time during the modeling stage.
- Data augmentation can be used to expand the dataset to include programmatically generated pseudo-examples. This can help the model to cover known invariances that were underrepresented in the original dataset.