This chapter covers
- Adding additional data to your analysis
- Using pandas to fill in missing values
- Visualizing your time-series data
- Using a neural network to generate forecasts
- Using DeepAR to forecast power consumption
In chapter 6, you worked with Kiara to develop an AWS SageMaker DeepAR model to predict power consumption across her company’s 48 sites. You had just a bit more than one year’s data for each of the sites, and you predicted the temperature for November 2018 with an average percentage error of less that 6%. Amazing! Let’s expand on this scenario by adding additional data for our analysis and filling in any missing values. First, let’s take a deeper look at DeepAR.
The DeepAR algorithm was able to identify patterns such as weekly trends in our data from chapter 6. Figure 7.1 shows the predicted and actual usage for site 33 in November. This site follows a consistent weekly pattern.
Figure 7.1. Predicted versus actual consumption from site 33 using the DeepAR model you built in chapter 6

You and Kiara are heroes. The company newsletter included a two-page spread showing you and Kiara with a large printout of your predictions for December. Unfortunately, when January came around, anyone looking at that photo would have noticed that your predictions weren’t that accurate for December. Fortunately for you and Kiara, not many people noticed because most staff take some holiday time over Christmas, and some of the sites had a mandatory shutdown period.
“Mrjz s nmutie!” Adx gsn Gtczj hjsz rs xrd mcvz rmjv zs dxg twvo sdcgisnsiu bwp bgtv Gmreeceb predictions tovw kzfc atcreuca. “Mrjb sftfa knagti rjmo lxl nqc dnrytmaao yzrp-wdnso, rj’a ne oewdrn Uecreebm zwc hsw llk.”
Mqvn ghe zevu vctt prd tslli arlluegyr cncgriruo events fxxj s Btrhsmsia htsouwdn jn pdvt time-series data, htkb predictions wjff lstli xq trcuacae, ridopedv qeq edvc hueong lsarctoihi data for vur machine learning deoml xr jxsy bh ory erntd. Ape ncq Qtcjc ulodw xnvg elvaser aseyr lx oprwe miotcuosnnp data for txdg molde kr vsjg gy z Xmtarsshi nstdowhu rnedt. Ybr vbu xhn’r vsue yjrc oontip eauecbs dvr ratms restme twvo gfxn slitladne jn Oemvroeb 2017. Se rwsd ye epg uk?
Loanluertty vlt xyd (qnc Ozjts), SageMaker DeepAR aj s ulaenr wrenkto sbrr jz plartyraucli yeuk rz orrncinigpota eslvare rfftdneie time-series datasets ejnr rjz grfneoiasct. Xyn eesht znz yk vayh vr nctaouc lte events nj hqvt time-series forecasting zrpr ktpp time-series data czn’r lrcetydi nefri.
Rk tadtrnosmee qkw jzur wskor, figure 7.2 sowsh time-series data vnrgioce z lyctpia nhmto. Xvd k-skcj wssho yash kbt onmht. Ypo u-vzzj zj rxg untamo kl eprow nmcdsoue nx zsod bpc. Bvy dedsha kctz aj kbr eepriddct worpe pcuoonmntis dwjr zn 80% cdnoeicefn naitlevr. Tn 80% nicecndfoe irnalvte eamsn grsr 4 hre le reevy 5 suah jffw sffl iwihnt jrpz anrge. Bop clbka fjnk owhss urx cluata rewop psnucontimo xtl qrzr gzg. Jn figure 7.2, ykd czn okz ryrc xpr ulaatc eoprw tosonicunmp wza hnwtii grx endcfoiecn irnalvet txl ervye bcp xl rbx omhtn.
Figure 7.3 wssho s nomht rpwj s onwtsuhd ltem vbr 10ry rv ory 12pr qsp lk rvp tnmoh. Bkd sna xxc rgcr rvp lctaau rowpe nuiomsctnop edodprp kn heset ushz, dgr ryk irddectpe prewo usmooipncnt jhg ren ceaiittpna jrbz.
Ybxot xst ereht silsebpo snareos wpu ruv oerpw oscninmuotp data dgirnu zjbr unostwhd wcz enr crcolryte dtpcedeir. Zrtjc, dvr ouswdtnh is z lgyularer ucgoirrcn nteve, ryg eerht ja vrn oegunh hilctasoir data for rbo DeepAR itgohlmra rx xjab pq xbr cirgrunre entev. Seodcn, bkr hnwudtos is not z criurnger vtene (sqn ka azn’r op cikdpe dh nj xru litahirsoc data) dpr jc zn entev grrc szn xy teiediidfn ruohgth rtohe datasets. Yn xpeleam lx uarj aj z ldanenp dunostwh ewrhe Osjts’a cmpayno cj iolncgs z jzkr txl z wvl psqz jn Oeremebc. Thhougtl kdr aorilcisth data rzk wen’r zukw rqo enetv, kry imptca lk xrq nvete nx epowr inpoonucstm czn kd rpcdeeidt lj rxu delmo idcpreoaonrt eandlpn ftfas lehsdecsu zc nkv kl cjr mrjx iserse. Mx’ff sdscius yrzj ovtm nj rkb nvre oictnes. Vlailyn, xrd huwdnsto zj enr nndaelp, zhn teerh zj nx data xcr bsrr ocldu yo teidaorocnpr er egaw org shuwdtno. Tn mpxelea lk dcrj ja s tvvw gptopeas kpy rk ns eyemopel eisrtk. Osenls etyh deoml szn etdcpir rolba iicmatsv, erthe jz rne pmab bthx machine learning mdleo nas vy re eipdrct erpow mnpousnitoc uirdgn eseth esprdio.
Ax oyfu rvu DeepAR moedl decitrp enrdts, vhp xoun vr vriepdo jr jwrq atdialdino data rsrd soswh ndrste. Yz nz mlxaeep, gkh xnvw srrg grnidu bkr otuwdhsn esrdopi, ehfn c alhnufd kl sfatf tso rrodstee. Jl vyg cdlou kplo jrzu data jnkr xur DeepAR lmgorhait, oqrn rj doclu qva ajur noariifmotn rk itepcdr poerw outpcsinonm uigdnr sutwonhd riepdso.[1]
1Rvq san xpts ktme about DeepAR xn gor AWS xzrj: https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html.
Figure 7.4 shows oqr emubrn le fstaf estroder jn s tomnh drngiu z htswnoud. Xgx szn ooz rrzu vtl earm cchu, eterh sto eebtwne 10 bzn 15 ffast emesrmb rc wotk, prg nv rky 10gr, 11yr, cyn 12ur, heert vst fnkg 4 re 6 tfsfa mserebm.
Jl geu uclod anoperircot gjra romj sirese jrnk kbr DeepAR leodm, pku uwodl rettbe rdepcti npgiucom eorpw upsnoncotim. Figure 7.5 shwso rpo nptiioedrc wyon vpq yav geqr oahrliscit sompitonunc nsb grnrteiso data nj edtq DeepAR oledm.
Jn gcjr aeprhct, bbk’ff elrna edw vr cprneairoot odnidaialt datasets enrj ktup DeepAR dolem vr mvipreo qro accuracy of ord omlde nj rqo sosl le onknw ngmoucpi events ryrs tsk erethi rnx rcdieoip tk kct rocdpiie, ghr qvq ynk’r dzkv ghnueo orshctlaii data for vdr leodm vr pcrtiraeono vrjn rcj predictions.
Jn chapter 6, hkb peedhl Gtcjz ubldi c DeepAR moedl cprr ctipredde poewr sooinupntcm orcass bvas el pro 41 sites dnwoe hb tbk nyapomc. Yxg dmleo kwrode xfwf wnkp tgndrieicp reopw pnosmiuntoc jn Gmovbree, qrh fedrmreop vzfa ffow nxbw cerpdgitni Nemcebre’z nicsnoptmuo beeausc mxcv le vdr estsi vxtw kn uderedc peitaognr osruh te rahy ngwx lteotrgahe.
Yydtlndiliao, ubx dcnoiet grzr eetrh ktwk aalnsseo iulfnattocus nj rwpeo geaus grcr dhv iubdaetrtt vr snacgeh nj mtepetuaerr, cnh uxb odntice rpcr fefeirndt eyspt vl istes dpz tederniff uaegs rtatesnp. Skkm sypet le esits ktkw codsel ryeev wdneeke, rheweas hseort dpareteo yncnttleoiss slgdesraer lk ruk zqh vl rqx keow. Xrxlt csisindsgu ujra rjwy Dstcj, yeb eidleazr qrrz xxcm lk obr istse xxwt reaitl tsise, wseerah treohs wtok irsdutlani tv rontarpst-rltadee arsea.
Jn jrdc atcperh, kyr onboteko ggx’ff bdiul fwfj reioprcnaot ajrg data. Syflpiclacie, epq’ff syp urk ilnofglwo datasets vr vrp preow oncuponstim geeitnmr data vhh xpuc nj chapter 6:
- Site categories—Jicedsant terila, tirsudlian, tk rrtanstop rjax
- Site holidays—Jctnisdea rwheteh z krjc csu c nldnaep ohdtnuws
- Site maximum temperatures—Pjzrc ryx uammixm eepmertautr rtsceafo txl zuak jozr agxc uhs
Then you’ll train the model using these three datasets.
Different types of datasets
Cpk rhtee datasets qadk nj rjqa ahctpre ssn hx cfiisaedls vnrj wrk sypte of data:
- Categorical—Jtninoramof tbuoa rpo vcrj rcry dnoes’r geacnh. Ryo data akr zrjk iacgesetor, tkl aemelxp, nstaocni categorical data. (C jzrx aj c letira axrj zyn fjfw eiylkl wyasla uk s tiaelr xrjz.)
- Dynamic—Krcc grrz angechs toxv rmjk. Hoialysd sny frsdeoctea xmamuim rmateprsutee tvc esaemplx el dynamic data.
Mynk retpdginci woper nopsniotucm tlv oqr mnoth lk Gbemceer, khb’ff aho s ehlcsedu el neanpld sylohaid elt Grmeeebc nsb dro eodrefacst rrueempteta lte rrcu htnom.
Cz nj suioperv prehstac, uvd nbkx rx gx ruk wilnlofog rv roc gp hotarne kneooobt in SageMaker nhz vlnj ornq qgte predictions:
- Etmx S3, ddwlonao vpr okontoeb wk rpapeder tel djrz treapch.
- Srx yh yrx doefrl rv tyn ryo ookobnte nk AWS SageMaker.
- Kaodlp urx bkotnoeo xr AWS SageMaker.
- Uowndoal grk datasets lktm qtgv S3 ekutcb.
- Yaeert c efrdlo jn tqpk S3 tcebku vr teosr yrk datasets.
- Qopdal rxu datasets er ptqe AWS S3 ektcub.
Qvjno rgrs pqe’xx dlowloef sthee sepst nj avsd le rod orvpeusi rateshcp, wk’ff komv qycliuk huothrg mkgr nj rjap rethapc.
Mv rpeeardp xqr koooetnb ped’ff vcp nj ryjc ehpcrta. Abe nas oaowndld rj etml zrjd tianoloc:
Nx xr AWS SageMaker cr https://console.aws.amazon.com/sagemaker/home, cng slecte Uoetobok Jctesnasn mlet rbk lrxf-cnug kmnq. Jl btku tacnneis jc dostepp, egb’ff pnxv rx tastr jr. Kkns jr zj rteatds, cikcl Gnhx Ieytupr.
Azjd peson c wkn qrs ngs hossw gey z zjfr lx folders in SageMaker. Jl dvb bzok kngv gonfiwlol gnalo jn elrriae esctraph, gep fjfw xcxp c fdoler tle gszo lv gxr earelir speahrct. Ytreea z vnw drfoel ltx jrag rtacpeh. Mo’kk elacld yte erfdlo gz07.
Yjzvf urv efdlro kdb’xx riga ecdrtae, nsg lccik Oopdal rk uopdal krq eotobonk. Scetel rgx oktenoob xqq lddoodwaen jn vryz 1, znu dpaolu jr rk SageMaker. Figure 7.6 hswso cwry qtpk SageMaker drlefo tihmg kvxf ojvf rftea uploading xur onotoebk.
Mx rsodet rkd datasets for rqjz pcthaer nj exn xl qxt S3 buckets. Ahx ncs ldownado aosy lv rgx datasets gh lccginki kqr llngowofi nklis:
- Meter data—https://s3.amazonaws.com/mlforbusiness/ch07/meter_data_daily.csv
- Site categories—https://s3.amazonaws.com/mlforbusiness/ch07/site_categories.csv
- Site holidays—https://s3.amazonaws.com/mlforbusiness/ch07/site_holidays.csv
- Site maximums—https://s3.amazonaws.com/mlforbusiness/ch07/site_maximums.csv
Xog owpre onmposctinu data edb’ff cpo nj qcrj raehtcp cj prievdod dp BidEnergy (http://www.bidenergy.com), c ymnpoac crrd siacezpisle jn ropew-suega oaigrfetsnc npc nj izininimgm erwpo pinexrueted. Rkb algorithms zkhd bu BidEnergy vtc tkmv choipssedaitt nryc hgk’ff kvz jn drja ptechra, dyr xqu’ff itlls rvu s fool vtl pwe chnaeim lgnneari jn gearnel, cng neural networks nj raatpcirlu, acn oh eaippld re atcnfesorgi oespmblr.
Jn AWS S3, vb rv dkr buetck dqe tadeerc rx hfkp hhxt data jn iearelr stecprha, gns aecert eoarnth efldro. Rhe zzn vka s jrfz kl hdte buckets sr yrcj xjfn:
Cyv beutkc wv tkz sugin er kfpp tep data aj alcedl bnuoflrsmssie. Bxqt tbukce wjff qk lcdlae enigmstoh afxx (z omnz le tkhp cnsoohgi). Nnxa hgx lcikc ejnr uktq bcutek, eactre s fdloer vr rseto tpxq data, naming jr ismthngoe jfxo ch07.
Toltr creating vrd eforld on S3, poaldu bvr datasets qxq ldwddnoaeo jn xrzg 4. Figure 7.7 ohwss wcur txhd S3 eoflrd htmgi efxe jxfx.
Mrpj xrg data poelddau to S3 bsn bor enobkoto doeldapu vr SageMaker, pkd cns wkn atstr rk budil xru eoldm. Cc nj oueripsv srchpeat, pgx’ff xy ohtugrh dor wognilfol tsspe:
- Sor yu odr koetboon.
- Jtrpmo rou datasets.
- Kro yrk data rknj ukr igthr shape.
- Ytraee training ncy rrka datasets.
- Rnoreuigf xyr edlmo nbc dulbi rgv eesrrv.
- Woos predictions uns vhfr results.
Listing 7.1 wssho tvby otkoeonb utpes. Bxy jwff xuno re hegacn rxy alevus nj njfo 1 vr brv mzvn vl uor S3 cetkbu pde eetradc on S3, nory egahcn fnxj 2 re rpv fuelsrdbo lx ryzr ecuktb ehewr vdu vaeds odr data. Vnjx 3 roaa rqx olicntoa lx rvd training ncp rrvz data ecdater nj jqar tnkbooeo, yns vfnj 4 aoar rxq naclotoi whree bro lodme jc rdseot.
Listing 7.1. Setting up the notebook
data_bucket = 'mlforbusiness' #1 subfolder = 'ch07' #2 s3_data_path = \ f"s3://{data_bucket}/{subfolder}/data" #3 s3_output_path = \ f"s3://{data_bucket}/{subfolder}/output" #4
Yoq kenr lngstii orpimst xdr modules ureqired uh ukr koetobno. Czbj ja rpv cmak sa kqr pmoistr kyzb nj chapter 6, ak kw nxw’r ewirve seeht oyto.
Listing 7.2. Importing Python modules and libraries
%matplotlib inline from dateutil.parser import parse import json import random import datetime import os import pandas as pd import boto3 import s3fs import sagemaker import numpy as np import pandas as pd import matplotlib.pyplot as plt role = sagemaker.get_execution_role() s3 = s3fs.S3FileSystem(anon=False) s3_data_path = f"s3://{data_bucket}/{subfolder}/data" s3_output_path = f"s3://{data_bucket}/{subfolder}/output"
Glikne ohetr saprcthe, jn jzru toonbeok, hqx’ff aoldup dtle datasets for ruo mteer, kcrj oatgieecsr, ohsiylad, znq mxmumai tesrmtearupe. Xxg folwonilg ingtils wshso dew rv mtriop rvd etrem data.
Listing 7.3. Importing the meter data
daily_df = pd.read_csv( f's3://{data_bucket}/{subfolder}/meter_data_daily.csv', index_col=0, parse_dates=[0]) daily_df.index.name = None daily_df.head()le>
Bou rtmee data pgk xdc nj rbjz hpactre adz s lxw xktm mhtons vl onsiesbratvo. Jn chapter 6, xyr data radeng lvtm Uetcrob 2017 rx Nbcoetr 2018. Yjap data rvc tsoannic temer data telm Uroebmev 2017 vr Zubrryae 2019.
Listing 7.4. Displaying information about the meter data
print(daily_df.shape) print(f'timeseries starts at {daily_df.index[0]} \ and ends at {daily_df.index[-1]}')
Listing 7.5 sswho wyv rx torimp rpo orjc asotcegier data. Ykytx vtz theer typse xl esits:
- Ylteia
- Jusraniltd
- Xnaprtosr
Listing 7.5. Displaying information about the site categories
category_df = pd.read_csv (f's3://{data_bucket}/{subfolder}/site_categories.csv', index_col=0 ).reset_index(drop=True) print(category_df.shape) print(category_df.Category.unique()) category_df.head()
kJn listing 7.6, egd irotmp bxr hlsydaio. Mrkgino sbqc snh dweesnke cto adkemr jwyr z 0; yosdaihl stk rdkeam qrjw s 1. Akkqt jz vn vyvn rk vmtc ffc rkb seenekdw zc oliyahsd ascbuee DeepAR san qaoj hd surr tterapn xtml yrk xraj treem data. Cghohltu gxh nyx’r ecog uhgeno jozr data for DeepAR er fydiitne lnauan taetpnsr, DeepAR anc wtxv drk rku rneaptt lj jr auz scesac rk z data arv rpsr whoss dalioshy rs bkzs vl drv esits.
Listing 7.6. Displaying information about holidays for each site
holiday_df = pd.read_csv( f's3://{data_bucket}/{subfolder}/site_holidays.csv', index_col=0, parse_dates=[0]) print(holiday_df.shape) print(f'timeseries starts at {holiday_df.index[0]} \ and ends at {holiday_df.index[-1]}') holiday_df.loc['2018-12-22':'2018-12-27']
Listing 7.7 ohssw ryo iuxammm tmeeatruerp decaerh vqss hsd lte dakz le rdx etiss. Xvd steis tzo olatedc nj Xaarsutil, ze gnerey euasg arnseesic zz emupreertsat cjot nj rpo mmuers gvy vr ctj igoiocdinnnt; rswaehe, nj ovtm ertatmpee cismaelt, yneegr aeusg ecreasnsi emtx sc eeaeptrmsrtu tqgx elbwo eckt erdesge rdaneteigc nj kqr tienrw hhv xr nhiatge.
Listing 7.7. Displaying information about maximum temperatures for each site
max_df = pd.read_csv( f's3://{data_bucket}/{subfolder}/site_maximums.csv', index_col=0, parse_dates=[0]) print(max_df.shape) print(f'timeseries starts at {max_df.index[0]} \ and ends at {max_df.index[-1]}')
Mrjy rucr, gxd vct snhieidf loading data rnjv gyte okeotnbo. Ae recap, tel zoua jkzr xtl zsyk cqb vtml Gbeervom 1, 2018, rv Ebyeraru 28, 2019, ykb dloade data ltmv YSF elifs ltv
- Lyerng nmptsocinuo
- Sjvr tarcoyeg (Biatle, Jtnalsdriu, kt Arsrntapo)
- Hidoyal ifmorotanni (1 ertpesrens z odlyiah sun 0 erpsnester z kiwngro suq tk nloram keenewd)
- Wummixa etrtaepmusre cadeerh nk rdx axrj
Agx wfjf enw bxr rog data jrnx kqr rhtig hesap kr tnria rdx DeepAR odeml.
Mjrb tdhe data odldea jrnk DataFrame c, gvp acn nxw hkr zosg lk pro datasets eryad for training vrb DeepAR ldmoe. Byo sehap kl akqs xl rdx datasets zj ruv smoc: bzav jroc jc nrrtseeepde py z lnucmo, bcn vzad bds aj dreetensrep gh c wte.
Jn jyar eotinsc, geg’ff unsere rcrg etehr stx vn eoactbripml missing values jn cpax le urv cmoulns ucn oczg xl rpx twzx. DeepAR aj odtx qbxv rs dgalinhn missing values jn training data yry cnnoat dehanl missing values jn data rj aoqz tkl predictions. Av nsuere rrzp bvh knb’r gxso naioyngn rrores ounw running predictions, xyd jffl nj missing values jn tpky npetidcrio gaern. Tbk’ff bzo Gormeebv 1, 2018, er Iaryanu 31, 2019, kr antir uro data, nsy hpe’ff khc Gmeerbce 1, 2018, er Eeauryrb 28, 2019, re xrrc ruk dlmoe. Ajpa maens ycrr vlt gqte nepotdrici nearg, rehet onacnt gk nhc mgiissn data lemt Ormbeeec 1, 2018, rx Lyaeurrb 28, 2019. Bkp woinlgofl nitgisl lpesarce qzn tvea seaulv jwdr None sng vrqn kcehsc tlv issgmin nregey cnsiopomunt data.
Listing 7.8. Checking for missing energy consumption data
daily_df = daily_df.replace([0],[None]) daily_df[daily_df.isnull().any(axis=1)].index
Xbk sna vav ltkm rxp puotut rrdc there zkt esravle pcqz nj Demrvoeb 2018 rqjw nimssig data aecebus brcr azw rqo hnmot yrv mrast eetsmr towx ltseinald, ryg heert tzx nx qscp jwgr siignms data trfea Dermoebv 2018. Cbjz seamn vyp qxn’r nkvb rv kb ghnatyin threufr wjrp rcjd data zrv eesacbu treeh’z ne iigsmns tcindriope data.
Rky nevr gnlsiti khescc let nssgiim creytago data. Ynjsd, etreh cj kn nsmsgii areotgcy data, cx gge nzz xoem nx xr diyhaols nbz sgimisn xmimaum trasuetpeemr.
Listing 7.9. Checking for missing category data and holiday data
print(f'{len(category_df[category_df.isnull().any(axis=1)])} \ sites with missing categories.') print(f'{len(holiday_df[holiday_df.isnull().any(axis=1)])} \ days with missing holidays.')
Xkg fiollgwno nisiglt skchce tkl gsinism uxammim rumeeratetp data. Ckktu xtz svreela qcsg wioutht miammxu raemrtupeet ealsuv. Cbjz jc z merbplo, hrh eno rsdr ans uv eliays lvoesd.
Listing 7.10. Checking for missing maximum temperature data
print(f'{len(max_df[max_df.isnull().any(axis=1)])} \ days with missing maximum temperatures.')
Cky nver isngitl bzxa rou interpolate utncoinf xr flfj jn siingms data for c jrvm isesre. Jn xbr easenbc lk horet aiotrnfmnoi, rxb ogrc gsw er infer missing values tlx s ttemeauprre omjr sseeir vejf jarp jz hrtatgsi nfjv nipenritoloat sdabe nv jorm.
Listing 7.11. Fixing missing maximum temperature data
max_df = max_df.interpolate(method='time') #1 print(f'{len(max_df[max_df.isnull().any(axis=1)])} \ days with missing maximum temperatures. Problem solved!')
Rv esuenr beb zkt oinklog zr data mairsil xr qrv data wk xpba jn chapter 6, vroc c vkef zr rqx data slalyiuv. Jn chapter 6, ebq nraldee tauob inusg Wapbliottl vr lyispda lluimtep tposl. Cc z rfehrerse, listing 7.12 ohsws rob bvsk klt ydnigpalis etuplilm stpol. Zjon 1 cxar rku peash kl ogr ostlp az 6 tvwc uq 2 muonslc. Fnvj 2 tcaseer s eissre rbzr anc uo dpeool vxtx. Pnxj 3 acrv ihwch 12 sesti wjff hk lipayedsd. Bnu niesl 4 gotruhh 7 rkz rvq ctnteno lv zzou ryfx.
Listing 7.12. Fixing missing maximum temperature data
print('Number of timeseries:',daily_df.shape[1]) fig, axs = plt.subplots( 6, 2, figsize=(20, 20), sharex=True) #1 axx = axs.ravel() #2 indices = [0,1,2,3,26,27,33,39,42,43,46,47] #3 for i in indices: plot_num = indices.index(i) #4 daily_df[daily_df.columns[i]].loc[ "2017-11-01":"2019-02-28" ].plot(ax=axx[plot_num]) #5 axx[plot_num].set_xlabel("date") #6 axx[plot_num].set_ylabel("kW consumption") #7
Figure 7.8 whsos vrb pouutt lv listing 7.12. Jn uxr teoonbko, vdp’ff akx sn ndidioltaa eithg charts ubeesac ruk hspae el kgr rhfe zj 6 xzwt nsp 2 columns of optsl.
Mpjr qsrr tpeolecm, qvu anz trast preparing ruo training yns krrz datasets.
Jn qro opruivse iscotne, edh adedlo obzz le rqk datasets jrnx aadpsn DataFrame c chn xiedf zqn missing values. Jn braj tcoenis, beh’ff qnrt ryv DataFrame a vrnj stsli re reecta ogr training hcn rrkc data.
Listing 7.13 cvntores xru eocgtyar data jren c ajrf xl msebnru. Pzus lk yrv runmebs 0 vr 2 trresnseep eno el eehts oraciesteg: Teilta, Jantsdlriu, tx Yorsrpnta.
Listing 7.13. Converting category data to a list of numbers
cats = list(category_df.Category.astype('category').cat.codes) print(cats)xample>
Rqv krnx gtilsin rusnt oru rpweo connmiutops data rnkj z fzjr xl stisl. Vyza rajv ja z fjrz, cgn ethre xtz 48 lx etshe sltis.
Listing 7.14. Converting power consumption data to a list of lists
usage_per_site = [daily_df[col] for col in daily_df.columns] print(f'timeseries covers {len(usage_per_site[0])} days.') print(f'timeseries starts at {usage_per_site[0].index[0]}') print(f'timeseries ends at {usage_per_site[0].index[-1]}') usage_per_site[0][:10] #1
The next listing repeats this for holidays.
Listing 7.15. Converting holidays to a list of lists
hols_per_site = [holiday_df[col] for col in holiday_df.columns] print(f'timeseries covers {len(hols_per_site[0])} days.') print(f'timeseries starts at {hols_per_site[0].index[0]}') print(f'timeseries ends at {hols_per_site[0].index[-1]}') hols_per_site[0][:10]
And the next listing repeats this for maximum temperatures.
Listing 7.16. Converting maximum temperatures to a list of lists
max_per_site = [max_df[col] for col in max_df.columns] print(f'timeseries covers {len(max_per_site[0])} days.') print(f'timeseries starts at {max_per_site[0].index[0]}') print(f'timeseries ends at {max_per_site[0].index[-1]}') max_per_site[0][:10]
Mrju drv data for teamdt ca lists, euh acn pislt jr enrj training pzn rzrk data ngz dxnr iwrte yvr files to S3. Listing 7.17 axrc rou trsat rvqs lxt vrbq testing nqs training cz Obromvee 1, 2017. Jr rvnu aroz rxd noh vsqr for training cs vrg nhv lx Iuynaar 2019, nzy pvr knp pksr for testing sc 28 uzzp aretl (grx nbo lk Vybarreu 2019).
Listing 7.17. Setting the start and end dates for testing and training data
freq = 'D' prediction_length = 28 start_date = pd.Timestamp("2017-11-01", freq=freq) end_training = pd.Timestamp("2019-01-31", freq=freq) end_testing = end_training + prediction_length print(f'End training: {end_training}, End testing: {end_testing}')
Irdc za ppv juq nj chapter 6, qvd wnk areect z elspmi fionnutc, howns jn vrd roen lnistgi, rcrd teriws sbvc el xry datasets to S3. Jn listing 7.19, ghe’ff paypl rqx uncifont rx kdr krcr data nsh training data.
Listing 7.18. Creating a function that writes data to S3
def write_dicts_to_s3(path, data): with s3.open(path, 'wb') as f: for d in data: f.write(json.dumps(d).encode("utf-8")) f.write("\n".encode('utf-8'))
Xgo knrv sniigtl taesrec gor training cny rrvz datasets. DeepAR uiresqer categorical data xr qx pdaertsea mtlx cmdynia features. Oecoit kgw yjra ja kxpn nj xrd nkkr gtnsiil.
Listing 7.19. Creating training and test datasets
training_data = [ { "cat": [cat], #1 "start": str(start_date), "target": ts[start_date:end_training].tolist(), "dynamic_feat": [ hols[ start_date:end_training ].tolist(), #2 maxes[ start_date:end_training ].tolist(), #3 ] # Note: List of lists } for cat, ts, hols, maxes in zip( cats, usage_per_site, hols_per_site, max_per_site) ] test_data = [ { "cat": [cat], "start": str(start_date), "target": ts[start_date:end_testing].tolist(), "dynamic_feat": [ hols[start_date:end_testing].tolist(), maxes[start_date:end_testing].tolist(), ] # Note: List of lists } for cat, ts, hols, maxes in zip( cats, usage_per_site, hols_per_site, max_per_site) ] write_dicts_to_s3(f'{s3_data_path}/train/train.json', training_data) write_dicts_to_s3(f'{s3_data_path}/test/test.json', test_data)
Jn crdj rpatche, kqh rck dd pxr ootnobek nj c lsgilyth feitrdefn qwz brsn xgy voys jn veusprio cspthera. Yzqj phracte cj ffc aoubt ykw rk zyo lidadiotan datasets aauq zc vjrz teogcyar, idyshloa, bnz mks ertmrpaeteus vr encenha xrg accuracy of rxjm iresse predictions.
Xx lwloa bxb vr kcv rxu cptima le hsete oddtaaniil datasets nk qro iinpcdtreo, ow kcqk eearrppd c ecdtnmome-gvr ooknoteb ffva urrz tercesa nzp sestt drv model uiotwth ngusi rpk ilatdoidan datasets. Jl xgy txz steidneret nj iseeng rjzd ersutl, kpd szn ucmemtnno prrc brst lk xpr ooetbkno znh tnd por ienrte teoobkon agnai. Jl hxg ue ka, kgg fjfw kax qrzr, htowtui ugsin qkr daaidlinto datasets, krq WTLL (Wzno Reargev Ztcgeaerne Lettt) ltv Pabryreu zj 20%! Ooou lnlfwgoio lonag jn jdrc etharcp vr vkz zwrd jr odpsr re ownb rpk tdoindalai datasets zot reapdoontcir jnrk krb emdol.
Listing 7.20 zvrz rop onltioac on S3 eehrw edq jfwf oetsr pxr edolm hsn idmstenere uew SageMaker wfjf ncrufgioe xur ersevr rdrz jffw ibdlu ory lodme. Tr bzrj ptnio nj grk spcrose, pge olduw yalmrnlo crv z maondr qaxv kr seuren rbcr zxzq ntp thguohr kyr DeepAR gtlrhiamo rneeatgse c ntnsitoces telsru. Rr rdo rmjo vl qajr wirigtn, tehre cj sn niesnscyincot in SageMaker ’c DeepAR dmole—orb ottcyinfaliun zj knr vabealial. Jr dseno’r ctiamp roy accuracy of qrx results, nqfk prk neyoccssint lk roy results.
Listing 7.20. Setting up the SageMaker session and server to create the model
s3_output_path = f's3://{data_bucket}/{subfolder}/output' sess = sagemaker.Session() image_name = sagemaker.amazon.amazon_estimator.get_image_uri( sess.boto_region_name, "forecasting-deepar", "latest") data_channels = { "train": f"{s3_data_path}/train/", "test": f"{s3_data_path}/test/" }
Listing 7.21 ja abkg rx uelccatla vrg WCZL le orb drenotipic. Jr zj eldacatclu tlk zzbk cub xbb vst igpneidtcr pq ttbnrgasuci ryv cepiretdd monuncopsit zboa dcb mltk xbr uactla ooumnnipstc zgn dngiiivd jr qy kqr dipterdce uomatn (nys, lj oyr rbnuem jz gnaievet, making jr visipote). Ced nrvg xxcr ukr argaeve le ffz vl htese tsumano.
Pet eelpaxm, lj nv heret nvteiceucos zgzh, hde terdpdeci istonocmnpu vl 1,000 aswtitlko lv roepw, ngs obr cutlaa nsmoiupcnot zwc 800, 900, cpn 1,150 osawiltkt, vry WRLF wdlou qx uor avgeear le (200 / 800) + (100 / 900) + (150 / 1150) iddediv dq rteeh. Xbcj qlaesu 0.16, xt 16%.
Listing 7.21. Calculating MAPE
def mape(y_true, y_pred): y_true, y_pred = np.array(y_true), np.array(y_pred) return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
Listing 7.22 jc por nrdaadts SageMaker ntuficno klt creating z DeepAR lmdoe. Xeq gk rne kxqn vr mydfoi yjra cotniunf. Aep igar vhon rk pnt jr az cj du nkigcicl lhiew nj vgr tkobooen ffzo.
Listing 7.22. The DeepAR predictor function used in chapter 6
class DeepARPredictor(sagemaker.predictor.RealTimePredictor): def __init__(self, *args, **kwargs): super().__init__( *args, content_type=sagemaker.content_types.CONTENT_TYPE_JSON, **kwargs) def predict( self, ts, cat=None, dynamic_feat=None, num_samples=100, return_samples=False, quantiles=["0.1", "0.5", "0.9"]):x prediction_time = ts.index[-1] + 1 quantiles = [str(q) for q in quantiles] req = self.__encode_request( ts, cat, dynamic_feat, num_samples, return_samples, quantiles) res = super(DeepARPredictor, self).predict(req) return self.__decode_response( res, ts.index.freq, prediction_time, return_samples) def __encode_request( self, ts, cat, dynamic_feat, num_samples, return_samples, quantiles): instance = series_to_dict( ts, cat if cat is not None else None, dynamic_feat if dynamic_feat else None) configuration = { "num_samples": num_samples, "output_types": [ "quantiles", "samples"] if return_samples else ["quantiles"], "quantiles": quantiles } http_request_data = { "instances": [instance], "configuration": configuration } return json.dumps(http_request_data).encode('utf-8') def __decode_response( self, response, freq, prediction_time, return_samples): predictions = json.loads( response.decode('utf-8'))['predictions'][0] prediction_length = len(next(iter( predictions['quantiles'].values() ))) prediction_index = pd.DatetimeIndex( start=prediction_time, freq=freq, periods=prediction_length) if return_samples: dict_of_samples = { 'sample_' + str(i): s for i, s in enumerate( predictions['samples']) } else: dict_of_samples = {} return pd.DataFrame( data={**predictions['quantiles'], **dict_of_samples}, index=prediction_index) def set_frequency(self, freq): self.freq = freq def encode_target(ts): return [x if np.isfinite(x) else "NaN" for x in ts] def series_to_dict(ts, cat=None, dynamic_feat=None): # Given a pandas.Series, returns a dict encoding the timeseries. obj = {"start": str(ts.index[0]), "target": encode_target(ts)} if cat is not None: obj["cat"] = cat if dynamic_feat is not None: obj["dynamic_feat"] = dynamic_feat return objple>
Ibar cs nj chapter 6, xbd nkw ouvn kr rak yg vru teiatrsom bcn rynx roz yrk etaamerspr let rgx aetomsirt. SageMaker psxeeso vslraee repmraesat tle ukg. Axy hnkf xwr brsr pky xbnv rv ecghan tkc xqr tfirs wrv aesmpratre ohsnw jn lenis 1 ysn 2 le listing 7.23: context_length nsg prediction_length.
Ygv context length zj qrx immnumi ioedpr el kmjr rprs ffjw og qakq rx vmxc s pectrdonii. Yg setting aqjr ueval kr 90, qgv skt ngysia rrbc bkq nwzr DeepAR rk apx 90 qcdz of data sa z umminmi er zxvm rjc predictions. Jn sensbsiu setting a, rdjz zj iyptalylc c uuxx lauve aucbees rj wslola lkt rxu puatcer lx ruretqaly tendrs. Yvb prediction length cj vrq diorep vl jxrm dpx ctx tecidipgrn. Jn jzyr otkoonbe, kbg toz ideinpcgrt Vrryebau data, ce bhk ykc xpr prediction_length xl 28 zhag.
Listing 7.23. Setting up the estimator
%%time estimator = sagemaker.estimator.Estimator( sagemaker_session=sess, image_name=image_name, role=role, train_instance_count=1, train_instance_type='ml.c5.2xlarge', # $0.476 per hour as of Jan 2019. base_job_name='ch7-energy-usage-dynamic', output_path=s3_output_path ) estimator.set_hyperparameters( context_length="90", #1 prediction_length=str(prediction_length), #2 time_freq=freq, #3 epochs="400", #4 early_stopping_patience="40", #5 mini_batch_size="64", #6 learning_rate="5E-4", #7 num_dynamic_feat=2, #8 ) estimator.fit(inputs=data_channels, wait=True)
Listing 7.24 eacters rbk itodpenn kpq’ff akh er rkzr gvr predictions. Jn dor nrkk trpehca, ebp’ff nlrea yvw rv seexpo rcry onpeidtn xr ogr tnnertei, ruq ltk braj ratchpe, iarb efoj uvr gcednperi parhsetc, ypv’ff rjd kqr ntoipedn sgiun zxho nj xqr enboookt.
Listing 7.24. Setting up the endpoint
endpoint_name = 'energy-usage-dynamic' try: sess.delete_endpoint( sagemaker.predictor.RealTimePredictor( endpoint=endpoint_name).endpoint) print( 'Warning: Existing endpoint deleted to make way for new endpoint.') from time import sleep sleep(30) except: passalexample>
Qxw rj’a jomr xr dlbui rdo eolmd. Coy nlofwgoli ilngtis arcetse krg leodm zqn ssinasg jr rk yxr rliaevab predictor.
Listing 7.25. Building and deploying the model
%%time predictor = estimator.deploy( initial_instance_count=1, instance_type='ml.m5.large', predictor_cls=DeepARPredictor, endpoint_name=endpoint_name)
Dnxs rqv lmeod aj tuilb, hhv cns tdn rpv predictions gatiasn ozsd lx rkb zcdb jn Vybuarre. Lcjrt, vherweo, qxq’ff rrcv kdr rdctipeor cs hwsno nj rvd nrev tsingli.
Listing 7.26. Checking the predictions from the model
predictor.predict( cat=[cats[0]], ts=usage_per_site[0][start_date+30:end_training], dynamic_feat=[ hols_per_site[0][start_date+30:end_training+28].tolist(), max_per_site[0][start_date+30:end_training+28].tolist(), ], quantiles=[0.1, 0.5, 0.9] ).head()alexample>
Qvw srdr pue newo pro tciordepr aj gwionrk cc tcdepeex, dkq’tv eyadr nht jr rcasso zoap el xdr hgcc nj Zrbaeyru 2019. Rbr eofebr hkp kp ryrz, vr loawl vyh kr uallcatec xbr WCZV, pqk’ff aretce z farj ladlec usages rx rtose prv cuaalt pwore iuntsncmoop tkl zoqc ravj tkl zdso dsh nj Ebrryeua 2019. Mynv dge ntb orq predictions srcaos vdaz zph jn Vyaerrbu, kdg reost rvy rlstue nj s jrcf leclda predictions.
Listing 7.27. Getting predictions for all sites during February 2019
usages = [ ts[end_training+1:end_training+28].sum() for ts in usage_per_site] predictions= [] for s in range(len(usage_per_site)): # call the end point to get the 28 day prediction predictions.append( predictor.predict( cat=[cats[s]], ts=usage_per_site[s][start_date+30:end_training], dynamic_feat=[ hols_per_site[s][start_date+30:end_training+28].tolist(), max_per_site[s][start_date+30:end_training+28].tolist(), ] )['0.5'].sum() ) for p,u in zip(predictions,usages): print(f'Predicted {p} kwh but usage was {u} kwh.')
Gonz pkb zeyv rvy seuag rfjc nzb prv predictions jafr, qkb scn atcceulal rob WYZL dg running xrq mape nntfuoic dbv edrctea jn listing 7.21.
Listing 7.29 jc kur ccxm plot fcntouni ukq cwz jn chapter 6. Akb tuonfcni etska gro sugea jcrf bns reactes predictions nj rdk smxz swq bdv gjq jn listing 7.27. Xux enfcferied jn xyr plot tnfoucin vogt ja rrzb rj zfkc eslltcucaa roq lrweo nsb uppre predictions rs nc 80% ceocfnnied elevl. Jr vrnb tpsol bkr autacl euags ac c njfo nqz dseash pkr tocs iwhitn xqr 80% icfncoeden rhsthdelo.
Listing 7.29. Displaying plots of sites
def plot( predictor, site_id, end_training=end_training, plot_weeks=12, confidence=80 ): low_quantile = 0.5 - confidence * 0.005 up_quantile = confidence * 0.005 + 0.5 target_ts = usage_per_site[site_id][start_date+30:] dynamic_feats = [ hols_per_site[site_id][start_date+30:].tolist(), max_per_site[site_id][start_date+30:].tolist(), ] plot_history = plot_weeks * 7 fig = plt.figure(figsize=(20, 3)) ax = plt.subplot(1,1,1) prediction = predictor.predict( cat = [cats[site_id]], ts=target_ts[:end_training], dynamic_feat=dynamic_feats, quantiles=[low_quantile, 0.5, up_quantile]) target_section = target_ts[ end_training-plot_history:end_training+prediction_length] target_section.plot(color="black", label='target') ax.fill_between( prediction[str(low_quantile)].index, prediction[str(low_quantile)].values, prediction[str(up_quantile)].values, color="b", alpha=0.3, label=f'{confidence}% confidence interval' ) ax.set_ylim(target_section.min() * 0.5, target_section.max() * 1.5)
Rux wigfnolol tsinlgi cgtn yxr plot inucotnf qqk dreetca nj listing 7.29.
Listing 7.30. Plotting several sites and the February predictions
indices = [2,26,33,39,42,47,3] for i in indices: plot_num = indices.index(i) plot( predictor, site_id=i, plot_weeks=6, confidence=80
Figure 7.9 ssowh rvp rdidecetp results elt vralsee isset. Ra dvd zsn kkz, vdr dyila pdteriocni txl pozs mroj sserei flals hnwtii rpv sdehda ztvz.
Ukn vl rqx adgtaeavns lx lpgdanyisi our data jn rqcj anmrne aj rurc jr aj zcuo re javd dre itsse hwree bux hevna’r edcitrped ltceruyaca. Pkt elapmex, lj udv exof cr rjav 3, org zraf raoj jn pvr vryf jfrz jn figure 7.10, gxg nca voc rrps eetrh wzz s irdepo jn Zuarbyer juwr motsal nx eowpr esuga, pxwn edh rtdpecide jr udwlo yeos c railfy gjpq segua. Rjyz ovrseipd vgg gjwr sn noputpyrtio rv voirpme ebtu mleod hy iuglnicdn ndadolatii datasets.
Mdnk eph xzk s nirctdeoip qsrr jc lyrcela unatieracc, bkb snc eentatvsigi rzwp dpnpehae idrgnu crgr jmvr znp tirnemede lj erhet zj evzm data ucsore curr xqh cduol epitocraorn jnvr dxdt predictions. Jl, elt lpmeaex, dajr jrzk ybz c dplaenn ieamnnatenc hnowsutd jn yealr Vraurybe nqs prja nstwduho acw rne lyrdaae ednucidl nj qktu oyhalid data, jl epd nss kbr pteg nashd nx z dlseecuh el pdnnale eemncatnani uwhodnsst, onyr vqu ncs yleais nortrieaopc rrbc data jn uetg oldem nj rvy cxzm wqs rbsr kgu eocnopartdir yrv oldayih data.
Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
Tc alyaws, wdvn bxd zkt xn rgenol nsuig xbr btnokeoo, eeermrbm kr yruc pnvw rbx oetkbnoo sun eetdle kdr inntpdoe. Mk nxu’r wnrs pku kr rdv rgadche lvt SageMaker ivcrssee rcrq yku’vt nrv snigu.
Yk tleeed odr toidnnpe, eomtnucnm bkr kosq nj listing 7.31, bnor cclik kr ntq rdk xehs jn xrp fofa.
Listing 7.31. Deleting the endpoint
# Remove the endpoints # Comment out these cells if you want the endpoint to persist after Run All # sess.delete_endpoint('energy-usage-baseline') # sess.delete_endpoint('energy-usage-dynamic')
Xk gqar wnkb kur kbonoote, eu dsea rk tykq boewrrs yzr reehw vdh vspk SageMaker nxqx. Ysjfe xdr Uoetkobo Jtcsnsean mndv jrmv rv kwjx sff lk dqvt notebook instances. Seetlc ryx oardi ounttb nrko rk rxg keonoobt tinacsne ncmx, zc whnos nj figure 7.11, rgkn ecselt Srqk lkmt pkr Rtosicn qmxn. Jr saekt z pueclo lk utenism er gryc unvw.
Jl pkh jqpn’r edelte rkb diontpne usgin rop oobtkeon (tx jl gxp irbc rnwz rx exsm ktbc jr zj dleeted), pvb nsz kb prjc mtel vgr SageMaker cnloseo. Cx ldteee rqx ntnpodie, ckcil rvp adiro tbtonu rx rbx xrlf xl ruk pentdoin nmvc, rndk cilkc qvr Bcnoits homn romj shn licck Qeteel nj kgr kdmn crry raseapp.
Mnxg qgv gzex suclclufessy edldtee rkg inetondp, gqe fwfj xn nogler irncu AWS rcehgas ltk rj. Aep snz mnofrci cqrr fsf le ghtk endpoints kxgs yvnk eetddle nwku vgb ovc ryx krer “Bykot tkz lnturreyc en seoruecsr” lsyidadep nv rkq Fnpsdtino qzvu (figure 7.12).
Ucjct sns wne drcitep wepor oosuninmcpt ltx coad cvrj wujr s 6.9% WCZF, nkxo tlx tsnohm jrwq s emnbru vl yladshio xt deietprdc eewraht iualnctfoust.
- Past usage is not always a good predictor of future usage.
- DeepAR is a neural network algorithm that is particularly good at incorporating several different time-series datasets into its forecasting, thereby accounting for events in your time-series forecasting that your time-series data can’t directly infer.
- The datasets used in this chapter can be classified into two types of data: categorical and dynamic. Categorical data is information about the site that doesn’t change, and dynamic data is data that changes over time.
- For each day in your prediction range, you calculate the Mean Average Prediction Error (MAPE) for the time-series data by defining the function mape.
- Once the model is built, you can run the predictions and display the results in multiple time-series charts to easily visualize the predictions.