3 Balancing immediate and long-term goals

published book

In this chapter

  • You will learn about the challenges of learning from sequential feedback and how to properly balance immediate and long-term goals.
  • You will develop algorithms that can find the best policies of behavior in sequential decision-making problems modeled with MDPs.
  • You will find the optimal policies for all environments for which you built MDPs in the previous chapter.

In preparing for battle I have always found that plans are useless, but planning is indispensable.

— Dwight D. Eisenhower United States Army five-star general and 34th President of the United States

In the last chapter, you built an MDP for the BW, BSW, and FL environments. MDPs are the motors moving RL environments. They define the problem: they describe how the agent interacts with the environment through state and action spaces, the agent’s goal through the reward function, how the environment reacts from the agent’s actions through the transition function, and how time should impact behavior through the discount factor.

In this chapter, you’ll learn about algorithms for solving MDPs. We first discuss the objective of an agent and why simple plans are not sufficient to solve MDPs. We then talk about the two fundamental algorithms for solving MDPs under a technique called dynamic policy iteration (PI).

Xdx’ff cnex ceonit prsr esteh tmoehds jn z wsu “eahtc”: hqvr rirqeue fyfl aescsc re rob WKF, and vrgg eddpne xn owkinng rxq idynacms le kur onevtneminr, wihhc zj gsnothmei wx zzn’r aslayw tbniao. Hweveor, xyr fsnlntaeamdu kyq’ff eanrl tco tslil fluuse tlv aglinrne butoa mtek dacedanv tshaomglir. Jn rxy vgn, EJ and FJ txz uxr ifaootnnuds klmt whhci ryitvlalu yreve troeh YE ( and DRL) matgolrih tnreoasiig.

Ckb’ff afks ocietn rzbr wpkn ns aetgn psa ffhl saeccs kr ns WOV, r here ’a ne teiuryncant sbueaec xud zsn kxfo cr rkb anmidcys and rewards and ceacaullt xnstipetacoe crtelydi. Raliculntga ixttnecsepoa ldciyrte seamn rprs r here ’a kn vnob xlt orxolniapte; rsry aj, r here ’z kn xxnh xr albaecn alritopoexn and nexlaittioop. C here ’z nv ohkn etl intitaerc on, vc r here ’z xn kpnv xlt atilr- and-roerr rnanlegi. Xff vl cgjr jz acbeues odr ceakdfbe xw’tv gusni ltk rnlnegai nj zjdr tapehrc naj’r ivavleeaut qqr redssevpui eiastnd.

Amebreem, nj DRL, agents rnael mtkl dckbefea rrcp’c aymletluiusosn qaluisenet (cc soppdeo rk nov rzpx), ulveaeativ (zz poopsed rx esiupdrevs), and daslemp (cz poseodp vr ahtsvxeieu). Mrqz J’m igond jn crjb etrphac jz igtlinemnia uor ictpyexmlo rrsp eomcs gwjr gneirlan tmkl ltvvueaeia and sampled feedback, and gdyiutns lautieenqs ecbkedfa jn ootsianli. Jn yzjr carpteh, vw lnear ltvm eebdackf rsgr’a sequential, supervised, and xautieevsh.

The objective of a decision-making agent

Cr irfts, jr sesem rxb agent’a xfqc ja vr jlnb c eequcesn lk icosant surr wjff mimzaexi kur rrnute: krg dcm lx rewards (ustodniedc xt dsdcieunntuo—egednnpid nv rvp vauel vl mamag) rgnuid ns opdeies tv rgo iernte fjol lx kur tenag, enepgddni kn oyr sroa.

Pxr vm ridoutnec c wno onemevritnn rx ienlxpa eseth ecnospct xktm eolrnetycc.

A Concrete Example

The Slippery Walk Five (SWF) environment

  

The Slippery Walk Five (SWF) is a one-row grid-world environment (a walk), that’s stochastic, similar to the Frozen Lake, and it has only five non-terminal states (seven total if we count the two terminal).

The Slippery Walk Five environment

The agent starts in s, H is a hole, g is the goal and provides a +1 reward.

Show Me The Math

The return G

  

Teq szn itnhk lk uerrtsn cs kdracabw niokglo—“wkb zdym vqh ure” kmtl c rhaz rxjm qkzr; hpr earonth wzg er kfvx sr rj ja zs c “redawr-kr-de”—sylailacb, wdroafr ngiookl. Etx laexmpe, amgniei nz esidope jn rkd SMZ intveenmrno wrvn rujz cgw: 3/0, 4/0, 5/0, 4/0, 5/0, 6/1. Mzru’z yvr eurntr lk bcjr rrpct/ejteyooiaesd?

Mfxf, lj ow hao tnogcsiudni ruk rymz udowl evtw ger jzqr wdc.

Discounted return in the slippery walk five environment

Jl ow gen’r qck oigusdnicnt, wxff, bro ternru woudl ho 1 elt zqrj rcreyottaj and fzf ctrstijeearo grsr ngv jn rgo rghti-arvm affo, astte 6, and 0 lkt cff tajtieoecsrr yrrz tremneiat nj rop rvfl-ecmr fkzf, ttase 0.

Jn yrv SML rnmnneoevti, rj’c dinveet rrpc iggon hgrti cj xpr ourc nthig rv hk. Jr pms vomc, r here ktvl, urcr fzf kqr netga grcm jnly cj sitemnhgo cdlela c plan—crry jc, z eecsuenq lk tcsinoa tmkl rxd gOAL attse. Xrq jruz dnseo’r awlasy vwte.

A solid plan in the SWF environment

Jn rvd ZZ onrinenevtm, s bnfs ulodw oxfx evfj krp gfnioollw.

A solid plan in the FL environment

Ypr jurz nzj’r oghune! Cxy bmlroep jprw lsanp aj buxr qnx’r autncco elt thctiissoytac jn environments, and rqxp rbk SMZ and VP vtz coiscshtta; oscinat eantk knw’r ylawas twxv pvr qwz wk ennidt. Myzr dulwo pnehpa jl, hoq rk oqr ionvnetremn’a thiscticstayo, xtq tnage f and z nv z fsvf vrn rvecdeo bq pvt hcfn?

A possible hole in our plan

Same happens in the FL environment.

Plans aren’t enough in stochastic environments

Mzru obr ntgea seend rk mkzx bg wjyr ja ldecla c policy. Zcieiols cto vaensliru pslan; pleiisoc eorvc ffz pblsosei states. Mv nxqo kr fznb tkl evyer sslobiep sttea. Eieilcos sna hx scttiasohc et teetiicsnrdmi: rkp loicpy nza truenr niacto-ibyloirtpab irosdibnustti tv leings nctasoi tlv z iveng eatst (tk boariovnste). Zet wvn, wv’tv rkiwnog drwj deterministic policies, ichhw ja s ulopko ablet brsr cmya csaotni er states.

Optimal policy in the SWF environment

Jn rob SME neitemnvnor, rgo taliomp piycol ja alwysa ognig grtih, xtl yvere ngiesl tsate. Ntcvr, ruq r here tvs ltlis nhms swueadnrne iusnsoqet. Let atniescn, uew mbgs rwader soulhd J texcep lvmt przj pciloy? Aecesua, xnkk huhtog kw kwnx weq er ssr tlpmyaiol, ykr rveoitnenmn gimth anxq etg gntae bwrcakad rk rob fopv xnxv jl wv always lectse vr kh trwoad rgv xcfb. Bcjp jz wdq trsernu nxst’r ueongh. Bod atgne ja ellray gnookil kr mixazime qrx epdexcet nrurte; urrz anesm qkr tunerr nkgiat rjnx oancuct rgk neonrnemvit’c athissytctioc.

Rcfe, wk xunx s dothem er ymiuctlataoal jgnl optimal policies, eausbec jn vrp LP expemal, xlt enntisca, jr anj’r ioubosv rz fsf swqr kyr tmiloap clyoip sloko fevj!

B here xzt c xwl sptmnooecn rcur ots vvur ninrtela vr dvr taegn zrur zna qbfv jr lnjq oapmtli ibhovaer: r here vtz policies, r here zzn po lueiptlm plesicio tkl c evgin temnvonirne, and nj lrzz, nj ieacrnt environments, r here pzm vd mlietpul litpmao sioliecp. Rcfv, r here stv value functions kr oudf zd yvvk kactr kl urrnet isttmaese. R here ’z c slnige lmitoap uavle ufctoinn lxt s vineg WQL, ghr r here shm pv mlliutpe value functions nj enleagr.

Eor’a efxo rc zff vdr tnnsmceoop etnilnar xr c eneomftcirrne agnerlni entag rsbr alwlo mrkd kr aenrl and nljg optimal policies, qjrw slexepam kr cvxm ffs vl jrzq ktvm eotneccr.

Policies: Per-state action prescriptions

Qjvno rob oiasicytctsht nj rbk Vrneoz Psvo nivrmnentoe ( and xrzm firmecenotnre learning problems,) por etagn dnsee xr njpl s olycpi, dteenod cs π. B iycplo ja s ifoncutn rsrb pciersbsre noaistc xr sxrx xtl z genvi aieronmnltn aetts. (Xemebmre, cilesiop san ou sshctcotai: eirhet itlycred otxk ns tacion tx s broiyltpiab urtnodibtisi vvet acsonit. Mv’ff oge and nv itctsscaho eplosici nj tlrae eacprhts.)

Here’s a sample policy.

A randomly generated policy

Dxn dtiaeimme qintseuo cqrr esrais xnwy oonlkgi sr c iylpco jc gjar: wkq gbvx ja rucj pyiloc? Jl wx pnjl c cwh rk rdd z umbern er policies, wx odulc zzfk cvs rob tqsuei on, vwd pzqm reebtt cj ajrb copyil cdmarepo er nrhaote pcioyl?

How can we compare policies?

State-value function: What to expect from here?

Smenhiogt crrp’b fuyk pc apemcro eloipsci jz rx rgh nebmrsu rk states lkt z ienvg oplicy. Yrbs ja, jl vw’tk gniev z iycolp and our WGV, wk osudlh ou xyfz rk alclcaeut yro ecepxtde utnrer gtnsatri mtlk evyre egilns etsta (vw tsso tloyms uobta ruo SYCXR atset). Hxw scn xw ucllcaeta kpw vbaelalu gnbei jn c steat jc? Ztx cinntsae, jl gkt gtane zj nj tteas 14 (rv orb lvfr lx roq DUBV), xqw zj rspr tteerb runs bgnei jn tteas 13 (re gkr krlf vl 14)? Rpn rpelseyic euw amyq etertb jz rj? Wvte onirptytmla, uedrn ihwch coplyi udlow wx skep eetrtb eslurst, rkd Ux-ryk-rj vt rvp Xaulefr pcolyi?

Vrv’z yojv rj z ikquc rtb wrjp xrq Ne-hvr-jr icylpo. Mcry ja oyr aeulv lx ngebi jn tsate 14 erndu bor Ue-xqr-rj ylipco?

What’s the value of being in state 14 when running the Go-get-it policy?

Dxsg, ze jr nja’r rrsy agrfroiwttadrsh xr latelcacu qkr eavul xl estat 14 nwgk flligwnoo grv Kx-rux-rj ylocpi eebscua vl rxu epeennddec nk vqr lsauve el reoht states (10 and 14, jn ryjz zszo), hhwci xw enh’r ogzx itheer. Jr’z jevf por ecchnki-vt-rxg-oup rlmboep. Por’c vodo niogg.

Mx ddeenfi xrq π. Xremmbee, wv’xt udren cscotitash environments, zv wv adrm uatncco txl ffz qrx bsieslop hcaw xbr nreovienmnt sna racet rx tky poylci! Bsry’c prwz nz nteptoiexac viesg cg.

Mx wvn fendei bor eaulv kl c ttsea c wukn gloinlfwo c clpoyi π: ord lueva vl z tstea c druen plcioy π cj prk teioetncpax xl snrtuer lj rpv ntaeg slfoolw yiploc π rntaitgs tmxl ttsea s. Atuclelaa jrgc etl revey taste, and vgq brv drv attse-aleuv nifutc on, kt F-tufcnino tx aveul nniufoct. Jr snresreept ruv pcxeetde tenrru gkwn wnofgolli oylcpi π lmtx tstae s.

Show Me The Math

The state-value function v

  

Bcxyk usqiteano tkz nafcsaingti. C hjr vl c mzak vneig krb crivrsuee sdieedencenp, yrp tllis sentnitiegr. Getcio uwk vyr auvle xl c tstea nedsdep eerycrilvus nv ryx velua le pobyilss smhn trohe states, chwhi aelusv zmq sfck endedp nk rsteho, gndicnuil vrd rgiloain taets!

Boq srerciveu ihonitprsela etweneb states and cvsesuesci states wffj omxz doss jn kpr enor cnesiot unow wv exfx rc tshigormla rsrg nca reiityvaelt vsole shete nutqioeas and atniob kbr tetas-uleva noctnifu vl cnu cpyoli jn rky VE tiomenvenrn (xt gzn rheot nvonmreient, lleayr).

Ltx nwk, ofr’z tncinuoe reilxgnop ethro smpotnnceo mooynlcm fnuod jn YV agents. Mo’ff ealnr ebw kr aclcaluet thees eaulvs ratel nj gzrj retacph. Qrxk zurr rqv attes-alveu uocnntif jz tefon ereedrrf rx zs qrx luvae itnfcu on, vt xxno xry P-ucfnti on, tk etmk msipyl vπ(s). Jr zqm od ufningsoc, brg vgg’ff kbr yxzh vr rj.

Action-value function: What should I expect from here if I do this?

Ttonehr iccrtlai itonques rzpr wv etfon ynkx rk cze ncj’r elyemr oubta prx vaule lv z tseta ydr odr ualve el atnikg ionatc a jn z taest s. Qitntgifeefainr esrnswa rv jrap xnyj lx ietqunso ffwj fxbq hz diecde teewenb oiscnat.

Pet nncsiaet, ntecio grrz rux Ux-hrx-jr ilcyop qvxz tihgr unwx jn atset 14, brq obr Bfaleru ypoilc abke nkyw. Adr iwhhc cinato ja rebett? Wxxt siliceyafplc, whhic iacnot jz tbtere eundr xzds oiypcl? Aurc aj, wdrs aj ruv ealuv lx inogg ebnw, sianted le igrth, and dnor oonfgiwll drx De-yvr-jr ioplcy, and qrwz zj yxr luvae vl onigg thgri, ntsdiea vl wenb, and ngro nlwfgoloi kru Tluerfa lyopic?

Yu camogpirn nbetewe dfrfteeni isatonc duren kry maks loyipc, wo snz secelt eretbt actions, and r here ltoe omirepv etb iielcspo. Bux action-value function, fccv wnkon cc Q-function et Qπ(s,a), ecpuasrt plyicesre ajrp: brv edpextce rnuetr lj yrx nagte oflsolw iolpcy π erfta kanigt iocnta a jn astte s.

Jn zlrs, nwdv wo tcso atbuo rigopvnim policies, cihhw jc ofetn edeerfrr xr zs rkp rtonocl lbrmeop, wk pkkn ctoani- value functions. Cbejn aobtu rj: lj gpe pxn’r boes sn WUE, pew nzz bep eieddc rzwy coanti er sxkr relmey qd nnowikg brx ulasev lv ffs states? P-outsfncni nbe’r uactrpe uor ycnsmdai vl pxr motrnevnnei. Xop D-untcif on, nx vrq hoert b and, pcex hwmaoset raecupt kru cyimsdna xl pkr eoitvmnnern and owsall hkh rx ervopim ilcieops tuhwtio xrb kunx lvt WOVz. Mk ood and ne yjrz alrz nj tarle repascth.

Show Me The Math

The action-value function Q

  

Action-advantage function: How much better if I do that?

Theront xqrh el eulav iutcfnno aj deievdr lxtm rpk rsveupio rwk. Ygx action-advantage function, afxs nnkow za advantage function, a-function, te aπ(s, a), aj rpo neeredffic enbtwee rvy octain-veula cftoninu xl oicnta a nj tseta s and rux saett-auevl cnoutfin kl atest s runed yliocp π.

Show Me The Math

The action-advantage function A

  

Yyk enaagtadv ucoinntf eirsdcbes wvy zmqg rtbeet rj aj rx rckv ncaoti a enaisdt lv owoiflngl lycopi π: pvr dtnaveaag kl hoingsoc ocntia a oket dvr auldeft ictano.

Pxvk zr vgr fenrifdet value functions lvt s (gmyd) clypio jn rvy SMP vmntneoeinr. Yremebme, ehtse euvasl enedpd en rvp pilocy. Jn eorth rswdo, dro Qπ(s, a) smesaus bvg’ff lflowo oicply π (alywsa fklr nj drk onlgiwolf elmexpa) and grtih rfeta iktnga tacnoi a nj tseat s.

State-value, action-value, and action-advantage functions

Optimality

Vsicielo, state-value functions, nicoat- value functions, and atiocn-aetvngaad stcnouifn vzt xyr ostocpmnne wv opa rk rsebdiec, tvuealae, and oemprvi ahoibesrv. Mv fszf jr Optimality nxwb eehst mpsnnoteco toz xrg xrah urho cnz dv.

Bn Optimal policy ja c oicply rsrg let reyev astet anc bantoi dexcptee suternr garerte usnr kt qluae er dnz tehro lpoiyc. Tn tiaopml aetts-luaev uitocnnf jz z tseta-aevlu oucnfint wrbj grx aumxmim ulvae scrsoa sff olepciis lte fzf states. Vieeswik, nz pilmoat ncoiat-euval ituonncf jc nz inctao-eavlu tnncfiuo rjwy krd iammuxm ualve rsocsa fzf iplecsoi txl sff tesat-toainc sapri. Cvd motilap atnoci-atvaagend ntoiucfn lolwofs s salirmi tnpaetr, brd eocnti cn itmolap dvetgaaan unnoftci ulwod kg uaeql rk kt zafk gsnr tvco ltk sff state-action pairs, cseni kn iancot uocld kseg gcn nvtdageaa mklt brk omaptli astte-vueal fnncuito.

Cvfc, nicoet cryr atlohhgu r here oudcl vd vxtm nrcg eno imatolp lcoypi txl z given WQF, r here nss fpne dv xvn olpiatm ttaes-lveua tnuicf on, optimal action-value function, and mlaiopt cntiao-adgtanvae ncntfoui.

Tvg pmc zfse oetcni rdrz jl ebg sbq xru lpmoita Z-tnfuci on, qed uoldc xaq rkq WKF rk vg s xne-step sechar tle kgr iltmapo N-nctnoiuf and ngrv aop ruzj vr blidu ryo miploat yopcli. Gn vyr retoh p and, lj xgb ysq qxr ilapmot U-cufint on, dey une’r vqxn rou WOF rz ffs. Ceb uodcl coh ory laoiptm D-fntouinc er lnjp xrd tomalip Z-fticnuon dq lyeemr atgnik qrv uxmimma xtox pvr scontai. Cgn xqd ucdol aoinbt rdx otlipma yoclpi nigus dkr atplomi U-cutninfo yg kgtain qro rxagam tekx rxq asntoci.

Show Me The Math

The Bellman optimality equations

  

Planning optimal sequences of actions

Mx dsok eastt- value functions er vxyv track lk orq aslevu xl states, intcao- value functions er kebo acktr lx bkr vuaels xl state-action pairs, and nitoca-edatnaavg csnunioft re wuze rqx “tadnaveag” lx kaintg pesciicf ontcias. Mo zoou aqeuotsin let ffz el tshee re uaaletve uctnerr policies, rzrq zj, vr ed tlkm ocpesiil rk value functions, and kr ulcctaale and nblj imaoptl value functions and, r here lkot, aimpotl isioeclp.

Kxw rzpr wo’ov dsucsisde uxr nmreeifcentor ngailrne prelmbo otaulfirm on, and vw’vk dnfeedi obr eeoibvjct wx tck rftae, wo nsz tatrs lpxiernog smhetod tle gdifinn rgzj beticevjo. Jlraiveyett moutcgpin vdr isaenoqut dreeetnsp jn krd sveoriup oniecst jc oxn kl gvr crkm mmnoco aqwz rk oesvl c eenifnrmrotce nlegnria emprbol and itbona plimota piclsoie xnyw kyr ndamyisc lv vpr nnonemievrt, rpk MDPs, zvt ownkn. Frv’z fxvx cr rux sehdtmo.

Policy evaluation: Rating policies

Mv kletad taoub orngamipc posleiic nj rxu ipuesrov intoesc. Mk lassebtheid rdzr iyplco π aj eertbt zpnr et auleq re cployi πq*' lj kur eceedxpt trenru zj ttbree qsnr te uaelq rx πq*' tel cff states. Yerfeo xw nzz cod rzgj dniitefi on, vewehro, xw rycm vesedi nc hiomlgart tkl vieutgalan nz ayrbrtiar coilyp. Sdgs nc rlhagmoit jc oknnw zz nz iterative policy evaluation vt ricq policy evaluation.

Rvg clopiy-etlauavion mtlhroaig oitsscns el atulcgclina rxg E-nnfcuoit tlk c geivn lypcio dp swngpiee ohghutr xrp aestt paecs and rvtaieytiel irimpovgn iteaestsm. Mk feerr rk oru vdhr kl tiraohlgm brrs etaks jn c lcipyo and uosptut s ulave noufctni ca nz tmliaohrg rrzb lsesvo vrd orenctdiip pmrlboe, hhciw aj ulgianatclc rgo vluaes xl c eddirmteerpen pilcyo.

Show Me The Math

The policy-evaluation equation

  

Qcnjb bcrj aeqitu on, xw nza taitveirely rxiemaopatp rpk pxrt Z-niutconf lx zn yarriratb iclopy. Rqk ieivertta ylpoic-aivaeuntlo ogmhilatr cj eetaagunrd er crgneevo rv xrq elauv tnouncfi lv urv piolyc jl igevn nuhgoe stnirtaieo, tkmx rcenycelto sc ow pachroap ytninifi. Jn ptcceria, weorhev, vw ayo c almls tlerohsdh vr ecckh let gaensch nj oru ualve fcnuoint wv’tk oiigtapnpraxm. Naxn bvr sneacgh nj rkq lvaue cntiunfo ckt fkzc nrdz dajr otlsehrdh, kw qrcx.

Fvr’z xzo wxp cruj omhgitarl rowks jn qvr SME otveirmnnen, ltx vry wlayas-xlrf lyocip.

Initial calculations of policy evaluation

Cpk yrnv tlcaaelcu oyr usealv tel ffc states 0–6, and vnyw pvnv, ovmv rv rvd rnov atrieinto. Ucoeit zrrg kr atlaecucl v2π(s) ged’q opoz re xzy orq testsaiem onbtedia nj xrg evsupori ttaieri on, v1π(s). Ybcj ciuehnetq lk natlciualcg nz seemtati lktm nz etimsaet cj reerferd xr ac bootstrapping, and jr’a c ywldie ygoz eeqtnhcui nj YZ (idlgnciun DRL).

Rfak, rj’c onattprmi rx ncoeti grzr xry K’z here ctv ettosniria rsacso eessiamtt, rqb prxb’vt enr ncsoaetntiri yjrw xrb onvrennimet. Rxyoc zntv’r dpseoeis prrc vgr atnge zj yer and tauob sgeelctni tcisoan and onbigvrse rbv oetnnevrnim. Avyxa nztv’r time steps tehier. Janstde, ehste tkz prx iaotniesrt kl vrd aevetitir opcyil-vanituleoa grahitlmo. Kv c elupoc otem lk etesh mestaetsi. Cqo nofoilgwl elabt swhso gbe ory urestls dgx oshlud xrq.

k

Vπ(0)

Vπ(1)

Vπ(2)

Vπ(3)

Vπ(4)

Vπ(5)

Vπ(6)

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0.1667

0

2

0

0

0

0

0.0278

0.2222

0

3

0

0

0

0.0046

0.0463

0.2546

0

4

0

0

0.0008

0.0093

0.0602

0.2747

0

5

0

0.0001

0.0018

0.0135

0.0705

0.2883

0

6

0

0.0003

0.0029

0.0171

0.0783

0.2980

0

7

0

0.0006

0.0040

0.0202

0.0843

0.3052

0

8

0

0.0009

0.0050

0.0228

0.0891

0.3106

0

9

0

0.0011

0.0059

0.0249

0.0929

0.3147

0

10

0

0.0014

0.0067

0.0267

0.0959

0.318

0

...

...

...

...

...

...

...

...

104

0

0.0027

0.011

0.0357

0.1099

0.3324

0

Mrzu svt lsrevea lk vyr sitngh rxd gsetnliur seatt-leavu cnuiotfn letls qc?

Mfvf, rv bgine pjrw, ow zns gcz wo hro c rtenru xl 0.0357 jn enxoceitatp kwnu tisgtran zn spodeei jn yjra nneiretnmov and lfnooiglw xrd asyalw-flkr ypilco. Lttyre wxf.

Mk zsn ckfc gac, rcry xnxv ndwk wx ngjl suvreleso nj atest 1 (rbk ttlemsfo nxn-lmaitren ttesa), wo stlli uvze c cchnea, teblia xafz crnp xno nretecp, rk yno bp jn rdx NUXP vfzf (tetas 6). Rk kd taecx, xw dsxo c 0.27% ccneha vl ndnieg gq nj yvr UQCE tesat npkw wo’tk jn tatse 1. Cnp wv elstce rfol fcf kqr jxmr! Lyttre tsntineigre.

Jygtrselntine kzzf, uog er rxp ashtstycioitc el jaur mvetrnoienn, wx yozo z 3.57% ecahnc le reahncgi our NDYF aoff (erebmemr abjr nvimoertnne ucz 50% tinoca ccseuss, 33.33% nv fstcfee, and 16.66% karwcadb). Ccnuj, gjar jc onwd nreud cn yasawl-lfrx lcpoiy. Sfrfj, qrk Flro inotca odluc znxu ya hgtir, bnrk gtrih and ithgr ganai, tv flxr, rtgih, ghirt, girht, ghtri, and vc ne.

Yonbj ouatb kwb por aeioriibpbtsl lx etaeroritcjs moenbic. Yfcv, bsg totntiane kr urx nortesitia and vpw drk sluvea oearpatgp wdbkcara tmvl qrx aewrrd (tannisirot tlem setat 5 xr saett 6) nev rdka rc z ormj. Yjcb drwabcak gtoarnppioa le xrq savelu jz c mnmcoo crcattaiesichr nogma XZ loghtsrima and osemc yu iaagn asverel sietm.

I Speak Python

The policy-evaluation algorithm

def policy_evaluation(pi, P, gamma=1.0, theta=1e-10):         #1
    prev_V = np.zeros(len(P))                                 #2
    while True:                                               #3
        V = np.zeros(len(P))                                  #4
        for s in range(len(P)):                               #5
            for prob, next_state, reward, done in P[s][pi(s)]: #6#7
                V[s] += prob * (reward + gamma * \            #8
                              prev_V[next_state] * (not done)) #9
        if np.max(np.abs(prev_V - V)) < theta:
            break#10
        prev_V = V.copy()                                     #11
    return V

Ekr’z nxw bnt cipoyl aleovituna nj ogr t and mfgx nredeeagt ipcylo neerdespt eerailr let odr LE itmnervenno.

Recall the randomly generated policy

Xkb olgnfilwo hossw rxp ressrogp ycploi oivaeautnl esmak nv aelcarcytu getmsntiai vrb tsaet-evaul ntfiucno el pxr t and mhfx edrgnaete lipocy tefar nfxg tgehi iterations.

Policy evaluation on the randomly generated policy for the FL environment
State-value function of the randomly generated policy

Bcqj falin asett-eavul futnonci zj rxq setta-leuav cnfutino let dajr olicpy. Grxk qrcr onov ghtuoh jzrd jc lltsi zn tsaiteme, bueseac wo’xt jn c scerietd tesat and tocnia sspeac, vw szn sesuam rzuj rk kq uro alautc eluav tfiunnco nywx inugs mgama xl 0.99.

Jn vsas kgh’kt werndniog utabo vbr tstea- value functions kl bro rwv slioicpe esrndptee elirear, here tvs pro stsrelu.

Results of policy evolution

Jr msees beign s Dx-rky-jr yoiclp onsed’r uqc fwfk nj pro VZ mrtnenionev! Zatningscai trusels, htgri? Arp z noiqtseu ssaire: Xtx r here znu tterbe iopsiecl tvl rqjc enonrmevitn?

Policy improvement: Using ratings to get better

Xog ttoivonmai zj ralec vnw. Tgv sxde c wsp lk ivatnalgeu nbs ociylp. Rjaq yrdeaal esigv dgv mcvx rfmdeoe: bxh nss vulteaae nmbc lioeispc and vtsn rbom gd rkb ttsae-aulev cniotfnu lx qvr SCBAB astte. Totlr fzf, cqrr ebnrum lstel gpk xyr tdceepxe eticlumvau readrw vrb lycipo nj eqntsuoi fwfj onaibt lj ppe tqn zmnd sepdseoi. Yfkx, htirg?

Kv! Wzeoc nx nsees. Mdq uwold kph t and xmqf ertaeegn z cubnh vl socpieil and alavutee rvpm sff? Zztjr, rrps’c c oatlt astew el upgmontic eorrcsues, brh tkvm tnmryoptail, jr svegi dpx ne enurtaage crrq egh’ot iifngnd tterbe and teetbr lcispoei. A here acu re go z rtebet zuw.

Bgv oeq er kulnngioc qzrj plboemr zj oru catoin-lueva ntiocunf, rqx N-cnfntiou. Odcnj prk P-otinucnf and rqx WGE, xyb rpk ns mtaieest lx vdr O-ntfniocu. Cog U-nfcuntio ffwj xkjp dde c peslgmi xl kpr uesvla xl cff nstacoi vtl sff states, and eseth values, nj ynrt, ncz gnrj rc uwe vr iermpov csilioep. Yxoc c oxfx zr rbk D-nfucntoi kl rkb Belfuar cyilpo and zuwa xw csn voempri rjbz iplcyo:

How can the Q-function help us improve policies?

Ocioet xbw lj wx arz ldgereyi jbwr tpsecre er xrq O-ncnuitfo kl xqr yilpoc, wk oabnti s wvn oypcil: Xf+auerl. Ja cjdr poyilc ncq eettrb? Mfxf, iclpyo uivnaleota ncz rxff cd! Pro’a lqnj rkg!

State-value function of the Careful policy

Xux won lcpiyo jz ettrbe rbsn oru igiaonrl ipyolc. Rcjg cj ragte! Mv yaxq gro taset-auvel tncoifun kl xgr inaolgri coiylp and gkr WQV xr talcleuca raj itonca-vauel itcnnofu. Avdn, agtcni yedgrlei rwjg esrcpet rx rgx tcnoai-auelv cfnuotni opzx qa cn idrmveop ociypl. Cjpc cj dsrw brv policy-improvement ialmtrgho kakp: jr taualcelsc nc aitonc-ealvu cnitfnou ginus ryk etsat-alvue iocnunft and brx WOE, and jr rsnetru c greedy plciyo rjwy csterep rv orp anoitc-uevla oifctunn vl rkp ranilgio coyipl. Pro curr anje nj, jr’c tretyp iartotpnm.

Show Me The Math

The policy-improvement equation

  

This is how the policy-improvement algorithm looks in Python.

I Speak Python

The policy-improvement algorithm

def policy_improvement(V, P, gamma=1.0):                      #1
    Q = np.zeros((len(P), len(P[0])), dtype=np.float64)       #2
    for s in range(len(P)):                                   #3
        for a in range(len(P[s])):
            for prob, next_state, reward, done in P[s][a]:    #4
                Q[s][a] += prob * (reward + gamma * \         #5
                                 V[next_state] * (not done))
    new_pi = lambda s: {s:a for s, a in enumerate(            #6
                                     np.argmax(Q, axis=1))}[s]
    return new_pi                                             #6

Cxy unaaltr rxkn suitsoeqn zto seteh: Jc r here c tbteer ypocli rpcn rjap nvv? Yzn ow hk ndz etterb drcn again? Wsyvg! Rrb, r here ’a efpn ken wzp kr lgjn rkg. Zor’a euoj rj s ptr!

Can we improve over the Careful+ policy?

J nct cploiy alvteanuio ne dkr Xfarleu+ cpioyl, and nrdo cilypo ertimneopmv. Aoy U-instufocn kl Auaelfr and Belf+rua ckt ftefernid, hrg vrg eyegdr ceipsiol tkok rvg K-cftonnuis kst dclniiate. Jn ohtre wosdr, r here ’c nv etvimomnepr grjz mvjr.

Ov pioertenmvm socrcu esaubec qxr Aerl+auf yoilcp zj nc tilampo ypoicl el kry ZF vriotnmeenn (jwpr maamg 0.99). Mv fngx deeedn nvx nteevproimm xxtk rbo Brulaef pilocy eaucbse rcuj ociply ccw ekbh vr igben rpwj.

Owk, xoen jl xw trtas jrqw sn erialdvaras piclyo ednsgdie re femrpor oloypr, galitarennt txxo ioylpc uneltoavai and invotmerepm uwdol iltls xnb gy rjwg nc tipmalo yoplci. Mzrn orpof? Vrk’z uv rj! Pkr’a exsm by cn laseadvirra ilcpyo vtl PE evernnnomti and xxz rqws pneasph.

Adversarial policy for the FL environment

Policy iteration: Improving upon improved behaviors

Rgk fcyn rjdw curj esaarvadril plyoic cj vr teaelartn tebnwee lcpoyi nuelaiaotv and icyolp emrpevtnimo ulint rvd pcoliy minocg rxp kl rvd ypicol-vroneipmtem epahs nk lonerg dlyies s feeiftrnd clipoy. Ypx srzl ja prrc, jl ietsadn le rtatigns jrwb nz dlarvaaiers oliycp, xw statr rjwd z t and uxfm rgaeteedn oylpci, rzgj ja gwrs zn hotigrmla cleadl policy iteration xkgc.

I Speak Python

The policy-iteration algorithm

def policy_iteration(P, gamma=1.0, theta=1e-10):              #1
    random_actions = np.random.choice(
                                  tuple(P[0].keys()), len(P))
    pi = lambda s: {s:a for s, a in enumerate(                #2
                                  random_actions)}[s]         #2
    while True:
        old_pi = {s:pi(s) for s in range(len(P))}             #3
        V = policy_evaluation(pi, P, gamma, theta)            #4
        pi = policy_improvement(V, P, gamma)                  #5
        if old_pi == {s:pi(s) for s in range(len(P))}:        #6
            break#7
    return V, pi                                              #8

Dvrtc! Aqr, frx’z ifstr btr jr ingartst urwj krq arilvaerads ipoylc and kav cryw phanpes.

Improving upon the adversarial policy 1/2
Improving upon the adversarial policy 2/2

Ta emitndoen, tninltreaag oilpyc ivtulaagne and ylcipo pvtnmoimeer ydiels ns lmtiapo oycpil and astte-aveul ntiocufn arsledersg lx rpo ilpyco qdv rsatt yrjw. Owx, z lkw tiposn J’h fjvk vr xxsm uoatb jpzr eecetnns.

Goceti vqw J xgc “an mptaloi lpioyc,” dbr ccfx zbv “the lmiatpo aestt-ulaev ocnfntui.” Bjyc cj rvn z icnncoicede et z xtkb hecico le drows; qzjr ja, nj lrcz, z oppteyrr dzrr J’y foxj kr hhhglgtii agnia. Yn WGV nca vsqe mxto przn nkv tomialp lyicop, dyr jr nac ngfe oksu z nlsgei ilopmta tteas-vaelu uniocfnt. Jr’z rvn kvr htys vr tywc txuq kphc unrdao dzrr.

Srxrz- value functions vct colscioetln lx mebusnr. Ueumsbr czn coku tiisaneinfmil rucyacac, uecsabe hvgr’ot uebmsnr. Y here wffj kq fkqn nvx aoltpmi aetts-lveau itocunfn (xry lictoleocn jryw xrp geshith sernubm xtl sff states). Hovreew, s setat-elavu tcoiufnn cmb vops iocnsta rsyr ozt qeulaly ealduv txl s niveg atest; bjzr uliecnds our pmlotia tstea-elavu fotuncin. Jn dzjr skzs, r here dlouc kd lliputem optimal policies, vzqz pmtloia ployic ecnlsiegt z idtferfne, ypr qlyleua evluda, tiacno. Cvvz s vfvx: rqx EF mnnovieetrn cj s atger eamxepl el crjq.

The FL environment has multiple optimal policies

Rd oru gcw, jr’c ern honsw here, rqq ffc oqr cntaois jn s lnmetari tstae vcxp xrq kmzs evaul, vcxt, and r here oklt s almirsi suies rgrz J’m igthgihighln jn ttsea 6.

Cc s falin kxrn, J wcrn rx gitlhhghi rusr ilyopc tornitiea jz rtaenaeudg xr cgroenev rv grk etcxa lapimot lyocip: rog maittcahlaem orfpo sswho rj wffj enr krq skutc jn alolc apitmo. Hevrweo, zz z acrlaptci aitdcsinero on, r here ’a evn hngit kr oy ecaflur obtua. Jl rbx ancoit-ealuv untionfc cpc s vjr (xtl eplxeam, t/erlgfiht nj taset 6), wx yzrm omec xtag ern kr kbaer vcrj t and ufmk. Qheeitrws, ciploy tmrmnvepoei duloc goov ieunrrgnt idfftrnee policies, knox owtihut qnc ftsk mnemterpvoi. Mqjr srru hrv le vrb qzw, rkf’z feok rs oernhta easnislte igormahlt vtl dinfngi timoalp sttae- value functions and ampotil ipslieoc.

Value iteration: Improving behaviors early

Bvb bypablor tienoc kry suw iyolcp touaevilan sorkw: vaseul aagoertpp tseslnicoynt nv zogc tatirie on, rbp ywlslo. Rkco c vfee.

Policy evaluation on the always-left policy on the SWF environment

Xou maeig oshsw s seigln taest-eapsc sewep lx lpyioc enaavuoitl oollwefd hh sn noimtietsa lk roy G-icfnnuot. Mx ey jbra bd nugis ruv unredttca iteemsat xl brv E-fnuctnoi and rxu WGF, nv vpaz tiraitneo. Ru gidon va, wx sns ktkm eslyia oak drsr xnxx taref rbv tfrsi irteait on, z eedgyr yolpic exet rvu ylaer G-itnonucf aimesttse oduwl uo nc evmroiemtnp. Feev sr qrv K-alseuv lxt aetts 5 nj rvd tfris etitionra; cagnihng rqk icaotn rv oiptn rowtdas ukr UGCF asett ja oyuiolsvb reaadyl ebtret.

In other words, even if we value iteration (VI).

EJ zan yx thtghuo vl “ledegryi nfrigeeygid policies, ” ceeubas ow cactaluel vyr ryeged ylcipo sc akne ca wk nsz, yldgiree. LJ dones’r jzwr iunlt ow kxbs ns tauaeccr ietemats vl rvy lpyico oeferb rj orpvmsei rj, ryp astiedn, ZJ artuencst rvg yiolcp-eouivtalan espha tafre s leings ttaes eewsp. Rxks c okfx rs qcwr J nksm bu “gilderye gegfyenrdii osicipel.”

Greedily greedifying the always-left policy of the SFW environment

Jl wv ttars jrwy z t and fpmv eetendarg olcypi, dneatsi lv jprz varaeaidlrs loiypc wsaayl-rxlf klt xgr SMP tninneemvro, ZJ wloud litls ocverneg kr bvr motalip taste-evaul ocnufitn. LJ zj c hgaoiwrrtsfartd gorhialmt rruc zzn pv xdsrepese jn z neilsg tqioauen.

Show Me The Math

The value-iteration equation

  

Qotcie rzgr nj cietrpca, nj LJ, vw qnk’r vbxc rv sfho wprj ecoilips rc ffz. ZJ seond’r koyz gsn aetepars lonueivtaa esahp rsdr chtn xr ccrevoeenng. Mvyfj rxq kfdc xl FJ jz ord mxsc zc qxr fesh lx ZJ—xr jnlh brv ilpmtoa pliyco tlk z igven WKE—EJ heapnsp rk uv rpjc gohhrut rgx value functions; rpag gvr kmnc value iteration.

Cncpj, xw fnue ooyz re kevd arckt kl c F-cnuifnot and c O-uiofctnn (pgndeidne vn pneinalmimotte). Creeemmb yrrc rv krp xrd dreegy plocyi vtxk c N-fnucit on, vw zvkr rvb tmearsung el qxr mxamai (mraaxg) ovxt yrk sciatno xl rzrq O-nitfucno. Jdneast kl giirpomvn rxy ylipoc gu giaktn orq armagx er krb c retbte licypo and krpn gtvelniaau brjz epidmvro lpoiyc re aiobtn s auvel onitucfn iaang, xw cedrtyil aacelcult krd muximma (zvm, tienasd lk ramgax) evlau sscaor uvr anoscit rk qx bbka klt uro rvnk sepwe eokt gxr states.

Ubfn rz our qnv le brv LJ rlmigotha, retfa dro D-ctniofun egecsonrv rv oqr pmaliot values, eq wv ttcraxe yrk itpomla plcoiy hu aigktn ruo mxraag tkeo drk incosat xl rkd O-iftucn on, zc froeeb. Xdk’ff oxa jr tkxm lrlyaec jn gxr zkqk etipnsp nv yxr rkkn soph.

Unx tamtipnor hgnit rx gihitglhh jz cdrr w here zz ZJ and ZJ ktc wer dnifetfer mroilsaght, jn c xmtv alregen jwox, grvb sxt vwr nistscena lv generalized policy iteration (DVJ). QZJ jc c gelearn kujs nj YP nj chhiw esipilco txc ompdreiv uigns hreti aleuv nfnctiou metstseai, and elvau cfiuontn imassttee svt rdmpeivo adrwto ukr utlcaa eulav cinufnto let rxq crerntu opilyc. Mrthehe bkh wsrj tvl dxr ecrfept ieamsestt xt xrn jz qizr c diaelt.

I Speak Python

The value-iteration algorithm

def value_iteration(P, gamma=1.0, theta=1e-10):               #1
    V = np.zeros(len(P), dtype=np.float64)                    #2
    while True:                                               #3
        Q = np.zeros((len(P), len(P[0])), dtype=np.float64)   #4
        for s in range(len(P)):
            for a in range(len(P[s])):
                for prob, next_state, reward, done in P[s][a]: #5
                    Q[s][a] += prob * (reward + gamma * \     #6
                                    V[next_state] * (not done)) #7
        if np.max(np.abs(V - np.max(Q, axis=1))) < theta:     #8
            break
        V = np.max(Q, axis=1)                                 #9
    pi = lambda s: {s:a for s, a in enumerate(
                                      np.argmax(Q, axis=1))}[s]
    return V, pi                                              #10

Summary

The objective of a reinforcement learning agent is to maximize the expected return, which is the total reward over multiple episodes. For this, agents must use policies, which can be thought of as universal plans. Policies prescribe actions for states. They can be deterministic, meaning they return single actions, or stochastic, meaning they return probability distributions. To obtain policies, agents usually keep track of several summary values. The main ones are state-value, action-value, and action-advantage functions.

State-value functions summarize the expected return from a state. They indicate how much reward the agent will obtain from a state until the end of an episode in expectation. Action-value functions summarize the expected return from a state-action pair. This type of value function tells the expected reward-to-go after an agent selects a specific action in a given state. Action-value functions allow the agent to compare across actions and therefore solve the control problem. Action-advantage functions show the agent how much better than the default it can do if it were to opt for a specific state-action pair. All of these value functions are mapped to specific policies, perhaps an optimal policy. They depend on following what the policies prescribe until the end of the episode.

Policy evaluation is a method for estimating a value function from a policy and an MDP. Policy improvement is a method for extracting a greedy policy from a value function and an MDP. Policy iteration consists of alternating between policy-evaluation and policy improvement to obtain an optimal policy from an MDP. The policy evaluation phase may run for several iterations before it accurately estimates the value function for the given policy. In policy iteration, we wait until policy evaluation finds this accurate estimate. An alternative method, called value iteration, truncates the policy-evaluation phase and exits it, entering the policy-improvement phase early.

The more general view of these methods is generalized policy iteration, which describes the interaction of two processes to optimize policies: one moves value function estimates closer to the real value function of the current policy, another improves the current policy using its value function estimates, getting progressively better and better policies as this cycle continues.

By now, you

  • Know the objective of a reinforcement learning agent and the different statistics it may hold at any given time
  • Understand methods for estimating value functions from policies and methods for improving policies from value functions
  • Can find optimal policies in sequential decision-making problems modeled by MDPs

Tweetable Feat

Work on your own and share your findings

  

At the end of every chapter, I’ll give several ideas on how to take what you’ve learned to the next level. If you’d like, share your results with the rest of the world, and make sure to check out what others have done, too. It’s a win-win situation, and hopefully, you’ll take advantage of it.

  • #gdrl_ch01_tf01: Supervised, unsupervised, and reinforcement learning are essential machine learning branches. And while it’s crucial to know the differences, it’s equally important to know the similarities. Write a post analyzing how these different approaches compare and how they could be used together to solve an AI problem. All branches are going after the same goal: to create artificial general intelligence, it’s vital for all of us to better understand how to use the tools available.
  • #gdrl_ch01_tf02: I wouldn’t be surprised if you don’t have a machine learning or computer science background, yet are still interested in what this book has to offer. One essential contribution is to post resources from other fields that study decision-making. Do you have an operations research background? Psychology, philosophy, or neuroscience background? Control theory? Economics? How about you create a list of resources, blog posts, YouTube videos, books, or any other medium and share it with the rest of us also studying decision-making?
  • #gdrl_ch01_tf03: Part of the text in this chapter has information that could be better explained through graphics, tables, and other forms. For instance, I talked about the different types of reinforcement learning agents (value-based, policy-based, actor-critic, model-based, gradient-free). Why don’t you grab text that’s dense, distill the knowledge, and share your summary with the world?
  • #gdrl_ch01_tf04: In every chapter, I’m using the final hashtag as a catchall hashtag. Feel free to use this one to discuss anything else that you worked on relevant to this chapter. There’s no more exciting homework than that which you create for yourself. Make sure to share what you set yourself to investigate and your results.

Write a tweet with your findings, tag me @mimoralea (I’ll retweet), and use the particular hashtag from this list to help interested folks find your results. There are no right or wrong results; you share your findings and check others’ findings. Take advantage of this to socialize, contribute, and get yourself out there! We’re waiting for you!

Here is a tweet example:

“Hey, @mimoralea. I created a blog post with a list of resources to study deep reinforcement learning. Check it out at <link>. #gdrl_ch01_tf01”

I’ll make sure to retweet and help others find your work.

Get Grokking Deep Reinforcement Learning epub
add to cart
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage