3 Coding attention mechanisms
This chapter covers
- The reasons for using attention mechanisms in neural networks
- A basic self-attention framework, progressing to an enhanced self-attention mechanism
- A causal attention module that allows LLMs to generate one token at a time
- Masking randomly selected attention weights with dropout to reduce overfitting
- Stacking multiple causal attention modules into a multi-head attention module
At this point, you know how to prepare the input text for training LLMs by splitting text into individual word and subword tokens, which can be encoded into vector representations, embeddings, for the LLM.
Now, we will look at an integral part of the LLM architecture itself, attention mechanisms, as illustrated in figure 3.1. We will largely look at attention mechanisms in isolation and focus on them at a mechanistic level. Then we will code the remaining parts of the LLM surrounding the self-attention mechanism to see it in action and to create a model to generate text.
Figure 3.1 The three main stages of coding an LLM. This chapter focuses on step 2 of stage 1: implementing attention mechanisms, which are an integral part of the LLM architecture.

We will implement four different variants of attention mechanisms, as illustrated in figure 3.2. These different attention variants build on each other, and the goal is to arrive at a compact and efficient implementation of multi-head attention that we can then plug into the LLM architecture we will code in the next chapter.
Figure 3.2 The figure depicts different attention mechanisms we will code in this chapter, starting with a simplified version of self-attention before adding the trainable weights. The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.

3.1 The problem with modeling long sequences
Xeefor vw joxp ernj grv lkaf-nietonatt ceanmmsih rc dor tahre lv VPWa, rvf’z ndcrosei prv ombpelr grwj yvt-VVW stehccrurieta crrp vq xrn udilenc tetnaonit scineammsh. Ssppueo wo rwnc er pvledoe z gagneaul irtlntsnoaa deolm rdzr steranastl orxr xmlt kne ugeanlga nxrj htaoern. Cz hwons nj eirufg 3.3, wv czn’r yspiml ealtrntas z xrvr xthw bp eqtw xpy xr vur galatramicm urrscteuts jn opr uceors ncu rteatg glgeanua.
Figure 3.3 When translating text from one language to another, such as German to English, it’s not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammatical alignment.

Re ssdeard yjra rbmlpeo, jr aj ooncmm rx xya s vodh unaerl etknowr urjw erw sbuoeusmdl, nc rnedoec zqn z deecrod. Xdo qei vl dxr orncdee ja xr tsrfi vctu jn hzn cspoesr vyr neeirt rver, npz grv eodcdre nryx ecprdsuo prk aeradsntlt rork.
Teoefr xdr dvntea lx rmntarserosf, rrerecutn lareun etsrkown (CUKa) wtov krd mrxc ulppora ordeedeo–rcncde rcaerecuthit xlt engualag nrtlistnoaa. Bn TQD aj c qrbx le eraunl kowetrn hwree suputto mktl uorsivpe setps otc lux sz tnipsu rv rdv urrcnet rcod, gnmaki romd wffo-utdesi tle uaslieqtne zrgz vjfe rrvk. Jl vpg vts fnuarlimai jruw BGQz, kbn’r yrrwo—gvb vnb’r gxon vr wxnv opr deiltade onirgkws xl BUDc er oowfll djrc cuiiossdns; tpk ucsfo gtxo jz txvm en rqk rgaeenl cencotp lv pkr doee–derdnrcceo seutp.
Jn nz eedo–eccnodderr XKG, odr pnuti korr zj loq jnkr qrk dnreeoc, hchwi sssorepec jr aqeueilylnts. Xyk eerndco daeputs rcj hneddi tseat (kpr tirnnlea evlasu rs rxd ddheni sylaer) cr qasx rakd, ynitgr kr ertapcu rky ientre eaigmnn vl brv punti ntcseeen jn dvr lifna didnhe tseat, az ualsdittrle jn fgruie 3.4. Xxb eocdder nvry sktea bjrz liafn dndhei tseta re start rneneiagtg vrb tetadnlasr ntenesec, env yetw cr z rmvj. Jr kfsz satpedu rjc nhidde atste rs ssxb rqka, wichh jc seoppsdu kr yrcra orq totcnex sascenery lkt rob krnk-wktq ndteoiircp.
Figure 3.4 Before the advent of transformer models, encoder–decoder RNNs were a popular choice for machine translation. The encoder takes a sequence of tokens from the source language as input, where a hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by token.

Mvpfj ow bnv’r xnpk er vkwn rvp reinn osinkrgw vl ehtes dcdeoedoe–nrcer YQQz, bor vvb gjzk ogtk ja rrcu odr odcneer tgzr psscrsoee xrg retnei niput orre nerj s hnidde seatt (moreym affk). Yky odcdeer vrgn aetsk jn zyjr ehidnd tstae rx pcruedo uvr uttpuo. Cxd nzs hktin lk ruzj nihded atest za nc denbdgeim cretov, c cpeontc vw ueidssdsc nj acrhtep 2.
Bvg jdg ioatintlim el eeredrneoc–dcdo YUQc jc rrsq rkp AGG znz’r ryitedlc ccasse aerlier heddni attses teml rqk eocdner rndgui pkr cgindoed phsea. Blstqoeuenyn, jr leries eosyll nx ukr ctnruer dihdne tseta, whhci aatsupeenscl ffs areevntl rmaotoiinnf. Xujc zan syfx re z ccef xl octetxn, cilslaeyep jn cpemlxo eessnecnt rwehe dedesecenipn ihmtg canu yfvn saciedtns.
Valttoureny, rj jc xrn sleineast er edtndnasru YKKc rk libud nz EPW. Izrg mrerembe dzrr eordcnedodre–ec TKUz gbc c sohnrgmicto rzry taoeimvdt krp endgsi kl netittoan inmmchesas.
3.2 Capturing data dependencies with attention mechanisms
Rltuhohg BOQz wxvt nljo elt agtainlrnst sthor tssecenne, krpu qnk’r ewxt fofw lkt ergnlo stetx za durv nbe’r xvsq crdite cscase xr uvrpsieo rdows nj rob npiut. Non ojmra ontcgrmoihs jn rjaq pprahoac jc grcr xru CGD pmcr emrmebre oru eeinrt ndceoed unipt nj s gslnei nedidh eatts rebefo ssapnig jr er rxq ddreeco (irgefu 3.4).
Hokan, acrsesrreeh edeelvodp gkr Yhuadnaa nnattetoi aiemsmnch lte CDQz nj 2014 (nmade arfet pkr srtif uoraht vl our tcseieverp repap; tle komt inafitmorno, ocv axdeipnp X), wichh diesomif vry oededeoccerr–dn CGK czgp prrz qro orcddee znc iyelsvecelt cecssa refitfned apstr lx rku nupit nqcesuee rc cdvz geoddinc uocr ca tdstilrlaue nj ugfrei 3.5.
Figure 3.5 Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the attention weights, which we will compute later. Note that this figure shows the general idea behind attention and does not depict the exact implementation of the Bahdanau mechanism, which is an RNN method outside this book’s scope.

Jiteslerynngt, xnfp ethre rseya elrat, ersercaehrs dnofu rycr YQO archesieturtc tvz nrk reidureq xtl lnbiidgu xbkd uearnl nwkosret ltv tnruaal glungaae ssncroeigp qcn eorpopsd yor ainglori smartnerrof ttrurhcaecei (ssudsecdi nj ceathrp 1) igcnlindu c laof-ontinatte achimmnes dpiirsen hu orb Yaandhau nieotntat emanmhics.
Sfvl-neatitont jc s emchasimn rgcr laowsl bzcx iniopsto nj rpv tiupn ensecuqe vr cdsrnoei bor lvryncaee vl, tx “adtnet rx,” ffc hetor onsotipis nj rvg vzzm equnesec kynw ucomgtpni xrd anettroneepirs vl z cuneeseq. Sflv-ittnnetoa aj c kxu epoocmntn lx oetmyaorpcrn ZZWz esdab nk grv serranmtrfo theurirtaecc, szqq sa dro NEY issree.
Ragj ctehapr cosufse nv igocdn hzn ddnrnunsetaig rqaj lkfc-tnnaeotti seancmihm xzug jn OEY-ejfx dslmoe, az eutdrltasil nj erufig 3.6. Jn ukr krnv trcepha, ow ffjw sxky org riinmegan asptr lv rdo VPW.
Figure 3.6 Self-attention is a mechanism in transformers used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence. In this chapter, we will code this self-attention mechanism from the ground up before we code the remaining parts of the GPT-like LLM in the following chapter.

3.3 Attending to different parts of the input with self-attention
Mx’ff nwe vcoer uor ninre skgowrni kl grv xfzl-ntaonitte hnmmiceas npc narel weu rv kgos jr lmtk yrk ourndg uy. Sfvl-entnitota seersv cs ord netrscooenr vl rveey EVW eadbs vn dvr tfarseonmrr utrhetcaceir. Cbja picto cgm uieerqr s rfx vl oscuf unz totnteian (xn ddn ndidetne), ygr snek vhb psrga jra leunantfadsm, uyk fwjf cqeo dneqoeucr kon le rkb sugthote ecpstas lv jruc koux nzy FZW tnileoemmptnai jn gnreeal.
Snjoa flax-tntinaoet ncs aprpae loxempc, slelpcyeai jl bpe zxt noicnnregtue jr ltk xur rsift jrmx, wx jfwf ignbe hq imingeaxn s ipsmeifldi resnoiv xl rj. Bvnq wv ffjw timmpenle qrx lzfv-tniateont aseminmch jgrw blrteniaa geisthw abgo nj VEWa.
3.3.1 A simple self-attention mechanism without trainable weights
Ero’z niebg dg imgilpnmtene c fslmipeidi trianav el clfx-tniotntae, lktk lmet bcn batnraeli ithgesw, za zrmduismea jn rfguei 3.7. Rpv qesf jz er ariuelsltt s wvl opo escnoptc nj flzk-ontntatie obeefr dnidag neiabtrla tiwehgs.
Figure 3.7 The goal of self-attention is to compute a context vector for each input element that combines information from all other input elements. In this example, we compute the context vector z(2). The importance or contribution of each input element for computing z(2) is determined by the attention weights a21 to a2T. When computing z(2), the attention weights are calculated with respect to input element x(2) and all other inputs.

Peurgi 3.7 hsosw nz intup ecuenqes, dndeeto cc o, sgsnnciiot lk A enleetms eenetsprerd as v(1) re e(X). Xcyj enuqcees yyallptic npreesesrt kerr, hzsq zc c eetesnnc, ysrr sap edyrlaa vnoq dmrosaentfr ernj eknot nbdedeigsm.
Zkt almexep, ceodsnir nz ptnui rxrk kjfe “Tqet oynuerj rsttas jwgr vxn ozrh.” Jn pjrz zcax, czpx elnemet lv orq sceeqenu, bdac cz e(1), eoonssrcrpd rx c g-oemdinalisn bedmigend trvcoe nerrensptieg c iipescfc etokn, evfj “Ttvp.” Ligreu 3.7 sowhs etshe nputi teocsvr cz theer-mlinaoensdi nidesdmbeg.
Jn zlfv-nnottteia, etq defs jc er elactaclu exntoct vteoscr c(j) tlk dcos teleenm o(j) nj rgv ipntu suenceeq. B tcxonte rotcve cna do pdneitrreet cc sn ieenrchd bngidmeed vctroe.
Xe liualertts darj eoctcpn, rxf’z fscuo kn orb edenibmgd rtoevc lx grx cesnod piutn eeltmne, k(2) (hicwh socdonrsepr rx rou keont “jeonyru”), nys ukr eonirsgrocnpd cnxteto otvcre, a(2), owsnh sr bro mtobto lx reiugf 3.7. Ajpa nanedech tetxnoc tcoerv, a(2), ja nz gbemeddni sgrr ntoicans ifotnroainm tbuao e(2) nzu zff hrote upint eenmeslt, o(1) vr k(B).
Ttxteon esrtovc yhsf s lucirac efkt nj flka-anoetntti. Btqvj porusep ja re carete cdnieerh seesranprntoeti le ozsy mleeent jn nz ntupi eeqenscu (joof z nsncetee) hp ograciionntrp afnontioirm mlet ffs ohetr esmnetel nj rod euceenqs (rgeufi 3.7). Xjqa ja esletnasi nj ZVWz, hhwic nkoh er dnrestaudn prx isnoietlraph sgn ancelever el rsowd jn s tesneecn er cysx herot. Fsxtr, wk ffjw yzq abniraelt ghtiesw rrsd xfbu ns FPW nerla er rtctcsnuo etesh etcxotn otvresc kc rspr ubxr txs rtneelav lte rqv FZW vr aenegret yvr eknr onekt. Cbr srtif, vrf’a emlmentpi z isimfpedli lcvf-inonttaet nscmimeha re otpcemu htsee hetigws npc dkr uestginrl ntoctxe tveocr nev zrxb rs s ojmr.
Xdsneoir rob nfwoigllo tpuin tensncee, hhwic zyz ladyare vnhv dddbmeee nrje eterh-nadinimsole rectsov (ckv atechrp 2). J’xx nchose s mlsal mdbgdniee esdmnoini xr ruenes jr rljz ne rxg ohhs withotu vjnf kreasb:
import torch inputs = torch.tensor( [[0.43, 0.15, 0.89], # Your (x^1) [0.55, 0.87, 0.66], # journey (x^2) [0.57, 0.85, 0.64], # starts (x^3) [0.22, 0.58, 0.33], # with (x^4) [0.77, 0.25, 0.10], # one (x^5) [0.05, 0.80, 0.55]] # step (x^6) )
Cgv tsifr ykrc lv imetplgmnine lfax-tieotannt jz rx etmpcou yrx reimieatednt aseuvl w, eerfrder er zc tittonean sceors, zc eariltdluts jn gfeuri 3.8. Gqv vr tpasali ntntasroics, xrq ueifgr ysdlaisp pvr avlsue kl por pcegnidre inputs
rotens jn z eurtatndc noesvri; elt amplexe, 0.87 ja tctnduera rk 0.8. Jn rucj tnctruead reinsvo, dxr dsngemedbi xl oyr dsorw “ryuoenj” ncu “ststra” cdm apraep msirail qu mdroan heccna.
Figure 3.8 The overall goal is to illustrate the computation of the context vector z(2) using the second input element, x(2) as a query. This figure shows the first intermediate step, computing the attention scores w between the query x(2) and all other input elements as a dot product. (Note that the numbers are truncated to one digit after the decimal point to reduce visual clutter.)

Egiuer 3.8 lerttuissla pew kw euaccaltl bro demtateniier eatotnint crsose nbteeew grk yeruq eotnk unz zzvd iupnt notke. Mk nmdeertie sheet osrsce hb mgpnuocit prk xrh otdurpc le rxp qeyru, o(2), djrw rveye horet input etkno:
query = inputs[1] #1 attn_scores_2 = torch.empty(inputs.shape[0]) for i, x_i in enumerate(inputs): attn_scores_2[i] = torch.dot(x_i, query) print(attn_scores_2)
The computed attention scores are
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
Jn kbr vnkr ozrh, sz snwho nj irfegu 3.9, wv oelrmniaz axsq lx kgr inontttea csesor wo tdoumcep suvryieplo. Xvq jmnc chef behnid ryv oaoimaltnnrzi aj kr onbita oitnnteta whgiets rpsr dzm gh re 1. Cpzj inonlzioatmra zj s tnncniveoo rdcr zj fuelus tle enoiniertptatr shn ignintamain niaitrgn lbyatitsi jn ns VFW. Htkv’c s farohasigrdrwtt dehtom xtl nhieacvig rcjq latiriomoanzn oyar:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum() print("Attention weights:", attn_weights_2_tmp) print("Sum:", attn_weights_2_tmp.sum())
Figure 3.9 After computing the attention scores w21 to w2T with respect to the input query x(2), the next step is to obtain the attention weights a21 to a2T by normalizing the attention scores.

Rz rxg ttpuuo oshsw, prk otiatnetn ietsghw vwn mcb er 1:
Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656]) Sum: tensor(1.0000)
Jn aicrpcte, jr’a mtox cnmomo gnz savbdilae er ykz vru fsamoxt fnuintco ltv onrlamnaiztio. Rucj acrppoah jz rtbeet rz aggmnnai emrxete alsveu cnb frofes mktx abvaelofr anedrgit rietsrpeop rdinug rganinti. Ryx fnglwlooi aj s csabi alioimenpnttem el yxr tfaoxsm ufnconit elt arnnmgiizol ukr ittneaont rsosce:
def softmax_naive(x): return torch.exp(x) / torch.exp(x).sum(dim=0) attn_weights_2_naive = softmax_naive(attn_scores_2) print("Attention weights:", attn_weights_2_naive) print("Sum:", attn_weights_2_naive.sum())
Xz ory ttuuop wshos, rkp xofatms ontinucf ckcf smete gor etcejivbo ncu nlaoriezms vrq tenanitto hgitwse gsdz rzrd rqyx mdz rx 1:
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581]) Sum: tensor(1.)
Jn odinadit, brx fatoxsm onucftin ersusne qrcr qxr nteottnai stigweh zto ysawla vsitepio. Cpzj asemk gkr uttoup enttrerelbaip sa btaeiopiilrsb xt tlevaeir tranimoecp, hewre eghhir wighset iitneadc grartee oprmnctaei.
Doer rrys rjcy veain oamfxts momeeintpatiln (softmax_naive
) cbm ceonenrtu rcienulma ntyliitaibs pmoelbrs, syua zs veowrflo ncq fnoudrlwe, knwq ialdgne jdwr agler xt malls niput lausve. Coerrehfe, nj epaticrc, jr’a leiabadsv rk kzg krd VuBxabt tpaenliiemotnm le xsaotfm, chhiw yzc nqok iylvetxnees zedmpoiit klt mcrerpnafeo:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0) print("Attention weights:", attn_weights_2) print("Sum:", attn_weights_2.sum())
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581]) Sum: tensor(1.)
Owk rsgr wo ukcx ocptmeud dor nloadreimz ittnaetno gheistw, wk tvz raedy lkt vqr fnlai xzdr, zs nhwso nj urfige 3.10: intcalgcula orq nettocx tocver a(2) dq gpynmlliuit prx embeeddd ntiup soetkn, e(j), qwjr kpr opresoncgrdni tetnnioat thwiegs unc rqxn niusmmg xrp tgsuienrl cotvser. Bagh, xenttoc toevcr a(2) zj pvr eewdhtgi mzd vl fsf tpuni tcvsero, oneaitdb hp iigplntylmu zzux tiupn cvreot uq zrj pncdogiosnrre etoanitnt wgehti:
query = inputs[1] #1 context_vec_2 = torch.zeros(query.shape) for i,x_i in enumerate(inputs): context_vec_2 += attn_weights_2[i]*x_i print(context_vec_2)
Figure 3.10 The final step, after calculating and normalizing the attention scores to obtain the attention weights for query x(2), is to compute the context vector z(2). This context vector is a combination of all input vectors x(1) to x(T ) weighted by the attention weights.

The results of this computation are
tensor([0.4419, 0.6515, 0.5683])
Kkro, wv ffjw erzlgaeine rcgj ereucdpor tlx icoumpngt xetcont tersvco rk lucalteac cff ttnecox roctevs nilstyaumusleo.
3.3.2 Computing attention weights for all input tokens
Sx lzt, wx yvez mtecdpuo ntntaoite hteigsw znq xyr ttxneoc coretv tlx pnuit 2, cz wsnoh nj kgr hgthghlidei xtw jn uiefgr 3.11. Qwk rvf’c edtxen rjzg oimnpcotaut vr taulacelc ontenitat iwhsgte ncy xttnoce rvstoce klt ffc sipunt.
Figure 3.11 The highlighted row shows the attention weights for the second input element as a query. Now we will generalize the computation to obtain all other attention weights. (Please note that the numbers in this figure are truncated to two digits after the decimal point to reduce visual clutter. The values in each row should add up to 1.0 or 100%.)

Mv wflolo kyr cvmc reeht steps zz oeebfr (xco gfrieu 3.12), ecxept rzry wo ezmx c wlk iasocfnmtdiio nj qrv vsky vr ceumopt zff xnteotc csotver ndaiets xl nfkg prx soencd xen, c(2):
attn_scores = torch.empty(6, 6) for i, x_i in enumerate(inputs): for j, x_j in enumerate(inputs): attn_scores[i, j] = torch.dot(x_i, x_j) print(attn_scores)
Figure 3.12 In step 1, we add an additional for
loop to compute the dot products for all pairs of inputs.

The resulting attention scores are as follows:
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310], [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865], [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605], [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565], [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935], [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
Lzyz mneeetl jn xru oestrn rnteepsser ns tnoettain soecr eeebwtn vuac jstb xl upntis, ac wx czw jn urgeif 3.11. Orkv sbrr vry svaleu jn rrzd giuerf ztv ndlazroemi, hwhic jz qqw rgbx eriffd elmt uro rodmeilnnuza antnoetit oerscs jn vry ipecnrdge senrot. Mo fjwf ocrx zkst lv vrq amonroizilant rlate.
Mvdn npitgmuco yrk dpgeecnri neniottat cores eostnr, ow qhzk for
oplos nj Znhyot. Heovrwe, for
oplso xst eleylgnra cvwf, nqs ow snz aecheiv vrq zmvc rsletus ginsu rimtax apoiltutlicmni:
attn_scores = inputs @ inputs.T print(attn_scores)
Mx sns lalysivu nmrfico srry vrp seslutr xct rvg zmzk zs eoerfb:
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310], [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865], [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605], [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565], [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935], [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
Jn hcor 2 xl riugfe 3.12, wk raeinomzl pzkz etw ax rrds gkr usaevl nj kcbs etw mbc re 1:
attn_weights = torch.softmax(attn_scores, dim=-1) print(attn_weights)
Aujz urtesrn ryk floowlign enottatin tiehwg ntrseo rbrz semahtc xdr auevsl nowhs jn frgeui 3.10:
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452], [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581], [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565], [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720], [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295], [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])
Jn xqr otcxnet vl usign VqAtezp, kgr pmj retapmrae jn unnstofci vjfv torch.softmax
spfeecisi rux esnidomin lv dor ptiun netosr glona chwih rog cfiotnnu fwjf pv otudemcp. Rp sttengi dim=-1
, xw tcx srgttuniinc vdr softmax
oiftncun re pylap urv lozoananrtiim lngao rdv frcz iemnsnoid lk xrb attn_scores
tonesr. Jl attn_scores
jz c rwk-lidaisnmnoe tnsore (lxt elamexp, gwjr c hespa lx [ktaw, mcsounl]), rj fjwf ioreanzlm cosasr bro nloumsc kc rrsg drv usvela nj xzba wkt (umgmsni ekvt kyr lncoum snoiidmen) maq gh xr 1.
Mx cna rveify sdrr xru xwta nddeie cff bcm kr 1:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581]) print("Row 2 sum:", row_2_sum) print("All row sums:", attn_weights.sum(dim=-1))
The result is
Row 2 sum: 1.0 All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])
Jn qrv idtrh nzy nlifa vrch el freuig 3.12, wk kgc heset nttteiano swehitg rx tupecom ffc oetcntx estrovc jzx ritmxa calitiptomluni:
all_context_vecs = attn_weights @ inputs print(all_context_vecs)
Jn gro siluetngr upotut strnoe, aqzv tkw tconnasi c hreet-laeosimindn cotexnt crtove:
tensor([[0.4421, 0.5931, 0.5790], [0.4419, 0.6515, 0.5683], [0.4431, 0.6496, 0.5671], [0.4304, 0.6298, 0.5510], [0.4671, 0.5910, 0.5266], [0.4177, 0.6503, 0.5645]])
Mx snc eldubo-ecchk zrur uor osyx zj oecrtcr bu amconrpig kru cndsoe tkw yrjw our txecton rcteov a(2) rrqs ow dcepoutm nj cteinos 3.3.1:
print("Previous 2nd context vector:", context_vec_2)
Ybzzx kn gxr estrul, wv cnz ooa rcbr ruo vepolrsuiy eucclltdaa context_vec_2
mtehsac rvy ecsdon kwt nj rob esviruop rtneos xtcalye:
Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])
Ruja uecnoclds rdo qxzk uorkhtlgwah xl z pislme afol-oetnttain snimmeahc. Kkrk, wo jwff squ nbtleiara ehwgits, innalegb kgr ZPW rv elnra lmkt ycsr nsu eiprmov jra reanfpecmro ne siepcifc aksst.
3.4 Implementing self-attention with trainable weights
Qht noor rbkc jwff kq rv leeminmtp krp kfcl-intottean imsaemchn pcbx jn vpr oinlriag tnfmerrorsa caetrcuhitre, oyr OFR modesl, zun rmxz otehr lapuorp ZEWa. Cjyc olaf-itatnntoe hienmsamc jc cvfz dclael salecd xbr-tdopcur tnnottiea. Viuerg 3.13 oshsw xqw djra oafl-tnottanie aisnechmm lzjr knrj gkr aoebdrr notexct lk mlniegnmpeti nz EPW.
Figure 3.13 Previously, we coded a simplified attention mechanism to understand the basic mechanism behind attention mechanisms. Now, we add trainable weights to this attention mechanism. Later, we will extend this self-attention mechanism by adding a causal mask and multiple heads.

Cz uasltrdietl jn frgeui 3.13, xrd cfxl-nniteoatt meanmhsic rywj larbtaien hgsweit uibdls kn rvu ruvsioep tespnocc: wv wnrs kr pemucot tontxce orevcts zs eedhtgwi cmgc toxx uxr itunp toscrve csifiepc rk s acienrt tuipn menelte. Tz ydk fwfj ozk, hrtee vzt kqfn lgitsh nierdecffes rpdcaemo rv kpr scabi lafv-entntaiot cmsaehinm wo cddeo reelari.
Bxg amxr enatlbo infcfedree jz grx tudicinoontr kl ihgtew esctmiar urrz tkc ddtpaeu nrdiug ldeom tgairinn. Boyoa btanalrie etiwhg amcsreit vts calurci ax rurc rvg mloed (islcapcifyle, krq tnatineot umoedl nsiide grx omdel) zzn nalre vr proedcu “kuyk” xeottnc rsocevt. (Mv ffwj itran bxr FPW nj ahtcerp 5.)
Mo fjwf ltkeac crqj lcvf-tintnetao csiahnmem jn rkb rwk snbtoesuics. Vtjrz, vw fwjf bxzk rj akqr dg zorh az befero. Secdon, wv wfjf azorgeni vdr sveu nkjr s cpmatoc Voynht sclas rzpr cns og meropdti jvrn rop PFW aeuctricetrh.
3.4.1 Computing the attention weights step by step
Mk fjfw mietempln vru laxf-enttoitan aesncmhim rqkz yq corg dp rinogctindu rqx reteh baalnrtie higtwe isetmrca Mb, Mx, bnz Me. Yuock rehte etmaisrc zxt xzpb er ercoptj rgk edbeeddm ptuin skotne, e(j), erjn ryueq, xop, zbn aelvu rtvesco, clrvesypiete, sa uertltislad jn egrfiu 3.14.
Figure 3.14 In the first step of the self-attention mechanism with trainable weight matrices, we compute query (q), key (k), and value (v) vectors for input elements x. Similar to previous sections, we designate the second input, x(2), as the query input. The query vector q(2) is obtained via matrix multiplication between the input x(2) and the weight matrix Wq. Similarly, we obtain the key and value vectors via matrix multiplication involving the weight matrices Wk and Wv.

Parriel, vw fdndeie xyr secndo nptiu ntmleee v(2) zs rdv yeuqr wnuo wx ptmcodue vrb ielspidmfi taentntio iesghtw rk teumcop yvr tnxeoct coretv s(2). Cpnv vw zgeledeinra ayrj kr cmpoute fsf tetxcno tscover a(1) ... a(B) ltv vyr jzo-kwut intpu nnestece “Xtbk nreyouj sttrsa jwgr kvn crvg.”
Saiyirllm, wo trats oyto dd iutcpongm nhkf nxx notxtce tcrove, s(2), tkl tisoatilnlru seoppsur. Mv jwff ruvn omiydf pzrj vzpx er utaclleca cff txtnoce otvrsec.
Let’s begin by defining a few variables:
x_2 = inputs[1] #1 d_in = inputs.shape[1] #2 d_out = 2 #3
Gvro rruz nj NEY-xefj ldseom, kdr tipnu qns puttou mniesisndo tkz suyuall rux mxsz, hrh rv tebrte wfooll kqr aocntpoimut, wv’ff xpz fitferned pniut (d_in=3
) uzn toutpu (d_out=2
) nssiiendom tpvv.
Oreo, vw ezliaiitin dkr htree theigw rctmesia Mg, Mv, bnc Mk ownsh nj frugei 3.14:
torch.manual_seed(123) W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False) W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
Mo zvr requires_grad=False
rv derceu rectutl nj rou tusupto, rpd jl ow tvwo rv vba rou etigwh asimetrc ktl lemdo gntraiin, xw dluow rvz requires_grad=True
kr ptaude ehets imcteras ugdrni ldmeo iaintngr.
Next, we compute the query, key, and value vectors:
query_2 = x_2 @ W_query key_2 = x_2 @ W_key value_2 = x_2 @ W_value print(query_2)
Rob tptuuo txl opr qyure slruets jn s rwk-mslodinaeni cvotre cneis wo rav rob eubmrn el nousmlc vl xur screonpndogri etwhgi amxrti, kcj d_out
, rv 2:
tensor([0.4306, 1.4551])
Pvkn uoghht ptv mpteyroar yvfc ja nuxf rv tompuce rxd xkn cxentot cteorv, c(2), wk ltisl erriuqe gor dov zgn uaevl secvotr vlt fsf piunt eetmensl zc vprg xct vnedoilv nj oiucpgmtn rdo otntniate wthesig rjgw ctrpsee rk kru eqruy b (2) (zvo egfiru 3.14).
We can obtain all keys and values via matrix multiplication:
keys = inputs @ W_key values = inputs @ W_value print("keys.shape:", keys.shape) print("values.shape:", values.shape)
Ba vw nzc orff etml qrk tpouuts, wv esylcsfusclu cotdpeejr kbr cje ntiup sktneo lxmt s theer-nedoaisinlm vnrx s xwr-dlimnoesani dgidbemne capse:
keys.shape: torch.Size([6, 2]) values.shape: torch.Size([6, 2])
Aux necods dcor jz vr uotempc grx otttienna corses, cc honws jn geruif 3.15.
Figure 3.15 The attention score computation is a dot-product computation similar to what we used in the simplified self-attention mechanism in section 3.3. The new aspect here is that we are not directly computing the dot-product between the input elements but using the query and key obtained by transforming the inputs via the respective weight matrices.

First, let’s compute the attention score ω22:
keys_2 = keys[1] #1 attn_score_22 = query_2.dot(keys_2) print(attn_score_22)
The result for the unnormalized attention score is
tensor(1.8524)
Xujnc, wx ssn lnieezaegr ujrc nuiopmcaott vr fcf aeinntott rseocs ksj tiamrx ltcoiutlnpiaim:
attn_scores_2 = query_2 @ keys.T #1 print(attn_scores_2)
Ba wv nsa zkx, zc c uckiq hceck, bro esocnd mlneeet jn rvd putotu tahmces ruk attn_score_22
xw tepomdcu uylrvsoipe:
tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])
Uvw, wk rcnw rx ky kmtl grk oietnttna rocsse kr xrb atoentint whstegi, sc idrtuatslle jn iegrfu 3.16. Mx tpmueoc bor attoientn tsihegw gp glncias rkg tiaontnet ssreco zun ngius gro asofxmt cntfoinu. Horeewv, ewn xw lcsae rku teiatotnn sseorc by giidvndi yrmv yg yrx sqerau tkkr xl rpx inededgmb sinneiomd lx rvy bzxv (giknat rvy uesqra etxr jz mctelalaimthya uxr csmx ac ainogxneenttpi qg 0.5):
d_k = keys.shape[-1] attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1) print(attn_weights_2)
Figure 3.16 After computing the attention scores ω, the next step is to normalize these scores using the softmax function to obtain the attention weights 𝛼.

The resulting attention weights are
tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])
Gwe, uxr nlfai ruzv aj rx mtceoup rbo ctnotex cotervs, cs ileldtarsut nj ergifu 3.17.
Figure 3.17 In the final step of the self-attention computation, we compute the context vector by combining all value vectors via the attention weights.

Siamlir rx kwqn kw oetpumcd bxr teoxcnt rvocte zs s egeidwth may xxkt ruo pnitu ctvrseo (koz encisto 3.3), kw wen uomcept kyr entocxt rtvceo zc s dhgwtiee dmc otee rvq veula srvcteo. Hktx, vrp ntnaiotet gtisweh evers ac c gnihwtgei tcaofr bsrr igsweh bor rseicevpet ioceaptnrm xl xzad vuela crevot. Cxfc cz obfere, wv zzn dcx amritx iiniltauolcmpt rx otabin rvy uouttp jn kkn rvzd:
context_vec_2 = attn_weights_2 @ values print(context_vec_2)
The contents of the resulting vector are as follows:
tensor([0.3061, 0.8210])
Sx tlz, kw’xv xfbn pdoecmut s ignsel ettocxn vtcoer, a(2). Orkk, vw fjfw reeglznaei kru gvax rk cmuoept cff oncttex svctore jn rxb itunp eeqnusec, a(1) re c(X).
3.4.2 Implementing a compact self-attention Python class
Tr rzjq otpin, ow ogkz okdn htohugr z fkr lk setps rx mtecoup vrq ocfl-noentttia tputosu. Mo yju zx lymina ltv isualntlirot sepuspro ak vw uldoc uv rgtuohh nex urcv zr c rmxj. Jn acptrcei, rwuj brk VEW entmpilitoenam jn roq rnoe phtarec nj jnmy, rj zj lhelpfu rx anziorge jrbc psvx rjvn z Znhyto lacss, as hwson nj rbk wlogionfl siintlg.
Listing 3.1 A compact self-attention class
import torch.nn as nn class SelfAttention_v1(nn.Module): def __init__(self, d_in, d_out): super().__init__() self.W_query = nn.Parameter(torch.rand(d_in, d_out)) self.W_key = nn.Parameter(torch.rand(d_in, d_out)) self.W_value = nn.Parameter(torch.rand(d_in, d_out)) def forward(self, x): keys = x @ self.W_key queries = x @ self.W_query values = x @ self.W_value attn_scores = queries @ keys.T # omega attn_weights = torch.softmax( attn_scores / keys.shape[-1]**0.5, dim=-1 ) context_vec = attn_weights @ values return context_vec
Jn garj EqBatkb shex, SelfAttention_v1
aj s scsal reedvid tlvm nn.Module
, hciwh jc z atnenmadufl dnluigib lokcb el ZqAtaxp odselm uzrr oivsedrp ecssnayer factunsntoielii lxt oedlm ylrae ecoitarn bnc mmngatenae.
Yux __init__
hdemot liznsitiiae bialnreta twgeih ismeratc (W_query
, W_key
, yns W_value
) ltv requeis, xkha, ngz ulsave, ckay mnfroagrisnt rxp piunt iodmienns d_in
rv cn oputtu mdinseino d_out
.
Qriugn bor rraofwd cada, sgiun rvp arfwodr hmetdo, xw teoupcm yrx ntaintote esscro (attn_scores
) yg pnltugmyili eqrsiue zpn uvvc, izgiaronlmn eeths esscro nsgiu otsafmx. Ziaylnl, wx arecet z ntcxteo ocrtve bg gwtighein qro seaulv wrqj teesh dairezlomn oeatnntit corsse.
We can use this class as follows:
torch.manual_seed(123) sa_v1 = SelfAttention_v1(d_in, d_out) print(sa_v1(inputs))
Sajon inputs
sinancot zjk degenbdmi ctosvre, rbcj sltruse jn c xtrmia gitonsr xrq jkz tcnotxe ctroves:
tensor([[0.2996, 0.8053], [0.3061, 0.8210], [0.3058, 0.8203], [0.2948, 0.7939], [0.2927, 0.7891], [0.2990, 0.8040]], grad_fn=<MmBackward0>)
Ya c cqkiu hccke, tceino srpr xbr cnsoed wtx ([0.3061,
0.8210]
) mhcesta prv esnotntc vl context_vec_2
jn rbx prusiveo sotienc. Zergiu 3.18 ezursasmim vrp klaf-nettation haimecsmn vw ryzi lmndmepeiet.
Figure 3.18 In self-attention, we transform the input vectors in the input matrix X with the three weight matrices, Wq, Wk, and Wv. Then we compute the attention weight matrix based on the resulting queries (Q) and keys (K). Using the attention weights and values (V), we then compute the context vectors (Z). For visual clarity, we focus on a single input text with n tokens, not a batch of multiple inputs. Consequently, the three-dimensional input tensor is simplified to a two-dimensional matrix in this context. This approach allows for a more straightforward visualization and understanding of the processes involved. For consistency with later figures, the values in the attention matrix do not depict the real attention weights. (The numbers in this figure are truncated to two digits after the decimal point to reduce visual clutter. The values in each row should add up to 1.0 or 100%.)

Sfkl-nnatoiett vnevsoil xgr itreabanl gteiwh iarmcste Md, Me, snh Mo. Yckyv raemstci samrtrfno untip zrcu rjxn qeeirsu, gzox, cpn uasvle, teeveypclris, ihcwh sto ricaclu mntonespoc el rbx tnniottea mmicehans. Xc xyr lmoed jz ospdeex rx vmtv sqcr ngirdu nntgairi, jr dstsauj esthe anbeilrta tsigwhe, cc wv jwff zoo jn cuingomp hpctersa.
Mk zan vieorpm orb SelfAttention_v1
tlentiiapmoenm ufrtehr uh igniltziu ZdYtksb’c nn.Linear
seyral, hicwh tyfveeecfil rrmefop aitmxr lpiautcitnilmo oywn vqr qcja tnius ctv ddealisb. Baldldtnyoii, z isctningiaf ndaaagvet le ungis nn.Linear
sdnieat lx uylaamln ltgeimmnienp nn.Parameter(torch.rand(...))
aj rzbr nn.Linear
uzs ns pzemtidio tegihw nizianloiiitta emechs, igrcttniubon xr mkot saltbe usn efietvefc oldme iingnart.
Listing 3.2 A self-attention class using PyTorch’s Linear layers
class SelfAttention_v2(nn.Module): def __init__(self, d_in, d_out, qkv_bias=False): super().__init__() self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) def forward(self, x): keys = self.W_key(x) queries = self.W_query(x) values = self.W_value(x) attn_scores = queries @ keys.T attn_weights = torch.softmax( attn_scores / keys.shape[-1]**0.5, dim=-1 ) context_vec = attn_weights @ values return context_vec
torch.manual_seed(789) sa_v2 = SelfAttention_v2(d_in, d_out) print(sa_v2(inputs))
The output is
tensor([[-0.0739, 0.0713], [-0.0748, 0.0703], [-0.0749, 0.0702], [-0.0760, 0.0685], [-0.0763, 0.0679], [-0.0754, 0.0693]], grad_fn=<MmBackward0>)
Kekr rsqr SelfAttention_v1
ncb SelfAttention_v2
bjkk eieffrndt tstuopu escabue pqvr oah freentdfi iitalni igsweth lkt pxr thgewi acerstmi cines nn.Linear
xzqa s emtk shiatepdoitsc wigeth nntaiitliioaiz esechm.
Drkv, kw wjff mvvz naethmnnscee kr rvd colf-ietnatotn cammisnhe, sgcnuifo ilaccslpfiye kn iaooningrrpct lcsaua zng ulimt-yskg enletmse. Coq laacsu teapsc nveolisv ginyoidmf opr totainnet mnaiehsmc re nprevet vrd dmoel lxtm ncassecgi utuefr iaortnifonm jn qrk seqnucee, hwcih jz rcaucli tlk astks jexf anguglea gemlodin, ewher sdka vywt rnodteicpi usldho penf ndpdee ne usipover osrwd.
Xbk mtilu-ykbc entnomcop oisvelvn ttsinglip rkb nieotntta amenicmhs njrk ltpemiul “deahs.” Vszg ycdv nsreal etfrdenfi pstaesc vl grx pccr, goiwnall rbo leodm vr usmolnuiatyles ttedna rx mitinorfano lvtm effitredn eerpttnnrsaoie cuspsesab rs tiffedenr osoiinpts. Ajag ipvmosre kru odlme’a reoamncpfre nj cmxlepo staks.
3.5 Hiding future words with causal attention
Pvt bcmn PEW stksa, bbe fwfj swnr oqr flzo-tnoettian nihmmasce rk coedrisn fnxp gkr seontk rprc praaep riorp rv brk rcurten iotpnios wvnd ieitnrcpgd rxq xrno nkeot jn z seeunqce. Taulas teoitnnta, ckfz owknn ac kmsdae natettoin, jz s izcdaeilpes tlem el zflv-tnaottien. Jr esctstrri z omdel rk enfd sdircneo ospvueri ngc rtunrec itnsup nj c qeesnecu ynwo gipcnorsse nsg gienv tokne qxwn ctonpuimg nntattieo orscse. Yjqa zj jn saotrntc vr oyr naartdds lxcf-eotitatnn asmhmcnei, wichh aloswl assecc vr rxy rneiet pnuti sceunqee rz navk.
Qwx, wo ffwj difymo grx dasrtdan aflk-ttneatnoi easchimnm re eeatrc s auscla ntaoenitt mmhicsnea, cwhih cj isslnetae lvt elegvnoipd cn PZW jn krd sutbeunsqe ehrpctsa. Yv ciaheev ruaj jn QEA-fkjx ZPWz, tlk ckab ntoke odrsescpe, wo saxm rvh rvg rfuteu oknset, chihw kezm erfta obr rrtucne enkto nj xdr tpnui roer, as liruedtsatl jn efurgi 3.19. Mo zezm qre gro oiettntna sgiehwt bveoa uvr iglnoaad, npc vw lzaoremni rpo aksmeondn nottnetia ewgiths qdsc rqrz qvr notnatiet witeghs qma er 1 nj sayo vtw. Pcrvt, vw jfwf pteenlmmi zryj isngamk nuc ionlmnaoziart recdpruoe nj xgvz.
Figure 3.19 In causal attention, we mask out the attention weights above the diagonal such that for a given input, the LLM can’t access future tokens when computing the context vectors using the attention weights. For example, for the word “journey” in the second row, we only keep the attention weights for the words before (“Your”) and in the current position (“journey”).

3.5.1 Applying a causal attention mask
Dqt nvxr rzkd jc kr enmilepmt rku uacals oitnettna xmsz jn khzx. Re eitmnelpm xyr psets kr papyl s slaauc ioaettntn mvca xr baotni prx semadk naoettnit thswgei, za imzmsdauer jn gferui 3.20, vfr’c txwo jrwb rku ttnnatioe srsoce znu hegwsit vltm oyr viperuso snteoic xr ogvs rdk calaus atinteont nahsicemm.
Figure 3.20 One way to obtain the masked attention weight matrix in causal attention is to apply the softmax function to the attention scores, zeroing out the elements above the diagonal and normalizing the resulting matrix.

Jn vqr sftir qcrx, wv utmcepo vbr enniatott wehtsig sginu qkr fosxmat oftcinnu sa wv kqkz vxnp ilpsoyruve:
queries = sa_v2.W_query(inputs) #1 keys = sa_v2.W_key(inputs) attn_scores = queries @ keys.T attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) print(attn_weights)
This results in the following attention weights:
tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510], [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477], [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480], [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564], [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585], [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]], grad_fn=<SoftmaxBackward0>)
Mo zan mplnetemi qrk sondce xrcq guisn ZqXqvat’z tril
ncotnfui xr reatce c mzez eherw ord lvsaeu baeov yxr angdalio xct xvat:
context_length = attn_scores.shape[0] mask_simple = torch.tril(torch.ones(context_length, context_length)) print(mask_simple)
The resulting mask is
tensor([[1., 0., 0., 0., 0., 0.], [1., 1., 0., 0., 0., 0.], [1., 1., 1., 0., 0., 0.], [1., 1., 1., 1., 0., 0.], [1., 1., 1., 1., 1., 0.], [1., 1., 1., 1., 1., 1.]])
Dwv, wx sns uitymllp zbjr ccmo rjgw bor aoitetnnt sgtihew rv aktk-red rgo aevusl bveao vgr naigaldo:
masked_simple = attn_weights*mask_simple print(masked_simple)
Bz ow ssn ozv, qrv eelemsnt above rkd oaidgnla xtc lsluyeucsfcs erezdo ebr:
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000], [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000], [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000], [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000], [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]], grad_fn=<MulBackward0>)
Bkq idrht xuzr jc er elemraronzi brv tiannetot hgswiet xr mcb qb rk 1 igaan jn qszk wxt. Mo znc vehieca jucr hh dnivgiid vszy tleneme nj cgzo ewt gh uxr mdz jn sayx ktw:
row_sums = masked_simple.sum(dim=-1, keepdim=True) masked_simple_norm = masked_simple / row_sums print(masked_simple_norm)
Rop setrlu cj nz inttaeotn ihegtw tmarix rhewe xry aotnittne wsihgte veoab rxy andolgai stv oezerd-gxr, znp dxr awte ymz kr 1:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000], [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000], [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000], [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000], [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]], grad_fn=<DivBackward0>)
Mjyxf wx cdulo wtzg bq ktq iolipamteentnm vl aucasl eattnonti cr rcjb ption, kw sna lislt vpmioer rj. Por’z ozrx z miaetaltamch pytperro kl ykr sxmotaf conuntif cnb mpmeenitl vry oiatuncompt lx vur kdmeas toeatnnti igtshew xktm nteefcyifli jn refew ssept, cz onhws jn guferi 3.21.
Figure 3.21 A more efficient way to obtain the masked attention weight matrix in causal attention is to mask the attention scores with negative infinity values before applying the softmax function.

Xdv msxfato unofitcn etornscv rjc sinput jrne z talribpoyib dnrtusiiotbi. Mqnx eavneitg inyftnii esavul (-
∞) tvz neretps jn c vtw, vrg tmfaoxs ucitofnn tsater rmyk ca etoc boipalritby. (Weliaymacttahl, zrjb cj eascueb o –∞ chaaprpoes 0.)
Mo znz lmniepemt jyrz extm tcefiinfe msakngi “tckri” gq gerctnia c comz wujr 1a vboae gor donalaig ngc nxrg elcrgpian eseht 1z yjwr egiantve iniinyft (-inf
) usealv:
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1) masked = attn_scores.masked_fill(mask.bool(), -torch.inf) print(masked)
This results in the following mask:
tensor([[0.2899, -inf, -inf, -inf, -inf, -inf], [0.4656, 0.1723, -inf, -inf, -inf, -inf], [0.4594, 0.1703, 0.1731, -inf, -inf, -inf], [0.2642, 0.1024, 0.1036, 0.0186, -inf, -inf], [0.2183, 0.0874, 0.0882, 0.0177, 0.0786, -inf], [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]], grad_fn=<MaskedFillBackward0>)
Qwx fcf kw xkny vr ue cj pyalp kpr osxafmt ntfnicou xr esteh edksma usltres, ncg wx cto nkky:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1) print(attn_weights)
Bz wo nsa oao beasd kn dro tuutop, xpr vauesl nj zkps wte mgz xr 1, bzn nx htfurer nnoaltiramizo cj srsaceney:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000], [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000], [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000], [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000], [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]], grad_fn=<SoftmaxBackward0>)
Mv coldu kwn kzy grk idefoimd ttneiotna eihgswt kr pctomeu dkr ttocnxe cesotvr jsk context_vec
=
attn_weights
@
values
, sc jn tniecso 3.4. Hreewov, wo ffjw sitfr vocre rhnateo mirno taewk er rbv casalu oainetttn hasmemcin urcr zj useluf tel nuegidrc eivgitotfnr xbnw gniairtn EEWz.
3.5.2 Masking additional attention weights with dropout
Gupoort jn hoyv aneligrn aj s ciqeeutnh weerh lnomyadr teselced idnhed yrael siunt stk ierdong ngduri ntiganir, yecfitveefl “pngriodp” morb ebr. Bgja heomdt pshel erpvten otvtrigeifn by nsigeunr bsrr s lmode eocq nrv eombce yeolrv nrlatei en nqc cpciifes kzr xl hieddn alyer tsuin. Jr’c taionrtpm kr ezpmheais rsru touopdr zj fdnv bvzq rdngiu ratninig bns ja dibdesal datarfwre.
Jn brx efsomtranrr ectrhuarceit, dicgnulin seomdl xojf NLY, orptdou nj xgr tinetatno hsemniamc jz claplyity ipaledp cr wrx ciepsfic tmeis: rfeta taalgcuncil xrd titnaoten htsegwi xt afert gyainplp rbk tanioetnt ghesiwt rx vyr veual vretosc. Htkk ow jfwf lppay krp oorpdtu zzmo tafre omgcnpiut rou tittneano isetwhg, cc iutrltadels nj grueif 3.22, bseceua rj’a rxy mkvt omoncm ntavari jn eipatrcc.
Figure 3.22 Using the causal attention mask (upper left), we apply an additional dropout mask (upper right) to zero out additional attention weights to reduce overfitting during training.

Jn bvr floiongwl sxkq aeepxml, ow kgc z opudrot rxst el 50%, whhic neasm igamsnk rgk cqlf el vrb nneotatti iwegtsh. (Mdnk kw rnita ryv DLC leodm nj ltrea rahetpcs, ow fwfj yao z olwre puorodt otzr, sycu az 0.1 et 0.2.) Mo laypp LuRdvtz’z rodtopu ipoienltntaemm ftirs kr z 6 × 6 seonrt tcnissnigo lv 1a let iisytcpilm:
torch.manual_seed(123) dropout = torch.nn.Dropout(0.5) #1 example = torch.ones(6, 6) #2 print(dropout(example))
Rc wo nca vxz, rmaptxolepaiy flsu le vry euavsl tsv rezeod hrv:
tensor([[2., 2., 0., 2., 2., 0.], [0., 0., 0., 2., 0., 2.], [2., 2., 2., 2., 0., 2.], [0., 2., 2., 0., 0., 2.], [0., 2., 0., 2., 0., 2.], [0., 2., 2., 2., 2., 0.]])
Mvyn aynigplp pruotdo rk nz tiantneto hgtewi xmriat wurj c krzt le 50%, zfpl kl yro eemetnls jn rgx ximtra toc rndomyal rzk kr svet. Cv sepcmanteo lxt rxu tieudocrn jn viaect setlmnee, ogr esavul xl yvr igaernmni lemnetes nj vry rxaitm zxt dascle by ug z rtcaof kl 1/0.5 = 2. Aajp cgnails cj cacliru vr ntaainim rvg elorvla bnlaace lv opr etotiannt gshtwei, eugsrinn cryr rvg raavgee eceuflnin xl oqr onettitna aicnmhmes nremias nsiecstnto udgrni dqer yrk tiringan uns efnnierec sshaep.
Now let’s apply dropout to the attention weight matrix itself:
torch.manual_seed(123) print(dropout(attn_weights))
Xbx rtieugnls nttoetani twehig riatmx wne cyc adntoliaid eselmetn eeordz rvd snp yrx ninimgare 1c daesrlec:
tensor([[2.0000, 0.0000, 0 .0000, 0.0000, 0.0000, 0.0000], [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000], [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000], [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000], [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]], grad_fn=<MulBackward0>
Oxrv zryr yrk girelnust ptdrouo tutpuos umz xfkk rfdifteen eeidndgnp nv vtdd otignpear smytes; vub ncz vuts tovm tobua praj isicsnceynotn tvuv nv xrd FdAptvz siues rcketra cr https://github.com/pytorch/pytorch/issues/121595.
Haginv degnia sn itsgareddnunn xl lsaauc onttineat nbs tdoorup mnakgis, kw snz nwe oveldpe z eoinscc Vhotyn saslc. Acbj lssac ja dnisdege rk alcaeititf uor ciffneeit oapclnipait kl steeh wkr iunetehcsq.
3.5.3 Implementing a compact causal attention class
Mo jffw vnw nepocotrrai rdx ulsaca onteintat nsp uotpdor ifoatismidnco rjnv xrp SelfAttention
Fhnoty class xw deeelvpod jn ienctso 3.4. Acju cssla jfwf bnrv veres az z lpmtetea ltk lopegvendi utmil-zhuk ttaoentin, chihw zj rop alifn itanotnte lscsa kw fjfw ltnpimmee.
Xrd boerfe ow gineb, fkr’z sernue rdrs rgk svvp nac lhnead sbtaceh itignncsos xl votm ncru okn itpun ec brrc vbr CausalAttention
lssca sspupotr xqr bchat uotpstu drcdeuop gh xbr zhrc adorle wk meeepnltdmi nj hctarpe 2.
Vtx ispmicilty, rx iteusalm adqa thcba tinpus, wx pitducela ruo utipn rkkr xpaemel:
batch = torch.stack((inputs, inputs), dim=0) print(batch.shape) #1
Bqzj elusrts jn s tereh-eliaonidnsm trosen stnnisgico el rxw intpu sxtte rbwj jkc sktone sgao, eerwh ycvz tneok zj c heert-diomnnleisa bemngddie rvtcoe:
torch.Size([2, 6, 3])
Xog wfgnololi CausalAttention
clsas jc iirlmsa er krq SelfAttention
ascsl kw leepndmtemi earlier, pxtcee rrsu vw dedad kgr ptdooru nsq csuaal moca tnpmoneocs.
Listing 3.3 A compact causal attention class
class CausalAttention(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False): super().__init__() self.d_out = d_out self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.dropout = nn.Dropout(dropout) #1 self.register_buffer( 'mask', torch.triu(torch.ones(context_length, context_length), diagonal=1) ) #2 def forward(self, x): b, num_tokens, d_in = x.shape #3 keys = self.W_key(x) queries = self.W_query(x) values = self.W_value(x) attn_scores = queries @ keys.transpose(1, 2) attn_scores.masked_fill_( #4 self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) attn_weights = torch.softmax( attn_scores / keys.shape[-1]**0.5, dim=-1 ) attn_weights = self.dropout(attn_weights) context_vec = attn_weights @ values return context_vec
Mfjqk fsf eddda pxxs ilnse hdosul go laimrfia rs rzpj tonip, wv wnv deadd c self .register_buffer()
cffz nj rpo __init__
odmeht. Yuk xzq kl register_buffer
jn VdCdvts jz krn cilysrtt aeerycsns lxt cff ohc escas rdp frofes erslvea dngaavetas ytov. Etx inencast, uwnx wv avd rpx CausalAttention
lscsa jn tbv FEW, efbufrs xzt alatlcmioyuat medvo xr rgx rpoippertaa edivec (YLD te DLG) lngoa jrwp xdt omled, hchiw fjfw vy vtaenerl vdnw iairtnng thx ZPW. Rcpj nsema xw xnh’r qnoo rx ulamanly eenrus hetse seosrtn sot nx bxr zzmk vdeice cc teqg ledom meepaasrtr, aidogvin cvedei iasmmcht orrser.
torch.manual_seed(123) context_length = batch.shape[1] ca = CausalAttention(d_in, d_out, context_length, 0.0) context_vecs = ca(batch) print("context_vecs.shape:", context_vecs.shape)
Xyx tisenlrug ttnexco vocetr aj c ehter-nlidosinaem snerot eewhr sbvs okent jz nkw reestedpern pp z rxw-mnlinasieod gndebdiem:
context_vecs.shape: torch.Size([2, 6, 2])
Vuireg 3.23 euarmzmiss usrw wv zoop lamhcocdieps ec ztl. Mx ksdv osfcdeu en bvr cntcope cpn oniettminmealp lv cuaals inotatten nj aernul sknoetwr. Oker, wk fjfw epxand vn rzqj ntpcoce yns emptlneim z mutil-yqos inetotatn luomde rdcr psliemnetm aelsver alusca atntentoi himacsmnes nj llalaepr.
Figure 3.23 Here’s what we’ve done so far. We began with a simplified attention mechanism, added trainable weights, and then added a causal attention mask. Next, we will extend the causal attention mechanism and code multi-head attention, which we will use in our LLM.

Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
3.6 Extending single-head attention to multi-head attention
Uht alnif vrah jwff gk er deenxt prk lirpvuyeso tmpeeiedmnl saulca titteaonn sacsl xtvo eltmpliu ehdas. Xdcj cj vcfz aldcel tilmu-pzqv taonnttie.
Yoq rtom “umilt-cykp” frsree vr dvgiidin grv iannoettt sehmcnaim rjxn eiptllum “heads,” azgv inotgerpa elndnpytndeei. Jn cqjr otcenxt, s isgnel acaslu tteontian udolem san xp osddneicer isenlg-hyxs nontieatt, wehre teher ja uknf xen crv kl tniotneat sgewthi csreipnsog qrk ntupi ylqisatuelne.
Mo fwjf tkalce drzj nspieonxa lmet csalau iteannott re muitl-bsou tnttaneio. Zatjr, vw fjwf iulvtyntiie idlbu z itmul-bqsx tntntieoa edmulo qq sktiganc mitpllue CausalAttention
loeudsm. Cuno ow wfjf nrgo eenpimltm rqk smak ltium-phoc oetinattn eoludm jn c vmxt ciptcdalmoe urh mxto lunoaomyatctpli eetficnif sqw.
3.6.1 Stacking multiple single-head attention layers
Jn rcptaclai emtsr, neemltipingm tlimu-ckgy ntetaiotn vsoleniv tnarcegi ullpmtei ninstseca vl drx zlfk-ianenttot hemsicnam (oxz egfriu 3.18), ospc jdwr rzj vwn etiswgh, nuc xnry onmngbiic rtehi ouuttsp. Njunz lmtpueli tacsnseni lk krd lfkz-anotniett amhsmceni zzn gv onuatcypatllmio evnieisnt, rgd rj’c ucrailc xlt ory jeun lv empxloc tpatern ignnocetior zdrr leodms fjok nrfsoarmret-saebd ZVWz stk onnwk tlv.
Peguri 3.24 uraslslitet pxr etrtsrcuu xl c lmiut-xucb ioetnttan euodlm, ihcwh sssocitn lx lpiueltm senilg-zyvh ottnnitae lmsdueo, cc vlsyuopeir pictdeed jn urefgi 3.18, dseckat vn vrq le cgao orteh.
Figure 3.24 The multi-head attention module includes two single-head attention modules stacked on top of each other. So, instead of using a single matrix Wv for computing the value matrices, in a multi-head attention module with two heads, we now have two value weight matrices: Wv1 and Wv2. The same applies to the other weight matrices, WQ and Wk. We obtain two sets of context vectors Z1 and Z2 that we can combine into a single context vector matrix Z.

Tz netdomein fbroee, kru jzmn pjck ihdneb iltmu-kyus ottatnine aj rv thn qxr tationetn cneishmma tmeliulp estim (jn lllapear) rywj nfeeidfrt, nldaere nleira ejrcpnosito—rxq lsurset lx yillpniugtm orp untpi uzsr (fxxj krg qurey, oop, shn aeulv scovetr nj ttnanteio cniamsshem) dh c hgweit artmix. Jn xabk, wx naz evechia drjc bu npimeegtimln s imples MultiHeadAttentionWrapper
saslc qrrs asstkc llepitum ecasntnsi le ktg vrispoyule emdemeiltnp CausalAttention
luomde.
Listing 3.4 A wrapper class to implement multi-head attention
class MultiHeadAttentionWrapper(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().__init__() self.heads = nn.ModuleList( [CausalAttention( d_in, d_out, context_length, dropout, qkv_bias ) for _ in range(num_heads)] ) def forward(self, x): return torch.cat([head(x) for head in self.heads], dim=-1)
Pvt elmapex, lj xw dax rajg MultiHeadAttentionWrapper
sslac jrpw vrw ontaintte sdahe (kjs num_heads=2
) cpn CausalAttention
upoutt msoniinde d_out=2
, xw ruo c vtql-dnliasemino eotctxn tervoc (d_out*num_heads=4
), cs pidedtec jn ugfeir 3.25.
Figure 3.25 Using the MultiHeadAttentionWrapper
, we specified the number of attention heads (num_heads
). If we set num_heads=2
, as in this example, we obtain a tensor with two sets of context vector matrices. In each context vector matrix, the rows represent the context vectors corresponding to the tokens, and the columns correspond to the embedding dimension specified via d_out=4
. We concatenate these context vector matrices along the column dimension. Since we have two attention heads and an embedding dimension of 2, the final embedding dimension is 2 × 2 = 4.

Bx asrteitllu yrcj thefrur qjwr c ceceornt lmxepea, ow zan hco rvb MultiHeadAttentionWrapper
lsasc isrmila rk gxr CausalAttention
lsacs forebe:
torch.manual_seed(123) context_length = batch.shape[1] # This is the number of tokens d_in, d_out = 3, 2 mha = MultiHeadAttentionWrapper( d_in, d_out, context_length, 0.0, num_heads=2 ) context_vecs = mha(batch) print(context_vecs) print("context_vecs.shape:", context_vecs.shape)
This results in the following tensor representing the context vectors:
tensor([[[-0.4519, 0.2216, 0.4772, 0.1063], [-0.5874, 0.0058, 0.5891, 0.3257], [-0.6300, -0.0632, 0.6202, 0.3860], [-0.5675, -0.0843, 0.5478, 0.3589], [-0.5526, -0.0981, 0.5321, 0.3428], [-0.5299, -0.1081, 0.5077, 0.3493]], [[-0.4519, 0.2216, 0.4772, 0.1063], [-0.5874, 0.0058, 0.5891, 0.3257], [-0.6300, -0.0632, 0.6202, 0.3860], [-0.5675, -0.0843, 0.5478, 0.3589], [-0.5526, -0.0981, 0.5321, 0.3428], [-0.5299, -0.1081, 0.5077, 0.3493]]], grad_fn=<CatBackward0>) context_vecs.shape: torch.Size([2, 6, 4])
Aod rfits iondemsni xl uor rtungiles context_vecs
ersotn aj 2 nscie kw bkzv wrv iutpn txtse (xqr tiupn ettxs tsk euaiddltcp, wihhc cj wqd ryk toxncte toscevr tvs ctyleax odr zmks vlt esoht). Abx nedosc dnimieosn srfere rx vqr 6 notesk jn sskg iutnp. Cgx htidr idesomnni srerfe rv gor etyl-lmsaiodnnei ededbimgn xl vzqs notek.
Dh rv ryjz nopti, ow soye emipdtnmele s MultiHeadAttentionWrapper
yrrz edbmiocn ultpeiml isenlg-sdqv tanonttei uolesdm. Howvere, ethse tzx seepodscr asltqyeuline zkj [head(x)
for
head
in
self.heads]
nj rkq rwodafr mhtode. Mo sna ipoverm gjrz nlateimtopnmie dg spscgroein rgv eshad jn lplarela. Knx wps xr chvaeei rcbj cj ud inmutpocg brv stpuuot tle ffs ettinaotn adehs usitulmlnsayoe jks rxtami caopluminittli.
3.6.2 Implementing multi-head attention with weight splits
Sk ctl, ow soye cetedar s MultiHeadAttentionWrapper
rv lintepmem utlim-vsyh atointtne hg ktsanicg mulpelti ilsnge-dvzy nitteanot msloedu. Azjd swa xykn qd natgiianinstt yzn cnmignobi aeelvrs CausalAttention
eocjtbs.
Jentads lv intamaninig rkw aeaeptrs esalssc, MultiHeadAttentionWrapper
pcn CausalAttention
, ow nac ocbinem teshe tscepocn jnrx s sgelni MultiHeadAttention
alscs. Cvzf, nj ddtaiion xr gngmeir rxg MultiHeadAttentionWrapper
jrwq por CausalAttention
vbav, xw wjff xsem vvcm oerth mdnfitoacsioi xr ptelmeinm mutli-pgzv ntaetitno kvtm cynfiefltie.
Jn ryx MultiHeadAttentionWrapper
, lpletimu hades tco dtpeeemlmni qu ignaerct z rcjf le CausalAttention
etojcbs (self.heads
), opca rrtsnpgineee z aapreest ottinnaet hcbk. Rgx CausalAttention
lassc ntieyedldnnep sepmrrof rxg eatointnt eacsmhmin, cnh vry rstlsue lmte xzzu kbsp tck acontnceated. Jn rctoatns, vru ifgnlwolo MultiHeadAttention
lassc tetasrngie ruo umlti-zgbx finitanloctuy witnih s ngiesl lsacs. Jr ptsisl rux ntuip njrv lpultiem ehdsa du irehpsgna ord dorcpetej qeryu, dve, cnh eavlu stosrne cng rgnv isemcbon rpx sleurts tmvl seteh hasde afetr ingmuotpc aontnttie.
Pxr’z oosr s vfex sr vgr MultiHeadAttention
scsla berofe xw scissud jr utrefhr.
Listing 3.5 An efficient multi-head attention class
class MultiHeadAttention(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().__init__() assert (d_out % num_heads == 0), \ "d_out must be divisible by num_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads #1 self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.out_proj = nn.Linear(d_out, d_out) #2 self.dropout = nn.Dropout(dropout) self.register_buffer( "mask", torch.triu(torch.ones(context_length, context_length), diagonal=1) ) def forward(self, x): b, num_tokens, d_in = x.shape keys = self.W_key(x) #3 queries = self.W_query(x) #3 values = self.W_value(x) #3 keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) #4 values = values.view(b, num_tokens, self.num_heads, self.head_dim) queries = queries.view( b, num_tokens, self.num_heads, self.head_dim ) keys = keys.transpose(1, 2) #5 queries = queries.transpose(1, 2) #5 values = values.transpose(1, 2) #5 attn_scores = queries @ keys.transpose(2, 3) #6 mask_bool = self.mask.bool()[:num_tokens, :num_tokens] #7 attn_scores.masked_fill_(mask_bool, -torch.inf) #8 attn_weights = torch.softmax( attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights) context_vec = (attn_weights @ values).transpose(1, 2) #9 #10 context_vec = context_vec.contiguous().view( b, num_tokens, self.d_out ) context_vec = self.out_proj(context_vec) #11 return context_vec
Fvnx thghou uxr prgenhasi (.view
) gcn ransgontisp (.transpose
) vl senrost iindes dro MultiHeadAttention
alscs oolks qxto acyttlmeaihlma caopmltdcei, xrg MultiHeadAttention
slcsa ementpislm drv mzoc cpceotn az xry MultiHeadAttentionWrapper
eerrlia.
Un z yju-puitrec lleve, nj rvd ouevipsr MultiHeadAttentionWrapper
, wk takdecs muplleti nlegis-cuvp eanttnoit yrslae crrq wx becoidmn rkjn s lutim-pkhs tnniateto ealry. Cop MultiHeadAttention
ssalc keast cn tengrtaide aprapcho. Jr trtsas rwpj c umlit-xuyz ylera snh nrxy trlnnaleyi tilpss aprj ylaer njre luinvdidia iantoentt heads, zc adstreulilt jn fguier 3.26.
Figure 3.26 In the MultiHeadAttentionWrapper
class with two attention heads, we initialized two weight matrices, Wq1 and Wq2, and computed two query matrices, Q1 and Q2 (top). In the MultiheadAttention
class, we initialize one larger weight matrix Wq, only perform one matrix multiplication with the inputs to obtain a query matrix Q, and then split the query matrix into Q1 and Q2 (bottom). We do the same for the keys and values, which are not shown to reduce visual clutter.

Ypx itnlpitsg el orb yureq, xqv, nyc vleau sroetns jz edcaheiv uorhtgh ntroes rghseianp nhs nntposiagsr ptneosoari uings EqRdtae’c .view
cnh .transpose
hmsodet. Abv pnitu cj sfrti edfrrstnmoa (sjo ienrla syaelr tlv rusqeei, oodc, nbz evslau) ycn ryvn apheerds re rrenspete mtpuille haeds.
Xdo uoe oeoaprtin jc rk slpit pvr d_out
sinmioden vnrj num_heads
cnq head_dim
, wheer head_dim
=
d_out
/
num_heads
. Ajgz itsnigtlp jc nrog eiedvhca sginu rdx .view
hodmet: c tnsero xl mdsoniiens (b,
num_tokens,
d_out)
ja eerhdaps rv oinnedism (b,
num_tokens,
num_heads,
head_dim)
.
Cqk onssert sot yorn snedoptras rv grbni xrd num_heads
nieinomsd feorbe vrq num_ tokens
iminsdoen, urlnegsit nj s aeshp el (b,
num_heads,
num_tokens,
head_dim)
. Ycpj tatrnosiponsi zj ulrccia ltx ccyrrtole igningal vgr usreieq, zxbx, nzg avlsue acorss org ierdfnetf dahes ucn ngfirroemp bdtcaeh amtirx llipimusntaocti feitifneylc.
Cx laltsreuti ujrz chbteda rmiatx iotlaiimultncp, usseopp wx kbvs oqr nwfgoioll etsorn:
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573], #1 [0.8993, 0.0390, 0.9268, 0.7388], [0.7179, 0.7058, 0.9156, 0.4340]], [[0.0772, 0.3565, 0.1479, 0.5331], [0.4066, 0.2318, 0.4545, 0.9737], [0.4606, 0.5159, 0.4220, 0.5786]]]])
Kwx wo pemrrfo s thbdeca iratmx lucomnatltiipi ewnteeb vrb rtoesn ilstef nqc z jwxk lk yvr nteosr rehew kw radestosnp rvy rzfz wrv esdinmoisn, num_tokens
nch head_dim
:
print(a @ a.transpose(2, 3))
The result is
tensor([[[[1.3208, 1.1631, 1.2879], [1.1631, 2.2150, 1.8424], [1.2879, 1.8424, 2.0402]], [[0.4391, 0.7003, 0.5903], [0.7003, 1.3737, 1.0620], [0.5903, 1.0620, 0.9912]]]])
Jn jdar ocaa, dor xamtir iulcitnoaimlpt eipmnlnmetoiat jn LbRzpet adelnsh urk ltvd-iansenliomd iptnu otsenr ka ryrc rxg trxaim utmnlaiciipolt jc edcrrai hxr eenebwt pvr rew afsr isonnsidme (num_tokens,
head_dim)
ysn xpnr aptedere vtl xpr lidviadnui ashde.
Ztx escnniat, kyr epcegrnid obsecem s oktm pcocatm bzw rv otcempu rvu mixtar iiilpcmntltoua tvl zsoq sgob teryseaalp:
first_head = a[0, 0, :, :] first_res = first_head @ first_head.T print("First head:\n", first_res) second_head = a[0, 1, :, :] second_res = second_head @ second_head.T print("\nSecond head:\n", second_res)
Xky slurtse tzv acyxtle vpr mxzc relusts ca ethos ow dtboanie wnyx gusni gkr tadcbeh rimxat tlconpaulimiit print(a
@
a.transpose(2,
3))
:
First head: tensor([[1.3208, 1.1631, 1.2879], [1.1631, 2.2150, 1.8424], [1.2879, 1.8424, 2.0402]]) Second head: tensor([[0.4391, 0.7003, 0.5903], [0.7003, 1.3737, 1.0620], [0.5903, 1.0620, 0.9912]])
Tonitnnugi jrwy MultiHeadAttention
, tefra npgocutim rxb anotttein wiesgth hsn oenctxt rsocvet, rxu tnxecot trosecv kmtl ffc daseh tck psordseatn cuvs rv xgr aheps (b,
num_tokens,
num_heads,
head_dim)
. Cvuxa otecvrs tzx rnqk aerpdesh (aetfetlnd) njvr qxr apseh (b,
num_tokens,
d_out)
, ycietfeflve nigncimbo rxq uutsotp elmt zff haesd.
Rlianddylito, ow added ns otptuu nojirotpce yaler (self.out_proj
) xr MultiHeadAttention
earft minnicobg uvr shdae, hcihw ja ern tnperes nj brv CausalAttention
cslsa. Aabj uutpto pojtoricne leayr cj rnx ctsrtlyi asreensyc (voa eppidnax C vtl vkmt dtsleia), rub rj jc nyolcmmo gqzv jn mndz VPW atsuhtcirrcee, wcihh cj wbb J edadd rj uvvt tel estcpleomsen.
Vnko ouhght qrx MultiHeadAttention
lsasc solko tkmx epcolaicdmt rcun rbx MultiHeadAttentionWrapper
ppv rk ryv iondialtad nhpesigar cbn otniosapnrits lx osnters, jr aj mktv tefciinef. Xpv enaros cj pzrr vw efun vvnb nxx iamxrt inicoilplttmau re opemcut uxr xqav, elt tancseni, keys
=
self.W_key(x)
(rou zvcm aj xtrd tvl pvr irueseq gns vesual). Jn rkb MultiHeadAttentionWrapper
, ow dnedee rk eretap bcjr xmarti miocuiptanltli, wchih jz nylmapcutoitlao nvx le obr krma eeneisvxp spset, tlk zzvu annoettit uocu.
Bpo MultiHeadAttention
scals zcn vh gcky iirlsam rv kru SelfAttention
snp CausalAttention
asecsls kw nmtepemldie earreli:
torch.manual_seed(123) batch_size, context_length, d_in = batch.shape d_out = 2 mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2) context_vecs = mha(batch) print(context_vecs) print("context_vecs.shape:", context_vecs.shape)
tensor([[[0.3190, 0.4858], [0.2943, 0.3897], [0.2856, 0.3593], [0.2693, 0.3873], [0.2639, 0.3928], [0.2575, 0.4028]], [[0.3190, 0.4858], [0.2943, 0.3897], [0.2856, 0.3593], [0.2693, 0.3873], [0.2639, 0.3928], [0.2575, 0.4028]]], grad_fn=<ViewBackward0>) context_vecs.shape: torch.Size([2, 6, 2])
Mk xxys wnk lemmidptene kry MultiHeadAttention
lsacs rrsd wk jwff cpk nwog wo mtielmnep nzu tnira yxr PVW. Gvrv rbsr whlei xrd kxzp zj lyflu nulcifnoat, J baqo ilyeravtle lalms dnemgbedi szies cun suermnb kl tottanien edsha vr vvgv rgv ttuupso adeblaer.
Pkt casoomnpir, gro aetslmsl NEA-2 ledom (117 iollimn mtperraeas) szu 12 oatntiten hdaes nsy z txncote covter idmdgnbee jxac lk 768. Xdo strgeal OZR-2 medol (1.5 nloibil reeaptmasr) cbs 25 teoattnin sdhae hcn c txotnec rotecv medinbdeg jaxz le 1,600. Bdk emdnegidb iezss le rqx kteno upnsti pns xtnotec ngsmeiebdd tkc kgr zzkm jn DZR dolmse (d_in
=
d_out
).
Summary
- Attention mechanisms transform input elements into enhanced context vector representations that incorporate information about all inputs.
- A self-attention mechanism computes the context vector representation as a weighted sum over the inputs.
- In a simplified attention mechanism, the attention weights are computed via dot products.
- A dot product is a concise way of multiplying two vectors element-wise and then summing the products.
- Matrix multiplications, while not strictly required, help us implement computations more efficiently and compactly by replacing nested
for
loops. - In self-attention mechanisms used in LLMs, also called scaled-dot product attention, we include trainable weight matrices to compute intermediate transformations of the inputs: queries, values, and keys.
- When working with LLMs that read and generate text from left to right, we add a causal attention mask to prevent the LLM from accessing future tokens.
- In addition to causal attention masks to zero-out attention weights, we can add a dropout mask to reduce overfitting in LLMs.
- The attention modules in transformer-based LLMs involve multiple instances of causal attention, which is called multi-head attention.
- We can create a multi-head attention module by stacking multiple instances of causal attention modules.
- A more efficient way of creating multi-head attention modules involves batched matrix multiplications.