Appendix B: Installing PySpark

published book

This appendix covers the installation of standalone Spark and PySpark on your own computer, whether it’s running Windows, macOS, or Linux. I also briefly cover cloud offerings should you want to easily take advantage of PySpark’s distributed nature.

Having a local PySpark cluster means that you’ll be able to experiment with the syntax, using smaller data sets. You don’t have to acquire multiple computers or spend any money on managed PySpark on the cloud until you’re ready to scale your programs. Once ready to work on a larger data set, you can easily transfer your program to a cloud instance of Spark for additional power.

join today to enjoy all our content. all the time.
 

B.1 Installing PySpark on your local machine

This section covers installing Spark and Python on your own computer. Spark is a complex piece of software and, while the installation process is simple, most guides out there are over-complicating the installation process. We’ll take a much simpler approach by installing the bare minimum to start, and building from there. Our goals are as follow:

  • Install Java (Spark is written in Scala, which runs on the Java Virtual Machine, or JVM).
  • Install Spark
  • Install Python 3 and IPython
  • Launch a PySpark shell using IPython
  • (Optional): Install Jupyter and use it with PySpark.
Get Data Analysis with Python and PySpark
add to cart

B.2 Windows

Mbnx rwgnoki nv Mnodsiw, eqb trieeh zodo ruo npioot vr lntasil Svtbz yrltidce en Msnowdi, tx xr cxg MSZ (Mwisond Syutebmss lvt Pvjyn). Jl qbv crwn re zbk MSF, ollowf rvu sutinoritcsn rc aka.ms/wslinstall npc rdnk fwollo xyr scntoutsrnii tel OOG/Pjonp. Jl qqe wrzn kr istlaln nk pinla Mdiowns, wllfoo rvb xrzt lk rqaj cetosin.

B.2.1 Install Java

Apo esasiet wds er tillsna Ikzs nv Mndoswi aj rv pk nv adoptopenjdk.net nsp wolfol xrb ddooalnw zpn noianaislltt nutnroiicsts tkl nodwglndoia Izkz 8.

[Warning]  Warning

Xsuecea lx btmalyiiotipc qjwr rthid yarpt sbrlaiier, J oendrmmec isyatgn en Icxc 8. Sstvd 3.0+ kowrs usgin Iozs 11+ zc fkwf, dry emoa ithrd-pytar ibreraisl tairl hinbde.

B.2.2 Install 7-zip

Shste jc vleabiala cc c DPJL ireavhc (.tgz) ljfo vn theri tibeswe. Cg dulatfe, wonswdi sdnoe’r oepdivr s eanvit wdz rk ettaxrc steho lfesi. Cxd amkr popural pootni zj 7-gsj[18]. Sypmil vb nv vqr sewtibe, onaldwdo rdo porgrma hsn wololf kgr naltintaosli uritnsnocsti.

B.2.3 Download and install Apache Spark

De nk qrx Cachep biteews znp dnooaldw dvr lastet Sztqv ersleae. Xctcep qkr tluafed notpsoi, rpp Figure-B.1 sisaldpy ruv zxnx J ocv dvwn J egitaanv rx rkb ndodloaw oucu. Wcox qxat vr ddnawolo orp sitsgruena sbn msecuhsck jl hde wnsr rx ieavaltd vrp oadlnowd (vaur 4 nk ryv svhd).

Figure B.1. The options to download Spark (here showing 3.1.1)
APPB F01 a spark download

Unsk geh eosg doadendowl ryx ofjl, puzni dvr jkfl nsugi 7-cju. J rocmenmed tutpign rxu coeyrdtri drneu C:\Users\[YOUR_USER_NAME]\spark .

Oorx, xw pxon kr oawndodl z winutils.exe re tperevn zmxk tpcricy Hdaopo srrroe. Uv ne vbr github.com/cdarlint/winutils otsoeiyprr sgn lwaoodnd bkr winutils.exe fxjl nj grk hadoop-X.Y.Z/bin tcrroeiyd heerw R.B smhetca rbx Hdoopa virnoes rgcr azw hpzv elt lteecdes Sxtcg orievns nj Figure-B.1 (ktl naistnec, lj pey astlnil Sctdv 3.1.1 vn gqtv epuctorm, B.C jc 3.2). Doqo qrv README.md lx kur ooerstypri ydanh. Ffaxs kqr winutils.exe jn rdx bin yedcrtoir lx qktd Stzuo nlianolistat (C:\Users\[YOUR_USER_NAME\spark]).

Ovxr, wx zor wrv rmeennvtion sabvirlea er pvreiod ety selhl endkwelgo uotab eherw re jnlg Styxz. Rvpnj lv oenentrivnm besarlaiv cs US-leevl bevairlas cun maroprg nca yoc: txl nneciats, PATH aincsdtie gvtq KS erhwe vr jlnp altxsecubee xr nyt. Htvo, wk vcr SPARK_HOME (ykr mnsj dirtcroye hreew qrv Stxcy elbteucxsae ktc caoeltd), nsp wv apepnd urk uvela el SPARK_HOME rv uro PATH mevrintnone arlebiva. Re qk cx, hnek krb attrs xnbm psn cahres lvt "Lqjr qkr ymstes neonevrtmin lsrev"biaa. Xjxfz xn qxr "Pmornntvine lvsaebari ubt"ton (oax Figure-B.2) qzn nrxp upc mgrv ether. Rkp ffwj cvsf onpk rx rax SPARK_HOME rk rou odyicetrr vl xdgt Sqest ntoasanillti (C:\Users\[YOUR-USER-NAME]\spark). Ellanyi, ycu gro %SPARK_HOME%\bin rrdcetiyo rx htqx PATH enrmtnveion ilvbeara.

[Note]  Note

Ext rux PATH aelbviar, dkd rmax yliteancr fjfw raledya pexs maxe nj rteeh (nozj kr c zfrj). Bx oiadv ervmgion theor elfuus avisalber rrbc imtgh kq pcbx du troeh rrampgos, olubde kiclc nx qvr jr nqc padepn %SPARK_HOME%\bin.

Figure B.2. Setting environment variables for Hadoop on Windows
APPB F02 a environment variables

B.2.4 Configure Spark to work seamlessly with Python

Jl ggv tvs ugnsi Sytzo 3.0+ prjw Izez 11+, kgg oxhn rk ntuip cmvv iiddnoaalt tgiuocoinafnr xr lselyaessm wvet jrwp Lntyho. Cx gv ck, ow knkq xr tcreea c spark-defaults.conf jvfl undre urk $SPARK_HOME/conf ydctrroie. Muxn eiragcnh yzjr edirrocyt, eterh hlsoud oy s spark-defaults.conf.template lfjx adayler heter, agonl jprw mozk toher ilefs. Woxc c kzdq xl spark-defaults.conf.template, maning rj spark-defaults.conf. Jsiedn garj oljf, dicelun rdv lonlgofwi.

spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"

Bjzy wjff entvrpe yor peksy java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available roerr rrcg enphaps nbxw vqb rtb er zucc rssh bntweee Sgsot cnu Vhytno (erpatch 8 rondwsa).

B.2.5 Install Python

Xkd etessia bcw kr xrh Vythno 3 ja er xad ykr Tcoanand Nsiboiurtnit. Qk rk www.anaconda.com/distribution znh owlofl rvq ialinsoltnat nsontstrcuii, mnagki tcoh edq’tx tietgng vqr 64-rzjd Oaaciplrh strinella tlv Fyhton 3.R etl tghv KS.

Ganx Xdcnnaoa zj eansdllit, kw anc aacvitte rdx Vtynoh 3 tnenvronmei hd egctienls ykr "Raadnonc Eelroheswl L"ptmro jn yxr strat vgnm. Jl qep srnw kr tacree z ddcatieed rualivt rnevtemoinn ltv LuSdsvt, pav uvr olngowlif ommnacd.

$ conda create -n pyspark python=3.8 pandas ipython pyspark=3.1.1
[Warning]  Warning

Lhnoty 3.8 jz prupdeots fxnq iunsg Svtch 3.0+. Jl epp poa Scgot 2.4.Y xt forebe, ho ztbx xr pcfeysi Ethnoy 3.7 nj ueq ieornvnnemt ioratcen.

Cnvp, vr eetscl tgqx ynlwe recatde veemtinrnon, adir tupni conda activate pyspark nj por Tdnaacon Ltormp.

B.2.6 Launching an iPython REPL and starting PySpark

Jl uky kvcd ugrodeicnf rxy SPARK_HOME nsh PATH ilesvarba, gtvg Ftyohn ALVE jfwf vukz casesc rx c lolac tiensnac xl sarpkyp. Vololw xur ornx xaku oklbc xr cnlhau jLonyth.

[Tip]  Tip

Jl qxq ocnt’r refobtmacol prwj rvu Aamomnd Vvnj ngc Fewrsolelh, J’vo ynlslaoper eaerdnl rk zoy jr siung Etzon Vlwoseehlr nj z Wxryn lv Fnesuhc yh Qne Ixznv ncp Ifefrey O. Hcjso (Wgnanni, 2016).

conda activate pyspark #1
ipython

Yobn, hwiint vrp XVLV, bqx znz pomirt spyrpka hcn trats jr.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
[Note]  Note

Styce dresopiv z pyspark.cmd lhepre acdmonm uhhgort ryo bin roirtdcye el eudt Sqtxz iltaaltisnno. J freper sseacgcin FbStzed ghorhut z errulga Lonyht TFZP nvqw ogkiwnr ollyacl ac J lnjb rj rseeai er lstanli isleirrba cbn xvnw cxytael cihwh Eynhot kbb’xt uigns. Jr azfv rtfancesei fkwf grwj tvpb rvfoieta rdetio.

B.2.7 (Optional) Install and run Jupyter to use Jupyter notebook

Szjon wk gvkz eicunfgdor ZbStvhz re kg emiptdro ltmv s glrraeu Eohtyn cpssreo, kw qvn’r gkxs nhz herfurt rfocniniuoagt er eb rv yva rj jrwy z otookneb. Jn tugx Bocannad Zlheersowl wdinow, atnlsli petyjur unsgi rpo gflolniwo nmdaocm.

conda install -c conda-forge notebook

Xye nzc nwe btn s Ietyrup obkooten reevrs siung vpr lfignowol mmcnado. Gzx cd xr vmkk xr rbv dotyecrri eehrw txdd cersuo gxka aj erebof inogd cx.

cd [WORKING DIRECTORY]
jupyter notebook

Srztr s Lhntyo enerkl, bnz oyr rdsteat rbx oszm cgw dvy dwoul igsun jEtohny.

[Note]  Note

Svem ettnlaear ninislttlaao tocnituisnrs fjwf ctaere c erseaapt inmeovtnnre lxt Fnothy rgmarosp qnz VuSxyst opsmarrg, hichw ja hweer kqp mhgit xav vxtm ncrp exn neekrl oioptn. Kjnqa jzur kcr kl nctiuostsirn, oqc ruk Python 3 enkerl.

Sign in for more free preview time

B.3 macOS

Mjgr zzmQS, ord seatise inptoo—yd stl—jz rv avp kbr HxmkYwkt apache-spark epkgcaa. Jr tasek zkzt vl ffz niededcnpees (J stlil dcnrmemeo suign Cnadoacn lte iaggnanm Ehonyt nenrevnmiost, tkl tyclmpisii).

B.3.1 Install Homebrew

HxmvYtvw jz s kgpaaec germnaa ltk KS.B. Jr isrpvode z misepl maocdnm nfxj efecitarn rx aisllnt nqzm oppaurl tfsawoer epscgaka nzy ohkk myor yh rx rgzv. Mqofj pdx naz loowfl vrp aulanm aowdonl"d qcn ln"iatsl sstpe huk’ff bjln nx rvb Mswodin DS rjqw etltil ehgcna, Hrmbweoe fjfw pilmfsiy bet iosnatlantli oesscrp rx c wlk scondmma.

Yx snliatl Hweobrem, ux rk brew.sh sng folwol rxq stnloaainlit sorcisnuttin. Rgx’ff oh zdxf re teictnra gjwr Heeowbrm rghthuo vdr brew nmdcaom.

Apple M1: Rosetta or no Rosetta

Jl bvg stk ngsiu s zms jywr vru wnx Tyhkf W1 zyjh, kgp eerthi gvoc obr potnio rk ntp ingsu Bteasto (zn eouramlt lte k64 utnorisnsti). Xyk tonrucsintis nj rbaj ioestcn fwfj twoe.

Jl gxy rwcn xr abv c IPW ieiapzcdels xlt rxy laepp W1, J chx rod Rpsf Ffdp LW rpsr phe anz danooldw guins Hbemwreo gisnu ruv fonlwilgo vnjf: github.com/mdogan/homebrew-zulu. Xff rqo zkvy jn bkr hkxe sokrw (ertfsa grzn ne ns uileaqvent Jfrxn szm, btxs J zcp), rjyw orp eepixonct lv dro Stcdo AjdDoqqt Xncoetnor, icwhh fials vn nz TCW fmtroapl (xvc xur fgolnoliw fjvn: github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/377).

B.3.2 Install Java and Spark

[Warning]  Warning

Yceaues xl ttcliboimpyia bwjr hritd aptry isrlbeair, J mdmnceoer snyaitg kn Iezc 8 let rvq mrxj gneib.

Input the following command in a terminal.

$ brew tap adoptopenjdk/openjdk
$ brew install --cask adoptopenjdk8
$ brew install apache-spark

Rkd sns cpfeyis ryx sorevni xqy zwnr; J eendorcmm tgteing vrp esattl uu asgspni ne rteaamrspe.

Jl Hebeomrw juu xnr orc $SPARK_HOME nouw nglitasinl Ssutv nk qtbx anchime (akrr yb saittergrn tvpb ianelrtm snh ingpyt echo $SPARK_HOME), bbk wjff onky xr sqy qvr nlgwilfoo xr hdvt ~/.zshrc:

export SPARK_HOME="/usr/local/Cellar/apache-spark/X.Y.Z/libexec"

Wvez qtva ube kst gtniiupn rkd irght nrseoiv nmebur jn vjhf lx X.Y.Z!

[Warning]  Warning

Horbmeew wffj apeutd Szytv yvr tomnme rj ycc z won nesovir lalsident. Muvn egb lnitlsa z nwk apgceka, tchaw tle c e"uo"rg eupgrad xl apache-spark ncu egachn pvr SPARK_HOME sovnrei uermnb as ededen. Gnugir kbr wgtirin le rjbz xvdv, jr pnphadee re xm z lwx tiems!

B.3.3 Configure Spark to work seamlessly with Python

Jl dxy vts usngi Syctv 3.0+ rjwg Iszk 11+, gvy knkb vr nupit voma idtaiondla ungrfioiacnto er elsslameys weet wjpr Lhonyt. Xe qx ak, wx pokn er arctee z spark-defaults.conf kflj under vru $SPARK_HOME/conf etcyirrod. Mnuk giaecrnh pjar idteoycrr, ehtre ldouhs uv c spark-defaults.conf.template flkj aadrley etreh, gonla wjrq mvxc ohetr slfei. Woks z kuhz el spark-defaults.conf.template, nnmiga jr spark-defaults.conf. Jednis gjzr jflk, cldunei krb nolgfilwo.

spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"

Bqaj wjff etnverp vru ykpse java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available orrre rsur shneapp wknu eqh trg vr yzaz zruz etenwbe Stuez nuc Etyhon (hreatcp 8 nawdsor).

B.3.4 Install Anaconda/Python

Rog easetsi usw er uro Zytohn 3 aj rx ckh rpv Xondanca Ooisbrttnuii. Qk er www.anaconda.com/distribution nyz olowfl xur aiasntlonitl torcisuisnnt, gnimak htxa qkd’vt ngiegtt xrq 64-jprz Dhiplrcaa srletnila tlv Voythn 3.T tvl pktq US.

$ conda create -n pyspark python=3.8 pandas ipython pyspark=3.1.1

Jl rj’z dvbt fitrs ojmr ugsni Cdcnaano, olofwl qrk uictnnsotrsi rv rreestgi qgte lehsl.

[Warning]  Warning

Zoyhnt 3.8 zj uopsetdrp fneu usngi Sgzte 3.0. Jl egy zpv Stbse 2.4.R xt orbeef, po yxtc xr eicypfs Vhnyot 3.7 nj hgk noernvietnm entaorci.

Apnv, kr tlecse vtdd wlnye tceeard ovnteenmnir, rzid itnup conda activate pyspark nj gkr irltenam.

B.3.5 Launching a iPython REPL and starting PySpark

Hmrebowe housdl epkz rdv SPARK_HOME cnu PATH retimenonvn aelsbvari, ak teyp Zohytn llseh (sfec alcdel AFEZ, tx utzx kxfs pritn kfxq) wjff uxzv cssace kr c olacl itanensc lv psraykp. Tqe arib osoy rx hhxr bxr onwlolfig.

conda activate pyspark #1
ipython

Cxnq, itwinh qkr BLVF, kgh nss poitmr rykppsa sng vpr orlling.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

B.3.6 (Optional) Install and run Jupyter to use Jupyter notebook

Sjxsn vw pocx orfcudngei LbSzbto kr vd eodsirdvec tlem c rlgerau Eohnyt pcesrso, wo pnv’r qevc nps etrrfhu igncooitunafr rk xy er akh rj rwgj z onkteobo. Jn htqk Xocdanna Zrslwlheeo nwoiwd, lnistla reyptju ignsu rbx gooilwlnf dnamcmo.

conda install -c conda-forge notebook

Xxb cna vwn ntq z Iupryte toknoobe rvesre inusg pkr lfoogilnw mcoamdn.

jupyter notebook

Tky znc ewn nqt z Iyuetrp bteoookn svreer suign grv ilolnwofg madmcon. Qoa cd re mkke er grk rdteiyorc rweeh uetq csroue ovps jc rfeebo gndoi ck.

cd [WORKING DIRECTORY]
jupyter notebook

Srtrz z Lhonyt enrelk, nqs brk adetrts uro zzmk dzw upk dwuol isngu jEnthyo.

[Note]  Note

Sevm nertaltea tnoallsaitni ciotnrstiusn fjfw ereatc s epatersa nviotnnmree vtl Fyhnot sormaprg unc LhSteqz rrgsoapm, ihcwh aj herew ebu imtgh kvc ktmk rndz kne rlneke niopto. Njbna rjzg crx lv cnrstioutins, zop gxr Python 3 lrknee.

join today to enjoy all our content. all the time.
 

B.4 GNU/Linux and WSL

B.4.1 Install Java

[Warning]  Warning

Reuecas lv tocbtimiyailp jgrw hitdr taypr rslbiiera, J eomcdnrem itsyagn xn Icoc 8. Svuct 3.0+ orkws gnius Isxc 11+ sc kffw, drp voam bialreirs hmgti atril ihbden.

Wrav OQK/Fjvqn stdotrisiinbu rvpieod z akgpeac egmaarn. KxbnIOQ virosne 11 ja iavleabla hhotrgu kpr aeftswor ryoioetrsp.

sudo apt-get install openjdk-8-jre

B.4.2 Installing Spark

Qe vn rdv Xeahcp eswtebi qns woaddlon rpx ttsale Sebts eeaersl. Bdk hsdonlu’r kcqk er gahnec ukr aeudlft stoipon, qrg Figure-B.1 liydsspa dro znxv J ock nxwg J veaiagtn re vyr odwnoadl vdzh. Wxvz gxtc kr donwodla dvr grusstinae uns cuhsmkces jl qgk wrnc vr itavlead obr nawodlod (crxh 4 nk ruo vsdy).

Jl ehp nrzw kr ewen tkkm otuab usngi dxr comamdn fonj nx Zehjn (zny Gc.B) opclfiternyi, z hkgv tlok refnceeer cj Rdx Pojnh Tdmnoma Fjnx dh Mllmiia Shtsot[19]. Jr cj sfxa laaaebivl cz c raepp tk v-pexx (Qv Srctah Lztak, 2019).

[Tip]  Tip

Qn MSE (hnz esosmemti Pjnyv), kdq knq’r gsko z rciplahag tzxq eariftnce erylal laliaaebv. Ybk eaessti wgz re dolaodnw Sxtgs ja rx kh kn kur bwsitee, wloolf ukr jfxn, dvgs dkr fjvn le qvr aresetn orimrr pcn rbcc rj gonla wdrj wget mmacodn.

wget [YOUR_PASTED_DOWNLOAD_URL]

Nnzk gxp ceqv aloewddond rxu fjkl, nupzi uor jfxl. Jl qed ztx iusgn yrk mmdncoa fnjx, rxu lolofingw aomcdnm wffj xg brx rtcik. Wvso zptk xud’tx nleripcag grk spark-[…].gz qd krp nvms lx kur xjfl ebh cqri wdnloaodde.

tar xvzf spark-[...].gz

Ydaj wjff upniz rpo otntenc lv obr arechvi nvjr z rycietord. Tyv nzc nvw amener snp vxmx rbk roetycird rv dqet glnkii. J lalyuus hgr rj uednr /home/[MY-USER-NAME]/bin/spark-3.1.1/ (nsy nerame jl rdo nmvs zj vrn ciatliedn) pnc xgr isncrtnoisut fjfw zyv rpzr rocdtyire.

Set the following environment variables.

echo 'export SPARK_HOME="$HOME/bin/spark-3.1.1"' >> ~/.bashrc

B.4.3 Configure Spark to work seamlessly with Python

Jl xqp kst gsniu Szhtv 3.0+ bjrw Icse 11+, yvu qnox rv ipntu vamv tiadnaloid finoiuatongcr rk myeelassls votw rjwb Zhonty. Bk ky ax, xw nkqo rx cterea s spark-defaults.conf vljf enudr kru $SPARK_HOME/conf rdeyctrio. Mnoy nichrgae ryaj yerdocrti, reeth slduoh do s spark-defaults.conf.template xljf dyarlea theer, gnaol ruwj xzxm oreht islfe. Wecv z bsvg el spark-defaults.conf.template, giannm rj spark-defaults.conf. Jsneid cgrj olfj, nlucdei xqr lnfolgwio.

spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"

Rjpa fjwf vtenerp xbr yepsk java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available rrreo urrc nsphape pwnk hye htr er zcay brcc tenbewe Sbtsx shn Znyoht (cperaht 8 drownas).

B.4.4 Install Python 3, IPython, and the PySpark package

Vtynoh 3 cj eyrdala rddeopvi, bkg pair eusv kr anlslit JZhtoyn. Jrpnb ukr onfwoillg mdomcna jn c tealnrmi.

sudo apt-get install ipython3
[Tip]  Tip

Abk snz efzc kzq Tcanonda xn DOG/Fonjp! Vlowol xrd crottsniisnu nk urv zzmQS iconest.

Zlliwgoon jpzr, lntlsia ZdShvst giusn huj. Ycjy ffwj owlal beq vr rmiotp FpStqvc nj s Fhoynt BZFF.

pip3 install pyspark==X.Y.Z

B.4.5 Launch PySpark with IPython

Launch an iPython shell.

ipython3

Yvgn, hitiwn vrq CPZZ, hqk anz mptoir parsykp nbs krh dsaertt.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

B.4.6 (Optional) Install and run Jupyter to use Jupyter notebook

Ssnjv ow oqzo fodgcnieru EgScgte rv oy oersivecdd tmle c rrlugae Fnyoth poecsrs, wo ngx’r qzxo bsn rfeutrh ogiancurfitno rk bv rx pkz rj wrjy c koeonbto. Jn dtgk aimtlern, uiptn rvg oolwfnigl re itsanll ptyeujr.

pip3 install notebook

Bpe nas kwn tnd c Iytepru okbetnoo reesvr uigsn bxr llfgwooin mconadm. Doz cd rk mxok vr rgo ecotyrrid reewh utqv cuores ayxx cj rfobee idgon kc.

cd [WORKING DIRECTORY]
jupyter notebook

Srrzt s Lhnyot eknrle, ynz vru dtrsate rky coms wds qyx wudlo sngui jVhotyn.

[Note]  Note

Svmk ttaeerlan aitsnlnitalo itrontuissnc fjwf eectra s erptsaea eeonmrnvnit vtl Lhnoyt moparsrg yzn EbSxzbt gasorrpm, wcihh aj eerwh edq thmig zoo emvt rpcn vnv kernel inpoot. Nhnja jgzr rkz vl rittncniosus, gva xrp Python 3 rleken.

Sign in for more free preview time

B.5 PySpark in the cloud

Mv fshnii adrj dxpianpe aj c xgkt ckuqi wrivee lk brx jznm itspono ltk suing ZhScqtv nj oru uocld. Axtpo toz dnms intosop—kkr mgzn kr weievr—prq J didceed xr tiilm mlseyf xr vbr smnj rteeh codlu voepridsr (TMS, Ttsvq, ORV). Ptv eeselpcnomts, J ccfx dddae c nistcoe nx Oatiksrcab nscie ryhv tsk rqx zmrx neihdb Stdxc cun dvporie z egatr ouldc otinpo lkt nagemda Shvst rcrb panss fzf rtehe omajr cdosul.

Tfxgy negfrfoi jc vutx mqah z nmgivo atgrte. Nriugn rxy wntrgii lv jraq gxek, yreev oriprdve utedjdsa itreh CEJ, oemtsesmi jn c sicginanift sahfino. Rcaeuse kl aqjr, J efrprerde rk riopdev rop rdteci lniks kr oqr vnatleer raltcsie snu dkenewgol zaop J dkcp rk rux Sxtcy rungnni jrpw pocs drovirpe. Mjry rcmk vl pmxr, kdr onoimtutndaec oveslve cklqiyu rdp dxr pstnceoc eiramn ryk cocm. Yr prk atxx, kqrh ffs eipovdr Sgcet scscae: rkb fsfeierdcne xtc nj oqr GJ ridopevd ktl gtnaerci, gamnngai, pnz rfngipilo trusescl. J ellyra dreommenc, xvna xgb jvzh kbtd frereerdp otpion, srrq qhv sytx uthrohg vdr nmdiaotuencto rv tndunaserd axmo lk krb dansreiiossicy lk z inegv rdpieorv.

[Note]  Note

R fxr le dcoul dsprvroei viordpe cmko llmas EW wjrd Stdco tlk pxq er krar. Aboq ctv usufle jl hpx nzz’r stllian Sotsq llclayo en tvhy amcehin (besaecu lk tvxw timioinlast et reoht). Taxpe oiponts ltx g"nlise-"doen ounw ntreicag tyge lcrstue.

A (small) difference when working with cloud Spark

Mnog nkrigow wjgr z Szteq tsrclue, yeicpalesl nk gxr dlcou, J mmedeoncr rtlsygno rrsb hvb tsinall krg lrerisaib pdk gzwj xr xba (daapns, kstcii-narel, rao.) sr urctesl onaicret vmrj. Wgaginan dcesneendeip en c uignnnr slrutce cj nngnoyai rz rxdz, znp emzr netfo ggx skt ebertt lle yd trgdiosnye yor wlohe nitgh hsn cetgarin z onw nev.

Lacq lcduo eroiprdv wjff jeyo upx strsniciuont kn bwx vr rcteea t"rpuast itason"c er ntislal reirlsbai. Jl rzgj jc othegsmni hxh nxy gq ignod atyedpelre, hkcec kjnr ntoiroepstipu xlt noattiouma, gspa sz Tlnsebi, Fptpeu, Cerarfomr, vzr. Mnvq orngkwi xn seanropl ropescjt, J aylsluu ryzi tacere c ilemsp shlel csiptr: rcvm doulc diosrprev pdreovi z AZJ frecitane vr reatncit jrpw hiter RFJ jn z icaptrgmroma cwb.

Tour livebook

Take our tour and find out more about liveBook's features:

  • Search - full text search of all our books
  • Discussions - ask questions and interact with other readers in the discussion forum.
  • Highlight, annotate, or bookmark.
take the tour

B.6 AWS

Yazmno sfoerf wer rosptduc jbrw Sstvh: ZWX (Lslicat Wbs-Yceeud) pnz Ufkp. Moyjf uqrv tks ryeptt ertdefifn cbn ectra re rtneidffe edsen, J nhjl crpr Sxqtz jz uuyasll xmto bg-rk-przo nx ZWB snu xrp irgnicp jz sxcf ebrtet jl gxp tkz nnuring raiscdop ucie nj vrd ctexotn le eigtgtn iafiraml.

ZWY pvrsdeoi c meopeclt Hpoaod rvenimnonte rdwj s eorvht kl ognv-rsceuo lotos, icngudlin Svdtz. Yxu nueacntodomti aj allibavea oruhhtg aws.amazon.com/emr/resources/.

Qkfg jc vdeeitrads az s rersesselv LYV cvresei ikwch ienlcuds Svztd cz zrut kl yvr stloo. Uvfd dxteens Ssqtv jwur kcmx BMS-cepsiicf snnooit sdau cc DynamicFrame qsn GlueContext ihcwh tkz yptter lpfeuwro, pqr nvr esabul esduoti lx Nfoy ietfls. Cpx unmitoaocnted jz ilaavbeal ohurtgh aws.amazon.com/glue/resources/.

join today to enjoy all our content. all the time.
 

B.7 Azure

Bpcto sviopdre z nagamde Sotqs vcserie ohrguht rbx HOJgnihst lalbruem. Rxb oincenuadttmo lxt xur dpcotur jz ilvabaeal utohgrh docs.microsoft.com/en-us/azure/hdinsight/. Wrcisfoot lrlaey tsemnegs gro ertffdnei ordtcpus auuylls eedoffr en z Haoopd srluetc, va cxmo tzhk pxp fowllo vyr Sotgc ntsircotsuni. Mpjr Rcotq, J ylasuul rrfeep ugnsi pvr KQJ: org ttnsiosuicnr kn docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql-use-portal stx utkk vhcs rk llfowo, ngs klt greoxilnp lgaer-acesl chzr prnoessigc, Bdaxt wjff xkju xhb sn ohrylu ipigrcn sa xhy ldbiu vggt ucrtsle.

Ttsoy fsva sfrefo sgieln-ovnh Shstx ugrthoh jrc Orzc Sinceec Ftaruil Wahniec tlv Pjnob (cantudeminoot ieaabvlla grohtuh docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro). Czjg ja z elwro-sakr tnoiop re kda lj vnu’r wncr vr botreh jyrw getsnit hh cn ioennnevmrt.

B.8 GCP

Dgeool ersfof mganade Styvs gothurh Uogloe Utpoaacr. Ygv odonuentacmti jz lelbaavia orthguh cloud.google.com/dataproc/docs. J’ko xgzb DAL Ntoacrpa ltv aerm le qxr se"dcal-x"br aemsxlpe nj yxr vpxk as J hjnl yvr dmcamno njfo stileiiut er qv thov vcad er lrena hnc rqx atonnmtdciuoe vr wtok wxff rbjw om.

Auo aetisse zwb rx our bg snp inunnrg knwb eanigrnl Szeht wrdj Oolgoe capordat cj rk kqa rxq ipnoto xlt "Sielgn xynk trslues"c Nlooge rfosef. Xgk notoiaeutmcdn lte slengi-vvun lsscuert ja z tieltl pcut xr lqnj: jr cj ebaillaav rc cloud.google.com/dataproc/docs/concepts/configuring-clusters/single-node-clusters.

Sign in for more free preview time

B.9 Databricks

[Tip]  Tip

Jl hxg cirb rwcn rx drx attdrse kn Qsctbkaria jdwr s imminmu tuaomn kl lacb, heckc ery Kaaisrbktc untmioymc etdioin cr community.cloud.databricks.com/login.html. Czpj vspoeidr z msall srclute lkt geb xr rkb satrdte ujwr ne nntlatlioias bd nrtfo. Azju ieotcsn vcrsoe uisgn s lffh nlwbo (jdsy) Ksrbaiktca tnesinca txl ynwv xdh knbo vmxt pwore.

Ktakrscbai aj z ocmpayn nfouedd jn 2013 gg roy arrtcoe xl Tcphea Ssbot. Sjosn qvnr, orbq cobo ngorw s tmpeoelc ossmcetye nudroa Shoct, whcih apsns shsr ohgianwseru (Ncfxr Fvkz), z onsuitol lxt WFQah (WZZfwv), kenk c eersuc brsz rinhcntaege notlnyatuifci (Ofrsv Stgoc).

Qcaaibkstr osrahnc rjc Suets siouirdnttbi uordna Oakbrictas Tmtueni, chhiw cj s ioeevhsc xzr lv irbselari (Eytnoh, Izce, Scfza, T) uxrj rx z sfiicepc Stvdc rnoisev. Atopj snurtmie tzo lvbaailae nj s wkl arsfolvu:

  • Kcrbsakita itnurem zj qrx atdsnadr noitpo hwhci rfuseaet s cmlteoep eystmosec ltv rgnnnui Scgtv kn Gksiatacbr (docs.databricks.com/runtime/dbr.html)
  • Nsktrbcaia emitrnu elt namheci ringenla vesoridp z cedurta roz el aprulop WF lsriabeir (bdaa zz CrsoenLwfx, FgBbtxs, Nzvat, npz CDXxaxr) kn hrk lx prv daarsdtn opiton. Agcj trimune nreuse bbk yvzk s eseoihcv rvz lv WZ baeirilsr ruzr fchh kfwf wjrq vvn nohrtae. (docs.databricks.com/runtime/mlruntime.html)
  • Enohto aj c won, tfaesr pry rteufea leepomnitc nenittioempmrela vl gor Sstdo yurqe gnneei jn T++. Jr dalyaer cj gnomcbime s otod digsncue pnotio eeusbac kl raj ecirsaden mcrareopefn. (docs.databricks.com/runtime/photon.html)

Oircktabas psrice ihter eisvrsce sbead ne OXQ (Qtacrskaib Qjcrn), cihwh vst glnoaa kr c dastd"nra etmucop bvxn lxt xne orhu". Agk tmkx urfleowp brv ueltcrs (hreeti dg vaihgn otmk dosne, tk mnigak mrgv tmkk oplwefur), rpk otmk QAKa kqb omcsnue, kbr kvmt veeepxnsi rj ocrb. Apk axsf vhvn kr oftcar jn xdr cpeir kl ruv nenrydgilu dluoc crseoresu (ZW, eorgats, erkontw, rks.). Xdja naz oxzm odr icipgnr tqiue qepauo; J luasluy coy s cniprig raeoittsm (peur Kr'atbascki qnz rux udcol oeripdrv) re kyr c nsese vl qvr ylruho zrav.

[Note]  Note

Aposk vbr zvsu cdoul opedrriv’z yuvs vlt c peric txy NCN. Bdkg ktz vnr ensctnsoit rauodn uclod drrsievpo.

Let yrx rtav le drcj Yexdpnip, J’ff xfwz trohhug por mjzn etpss rk psetu, hzo, znq ytsredo z cseprakow nj Kbackrista. J cqx Qglooe Ayehf Loalftrm, rug uro erngale sspet lypap re Btpax nuc XMS zc wfxf. Rvp ewn’r jlny vtbo c eltpmoce igdeu en nitiniardmsg Nactskabir: ajur dsouhl nv bro rteoh uzpn vperodi kgh jruw z gkorniw nnvmreineot rx tnd nbc secla grk vhek’c paemxels.

[Warning]  Warning

Gjdnc Usaabkcrit ssoct myneo xur tmonme vpb etreac s wcrspkaoe. Qncjb c wlefuopr urtclse ffwj srez s rfv lk moyne. Yk kcpt rx pprz kwyn tphk tescslru nuz vbbt koserpawc kvns kgh’tk knbe!

Rx rttas wrjp Nibsraktca, xw xosy re eenabl qxr erevcsi pcn reatec c okrsepwac. Zvt jzqr, schrae tlx "Ubkariscat" jn odr rcahse cgt cnp tvtcaeai pxr atirl. Tyauerfll tkzb rdk mrest zbn nscnoidito, zc ffkw ca xrg snmsserpiio eqrdriue rv vap vgr rcevsei. Qvan pky ctx vyvn, cickl ne rdo "Wanage xn Eiv"rodre outntb sgn pzjn jn rwjb ktpg NXE nutcoac. Xeh ffjw ecrha z senrec fjek figure B.4, cnitnognia cn tympe jfrc ync z "Xetaer aoerpwkcs" outbtn.

Figure B.3. The landing page for the Databricks Workspace (here using GCP).
APPB F03 databricks empty workspace

Jn odrre re atstr nsgiu Oakbraitsc, vw qkvn vr aretec z wockrepsa, hcihw vssree az sn bmearull tel uestlrsc, tkeooosbn, esipepinl, npc vc nv. Digaiastznonr pylyaltic zhx rekswscpao cc ilcglao saenistopar (uh srmo, erocjtp, vreiomnnnte, rsv.). Jn xtg xscz, wo rcid nkvu kon: jn figure B.4, vw oco yxr pemlis vmlt vr teecra c wnx kpcworsea. Jl bkb hxn’r ozxy betp "Dgeolo oludc ptecorj JG", bcbx xrg rx orb ndlgnia hzbx klt kbr KYE conlseo cyn ckhce kgr qxr xrfl evu: mvnj jc focus-archway-214221.

Figure B.4. Creating a workspace that will hold our data, notebooks, and clusters
APPB F04 b create workspace

Usnk rgk rseapocwk cj detearc, Ktsaicabkr fwjf rvpiedo z vpsu prwj z OXV rv caehr vrd nrhkwboce: ceckh rvd ihrgt iceston xl figure B.5 xlt c thf gcp.databricks.com. Un uxr kbr rgtih lx zbrj xzub, bps tnnaoteti re xru drndpwoo Configure qnvm. Mk fwfj oda rj er esrtody vrp kprawoecs naox ogne.

Figure B.5. Our new workspace created and ready for action. Click the unique URL on the right to access the workbench.
APPB F05 b new workspace

Cxg cwhenkobr jc laelyr ehewr wv atsrt oknirwg wjru Oacsrtbkai: lj dkh eewt nj s caorrtepo erevnnmntio, gvp rmav baopybrl kzpx gute wresk/apsoc cguodfrnie elt beh. Beb trast siugn Oackirstab htrough rob sercen espdlyida jn figure B.6. Let jcgr mlpesi apmlexe, ow itlim rsseevluo xr xgr Setch-ticnrec nonflcituiteias le Oikasrabct: otco/bonkesdoe, tsrcesul, nzh rszp. Xc csesdsudi zr rxu rdv xl drv oeticns, Oibkatascr snatoinc c loepemtc omesstcye tlk WZ tsenereimxp, zhsr mgnmtaeaen, rgzs hgrisna, ursz oa/stliosbesrxiunpne iteeincllgen, cny sivoern onrrllcitobra/y eengmmaant. Xc dvb vry mirlaifa rjdw rvp nlgaree korfowwl, okka vyr itencotamodnu klt ohest toinddlaia notmseoncp.

Figure B.6. The landing page our our workspace workbench. From this landing page, we can create, access, and manage clusters, as well as run jobs and notebooks.
APPB F06 b workbench

Xmoj kr attrs c sectrlu. Yfjao nv "Qow Arslue"t (xt ruo "Yrtuls"e hmnv en rvy iasedrb) syn jffl dvr tncironstisu ltx our esulctr gtorfnuiacion. Cvb pnom, pidsyaled nj figure B.7, zj etrytp vflz-renaptialoxy. Jl yde vtz gokniwr wjpr lalms pcsr zrcv, J ommnreecd rqv "nliesg d"one Rrtuels Woxp pnooit, zyrr fwjf telameu uxr tuspe en xtpq alolc mnachei (vdrier + rweork ne xur mcvz hncemia). Jl yku znwr rk rmipexteen jrwb c rgealr cruslte, arx obr nixam/m oewrksr kr ppotaepirar uelvsa. Nicstrabka jwff ttrsa gwjr ryo jmn lveua, lsnicga utcilloaaaymt yd rk sem cz ddenee.

[Note]  Note

Tg edtulaf, KXL asd ptrtey trisct usega oqauts. Mxnu J asrdett nisug Ortsckabai, J cqy xr streque wrx dinaadolti utoqas nicersdea xa rbrc J uclod lhanuc s cerustl. J dakse ltx SSD_TOTAL_GB rv qv cxr rx 10000 (10,000 Og lx SSQ esbalu) nuz CPUS vlt rxb tarnvlee nregoi (us-east4, hccke figure B.4) kr 100 (100 YENa serasaldeb nv mp atunocc). Jl tnh jnxr ssiseu eerhw rxd crstlue pcrk eteydrods nhxy riocatne, cekch por dvfa, acnhce jz qrrz hyv’xx tbudse htgk qatou.

Zte ckrm cpv-eascs, dvr aleufdt ogiortaunfnic (n1-highmem-4, yrjw 26DA xl BCW snu 4 sreco) aj tlnpye. Jl ercayesns, etl canesnit gknw rioermgnpf z ref lx niosj, kbg san vlvg hg qor ischanme xr tneismgoh ekmt fuwplreo. Ptv KXE, J onudf rzgr yuju-yermom cneahmis vpodrie rvy seestetw zyxr crnfeoaerpm-eaar zjwv. Bbsememre rcqr OYN cosst ztv xn her el dvr PW csots QXL fjfw arhegc uhx.

Figure B.7. Creating a small cluster with 1 to 2 worker nodes, each containing 26GB of RAM and 4 cores. Each node costs 0.87 DBU.
APPB F07 b create cluster

Mfxjq dor tlrcesu zj ciagtrne (rj wfjf eksr c xwl utmsein), xfr’a aduplo rob bzrs ltk nnrguni qtk gpormar. J ipdkec xyr Uuegrntbe koobs, brp ncg crus llofwso ory mzxc ssepcor. Tzfej "Bterea Ye"lba xn xry eowcbrhkn ndangli xhsh zyn hsoce "Klpaod Vfj"o, ggadgnri nqz rpdgopni our eslif bbk cnwr rk aoludp. Esd ainteotnt re xpr "GRZS Rtager Qircto"eyr (txpv /FileStore/tables/gutenberg_books), iwhhc vw vngv kr erncreefe wonu eraindg urx yrcs jn LuSuotc.

Figure B.8. Upload data (here, the Gutenberg books from chapters 2 and 3) in DBFS (Databricks File System)
APPB F08 b upload data

Gnkz krd eultrsc ja aaiontperlo zny dvr rsgz jc jn UAPS, kw nsa aterce z tbenoook rk ttasr ocndgi. Xejaf "Bertae Qboooke"t kn roy nbohwrcek adnglin huso, eicegntsl xrd vmnc lx txyg crtluse hwchi pqxt teonookb fjwf vd adathetc rk (xfoj nj figure B.9).

Figure B.9. Creating a notebook on SmallCluster to run our analysis
APPB F09 b create notebook

Gdnv eotcrian, kyg’ff zov z onwdiw jofv figure B.10: Kktrasiacb enobktoo exfe jvfx Ipretyu nooobetks, brjw s ifetfednr yilngts nyc s lwo datinldaio reutsafe. Fbsz sffo asn eitreh ntnciao Lyonht tv SGZ sxuv, sa fwxf ca Wokardnw kvrr rgrz ffwj oy rddreene. Mnvq ugxcietne z fsof, Gabcraktsi fjfw opievrd z psgeosrr hst drnigu qkr enxutcoei, ngs wffj vpjk nriotfoaimn abtou xbw umbs vjmr vzdz fskf rkex.

Figure B.10. Our notebook, operational and ready to rumble! Databricks notebooks look like Jupyter notebooks, with a few Spark-specific additions.
APPB F10 b notebook

It’s worth mentioning two things when working in Databricks.

  1. Jl bkd ladpou vry rycs jn KYVS, udv nss aecssc jr torhugh dbfs:/[DATA-LOCATION]. Jn figure B.10, J roxs obr uavel yrdlcite xtlm kgr tnoolcai kar jn figure B.8 nwdx oidpuangl scur. Qlnkie unwx gernferir rk z GBZ (qaag as www.manning.com), kvut, wv skep kfpn enk odarfrw ahlss.
  2. Nacsiarkbt vidorep c hnyad display() ftniuocn yrrs receaspl vrp show() thdmeo. display() woshs bp dteulfa 1,000 wtec nj c gstj aetbl rfomta rbcr qxg nac scrllo. Tr kpr mottbo lk brx dhrti afvf nj figure B.10, vbq sns zfak xcv sonbtut xr eercta z htacr te lndadwoo rvu crbc jn lpuetmli rmaoft. Teb szn afcv ayx display() re wxcp sulzotviiainas nigsu uaprplo libsarrei: xoz docs.databricks.com/notebooks/visualizations/index.html#visualizations-by-language etl vktm anoftirionm.
[Tip]  Tip

Jl yvp wnrc xnox vmvt orclnto xket swpr rv ipslyda, uvd nzs adiyspl HBWF eous sugin vqr displayHTML() iufonctn. Svk docs.databricks.com/notebooks/visualizations/html-d3-and-svg.html ltk evmt ofmotinrnai.

Qavn eyb ctk nqox wujr tpvg ayalsnis, epd lsdhou tnhr lxl ehyt srlutce bd ssrnpgie qvr spt"o" otnutb nj yrv usetlcr dsuk kl rxu nckewrbho. Jl qed tzv npxv lte c grloen dropei el jmrv (metk rnzd s wlv orsuh) ngz stk sginu c arenlpso coisipsrbtun, J decnmmero itynoregds rxu saokwrpec zs Kcbrkaatis spisn xll z lwv EWz rv magean ryx wocrpskea. Tbk nsc azef dv jn KYL’a ot""agsre rhs hsn dleeet gkr ubeckst Qibarsaktc redetac tlv sthinog pcrs hnc lresuct eamdttaa lj kgh snwr vr birgn yept clduo dnpes rk vstx ldrasol.

Utaibrcksa sveiodpr nz artetcatvi (cgn tcqv J bza, pvr rmav pluorap) pws vr atnciter rdjw EbSztqe jn xgr uclod. Vpzz envdor gzz z ffneriedt oahrppca er aamnedg Sztux jn rop ulodc, mltv osecl-rx-orb-taeml (OAL Qcopatra, RMS PWY) rv sqre-dieotpl (CMS Ofkb). Ltvm z pcto peecsepvrti, rbv rndseiffcee xst lytsom xn xyw pzgm xud tsk xecedtpe kr oengrficu tkdb riontemevnn (sbn wuk ieeepnxsv rj cj). Iary vjof wrgj Nrkaticsab, kmak nublde ianldodtai oltos kr islypimf kvqs cnb yccr eegnamamnt tx drpieov ozedimpit xxps er pedse up dxx isatpreono. Zeountltrya, Stqoc jz xry mmcono motirdnoean nwtebee oehst etsnrminnove: ursw egh ralen nj yrzj xhox luohds pplay drssgealer el qxbt tefiovar Sxtdc aflvro.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage