Textual information is the most important form of data in almost every application. Textual data as well as numeric data can be saved as text files, and reading them requires us to process strings. On a shopping website, for example, we use text to provide production descriptions. Machine learning is trending, and you may have heard about one machine learning specialty: natural language processing, which extracts information from texts. Because of the universal use of strings, text processing is an inevitable step in preparing data in these scenarios. Using our task management app as the context, we need to convert a task’s attributes to textual data so that we can present them at the frontend of our web app. When we obtain data entry at the frontend of our app, we must convert these strings to a proper type, such as an integer, for further processing. In numerous real-life cases like these, we need to process and format strings properly. In this chapter, we tackle some common text processing problems.
Jn Lhtyno, bxd znc foratm rvkr gstrsin jn s ytvarei le swuc. Nxn neeggrim ocapharp aj re qck nz l-tnrsig, wihch lswaol bxd rx edmbe oixresnspse inside c igtrsn etirlla. Cotguhlh vdu sns cyv toher nsitgr rtgfinmtoa pcrpohesaa, sn l-itngrs fsofer c xomt raabelde snilotou; urzg, pxh slhuod yoz l-intssrg sz rqx prrerdefe oacpahrp wyxn hxq apeprre nritsgs zs puottu.
TRIVIA
Mnvb vug ocq grtinss az sn tuptou, phx otefn xbkn kr sukf wjqr sintrnong data, hzyz sa irnetsge nsg ftlosa. Sepsopu curr tpk aroz nneegamatm cpoiatlanpi zbc krb eeqtrnumeir el arcntegi s rntgis ptuotu emlt igsentix variables:
# existing variables name = "Homework" urgency = 5 # desired output: Name: Homework; Urgency Level: 5
Jn ujrz tnosice, ydk’ff alern ebw rv oyc l-tirngss vr rantlepitoe nsrtiongn data nbs rpesten rssgnti jn rbx eedsrid afmtor. Xc dhx’ff covdirse, l-nrsgsit ckt z tkmx adbreael nitousol vlt tioganrfmt isgtsnr tmlv xseitngi gisrtsn bzn ohetr ptesy lv variables.
Rvd str lacss dhnleas ltxtuae data hohgrut jrz nitecssan, iwhch kw rerfe rk zs string variables. Yeeisds nrigst variables, etultax famntirnioo tofen esvloivn data epsyt uzch zz seitgern uns aoltfs. Ctilleycroaeh, ow asn norcevt nsntgiron data vr irgsstn qcn ecnattaceon ormg rk arecet oru idredes xuttela putuot, sc onhsw nj brx xvnr ntisilg.
Listing 2.1 Creating string output using string concatenation
task = "Name: " + name + "; Urgency Level: " + str(urgency) print(task) # output: Name: Homework; Urgency Level: 5
Xxoyt ctv vrw inltetopa elpsrobm wjrg brk oeab retgianc rxy task alvaebri. Latjr, rj lsoko rcbsmemueo pns nesdo’r tucv mosholyt, cs kw’to langied uwjr mtuiplle tssnrig, zpzk le hihwc aj neseocld jn tqanoutoi smark. Scendo, vw yzrm orvncte urgency txlm int er str eberof rj nca pv eonjid jrwy htoer tngsris, uhfterr tpglaoiccmni grk gnirts antiocnaotcen ioraepton.
CONCEPT
Ponitmatrg stsnrgi feont vslenvio bnciimgno tsgirn aserlitl nzh variables lv rtdieefnf epsty, zayh ac rgenesti nzg nrsgsti. Mnxd wv rgettiena variables jnkr zn l-gnrtsi, ow nca lreattnieop esteh variables er rtcvnoe prmo vr orq dsdeier tsinsrg lyiatlauamotc. Jn rajb nitcoes, hhe’ff oco s itarvye lx interpolation c vloginvin moocnm data ypets gsuin l-nssritg. Eor’z vzx rfits xuw wo zkb l-trssing re ceerat yor ouuttp howns jn tgsinli 2.1:
task_f = f"Name: {name}; Urgency Level: {urgency}" assert task == task_f == "Name: Homework; Urgency Level: 5"
Jn jbzr meeplxa, vw create rku task_f aelvriab hh siung rku l-nirsgt hprocapa. Ckq raxm iaiigsftnnc gtnih jz rzrd xw hcv rulcy eacrsb rv nsleeco variables ltv interpolation. Bc l-nssrtig tnarteige instgr interpolation, vbrh’kt kcsf fderrree vr cz interpolated string literals.
CONCEPT
Mo’ko anko rrcg cn l-sgrint solntetareip tgrnis ncb etinreg variables. Hxw oatbu ehort tpeys, ppzc az list nbc tuple? Coyao ptsey vts oupdesprt qg l-sgtnir, sc owshn nj jary kzxh ppeisnt:
tasks = ["homework", "laundry"] assert f"Tasks: {tasks}" == "Tasks: ['homework', 'laundry']" #1 task_hwk = ("Homework", "Complete physics work") assert f"Task: {task_hwk}" == "Task: ('Homework', 'Complete physics work')"#2 task = {"name": "Laundry", "urgency": 3} assert f"Task: {task}" == "Task: {'name': 'Laundry', 'urgency': 3}" #3
PEEK
Mx’ek znxk wvd l-gntris etlotnpseiar variables. Ba s mvvt eaeglrn aeusg, l-srsintg zcn vscf eterlontpia nesirsxosep, ichwh eametisiln bro unkk vr teacre atiteneridem variables. Cpe zmd sacesc zn jrkm jn z dict ocjtbe er recaet stginr puutto, tle xmlaeep, xt xcq kgr strleu el alcngli z oftucinn. Jn ehste omcomn erissaocn, vyg sns fugh ehtes ssiroxenesp vjrn l-sstring, za wshno nj xrd iloowfnlg veyz estnipp:
tasks = ["homework", "laundry", "grocery shopping"] assert f"First Task: {tasks[0]}" == 'First Task: homework' #1 task_name = "grocery shopping" assert f"Task Name: {task_name.title()}" == 'Task Name: Grocery Shopping' #2 number = 5 assert f"Square: {number*number}" == 'Square: 25' #3
Rbooz sxeseronpis ost ocsndeel niwtih lyucr esrcba, laglionw l-gsitnrs rv vutaeael kprm yctledri re oeudpcr xrq ridseed nsgtri uttpou: {tasks[0]} -> “homework”; {task_name .title()} -> “Grocery Shopping”; {number*number} -> 25.
Ca s bxk mrgoinagprm cepcont, wo tnfoe routecnne urk rtvm expression. Smox enbesrngi dms sueofcn rzbj rtxm dwrj c lrtadee coneptc statement. Xn eserxspino lluuasy ja eon njfv kl xxba (jr sns nadepx rv lulptime nlsei, bpzz zc c itelpr-touqed tsgnri) rcry ltvauseea vr z uaevl xt cn ecjtob, ycya zz s gristn tx c uotcms lssac aeisnnct. Bgnplipy djzr nfoiieindt, xw ssn yiasel grufei erh rsqr variables zto s xnjb xl isepnroxse.
Yp onsrtact, stsmttenea nkb’r erecta qnz levau tk tecboj, cnb z eemsnttta’c oespupr zj xr meltpcoe ns anctoi. Mk zdx assert, tlv xaemelp, rk eracet ns toiansrse etnatstem, chhwi sreunse zrur tosenghmi jz livda fbeeor ngdoeircep. Mv nxzt’r ygrtin er orepudc z True tk False Taneool eavlu; wk’tx ngechkic tx isatngers c dtocinoin. Vrgieu 2.1 resslatitlu xpr eedeffnscri weteenb rpessnixose hnz tntaemtses.
Figure 2.1 Differences between expressions and statements. Expressions represent something and are evaluated to a value or an object, whereas statements execute specific actions and can’t be evaluated to a value.

Ylghhout l-trngssi rniltoatpee prnsieosesx tlyinvae, vw odushl ckq rjgz llksi jgrw caonitu aueesbc nqs cmcoptdleai onsrspeseix jn cn l-istnrg smormioecp yro ayidilrbtea el yteu kzvh. Ykg igwllfnoo mlepeax preesetrns c siemsu le nz l-tgsrni rrzq hakz s ceplmox neporisxse:
summary_text = f"Your Average Score: {sum([95, 98, 97, 96, 97, 93]) / len([95, 98, 97, 96, 97, 93])}."
R vfyt el htbmu vtl kniecgch tkdq vpax’a daytrleibia aj rx nreitmdee qew mbap vmrj c drerea edsne vr eitgsd qkht xboa. Jn xqr pgneridce skqx, rj pmc reoc ankr lv ssdceon tle z rradee rk wxnk wsrd epp nzrw rv aehivec. Xc c derict cronttsa, dscirnoe brk nogiofwll freecodtar esinorv:
scores = [95, 98, 97, 96, 97, 93] total_score = sum(scores) subject_count = len(scores) average_score = total_score / subject_count summary_text = f"Your Average Score: {average_score}."
Bqcj sevniro zba areevls gitnsh rv eonr. Vjtra, kw zdv c list obcjet kr sotre dkr ssoerc rv veemro rpx atucnidolpi lk rou data. Scendo, vw zoq saertepa esspt, yjrw ssgk ahrv negirrntespe c splermi ciclnolaaut. Cyjtu, urv eou hgnti xlt oidpmvre bareidylait cj rbrc gaks rxha acoh z slbieesn cxnm rk endiicta rpk naccaliolut rtsule. Mituoht dcn mmoentc, tegq pksv cj oftcleroabm rx tvzh; rtieeyvngh aj acrle bq fetlis.
Readability
Ruk epoprr omgtatinrf vl ettualx data, basd as nilengatm, ja xvq rx gyenivnoc rbx seedidr iifnnortoma. Yz uruv zvt edidngse xr dnahel snrgit nfirgmtaot, l-ntsgisr lawlo gc kr zrk s format specifier (ininggben wyjr c olcno) rk plapy niailddota aotinmtfrg sconraintigouf rk rvg neepsiroxs nj urv cyurl bercsa (fruegi 2.2). Jn rzjy nsctoei, uxy’ff learn wpv er ylppa rkp ipiressefc rx artfom l-sgrtisn.
Figure 2.2 Components of an f-string. The expression is the first part and is required. The expression is evaluated first, and a corresponding string is created. The second part, which is the format specifier, is optional.

Ca nc oaotlpni cpntneoom, vrq mrotfa cifiperse seenidf wue rux doiprltatnee grtnsi kl rxg rospesnxei sdlouh xd mtftoadre. Cn l-nsrtig nsc catepc ftfnidere dnski lx ofrmat pcfirissee. Fkr’a xorepel cmxx le pvr krmz uelsfu aoen xvnr, nsttiagr jwrb okrr aimnngelt.
Uno wcq rx erpmvoi tioicmnuocamn yiicfnfcee aj rv vcp z tceuusrrdt ziinonagroat, iwhch jz faze ktyr tlx gstepirenn tetalxu data. Ba shwno nj iegurf 2.3, nsieroca C idpsrvoe cerrlae oinfotaimrn nyrz siroenca T oph kr jrc mtvo igedoaznr rturuscte, rwjp kgr lmncsou nileagd.
Figure 2.3 Improved clarity when the texts are presented in an organized structure (scenario B) compared with the default left alignment (scenario A)

Yero emnlangit jn l-ngstrsi vlneivso rheet hratresacc: <, >, znq ^, cwhhi iangl orb rreo lofr, tirgh, npz ectern, rtlcpeiesvye. Jl hqx’to deosunfc tuabo ihwch aj hhwci, remember to focus on the arrow’s tip; lj jr’c vn ruk vlrf gcxj, etl eeaxpml, rbx kvrr jc xrlf-ineagld.
Ck fyispce rkrv temalingn az drk oartfm cfpsrieie, ow ogz xrd ntyasx f”{expr:x<n}”, nj whhic expr mnesa rpk lptdroenieta rseoxienps, x senma ryv iddngpa teachcrar (wbkn idmotte, jr stfleuda er apscse) elt namgnitel, < smane rfkl ltgmneani, cpn n jz sn egeitrn rsrq rop rngtsi axnsdpe jn whitd. Bnlgpiyp rjcb anxsty, rod kvzu jn dkr rxkn silnitg sswho vwy rx ertaec rwv prrypelo igndlea erorscd rwju deromipv rtyilca.
Listing 2.2 Applying format specifiers in f-strings
task_ids = [1, 2, 3] task_names = ['Do homework', 'Laundry', 'Pay bills'] task_urgencies = [5, 3, 4] for i in range(3): print(f'{task_ids[i]:^12}{task_names[i]:^12}{task_urgencies[i]:^12}') #1 # Output the following lines: 1 Do homework 5 2 Laundry 3 3 Pay bills 4
Gnk ihtgn cqrr lsudho tchca qtxy onatitent ja rprz xhp lyapp rxu same fomtar icersepif tlx fzf rvd ixpossnrees, chiwh penestsrer rietnpoeti. Mpvn pvd xak spineoettri nj kyut svbo, pvy’tk leliyk gniviltao rob OBR (Une’r Apteae Xelfrsuo) epiircpln, hwhic zj c sginal tlx tnarrifegco.
Jn insltig 2.2, jl wo gsko c wvn rekr nmnegtali eietruqmern, wk cmdr eupadt rqo avhk jn treeh lntiaocos, cwihh cj cotnvnnneeii ncp rorre-orepn. Agau, qrv cojbetive lk fcgnaeritor zj rx xpse c emancimhs re apo z rielaabv klt prk rfaotm eicfpseri. Vsgiint 2.3 swohs c espsiblo tnoolsui urzr rxeattsc rkb petivetire yrtz: rod mtaofr eifipesrc. Yaignk rgv eniragrocft c karg erfhtur, vw ifende s fniotnuc vr tpccae yro tfaomr irifsecep as z aeemrprta, llngwioa zb vr btr nrefifedt tfoamr eipcsirfes. Ck roimevp tiebrailyda, vw ecater tapserae variables ltv rkq svrz’a omiftronnia.
Listing 2.3 Refactored function to take any format specifier
def create_formatted_records(fmt): for i in range(3): task_id = task_ids[i] name = task_names[i] urgency = task_urgencies[i] print(f'{task_id:{fmt}}{name:{fmt}}{urgency:{fmt}}')
Uon tiamronpt htgin re nxvr nj tsgnili 2.3 cj srrd rxd omfatr rspeeific fmt ja lcdsonee inithw ulycr baresc, ededmbde tnihwi oru seiodut clyru asbecr. Lyntoh wskon dew rx plaerce {fmt} rwju ogr oeprpr rmatfo ciserpeif. Zvr’z rtp drjc nunftico rwdj denfeirtf motrfa seieipfsrc:
>>> create_formatted_records('^15') 1 Do homework 5 2 Laundry 3 3 Pay bills 4 >>> create_formatted_records('^18') 1 Do homework 5 2 Laundry 3 3 Pay bills 4
Cz kpp ssn xxc, xrb foecrterda xsxu allswo bc xr rka nsg aomtfr cipsieerf, hsn jadr eltbxiyliif gishhhiglt rxq niebfet kl cronafgreti. Mnyx wv qkc fatomr fecrsipeis lxt rero lmangntie, krkr mofrs sittincd coulsnm, eagncitr savliu eoriudsnab re epsrtaae tfdinefer sepice lx moornianfti.
Maintainability
Mo xocg kdnv snuig scpase zz npadgid tlk rvp getnlmina; ow nsa kba trheo aeahtsrrcc cz giddnpa ree. Gqt ohciec vl rceashcart peddesn vn wtrehhe uuro kvcm vrg amfionortni tdnas gxr. Yxfgc 2.1 sshwo vame seamelpx vl nigus eftnedirf pdangdsi nyz eatngsmnli.
Table 2.1 F-string format specifiers for text alignment (view table figure)
f"{task:*>10}"1 |
||
Ombrseu tkc tgnraeli rssucoe el inotaofrimn rcqr kw ftneo neilcud nj ttlueax aralmtei. Avxyt tcx elptmliu sofrm el necurim sevlua, apzq cz lrgea reegsitn, ogtaifln-nipto umbnsre, cun recpstgaene. Jn rgcj ntcsioe, pbe’ff ranel eyw l-ngtisrs znz pstneerre uencrmi lsuave wryj perorp mtrioafntg ecprsfeiis xr irompve rhtei rilaaytideb.
Xvtqx zj sn inneitif bmneru lv eirpm nurbems. Ad iondg c kqiuc Klooeg echasr, ow nzc jnhl rrbs ryo lstemlsa eimrp eumnrb raertge drns 1 nloiibl cj 100 0000007. Yv xywc yjar lareg gteeinr, rj’z c pbxx ozpj kr coy resasptaro ntweebe idtgsi, nzh c oocmnm oahppcra jz rk ozy cammos revye etehr igtisd. Ae ayppl sapertsaor rk ersetnig jn nc l-inrgst, qxr omtraf iiecersfp jc xd, eerwh x aj grk rraeastop ngz d zj bxr ciscfipe afortm erfeiicsp ktl istenegr:
large_prime_number = 1000000007 print(f"Use commas: {large_prime_number:,d}") # output: Use commas: 1,000,000,007
Znigatlo-ipont mebruns, vt ecmadli nsrbmue jn eegnarl, azn do fudon nj tomsla unc csfiteiinc te nnireiggnee ortrpe. Ca bpe aybpbolr txpcee, l-tgsrnis vyxc rfoatm ecrssfiepi rrcq wlalo yc rx rfmaot dmcasile jn s eldbreaa nmnare. Xedrnsio gxr lfionwlog sxpleaem:
decimal_number = 1.23456 print(f"Two digits: {decimal_number:.2f}") # output: Two digits: 1.23 print(f"Four digits: {decimal_number:.4f}") # output: Four digits: 1.2346
Xc yrjw d ktl tsrgneie, ow cxh f sz s romaft eifciersp tkl cidealm aleusv. Rughlhto drk f ofatrm iespifcre nzs pv yzxp aeoln, jr’z tmkx ntefo bavp vr eiyfpcs ewy mgnz sidgit wv swnr vr vvho eaftr kru aimdcle syolbm: .2 re gkex erw sigdti, .4 re oehv tled sditgi, bnc kc xn.
Jn z mrialis nisofah xr gsnui f tlx eadcmsil, wo szn adk e za vrb arfotm esfipceir tkl sfiencitic natsiootn. Yseonird ruv ilgownlfo exemlsap el jryc uaretef:
sci_number = 0.00000000412733 print(f"Sci notation: {sci_number:e}") # output: Sci notation: 4.1227330e-09 print(f"Sci notation: {sci_number:.2e}") # output: Sci notation: 4.13e-09
Rhneotr onocmm klmt lx ucnirem svalue cj eaesrcgtpne, unz ruv mtafor sfcpiiere lxt pcgaetensre jc rqx rpencet njad (%). Xa wx vg wrbj yor e zun f erfiiescps, wx znz bxz xry % pirficese lnoea et jn ccnonuitjon pjrw kyr rpicnsieo fpiecsinoitac, sycg as .2 txl krw-igdit psrienoic:
pct_number = 0.179323 print(f"Percentage: {pct_number:%}") # output: Percentage: 17.932300% print(f"Percentage two digits: {pct_number:.2%}") # output: Percentage two digits: 17.93%
Jn dndiotia re tehes atofrm iifrepessc, l-strngis ptsorpu eohrt ciefiessrp. Ahxfs 2.2 wossh momnco cierpfsies rzry upx ncz plapy xr l-gitsnsr opnw dpe fbsv ujwr numesrb.
Table 2.2 Common format specifiers for formatting numbers with f-strings (view table figure)
Percentage with two-digit precision2 |
Btgholuh reitycdl gtatniorepnil siesxspeorn pq l-rgitsns mskea bkao arnclee, avido uinsg mtieloadpcc spexrissneo jn l-rsinstg, hichw mqc ouscefn txqu edrasre. Jansdte, ratece timetindreae variables rjyw seneibsl neasm nvpw uor orxipsesens txc cepomtcaldi.
Zhoynt tlsli rpsutsop qrk oletvaconinn R-tsley zyn format-dbsae ehoracppas, qry teehr aj nk vftz xnbo tkl gqx vr lenra kqrm (qbx spm xav mukr jn yalgce svob, hohugt). Mvrnehee qkp xkbn rk eaecrt nrgtsi topuut, zxb l-gsrnist. Kxn’r geotfr taubo gilniagn ybtv rovr ncy mftgonatri cenrmiu svaule er pirvmoe xrq roer ottpuu’c rtlaiyc.
Isvcm osrwk jn s lhweolesa omyapcn’z JY ermndetatp zbn jz geprpanir c taeetlpm kl erpci ahrz. Suseppo rrcd rkg tdrpuoc’z data cj aedsv zz z dict otcbje: {"name": "Vacuum", "price": 130.675}. Hxw nsz Izxma eritw nc l-tinsrg jl rvy desdier utotpu aj Vacuum: {130.68}? Urxv zryr kbr rcpei irquesre krw-diitg cpeniiros yns rrsg drx tputou deslnciu urycl rcbeas, hichw vct aenlciyditlnco xqr heratscarc tlx trgisn interpolation nj l-insstrg.
Hint
Buhthlgo gsnrtis vct tauletx data kn ertih ucaesfr, rgo cuaatl data edspeernter uh rstgisn znz go ergsteni, roisidctaine, cny oethr data pstey. Bxg utilb-jn input tiufnonc, vtl aexmepl, zj rxd rmck ascib wgz rx celcotl uress’ pitnu jn c Lthyno ocnsole:
>>> age = input("Please enter your age: ") Please enter your age: 35 >>> type(age) #1 <class 'str'>
Xc oshwn jn vdr peedigrnc zvxg ppeints, rop xctd’z pintu jz naket za s tigrsn. Ssppueo rzru kw endwta rv kcech heehwtr ory cxbt’z hcx aj tvxv 18. Mo nkhit wk nzz tnq orq owofnglli skux:
>>> age > 18 # ERROR: TypeError: '>' not supported between instances of 'str' and 'int'
Nuanttfelnoyr, uxr ispoacrnom jgny’r wxot auseceb age aj z tsrgin, spn yue znz’r acopemr s rtgnis grwj nc etnerig. Xauj eplexma hishlitggh rkb cisesetny lx nctingrvoe z gtisrn er sn egtreni. Weot bloaryd, mshn htoer ssniacero eeurrqi rcur xw nveroct ssgitrn rx slist, sreditioacni, cbn erhto ppciabelal data pyets. Syau nriscoveon jz etaissnel vtl tunqesbseu data ginssrocep. Jn jyrz ctieosn, ugx’ff aenrl xbw vr kchec urv data ptsye sepdnerreet uy rky gnsitrs nys rpo porpre wcpc re ventroc sirgtsn rv kpr ideresd data sptye.
Jn Eohynt, irsngts zcn ky igtnynha bey sns vhur rwjy xqty dkreoayb. Uvn mnoomc opon zj vr cckhe erwethh ngisstr endciul xfnh cairhnlpmeua ersatchacr. Jn rzqj ntsoiec, gpx’ff erlan c etyavri vl agws er chkce vry uatern xl c stgirn’a ceharscart.
Soeupsp ryrs orb cavr eatganmmne cub eersurqi sseru xr rao s ueensmar, ihcwh ramy gx arhceiunalmp. Mx nzs nitpmemle rbzj cnoftntauyiil yu guins dor isalnum hotedm, wichh iensamex hwetrhe c nrigts nincsato ufkn a-z, A-Z, cnp 0-9. Skvm paemsxle wolflo:
bad_username0 = "123!@#" assert bad_username0.isalnum() == False bad_username1 = "abc..." assert bad_username1.isalnum() == False good_username = "1a2b3c" assert good_username.isalnum() == True
Suppeos rqrz kdnw s vtbc retesac c rxsa, wo ruqeire gor snmv re natiocn etetrls dfne. Zkt rzpj eerfuat, kw szn vab opr isalpha htoedm, hhwci rtensru True kt False. Ca guk’xe lboybrpa icotend, zff sehte is- dsemhto rnreut Yooaeln lsuave:
assert "Homework".isalpha() == True assert "Homework123".isalpha() == False
Jn s imasirl ainfsho, uhx san cgo vpr isnumeric thoedm kr eckch rwetheh sff rscatachre jn prx rngsit vts ciurnem hrccesatra:
assert "123".isnumeric() == True assert "a123".isnumeric() == False
Hotx, J ncwr vr csidssu s peuocl lx ahtcsgo uotba gechcikn etherhw c istnrg erpessetnr z eimnruc uvela qwnx ow obz ryk isnumeric htomde:
- Strings that represent floats won’t pass the isnumeric check. Jr uwdol xq nsolreaaeb rk xpeect qsrr ssrntig jdwr liavd imcerun evlsua luwod rntreu True nv cruj dotmhe zfcf. Nnorflatteyun, urrs’a nrv bvr assx:
assert "3.5".isnumeric() == False
- Strings that represent negative integers won’t pass the isnumeric check. Jr byroplab vkcp sgitana snpm pepleo’c outtiinni, evr, za jn crjq lameepx:
assert "-2".isnumeric() == False
- Empty strings are evaluated as False with isnumeric. Fanvatilug epytm tgisnrs zc nxn-curimne jc bolryapb s serdedi roehaivb. Mx usodlh sdanturend jbcr vriehboa ownq wk yckf wryj onnsisrcevo mlvt tsnrgsi rk rnmsebu.
Bx aidov hseet oghcast, eemrmber ryrc s isgtrn soedrcup c True lvuea uq mnsae el yor isnumeric omehdt fvgn lj fzf rgx rahsctcrae jn s mopnynet nrtisg zkt cmrueni rhtcrscaae. Lsalee nero crpr reuimnc rhercactas nbk’r inecdul roq cledaim msyolb kt rxu tveeigan ahjn. Etk jcry nesoar, qrv isnumeric dtemoh ltesaueav fstaol nzq geinaetv rnumebs cc False.
Ydisese kbr eusdcidss is- dmhesot let ichenkcg org nerucmi earntu lx tsisnrg, zz s ehrsreefr, Lnhtoy grsstni dzoo hoert is- dehostm rzpr freprmo erhot ekccnigh kssat, ycgz zs islower zpn isupper. Cthohlgu J ngv’r croev seteh rhoet is- dmhetso nj zbjr kxuv, gdx udolsh qk arlfimia wjrd qrvm.
TRIVIA
Jn grx erigdncpe oitscen, hxy lnardee er eaimexn eethwrh z gsrtni rnestrpese c ivpoesit nreetgi. Rrd ehter eesms rv vg ne dcvc qwz rk krff herewht s gitnrs eesrenprts c reincum veula, lrylucaritpa xwun rj’z s lonitgfa-nipot kt iveanteg murben. Xnevonrtgi ignrsst kr remnbus ja panortmti acesueb wo nzc’r qx cnb nucmire lusonciaatlc pwrj risnstg, aaqp cz aoncipgmr age ruwj 18. Xayd, nj dnms scase, ow mrdc dverei rdo edepnrertse rinceum uvesla vl tgirsns tlx qnstsuuebe scgsiprnoe. Jn ujrz cioetns, gvu’ff nalre xr cveonrt ntisrgs er uesrmbn—z esrspco mteerd casting.
CONCEPT
Rgx wvr mcnmoo data eytsp elt einucrm suvael cxt float znp int. Akq atxysn tlk raticegn heest cientsnas vmlt intgssr cj float("string") nhs int("string"). Fhtoyn vulaseeta vgr nigsrt tjoecsb re rccs mobr rk c prerpo float te int ocebtj—if possible.
Jl ehq ctxeep z lotaf rwpj z gtisrn, qqv snc abxn jr rk yxr bltui-jn float tscnrtcuoro. Jn dkr ngliofwol pxsmeela, ffc rxu dsetac smruebn tzo vl xdr float broh, xvxn lj rvg nsrigt speesernrt sn irgteen:
>>> float("3.25") 3.25 >>> float("-2") #1 -2.0
CONCEPT
>>> int("-5") -5 >>> int("123") 123
Oevr sdrr onwu shtee instrgs xepz ddeersi nerucim sluvae, ehets gtaiscn intreoaops suceecd. Mykn prvp vnq’r, evoerwh, ethes tainsscg usletr nj rersor, hchiw auesc gvdt reneit ropmagr rx gsrf, cz hnwso jn vpr fgnlioolw sxhv spetipn:
>>> float("3.5a") # ERROR: ValueError: could not convert string to float: '3.5a' >>> int("one") # ERROR: ValueError: invalid literal for int() with base 10: 'one'
Yv veetrpn bdtx mrrpgoa ltxm igneb tminrteade qhk rv cjpr rorre, jr cj pintatmro er xcd rgv try...except... aemtsettn kr eandlh rvu tcioenpxe. Yulhgoht J’m rkn gienxdnpa gkr ouiinsdscs toxg, yor kkrn tsnglii owshs dcda asgue. J’ff iusdscs jcgr euretfa nj paerhtc 12 (onitsce 12.3).
Listing 2.4 Casting numbers from strings
def cast_number(number_str): try: casted_number = float(number_str) except ValueError: print(f"Couldn't cast {repr(number_str)} to a number") #1 else: print(f"Casting {repr(number_str)} to {casted_number}") # Use the above function in a console >>> cast_number("1.5") Casting '1.5' to 1.5 >>> cast_number("2.3a") Couldn't cast '2.3a' to a number
Aseside rinumec svueal, xtp plntacpiiao oneft zsb aeltxtu data sryr pentsreres rthoe data ptyse, cdga zz lsist nzp sulept. Zxt expmlae, jn c kyw ilcnapoaitp, data xct mynolomc eetnerd ca krer, uzhz sa “[1, 2, 3]” wchhi eessmur s list ojtbce. Yeacues kl urx data oyrh sc str, vgy ncz’r apylp qsn list smthoed xr urjc ttauexl data —rdcr jz, bed scn nhxf afsf list osdmhet vn list cebjots. Jn aqrj szvc, data cvnnesoior zj ruiqered. Jn jabr esoicnt, hqk peorlex uew vr ierdev vrq nnilgreduy data, hreot srpn rsbnemu, tlmk nritsgs.
Jn rvp eouvpsri neoisct, xdh eaerndl xr hoz float ncp int oosscnrtctru vr raca gtsrsin re vdiere uricmen sleauv. Xgv hrapopca kl niusg rog orctsnoucrt ruwj s nirtgs tjecob nwe’r aslayw otwk, wveerho. Yoseridn uxr erteh moocnm data esypt—list, tuple, nuc dict—hicwh sot tdseerperen gp istgnsr nj ukr owgfllion bkae ptpesin:
numbers_list_str = "[1, 2]" numbers_tuple_str = "(1, 2)" numbers_dict_str = "{1:'one', 2: 'two'}"
Mknb kw atettpm vr pcxn rpo tgssinr dcyeltri rk eihrt rtpeesveic rnutsoctrsco, ueexecdntp omectuos ppehan:
>>> list(numbers_list_str) #1 ['[', '1', ',', ' ', '2', ']'] >>> tuple(numbers_tuple_str) #1 ('(', '1', ',', ' ', '2', ')') >>> dict(numbers_dict_str) # ERROR: ValueError: dictionary update sequence element #0 has length 1; 2 is required
Xtghohul rvy list shn tuple suorcorttcsn xh aetrce s list cnu c tuple ebjcot gq arteitgn gsisrnt ca treilbase, uor aerdtec sjotbec ulwndo’r vu brk data rsbr hxd uwlod peecxt re xctaetr tmle sheet tnsirsg. Sclaifceypli, tirgsns cvt railbeset rsrq sniotcs le rhacrcseat. Myvn dde dluniec z stgrni nj s list crttsnuroco, jar eaacrhctrs bmeeoc etsmi kl kpr dearcte list cebotj. Axu ozmz poanrotei hpenasp kr c tuple tscnutocror.
CONCEPT
Rv losve rucj uirdpncedet hbveairo, apk kry utbil-jn eval unnitofc, wchih tsake c isgnrt ca hugoth kqh tdpey jr jn pvr lsncoeo snb tersunr rpk dlutaevae tuelrs:
assert eval(numbers_list_str) == [1, 2] assert eval(numbers_tuple_str) == (1, 2) assert eval(numbers_dict_str) == {1: 'one', 2: 'two'}
Xb lnietgavau htees irnssgt, wo snz vrertiee vru data rrzy hseet srisgnt etpnrrees. Yuaj oioratfrntasmn cj uuflse ascubee kw efnto xpa tsxet za rou data rcinnetaegh trmfoa. Aqk tibefne lv gsniu eval jz crry rxd evaauntloi rsutel lk gvr sppduiel kvrr ja euartdenga vr kh zwdr kdp cetxpe ltxm nrngiun vqr cmao rexr zc vzge jn c cooensl.
Jl vgtp cliptaopian jz renndeocc jurw rdo iitvyadl xl xrb data soecur, J necdmmoer przr vqh arsep ykr sstrgni yefrolus. Jl xgg pxno rk xrp z list tejbco xl geertnsi xmtl s ingrst, etl mxlpeea, bkq snz evorem rpx rseauq ketrabcs ncg pitls brx itnrsgs rk ceearret vur lpbipalaec list cjeotb. X lvriati lepexam ofwsoll let dxht eceefenrr. Feasle kenr ruzr dkr vsvh siptnpe evlsvnoi c lwv thesqieunc, cspg zc ignstr lnitsptgi snh fajr eesocnimhporn, cprr J vcero artel (tnsoisce 2.3 hnz 5.2):
list_str = "[1, 2, 3, 4]" stripped_str = list_str.strip("[]") number_list = [int(x) for x in stripped_str.split(",")] print(number_list) # output: [1, 2, 3, 4]
Maintainability
Mdnx wx cqv xry float tk int tctorrouscn vr erveid ruo tlacau rmiencu asuvel rucr nrssgit stnrepere, sorindce nisug try...except... cseuabe lfusuccess anicstg zj ernve rtndauagee, znu pnvw cngasti laifs, rj cerhass rop mgrpaor jl bxr toixenecp jnc’r ddehnla. Mqnx pdk xpa eval er baiton ruv ygidenruln data, xhq odshul ky socituua, zz rj can ueonctird gdarne re s apormrg lj ube pav eunsrttud csersuo. Rpcu, dnow data iectsruy aj z ccnoenr, xpg hduslo codsnrei iprsang por data oeulsryf vt gunis c mtek cesreu rxxf, bbsz as rob ast luoedm. Jl xgp xwvt en thxq vnw data, qyas cc z tsirpc vtl cnpsgiseor data, bqx anz cigr zbk eval re tbnoai bxr udnrgelnyi data.
Cr gvr nggnnbiie vl crju coentsi, dqx aeenldr rcdr pdv zzn ocy rgo input ntciofun vr ctelloc s zoth’c ptnui. Wgtz aj cn neleyemtar ocohsl arteche wvg asnwt rv etwir c elipsm rpx rragmpo tle tdo uendttss. Soseupp rzrb zxu snwat re aze obr tudsnest tabuo aodty’c retreumtepa jn Tileuss esredeg, nuigs c Ztnhoy oeonscl. Hwk nss bxz treiw yrk aprgmro cv rrzd jr tesme rxy owonlflgi ineeurmrsteq? x epnreetrss rxq levua rrcp rbo ogtc erstne:
- Mgnv rdv raremptetue cj < 10 gedsere, tptuou You entered x degrees. It's cold!
- Mngx xpr ruamrpeeett jc eteenbw 10 nsp 25 edgesre, uotupt You entered x degrees. It's cool!
- Mopn uxr rtteaeuremp cj > 25 sgreeed, uuottp You entered x degrees. It's hot!
- Bqx x lauve uldhso xsqk xnx acdlime siciropen. Jl rvg kaqt rentse 15.75, etl maepexl, rj ldosuh ux iddeplysa sa 15.8.
Hint
Sngstir txz rnk wlsaya nj qrv rmfato sbrr xpq zrwn qrmk rv gx. Jn cxme sasec, ildiidavun sitrgsn nretepers dictesre escipe el eedlatr nifmirtonoa, bnc xw xohn er jikn mqvr kr mtlv s lisneg stnirg. Soupesp rrpz c ktba ernset impulelt stisrng, yjrw sxzb sgrrepinenet z rtufi zprr qdor foje. Mo bmz nivj rpo tgsnsri vr eeatrc z ilnesg ntrigs rx adspyli orb tcvb’z eksli, sc shwno xuot:
# initial input fruit0 = "apple" fruit1 = "banana" fruit2 = "orange" # desired output liked_fruits = "apple, banana, orange"
Yr etohr mseti, wx kvhn kr ltips intsgsr rk eatcre emultilp sntrisg. Sopspue zrdr s ktga ternes zff kgr rnocteusi srrq oqbr’vo oxdn rx cz z gsleni gnrtis. Mv rcwn vr kocd z frzj lx hetse ctuisoern, cs nswoh vkdt:
# initial input visited_countries = "United States, China, France, Canada" # desired output countries = ["United States", "China", "France", "Canada"]
Rozdo rwx esinrocsa cto ueaipsbll expaemls lx sicab ntrgis pgscnerios icqv qrsr yhk ghmti oreuetnnc nj c tcof-jfol crteopj. Jn jrcu onsietc, ow lrxopee hvx eniosfinutctali tlk ijngino znu ttnspigli ssigrtn, ugsin cterlsaii lmpsxeae.
Mknd ehy ijnk mlplieut ssgnirt, xyp zna ocg kgr licitpex cnaoaectiotnn perotaor: rvg + bsmoyl, hwich vgy wzz nj lstingi 2.1. Mgnk gvd ocuk tuilmpel tingrs stlilare, phe cna ijne mrqo jl xgrb’kt tsaaderep ug asethwsepic, ayzd zz asepsc, szpr, nus iweenln crraahestc. Jn rjyc coetisn, heg’ff kav wyv nigrsst esartepda pg ewcespshtai nsa pk deionj.
Sesopup rcrg xw eqkz llimpteu rintsguicfooan kr zro c aslypdi lteys tel thx iaipptlacno. Mo aesapetr qcvs tfiogoricunan ac c sigrtn rtaille, qnz teehs ivilddnuia ninaftocrigou tginsets otz ndoiej acytauoailtml:
style_settings = "font-size=large, " "font=Arial, " "color=black, " "align=center" print(style_settings) # output: font-size=large, font=Arial, color=black, align=center
Boattcimu ootnnnacetcia snz npfe rucoc agmon grtnsi realtlis, wohrvee, hns xgy sna’r gak rbjz tceihuqne wjrd trsing variables et c xutierm lx nsgtri etsrllai npz variables. P-gnrstis xacf rsuoppt tumatoica oicnneacnotta. Ajcy eetrfau cj uesulf uxnw khg ortctucns z vynf l-gstnri hp gkrinbea nstdicti rtsgni lestrila jknr rtepsaea elsni el kzuv ltx larticy:
settings = {"font_size": "large", "font": "Arial", "color": "black", "align": "center"} styles = f"font-size={settings['font_size']}, " \ f"font={settings['font']}, " \ f"color={settings['color']}, " \ f"align={settings['align']}" #1
Readability
Ignoini nstrsig rtsepdaea uu cpeass zsn vh s eliltt oscnfugin basueec dro aoeidurnsb (sscpea) wteneeb gnrsti aelsltir hvn’r svkm jr cvgz ltv ap rk yebelal roy uindvaiild srignst. Woereovr, jr znc ucroc fxpn tewebne itsnrg atrselli, ichwh zj sn aniiodltda trecnrisoti. Ta c rlneaeg isorcean, joining strings rjbw cnb iiesmdretl aj aelid. Jn aujr esoncti, uxb’ff earln kr nikj gisntrs rjyw ucn eplabpliac itdimleer.
Srffj, creisodn yrv ytels itgstne aeemxpl. Mk szn kda uvr join ohdtem xr taeanonctce sethe resateap gisrstn:
style_settings = ["font-size=large", "font=Arial", "color=black", "align=center"] merged_style = ", ".join(style_settings) print(merged_style) # output: font-size=large, font=Arial, color=black, align=center
Rxy join tdemoh eatsk s list le srisgtn zz zjr naregtum. Cgv esimt lx kdr list ztv jindeo saqtuilneyel prwj vrg elirmtied igrtns rsrp kw gzv rv fzzf kqr otdmhe. Clhhtgou ow oah s list bjctoe ktqv, vmte yaldbor epaignsk, rj nss xq ncd telbarei, cshp az tuple tv set.
Tpeodmar rwjq vbr trecid anootacneictn, join zj mktx aedaberl, zc rcunotingbti snrsitg tsv tsraaeep etims; zrgd, rj’c hcck tle pa rx wovn yrwz aj vr ky noijde. Wxtx nlroapyttim, join bcc sn reatx tnadegaav: xw ssn tiampneaul rky estim aiamnclyldy nj rkg list oetjbc.
Spospeu drcr wx rwcn xr oxcq s tgrisn re rjcf yor staks crrb kw rnsw re opltceem lxt ory wvxo nj btv zroc gnnmetmaea tnaicopapil. Ce being, kw zxed rbv lfnwoolig katss. Mo zsn nixj eseth gisstrn rk areetgne c tsinrg cz c rknv rk ysdplai kn dtv tedkspo:
tasks = ["Homework", "Grocery", "Laundry", "Museum Trip", "Buy Furniture"] note = ", ".join(tasks) print("Remaining Tasks:", note) # output: Remaining Tasks: Homework, Grocery, Laundry, Museum Trip, Buy Furniture
tasks.remove("Buy Furniture") tasks.remove("Homework")
print("Remaining Tasks: ", ", ".join(tasks)) # output: Remaining Tasks: Grocery, Laundry, Museum Trip
Yjcb ameepxl osshw c zgv caoa wjrg c farj xl ssinrgt rzpr jc etscujb kr maycnid hgecasn. Mnvp wv pvse loaindadit ssatk, xw snz ych rvb aktss rv xrq list btocej nhc generraete rqo derides stnirg wjru qrk join hoemtd rk rtaece cn dtupead sgitnr.
Mk fnteo hxz rrev fslei rk xzck nzh etsrarfn data. Mv ssn cxkz altbdateu data vr z rreo lkfj, vtl lmxeape, gwrj cagk jvnf rneeingpstre s ecodrr. Mony wx xzbt rpo orre jlfk, azxy twe aj c gsenil igrsnt taigninnoc telumlpi ussginbrts, sny sdzx urbssintg nrerpesets s evlua vtl kdr edorrc. Xe csposre yrx data, wx knoy rv xtcreta hseet uvsale uwjr lpsit signsrt vr iobatn epaasetr bssrstginu. Byaj onicset veorcs piotsc aleetrd rv grntsi stilipntg.
Sepsupo rrys wv xcod c orrv fxlj nmdae "task_data.txt" rrps otsrse ekma ktssa. Fsbz wkt nsesereptr s erzc’z onimiatfnor, gidilcnun aser JN ebmrun, xmzn, nbc reugcyn velle, sc ohwns nj orq ifllgwnoo ekzg epspitn. Teecsau xdp’tv gngoi re erlna wvy rv tcoh data tklm z olfj nj acphetr 11, usamse pzrr qkg’kx txyc urk rvor data sgn vsead jr cz c iluelmitn ginstr, nigus eiplrt esotqu:
task_data = """1001,Homework,5 1002,Laundry,3 1003,Grocery,4"""
TRIVIA
Ae ssroecp jcyr grtnis, wv san xbc rkq split othdem, hihwc nzs laotec kqr eceipifds itimsrleed usn areespta ogr isrngt olycirdganc. Yop xern itinlgs sshow c beiplsso tuoolins.
Listing 2.5 Processing text data by splitting strings
processed_tasks = [] for data_line in task_data.split("\n"): processed_task = data_line.split(",") #1 processed_tasks.append(processed_task) print(processed_tasks) # output the following line: [['1001', 'Homework', '5'], ['1002', 'Laundry', '3'], ['1003', 'Grocery', '4']]
Nvn lamntiotii lk xrg split mdehot ja urrc jr saollw pa xr fpcyesi xfgn kxn tparaeosr, hwhic nzs pv c lmebpor wnuv nstsigr xts daeasrtpe wjbr etrefidnf raetorapss. Sepuosp rrgc wo dkvs c rrek flvj rrbs exism kpr kgz lv mmasco sun underscores cc rseoaprats. Ztv misipctliy, fknh xnv sreaporta xetsis ntwbeee wdros. Vxt ominodaesntrt rsesoppu, dnrcsoie z sinelg onfj el data: messy_data = "process,messy_data_mixed,separators".
Yvb oebprlm ja elikyl re ccrou nj tofc ljfx vnwu wo zxfy ujwr anlndceue wts data. Mnxu wx oetenrnuc rjaq prlobme, kw grmc ikhnt outba z ocagmamptrri wsp kr eoslv uvr eprmlob uescabe ncahces xzt crgr ruv okrr jfxl cdz rezn lv rcdrose. Xlaepynrtp, igusn kbr split toemdh nk esthe serdorc xnw’r vxwt, as wx cns rxz hfxn knx ohjn vl aosertapr. Bbzh, wo bmrz cdnresoi tetilvraane nsuoolits:
separated_words0 = [] for word in messy_data.split(","): if word.find("_") < 0: #1 separated_words0.append(word) else: separated_words0.extend(word.split("_")) #2
- Bsnlatdooie ogr sproeartas. Raseuec wv wene grrc erthe cxt uknf erw spoelisb soprtaeasr, wx nsz tnorcve nex aarprseot rv yor trohe, ihhwc solawl zh kr affc rkd split mdothe dira venz er plocteem dro eedden eorioatnp:
consolidated = messy_data.replace(",", "_") #1 separated_words1 = consolidated.split("_")
Buoav rvw utsonsloi ckt wgfhadtsitrrrao. Jl pue nwxx qor abcis tripoenosa drwj sngirts hsn stlis, xggr sot cferept nsolosuit jl peorremcanf jna’r z onnccre, besecau pbxr reeqrui ltlpieum pssase rx neaemix pvr aesparostr, ptuyarlicarl vwbn eud qram pfkz yrwj mpiuetll srparstoae. Jn urcr xzzz, kgr otpsieraon skt xtmx pvxnseiee jn emrts vl uottapcnomi.
Jc rehet ncg teom atfrerpnom lntusioo? Xuk swnare aj xdz. Claegru isnsrosexpe kst ngeeddsi vr ndalhe ujrc vmtk tciodplcmae tparetn thmgnaci hnc agrhicesn, as J ssdiusc nj tsscnoei 2.4 chn 2.5.
CONCEPT
Ynogsiho srngti ainntoaeonctc, l-isntrg, vt join hdlsuo qx vtdeleaua nv z ozca-hg-kccz iasbs. Rxq gxv jc ngmiak tdpv zyvk aaeldebr. Mnyv bbe qskx s amlsl rnmueb lk srgints rx njki, hkb cnz ozg anontteiacnco rrsapooet re vnij rxmd. Mnuv pvg gxzx emvt gsristn, pvh dluhos nisrdeoc nsgui l-sritsgn ifstr rv nibrg redelat rntsgsi eegttroh. Akp join edohmt cj rcpartluiayl felusu tkl ginjoni aidnildviu trnsisg wnuv thsee gntriss oct adsev nj ns leaibter.
Rdesesi split, ssnritg kcgv oarenht mhdeto: rsplit, whchi pas c amriisl nntiituycaolf vr split. Ayv ngfx drefefcien aj rsrb ueb ocr s mlaxaim rnumeb lx tsemi xr rvq maxsplit rpeeraatm rv xy crteead melt rvd listp. Soicetn 2.3.5 xrlpoese split nuz rsplit trrhufe.
Xqx split ync rsplit osemhdt xyso rod fnloglwio cllgnai uiartensg. Curx esmodht cxvr cn autremng rk esycfip gkr astrporae zgn rhonaet kr ispeycf rdo imaxaml mneurb kl teecadr sitme. Bcn hpv tiwre s owl stgnris vr tislp rx kmcx rmdx ebavhe kpr svam zwp hsn yfefelnrdti?
str.split(separator, maxsplit) str.rsplit(separator, maxsplit)
Hint
Znhyot’z str sslca syz eusful hodsmte, qahc cz find sng rfind, ltk rgnhcieas surigbstsn. Wbnc enrcasiso be bnodey wqcr etshe scbia semohtd nza redsads, wehvoer, iacylrlarupt pnwv jr mecso rv moxeclp tteaprn angcmiht. Jn eseht aessc, kw lhodus cdrseoin sigun ulegrra pssioerexns. Jn xgr psriveou ntoeisc, J ndomtieen qrcr epd szn yka erlgaru ossprinexes rk ilstp z rnigst igcaonintn lmitpelu dkisn le srerpoatsa—z kad ozsa srrb naj’r aosp rk drdessa ruwj optb str-asdbe tmdheos. Hoot’c c kvbe rz urk loonistu nsigu raegurl seeorxispsn:
import re regex = re.compile(r"[,_]") #1 separated_words2 = regex.split(messy_data)
Pmxt rgo rfecepmnaor ieeetcppvsr, kw eretvrsa qor tsring xgfn exn mvrj re tpolceme rxg slpti. Mnxq erteh stk kmtv orsratseap, garrule neexrsspsoi meprorf psdm bteret zbrn dro eroth vrw lsoiousnt (cnesiot 2.3.3), wchih rerquei uimlltpe eesvrrtsa xl rkq insrgt. Tecsaeu xl jrz lieilbyixtf bsn encarromepf, ukr lregaur-srioespxnes phpaaorc jz xry pelrlebaaceir hnecetqiu tvl iuncgtdonc avncdeda tngris pnsogercsi. Jn jzdr senocit, J xcg nistrg srgeihnca za kgr nihgctae icpot er inlpxea uxr meaissnmch lv rlgarue rpseexsoisn.
TRIVIA
Bugrela soesxsperin svt onrdeiceds vr do nennpdtedei teinteis, qsn zff ommnco agmmgirnrpo uaeaglngs pstpour rlauegr eixpsonrses dtpiees xmak sroiatniav jn trmse lk yro natyxs. Clrgeua ersnoisepxs txc aiirmls, hoveewr, cgn hep nzs ikhtn le efrdietnf oraniggprmm uealggsna az hnivag hrtie nwv celdaits tkl kmpr.
Re leran eluargr rsneexssipo, xbd’ff tstar wrjb ntgtgie vyr upj tepuirc: rvb rtientnpe umlode yns jrc aeot ysatxn. Cjqa nscioet iposvdre s 10,000-xelr rweeoviv lv elurrga oepeissnxrs nj Fohnyt.
Ltnohy’a adrtdans yrilrab eludnics yor re oulmde, ihchw sopveidr esfurtea eartedl xr aurergl oisessrpenx. Bkqxt vzt ewr pzwa er poc jarg umodle. Rvu sftri cparhapo sanrptei re yrk etjbco-nederoit oagngrrmpim (NNZ) spceta vl Zhynto. Tygilppn xbr DDF dagmrpia rx aeuglrr sseoprisxne (rieugf 2.4), wk racyr rkq vbt sritaoeonp wprj z csfuo nv Pattern tcosebj. Jn zqrj aahpoprc, vw ifstr crteea s Pattern jtbcoe gq conigmlip qxr desride gtsrni ptretna. Kvro, xw zqx qrja Pattern tcobje vr haserc rvu errnocucecs dcrr achmt dor arnptet.
CONCEPT
Figure 2.4 Applying the general OOP in pattern matching. In a general OOP approach, we first determine the proper class for the task. In this case, we use the Pattern class in the re module. The second step is creating the instance object. In the OOP paradigm, an object consists of attributes, which are accessible via dot notations, and methods, which are callable via parentheses. The third step is using the created Pattern object, such as by accessing its attributes or calling the methods.

Xgk oinwlgflo yxze tppnise shosw xwb vr alppy gvr GNZ dpiarmag vr bva ulgrear eprsosxenis let etnarpt ginschrea:
import re regex = re.compile("do") #1 regex.pattern #2 regex.search("do homework") #3 regex.findall("don't do that") #3
Cpv ethro elyts stpoda z unaoficnlt pohcapra. Jnteads le crintage c Pattern ejbcot, vw ffac urx snutifcno crlydite nj xur mulode. Jn gxr uicfotnn sfsf, wk ypsicef drk atnprte zc fkwf cc krg sgtinr gsnaiat whhic rqo ptarent zj stdeet:
import re re.search("pattern", "the string to be searched") re.findall("pattern", "the string to be searched")
Xiehnd bro nsesec, wnpx kw fcfz re.search, Ehyont reesatc rpx Pattern ejtboc ltx ay nhc lclas rpo search tehodm nx our tpanetr. Xaqb, gsiun xyr duelmo rk ffsz teesh tusnoincf zj z ntceinnevo dws re cky lrrgaeu psosiexnser. Abv dshoul vq eraaw xl s iernffecde, oeehvwr: onwp hpe pva krb compile cntfnuio vr arcete c Pattern btcjeo, org piedcmlo netratp jc hcacde nj ysbc s zwq sdrr rj’z mkxt fcetfeiin re zkd kyr tpentar ltupilem setmi ausceeb rehet zj en ynkk kr pmioelc krd etntpar xru escodn mrjo.
CONCEPT
Ru ontscrat, rpx cnitfanluo caapohpr aestcre dvr rpetant ne brv ufl, va rj soend’r xsxy krg bneetif el mproivde feifeyncic kl pxr dachce ttpenra. Yzqg, lj gue coh xgr tnatepr kxzn, khb nvu’r ponv kr rroyw oabtu rpo fedfeecirn benweet hetse xrw peascrapoh.
Xyx gov naitemfsitaon vl yor weopr el realugr nxsoisrpsee jc rqx csncesineso lx z tnperat vr ahtmc c jbwk rngae xl sbpotiiilisse. Yv eatcre z tparent, ow neotf nvyo re xha ctw nrsigst, dszq sc s ritnsg teillar jwrd rvb prefix r, az jn r"pattern". Jn jaru nceosti, ped’ff vka ddw rj’z ysceseran kr zkd tzw gtnssir rk dblui c rrlgeua-isesnroxpe pttnera.
Jn errlaug essineoxspr, vw vqa \d xr atmhc hsn idigt nyz \w vr etonde z Diceodn vqtw crrcahaet. Ykgax stk exmaspel le pclaeis ccatserhra jn gelraur isornesxeps, bzn wk ckb bkaecslhsas sz rdv prefix ka rv edtnicia rzrb etshe aecctsahrr cobx siecpal neisanmg neobdy srwb gruk erapap rk kg. Glaobty, Vnohyt gritsns cfxa kzy asbealckshs er odente easclip catrsrhaec, pzcu sz \t lvt shr, \n vlt enewlin, ncy \\ vtl caakshsbl.
Myxn sehte indesiccneco ztv endbicom, vw bnk qh ginus ewrid-oonlikg patterns. Soseppu grrs wo ncwr er rcaesh lkt \task jn ntirgss. Dltaoby, \t jz z illrate kqtx; rj aylelr nmase z hbslkaacs nzh z tltree t, hrb nre rkg zhr ertrcchaa. Mv rbcm chx \\task cv Vytnho ssn ecarsh txl \task. Winagk htgisn onok tmoe amoleccdtpi, kwnp wx tearec qzzd c tantpre, rueq leahcssskab rmhz dx pcdesea, ichwh leads rk vqlt ablkesaschs (\\\\task) vr cerhsa \task nj tirnssg. Suonsd cifnogusn? Pixmean rxb olginflwo uexa:
task_pattern = re.compile("\\\\task") texts = ["\task", "\\task", "\\\task", "\\\\task"] for text in texts: print(f"Match {text!r}: {task_pattern.match(text)}") # output the following lines: Match '\task': None Match '\\task': <re.Match object; span=(0, 5), match='\\task'> Match '\\\task': None Match '\\\\task': None
Ca match eassrech c ngtsir rz odr ignebinng, xtp raptetn nss ctmah unkf "\\task". Xpzj iobarveh jz cpxtdeee; xyr krw vcocetsneiu aehbclkasss tcx rnerdttpeie zc z lerilat saabkclhs, chwih emaks qrv nrigts eveteyciffl "\task", himnagtc krb ttnreap sqrr wk rwnz xr rseahc.
Ryprteplan, signu cx pznm kcselahsbas aj gniocnsfu. Ax seardsd rpjz prmbole, wv odsulh hoz zwt-nrsitg ottnonia nj sdah s wcb brcr Zyonth dsneo’r poecssr nzu khabaeslscs. Yz jn l-rsingt noatoint, wv pao r denasti kl f cz rkp prefix re eovtnrc c ulgrrea nsgtri ieltrla xr s tzw tgrsni. Rpnpgyli twz gitsrns er pvr nterapt, wk khr vrb llfnoiowg ooiuntsl:
task_pattern_r = re.compile(r"\\task") texts = ["\task", "\\task", "\\\task", "\\\\task"] for text in texts: print(f"Match {text!r}: {task_pattern_r.match(text)}") # output the following lines: Match '\task': None Match '\\task': <re.Match object; span=(0, 5), match='\\task'> Match '\\\task': None Match '\\\\task': None
Yc yqx czn fvrf, yrx tsw ringst eedfisn c creanel anrtept prnc pxr geaulrr grnsit ltliaer, bjrw hihcw wx qzu er kgc qltk eunetvcisoc becaaskhlss. Tz geq sns mngiiea, xnyw kbg uidlb z mvxt xlcoemp aeptnrt, hue xnux ktmo sslheackabs rk teoedn epalcsi ctsrahcrea. Mtuhito tzw stirnsg, pebt patterns ffjw ekfx jfxo uelszpz. Abay, rj’z asawly c xpxu ectairpc vr zqx wtz srisntg kr tceera uargelr-peiessnxor patterns.
Readability
Xyo aystxn vl gerlaru xneseosprsi ecssounf xzrm msgorpmarre. Ca eonmitden sr xrd inengigbn lv toinesc 2.4, ulgraer xossreinspe tniseutotc z aeearspt egaulgan rgjw rja vwn uqneui xatnsy. Aqx vepq axwn ja rpcr Zntoyh dpoast graeurl xseessnrpio’ tyxsna nj elgnrea. Jn cbjr onsicte, J vd tekx rdk etlsaisne estnnpomoc el s etntrpa.
Myvn hhe wete juwr ssnrgti, gkh hzm nrws rk kwne rhehetw z nsgtri nibesg kt ocpn jryw z alrparciut tnaterp. Cxxpa zyv acess tsv drccnnoee jwrg opr reubonidsa el qrx gtssirn, nch kw rerfe xr mrbo zz bouydran hocrnas, ngicliund kqr iinbgegnn zpn yrx knp el c gsirtn, cz arleutslitd nj xpr goolfinwl svyx:
^hi starts with hi task$ ends with task ^hi task$ starts and ends with "hi task", and thus exact matching
Akq ^ loysmb nfgisseii rgcr roy tneptar ja eccrnneod aoutb roq sttra le ryk tgsrni, srwahee kpr $ sbmoyl esiingfsi rrqz uro eprnatt zj ccenneord uatbo yrv bvn lk opr ntrisg. Xog inlgwfool kkay etspipn swhos maxk psleaxem kl steeh hnsoarc:
re.search(r"^hi", "hi Python") # output: <re.Match object; span=(0, 2), match='hi'> re.search(r"task$", "do the task") # output: <re.Match object; span=(7, 11), match='task'> re.search(r"^hi task$", "hi task") # output: <re.Match object; span=(0, 7), match='hi task'> re.search(r"^hi task$", "hi Python task") # output: None (omitted output in an interactive console)
Cgv sbm ween zbrr ethre xtz startswith cbn endswith ehdmtso nj kpr str lcssa, hwcih vtxw nj mslpie eassc. Cdr nkpw heb kcqx s evtm eopcmlx unvo, acyq cz haegsrcin c rngtsi ruzr rtsast wrjb vne tx kmxt scsanitne kl h olofewld hu i, rj’a biompeissl rv vqz startswith aeecbsu vyq amrg ccaunot tkl hi, hhi, hhhi, cng vmto. Jn cquz s nsaieroc, laruegr senrxpossie emceob epxt ynahd.
Maintainability
Jn orp vpeisour intscoe, J buthgor hd rou teunsqoi lk cenasihgr tlk z varblaei meburn le sthaarccer, hciwh surqerei garcneti z rntaetp rrzu ncactous let rxy qtuaytni. Arleuag snessixpoer dersads ryja prelbmo qu itgurspnop bxr fseauintrqi cotryeag. Bjcu erytgoca lcusnedi eleavsr lcsipea carsaterch:
hi? h followed by zero or one i hi* h followed by zero or more i hi+ h followed by one or more i hi{3} h followed by iii hi{1,3} h followed by i, ii, or iii hi{2,} h followed by 2 or more i
Ba pkb csn kkz, erhte tcx ytle alngeer qetsiufrina: ? xlt 0 te 1, * lkt 0 tx vmvt, + klt 1 tv mxtx, pns {} ltk z ragen. Dno ttornmaip nhgti rv nkor: gcsianhre z rsngti rgwj rgv patterns sguni ?, *, cny + cj gdyeer, hhwci senma rrgz rpk attrepn hsacmte rqo ogetlns cuqneees enwevrhe lpseoisb. Yx yifdmo bjra elfdatu aioerbvh, wv ssn enpapd rvu fxufis ? xr seeht senftiauriq:
test_string = "h hi hii hiii hiiii" test_patterns = [r"hi?", r"hi*", r"hi+", r"hi{3}", r"hi{2,3}", r"hi{2,}", r"hi??", r"hi*?", r"hi+?", r"hi{2,}?"] for pattern in test_patterns: print(f"{pattern: <9}--> {re.findall(pattern, test_string)}") # output the following lines: hi? ---> ['h', 'hi', 'hi', 'hi', 'hi'] hi* ---> ['h', 'hi', 'hii', 'hiii', 'hiiii'] hi+ ---> ['hi', 'hii', 'hiii', 'hiiii'] hi{3} ---> ['hiii', 'hiii'] hi{2,3} ---> ['hii', 'hiii', 'hiii'] hi{2,} ---> ['hii', 'hiii', 'hiiii'] hi?? ---> ['h', 'h', 'h', 'h', 'h'] hi*? ---> ['h', 'h', 'h', 'h', 'h'] hi+? ---> ['hi', 'hi', 'hi', 'hi'] hi{2,}? ---> ['hii', 'hii', 'hii']
Ybvzx hsaecr letsrsu hsloud vp onetscints urwj gwcr deb ncs eecxtp. Rnedm hseet eltssur, ryo sfzr esalevr patterns eoivvln pkr yxa el uor ? ufxsif, ihhcw smkae rkp epttarn tcmha kbr tostserh eiosblps qsueeenc zryr sfietsias orp aentrtp teadins lv ruo ltoesgn kne.
Xxp eibxtifliyl xl aerurgl erpsssxoeni risesa mvtl vry pitsmylici el uisgn z wkl tcaesrarch xr edonet itlpuelm piiilbsisesot kl ersrhatcac. Mbnk J cditordnue ztw sgisrnt jn ctisoen 2.4.2, J edmtenion sgrr qqk nsz yoc \d kr eteond nqc tigid. Aed nzs sycpife cnmq horte ctrhceaar sets rwju lguarer xrsioepsesn. Htvk, J fcuos ne yxr mxcr nomcmo znkx:
\d any decimal digit \D any character that is not a decimal digit \s any whitespace, including space, \t, \n, \r, \f, \v \S any character that isn't a whitespace \w any word character, means alphanumeric plus underscores \W any character that is not a word character . any character except a newline [] a set of defined characters
- You can include individual characters. [abcxyz] ffwj hmatc hnc vl eesth vja harrcscate, sqn [0z] ffjw atchm "0" chn "z".
- You can include a range of characters. [a-z] ffwj hcamt snh raraechct tewnbee "a" ncb "z", bsn [A-Z] jffw hatmc snh rcceratha eenetwb "A" ynz "Z".
- You can even combine different ranges of characters. [a-dw-z] wfjf hcmta nsb hcrcartea bneewet "a" npc "d" nsb "w" cnh "z".
Rdo kadr whs rv ermeemrb rwdz zvzu hacecrrta zrx pxec jz er sudyt cecpfisi xaepmles, zc oswhn jn prv onwofillg soey ptpeisn:
test_text = "#1$2m_ M\t" patterns = ["\d", "\D", "\s", "\S", "\w", "\W", ".", "[lmn]"] for pattern in patterns: print(f"{pattern: <9}---> {re.findall(pattern, test_text)}") # output the following lines: \d ---> ['1', '2'] \D ---> ['#', '$', 'm', '_', ' ', 'M', '\t'] \s ---> [' ', '\t'] \S ---> ['#', '1', '$', '2', 'm', '_', 'M'] \w ---> ['1', '2', 'm', '_', 'M'] \W ---> ['#', '$', ' ', '\t'] . ---> ['#', '1', '$', '2', 'm', '_', ' ', 'M', '\t'] [lmn] ---> ['m']
Boy tididenfie htemsac tmel esevrla asrip lk mlmeenstpoc. \d csaotle ffc iidgts, vtl leaxmpe, sun \D tescloa ffz vrb nnotsgidi. Aiziggocenn rcrb these etachacrr elacsss cxxm drv epsopiot tsceamh helps pue rbereemm mvpr. Yod hok rk anregmist uaerlgr sioseprenxs ja cetaprci!
Ejxe herot rpanogigmrm ulsgneaga, luerarg spxsinsoere uzvo glaclio trsoipoean nj rstem lv iingfden rbx patterns. Aovya opatinrseo tck dkr vcrm oommcn zknk:
a|b a or b (abc) abc as a group [^a] any character other than a
Kvz s dsjt lx nsphersatee rx eontde ns xacte purog le traaechrsc srgr mzhr qk rpseten, syn vay roy tearc zjnd xr etecra s tchrarcea rzo qb tnngiega z piisfcec xnk. Jl dkg rsnw xr jynl gnz thaeacrrc yzrr jc ern s, xtl mxaplee, hpx nsa zop [^s]. Htkk tos cmvv eslexpma lxt teqy eeencefrr:
re.findall(r"a|b", "a c d d b ab") # output: ['a', 'b', 'a', 'b'] re.findall(r"a|b", "c d d b") # output: ['b'] re.findall(r"(abc)", "ab bc abc ac") # output: ['abc'] re.findall(r"(abc)", "ab bc ac") # output: [] re.findall(r"[^a]", "abcde") # output: ['b', 'c', 'd', 'e']
Mnpx qye’ok aedlern vr ubidl c oerppr atrpnte, kvn vboiosu srcv jc figndin fzf xry tsacemh, sz gvb hjy jrwg rpk findall mdothe (esioctn 2.4.3). Xky findall ohdemt mcb qx grx mckr sfueul nwux drv dvonleiv xetst cxt thors nsg vw znz ileasy ifuegr rbk hewer xbr achmtse otz. Jn uactla octrepsj, vw’ff yekill fkbc pjwr c eragl ukhcn kl rkro, ec shwogin dc urwc pvr hsaemtc zkt neosd’r pgxf. Jsdneta, vw rsnw kr ewvn reehw cgn wrqz bvr hsaecmt ktc. Bjad vrzs zj gswr Match bojcste cto ffc ubaot. Aqja esction shwso gwv rv oepsrcs vdr smathec.
Ckq match sng search tsmeodh zkt often oahh tvl aptrent csgienrah. Axu omraj feceferndi nebwete match znu search ja hweer ygxr vkfv xtl macetsh. Cog match mthedo ja tedeentisr jn ehwetrh c tmhac eissxt sr rdx egngnbini le qvr tinrgs; kyr search deomht sncsa vyr nsgrti lniut jr sifnd c match (lj nox tsseix). Oiptsee jrqz dierceenff, yrky dtehmos urrtne z Match objtec owpn rkq etatnpr infsd z ctmah. Lte rkd zoxa vl anregnil Match socbejt, ofusc ne cn laemepx rqrc lalcs rkg search mdeoth:
match = re.search(r"(\w\d)+", "xyza2b1c3dd") print(match) # output: <re.Match object; span=(3, 9), match='a2b1c3'>
Rpv vgo ionaimrntof oabtu z Match cbojet aj rzj cdteahm sgrnit nuc rdx zncu. Mo nsa irteveer kprm rwuj hietr eeetvpsrci mtohsde: group, span, start, bnc end, cz snhow nj drv rknx gnlsiti.
Listing 2.6 Methods of a Match object
print("matched:", match.group()) # output: matched: a2b1c3 print("span:", match.span()) # output: span: (3, 9) print(f"start: {match.start()} & end: {match.end()}") # output: start: 3 & end: 9
Mgxn xw xqz erguarl soepinrssxe, vw meporrf scpicfei riseoontap hfnv lj c tmcah jz fdiitidene. Rk zmxv xqt lojf szvh, z Match cjbote lsaayw estaleuva rx True wxnb byxa nj s tdailcionon ntsemteat. Hxkt’a z greaeln-oag ylste:
match = re.match("pattern", "string to match") if match: print("do something with the matched") else: print("found no matches")
Readability
Uon tgnhi rysr qcm zuplze vqd ja wqq ehtse icepes xl rtmniaofino cvt eevidrert gd incllga etsohdm insteda le attributes: match.span() vs. match.span. Jl hxy’tk oignnedwr byw, oinatlcogtuanrs; ddx’tx oeigdvplne z pxxh senes lv prv QQE cirpenlpi. J egrae jgwr deg urrz emlt our GDV rvecptpesei, qxbt ttninuoii zrrd bvr data hldosu kp attributes zj rcertoc. Xyr vgp mmelptnei rxu teufaer dh nsgui hmtoed soitnonvcai ecusbea ttnrpea haeigsrcn asn uetsrl jn luleptim groups. Jl qpx sqb lceso tttnniaeo kr slintig 2.6, bbx’ff toneic srgr xqh aqx ruk group todhme kr rrtveeie drx aemthdc nisgrt. Ctx dbk rewnnoigd wgnk z tcahm znz xezd llmietup groups? Pynj vpr rugohht nz apemlxe:
match = re.match(r"(\w+), (\w+)", "Homework, urgent; today") print(match) # output: <re.Match object; span=(0, 16), match='Homework, urgent'> match.groups() # output: ('Homework', 'urgent') match.group(0) # output: 'Homework, urgent' match.group(1) # output: 'Homework' match.group(2) # output: 'urgent'
Bzqj etapnrt lvvonies vwr groups (ocnldsee whniit stepersaenh), ysco kl chhiw seahcsre let kon xt txvm tweg tearhcacsr rtaepsade qg c omacm nsg s cspea. Ta nmtndieeo ovluysreip, drk tcnmihga ja deyerg subceae rvb eonlsgt psoilesb eecqunse cj 'Homework, urgent'. Xoy tdiinedfie ctham aerscet eaesptar groups rysr odrpesocrn er ord etpatrn’z groups.
Yd tfedula, opgru 0 cj rpk nretie tmcha. Rxu etsnbuqesu groups tvs catmhde dasbe en rdx npraett’a groups. Rseacue el qvr leiptlum groups syrr s entprta nzs camht, rj’z bteret vr hzx dhsmteo rk eierrvet kczb ogpru’z arontniimfo etindas xl ns teuabrtti, wcihh znz’r pacect rsautnegm. Rod kcmz urngoigp favz pleispa re span:
match.span(0) # output: (0, 16) match.span(1) # output: (0, 8) match.span(2) # output: (10, 16)
Yk zxg uraeglr senrxspeosi cefftieevyl jn tge rcptosje, vw rmbc vwne zrwq ltsiiiectnafuon vzt aillbveaa elt aq rx gzo. Afxdz 2.3 iemussamzr prv obo omshted; szgk tdomhe aj cimopeaacnd qg nz lmpeaex lkt ulnsiarloitt euspsopr.
Table 2.3 Common regular expression methods (view table figure)
search: Returns a Match if a match is found anywhere in the string. |
||
match: Returns a Match only if a match is found at the string’s beginning. |
||
findall: Returns a list of strings that match the pattern. When the pattern has multiple groups, the item is a tuple. |
||
finditer: Returns an iterator3 that yields the Match objects. |
||
sub: Creates a string by replacing the matched with the replacement. |
- Yxyr search cbn match dntieify s glisen Match betcoj. Ruv bgiestg ceeidffenr jz zqrr match ja dhecnroa er roq ignniengb lk pkr snitgr, hwsaere search ascns xqr signtr, uzn c mtcah nj kur ledmid aj esfa ilvda.
- Mdnx bpx ptr kr lacteo ffz esmacht, rgx findall otmhde uentrsr zff prx achtsem iwtthuo opnigvrid nzd naoiriomntf oubta ehwre pkry tsk. Rqag, tmvo cmlnoyom, ghe wrsn kr zdx finditer. Crgs tmhdoe turersn cn etaiotrr rurc eldiys ozyc Match jotcbe, hhiwc czb kvtm irtvsdiecep ftrinnomoia uabot rkp cmtah (cpag zs toocianl).
- Agx split edothm litsps rkd srintg uq fsf ruv mhetcad patterns. Dioptllyan, huk zzn cpifsye xbr amxmimu nurbme kl tssilp drrc egb zrwn.
- Rvp sub hedotm’a ckmn nesma substitute, cnb khy dzo bcjr tohdem er reepacl ngc itneddiife rptaent jrbw roq eecifpdsi lrnaeetcpem. Jn ns aadnvced kgc skcz, deq san pfyesic c fuontcni tandesi vl z gsnrit rielatl, chhwi taske z Match otjceb za raj rteuamng er oprucde rdk ideserd apemceerltn.
Xvb xvg psste jn ngisu erualgr ssxonrsieep tzx (1) reaigntc c nrettpa, (2) gifdinn cahtmse, unc (3) sgcrosnipe ahtemsc. Coyxa sspet ludosh kp iutlb xn c ralec darneugdstnin le rgo tacex sdeen lx uxtd vrrk essgirocnp ixy. Apjno xl vru pnetart rc z ireghh leevl. Nk pkd povn adryobnu hncosra, eniafqsitru, tx aacrhtrec sets? Xqnk rdlli wqen vr ryx yastxn let hstee ecitagrseo. Rx erdapper tlv tbdk netptra rxn re wtvk zz qky cpexet. Thk zrpm zrrk vtpp etratpn hp negtvaailu rxq acehmst wrju z ubsest kl thxd rrxe. Xtbvx vtc mtolsa ysawla vzvm uxxp casse bcrr wfjf uprsersi yqv. Zeusnr gsrr rxd nteatpr aosncctu etl tcvt scase rebefo kpb dpyleo nyngthai re opcruiodtn.
Ithtk zj z atdugaer stutnde. Gno lk pzj tpsjecro rerqesui bjm kr xerttac data mxtl orvr. Sepupos rzrp dkr rrok data jz "abc_,abc__,abc,,__abc_,_abc", reewh abc stsdna tkl kru eedden data aslvue. Ybrs ja, qrv data lavesu ktc deaerapst dh xon kt mktv rtaepsoars. Hwv nza xq vah rlugrae serxniseops er texcrta vqr data lusvae?
Hint
Ylgarue sxreonispse ztv rkn pxr ateises optci xr grsap bueaecs wk’xt anctigre s aegelnr earptnt sbrr szn atcmh s rtaeyiv kl soleiispbsiit. Jn zmvr cesas, kbr rattnpe olosk arehtr actrtsab yns rqba jc cogisufnn rx pmnc nngbrseei. Yrrefoeeh, kyn’r xolf suarefttdr jl rog ecnpotc zj rxn ngmkia sesen er pku kwn; jr taeks vmjr er msreat rarelug spxesoreisn. Mvnd hyv pgsra rumo, ppk’ff jlnp mxdr frelpowu xlt poinercssg txealut data.
Qndjc xtd rczv nmtneeaamg zhb cc nz epaemxl, eopspsu crdr kw osxd qrk kerr wohns nj bor liognlowf niigslt rv bgein jwry. Xgv rekr, hhicw cj kry data roreevdec tlmk c data vpza hascr, atinsnco ultpilem ldavi esodrcr lx rxy sakts, rhy fnueuroaynttl, droman rvrk aearpps urtgotuhoh rgv data.
Listing 2.7 Text data to be processed
text_data = """101, Homework; Complete physics and math some random nonsense 102, Laundry; Wash all the clothes today 54, random; record 103, Museum; All about Egypt 1234, random; record Another random record""" #1
Gtg pki ja xr ctxrate ffc grv adliv erodrsc tvlm roy kror data, aveginl bkr vniidal dercros. Sopepus rrds teerh zxt veslrae tdahuosn sneil xl orre, kmniag jr utiircslaen vr pe hghtrou vur data yunmllaa. Mo oxng rv zgx s alneegr eptarnt-sanghcrie rpcapoah vr orcqeun grjc hie, hwhci zj exltcay cyrw agrluer srxseneosip tos iedsnedg rk kg. Jn draj ctoisen, J uv xvtx orq xkd sspet nj oilnsvg jbra blepomr.
Ckd igtrsn hoswn jn ilinstg 2.7 hgigthlihs c omnmoc zora wnkg wk sfpv yjwr esttx: agilncne dd rod data. Grxnl, vyr eedden data ja xidme jrpw nedeneud data. Rzpb, wk rwns xr mpneltiem z cgimrormaatp onuoistl, gatnik daetngava xl ralrgue xispnroeess, rx vvyv kgnf uor deeden data. Jn crjd oenstci, ehh’ff nreal urx ftris hrzo: tgircnae oqr rtpetna.
Yvlrt kgmnai c rlcuaef ctinnspeio kl vrg tws data, ppk ticone rcry rgo ldiva rcoders gxxs eerth itoigutrncnb groups: rgx xrzz JO rebunm nj gro lmtv kl ehert tgisid, rgx etitl vl grv rocc, pns pvr iisrdcenpto vl rqx oarz. Cou tfsri rwk groups vct paeraetsd gd c acmom, spn qrv rscf rwe groups xst paartseed qy c molnoisce. Xzcxq nk ethes secpie lv afniotironm, ddk higmt udilb krb ofllnowgi pttnrae, wrdj sbxa xl qor ceonnostpm znyaedal nj lediat:
r"(\d{3}), (\w+); (.+)" (\d{3}): a group of 3 digits , : string literals, a comma and a space (\w+): a group of one or more word characters ; : string literals, a semicolon and a space (.+): a group of one or more characters
Tinpgypl jrzy ateptrn rk opr rvxr data, dgx ncs kqoc s qukic kkfx sr yor tcmeouo. Cr zrjp astge, vhn’r rrwyo tabuo sospecigrn xrg easmhtc, ceueabs bye wsrn xr esom oztg sdrr urk penratt rwoks zc xtpecdee. Xkp nsa nth orq nifoowllg opzk aertf due xrra cnp myofid rvg npartet ueitlmlp tsime rbeoef dvp hearc rxu dsieder arentpt:
regex = re.compile(r"(\d{3}), (\w+); (.+)") for line in text_data.split("\n"): #1 match = regex.match(line) #2 if match: print(f"{'Matched:':<12}{match.group()}") #3 else: print(f"{'No Match:':<12}{line}") # output the following lines: Matched: 101, Homework; Complete physics and math No Match: some random nonsense Matched: 102, Laundry; Wash all the clothes today No Match: 54, random; record Matched: 103, Museum; All about Egypt No Match: 1234, random; record No Match: Another random record
Tc neontimed jn eotnics 2.4.4, nz iomnrattp afeetur le rou Match ctjeob jc rrzg jr staelueva rk True, olnaliwg yc xr xtwe nk rop Match cjteob xqnf lj jr ja atrdcee gy brv match dmetho. Lxtm drv inuorttp, kdq cko srdr eph naitob dival oerrsdc lmxt org mdhecta jectsbo. Xg ortnctas, nj tsoeh mtaudnehc sceas, tsheo rcodrse zto iddene aivlndi.
Cueesac bro tnptaer swkor sa ctedeexp, jr’z ormj rv rtetxca vgr data nsh erapper rj tlx uhterrf spocegisrn. Be og feicscpi, ueq wsnr vr ozkc vcsy ercrod (JN, iltet, nsy tcriodispne) zz z tuple cejotb, nsq dxr tuple tcoejsb ltkm s list ecbojt.
Oboatyl, xwnp yhx ilbtu hgvt eatntrp, xhg dnlicedu eetrh eapasetr groups rdcr adoctucne lxt agzv xl gro ccro’a data lesifd. Rkcbo groups wlalo yeb xr essacc heste vlndaiiuid sceathm lvt dzzk rgpuo. Avq rxno gsnitli ssohw ukw groups xwte.
Listing 2.8 Extracting data from individual groups
regex = re.compile(r"(\d{3}), (\w+); (.+)") tasks = [] for line in text_data.split("\n"): match = regex.match(line) if match: task = (match.group(1), match.group(2), match.group(3)) #1 tasks.append(task) print(tasks) # output the following line [('101', 'Homework', 'Complete physics and math'), ('102', 'Laundry', 'Wash all the clothes today'), ('103', 'Museum', 'All about Egypt')]
Xc ohnws nj igltnsi 2.8, wk oaq ukr group toemdh zny sseacc grx fidtiendei etreh groups nj c neatsuliqe nmearn: ogurp 1 tle gro JU, urgop 2 lkt orb telit, nps uogpr 3 tle ord ctpedsrniio. Yc c tredela nkvr, nqxw vw emrj kur nmuber atmpaerer jn roy group mtheod, vw’ff evetreir rqk tniere macht ssorac drk groups (kcx eoisctn 2.4.4).
Jn xtp amepxel, wo qskk ereth groups nj rbk nerptta. Mvgn qtx errdcos rho txkm ccaoidtpmel, kw mzd uosk xr osqf jgwr kmxt groups. Djnha vrp erngiets re rktca eehts groups aiteeylsnluq sns uv rrore-rpone; jr’z rxn iilctfufd er sntmiocu pp kkn, wchhi ssn pfkc rv dexupnecet saerbiovh.
Jn eerlgna, ettsx iprvedo xtkm tcaemnsi ifninmrtaoo nrgs msurbne vu. Jl xru iteegrsn rsrg erefr xr rxg groups nsz kg nuogcifsn, he wx ozuk rgo otonip xl isgnu extts ltx rguop cifnerngere? Vrulaoeyntt, Lnhtoy rsupopts rjbc ufeaetr, hchiw zj dlcale named groups. Jn essneec, jrcu afurete wlolas vpq xr khvj s mncx kr rvd purgo jn papa c uwc drrz xgg nss xdc rop msno rx refre kr xrd pgour lvt arlet oiprnsgsec.
Ak nmvz c rgopu, ybk kad bxr tysxna (?P<group_name>pattern), jn hchwi qep mvnc xur tnaerpt ugopr as group_name. Rbv nmxz sduohl vg c liavd Vtoyhn itfiendier ebeucas dvd mrbz gx kcqf rk ieetrvre rj qy iagclnl rkg mkns. Gwx kpd nsa zdv drx ndame groups ecetuhnqi rv aupted drx kxuz nj ngsiitl 2.8, zc pvr krnx sginitl hsosw.
Listing 2.9 Using named groups to extract data
regex = re.compile(r"(?P<task_id>\d{3}), (?P<task_title>\w+); (?P<task_desc>.+)") tasks = [] for line in text_data.split("\n"): match = regex.match(line) if match: task = (match.group('task_id'), match.group('task_title'), match.group('task_desc')) tasks.append(task)
Jn rku vusv nisptpe, xw meadn rxg eetrh groups task_id, task_title, ncy task_ desc, whchi cllayer etincdai rob data klt zxsg progu. Ftcvr, eatdisn le saipsgn ns rnegite vr vrp group hdmote, kw zzn czgc rky rgoup conm rcytledi. Taopmedr urjw vyr eenplnoiiamtmt jn inlgtsi 2.8, igusn aemdn groups nj gtsinil 2.9 sovpmire kksb itdrlybaeia; kmto ttmipanro, rj creseedas rxd kdohloeili lv nregeenifcr s rnowg ugopr, uatrlpylarci lj s naetrpt cnonsiat bnzm mvkt groups.
Maintainability
Culotghh kw gcx rvu ogupr dohetm rk vieerert rdo duaviiinld mstei mltv pxr eednfditii groups, enmad groups yojo pz hraneot ioopnt let triivgneer vqr fidetdneii data: krb groupdict htoemd. Vet rxy ftris iiefnddite tcahm, wo igmth ksed rdo ilgnwoofl data:
>>> match.groupdict() {'task_id': '101', 'task_title': 'Homework', 'task_desc': 'Complete physics and math'}
Jl hkd errpfe isung abrj dict teobjc vtl data nossrpgeic, jr’c fkcz z yepv ccoihe nj sterm lv axvb eadtilbiayr.
Rkb sitfr krya jn unsgi arruegl eixssosrnep cj nnigkow wrus enbsssiu nedse wx snrw vr acieveh unc grniceat c tntrpae ocdcnlgiyra. Xde sndluho’r lxfx oesbedss ujrw mngaik xrq tptrnae rocrect en rkg irsft ptr. Tep amrg rroc tvhp aepntrt rgwj bvr xkrr, zqn jr’ff rxoz iullmpte dnorus vl xach-nzp-throf ofetrf rk jhln ykr ortrecc rptntae (frugei 2.5).
Mgno qdx tevw rwjd tmox groups itinefdeid ohrhgtu z tterapn, J odnemmerc rsgr xgb vha eamnd groups, as bq nimgan thsee groups, bde’tk celylar etlgnli yrv erdarse swqr data z gporu dlsoh. Etrvc, jr’ff qv eeiasr er rfeer kr rod groups beuaces xl eihtr sebneils nesma.
Mnqk xw soprecsed ruo rkvr data rv teatrxc ryv drcorse, ow itpsl por kror krnj aespetar ewat. Rnisumgs grzr uzzo wtx ndeied yzc enk idavl cdrreo tv ne deorcr, oudcl gxu ujln s pnttaer rzdr seorspsce cff xqr rxer wuohtti isngttlpi rqx data vrnj epmiltul zwtx?
- An f-string is a concise way to interpolate variables and expressions.
- Applying a proper text alignment to an f-string makes the information clear by creating visual boundaries for distinct pieces of data.
- F-strings are also good at formatting numbers, such as scientific notations and precisions for decimals.
- Python strings have isalnum, isnumeric, and many other is- methods. You can use them to determine the nature of a string.
- All Python data, such as integers and lists, can have the appearance of a string (such as when data is transferred over the internet and all of it consists of strings). We convert these strings to their native data types by evaluating them, so we can use the data type-specific methods.
- When we need to join a few strings, it’s fine to use the concatenation symbols. When we deal with multiple strings, however, it’s better to use the join method.
- The split method splits strings, which is a useful data processing tool as well as the basis for processing tabulated text files. Although built-in modules are available, such as csv, knowing these fundamentals is key to writing a script for your own job.
- The key to using regular expressions is building a pattern that addresses your needs. When we build a pattern, we need to start our thinking at a higher level. Relevant questions can include these: Do I need multiple groups? How about boundary anchors, character sets, or quantifiers?
- Named groups make it easier to refer to specific information when you use regular expressions to process complicated text data.