-
Notifications
You must be signed in to change notification settings - Fork 35
/
Copy pathindex.html
122 lines (122 loc) Β· 45.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
<!doctype html><html lang=en><head><meta content="IE=edge" http-equiv=X-UA-Compatible><meta content="text/html; charset=utf-8" http-equiv=content-type><meta content="width=device-width,initial-scale=1.0,maximum-scale=1" name=viewport><title>Python Regex Surprises</title><link href=https://learnbyexample.github.io/atom.xml rel=alternate title=RSS type=application/atom+xml><script src=https://cdnjs.cloudflare.com/ajax/libs/slideout/1.0.1/slideout.min.js></script><link href=https://learnbyexample.github.io/site.css rel=stylesheet><meta content="Python Regex Surprises" property=og:title><meta content=website property=og:type><meta content="Quiz to stretch your understanding of Python regular expressions" property=og:description><meta content=https://learnbyexample.github.io/python-regex-surprises/ property=og:url><meta content=https://learnbyexample.github.io/images/python_regex_surprises.png property=og:image><meta content=1280 property=og:image:width><meta content=640 property=og:image:height><meta content=summary_large_image property=twitter:card><meta content=@learn_byexample property=twitter:site><link href=https://learnbyexample.github.io/favicon.svg rel=icon><link rel="shortcut icon" href=https://learnbyexample.github.io/favicon.png><body><div class=container><div class=mobile-navbar id=mobile-navbar><div class=mobile-header-logo><a class=logo href=/>learnbyexample</a></div><div class="mobile-navbar-icon icon-out"><span></span><span></span><span></span></div></div><nav class="mobile-menu slideout-menu slideout-menu-left" id=mobile-menu><ul class=mobile-menu-list><li class=mobile-menu-item><a href=https://learnbyexample.github.io/books> Books </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/mini> Mini </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/tips> Tips </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/tags> Tags </a><li class=mobile-menu-item><a href=https://learnbyexample.github.io/about> About </a></ul></nav><header id=header><div class=logo><a href=https://learnbyexample.github.io>learnbyexample</a></div><nav class=menu><ul><li><a href=https://learnbyexample.github.io/books> Books </a><li><a href=https://learnbyexample.github.io/mini> Mini </a><li><a href=https://learnbyexample.github.io/tips> Tips </a><li><a href=https://learnbyexample.github.io/tags> Tags </a><li><a href=https://learnbyexample.github.io/about> About </a></ul></nav></header><main><div class=content id=mobile-panel><div class=post-toc id=post-toc><h2 class=post-toc-title>Contents</h2><div class="post-toc-content always-active"><nav id=TableOfContents><ul><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#vs-z>$ vs \Z</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#slicing-vs-start-and-end-arguments>Slicing vs start and end arguments</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#do-and-match-after-the-last-newline>Do ^ and $ match after the last newline?</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#word-boundary-vs-lookarounds>Word boundary vs lookarounds</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#undefined-escape-sequences>Undefined escape sequences</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#using-octal-and-hexadecimal-escapes-in-the-replacement-section>Using octal and hexadecimal escapes in the replacement section</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#using-escape-sequences-for-metacharacters>Using escape sequences for metacharacters</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#empty-matches>Empty matches</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#can-quantifiers-be-grouped-out>Can quantifiers be grouped out?</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#portion-captured-by-a-quantified-group>Portion captured by a quantified group</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#character-combinations>Character combinations</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#greedy-vs-possessive>Greedy vs Possessive</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#optional-flags-argument>Optional flags argument</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#re-vs-regex-module-flags>re vs regex module flags</a><li><a class=toc-link href=https://learnbyexample.github.io/python-regex-surprises/#understanding-python-re-gex-book>Understanding Python re(gex)? book</a></ul></nav></div></div><article class=post><header class=post__header><h1 class=post__title><a href=https://learnbyexample.github.io/python-regex-surprises/>Python Regex Surprises</a></h1><div class=post__meta><span class=post__time>2023-01-21</span></div></header><div class=post-content><p>In this post, you'll find a few regular expression examples that might surprise you. Some are Python specific and some are applicable to other regex flavors as well. To make it more interesting, these are framed as questions for you to ponder upon. Answers are hidden by default.<p align=center><img alt="Python Regex Surprises" src=/images/python_regex_surprises.png><p align=center><i>Poster created using <a href=https://www.canva.com/>Canva</a></i><p><img alt=info src=/images/info.svg> If you are not familiar with regular expressions, check out my <a href=https://github.com/learnbyexample/py_regular_expressions>Understanding Python re(gex)?</a> ebook.</p><span id=continue-reading></span><br><h2 id=vs-z>$ vs \Z<a aria-label="Anchor link for: vs-z" class=zola-anchor href=#vs-z>π</a></h2><p>Are the <code>$</code> and <code>\Z</code> anchors equivalent?<details><summary><i style=color:gray>Click to view answer</i></summary> <p><code>$</code> can match both the end of string and just before <code>\n</code> if it is the last character. <code>\Z</code> will only match the end of string.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>greeting </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'hi there</span><span style=color:#aeb52b;>\n</span><span style=color:#d07711;>have a nice day</span><span style=color:#aeb52b;>\n</span><span style=color:#d07711;>'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>day</span><span style=color:#72ab00;>$</span><span style=color:#d07711;>'</span><span>, greeting))
</span><span style=color:#b3933a;>True
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>day</span><span style=color:#aeb52b;>\n</span><span style=color:#72ab00;>$</span><span style=color:#d07711;>'</span><span>, greeting))
</span><span style=color:#b3933a;>True
</span><span>
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>day</span><span style=color:#72ab00;>\Z</span><span style=color:#d07711;>'</span><span>, greeting))
</span><span style=color:#b3933a;>False
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>day</span><span style=color:#aeb52b;>\n</span><span style=color:#72ab00;>\Z</span><span style=color:#d07711;>'</span><span>, greeting))
</span><span style=color:#b3933a;>True
</span></code></pre></details><br><h2 id=slicing-vs-start-and-end-arguments>Slicing vs start and end arguments<a aria-label="Anchor link for: slicing-vs-start-and-end-arguments" class=zola-anchor href=#slicing-vs-start-and-end-arguments>π</a></h2><p>Did you know that you can specify <em>start</em> and <em>end</em> index arguments for compiled methods?<blockquote><p><code>Pattern.search(string[, pos[, endpos]])</code></blockquote><p>Now, here's a conundrum:<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>word_pat </span><span style=color:#72ab00;>= </span><span>re.</span><span style=color:#5597d6;>compile</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\A</span><span style=color:#7c8f4c;>at</span><span style=color:#d07711;>'</span><span>)
</span><span>
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(word_pat.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#d07711;>'cater'</span><span>[</span><span style=color:#b3933a;>1</span><span>:]))
</span><span style=color:#b3933a;>True
</span><span>
</span><span style=color:#7f8989;># what will be the output?
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(word_pat.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#d07711;>'cater'</span><span>, </span><span style=color:#b3933a;>1</span><span>))
</span></code></pre><details><summary><i style=color:gray>Click to view answer</i></summary> Specifying a greater than <code>0</code> start index when using <code>\A</code> is always going to return <code>False</code>. This is because, as far as the <code>search()</code> method is concerned, only the search space has been narrowed β the anchor positions haven't changed. When slicing is used, you are creating an entirely new string object with new anchor positions.</details><br><h2 id=do-and-match-after-the-last-newline>Do ^ and $ match after the last newline?<a aria-label="Anchor link for: do-and-match-after-the-last-newline" class=zola-anchor href=#do-and-match-after-the-last-newline>π</a></h2><p>When you use the <code>re.MULTILINE</code> flag, the <code>^</code> and <code>$</code> anchors will match at the start and end of every input line. Question is, will they also match after a newline character at the end of the input?<details><summary><i style=color:gray>Click to view answer</i></summary> <p>Yes, they will both match after the last newline character.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span style=color:#b39f04;>print</span><span>(re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>(?m)^</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'apple '</span><span>, </span><span style=color:#d07711;>'1</span><span style=color:#aeb52b;>\n</span><span style=color:#d07711;>2</span><span style=color:#aeb52b;>\n</span><span style=color:#d07711;>'</span><span>))
</span><span>apple </span><span style=color:#b3933a;>1
</span><span>apple </span><span style=color:#b3933a;>2
</span><span>apple
</span><span>
</span><span style=color:#72ab00;>>>> </span><span style=color:#b39f04;>print</span><span>(re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>(?m)$</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>' banana'</span><span>, </span><span style=color:#d07711;>'1</span><span style=color:#aeb52b;>\n</span><span style=color:#d07711;>2</span><span style=color:#aeb52b;>\n</span><span style=color:#d07711;>'</span><span>))
</span><span style=color:#b3933a;>1 </span><span>banana
</span><span style=color:#b3933a;>2 </span><span>banana
</span><span> banana
</span></code></pre></details><br><h2 id=word-boundary-vs-lookarounds>Word boundary vs lookarounds<a aria-label="Anchor link for: word-boundary-vs-lookarounds" class=zola-anchor href=#word-boundary-vs-lookarounds>π</a></h2><p><code>\b..\b</code> is same as <code>(?<!\w)..(?!\w)</code> β True or False?<details><summary><i style=color:gray>Click to view answer</i></summary> <p>False! <code>\b</code> matches both the start and end of word locations. In the below example, <code>\b..\b</code> doesn't necessarily mean that the first <code>\b</code> will match only the start of word location and the second <code>\b</code> will match only the end of word location. They can be any combination! For example, <code>I</code> followed by space in the input string here is using the start of word location for both the conditions. Similarly, space followed by <code>2</code> is using the end of word location for both the conditions.</p> <p>In contrast, the negative lookarounds version ensures that there are no word characters around any two characters. Also, such assertions will always be satisfied at the start of string and the end of string respectively. But <code>\b</code> depends on the presence of word characters. For example, <code>!</code> at the end of the input string here matches the lookaround assertion but not word boundary.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>ip </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'I have 12, he has 2!'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\b</span><span style=color:#aeb52b;>..</span><span style=color:#72ab00;>\b</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'{</span><span style=text-decoration:underline;font-style:italic;color:#d2a8a1;>\g</span><span style=color:#d07711;><0>}'</span><span>, ip)
</span><span style=color:#d07711;>'{I }have </span><span style=color:#aeb52b;>{12}</span><span style=color:#d07711;>{, }</span><span style=color:#aeb52b;>{he}</span><span style=color:#d07711;> has{ 2}!'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>?<!\w</span><span style=color:#7c8f4c;>)</span><span style=color:#aeb52b;>..</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>?!\w</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'{</span><span style=text-decoration:underline;font-style:italic;color:#d2a8a1;>\g</span><span style=color:#d07711;><0>}'</span><span>, ip)
</span><span style=color:#d07711;>'I have </span><span style=color:#aeb52b;>{12}</span><span style=color:#d07711;>, </span><span style=color:#aeb52b;>{he}</span><span style=color:#d07711;> has {2!}'
</span></code></pre></details><br><h2 id=undefined-escape-sequences>Undefined escape sequences<a aria-label="Anchor link for: undefined-escape-sequences" class=zola-anchor href=#undefined-escape-sequences>π</a></h2><p>If you use undefined escape sequences like <code>\e</code>, will you get an error or will it match the unescaped character (<code>e</code> for this example`)?<details><summary><i style=color:gray>Click to view answer</i></summary> <p>Python raises an exception for escape sequences that are not defined. Apart from sequences defined for character sets (for example <code>\d</code>, <code>\w</code>, <code>\s</code>, etc), these are allowed: <code>\a \b \f \n \N \r \t \u \U \v \x \\</code> where <code>\b</code> means backspace only in character classes. Also, <code>\u</code> and <code>\U</code> are valid only in Unicode patterns.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\t</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'cat</span><span style=color:#aeb52b;>\t</span><span style=color:#d07711;>dog'</span><span>))
</span><span style=color:#b3933a;>True
</span><span>
</span><span style=color:#72ab00;>>>> </span><span style=color:#a2a001;>bool</span><span>(re.</span><span style=color:#5597d6;>search</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\c</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'cat</span><span style=color:#aeb52b;>\t</span><span style=color:#d07711;>dog'</span><span>))
</span><span>re.error: bad escape \</span><span style=background-color:#562d56bf;color:#f8f8f8;>c at position 0</span><span>
</span></code></pre></details><br><h2 id=using-octal-and-hexadecimal-escapes-in-the-replacement-section>Using octal and hexadecimal escapes in the replacement section<a aria-label="Anchor link for: using-octal-and-hexadecimal-escapes-in-the-replacement-section" class=zola-anchor href=#using-octal-and-hexadecimal-escapes-in-the-replacement-section>π</a></h2><p>In string literals, you can use octal, hexadecimal and unicode escapes to represent a character. For example, <code>'\174'</code> is same as using <code>'|'</code>. Do you know which of these escapes you can use inside raw strings in the replacement section of the <code>sub()</code> function?<details><summary><i style=color:gray>Click to view answer</i></summary> <p>Only octal escapes are allowed inside raw strings in the replacement section. If you are otherwise not using the <code>\</code> character, then using normal strings in the replacement section is preferred as it will also allow hexadecimal and unicode escapes.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>,</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\x</span><span style=color:#7c8f4c;>7c</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'1,2'</span><span>)
</span><span>re.error: bad escape \</span><span style=background-color:#562d56bf;color:#f8f8f8;>x at position 0</span><span>
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>,</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\17</span><span style=color:#7c8f4c;>4</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'1,2'</span><span>)
</span><span style=color:#d07711;>'1|2'
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>,</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>\x7c</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'1,2'</span><span>)
</span><span style=color:#d07711;>'1|2'
</span></code></pre> <p>I feel like it would have been rather better if octal escapes were also not allowed. That would have allowed us to use <code>\0</code> instead of <code>\g<0></code> for backreferencing the entire matched portion in the replacement section.</p></details><br><h2 id=using-escape-sequences-for-metacharacters>Using escape sequences for metacharacters<a aria-label="Anchor link for: using-escape-sequences-for-metacharacters" class=zola-anchor href=#using-escape-sequences-for-metacharacters>π</a></h2><p>In the search section, if you use an escape (for example, <code>\x7c</code> to represent the <code>|</code> character), will it behave as the alternation metacharacter or match it literally?<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>2</span><span style=color:#72ab00;>|</span><span style=color:#7c8f4c;>3</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'5'</span><span>, </span><span style=color:#d07711;>'12|30'</span><span>)
</span><span style=color:#d07711;>'15|50'
</span><span>
</span><span style=color:#7f8989;># what will be the output?
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>2</span><span style=color:#aeb52b;>\x</span><span style=color:#7c8f4c;>7c3</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'5'</span><span>, </span><span style=color:#d07711;>'12|30'</span><span>)
</span></code></pre><details><summary><i style=color:gray>Click to view answer</i></summary> <p>The output will be <code>'150'</code> since escapes will be treated literally.</p></details><br><h2 id=empty-matches>Empty matches<a aria-label="Anchor link for: empty-matches" class=zola-anchor href=#empty-matches>π</a></h2><p>You are likely to have come across this before:<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#7f8989;># what will be the output?
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>*</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>{</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>}</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>',cat,tiger'</span><span>)
</span></code></pre><details><summary><i style=color:gray>Click to view answer</i></summary> <p>See also <a href=https://www.regular-expressions.info/zerolength.html>Zero-Length Matches</a>.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#7f8989;># there is an extra empty string match at the end of matches
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>*</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>{</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>}</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>',cat,tiger'</span><span>)
</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>{}</span><span style=color:#d07711;>,</span><span style=color:#aeb52b;>{cat}{}</span><span style=color:#d07711;>,</span><span style=color:#aeb52b;>{tiger}{}</span><span style=color:#d07711;>'
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>*+</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>{</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>}</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>',cat,tiger'</span><span>)
</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>{}</span><span style=color:#d07711;>,</span><span style=color:#aeb52b;>{cat}{}</span><span style=color:#d07711;>,</span><span style=color:#aeb52b;>{tiger}{}</span><span style=color:#d07711;>'
</span><span>
</span><span style=color:#7f8989;># use lookarounds as a workaround
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>?<![</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#7c8f4c;>)</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>*</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>{</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>}</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>',cat,tiger'</span><span>)
</span><span style=color:#d07711;>'</span><span style=color:#aeb52b;>{}</span><span style=color:#d07711;>,</span><span style=color:#aeb52b;>{cat}</span><span style=color:#d07711;>,</span><span style=color:#aeb52b;>{tiger}</span><span style=color:#d07711;>'
</span></code></pre></details><br><h2 id=can-quantifiers-be-grouped-out>Can quantifiers be grouped out?<a aria-label="Anchor link for: can-quantifiers-be-grouped-out" class=zola-anchor href=#can-quantifiers-be-grouped-out>π</a></h2><p>Similar to <code>a(b+c)d = abd+acd</code> in maths, you get <code>a(b|c)d = abd|acd</code> in regular expressions. <code>(a*|b*)</code> is same as <code>(a|b)*</code> β True or False?<details><summary><i style=color:gray>Click to view answer</i></summary> <p align=center><img alt="Regexp grouping with quantifiers gotcha" src=/images/mini/regexp_gotcha_1.png></p> <p align=center>Railroad diagram created using <a href=https://www.debuggex.com/>debuggex.com</a></p> <p>False. Because <code>(a*|b*)</code> will match only sequences like <code>a</code>, <code>aaa</code>, <code>bb</code>, <code>bbbbbbbb</code>. But <code>(a|b)*</code> can match mixed sequences like <code>ababbba</code> too.</p></details><br><h2 id=portion-captured-by-a-quantified-group>Portion captured by a quantified group<a aria-label="Anchor link for: portion-captured-by-a-quantified-group" class=zola-anchor href=#portion-captured-by-a-quantified-group>π</a></h2><p>This should be another familiar regex gotcha:<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#7f8989;># what will be the output?
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\A</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>,)</span><span style=color:#72ab00;>{3}</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\1</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>\2</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'1,2,3,4,5,6,7'</span><span>)
</span></code></pre><details><summary><i style=color:gray>Click to view answer</i></summary> <p>Referring to the text matched by a capture group with a quantifier will give only the last match, not the entire match. You'll need an outer capture group to get the entire matched portion.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\A</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>,)</span><span style=color:#72ab00;>{3}</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\1</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>\2</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'1,2,3,4,5,6,7'</span><span>)
</span><span style=color:#d07711;>'3,(4),5,6,7'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\A</span><span style=color:#7c8f4c;>((?:</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>,)</span><span style=color:#72ab00;>{3}</span><span style=color:#7c8f4c;>)(</span><span style=color:#aeb52b;>[</span><span style=color:#72ab00;>^</span><span style=color:#aeb52b;>,]</span><span style=color:#72ab00;>+</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\1</span><span style=color:#7c8f4c;>(</span><span style=color:#72ab00;>\2</span><span style=color:#7c8f4c;>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'1,2,3,4,5,6,7'</span><span>)
</span><span style=color:#d07711;>'1,2,3,(4),5,6,7'
</span></code></pre></details><br><h2 id=character-combinations>Character combinations<a aria-label="Anchor link for: character-combinations" class=zola-anchor href=#character-combinations>π</a></h2><p><code>\b[a-z](on|no)[a-z]\b</code> is same as <code>\b[a-z][on]{2}[a-z]\b</code> β True or False?<details><summary><i style=color:gray>Click to view answer</i></summary> <p>False. <code>[on]{2}</code> will also match <code>oo</code> and <code>nn</code>.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>words </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'known mood know pony inns'
</span><span>
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\b</span><span style=color:#aeb52b;>[</span><span style=color:#b3933a;>a-z</span><span style=color:#aeb52b;>]</span><span style=color:#7c8f4c;>(?:on</span><span style=color:#72ab00;>|</span><span style=color:#7c8f4c;>no)</span><span style=color:#aeb52b;>[</span><span style=color:#b3933a;>a-z</span><span style=color:#aeb52b;>]</span><span style=color:#72ab00;>\b</span><span style=color:#d07711;>'</span><span>, words)
</span><span>[</span><span style=color:#d07711;>'know'</span><span>, </span><span style=color:#d07711;>'pony'</span><span>]
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>\b</span><span style=color:#aeb52b;>[</span><span style=color:#b3933a;>a-z</span><span style=color:#aeb52b;>][on]</span><span style=color:#72ab00;>{2}</span><span style=color:#aeb52b;>[</span><span style=color:#b3933a;>a-z</span><span style=color:#aeb52b;>]</span><span style=color:#72ab00;>\b</span><span style=color:#d07711;>'</span><span>, words)
</span><span>[</span><span style=color:#d07711;>'mood'</span><span>, </span><span style=color:#d07711;>'know'</span><span>, </span><span style=color:#d07711;>'pony'</span><span>, </span><span style=color:#d07711;>'inns'</span><span>]
</span></code></pre></details><br><h2 id=greedy-vs-possessive>Greedy vs Possessive<a aria-label="Anchor link for: greedy-vs-possessive" class=zola-anchor href=#greedy-vs-possessive>π</a></h2><p>Suppose you want to match integer numbers greater than or equal to <code>100</code> where these numbers can optionally have leading zeros. Will the below code work? If not, what would you use instead?<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>numbers </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'42 314 001 12 00984'
</span><span>
</span><span style=color:#7f8989;># will this work?
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>\d</span><span style=color:#72ab00;>{3,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span></code></pre><details><summary><i style=color:gray>Click to view answer</i></summary> <p>No. You can either modify the pattern such that <code>0*</code> won't interfere or use possessive quantifiers to prevent backtracking.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>numbers </span><span style=color:#72ab00;>= </span><span style=color:#d07711;>'42 314 001 12 00984'
</span><span>
</span><span style=color:#7f8989;># this solution fails because 0* and \d{3,} can both match leading zeros
</span><span style=color:#7f8989;># and greedy quantifiers will give up characters to help overall RE succeed
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>\d</span><span style=color:#72ab00;>{3,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'001'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span><span>
</span><span style=color:#7f8989;># 0*+ is possessive, will never give back leading zeros
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*+</span><span style=color:#aeb52b;>\d</span><span style=color:#72ab00;>{3,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span><span>
</span><span style=color:#7f8989;># workaround if possessive isn't supported
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>0</span><span style=color:#72ab00;>*</span><span style=color:#aeb52b;>[</span><span style=color:#b3933a;>1-9</span><span style=color:#aeb52b;>]\d</span><span style=color:#72ab00;>{2,}</span><span style=color:#d07711;>'</span><span>, numbers)
</span><span>[</span><span style=color:#d07711;>'314'</span><span>, </span><span style=color:#d07711;>'00984'</span><span>]
</span></code></pre> <p><img alt=info src=/images/info.svg> See my blog post on <a href=https://learnbyexample.github.io/python-regex-possessive-quantifier/>possessive quantifiers and atomic grouping</a> for more examples, details about catastrophic backtracking and so on.</p></details><br><h2 id=optional-flags-argument>Optional flags argument<a aria-label="Anchor link for: optional-flags-argument" class=zola-anchor href=#optional-flags-argument>π</a></h2><p>Will the <code>sub()</code> function in the code sample below match case insensitively or not?<pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>key</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'KEY portkey oKey Keyed'</span><span>, re.I)
</span><span>[</span><span style=color:#d07711;>'KEY'</span><span>, </span><span style=color:#d07711;>'key'</span><span>, </span><span style=color:#d07711;>'Key'</span><span>, </span><span style=color:#d07711;>'Key'</span><span>]
</span><span>
</span><span style=color:#7f8989;># what will be the output?
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>key</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'KEY portkey oKey Keyed'</span><span>, re.I)
</span></code></pre><details><summary><i style=color:gray>Click to view answer</i></summary> <p>You should always pass flags as a keyword argument. Using it as positional argument leads to a common mistake between <code>re.findall()</code> and <code>re.sub()</code> functions due to difference in their placement.</p> <blockquote><p><code>re.findall(pattern, string, flags=0)</code><p><code>re.sub(pattern, repl, string, count=0, flags=0)</code></blockquote> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> +</span><span>re.I
</span><span style=color:#b3933a;>2
</span><span>
</span><span style=color:#7f8989;># works because flags is the only optional argument for findall
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>findall</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>key</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'KEY portkey oKey Keyed'</span><span>, re.I)
</span><span>[</span><span style=color:#d07711;>'KEY'</span><span>, </span><span style=color:#d07711;>'key'</span><span>, </span><span style=color:#d07711;>'Key'</span><span>, </span><span style=color:#d07711;>'Key'</span><span>]
</span><span>
</span><span style=color:#7f8989;># wrong usage, but no error because re.I has a value of 2
</span><span style=color:#7f8989;># so, this is same as specifying count=2
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>key</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'KEY portkey oKey Keyed'</span><span>, re.I)
</span><span style=color:#d07711;>'KEY port(key) oKey Keyed'
</span><span>
</span><span style=color:#7f8989;># correct use of keyword argument
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>key</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'KEY portkey oKey Keyed'</span><span>, </span><span style=color:#5597d6;>flags</span><span style=color:#72ab00;>=</span><span>re.I)
</span><span style=color:#d07711;>'(KEY) port(key) o(Key) (Key)ed'
</span><span style=color:#7f8989;># alternatively, you can use inline flags to avoid this problem altogether
</span><span style=color:#72ab00;>>>> </span><span>re.</span><span style=color:#5597d6;>sub</span><span>(</span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#72ab00;>(?i)</span><span style=color:#7c8f4c;>key</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#668f14;>r</span><span style=color:#d07711;>'</span><span style=color:#7c8f4c;>(</span><span style=color:#aeb52b;>\g</span><span style=color:#7c8f4c;><0>)</span><span style=color:#d07711;>'</span><span>, </span><span style=color:#d07711;>'KEY portkey oKey Keyed'</span><span>)
</span><span style=color:#d07711;>'(KEY) port(key) o(Key) (Key)ed'
</span></code></pre></details><br><h2 id=re-vs-regex-module-flags>re vs regex module flags<a aria-label="Anchor link for: re-vs-regex-module-flags" class=zola-anchor href=#re-vs-regex-module-flags>π</a></h2><p>The third-party <code>regex</code> module is handy for advanced features like subexpression calls, skipping matches and so on. Can you use <code>re</code> module flag constants with the <code>regex</code> module?<details><summary><i style=color:gray>Click to view answer</i></summary> <p>When using the flags argument with the <code>regex</code> module, the constants should also be used from the <code>regex</code> module.</p> <pre class=language-python data-lang=python style=background-color:#f5f5f5;color:#1f1f1f;><code class=language-python data-lang=python><span style=color:#72ab00;>>>> +</span><span>re.A
</span><span style=color:#b3933a;>256
</span><span>
</span><span style=color:#72ab00;>>>> +</span><span>regex.A
</span><span style=color:#b3933a;>128
</span></code></pre> <p>Again, you can use inline flags to avoid such issues.</p></details><br><h2 id=understanding-python-re-gex-book>Understanding Python re(gex)? book<a aria-label="Anchor link for: understanding-python-re-gex-book" class=zola-anchor href=#understanding-python-re-gex-book>π</a></h2><p>Visit my GitHub repo <a href=https://github.com/learnbyexample/py_regular_expressions>Understanding Python re(gex)?</a> for details about the book I wrote on Python regular expressions. The ebook uses plenty of examples to explain the concepts from the very beginning and step by step introduces more advanced concepts. The book also covers the <a href=https://pypi.org/project/regex/>third-party module regex</a>.</div><div class=post-footer><div class=post-tags><a href=https://learnbyexample.github.io/tags/python/>#python</a><a href=https://learnbyexample.github.io/tags/regular-expressions/>#regular-expressions</a><a href=https://learnbyexample.github.io/tags/gotcha/>#gotcha</a><a href=https://learnbyexample.github.io/tags/quiz/>#quiz</a></div><hr color=#e6e6e6><div class=post-nav><p><a class=previous href=https://learnbyexample.github.io/python-regex-playground/>β Python Regular Expressions Playground</a><br><p><a class=next href=https://learnbyexample.github.io/2022-year-in-perspective/>2022: year in perspective β</a><br></div><hr color=#e6e6e6><p>π° Use <a href=https://learnbyexample.github.io/atom.xml>this link</a> for the Atom feed. <br> β
Follow me on <a href=https://twitter.com/learn_byexample>Twitter</a>, <a href=https://github.com/learnbyexample>GitHub</a> and <a href=https://www.youtube.com/c/learnbyexample42>Youtube</a> for interesting tech nuggets. <br> π§ Subscribe to <a href=https://learnbyexample.gumroad.com/l/learnbyexample-weekly>learnbyexample weekly</a> for programming resources, tips, tools, free ebooks and more (free newsletter, delivered every Friday).<hr color=#e6e6e6></div></article></div></main></div><script src=https://learnbyexample.github.io/even.js></script>