Learn Python Series (#24) - Handling Regular Expressions Part 2
Repository
https://github.com/python/cpython
What Will I Learn?
- In this part 2 of the regex subseries within the
Learn Python Series
you will learn how to start constructing more "flexible" regex patterns; - you will learn about Character Sets ("Character Classes") in order to match patterns containing multiple characters; - about Ranges: single ranges, multiple ranges, and alternative ranges;
- about how to exclude (a range of) characters from a Character Set;
- about Regex Control Characters (also called "Meta Characters"),
- how to escape them via
\
; - and about various forms of Pattern Repetition.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution;
- The ambition to learn Python programming.
Difficulty
- Intermediate
Curriculum (of the Learn Python Series
):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
Proof of Work Done
Supplemental source code, including tutorial itself (iPython):
https://github.com/realScipio/learn-python-series/blob/master/regex-02.ipynb
Learn Python Series (#24) - Handling Regular Expressions Part 2
In Learn Python Series (#23) - Handling Regular Expressions Part 1, which was kind of like an Intro to the world of regular expresions, you learned (via fixed string patterns to begin with) how to use common functions and methods provided by the re
module.
We talked about using match()
, search()
, findall()
, finditer()
, split()
, sub()
and compile()
; all utilizing regular expressions for pattern matching. However, in Part 1
, we haven't talked about how the regular expression "language" itself works.
So let's do that right now!
import re
Using Character Sets (or: Character Classes) via [ ]
Within regular expressions an easy-to-use yet powerful mechanism is the Character Set
. Place squared brackets [ ]
around individually listed characters the all should match.
E.g. the regex sc[aio]pio
, in which [aio]
is the Character Set, should match scapio
, scipio
and scopio
. In this example [aio]
counts as 1 character, which could either be an a
, i
or o
, after sc
was found and right before pio
.
Please also regard the following example, in which the regex pyth[ae]n
matches both pythan
and pythen
(and next we properly substitute the typos.)
pattern = "pyth[ae]n"
replace = "Python"
string = """Wow what a great post Dear about pythen!
Your post gives us a new lesson! I like it Dear! Beautiful!
We can learn a lot of things about pythan!"""
result = re.sub(pattern, replace, string)
print(result)
Wow what a great post Dear about Python!
Your post gives us a new lesson! I like it Dear! Beautiful!
We can learn a lot of things about Python!
You're not limited to using only one Character Set inside a regex; you can use multiple, as you please!
For example:
pattern = "[Pp][iy]th[aeiou]n"
string = "Learning Python is just like learning pithen, Pythan or pythun!"
match = re.findall(pattern, string)
print(match)
['Python', 'pithen', 'Pythan', 'pythun']
Using Ranges within Character Sets via -
Single ranges
You can define a range of characters within a Character Set by putting -
in between characters. For example [a-z]
will match any one lower case letter in (abcdefghjklmnopqrstuvwxyz
); just using [a-e]
will match one lower case letter in abcde
. To match any one number in 0123456789
use [0-9]
, and if you like to match all triple-digit numbers from 000
to 299
then use [0-2][0-9][0-9]
.
pattern = "[0-9][0-9][0-9]"
string = "Let's add the numbers 123 and 45 also add 678 to see if we can match the the triple digits!"
match = re.findall(pattern, string)
print(match)
['123', '678']
Nota bene: as you can see from the visual output in the last code example, since we were looking for 3-digit numbers only, the number 45
in the input string was not listed in the returned list of matches.
Multiple ranges within one Character Set
It is also possible to use multiple ranges within one character set. For example, if you want to match the single-digit numbers 1
, 2
, 3
and 7
, 8
, 9
(but not 0
, or 4
, or 5
or 6
), you can use [1-37-9]
(which looks a bit strange, because of the 3
and 7
placed next to eachother).
pattern = "[1-37-9]"
string = "Let's look which numbers will match within: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9"
match = re.findall(pattern, string)
print(match)
['1', '2', '3', '7', '8', '9']
Using |
(OR, alternative) to split multiple ranges for readability
For readability, it's allowed to put a |
in between two ranges within a character set. The following code example produces the same result as the last one:
pattern = "[1-3|7-9]"
string = "Let's look which numbers will match within: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9"
match = re.findall(pattern, string)
print(match)
['1', '2', '3', '7', '8', '9']
Using ^
inside a Character Set for character exclusion
The ^
("carat") character matches characters not in the set after ^
.
pattern = "[^aeiou]"
string = "Scipio"
match = re.findall(pattern, string)
print(match)
['S', 'c', 'p']
Regex Control Characters (or: Meta Characters) & Escaping via \
What we've just learned regarding Character Sets via [ ]
is an example of using Control Characters in regular expressions. In regexes, all characters match themselves, except for Control Characters, which have a different meaning.
These Control Characters are: +
?
.
*
^
$
(
)
[
]
{
}
|
\
If you however want to match a Control Character itself, as being something you're literally looking for within an input string, then escape them by adding a backslash \
directly in front of them.
For example, if you want to find the position index number of the dollar sign $
in the following input string, then do as follows:
pattern = "\$"
string = "The $ sign is used to represent a Dollar"
match = re.search(pattern, string)
print(match, match.start())
<_sre.SRE_Match object; span=(4, 5), match='$'> 4
We will continue now with explaining some of the mentioned Control Characters, including examples of course.
Pattern Repetition
In regular expressions you can express repetition in a pattern in various ways, all done by placing Meta Characters directly after the (sub-)pattern:
The ?
Control Character: 0 or 1 times
The ?
character is placed in case the pattern is matched 0 or 1 time. You could also say that pattern is then "optional": it may be there, or not.
pattern = "ab?"
string = "abacabbba"
match = re.findall(pattern, string)
print(match)
['ab', 'a', 'ab', 'a']
Explanation: the findall()
function returned 4 matches (evaluating the pattern / input string left to right of course).
the first match
ab
means of course the first 2 characters: the firsta
was already a match, but since there's also ab
after it, the substringab
was returned.the second match is an
a
because it's followed by ac
(not the optionalb
).then once again an
ab
was returned, regardless the amount ofb
's (3 in this case) were found after thea
.and finally at the end is another
a
found as the last character.
The *
Control Character: 0 or more times
Use the *
character when you want to match the pattern 0 or more times. "Zero or more" does not mean "I don't care how many times" but "if it's not there match it anyway, but if it is, then keep matching in case you find more".
pattern = "ab*"
string = "abacabbba"
match = re.findall(pattern, string)
print(match)
['ab', 'a', 'abbb', 'a']
Explanation: the findall()
function again returned 4 matches, but not the same ones!
the first match is the same: an
a
was found immediately followed by oneb
, so:ab
.then another
a
was found, followed by ac
(which is not in the pattern), hence thea
is returned.then once again an
a
was found, but this time all threeb
's after it get matched as well, not just the first one!finally the last
a
is returned
The +
Control Character: 1 or more times
The +
characters enforces the existence of the pattern in order to match it, at least once, or more, but not zero.
pattern = "ab+"
string = "abacabbba"
match = re.findall(pattern, string)
print(match)
['ab', 'abbb']
Explanation: the findall()
function now only returned 2 matches:
the first match, at the beginning, is the same:
ab
.but the following substring
ac
doesn't have theb
so it's unmatched. Only then (at index 4) ana
is found that's followed by ab
, three in this case. And the last charactera
is also not matched, because it's not followed by ab
.
Using { }
curly braces for {n}
, {n,m}
or {n,}
repetition
If you want to specify how many times the pattern repetition should occur do so via { }
.
- Use
{n}
if you want the repetition to occurn
times, e.g. 3
pattern = "ab{3}"
string = "abacabbba"
match = re.findall(pattern, string)
print(match)
['abbb']
Explanation: only 1 match was returned, an a
followed by exactly 3 times a b
.
- Use
{n,m}
if
you want the repetition to occur betweenn
andm
times, e.g.{1,3}
means a repetition of 1, 2 or 3 times.
pattern = "ab{1,3}"
string = "abacabbba"
match = re.findall(pattern, string)
print(match)
['ab', 'abbb']
Explanation: the occurence of the a
in ac
was disregarded now, and so is the final a
character.
- Use
{n,}
if you want the repetition to occur betweenn
or more times, e.g.{2,}
means 2 or more times.
pattern = "ab{2,}"
string = "abacabbba"
match = re.findall(pattern, string)
print(match)
['abbb']
Explanation: only abbb
was returned, as being the only instance that has an a
followed by at least 2 b
's.
What did we learn, hopefully?
This Part 2
episode of my regex subseries expanded on what was explained in Part 1
(handling re
module functions such match()
, search()
, findall()
, finditer()
, split()
, sub()
and compile()
), where only fixed, literal strings were used, to forming more "flexible" patterns to match input strings with.
The regex language can be pretty complex, which is why I'm deliberately "taking it slow". We talked about Character Sets surrounded by[ ]
, with which you can match patterns containing multiple characters. And we talked about Ranges, single ranges, multiple ranges, and alternative ranges. Also we learned how to exclude characters from a Character Set.
We then learned about Regex Control Characters (also called "Meta Characters"), how to escape them via \
. And finally, we talked about various forms of Pattern Repetition.
See you in the next episode!
Dear sir@scipio,
Thank you so much for posting so beautiful.
SIR was very good for your posting. Because you get many benefits with the right information, so thank you very much for posting so beautiful and helping others by providing such information in the future.
Very good tutorial and good example how to do a tutorial for utopian. The subject Regular Expressions in Phython is quite interesting. Thank you for your contribution.
Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]
what's your present chapter to teach us? every people benefit from you.
Hey @scipio
Thanks for contributing via Utopian.
We're already looking forward to your next contribution!
Contributing on Utopian
Learn how to contribute on our website or by watching this tutorial on Youtube.
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!