Learn Python Series (#24) - Handling Regular Expressions Part 2

in #utopian-io7 years ago

Learn Python Series (#24) - Handling Regular Expressions Part 2

python_logo.png

Repository

https://github.com/python/cpython

What Will I Learn?

  • In this part 2 of the regex subseries within the Learn Python Series you will learn how to start constructing more "flexible" regex patterns;
  • you will learn about Character Sets ("Character Classes") in order to match patterns containing multiple characters; - about Ranges: single ranges, multiple ranges, and alternative ranges;
  • about how to exclude (a range of) characters from a Character Set;
  • about Regex Control Characters (also called "Meta Characters"),
  • how to escape them via \;
  • and about various forms of Pattern Repetition.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution;
  • The ambition to learn Python programming.

Difficulty

  • Intermediate

Curriculum (of the Learn Python Series):

Proof of Work Done

https://github.com/realScipio

Supplemental source code, including tutorial itself (iPython):

https://github.com/realScipio/learn-python-series/blob/master/regex-02.ipynb

Learn Python Series (#24) - Handling Regular Expressions Part 2

In Learn Python Series (#23) - Handling Regular Expressions Part 1, which was kind of like an Intro to the world of regular expresions, you learned (via fixed string patterns to begin with) how to use common functions and methods provided by the re module.

We talked about using match(), search(), findall(), finditer(), split(), sub() and compile(); all utilizing regular expressions for pattern matching. However, in Part 1, we haven't talked about how the regular expression "language" itself works.

So let's do that right now!

import re

Using Character Sets (or: Character Classes) via [ ]

Within regular expressions an easy-to-use yet powerful mechanism is the Character Set. Place squared brackets [ ] around individually listed characters the all should match.

E.g. the regex sc[aio]pio, in which [aio] is the Character Set, should match scapio, scipio and scopio. In this example [aio] counts as 1 character, which could either be an a, i or o, after sc was found and right before pio.

Please also regard the following example, in which the regex pyth[ae]n matches both pythan and pythen (and next we properly substitute the typos.)

pattern = "pyth[ae]n"
replace = "Python"
string = """Wow what a great post Dear about pythen!
Your post gives us a new lesson! I like it Dear! Beautiful!
We can learn a lot of things about pythan!"""

result = re.sub(pattern, replace, string)
print(result)
Wow what a great post Dear about Python!
Your post gives us a new lesson! I like it Dear! Beautiful!
We can learn a lot of things about Python!

You're not limited to using only one Character Set inside a regex; you can use multiple, as you please!

For example:

pattern = "[Pp][iy]th[aeiou]n"
string = "Learning Python is just like learning pithen, Pythan or pythun!"

match = re.findall(pattern, string)
print(match)
['Python', 'pithen', 'Pythan', 'pythun']

Using Ranges within Character Sets via -

Single ranges

You can define a range of characters within a Character Set by putting - in between characters. For example [a-z] will match any one lower case letter in (abcdefghjklmnopqrstuvwxyz); just using [a-e] will match one lower case letter in abcde. To match any one number in 0123456789 use [0-9], and if you like to match all triple-digit numbers from 000 to 299 then use [0-2][0-9][0-9].

pattern = "[0-9][0-9][0-9]"
string = "Let's add the numbers 123 and 45 also add 678 to see if we can match the the triple digits!"

match = re.findall(pattern, string)
print(match)
['123', '678']

Nota bene: as you can see from the visual output in the last code example, since we were looking for 3-digit numbers only, the number 45 in the input string was not listed in the returned list of matches.

Multiple ranges within one Character Set

It is also possible to use multiple ranges within one character set. For example, if you want to match the single-digit numbers 1, 2, 3 and 7, 8, 9 (but not 0, or 4, or 5 or 6), you can use [1-37-9] (which looks a bit strange, because of the 3 and 7 placed next to eachother).

pattern = "[1-37-9]"
string = "Let's look which numbers will match within: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9"

match = re.findall(pattern, string)
print(match)
['1', '2', '3', '7', '8', '9']

Using | (OR, alternative) to split multiple ranges for readability

For readability, it's allowed to put a | in between two ranges within a character set. The following code example produces the same result as the last one:

pattern = "[1-3|7-9]"
string = "Let's look which numbers will match within: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9"

match = re.findall(pattern, string)
print(match)
['1', '2', '3', '7', '8', '9']

Using ^ inside a Character Set for character exclusion

The ^ ("carat") character matches characters not in the set after ^.

pattern = "[^aeiou]"
string = "Scipio"

match = re.findall(pattern, string)
print(match)
['S', 'c', 'p']

Regex Control Characters (or: Meta Characters) & Escaping via \

What we've just learned regarding Character Sets via [ ] is an example of using Control Characters in regular expressions. In regexes, all characters match themselves, except for Control Characters, which have a different meaning.

These Control Characters are: + ? . * ^ $ ( ) [ ] { } | \

If you however want to match a Control Character itself, as being something you're literally looking for within an input string, then escape them by adding a backslash \ directly in front of them.

For example, if you want to find the position index number of the dollar sign $ in the following input string, then do as follows:

pattern = "\$"
string = "The $ sign is used to represent a Dollar"

match = re.search(pattern, string)
print(match, match.start())
<_sre.SRE_Match object; span=(4, 5), match='$'> 4

We will continue now with explaining some of the mentioned Control Characters, including examples of course.

Pattern Repetition

In regular expressions you can express repetition in a pattern in various ways, all done by placing Meta Characters directly after the (sub-)pattern:

The ? Control Character: 0 or 1 times

The ? character is placed in case the pattern is matched 0 or 1 time. You could also say that pattern is then "optional": it may be there, or not.

pattern = "ab?"
string = "abacabbba"

match = re.findall(pattern, string)
print(match)
['ab', 'a', 'ab', 'a']

Explanation: the findall() function returned 4 matches (evaluating the pattern / input string left to right of course).

  1. the first match ab means of course the first 2 characters: the first a was already a match, but since there's also a b after it, the substring ab was returned.

  2. the second match is an a because it's followed by a c (not the optional b).

  3. then once again an ab was returned, regardless the amount of b's (3 in this case) were found after the a.

  4. and finally at the end is another a found as the last character.

The * Control Character: 0 or more times

Use the * character when you want to match the pattern 0 or more times. "Zero or more" does not mean "I don't care how many times" but "if it's not there match it anyway, but if it is, then keep matching in case you find more".

pattern = "ab*"
string = "abacabbba"

match = re.findall(pattern, string)
print(match)
['ab', 'a', 'abbb', 'a']

Explanation: the findall() function again returned 4 matches, but not the same ones!

  1. the first match is the same: an a was found immediately followed by one b, so: ab.

  2. then another a was found, followed by a c (which is not in the pattern), hence the a is returned.

  3. then once again an a was found, but this time all three b's after it get matched as well, not just the first one!

  4. finally the last a is returned

The + Control Character: 1 or more times

The + characters enforces the existence of the pattern in order to match it, at least once, or more, but not zero.

pattern = "ab+"
string = "abacabbba"

match = re.findall(pattern, string)
print(match)
['ab', 'abbb']

Explanation: the findall() function now only returned 2 matches:

  1. the first match, at the beginning, is the same: ab.

  2. but the following substring ac doesn't have the b so it's unmatched. Only then (at index 4) an a is found that's followed by a b, three in this case. And the last character a is also not matched, because it's not followed by a b.

Using { } curly braces for {n}, {n,m} or {n,} repetition

If you want to specify how many times the pattern repetition should occur do so via { }.

  • Use {n} if you want the repetition to occur n times, e.g. 3
pattern = "ab{3}"
string = "abacabbba"

match = re.findall(pattern, string)
print(match)
['abbb']

Explanation: only 1 match was returned, an a followed by exactly 3 times a b.

  • Use {n,m} if
    you want the repetition to occur between n and m times, e.g. {1,3} means a repetition of 1, 2 or 3 times.
pattern = "ab{1,3}"
string = "abacabbba"

match = re.findall(pattern, string)
print(match)
['ab', 'abbb']

Explanation: the occurence of the a in acwas disregarded now, and so is the final a character.

  • Use {n,} if you want the repetition to occur between n or more times, e.g. {2,} means 2 or more times.
pattern = "ab{2,}"
string = "abacabbba"

match = re.findall(pattern, string)
print(match)
['abbb']

Explanation: only abbb was returned, as being the only instance that has an a followed by at least 2 b's.

What did we learn, hopefully?

This Part 2 episode of my regex subseries expanded on what was explained in Part 1 (handling re module functions such match(), search(), findall(), finditer(), split(), sub() and compile()), where only fixed, literal strings were used, to forming more "flexible" patterns to match input strings with.

The regex language can be pretty complex, which is why I'm deliberately "taking it slow". We talked about Character Sets surrounded by[ ], with which you can match patterns containing multiple characters. And we talked about Ranges, single ranges, multiple ranges, and alternative ranges. Also we learned how to exclude characters from a Character Set.

We then learned about Regex Control Characters (also called "Meta Characters"), how to escape them via \. And finally, we talked about various forms of Pattern Repetition.

See you in the next episode!

Thank you for your time!

Sort:  

Dear sir@scipio,
Thank you so much for posting so beautiful.
SIR was very good for your posting. Because you get many benefits with the right information, so thank you very much for posting so beautiful and helping others by providing such information in the future.

Very good tutorial and good example how to do a tutorial for utopian. The subject Regular Expressions in Phython is quite interesting. Thank you for your contribution.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

what's your present chapter to teach us? every people benefit from you.

Hey @scipio
Thanks for contributing via Utopian.
We're already looking forward to your next contribution!

Contributing on Utopian
Learn how to contribute on our website or by watching this tutorial on Youtube.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!