Learn Python Series (#25) - Handling Regular Expressions Part 3

in #utopian-io7 years ago

Learn Python Series (#25) - Handling Regular Expressions Part 3

python_logo.png

Repository

https://github.com/python/cpython

What Will I Learn?

  • In this part 3 of the regex subseries within the Learn Python Series you will learn how to use more Control Characters,
  • about the meaning and use of several common "Special" characters via Escape Codes,
  • about raw strings for readability with respect to Escape Codes,
  • and about Grouping mechanisms within regular expressions.
  • also we will start to combine several expression techniques to begin forming more complex, advanced yet powerful regexes.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution;
  • The ambition to learn Python programming.

Difficulty

  • Intermediate

Curriculum (of the Learn Python Series):

Proof of Work Done

https://github.com/realScipio

Supplemental source code, including tutorial itself (iPython):

https://github.com/realScipio/learn-python-series/blob/master/regex-03.ipynb

Learn Python Series (#25) - Handling Regular Expressions Part 3

Welcome to, already, part 3 of the Regular Expression subseries of my Learn Python Series! In Part 1 we learned how to use a number of functions of the re module, using fixed strings, for starters. And in Part 2 we began exploring the regex language itself, by constructing (relatively simple) expressions, slowly discussing concepts such as Character Classes, ranges, Control Characters and repetition.

In this episode - Part 3 - we will further expand our regex knowledge.
Let's begin!

import re

More about Control Characters

The ^ (caret) character, for the beginning of a string or line

As we've seen, when the ^ character is used inside a character set [ ], it serves the purpose of excluding the characters placed inside the character set, or in other words, to match anything but those character.

However, when ^ is used outside of a character set, it means something different: match the pattern if it occurs at the beginning of the input string.
You could also state "^ means the start of the string, and the match we're looking for occurs right after it".

For example:

pattern = "dog"
string = "This is my dog"

match = re.search(pattern, string)
print(match)
<_sre.SRE_Match object; span=(11, 14), match='dog'>

In the code example above, the pattern dog is found inside the string.

Yet if we begin the pattern with the ^ character, and match it against the same input string, we get:

pattern = "^dog"
string = "This is my dog"

match = re.search(pattern, string)
print(match)
None

Explanation: the substring dog is not placed at the start of the input string, and therefore the pattern matches nothing, and the search() function returns None.

The $ character, for the end of a string or line

The $ character does the opposite of ^. Being, $ matches a pattern located at the end of a string.
You could also state "$ means the end of the string, and the match we're looking for occurs right before it".

pattern = "dog$"
string = "This is my dog"

match = re.search(pattern, string)
print(match)
<_sre.SRE_Match object; span=(11, 14), match='dog'>

If we would for example add a . after the word dog in the input string, the pattern doesn't match, because at the end of the input string a . is placed now, not the word dog.

pattern = "dog$"
string = "This is my dog."

match = re.search(pattern, string)
print(match)
None

The . character, matching anything except a newline

Using the . character (a period sign) means match any character except for a newline. One . of course only counts for matching one character.

pattern = "c.ke"
string = "Would you like to drink a coke next to eating your cake?"

matches = re.findall(pattern, string)
print(matches)
['coke', 'cake']

"Special" Character Classes using Escape Codes

There are some "special" characters that have a different meaning when escaped with a backslash \ placed directly before it. I will hereby briefly discuss them.

Nota bene: If you want to match a backslash, you must escape it with another backslash. That can quickly lead to hard-to-read regex patterns. In order to avoid that, you can use raw strings: just prefix your regex pattern with the letter r.

Using \d for matching a digit

The escaped code \d means the same as [0-9]: it matches a digit.

In the following example we're searching for the pattern \d+ which means "match all instances of digit-characters (\d) that occur right after eachother at least once or more times (+)".

pattern = "\d+"
string = "Let's see if we can match the numbers 1, 31785 and 45 from this input string."

matches = re.findall(pattern, string)
print(matches)
['1', '31785', '45']

Using \D for matching a non-digit

The escaped code \D does the opposite as \d does: it matches anything but digits ("non-digits").

pattern = "\D+"
string = "Let's see if we can exclude the numbers 1, 31785 and 45 from this input string."

matches = re.findall(pattern, string)
print(matches)
["Let's see if we can exclude the numbers ", ', ', ' and ', ' from this input string.']

The regex pattern \D+ shown above matches the same thing as [^\d]+ does.
The latter means "there's a character set ([]) in which we're excluding (^) all digits (\d), ocurring one or more times (+)."

pattern = "[^\d]+"
string = "Let's see if we can exclude the numbers 1, 31785 and 45 from this input string."

matches = re.findall(pattern, string)
print(matches)
["Let's see if we can exclude the numbers ", ', ', ' and ', ' from this input string.']

Using \w for matching alphanumerics (and underscores _ )

In short, an alphanumeric is either an alphabetical or numeric character. So if we for example inspect a "password-like" input string and try to match all alphanumerics directly following eachother via the regular expression \w, we get the following result:

pattern = "\w+"
string = "ghj#$5&378$anjUjfe278"

matches = re.findall(pattern, string)
print(matches)
['ghj', '5', '378', 'anjUjfe278']

Using \W for matching non-alphanumerics (no underscores _ )

The escape code \W does the opposite of \w: it matchs everything but alphanumerics.

pattern = "\W+"
string = "ghj#$5&378$anjUjfe278"

matches = re.findall(pattern, string)
print(matches)
['#$', '&', '$']

And also here using the pattern \W returns the same result as using [^\w]+ (which means: "exclude all alphanumerics ([^\w]) that occur at least once (+)").

pattern = "[^\w]+"
string = "ghj#$5&378$anjUjfe278"

matches = re.findall(pattern, string)
print(matches)
['#$', '&', '$']

Using \s for matching whitespaces (= spaces, tabs, newlines, returns)

A "whitespace" is something "blank", which could be a space or a tab, a newline or return.

A possible usecase, out of many, is matching "one or more" whitespaces in an input string and have all of those replaced by one space, like so:

pattern = "\s+"
replacement = " "
string = """This     is a line that is    quite    

messy 

and   
    we     want    to  
  fix   its   formatting.
"""
new_string = re.sub(pattern, replacement, string)
print(new_string)
This is a line that is quite messy and we want to fix its formatting. 

Using \S to match all non-whitespaces

\S matches the opposite as \s does. A possible usecase is returning all individual words in an input string, and for example return the word count. Like so:

pattern = "\S+"
string = """This     is another   mess
  but 
now we     want    to  
find

all individual    words
"""
matches = re.findall(pattern, string)
print(matches)
print('Words found:', len(matches))
['This', 'is', 'another', 'mess', 'but', 'now', 'we', 'want', 'to', 'find', 'all', 'individual', 'words']
Words found: 13

Using \b for matching an "empty string" at the start or end of a word (= word boundary)

The \b character can be used to match an "empty string" (or a position in between two characters) at either the beginning or ending of a word, which is therefore also named a "word boundary".

Regard the following example (using a "raw string" as mentioned earlier) in which we're trying to match all i characters in the input string that are located either as the first or last character of a word.

pattern = r"\bi"
string = "This is a test string"

for match in re.finditer(pattern, string):
    print(match.start())
5

And as a result, only the i in the word is was returned, at index 5.

Another code example for using \b is the following snippet, with which we're saying "match all non-whitespaces between the boundaries of a word" (ergo: the words themselves) and have them returned as a list:

pattern = r"\b\S+\b"
string = "This is a test string"

matches = re.findall(pattern, string)
print(matches)
['This', 'is', 'a', 'test', 'string']

Using \B for empty strings NOT at the start or end of a word.

After having explained the usage of \b, the opposite \B is pretty self-explanatory.

pattern = r"\Bi"
string = "This is a test string"

for match in re.finditer(pattern, string):
    print(match.start())
2
18

The above regex now matches all i characters NOT at the beginning or ending of a word, ergo the i in is is not matched this time, but the i in This and string are matched.

Grouping regexes with ( )

Using ()

If you enclose parts of a pattern within parentheses (), you can isolate parts of the match(es), and have those matching parts returned separately. Using one or more groups doesn't actually change what your regular expression is matching though.

In the following example (I thought it might be okay now to use a "slightly more complicated-looking regex") I've constructed a regular expression including two groups both enclosed in ().

The first group is: ([a-zA-Z0-9_.+-]+)
And the second one: ([a-zA-Z0-9_.+-]+) ... don't try to spot the differences, they're the same! ;-)

Of course this means: "Let's look for a character set ([]) containing lower case letters ([a-z]), and/or upper case letters ([A-Z]) and or digits ([0-9]), or [.+-], occurring at least once, and do that both before and after one @ character".

The first group contains the user name, the second the domain name.

pattern = r"([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9_.+-]+)"
string = "Let's see if the email adress '[email protected]' or can be matched!"

match = re.search(pattern, string)
print(match)
<_sre.SRE_Match object; span=(31, 49), match='[email protected]'>

PS: despite the fact the above expression af first sight looks "complex", it's far from reliable for fool-proof email address detection. For example the substring a@b, which is not a valid email address, still gets matched.

Using the group() function

A Match object also has a group() function which returns the entire match (ergo: all groups) as a string.
You can however also pass in a group number to group() as its argument (first group is 1, not 0).

For example:

email_address = match.group()
email_user = match.group(1)
email_domain = match.group(2)

print(email_address)
print(email_user)
print(email_domain)
somebody@yahoo.com
somebody
yahoo.com

What did we learn, hopefully?

This Part 3 episode of my regex subseries discussed how to use even more Control Characters, "Special" characters via Escape Codes, about raw strings for readability with respect to escape codes, and we talked about Grouping mechanisms.

Nota bene: As you might have noticed, slowly but surely I'm beginning to combine multiple regex components into forming more complex-looking expressions, that are far more powerful and interesting to use as well. In case you feel some of the expressions already became too complex to understand, or if I might have not explained them well enough, please feel free to ask me questions in the comment section! Never hesitate to do so: asking questions is smart!

See you in the next episode!

Thank you for your time!

Sort:  

As always a beautiful job! Thanks for your tutorial!


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thx! :-)

I love this regex series, no matter how senior you are I find you always need to have a bookmark or two handy to remind you some syntax.

I find this regex series to be useful with most regex engines as well.

Thank you Scipio.

Can't wait for the lookaheads and such.

Those are actually pretty difficult to explain! :P
I always try to think of simple examples to explain anything complex.

thanks @scipio, learned a lot

Did you really? You posted this comment 4 minutes after I published it....

i am a python programmer, i got idea on handlig expressions

A speed-reader as well then! ;-)

Hey @scipio
Thanks for contributing on Utopian.
We're already looking forward to your next contribution!

Contributing on Utopian
Learn how to contribute on our website or by watching this tutorial on Youtube.

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Thanks for creating this. It's a very useful tutorial. The following website may also be useful to people. It is a live environment for learning/testing regular expressions. Similar to what Code Academy does.

https://regexr.com/

Thx for your compliment!

I just started following you.i will check out the lesson.python is tough

Hey @scipio!

Long time not spoken to ya! Going great with the blogging mate.

I sent you a DM on Discord just now :)

Dear Sir@scipio,
Thank you so much for posting so beautiful.
SIR was very good for your posting. Because you get many benefits with the right information, so thank you very much for posting so beautiful and helping others by providing such information in the future.

I think you should've done a "regex series". Whatever the language, regeX' pretty much the same.
You'll have to make this a 20 part series haha