Regular Expressions Module¶

Special Sequences¶

We’ve already learned about character ranges for matching digits and alphanumeric characters. These are so common that there is a special shorthand we can use to represent these.

We can use \d to match a digit character:

>>> re.search(r'\d', greeting)
>>> re.search(r'\d', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>

So here’s another way to match a string containing one or more digit characters only:

>>> re.search(r'^\d+$', "$100")
>>> re.search(r'^\d+$', "100")
<_sre.SRE_Match object; span=(0, 3), match='100'>

This is a special sequence.

Special sequences look like escape characters in Python strings. They consist of a backslash and another character that denotes what the sequence represents.

Let’s look at a couple others.

A capital D sequence matches non-digits:

>>> re.search(r'\D', '100')
>>> re.search(r'\D', '$100')
<_sre.SRE_Match object; span=(0, 1), match='$'>

So \d and \D are essentially shorthands for the digit ranges we’ve already seen so far:

>>> re.search(r'\D', '100')
>>> re.search(r'[^0-9]', '100')
>>> re.search(r'\d', '100')
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search(r'[0-9]', '100')
<_sre.SRE_Match object; span=(0, 1), match='1'>

Another common special sequence is \w:

>>> re.search(r'^\w*$', "$hell0")
>>> re.search(r'^\w*$', "hello")
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^\w*$', "hello_there")
<_sre.SRE_Match object; span=(0, 11), match='hello_there'>
>>> re.search(r'^\w*$', "hello there")
>>> re.search(r'^\w*$', "hello40")
<_sre.SRE_Match object; span=(0, 7), match='hello40'>
>>> re.search(r'^\[A-Za-z0-9_]*$', "hello40")
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40")
<_sre.SRE_Match object; span=(0, 7), match='hello40'>
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40_")
<_sre.SRE_Match object; span=(0, 8), match='hello40_'>
>>> re.search(r'^[A-Za-z0-9_]*$', "hello40 ")

This matches alphanumeric characters (aka “word” characters). These are letters, digits, or underscores.

There’s also \s which matches whitespace characters:

>>> re.search(r'^\s*$', " ")
<_sre.SRE_Match object; span=(0, 1), match=' '>
>>> re.search(r'^\s*$', " \n")
<_sre.SRE_Match object; span=(0, 2), match=' \n'>
>>> re.search(r'^\s*$', " _\n")

We could match two words like this:

>>> re.search(r'^\w+\s+\w+$', 'hithere')
>>> re.search(r'^\w+\s+\w+$', 'hi there')
<_sre.SRE_Match object; span=(0, 8), match='hi there'>
>>> re.search(r'^\w+\s+\w+$', 'hi there, Trey')
>>> re.search(r'^\w+\s+\w+$', 'hi there Trey')

This \s sequence matches newline characters, tabs, spaces. But it also matches weird things like vertical tabs, form feeds, and carriage returns.

Just like the digit special sequence, the word and space sequences have opposites. Capital W matches non-word characters and capital S matches non-space characters

>>> re.search(r'^\w+\W+\w+$', 'hi there')
<_sre.SRE_Match object; span=(0, 8), match='hi there'>
>>> re.search(r'^\w+\W+\w+$', 'hi*there')
<_sre.SRE_Match object; span=(0, 8), match='hi*there'>
>>> re.search(r'^\S+\s+\S+$', '_ $#@!')
<_sre.SRE_Match object; span=(0, 6), match='_ $#@!'>

Word Boundaries¶

The \b special sequence is used for denoting word boundaries. This is an anchor and just like the ^ and $ anchors we’ve already seen, this doesn’t consume a character. This represents a location where a whitespace is or where the string starts or ends.

>>> re.search(r'\bhello\b', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'\bhello\b', 'hello_there')
>>> re.search(r'\bhello\b', 'hello there')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'\bhello\b', 'oh hello there')
<_sre.SRE_Match object; span=(3, 8), match='hello'>
>>> re.search(r'\bhello\b', 'ohhello there')

Remember how we always put a r before our regular expression strings to make them raw strings?

If we forget to do that when matching a word boundary, bad things happen:

>>> re.search('\bhello\b', 'hello')
>>> '\b'
'\x08'
>>> '\\b'
'\\b'
>>> r'\b'
'\\b'

The \b escape character represents a backspace. This character can be used to remove characters from the terminal and redraw them.

If we want to represent a word boundary in our regular expressions we need to put \b in a raw string to avoid the need to double-escape it so Python doesn’t think we’re talking about a backspace.

Repeats¶

Let’s write a regular expression that validates United States ZIP codes (shortened format only):

>>> re.search(r'^\d\d\d\d\d$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d\d\d\d\d$', '123456')
>>> re.search(r'^\d\d\d\d\d$', '1234')
>>> re.search(r'^\d\d\d\d\d$', '10001')
<_sre.SRE_Match object; span=(0, 5), match='10001'>

This regular expression matches 5 consecutive digits. We have a shortcut for matching 1 or more consecutive characters:

>>> re.search(r'^\d+$', '1234')
<_sre.SRE_Match object; span=(0, 4), match='1234'>

There’s also a shortcut for matching a particular number of consecutive characters:

>>> re.search(r'^\d{5}$', '1234')
>>> re.search(r'^\d{5}$', '12345')
<_sre.SRE_Match object; span=(0, 5), match='12345'>
>>> re.search(r'^\d{5}$', '123456')

We can also match ranges of repetitions with this.

Words 3 to 5 letters long:

>>> re.search(r'^\w{3,5}$', 'hi')
>>> re.search(r'^\w{3,5}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{3,5}$', 'ball')
<_sre.SRE_Match object; span=(0, 4), match='ball'>
>>> re.search(r'^\w{3,5}$', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^\w{3,5}$', 'mellow')

Words 3 or more letters long:

>>> re.search(r'^\w{3,}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{3,}$', 'ball')
<_sre.SRE_Match object; span=(0, 4), match='ball'>
>>> re.search(r'^\w{3,}$', 'hi')

Words 3 or fewer letters long:

>>> re.search(r'^\w{,3}$', 'hi')
<_sre.SRE_Match object; span=(0, 2), match='hi'>
>>> re.search(r'^\w{,3}$', 'cat')
<_sre.SRE_Match object; span=(0, 3), match='cat'>
>>> re.search(r'^\w{,3}$', 'ball')

Case Sensitivity¶

There is an optional third argument that we can provide to Python’s re.search function. This third argument is a flags argument.

One of these flags is the IGNORECASE flag. This makes matches on lowercase and uppercase letters match both.

Here’s how we use this flag:

>>> re.search('hello', "Hello there")
>>> re.search('hello', "Hello there", re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 5), match='Hello'>

This works for character classes too:

>>> re.search('[A-Z]', "hi")
>>> re.search('[A-Z]', "hi", re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 1), match='h'>

The documentation has more information on the flags argument.

Verbose Flag¶

We’ll look at one more flag right now.

Let’s take a regular expression that validates UUIDs (universally unique identifiers:

def is_valid_uuid(uuid):
    return bool(re.search(r'^[a-f\d]{8}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{4}-[a-f\d]{12}$', uuid, re.IGNORECASE))

That’s not easy to read at all.

We could make that more readable by splitting it over multiple lines:

def is_valid_uuid(uuid):
    uuid_regex = (
        r'^[a-f\d]{8}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{12}'
        r'$'
    )
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE))

That helps some, but there’s a lot of extra quotes around each of the strings and it’s still kind of terse. It would be cool if there was a way to space out this regular expression in a multi-line string.

The VERBOSE flag does exactly this.

We’re already using the IGNORECASE flag. To use two flags we need to use the pipe operator (|) also known as the bitwise OR operator:

def is_valid_uuid(uuid):
    uuid_regex = (
        r'^[a-f\d]{8}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{4}'
        r'-'
        r'[a-f\d]{12}'
        r'$'
    )
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

Now our regular expression will ignore whitespace:

def is_valid_uuid(uuid):
    uuid_regex = (r'''
        ^
        [ a-f \d ] {8}
        -
        [ a-f \d ] {4}
        -
        [ a-f \d ] {4}
        -
        [ a-f \d ] {4}
        -
        [ a-f \d ] {12}
        $
    ''')
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

We can also add as many comments to our regular expression as we want:

def is_valid_uuid(uuid):
    uuid_regex = (r'''
        ^               # beginning of string
        [ a-f \d ] {8}  # 8 hexadecimal digits
        -               # dash character
        [ a-f \d ] {4}  # 4 hexadecimal digits
        -               # dash character
        [ a-f \d ] {4}  # 4 hexadecimal digits
        -               # dash character
        [ a-f \d ] {4}  # 4 hexadecimal digits
        -               # dash character
        [ a-f \d ] {12} # 12 hexadecimal digits
        $               # end of string
    ''')
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

Unfortunately regular expressions, unlike Python, do not have a concept of variables or functions so we often can’t make our regular expressions as self-documenting as our Python code.

If you wanted to reuse parts of your regular expression you could try using string formatting, but the curly braces ({ and }) in your regular expression will need to be doubled up in order to escape them when using string formatting:

def is_valid_uuid(uuid):
    hex_re = r'[ a-f \d ]'
    uuid_re = r'''
        ^               # beginning of string
        {hex} {{8}}     # 8 hexadecimal digits
        -               # dash character
        {hex} {{4}}     # 4 hexadecimal digits
        -               # dash character
        {hex} {{4}}     # 4 hexadecimal digits
        -               # dash character
        {hex} {{4}}     # 4 hexadecimal digits
        -               # dash character
        {hex} {{12}}    # 12 hexadecimal digits
        $               # end of string
    '''.format(hex=hex_re)
    uuid_regex = (uuid_re)
    return bool(re.search(uuid_regex, uuid, re.IGNORECASE | re.VERBOSE))

Searching¶

So far we’ve used regular expressions to validate whether something is inside a string or validate whether a string looks a certain way.

What if we want to find out the value that the regular expression actually matched?

The secret lies in the match object that is returned during a positive match.

We can do this with the group method on our match object:

>>> sentence = "I'm flying out of SAN right now."
>>> m = re.search(r'\b[A-Z]{3}\b', sentence)
>>> m.group()
'SAN'

You can use the help function to find other features of match objects:

>>> help(m)

Grouping¶

So far our regular expressions have consisted solely of commands that match individual letters or allow repetition of individual letter matches.

What if we want to act on a group?

For example what if we want to match US ZIP codes in their shortened form or their full form?

We’ve already matched shortened ZIP codes:

>>> re.search(r'^\d{5}$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>

A full ZIP code match looks like this:

>>> re.search(r'^\d{5}-\d{4}$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>

So far we haven’t seen a way to make that last part optional.

We could try putting a question mark after the - and the repetition:

>>> re.search(r'^\d{5}-?\d{4}?$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '902104873')
<_sre.SRE_Match object; span=(0, 9), match='902104873'>
>>> re.search(r'^\d{5}-?\d{4}?$', '90210-')

That matches strange things though (also what’s up with that ? after the repetition count?).

To optionally match a number of consecutive character patterns, we can use a group:

>>> re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
<_sre.SRE_Match object; span=(0, 10), match='90210-4873'>
>>> re.search(r'^\d{5}(-\d{4})?$', '90210')
<_sre.SRE_Match object; span=(0, 5), match='90210'>
>>> re.search(r'^\d{5}(-\d{4})?$', '902104873')
>>> re.search(r'^\d{5}(-\d{4})?$', '90210-')

This allows us to match 5 digits followed optionally by a dash and 4 digits (both the dash and 4 digits must be present).

Capture Groups¶

We’ve already talked about using groups to allow for quantifying a group of character patterns.

There’s actually another purpose for groups though.

Groups also allow capturing characters matched by a group.

Remember how we used the group method to access the matched data? We can pass arguments to that method to access captured groups.

For example, in our ZIP code regular expression, we can get the first matching group like this:

>>> m = re.search(r'^\d{5}(-\d{4})?$', '90210-4873')
>>> m.group(1)
'-4873'

>>> m.group()
'90210-4873'

If we want to always access just the first 5 digits, we could put those in a group:

>>> m = re.search(r'(^\d{5})(-\d{4})?$', '90210-4873')
>>> m.group(2)
'-4873'
>>> m.group(1)
'90210'

Note that if we access the 0 group that will give us the entire match, just like when we pass no arguments:

>>> m.group(0)
'90210-4873'
>>> m.group()
'90210-4873'

Multi-search¶

What if we’re searching with a regular expression but we don’t want to stop at the first match?

We can find multiple matches using finditer:

>>> sentence = "I'll be flying from SAN to PDX with a stop in SFO on the way"
>>> airport_matches = re.finditer(r'\b[A-Z]{3}\b', sentence)
>>> for m in airport_matches:
...     print(m.group())
...
SAN
PDX
SFO

We could use a list comprehension to store our matches in a list:

>>> airports = [m.group() for m in re.finditer(r'\b[A-Z]{3}\b', sentence)]
>>> airports
['SAN', 'PDX', 'SFO']

There’s a helper function we can use that does something similar to this though:

>>> airport_codes = re.findall(r'\b[A-Z]{3}\b', sentence)
>>> airport_codes
['SAN', 'PDX', 'SFO']

This findall function does not return match objects. Instead it returns the full string that was matched.

Uncapturing Groups¶

There’s a important caveat to be aware of when using findall. If we have capturing groups in our expression, the full match won’t be returned:

>>> re.findall(r'\d{5}(-\d{4})?', '90210-4873\n12345')
['-4873', '']

If there is a single capturing groups, the findall function returns the contents of that group.

If there are multiple capturing groups, the findall function returns a tuple of all the group contents:

>>> re.findall(r'(\d{5})(-\d{4})?', '90210-4873\n12345')
[('90210', '-4873'), ('12345', '')]
>>> re.findall(r'(\d{5}(-\d{4})?)', '90210-4873\n12345')
[('90210-4873', '-4873'), ('12345', '')]

Remember that groups have two purposes:

Grouping patterns
Capturing the strings that were matched by the group

What if we want to use parenthesis to make a group, but we don’t want to capture in our group?

We can use an uncapturing group for this.

Uncapturing groups have a really weird syntax.

You’ll probably want to refer to the cheat sheet when you realize you need to use them. To make an uncapturing group you put ?: (a question mark and a colon) after the opening parenthesis for the group.

>>> re.findall(r'\d{5}(?:-\d{4})?', '90210-4873\n12345')
['90210-4873', '12345']

The reason this syntax is so weird is that the creators wanted to maintain backwards compatibility and this particular syntax was invalid in regular expression parses up to that point (? makes no sense after ( normally).

Hopefully you won’t need this feature often.

Search Exercises¶

Most of these exercises involves searching in a dictionary.

Hint

You can open and read from the dictionary file like this:

with open('dictionary.txt') as dict_file:
    dictionary = dict_file.read()

Get File Extension¶

Make a function that accepts a full file path and returns the file extension.

Tip

Modify the get_extension function in the search module.

Example usage:

>>> get_extension('archive.zip')
'zip'
>>> get_extension('image.jpeg')
'jpeg'
>>> get_extension('index.xhtml')
'xhtml'
>>> get_extension('archive.tar.gz')
'gz'

Hexadecimal Words¶

Find every word that consists solely of the letters A, B, C, D, E, and F.

Tip

Modify the hexadecimal function in the search module.

Examples: decaf, bead, cab

Tetravocalic¶

Find all words that include four consecutive vowels.

Tip

Modify the tetravocalic function in the search module.

Hexaconsonantal¶

Find at least one word with 6 consecutive consonants. For this problem treat y as a vowel.

Tip

Modify the hexaconsonantal function in the search module.

Crossword Helper¶

Make a function possible_words that accepts a partial word with underscores representing missing letters and returns a list of all possible matches.

Tip

Modify the possible_words function in the search module.

Use your crossword helper function to solve the following:

water tank: CIS____
pastry: ___TE
temporary: __A_S_E__

Repeat Letter¶

Find every word with 5 repeat letters.

Tip

Modify the five_repeats function in the search module.

✕

↑

Write more Pythonic code

I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.

If you'd like to improve your Python skills every week, sign up!

You can find the Privacy Policy here.
reCAPTCHA protected (Google Privacy Policy & TOS)