The Basics¶
Regular expressions are a mini programming language used for searching through text.
You can use regular expressions to:
- validate text
- search for things in text
- normalize text
Today we’re going to learn how to use regular expressions to validate text and searching for things in text.
Raw Strings¶
Python strings use backslashes as escape characters:
>>> file_name = "C:\projects\nathan"
>>> file_name
'C:\\projects\nathan'
>>> print(file_name)
C:\projects
athan
That \n
represents a newline character. If we want to use a literal \
followed by a literal n
we have to escape the backslash with another backslash.
>>> file_name = "C:\\projects\\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan
We can turn off character escaping completely by using “raw” strings, which you can make by prefixing your string with an r
character:
>>> file_name = r"C:\projects\nathan"
>>> file_name
'C:\\projects\\nathan'
>>> print(file_name)
C:\projects\nathan
When making regular expressions it’s a very good idea to always use raw strings. This is because the special sequences that regular expressions use sometimes also work as escape characters.
Because of this, we will exclusively use raw strings when creating regular expressions.
Searching¶
The re
module in Python’s standard library includes tools for using regular expressions.
Let’s import the re
module like this:
>>> import re
Let’s make a string:
>>> greeting = "hello world"
We’re going to use this string for testing our regular expressions.
Let’s ask whether our string includes the letter x. We can use the search
function for this:
>>> re.search(r"x", greeting)
Nothing was returned. Specifically, None
was returned:
>>> print(re.search(r"x", greeting))
None
This means greeting
does not include the letter x
.
Let’s try a string that does have the letter x
:
>>> re.search(r"x", "exit")
<_sre.SRE_Match object; span=(1, 2), match='x'>
We got a match object back. That means we got a match!
This first example wasn’t particularly interesting because we could do this same thing with the in
operator on strings:
>>> 'x' in greeting
False
>>> 'x' in 'exit'
True
Character Class¶
Let’s say we want to ask whether our string includes a vowel.
Without regular expressions this could get quite verbose:
>>> 'a' in greeting or 'e' in greeting or 'i' in greeting
True
We could make that shorter with a list comprehension, but this is still a little verbose:
>>> any(c in greeting for c in 'aeiou')
True
With our regular expression search
function, we can do this:
>>> re.search(r'[aeiou]', greeting)
<_sre.SRE_Match object; span=(1, 2), match='e'>
If we provide a word without vowels (note that we’re not counting y
as a vowel here) we’ll get None
:
>>> re.search(r'[aeiou]', 'rhythm')
This is called a character class. We can make character classes with square brackets. A character class matches any single character we put inside it.
We could match any digit like this:
>>> re.search(r'[0123456789]', 'rhythm')
>>> re.search(r'[0123456789]', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>
Character classes also support ranges of characters. We can denote a range of characters with a -
:
>>> re.search(r'[0-9]', 'rhythm')
>>> re.search(r'[0-9]', '$100')
<_sre.SRE_Match object; span=(1, 2), match='1'>
Ranges allow us to match a number of ASCII-betically consecutive characters (this means it’s like looking at an ASCII table and fetching every character between two others).
Ranges can get pretty advanced because of that, but you’ll usually only see ranges for digits and uppercase or lowercase characters:
>>> re.search(r'[0-9]', greeting)
>>> re.search(r'[a-z]', greeting)
<_sre.SRE_Match object; span=(0, 1), match='h'>
>>> re.search(r'[A-Z]', greeting)
Note that you can put multiple ranges in a character class and you can even mix and match ranges and other characters in character classes.
This matches a letter, digit, or underscore character:
>>> re.search(r'[a-zA-Z0-9_]', greeting)
<_sre.SRE_Match object; span=(0, 1), match='h'>
We can also invert a character class by starting it with a ^
(a caret):
>>> re.search(r'[^0-9]', 'rhythm')
<_sre.SRE_Match object; span=(0, 1), match='r'>
>>> re.search(r'[^0-9]', '$100')
<_sre.SRE_Match object; span=(0, 1), match='$'>
Why did that second one match?
We’re asking whether our string includes a non-digit character. That dollar sign is a non-digit character.
If we remove the dollar sign, we won’t get a match:
>>> re.search(r'[^0-9]', '100')
Anchors¶
What if we want to match strings that start with an a
character?
So far we haven’t seen a way to match at the start of a string. We only know how to look for any characters in a string:
>>> re.search(r'a', 'hiya')
<_sre.SRE_Match object; span=(3, 4), match='a'>
We can use ^ to match the beginning of a string:
>>> re.search(r'^a', 'hiya')
>>> re.search(r'^a', 'abcd')
<_sre.SRE_Match object; span=(0, 1), match='a'>
Notice that ^
doesn’t actually match a character. This is an anchor character. It matches a location, not a character.
The other popular anchor character is $ which matches the end of the string:
>>> re.search(r'a$', 'hiya')
<_sre.SRE_Match object; span=(3, 4), match='a'>
>>> re.search(r'a$', 'abcd')
Metacharacters¶
Most characters in a regular expression just match themselves. Metacharacters are characters that have a special meaning.
So far we’ve seen that square brackets ([ and ]), caret (^), and dollar sign ($) have special meaning. These are all metacharacters.
If you want to represent a metacharacter literally, you can use a backslash to escape the character:
>>> re.search(r"\[hello\]", "h")
>>> re.search(r"\[hello\]", "[hello]")
<_sre.SRE_Match object; span=(0, 7), match='[hello]'>
If we want to match a single dollar sign, we’ll want to escape it like this:
>>> re.search(r"$", "100")
<_sre.SRE_Match object; span=(3, 3), match=''>
>>> re.search(r"\$", "100")
>>> re.search(r"\$", "$100")
<_sre.SRE_Match object; span=(0, 1), match='$'>
You can find a list of regular expression metacharacters in the documentation.
One of the most common metacharacters is .
. This matches any single character (except for a newline character by default).
>>> re.search(r'.', greeting)
<_sre.SRE_Match object; span=(0, 1), match='h'>
>>> re.search(r'.', 'a')
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> re.search(r'.', '')
We can use this to match any three-character sequence that starts with an a
and ends with a z
:
>>> re.search(r'a.z', 'abz')
<_sre.SRE_Match object; span=(0, 3), match='abz'>
>>> re.search(r'a.z', 'wa zo')
<_sre.SRE_Match object; span=(1, 4), match='a z'>
>>> re.search(r'a.z', 'wazo')
>>> re.search(r'a.z', 'wa zo')
Quantifiers¶
What if we want to match any string that starts with an a
and ends with a z
?
We haven’t learned a way to do this so far. The problem is we need to match strings that are any number of characters long:
>>> re.search(r'^az$', 'abz')
>>> re.search(r'^a.z$', 'abz')
<_sre.SRE_Match object; span=(0, 3), match='abz'>
>>> re.search(r'^a..z$', 'abz')
>>> re.search(r'^a...z$', 'abz')
We can use *
for this:
>>> re.search(r'^a.*z$', 'abz')
<_sre.SRE_Match object; span=(0, 3), match='abz'>
>>> re.search(r'a.*z$', 'az')
<_sre.SRE_Match object; span=(0, 2), match='az'>
>>> re.search(r'^a.*z$', 'a and z')
<_sre.SRE_Match object; span=(0, 7), match='a and z'>
>>> re.search(r'^a.*z$', 'a and c')
This *
character makes the match command before it (the .
character in this case) match 0 or more times.
So this matches strings that consist of exclusively digit characters (it also matches the empty string):
>>> re.search(r'^[0-9]*$', greeting)
>>> re.search(r'^[0-9]*$', '$100')
>>> re.search(r'^[0-9]*$', '100')
<_sre.SRE_Match object; span=(0, 3), match='100'>
>>> re.search(r'^[0-9]*$', '')
<_sre.SRE_Match object; span=(0, 0), match=''>
This kind of metacharacter is often called a quantifier or modifier character. Instead of matching a character if modifies the match before it.
Here’s another quantifier character:
>>> re.search(r'^[a-z]+$', greeting)
>>> re.search(r'^[a-z]+$', 'hello')
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> re.search(r'^[a-z]+$', '')
The +
character modifies the previous match to match 1 or more times. Unlike *
this cannot match zero times.
There’s also ?
which matches zero or 1 times. We can use this for matching something that’s optional. For example we could look for the word color spelled with or without a “u”:
>>> re.search(r'colou?r', 'what a nice color')
<_sre.SRE_Match object; span=(12, 17), match='color'>
>>> re.search(r'colou?r', 'what a nice colour')
<_sre.SRE_Match object; span=(12, 18), match='colour'>
>>> re.search(r'colou?r', 'what a nice shade')
Validation Exercises¶
Hint
Match objects are always “truthy” and None
is always “falsey”. Truthy means when you convert something to a boolean, it’ll be True
.
You can convert the result of re.search
to a boolean to get True
or False
for a match or non-match like this:
>>> bool(re.search(r'hello', sentence))
True
>>> bool(re.search(r'hi', sentence))
False
Has Vowels¶
Create a function has_vowel
, that accepts a string and returns True
if the string contains a vowel (a, e, i, o, or u) returns False
otherwise.
Tip
Modify the has_vowel
function in the validation
module.
Your function should work like this:
>>> has_vowel("rhythm")
False
>>> has_vowel("exit")
True
Is Integer¶
Create a function is_integer
that accepts a string and returns True
if the string represents an integer.
By our definition, an integer:
- Consists of 1 or more digits
- May optionally begin with
-
- Does not contain any other non-digit characters.
Tip
Modify the is_integer
function in the validation
module.
Your function should work like this:
>>> is_integer("")
False
>>> is_integer(" 5")
False
>>> is_integer("5000")
True
>>> is_integer("-999")
True
>>> is_integer("+999")
False
>>> is_integer("00")
True
>>> is_integer("0.0")
False
Is Fraction¶
Create a function is_fraction
that accepts a string and returns True
if the string represents a fraction.
By our definition a fraction consists of:
- An optional
-
character - Followed by 1 or more digits
- Followed by a
/
- Followed by 1 or more digits, at least one of which is non-zero (the denominator cannot be the number
0
).
Tip
Modify the is_fraction
function in the validation
module.
Your function should work like this:
>>> is_fraction("")
False
>>> is_fraction("5000")
False
>>> is_fraction("-999/1")
True
>>> is_fraction("+999/1")
False
>>> is_fraction("00/1")
True
>>> is_fraction("/5")
False
>>> is_fraction("5/0")
False
>>> is_fraction("5/010")
True
>>> is_fraction("5/105")
True
>>> is_fraction("5 / 1")
False
I send out 1 Python exercise every week through a Python skill-building service called Python Morsels.
If you'd like to improve your Python skills every week, sign up!
You can find the Privacy Policy here.reCAPTCHA protected (Google Privacy Policy & TOS)