< Working with files and folders | Table of contents | Data acquisition >

Regular Expressions¶

Regular expressions can be used to search for specific patterns within texts. They typically consist of a sequence of symbols which specify a search action. Once defined, such regular expressions can be matched against actual strings.

Regular expressions can be constructed using literal characters and so-called metacharacters. The simple regular expression ‘flower’, for instance, only contains literal characters. It can be used to search for the six characters that are mentioned. Metacharacters, by contrast, are characters with a special meaning. They represent specific types of characters, such as characters in lower case, digits, or tabs. When you combine literal characters and metacharacters, you can search for patterns rather than for literal strings.

The standard installation of Python includes a useful module called ‘re’, which can be used to search for text fragments on the basis of regular expressions. To work with the module, you firstly need to import it. The module ‘re’ contains a function called 'search()', which minimally requires two parameters. The first parameter is the pattern to search for, and the second parameter is the string in which you want to search. The method returns the value ‘true’ if the pattern which is mentioned occurs in the string which is provided as the second parameter.

The listing below offers an example. The regular expression, in this case, is simply a string consisting of literal characters. The program simply tries to establish whether the string that is mentioned as the first parameter of 're.search()' occurs in the sentence which is mentioned as the second parameter.

import re

sentence = 'Mrs. Dalloway said she would buy the flowers herself.'

if re.search( 'flower' , sentence ):
    print('The pattern was found in the sentence!')

Next to literal characters, the following metacharacters may be used:

Metacharacter	Description
\w	Any alphanumeric character: all 26 alphabetical characters, both in upper case and in lower case, all numbers and the underscore.
\d	Digits.
.	Any character, except the newline.
\s	White space: the space, a tab or a newline character.
[A-Z]	Any upper case character.
[A-Za-z]	Any upper case of lower case character.
[...]	If only a limited number of characters are allowed on a specific position in a string, the characters that are allowed can be supplied in square brackets.

The square brackets can be useful if you need to search for words which can be spelled in different ways. To localize the word ‘digitise’, for instance, either in its British or in its American spelling, you may use the regular expression ‘digiti[sz]e’.

You can also use quantifiers to specify the number of times a character or a pattern should occur.

Quantifier	Description
{n,m}	Pattern must occur a least n times, at most m times
{n,}	At least n times.
{n}	Exactly n times.
?	Is the same as {0,1}
+	Is the same as {1,}
*	Is the same as {0,}

The code below contains a number of examples of regular expressions containing such metacharacters and quantifiers.

import re

sentence = "Keats's 'Ode on a Grecian Urn' was written in 1819."

if re.search( r'\d{4}' , sentence ):
    print('Found')
## Matches '1819'

if re.search( r'K[aeuio]{2}ts' , sentence ):
    print('Found')
## Matches 'Keats'

hits = re.findall( r'[aeuio]n' , sentence )
for h in hits:
    print(h)
## Four matches: 'on', 'an', 'en' and 'in'

In the code above, all the regular expressions are preceded by the character ‘r’, which, in this context, indicates that the strings defining the regular expressions make use of the ‘raw string’ notation. In short, it means that all characters need to be read literally, in their ‘raw’ form. You are advised to use the ‘r’ in front of the string whenever the regular expression contains metacharacters such as ‘\w’ or ‘\d’.

The fragment above also illustrates the function of the findall() function from the ‘re’ module. This function creates a list containing all fragments from the string that match the regular expression. The 're.search()' funcion, by contrast, only produces a Boolean value, depending on whether the regular expression matches the string.

Finally, you can also use so-called anchors in regular expression. Anchors do not represent actual characters, but only locations within strings.

Symbol	Description
\b	A word boundary.
^	The beginning of a string.
$	The end of a string.

A word boundary is a location in which an alphanumeric character is placed next to a character which is not an alphanumeric character, such as punctuation, a space or a new line character. Illustrations of the use of such anchors can be found below.

import re

line = "In Xanadu did Kubla Khan a stately pleasure-dome decree"

if re.search( r'^In\b' , line ):
    print('Found!')
    ## This regular expression searches for lines 
    ## beginning with the preposion ‘In’  

if re.search( r'\bd.*$' , line ):
    print('Found!')  
    ## Searches for lines whose final word begin with the character ‘d’  

if re.search( r'\ba\b' , line ):
    print('Found!')  
    ### Searches for the single character ‘a’.    
    ### It does not match words which contain an ‘a’, such    
    ### as ‘Xanadu’ or ‘Khan’

If you add the text “re.IGNORECASE” as the third parameter of the search() function, the search will take place in a case-insensitive manner. For examples of case-insensitive searches using word boundaries, see the code below.

import re

line = "Doubting, dreaming dreams no mortal ever dared to dream before"

hits = re.findall( r'\bd[a-z]*\b' , line , re.IGNORECASE )
for h in hits:
    print(h)

# Matches all words starting with 'd', including 'Doubting' which starts 
# with upper case 'd'

As was discussed, the method findall() can be used to retrieve the substrings that match the regular expression. Alternatively, you can also work with parentheses in the regular expression. These parentheses will have the effect that the characters in the string that match the regular expression are saved using the method group(). The matches will also be numbered. The text that matches the full regular expression is assigned number 0, and the fragment that matches the pattern in the first set of parentheses is given number 1.

This approach can be followed, for instance, to extract the direct speech from a longer sentence.

import re

sentence = "\"Oh, good gracious me!\" said Lucy, suddenly collapsing and again seeing the whole of life in a new perspective."

hits = re.search( r'["](.+)["]' , sentence )

if hits:
    print( hits.group(1) )
## prints Oh, good gracious me!

As was discussed above, character such as the dot (‘.’), the asterisk (‘*’) or the question mark (‘?’) have a special meaning in regular expressions. Normally, they function as quantifiers or as metacharacters. In some cases, however, you may want to search for these literal characters themselves. If you need to extract the top level domain name from the URL of a website, for example, you need to specify that it is the part that follows the final dot. If you want to refer to characters in their literal meaning, these special characters need to be preceded by the back slash. This notation is known as “escaping” the character. Listing 3.8 contains an illustration.

import re

url = "www.universiteitleiden.nl"

match = re.search( r'\.(\w+)$' , url )

if match:
    print( 'The top level domain name of this URL is ' + 
               match.group(1) )

Finding and replacing text¶

In Python, regular expressions can also be applied usefully in ‘find and replace’ operations. Such operations can be performed using the 'sub()' function from the ‘re’ module. The 'sub()' method demands three parameters: a regular expression, a replacement string, and the string containing text which needs to be replaced. If matches can be found for the regular expression which is mentioned as the first parameter, these matches will all be replaced with the string which is given as the second parameter.

import re

sentence = "This,, ..sentence-. .,contains. .strange. !=puncuation"

sentence = re.sub( r'[.,!=-]' , '' , sentence )

print(sentence)
## This code removes all punctuation

The process of learning to work with regular expressions may imply a steep learning curve. You need to develp a good understanding of all the characters that can be used to compose search patterns, next to the ability to use all of these characters and symbols in combination.

If you want to learn more about regular expressions, you can study the very elaborate and accessible turorials on this topic on The Programming Historian or on the website of Library Carpentry.

On Dataquest.io, you can find a helpful Regular Expressions Cheat Sheet (also available as a PDF document)

< Working with files and folders | Table of contents | Data acquisition >