A guide to regular expressions
Moving on to practical matters, the following is an introductory guide to the basic use of
regular expressions in Python. To use regular expressions in Python we must first import
the re module:
import re
or the required sub-modules:
from re import search, compile, sub
We use the former approach here. To begin, we will consider looking for a particular
substring within a larger string using re.search. This takes the general form:
matchObj = re.search(regExpPattern, textString)
And in practice we would do something like the following, where for illustrative
purposes we use an exact string as a pattern to search for:
text = 'Antidisestablishmentarianism'
matchObjA = re.search('establish', text) # Present – gives MatchObject
matchObjB = re.search('banana', text) # Absent - gives None
As you can see, the search gives back a special MatchObject if successful or None if
the search failed. A MatchObject may then be interrogated to determine where the match
occurred, and what the substring was etc.
If the search pattern doesn’t change often compared to the number of searches then it
may be quicker to compile the pattern from the regular expression string once at the start,
before then applying the pattern repeatedly. In this instance to ‘compile’ a regular
expression means to interpret the pattern specification (which is initially just text) and
create a RegexObject, which has methods (bound functions) to perform searches etc. on
that particular pattern. So adapting an example from above we could do the following,
noting the .search() now comes from regexObj, the compiled regular expression object:
regexObj = re.compile('establish')
matchObj = regexObj.search(text)
print(matchObj)
Various useful functionalities are associated with the match object:
print(matchObj.group()) # The substring that was found/matched
print(matchObj.start()) # index in the string of the start of match
print(matchObj.end()) # index in the string for just after end of match
print(matchObj.span()) # (start, end+1)
In general we need to check that a search was successful, i.e. that it did not give None,
before proceeding to interrogate any match object:
regexObj = re.compile('Green')
texts = ['Green tomatoes', 'Red brick house']
for text in texts:
matchObj = regexObj.search(text)
if matchObj is None:
print('Pattern does not match')
else:
print(matchObj.span())
Considering the above search pattern ‘Green’, we may wish to be less specific and also
accept lower-case ‘green’. The regular expression string for this could be ‘[Gg]reen’, so
that we accept either upper- or lower-case letters at the start of the word.
2
regexObj = re.compile('[Gg]reen')
print(regexObj.search('Green door'))
print(regexObj.search('Fried green tomatoes'))
As you may expect the above regular expression was made more general by creating a
group of accepted characters using square brackets []. Hence, square brackets have a
special meaning when they are in a regular expression. There are other characters with
special meanings, and the complete list is:
. ^ $ * + ? { } [ ] \ | ( )
See the table below for what these are used for. If we want to have these characters used
in a literal way then we have to put a ‘\’ in front so that it escapes any special
interpretation (in the jargon the character is said to have been ‘escaped’).
regexObj = re.compile('\[abc\]') # Match the bracket characters
regexObj.search('Text with exactly [abc] inside')
While we can explicitly define groups of characters by stating all possibilities, for a
range of consecutive characters we can use a shorter notation. Thus instead of doing the
following to match any grade from A through to E:
text = 'passed the exam with grade D'
regexObj = re.compile('grade [ABCDE]')
matchObj = regexObj.search(text)
we could use a range specified with the group ‘[A-E]’.
regexObj = re.compile('grade [A-E]')
matchObj = regexObj.search(text)
In a similar way you can define range groups like ‘[A-Z]’, ‘[a-z]’ or ‘[a-z0-9]’, the last
of which would match lower-case letters or digits. For some of the more general, regularly
used character groups there are some even simpler codes. Hence instead of the complete
digit character group ‘[0-9]’ the code ‘\d’ can be used instead:
regexObj = re.compile('grade \d')
regexObj.search('Wizard grade 1')
regexObj.search('Wizard grade 2')
As described in the table below the commonly used group codes are:
\s \d \w \S \D \W
Respectively these represent whitespace, digit, alphanumeric/underscore and their
opposites; non-whitespace, non-digit, non-alphanumeric/underscore. Accordingly, in one
of the above examples we could use ‘\w’ instead, and this will match both letters and
numbers:
regexObj = re.compile('grade \w') # 'Wordy' character [0-9a-zA-Z_]
regexObj.search('grade D')
regexObj.search('grade 1')
Sometimes a regular expression cannot be expressed in terms of simple character
groups and codes. For example, we may wish to accept alternative words. In such cases
we can simply specify complete substring alternatives using ‘|’, which acts as an OR
operator.
regexObj = re.compile('\s(trousers|pants)\.')
matchObj = regexObj.search('I got mud on my trousers.')
So far we have only considered groups and codes for matching single characters.
Naturally we often want to find more than one character from a given group. This is
achieved using the ‘+’ symbol, which means to match one or more (as many as are
available). For example, to match multiple digits:
regexObj = re.compile('\d+')
matchObj = regexObj.search('In the year of 1949.')
print(matchObj.group()) # Print the part that matches
In this case we would match the multiple digits of the substring ‘1949’. Note that this
matching is ‘greedy’, so gets all of the sequence of digits. Multiple character groups and
codes can be used in the same regular expression. So adapting the above example we
could also match one or more non-digit characters, specified with ‘\D+’ before any
number of digits ‘\d+’.
regexObj = re.compile('\D+\d+') # One or more non-digit, one or more digit
matchObj = regexObj.search('I arrived in 1949 from Cuba.')
print(matchObj.group())
The result here is ‘I arrived in 1949’, so you can see that the ‘\D+’ matched all the
initial characters and the ‘\d+’ matched the year in the combined pattern. As an alternative
we could use the code ‘.+’, which means one or more of any character, but it should be
noted that this will also match to digits. Hence, if we do the following:
regexObj = re.compile('.+\d+') # One or more of anything, one or more digit
matchObj = regexObj.search('I arrived in 1949 from Cuba.')
print(matchObj.group())
The final result is the same as before but the ‘.+’ code actually matches ‘I arrived in
194’; it is ‘greedy’ and matches as many characters as possible before the next code ‘\d+’,
which here only matches the single character ‘9’. As another example, you may wish to
match digits only if they are preceded by a space (or other whitespace character like ‘\t’,
‘\r’ or ‘\n’). In this case you could do:
regexObj = re.compile('\s\d+')
regexObj.search('Year 2013') # Success
regexObj.search('Year2178') # Gives None: no whitespace before digit
Note that here we could also use ‘\s+’, but searching for multiple spaces doesn’t make
any difference if all we require is one. If the pattern really must have a space, and nothing
else, then re.compile(‘ \d+’) can be used.
As far as the output is concerned, the examples so far don’t distinguish the different
parts of the regular expression. However, we may wish to segregate the different parts of
the matched string so that we can extract it separately using the .group() or .groups()
method of the match object. Hence, considering the extraction of numbers from the above
example we can use round brackets () to define a group so that although the match must
include whitespace we can access the digits separately:
regexObj = re.compile('\s(\d+)')
matchObj = regexObj.search('In the year of 1949.')
print(matchObj.group(1)) # '1949' - digits only
print(matchObj.groups()) # ('1949',) – all match groups as a tuple
Note that using .group(0) gives the complete match substring. Naturally, if there are
more groups these take subsequent numbers, as in the following example, where ‘\D’
means non-digit:
regexObj = re.compile('(\d+)\D+(\d+)')
matchObj = regexObj.search('The 14th day of January 1865.')
print(matchObj.group(0)) # '14th day of January 1865'
print(matchObj.group(1)) # '14'
print(matchObj.group(2)) # '1865'
print(matchObj.groups()) # ('14', '1865')
There are some tricky details when using special character codes. Consider the
following, for example:
text = 'C:\data'
regexObj = re.compile('\data')
matchObj = regexObj.search(text)
print(matchObj.group()) # Fails
The problem arises because ‘\d’ is a special code for digit characters and is not
interpreted as a literal backslash character followed by a ‘d’. Now, given what we
mentioned above by escaping characters by prepending them with ‘\’ you might expect the
following to work to treat the slash literally.
regexObj = re.compile('\\data')
print(regexObj.search(text).group()) # Still fails!
To get this to work as we intended we need the following completely horrid pattern:
regexObj = re.compile('\\\\data') # Works – Yuck!
print(regexObj.search(text).group())
The problem we have encountered here occurs because in reality there are actually two
rounds of string interpretation and in both rounds ‘\’ is an escape code. The first
interpretation is the normal Python string handling, and here ‘\’ also means a literal
backslash (remembering that a backslash is used for whitespace codes like ‘\n’, ‘\t’ etc.).
The second interpretation is the interpretation as a regular expression, which has its own
set of escape codes. Thus the first round of string interpretation in compile(‘\data’)
replaces the double backslash code with a single literal backslash, so that by the time the
string gets to the regular expression interpretation the double backslash is removed and
we’re back at ‘\d’. At this point it should be noted that we didn’t have this problem with
the ‘\[’ or ‘\]’ in previous examples because these only act as escape codes in regular
expressions, not in regular Python strings.
As you can see, we can add yet more backslashes so that the removal of double slashes
in the string interpretation leaves enough for the regular expression. However, there is a
much more palatable way of doing things by disabling the first round of escape character
interpretation using what are called raw strings so all characters are treated literally, which
uses the r” or r”” syntax. Hence we can do:
regexObj = re.compile(r'\\data') # Raw string
print(regexObj.search(text).group()) # Success!
Another aspect of regular expressions which should be noted is that the matches are
done as close to the start of the queried string as possible. Hence, if there are two
possibilities which both match in theory it is the first that is matched in practice. For
example:
regexObj = re.compile('\d+')
matchObj = regexObj.search('In the years 1949 and 1954.')
print(matchObj.group()) # '1949' - First match
Where there are multiple matches for the pattern we can use .findall() to get multiple
match occurrences, noting that this gives back the matches’ substrings rather than a
MatchObject (we could get such objects using .finditer(), as we show later):
regexObj = re.compile('\d+')
matchStrs = regexObj.findall('In the years 1949, 1954 and 1963.')
for matchStr in matchStrs:
print(matchStr)
So far we have only considered regular expressions where a particular character code or
group must occur. However, there are situations when a group of characters may
sometimes be absent. For example, consider the following strings where we wish to
extract the numeric data after equal signs but where there may or may not be multiple
spaces before the digits.
s1 = 'first=123457'
s2 = 'second= 6'
s3 = 'third= 8768'
All of these numeric substrings can be extracted with a single regular expression. Here
‘*’ means zero or more (as applied to the preceding character) so we have flexibility with
regard to the presence of spaces before getting one or more digits:
regexObj = re.compile('= *\d+')
print(regexObj.search(s3).group())
Taking this kind of example further, the extraction of numbers may be further
complicated with the presence or absence of minus signs and decimal points. However, we
only accept the presence of a single minus sign and/or a single decimal point, so we use ‘?’
to mean zero or one (and not more). Considering the following string:
line = 'p1=123.457, p2= 1.80, delta1= -7.869, delta2=-10'
A regular expression to match all the numbers must account for zero or more spaces ‘*’,
an optional minus sign ‘-?’, one or more digits ‘\d+’, an optional decimal point ‘\.?’
(remembering the backslash because a plain dot is a code for any character) and then any
optional remaining digits ‘\d*’. The resulting regular expression may seem somewhat
unreadable at first glance, but it is readily broken down into its component parts:
regexObj = re.compile('= *(-?\d+.?\d*)')
for match in regexObj.finditer(line): # iterates through all match objects
print(match.group(1))
Note that by bracketing the part of the character specification that includes the numbers
and any minus sign we can get just the numeric part with .group(1). So far we have
considered codes for zero or one ‘?’, zero or more ‘*’ and one or more ‘+’, but naturally
there are other possibilities, such as allowing between two and four, but no more or less.
In this case we use the curly brace specification in the form ‘{minAllowed,
maxAllowed}’. Hence to allow two, three or four whitespace characters ‘\s’ before digits
we could do:
regexObj = re.compile('=\s{2,4}\d+') # From two to four, inclusive
If only one number is given in braces then there must be exactly that number of
characters for a match:
regexObj = re.compile('=\s{2}\d+') # Exactly two
If the first number is omitted, with a comma still present, then the minimum number
defaults to zero. So the following accepts up to two whitespace characters, but no more:
regexObj = re.compile('=\s{,2}\d+') # Zero to two
If the second number after the comma is omitted the maximum number of occurrences
is unlimited. The following accepts two or more whitespace characters:
regexObj = re.compile('=\s{2,}\d+') # Two or more
Moving on from simply matching and extracting substrings, the re module and
RegexpObject have a substitution method .sub(). Here if the pattern matches the matching
substring is replaced with another substring, yielding a new string. In the following
example any negative integer numbers are replaced with ‘neg!’:
3
text = 'N: -9 4 -2 7 8 -8'
regexObj = re.compile('-\d+')
newText = regexObj.sub('neg!', text) # Gives 'N: neg! 4 neg! 7 8 neg!'
Alternatively the replacement substring can simply be empty, so that the matches are
removed. Here we remove any negative numbers and preceding whitespace:
text = 'N: -9 4 -2 7 8 -8'
regexObj = re.compile('\s+-\d+')
newText = regexObj.sub('', text) # Gives 'N: 4 7 8'
If we wish to keep the digits we found after the minus sign we can capture them in a
group and then recall them in the replacement text using the ‘\1’ etc. (a second group
would be ‘\2’). Hence the following matches both the minus sign and digits, but puts the
digits back into the new string after a space:
text = 'N: -9 4 -2 7 8 -8'
regexObj = re.compile('\s+-(\d+)')
newText = regexObj.sub(r' \1', text) # Gives 'N: 9 4 2 7 8 8'
Note that here we use the raw string notation r” so that Python uses the characters
literally and does not attempt to interpret ‘\1’ as an escaped control character.
Another useful operation involving regular expressions is to split a string according to a
pattern. The method for this (of both the re module and RegexpObject) is .split(). This is
like the string method with the same name, in that it gives a list of strings by breaking a
long string at the points where a separator matches, but the matching is done with a
regular expression, not just an exact substring of characters. The following is an example
of splitting with a regular expression, but take special note that the pattern uses ‘.+?’. If
you try the example without the question mark you will see that the ‘.+’, meaning one or
more of any character, is too greedy and will match all of the rest of the string, up to the
last angle bracket ‘>’. Rather what we want is for the pattern to match conservatively and
only go up to the next ‘>’, hence we use ‘+?’ which means a minimalistic search for one or
more:
text = '
Paris
London
Berlin
New York'
regexObj = re.compile('<.+?>')
print(regexObj.split(text)) # ['', 'Paris', 'London', 'Berlin', 'New York']
There are many more subtleties and options with regular expressions in Python. Many
of these are detailed in the following tables, but we recommend reading the main on-line
Python documentation for the complete picture.
Do'stlaringiz bilan baham: |