Python Programming for Biology: Bioinformatics and Beyond

A guide to regular expressions

Download 7,75 Mb.

Pdf ko'rish

bet	481/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 477 478 479 480 481 482 483 484 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

A guide to regular expressions

Moving on to practical matters, the following is an introductory guide to the basic use of

regular expressions in Python. To use regular expressions in Python we must first import

the re module:

import re

or the required sub-modules:

from re import search, compile, sub

We use the former approach here. To begin, we will consider looking for a particular

substring within a larger string using re.search. This takes the general form:

matchObj = re.search(regExpPattern, textString)

And in practice we would do something like the following, where for illustrative

purposes we use an exact string as a pattern to search for:

text = 'Antidisestablishmentarianism'

matchObjA = re.search('establish', text) # Present – gives MatchObject

matchObjB = re.search('banana', text) # Absent - gives None

As you can see, the search gives back a special MatchObject if successful or None if

the search failed. A MatchObject may then be interrogated to determine where the match

occurred, and what the substring was etc.

If the search pattern doesn’t change often compared to the number of searches then it

may be quicker to compile the pattern from the regular expression string once at the start,

before then applying the pattern repeatedly. In this instance to ‘compile’ a regular

expression means to interpret the pattern specification (which is initially just text) and

create a RegexObject, which has methods (bound functions) to perform searches etc. on

that particular pattern. So adapting an example from above we could do the following,

noting the .search() now comes from regexObj, the compiled regular expression object:

regexObj = re.compile('establish')

matchObj = regexObj.search(text)

print(matchObj)

Various useful functionalities are associated with the match object:

print(matchObj.group()) # The substring that was found/matched

print(matchObj.start()) # index in the string of the start of match

print(matchObj.end()) # index in the string for just after end of match

print(matchObj.span()) # (start, end+1)

In general we need to check that a search was successful, i.e. that it did not give None,

before proceeding to interrogate any match object:

regexObj = re.compile('Green')

texts = ['Green tomatoes', 'Red brick house']

for text in texts:

matchObj = regexObj.search(text)

if matchObj is None:

print('Pattern does not match')

else:

print(matchObj.span())

Considering the above search pattern ‘Green’, we may wish to be less specific and also

accept lower-case ‘green’. The regular expression string for this could be ‘[Gg]reen’, so

that we accept either upper- or lower-case letters at the start of the word.

regexObj = re.compile('[Gg]reen')

print(regexObj.search('Green door'))

print(regexObj.search('Fried green tomatoes'))

As you may expect the above regular expression was made more general by creating a

group of accepted characters using square brackets []. Hence, square brackets have a

special meaning when they are in a regular expression. There are other characters with

special meanings, and the complete list is:

. ^ $ * + ? { } [ ] \ | ( )

See the table below for what these are used for. If we want to have these characters used

in a literal way then we have to put a ‘\’ in front so that it escapes any special

interpretation (in the jargon the character is said to have been ‘escaped’).

regexObj = re.compile('\[abc\]') # Match the bracket characters

regexObj.search('Text with exactly [abc] inside')

While we can explicitly define groups of characters by stating all possibilities, for a

range of consecutive characters we can use a shorter notation. Thus instead of doing the

following to match any grade from A through to E:

text = 'passed the exam with grade D'

regexObj = re.compile('grade [ABCDE]')

matchObj = regexObj.search(text)

we could use a range specified with the group ‘[A-E]’.

regexObj = re.compile('grade [A-E]')

matchObj = regexObj.search(text)

In a similar way you can define range groups like ‘[A-Z]’, ‘[a-z]’ or ‘[a-z0-9]’, the last

of which would match lower-case letters or digits. For some of the more general, regularly

used character groups there are some even simpler codes. Hence instead of the complete

digit character group ‘[0-9]’ the code ‘\d’ can be used instead:

regexObj = re.compile('grade \d')

regexObj.search('Wizard grade 1')

regexObj.search('Wizard grade 2')

As described in the table below the commonly used group codes are:

\s \d \w \S \D \W

Respectively these represent whitespace, digit, alphanumeric/underscore and their

opposites; non-whitespace, non-digit, non-alphanumeric/underscore. Accordingly, in one

of the above examples we could use ‘\w’ instead, and this will match both letters and

numbers:

regexObj = re.compile('grade \w') # 'Wordy' character [0-9a-zA-Z_]

regexObj.search('grade D')

regexObj.search('grade 1')

Sometimes a regular expression cannot be expressed in terms of simple character

groups and codes. For example, we may wish to accept alternative words. In such cases

we can simply specify complete substring alternatives using ‘|’, which acts as an OR

operator.

regexObj = re.compile('\s(trousers|pants)\.')

matchObj = regexObj.search('I got mud on my trousers.')

So far we have only considered groups and codes for matching single characters.

Naturally we often want to find more than one character from a given group. This is

achieved using the ‘+’ symbol, which means to match one or more (as many as are

available). For example, to match multiple digits:

regexObj = re.compile('\d+')

matchObj = regexObj.search('In the year of 1949.')

print(matchObj.group()) # Print the part that matches

In this case we would match the multiple digits of the substring ‘1949’. Note that this

matching is ‘greedy’, so gets all of the sequence of digits. Multiple character groups and

codes can be used in the same regular expression. So adapting the above example we

could also match one or more non-digit characters, specified with ‘\D+’ before any

number of digits ‘\d+’.

regexObj = re.compile('\D+\d+') # One or more non-digit, one or more digit

matchObj = regexObj.search('I arrived in 1949 from Cuba.')

print(matchObj.group())

The result here is ‘I arrived in 1949’, so you can see that the ‘\D+’ matched all the

initial characters and the ‘\d+’ matched the year in the combined pattern. As an alternative

we could use the code ‘.+’, which means one or more of any character, but it should be

noted that this will also match to digits. Hence, if we do the following:

regexObj = re.compile('.+\d+') # One or more of anything, one or more digit

matchObj = regexObj.search('I arrived in 1949 from Cuba.')

print(matchObj.group())

The final result is the same as before but the ‘.+’ code actually matches ‘I arrived in

194’; it is ‘greedy’ and matches as many characters as possible before the next code ‘\d+’,

which here only matches the single character ‘9’. As another example, you may wish to

match digits only if they are preceded by a space (or other whitespace character like ‘\t’,

‘\r’ or ‘\n’). In this case you could do:

regexObj = re.compile('\s\d+')

regexObj.search('Year 2013') # Success

regexObj.search('Year2178') # Gives None: no whitespace before digit

Note that here we could also use ‘\s+’, but searching for multiple spaces doesn’t make

any difference if all we require is one. If the pattern really must have a space, and nothing

else, then re.compile(‘ \d+’) can be used.

As far as the output is concerned, the examples so far don’t distinguish the different

parts of the regular expression. However, we may wish to segregate the different parts of

the matched string so that we can extract it separately using the .group() or .groups()

method of the match object. Hence, considering the extraction of numbers from the above

example we can use round brackets () to define a group so that although the match must

include whitespace we can access the digits separately:

regexObj = re.compile('\s(\d+)')

matchObj = regexObj.search('In the year of 1949.')

print(matchObj.group(1)) # '1949' - digits only

print(matchObj.groups()) # ('1949',) – all match groups as a tuple

Note that using .group(0) gives the complete match substring. Naturally, if there are

more groups these take subsequent numbers, as in the following example, where ‘\D’

means non-digit:

regexObj = re.compile('(\d+)\D+(\d+)')

matchObj = regexObj.search('The 14th day of January 1865.')

print(matchObj.group(0)) # '14th day of January 1865'

print(matchObj.group(1)) # '14'

print(matchObj.group(2)) # '1865'

print(matchObj.groups()) # ('14', '1865')

There are some tricky details when using special character codes. Consider the

following, for example:

text = 'C:\data'

regexObj = re.compile('\data')

matchObj = regexObj.search(text)

print(matchObj.group()) # Fails

The problem arises because ‘\d’ is a special code for digit characters and is not

interpreted as a literal backslash character followed by a ‘d’. Now, given what we

mentioned above by escaping characters by prepending them with ‘\’ you might expect the

following to work to treat the slash literally.

regexObj = re.compile('\\data')

print(regexObj.search(text).group()) # Still fails!

To get this to work as we intended we need the following completely horrid pattern:

regexObj = re.compile('\\\\data') # Works – Yuck!

print(regexObj.search(text).group())

The problem we have encountered here occurs because in reality there are actually two

rounds of string interpretation and in both rounds ‘\’ is an escape code. The first

interpretation is the normal Python string handling, and here ‘\’ also means a literal

backslash (remembering that a backslash is used for whitespace codes like ‘\n’, ‘\t’ etc.).

The second interpretation is the interpretation as a regular expression, which has its own

set of escape codes. Thus the first round of string interpretation in compile(‘\data’)

replaces the double backslash code with a single literal backslash, so that by the time the

string gets to the regular expression interpretation the double backslash is removed and

we’re back at ‘\d’. At this point it should be noted that we didn’t have this problem with

the ‘\[’ or ‘\]’ in previous examples because these only act as escape codes in regular

expressions, not in regular Python strings.

As you can see, we can add yet more backslashes so that the removal of double slashes

in the string interpretation leaves enough for the regular expression. However, there is a

much more palatable way of doing things by disabling the first round of escape character

interpretation using what are called raw strings so all characters are treated literally, which

uses the r” or r”” syntax. Hence we can do:

regexObj = re.compile(r'\\data') # Raw string

print(regexObj.search(text).group()) # Success!

Another aspect of regular expressions which should be noted is that the matches are

done as close to the start of the queried string as possible. Hence, if there are two

possibilities which both match in theory it is the first that is matched in practice. For

example:

regexObj = re.compile('\d+')

matchObj = regexObj.search('In the years 1949 and 1954.')

print(matchObj.group()) # '1949' - First match

Where there are multiple matches for the pattern we can use .findall() to get multiple

match occurrences, noting that this gives back the matches’ substrings rather than a

MatchObject (we could get such objects using .finditer(), as we show later):

regexObj = re.compile('\d+')

matchStrs = regexObj.findall('In the years 1949, 1954 and 1963.')

for matchStr in matchStrs:

print(matchStr)

So far we have only considered regular expressions where a particular character code or

group must occur. However, there are situations when a group of characters may

sometimes be absent. For example, consider the following strings where we wish to

extract the numeric data after equal signs but where there may or may not be multiple

spaces before the digits.

s1 = 'first=123457'

s2 = 'second= 6'

s3 = 'third= 8768'

All of these numeric substrings can be extracted with a single regular expression. Here

‘*’ means zero or more (as applied to the preceding character) so we have flexibility with

regard to the presence of spaces before getting one or more digits:

regexObj = re.compile('= *\d+')

print(regexObj.search(s3).group())

Taking this kind of example further, the extraction of numbers may be further

complicated with the presence or absence of minus signs and decimal points. However, we

only accept the presence of a single minus sign and/or a single decimal point, so we use ‘?’

to mean zero or one (and not more). Considering the following string:

line = 'p1=123.457, p2= 1.80, delta1= -7.869, delta2=-10'

A regular expression to match all the numbers must account for zero or more spaces ‘*’,

an optional minus sign ‘-?’, one or more digits ‘\d+’, an optional decimal point ‘\.?’

(remembering the backslash because a plain dot is a code for any character) and then any

optional remaining digits ‘\d*’. The resulting regular expression may seem somewhat

unreadable at first glance, but it is readily broken down into its component parts:

regexObj = re.compile('= *(-?\d+.?\d*)')

for match in regexObj.finditer(line): # iterates through all match objects

print(match.group(1))

Note that by bracketing the part of the character specification that includes the numbers

and any minus sign we can get just the numeric part with .group(1). So far we have

considered codes for zero or one ‘?’, zero or more ‘*’ and one or more ‘+’, but naturally

there are other possibilities, such as allowing between two and four, but no more or less.

In this case we use the curly brace specification in the form ‘{minAllowed,

maxAllowed}’. Hence to allow two, three or four whitespace characters ‘\s’ before digits

we could do:

regexObj = re.compile('=\s{2,4}\d+') # From two to four, inclusive

If only one number is given in braces then there must be exactly that number of

characters for a match:

regexObj = re.compile('=\s{2}\d+') # Exactly two

If the first number is omitted, with a comma still present, then the minimum number

defaults to zero. So the following accepts up to two whitespace characters, but no more:

regexObj = re.compile('=\s{,2}\d+') # Zero to two

If the second number after the comma is omitted the maximum number of occurrences

is unlimited. The following accepts two or more whitespace characters:

regexObj = re.compile('=\s{2,}\d+') # Two or more

Moving on from simply matching and extracting substrings, the re module and

RegexpObject have a substitution method .sub(). Here if the pattern matches the matching

substring is replaced with another substring, yielding a new string. In the following

example any negative integer numbers are replaced with ‘neg!’:

text = 'N: -9 4 -2 7 8 -8'

regexObj = re.compile('-\d+')

newText = regexObj.sub('neg!', text) # Gives 'N: neg! 4 neg! 7 8 neg!'

Alternatively the replacement substring can simply be empty, so that the matches are

removed. Here we remove any negative numbers and preceding whitespace:

text = 'N: -9 4 -2 7 8 -8'

regexObj = re.compile('\s+-\d+')

newText = regexObj.sub('', text) # Gives 'N: 4 7 8'

If we wish to keep the digits we found after the minus sign we can capture them in a

group and then recall them in the replacement text using the ‘\1’ etc. (a second group

would be ‘\2’). Hence the following matches both the minus sign and digits, but puts the

digits back into the new string after a space:

text = 'N: -9 4 -2 7 8 -8'

regexObj = re.compile('\s+-(\d+)')

newText = regexObj.sub(r' \1', text) # Gives 'N: 9 4 2 7 8 8'

Note that here we use the raw string notation r” so that Python uses the characters

literally and does not attempt to interpret ‘\1’ as an escaped control character.

Another useful operation involving regular expressions is to split a string according to a

pattern. The method for this (of both the re module and RegexpObject) is .split(). This is

like the string method with the same name, in that it gives a list of strings by breaking a

long string at the points where a separator matches, but the matching is done with a

regular expression, not just an exact substring of characters. The following is an example

of splitting with a regular expression, but take special note that the pattern uses ‘.+?’. If

you try the example without the question mark you will see that the ‘.+’, meaning one or

more of any character, is too greedy and will match all of the rest of the string, up to the

last angle bracket ‘>’. Rather what we want is for the pattern to match conservatively and

only go up to the next ‘>’, hence we use ‘+?’ which means a minimalistic search for one or

text = '
Paris
London
Berlin
New York'

regexObj = re.compile('<.+?>')

print(regexObj.split(text)) # ['', 'Paris', 'London', 'Berlin', 'New York']

There are many more subtleties and options with regular expressions in Python. Many

of these are detailed in the following tables, but we recommend reading the main on-line

Python documentation for the complete picture.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 477 478 479 480 481 482 483 484 ... 514