Python Programming for Biology: Bioinformatics and Beyond


A guide to regular expressions



Download 7,75 Mb.
Pdf ko'rish
bet481/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   477   478   479   480   481   482   483   484   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

A guide to regular expressions

Moving on to practical matters, the following is an introductory guide to the basic use of

regular expressions in Python. To use regular expressions in Python we must first import

the re module:

import re

or the required sub-modules:

from re import search, compile, sub

We  use  the  former  approach  here.  To  begin,  we  will  consider  looking  for  a  particular

substring within a larger string using re.search. This takes the general form:

matchObj = re.search(regExpPattern, textString)

And  in  practice  we  would  do  something  like  the  following,  where  for  illustrative

purposes we use an exact string as a pattern to search for:

text = 'Antidisestablishmentarianism'

matchObjA = re.search('establish', text) # Present – gives MatchObject




matchObjB = re.search('banana', text) # Absent - gives None

As  you  can  see,  the  search  gives  back  a  special  MatchObject if successful or None  if

the search failed. A MatchObject may then be interrogated to determine where the match

occurred, and what the substring was etc.

If the search pattern doesn’t change often compared to the number of searches then it

may be quicker to compile the pattern from the regular expression string once at the start,

before  then  applying  the  pattern  repeatedly.  In  this  instance  to  ‘compile’  a  regular

expression  means  to  interpret  the  pattern  specification  (which  is  initially  just  text)  and

create  a  RegexObject,  which  has  methods  (bound  functions)  to  perform  searches  etc.  on

that  particular  pattern.  So  adapting  an  example  from  above  we  could  do  the  following,

noting the .search() now comes from regexObj, the compiled regular expression object:

regexObj = re.compile('establish')

matchObj = regexObj.search(text)

print(matchObj)

Various useful functionalities are associated with the match object:

print(matchObj.group()) # The substring that was found/matched

print(matchObj.start()) # index in the string of the start of match

print(matchObj.end()) # index in the string for just after end of match

print(matchObj.span()) # (start, end+1)

In general we need to check that a search was successful, i.e. that it did not give None,

before proceeding to interrogate any match object:

regexObj = re.compile('Green')

texts = ['Green tomatoes', 'Red brick house']

for text in texts:

matchObj = regexObj.search(text)

if matchObj is None:

print('Pattern does not match')

else:


print(matchObj.span())

Considering the above search pattern ‘Green’, we may wish to be less specific and also

accept  lower-case  ‘green’.  The  regular  expression  string  for  this  could  be  ‘[Gg]reen’,  so

that we accept either upper- or lower-case letters at the start of the word.

2

regexObj = re.compile('[Gg]reen')



print(regexObj.search('Green door'))

print(regexObj.search('Fried green tomatoes'))

As you may expect the above regular expression was made more general by creating a

group  of  accepted  characters  using  square  brackets  [].  Hence,  square  brackets  have  a

special  meaning  when  they  are  in  a  regular  expression.  There  are  other  characters  with

special meanings, and the complete list is:

. ^ $ * + ? { } [ ] \ | ( )



See the table below for what these are used for. If we want to have these characters used

in  a  literal  way  then  we  have  to  put  a  ‘\’  in  front  so  that  it  escapes  any  special

interpretation (in the jargon the character is said to have been ‘escaped’).

regexObj = re.compile('\[abc\]') # Match the bracket characters

regexObj.search('Text with exactly [abc] inside')

While  we  can  explicitly  define  groups  of  characters  by  stating  all  possibilities,  for  a

range of consecutive characters we can use a shorter notation. Thus instead of doing the

following to match any grade from A through to E:

text = 'passed the exam with grade D'

regexObj = re.compile('grade [ABCDE]')

matchObj = regexObj.search(text)

we could use a range specified with the group ‘[A-E]’.

regexObj = re.compile('grade [A-E]')

matchObj = regexObj.search(text)

In a similar way you can define range groups like ‘[A-Z]’, ‘[a-z]’ or ‘[a-z0-9]’, the last

of which would match lower-case letters or digits. For some of the more general, regularly

used character groups there are some even simpler codes. Hence instead of the complete

digit character group ‘[0-9]’ the code ‘\d’ can be used instead:

regexObj = re.compile('grade \d')

regexObj.search('Wizard grade 1')

regexObj.search('Wizard grade 2')

As described in the table below the commonly used group codes are:

\s \d \w \S \D \W

Respectively  these  represent  whitespace,  digit,  alphanumeric/underscore  and  their

opposites;  non-whitespace,  non-digit,  non-alphanumeric/underscore.  Accordingly,  in  one

of  the  above  examples  we  could  use  ‘\w’  instead,  and  this  will  match  both  letters  and

numbers:

regexObj = re.compile('grade \w') # 'Wordy' character [0-9a-zA-Z_]

regexObj.search('grade D')

regexObj.search('grade 1')

Sometimes  a  regular  expression  cannot  be  expressed  in  terms  of  simple  character

groups  and  codes.  For  example,  we  may  wish  to  accept  alternative  words.  In  such  cases

we  can  simply  specify  complete  substring  alternatives  using  ‘|’,  which  acts  as  an  OR

operator.

regexObj = re.compile('\s(trousers|pants)\.')

matchObj = regexObj.search('I got mud on my trousers.')

So  far  we  have  only  considered  groups  and  codes  for  matching  single  characters.

Naturally  we  often  want  to  find  more  than  one  character  from  a  given  group.  This  is

achieved  using  the  ‘+’  symbol,  which  means  to  match  one  or  more  (as  many  as  are



available). For example, to match multiple digits:

regexObj = re.compile('\d+')

matchObj = regexObj.search('In the year of 1949.')

print(matchObj.group()) # Print the part that matches

In this case we would match the multiple digits of the substring ‘1949’. Note that this

matching is ‘greedy’, so gets all of the sequence of digits. Multiple character groups and

codes  can  be  used  in  the  same  regular  expression.  So  adapting  the  above  example  we

could  also  match  one  or  more  non-digit  characters,  specified  with  ‘\D+’  before  any

number of digits ‘\d+’.

regexObj = re.compile('\D+\d+') # One or more non-digit, one or more digit

matchObj = regexObj.search('I arrived in 1949 from Cuba.')

print(matchObj.group())

The  result  here  is  ‘I  arrived  in  1949’,  so  you  can  see  that  the  ‘\D+’  matched  all  the

initial characters and the ‘\d+’ matched the year in the combined pattern. As an alternative

we could use the code ‘.+’,  which  means  one  or  more  of  any  character,  but  it  should  be

noted that this will also match to digits. Hence, if we do the following:

regexObj = re.compile('.+\d+') # One or more of anything, one or more digit

matchObj = regexObj.search('I arrived in 1949 from Cuba.')

print(matchObj.group())

The  final  result  is  the  same  as  before  but  the  ‘.+’  code  actually  matches  ‘I  arrived  in

194’; it is ‘greedy’ and matches as many characters as possible before the next code ‘\d+’,

which  here  only  matches  the  single  character  ‘9’.  As  another  example,  you  may  wish  to

match digits only if they are preceded by a space (or other whitespace character like ‘\t’,

‘\r’ or ‘\n’). In this case you could do:

regexObj = re.compile('\s\d+')

regexObj.search('Year 2013') # Success

regexObj.search('Year2178') # Gives None: no whitespace before digit

Note that here we could also use ‘\s+’, but searching for multiple spaces doesn’t make

any difference if all we require is one. If the pattern really must have a space, and nothing

else, then re.compile(‘ \d+’) can be used.

As  far  as  the  output  is  concerned,  the  examples  so  far  don’t  distinguish  the  different

parts of the regular expression. However, we may wish to segregate the different parts of

the  matched  string  so  that  we  can  extract  it  separately  using  the  .group()  or  .groups()

method of the match object. Hence, considering the extraction of numbers from the above

example we can use round brackets () to define a group so that although the match must

include whitespace we can access the digits separately:

regexObj = re.compile('\s(\d+)')

matchObj = regexObj.search('In the year of 1949.')

print(matchObj.group(1)) # '1949' - digits only

print(matchObj.groups()) # ('1949',) – all match groups as a tuple

Note  that  using  .group(0)  gives  the  complete  match  substring.  Naturally,  if  there  are



more  groups  these  take  subsequent  numbers,  as  in  the  following  example,  where  ‘\D’

means non-digit:

regexObj = re.compile('(\d+)\D+(\d+)')

matchObj = regexObj.search('The 14th day of January 1865.')

print(matchObj.group(0)) # '14th day of January 1865'

print(matchObj.group(1)) # '14'

print(matchObj.group(2)) # '1865'

print(matchObj.groups()) # ('14', '1865')

There  are  some  tricky  details  when  using  special  character  codes.  Consider  the

following, for example:

text = 'C:\data'

regexObj = re.compile('\data')

matchObj = regexObj.search(text)

print(matchObj.group()) # Fails

The  problem  arises  because  ‘\d’  is  a  special  code  for  digit  characters  and  is  not

interpreted  as  a  literal  backslash  character  followed  by  a  ‘d’.  Now,  given  what  we

mentioned above by escaping characters by prepending them with ‘\’ you might expect the

following to work to treat the slash literally.

regexObj = re.compile('\\data')

print(regexObj.search(text).group()) # Still fails!

To get this to work as we intended we need the following completely horrid pattern:

regexObj = re.compile('\\\\data') # Works – Yuck!

print(regexObj.search(text).group())

The problem we have encountered here occurs because in reality there are actually two

rounds  of  string  interpretation  and  in  both  rounds  ‘\’  is  an  escape  code.  The  first

interpretation  is  the  normal  Python  string  handling,  and  here  ‘\’  also  means  a  literal

backslash (remembering that a backslash is used for whitespace codes like ‘\n’, ‘\t’  etc.).

The second interpretation is the interpretation as a regular expression, which has its own

set  of  escape  codes.  Thus  the  first  round  of  string  interpretation  in  compile(‘\data’)

replaces the double backslash code with a single literal backslash, so that by the time the

string  gets  to  the  regular  expression  interpretation  the  double  backslash  is  removed  and

we’re back at ‘\d’. At this point it should be noted that we didn’t have this problem with

the  ‘\[’  or  ‘\]’  in  previous  examples  because  these  only  act  as  escape  codes  in  regular

expressions, not in regular Python strings.

As you can see, we can add yet more backslashes so that the removal of double slashes

in  the  string  interpretation  leaves  enough  for  the  regular  expression.  However,  there  is  a

much more palatable way of doing things by disabling the first round of escape character

interpretation using what are called raw strings so all characters are treated literally, which

uses the r” or r”” syntax. Hence we can do:

regexObj = re.compile(r'\\data') # Raw string

print(regexObj.search(text).group()) # Success!



Another  aspect  of  regular  expressions  which  should  be  noted  is  that  the  matches  are

done  as  close  to  the  start  of  the  queried  string  as  possible.  Hence,  if  there  are  two

possibilities  which  both  match  in  theory  it  is  the  first  that  is  matched  in  practice.  For

example:


regexObj = re.compile('\d+')

matchObj = regexObj.search('In the years 1949 and 1954.')

print(matchObj.group()) # '1949' - First match

Where  there  are  multiple  matches  for  the  pattern  we  can  use  .findall()  to  get  multiple

match  occurrences,  noting  that  this  gives  back  the  matches’  substrings  rather  than  a

MatchObject (we could get such objects using .finditer(), as we show later):

regexObj = re.compile('\d+')

matchStrs = regexObj.findall('In the years 1949, 1954 and 1963.')

for matchStr in matchStrs:

print(matchStr)

So far we have only considered regular expressions where a particular character code or

group  must  occur.  However,  there  are  situations  when  a  group  of  characters  may

sometimes  be  absent.  For  example,  consider  the  following  strings  where  we  wish  to

extract  the  numeric  data  after  equal  signs  but  where  there  may  or  may  not  be  multiple

spaces before the digits.

s1 = 'first=123457'

s2 = 'second= 6'

s3 = 'third= 8768'

All of these numeric substrings can be extracted with a single regular expression. Here

‘*’ means zero or more (as applied to the preceding character) so we have flexibility with

regard to the presence of spaces before getting one or more digits:

regexObj = re.compile('= *\d+')

print(regexObj.search(s3).group())

Taking  this  kind  of  example  further,  the  extraction  of  numbers  may  be  further

complicated with the presence or absence of minus signs and decimal points. However, we

only accept the presence of a single minus sign and/or a single decimal point, so we use ‘?’

to mean zero or one (and not more). Considering the following string:

line = 'p1=123.457, p2= 1.80, delta1= -7.869, delta2=-10'

A regular expression to match all the numbers must account for zero or more spaces ‘*’,

an  optional  minus  sign  ‘-?’,  one  or  more  digits  ‘\d+’,  an  optional  decimal  point  ‘\.?’

(remembering the backslash because a plain dot is a code for any character) and then any

optional  remaining  digits  ‘\d*’.  The  resulting  regular  expression  may  seem  somewhat

unreadable at first glance, but it is readily broken down into its component parts:

regexObj = re.compile('= *(-?\d+.?\d*)')

for match in regexObj.finditer(line): # iterates through all match objects

print(match.group(1))

Note that by bracketing the part of the character specification that includes the numbers



and  any  minus  sign  we  can  get  just  the  numeric  part  with  .group(1).  So  far  we  have

considered codes for zero or one ‘?’, zero or more ‘*’ and one or more ‘+’, but naturally

there are other possibilities, such as allowing between two and four, but no more or less.

In  this  case  we  use  the  curly  brace  specification  in  the  form  ‘{minAllowed,

maxAllowed}’. Hence to allow two, three or four whitespace characters ‘\s’ before digits

we could do:

regexObj = re.compile('=\s{2,4}\d+') # From two to four, inclusive

If  only  one  number  is  given  in  braces  then  there  must  be  exactly  that  number  of

characters for a match:

regexObj = re.compile('=\s{2}\d+') # Exactly two

If  the  first  number  is  omitted,  with  a  comma  still  present,  then  the  minimum  number

defaults to zero. So the following accepts up to two whitespace characters, but no more:

regexObj = re.compile('=\s{,2}\d+') # Zero to two

If the second number after the comma is omitted the maximum number of occurrences

is unlimited. The following accepts two or more whitespace characters:

regexObj = re.compile('=\s{2,}\d+') # Two or more

Moving  on  from  simply  matching  and  extracting  substrings,  the  re  module  and

RegexpObject have a substitution method .sub(). Here if the pattern matches the matching

substring  is  replaced  with  another  substring,  yielding  a  new  string.  In  the  following

example any negative integer numbers are replaced with ‘neg!’:

3

text = 'N: -9 4 -2 7 8 -8'



regexObj = re.compile('-\d+')

newText = regexObj.sub('neg!', text) # Gives 'N: neg! 4 neg! 7 8 neg!'

Alternatively  the  replacement  substring  can  simply  be  empty,  so  that  the  matches  are

removed. Here we remove any negative numbers and preceding whitespace:

text = 'N: -9 4 -2 7 8 -8'

regexObj = re.compile('\s+-\d+')

newText = regexObj.sub('', text) # Gives 'N: 4 7 8'

If we wish to keep the digits we found after the minus sign we can capture them in a

group  and  then  recall  them  in  the  replacement  text  using  the  ‘\1’  etc.  (a  second  group

would be ‘\2’). Hence the following matches both the minus sign and digits, but puts the

digits back into the new string after a space:

text = 'N: -9 4 -2 7 8 -8'

regexObj = re.compile('\s+-(\d+)')

newText = regexObj.sub(r' \1', text) # Gives 'N: 9 4 2 7 8 8'

Note  that  here  we  use  the  raw  string  notation  r”  so  that  Python  uses  the  characters

literally and does not attempt to interpret ‘\1’ as an escaped control character.

Another useful operation involving regular expressions is to split a string according to a



pattern. The method for this (of both the re module and RegexpObject) is .split(). This is

like the string method with the same name, in that it gives a list of strings by breaking a

long  string  at  the  points  where  a  separator  matches,  but  the  matching  is  done  with  a

regular expression, not just an exact substring of characters. The following is an example

of  splitting  with  a  regular  expression,  but  take  special  note  that  the  pattern  uses  ‘.+?’.  If

you try the example without the question mark you will see that the ‘.+’, meaning one or

more of any character, is too greedy and will match all of the rest of the string, up to the

last angle bracket ‘>’. Rather what we want is for the pattern to match conservatively and

only go up to the next ‘>’, hence we use ‘+?’ which means a minimalistic search for one or

more:


text = '
Paris
London
Berlin
New York'

regexObj = re.compile('<.+?>')

print(regexObj.split(text)) # ['', 'Paris', 'London', 'Berlin', 'New York']

There are many more subtleties and options with regular expressions in Python. Many

of these are detailed in the following tables, but we recommend reading the main on-line

Python documentation for the complete picture.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   477   478   479   480   481   482   483   484   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish