Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet35/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   31   32   33   34   35   36   37   38   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

String manipulation

Text  items  in  Python  are  called  strings,  referring  to  the  fact  that  they  are  strings  of

characters. String functionality is an important part of the Python toolbox. For example, a

file  on  disk  (covered  in

Chapter  6

)  is  read  as  a  string  or  a  list  of  strings;  a  file  can  be

viewed  as  a  collection  of  characters.  Here,  even  if  part  of  the  loaded  file  represents  a

number,  it  is  initially  represented  as  a  string  of  characters,  not  a  proper  Python  numeric

object. In Python, strings are not modifiable. This might seem like a limitation, but in fact

it  rarely  is  because  it  is  easy  enough  to  create  a  new,  modified  string  from  an  existing

string.  And  since  strings  are  not  modifiable  it  means  that  they  can  be  placed  in  sets  and

used as keys in dictionaries, both of which are exceedingly useful.

In  this  section  we  will  illustrate  some  basic  manipulations  on  strings  using  the

following example string:

text = 'hello world' # same as double quoted "hello world"

In some ways a string can be thought of as a list of characters, although in Python a list

of  characters  would  be  a  different  entity  (see  below  for  a  discussion  of  lists).  Note  that

when  we  refer  to  something  in  a  string  as  being  a  character,  we  don’t  just  mean  the

regular  symbols  for  letters,  numbers  and  punctuation;  we  also  include  spaces  and

formatting  codes  (tab  stop,  new  line  etc.).  You  can  access  the  character  at  a  specific

position, or index, using square brackets:

text[1] # 'e'

text[5] # ' ' – a space

Note that the index for accessing the characters of a string starts counting from 0, not 1.




Thus the first character of a string is index number 0. At first this can seem odd to non-

programmers,  but  it  is  by  far  the  most  sensible  convention,  and  is  used  in  most  modern

computer languages.

Bear in mind that we cannot change the characters of a string. For example, we get an

error if we try to change the first position to an ‘H’:

text[0] = 'H' # Fails!

TypeError: 'str' object does not support item assignment

You can count backwards from the end of the string, where index -1 is the last character

of the string:

text[-3] # 'r'

If  a  string  has  n  characters,  then  the  minimum  value  of  the  index  is  –n  and  the

maximum value is n-1. If the index falls outside this range an error is generated; Python

makes  an  Exception  object  which  reports  what  the  error  was  (see  the  next  chapter  for  a

description of these).

Python also has a very convenient slicing notation, to access a substring from within a

string.  The  notation  [start:stop]  refers  to  the  characters  from  position  start  up  to  but  not

including position stop. As with single indices, these positions can be negative. The fact

that it is ‘up to but not including’ might seem odd, but as with the indices counting from 0,

this  turns  out  to  be  a  sensible  convention.  In  particular,  if  start  and  stop  numbers  in  the

slice notation are both non-negative then the number of characters in the resulting slice is

just  the  difference  between  the  two  values  (stop-start),  or  put  another  way  [start:start+n]

gives n characters.

As a further convenience, if you leave out the start entirely giving just [:stop], then the

slice starts at the very beginning; the start point is taken to be 0. If you leave out the stop,

so  have  [start:],  then  the  slice  continues  to  the  very  end;  as  if  stop  were  taken  to  be  the

length of the string. Thus, for example, [:n] refers to the first n characters of the string.

text[1:3] # 'el'

text[1:] # 'ello world'

text[1:-1] # 'ello worl'

text[:-1] # 'hello worl'

This leads to the proper way to (effectively) change the first character of the example

string. We can use a slice to access the characters we wish to keep and redefine text:

text = 'H' + text[1:] # 'Hello world'

You can check if a substring is contained in a string:

'wor' in text # True

'war' in text # False

or is not contained in (is absent from) a string:

'wor' not in text # False

'war' not in text # True



There are two functions that let you determine the position of (the first occurrence of) a

substring inside a string:

text.index('wor') # 6

text.find('wor') # 6

Note  that  the  value  returned  is  the  index  of  the  first  character  of  the  substring  in  the

string. The difference between these functions is how they deal with the situation when the

substring is not contained in the string. For the index() function an error is generated, but

instead the find() function returns −1:

text.find('war') # -1

It is a matter of taste which version you use. Nonetheless, it might have been better for

find() to return None if the  substring isn’t present.  You can search  from the (right-hand)

end of the string instead of the beginning:

text.index('l') # 2

text.rindex('l') # 9

text.find('l') # 2

text.rfind('l') # 9

When you read a file, you often end up with whitespace characters (newlines, carriage

returns,  tabs  and  spaces)  that  you  want  to  get  rid  of,  or  deal  with.  There  are  various

functions for this. Here we will consider a string with two leading spaces and two trailing

spaces:


line = ' hello world '

You can strip off the whitespace from both ends:

line.strip() # 'hello world'

Note  that  since  strings  are  not  modifiable,  this  gives  back  a  new  string;  it  does  not

modify the original string. You can also strip whitespace from just the beginning (left) or

end (right) of the string:

line.lstrip() # 'hello world '

line.rstrip() # ' hello world'

There  is  no  inbuilt  function  to  remove  all  whitespace  from  everywhere  in  the  string,

including any in the middle. This is possible using the regular expression module, which

we discuss in detail in

Appendix 5

.

You  can  split  up  your  string  into  separate  substrings  according  to  the  presence  of



whitespace. This creates a list of strings, where a ‘list’ is simply a container for the strings

(here represented by square brackets). Lists are Python objects in their own right and are

discussed further in the next section.

line.split() # ['hello', 'world'] – a list of two strings

Note  that  this  automatically  strips  off  the  whitespace  at  the  beginning  and  end  before

doing  any  splitting.  You  can  also  split  on  an  arbitrary  substring,  noting  that  (quite




sensibly) this does not strip off the whitespace at the beginning or end:

line.split('wor') # [' hello ', 'ld ']

Given that you can split a string into parts, it is quite natural that you can also do the

opposite and join a number of strings together into one long string. For example, given a

variable  that  represents  a  list  of  strings,  which  we  write  inside  square  brackets  and

separate with commas:

myList = ['Homer', 'Marge', 'Maude', 'Ned']

you may want to create one long, combined string:

longText = 'Homer, Marge, Maude, Ned'

This is done using the join() function, where you connect the items from the list with

some  other  connecting  string  (e.g.  with  commas  and  spaces).  However,  although  you

might  expect  the  joining  function  to  come  from  the  list,  it  actually  belongs  to  the

connecting string. Thus, you do not do:

longText = myList.join(connectorString) # Not used

Instead the correct Python way is:

longText = connectorString.join(myList)

The syntax can take a bit of time to become familiar, because the string that is linking

things together might be defined on the same line where the joining occurs. Considering

the following:

cities = ['London', 'Paris', 'Berlin']

connector = '->'

connector.join(cities) # 'London->Paris->Berlin'

The last lines could be written as one, without an intermediate variable name:

'->'.join(cities) # 'London->Paris->Berlin'

Thus,  the  connecting  string  is  the  thing  that  comes  before  the  dot.  A  further  point,

which  can  catch  you  out,  is  that  all  the  items  that  are  to  be  joined  together  have  to  be

strings; no other type will do. Also, the joining string is only added in-between the items

of the list not at the beginning or end.

The  join()  function  also  allows  you  to  concatenate  items  together  without  adding  any

extra characters, using an empty string. For example, suppose you have a list of one-letter

codes for a DNA sequence (or protein or RNA) and want to create a string of all the letters

joined together. Then you could do:

sequence = ['G', 'C', 'A', 'T']

seq = ''.join(sequence) # 'GCAT'

You  can  also  do  string  concatenation  using  the  ‘+’  operator,  so  an  alternative  to  the

above would be:




seq = sequence[0] + sequence[1] + sequence[2] + sequence[3]

# seq is 'GCAT'

This is generally not a good approach if the list is long, because it is much less efficient

than  using  the  join()  method.  And  in  any  case  you  would  usually  not  write  out  the  list

elements  in  full;  you  would  use  a  loop  to  go  through  each  item  in  turn  (see  the  next

chapter).  On  the  other  hand,  for  concatenating  only  a  few  strings  together  it  is  perfectly

acceptable  to  do  it  this  way.  As  another  example,  suppose  you  have  some  numbers  and

want  to  create  a  string  with  this  information  in  it.  Then  you  could  do  the  following,

converting the numbers to strings using str():

x = 12


y = 5

text = "I have " + str(x) + " apples and " + str(y) + " oranges."

# the text is "I have 12 apples and 5 oranges."

Even here, though, Python offers an alternative, which is to use a formatted string.  So

we could write the above instead as:

text = "I have %d apples and %d oranges." % (x,y)

Here %d is a formatting code and represents the places in the text to insert the digits.

The values for the digits are contained in the round-bracketed ‘tuple’ collection at the end

(see  below  for  discussion  of  tuples),  after  the  bare  %  sign.  Naturally,  there  should  be  as

many formatting codes in the initial string as there are items to insert. If we were inserting

other types of data then we would use different codes, for example, %s to insert a string

and %f for a floating point value:

name = 'Barry'

weight = 82.173

text = "The weight of %s is %f kg" % (name, weight)

# Gives "The weight of Barry is 82.173000 kg"

We  can  optionally  specify  the  number  of  decimal  places  to  use  for  the  floating  point

value by adjusting its formatting code. For example, %.1f can be used so that the weight is

written out with one digit after the decimal place, rounding as appropriate:

text = "The weight of %s is %.1f kg" % (name, weight)

# Gives "The weight of Barry is 82.2 kg"

If you also wanted at least five total characters for the weight, padding with spaces, you

would  write  %5.1f.  It  is  notable  that  you  can  actually  use  %s  for  every  type  of  data,

because values will be automatically converted into a representative string, but if you want

to  fine-tune  the  appearance  of  floating  point  numbers  then  it  is  best  to  use  the  %f

construct.

There  are  analogous  options  for  the  %d  construct  used  with  integers.  So  %5d  means

that at least five places are used to display the integer, and %05d means that you zero-pad

the  five  places  at  the  left,  if  necessary.  For  example,  you  could  create  a  string  with  the



time of day via:

hours = 12

minutes = 5

seconds = 43

t = "%02d:%02d:%02d" % (hours, minutes, seconds)

# t is "12:05:43"

Python has a notable tweak with string formatting: if the collection of values that is to

be substituted only has one item then you can just use the item directly, rather than using

brackets (which represents a tuple, see below). So

"My name is %s" % name

is equivalent to

"My name is %s" % (name,)

See

Appendix  4



 for  a  more  complete  table  of  formatting  codes  and  a  thorough

description of the new-style formatting system specified with string.format() method.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   31   32   33   34   35   36   37   38   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish