Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	35/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 31 32 33 34 35 36 37 38 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

String manipulation

Text items in Python are called strings, referring to the fact that they are strings of

characters. String functionality is an important part of the Python toolbox. For example, a

file on disk (covered in

Chapter 6

) is read as a string or a list of strings; a file can be

viewed as a collection of characters. Here, even if part of the loaded file represents a

number, it is initially represented as a string of characters, not a proper Python numeric

object. In Python, strings are not modifiable. This might seem like a limitation, but in fact

it rarely is because it is easy enough to create a new, modified string from an existing

string. And since strings are not modifiable it means that they can be placed in sets and

used as keys in dictionaries, both of which are exceedingly useful.

In this section we will illustrate some basic manipulations on strings using the

following example string:

text = 'hello world' # same as double quoted "hello world"

In some ways a string can be thought of as a list of characters, although in Python a list

of characters would be a different entity (see below for a discussion of lists). Note that

when we refer to something in a string as being a character, we don’t just mean the

regular symbols for letters, numbers and punctuation; we also include spaces and

formatting codes (tab stop, new line etc.). You can access the character at a specific

position, or index, using square brackets:

text[1] # 'e'

text[5] # ' ' – a space

Note that the index for accessing the characters of a string starts counting from 0, not 1.

Thus the first character of a string is index number 0. At first this can seem odd to non-

programmers, but it is by far the most sensible convention, and is used in most modern

computer languages.

Bear in mind that we cannot change the characters of a string. For example, we get an

error if we try to change the first position to an ‘H’:

text[0] = 'H' # Fails!

TypeError: 'str' object does not support item assignment

You can count backwards from the end of the string, where index -1 is the last character

of the string:

text[-3] # 'r'

If a string has n characters, then the minimum value of the index is –n and the

maximum value is n-1. If the index falls outside this range an error is generated; Python

makes an Exception object which reports what the error was (see the next chapter for a

description of these).

Python also has a very convenient slicing notation, to access a substring from within a

string. The notation [start:stop] refers to the characters from position start up to but not

including position stop. As with single indices, these positions can be negative. The fact

that it is ‘up to but not including’ might seem odd, but as with the indices counting from 0,

this turns out to be a sensible convention. In particular, if start and stop numbers in the

slice notation are both non-negative then the number of characters in the resulting slice is

just the difference between the two values (stop-start), or put another way [start:start+n]

gives n characters.

As a further convenience, if you leave out the start entirely giving just [:stop], then the

slice starts at the very beginning; the start point is taken to be 0. If you leave out the stop,

so have [start:], then the slice continues to the very end; as if stop were taken to be the

length of the string. Thus, for example, [:n] refers to the first n characters of the string.

text[1:3] # 'el'

text[1:] # 'ello world'

text[1:-1] # 'ello worl'

text[:-1] # 'hello worl'

This leads to the proper way to (effectively) change the first character of the example

string. We can use a slice to access the characters we wish to keep and redefine text:

text = 'H' + text[1:] # 'Hello world'

You can check if a substring is contained in a string:

'wor' in text # True

'war' in text # False

or is not contained in (is absent from) a string:

'wor' not in text # False

'war' not in text # True

There are two functions that let you determine the position of (the first occurrence of) a

substring inside a string:

text.index('wor') # 6

text.find('wor') # 6

Note that the value returned is the index of the first character of the substring in the

string. The difference between these functions is how they deal with the situation when the

substring is not contained in the string. For the index() function an error is generated, but

instead the find() function returns −1:

text.find('war') # -1

It is a matter of taste which version you use. Nonetheless, it might have been better for

find() to return None if the substring isn’t present. You can search from the (right-hand)

end of the string instead of the beginning:

text.index('l') # 2

text.rindex('l') # 9

text.find('l') # 2

text.rfind('l') # 9

When you read a file, you often end up with whitespace characters (newlines, carriage

returns, tabs and spaces) that you want to get rid of, or deal with. There are various

functions for this. Here we will consider a string with two leading spaces and two trailing

spaces:

line = ' hello world '

You can strip off the whitespace from both ends:

line.strip() # 'hello world'

Note that since strings are not modifiable, this gives back a new string; it does not

modify the original string. You can also strip whitespace from just the beginning (left) or

end (right) of the string:

line.lstrip() # 'hello world '

line.rstrip() # ' hello world'

There is no inbuilt function to remove all whitespace from everywhere in the string,

including any in the middle. This is possible using the regular expression module, which

we discuss in detail in

Appendix 5

You can split up your string into separate substrings according to the presence of

whitespace. This creates a list of strings, where a ‘list’ is simply a container for the strings

(here represented by square brackets). Lists are Python objects in their own right and are

discussed further in the next section.

line.split() # ['hello', 'world'] – a list of two strings

Note that this automatically strips off the whitespace at the beginning and end before

doing any splitting. You can also split on an arbitrary substring, noting that (quite

sensibly) this does not strip off the whitespace at the beginning or end:

line.split('wor') # [' hello ', 'ld ']

Given that you can split a string into parts, it is quite natural that you can also do the

opposite and join a number of strings together into one long string. For example, given a

variable that represents a list of strings, which we write inside square brackets and

separate with commas:

myList = ['Homer', 'Marge', 'Maude', 'Ned']

you may want to create one long, combined string:

longText = 'Homer, Marge, Maude, Ned'

This is done using the join() function, where you connect the items from the list with

some other connecting string (e.g. with commas and spaces). However, although you

might expect the joining function to come from the list, it actually belongs to the

connecting string. Thus, you do not do:

longText = myList.join(connectorString) # Not used

Instead the correct Python way is:

longText = connectorString.join(myList)

The syntax can take a bit of time to become familiar, because the string that is linking

things together might be defined on the same line where the joining occurs. Considering

the following:

cities = ['London', 'Paris', 'Berlin']

connector = '->'

connector.join(cities) # 'London->Paris->Berlin'

The last lines could be written as one, without an intermediate variable name:

'->'.join(cities) # 'London->Paris->Berlin'

Thus, the connecting string is the thing that comes before the dot. A further point,

which can catch you out, is that all the items that are to be joined together have to be

strings; no other type will do. Also, the joining string is only added in-between the items

of the list not at the beginning or end.

The join() function also allows you to concatenate items together without adding any

extra characters, using an empty string. For example, suppose you have a list of one-letter

codes for a DNA sequence (or protein or RNA) and want to create a string of all the letters

joined together. Then you could do:

sequence = ['G', 'C', 'A', 'T']

seq = ''.join(sequence) # 'GCAT'

You can also do string concatenation using the ‘+’ operator, so an alternative to the

above would be:

seq = sequence[0] + sequence[1] + sequence[2] + sequence[3]

# seq is 'GCAT'

This is generally not a good approach if the list is long, because it is much less efficient

than using the join() method. And in any case you would usually not write out the list

elements in full; you would use a loop to go through each item in turn (see the next

chapter). On the other hand, for concatenating only a few strings together it is perfectly

acceptable to do it this way. As another example, suppose you have some numbers and

want to create a string with this information in it. Then you could do the following,

converting the numbers to strings using str():

x = 12

y = 5

text = "I have " + str(x) + " apples and " + str(y) + " oranges."

# the text is "I have 12 apples and 5 oranges."

Even here, though, Python offers an alternative, which is to use a formatted string. So

we could write the above instead as:

text = "I have %d apples and %d oranges." % (x,y)

Here %d is a formatting code and represents the places in the text to insert the digits.

The values for the digits are contained in the round-bracketed ‘tuple’ collection at the end

(see below for discussion of tuples), after the bare % sign. Naturally, there should be as

many formatting codes in the initial string as there are items to insert. If we were inserting

other types of data then we would use different codes, for example, %s to insert a string

and %f for a floating point value:

name = 'Barry'

weight = 82.173

text = "The weight of %s is %f kg" % (name, weight)

# Gives "The weight of Barry is 82.173000 kg"

We can optionally specify the number of decimal places to use for the floating point

value by adjusting its formatting code. For example, %.1f can be used so that the weight is

written out with one digit after the decimal place, rounding as appropriate:

text = "The weight of %s is %.1f kg" % (name, weight)

# Gives "The weight of Barry is 82.2 kg"

If you also wanted at least five total characters for the weight, padding with spaces, you

would write %5.1f. It is notable that you can actually use %s for every type of data,

because values will be automatically converted into a representative string, but if you want

to fine-tune the appearance of floating point numbers then it is best to use the %f

construct.

There are analogous options for the %d construct used with integers. So %5d means

that at least five places are used to display the integer, and %05d means that you zero-pad

the five places at the left, if necessary. For example, you could create a string with the

time of day via:

hours = 12

minutes = 5

seconds = 43

t = "%02d:%02d:%02d" % (hours, minutes, seconds)

# t is "12:05:43"

Python has a notable tweak with string formatting: if the collection of values that is to

be substituted only has one item then you can just use the item directly, rather than using

brackets (which represents a tuple, see below). So

"My name is %s" % name

is equivalent to

"My name is %s" % (name,)

See

Appendix 4

for a more complete table of formatting codes and a thorough

description of the new-style formatting system specified with string.format() method.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 31 32 33 34 35 36 37 38 ... 514