String manipulation
Text items in Python are called strings, referring to the fact that they are strings of
characters. String functionality is an important part of the Python toolbox. For example, a
file on disk (covered in
Chapter 6
) is read as a string or a list of strings; a file can be
viewed as a collection of characters. Here, even if part of the loaded file represents a
number, it is initially represented as a string of characters, not a proper Python numeric
object. In Python, strings are not modifiable. This might seem like a limitation, but in fact
it rarely is because it is easy enough to create a new, modified string from an existing
string. And since strings are not modifiable it means that they can be placed in sets and
used as keys in dictionaries, both of which are exceedingly useful.
In this section we will illustrate some basic manipulations on strings using the
following example string:
text = 'hello world' # same as double quoted "hello world"
In some ways a string can be thought of as a list of characters, although in Python a list
of characters would be a different entity (see below for a discussion of lists). Note that
when we refer to something in a string as being a character, we don’t just mean the
regular symbols for letters, numbers and punctuation; we also include spaces and
formatting codes (tab stop, new line etc.). You can access the character at a specific
position, or index, using square brackets:
text[1] # 'e'
text[5] # ' ' – a space
Note that the index for accessing the characters of a string starts counting from 0, not 1.
Thus the first character of a string is index number 0. At first this can seem odd to non-
programmers, but it is by far the most sensible convention, and is used in most modern
computer languages.
Bear in mind that we cannot change the characters of a string. For example, we get an
error if we try to change the first position to an ‘H’:
text[0] = 'H' # Fails!
TypeError: 'str' object does not support item assignment
You can count backwards from the end of the string, where index -1 is the last character
of the string:
text[-3] # 'r'
If a string has n characters, then the minimum value of the index is –n and the
maximum value is n-1. If the index falls outside this range an error is generated; Python
makes an Exception object which reports what the error was (see the next chapter for a
description of these).
Python also has a very convenient slicing notation, to access a substring from within a
string. The notation [start:stop] refers to the characters from position start up to but not
including position stop. As with single indices, these positions can be negative. The fact
that it is ‘up to but not including’ might seem odd, but as with the indices counting from 0,
this turns out to be a sensible convention. In particular, if start and stop numbers in the
slice notation are both non-negative then the number of characters in the resulting slice is
just the difference between the two values (stop-start), or put another way [start:start+n]
gives n characters.
As a further convenience, if you leave out the start entirely giving just [:stop], then the
slice starts at the very beginning; the start point is taken to be 0. If you leave out the stop,
so have [start:], then the slice continues to the very end; as if stop were taken to be the
length of the string. Thus, for example, [:n] refers to the first n characters of the string.
text[1:3] # 'el'
text[1:] # 'ello world'
text[1:-1] # 'ello worl'
text[:-1] # 'hello worl'
This leads to the proper way to (effectively) change the first character of the example
string. We can use a slice to access the characters we wish to keep and redefine text:
text = 'H' + text[1:] # 'Hello world'
You can check if a substring is contained in a string:
'wor' in text # True
'war' in text # False
or is not contained in (is absent from) a string:
'wor' not in text # False
'war' not in text # True
There are two functions that let you determine the position of (the first occurrence of) a
substring inside a string:
text.index('wor') # 6
text.find('wor') # 6
Note that the value returned is the index of the first character of the substring in the
string. The difference between these functions is how they deal with the situation when the
substring is not contained in the string. For the index() function an error is generated, but
instead the find() function returns −1:
text.find('war') # -1
It is a matter of taste which version you use. Nonetheless, it might have been better for
find() to return None if the substring isn’t present. You can search from the (right-hand)
end of the string instead of the beginning:
text.index('l') # 2
text.rindex('l') # 9
text.find('l') # 2
text.rfind('l') # 9
When you read a file, you often end up with whitespace characters (newlines, carriage
returns, tabs and spaces) that you want to get rid of, or deal with. There are various
functions for this. Here we will consider a string with two leading spaces and two trailing
spaces:
line = ' hello world '
You can strip off the whitespace from both ends:
line.strip() # 'hello world'
Note that since strings are not modifiable, this gives back a new string; it does not
modify the original string. You can also strip whitespace from just the beginning (left) or
end (right) of the string:
line.lstrip() # 'hello world '
line.rstrip() # ' hello world'
There is no inbuilt function to remove all whitespace from everywhere in the string,
including any in the middle. This is possible using the regular expression module, which
we discuss in detail in
Appendix 5
.
You can split up your string into separate substrings according to the presence of
whitespace. This creates a list of strings, where a ‘list’ is simply a container for the strings
(here represented by square brackets). Lists are Python objects in their own right and are
discussed further in the next section.
line.split() # ['hello', 'world'] – a list of two strings
Note that this automatically strips off the whitespace at the beginning and end before
doing any splitting. You can also split on an arbitrary substring, noting that (quite
sensibly) this does not strip off the whitespace at the beginning or end:
line.split('wor') # [' hello ', 'ld ']
Given that you can split a string into parts, it is quite natural that you can also do the
opposite and join a number of strings together into one long string. For example, given a
variable that represents a list of strings, which we write inside square brackets and
separate with commas:
myList = ['Homer', 'Marge', 'Maude', 'Ned']
you may want to create one long, combined string:
longText = 'Homer, Marge, Maude, Ned'
This is done using the join() function, where you connect the items from the list with
some other connecting string (e.g. with commas and spaces). However, although you
might expect the joining function to come from the list, it actually belongs to the
connecting string. Thus, you do not do:
longText = myList.join(connectorString) # Not used
Instead the correct Python way is:
longText = connectorString.join(myList)
The syntax can take a bit of time to become familiar, because the string that is linking
things together might be defined on the same line where the joining occurs. Considering
the following:
cities = ['London', 'Paris', 'Berlin']
connector = '->'
connector.join(cities) # 'London->Paris->Berlin'
The last lines could be written as one, without an intermediate variable name:
'->'.join(cities) # 'London->Paris->Berlin'
Thus, the connecting string is the thing that comes before the dot. A further point,
which can catch you out, is that all the items that are to be joined together have to be
strings; no other type will do. Also, the joining string is only added in-between the items
of the list not at the beginning or end.
The join() function also allows you to concatenate items together without adding any
extra characters, using an empty string. For example, suppose you have a list of one-letter
codes for a DNA sequence (or protein or RNA) and want to create a string of all the letters
joined together. Then you could do:
sequence = ['G', 'C', 'A', 'T']
seq = ''.join(sequence) # 'GCAT'
You can also do string concatenation using the ‘+’ operator, so an alternative to the
above would be:
seq = sequence[0] + sequence[1] + sequence[2] + sequence[3]
# seq is 'GCAT'
This is generally not a good approach if the list is long, because it is much less efficient
than using the join() method. And in any case you would usually not write out the list
elements in full; you would use a loop to go through each item in turn (see the next
chapter). On the other hand, for concatenating only a few strings together it is perfectly
acceptable to do it this way. As another example, suppose you have some numbers and
want to create a string with this information in it. Then you could do the following,
converting the numbers to strings using str():
x = 12
y = 5
text = "I have " + str(x) + " apples and " + str(y) + " oranges."
# the text is "I have 12 apples and 5 oranges."
Even here, though, Python offers an alternative, which is to use a formatted string. So
we could write the above instead as:
text = "I have %d apples and %d oranges." % (x,y)
Here %d is a formatting code and represents the places in the text to insert the digits.
The values for the digits are contained in the round-bracketed ‘tuple’ collection at the end
(see below for discussion of tuples), after the bare % sign. Naturally, there should be as
many formatting codes in the initial string as there are items to insert. If we were inserting
other types of data then we would use different codes, for example, %s to insert a string
and %f for a floating point value:
name = 'Barry'
weight = 82.173
text = "The weight of %s is %f kg" % (name, weight)
# Gives "The weight of Barry is 82.173000 kg"
We can optionally specify the number of decimal places to use for the floating point
value by adjusting its formatting code. For example, %.1f can be used so that the weight is
written out with one digit after the decimal place, rounding as appropriate:
text = "The weight of %s is %.1f kg" % (name, weight)
# Gives "The weight of Barry is 82.2 kg"
If you also wanted at least five total characters for the weight, padding with spaces, you
would write %5.1f. It is notable that you can actually use %s for every type of data,
because values will be automatically converted into a representative string, but if you want
to fine-tune the appearance of floating point numbers then it is best to use the %f
construct.
There are analogous options for the %d construct used with integers. So %5d means
that at least five places are used to display the integer, and %05d means that you zero-pad
the five places at the left, if necessary. For example, you could create a string with the
time of day via:
hours = 12
minutes = 5
seconds = 43
t = "%02d:%02d:%02d" % (hours, minutes, seconds)
# t is "12:05:43"
Python has a notable tweak with string formatting: if the collection of values that is to
be substituted only has one item then you can just use the item directly, rather than using
brackets (which represents a tuple, see below). So
"My name is %s" % name
is equivalent to
"My name is %s" % (name,)
See
Appendix 4
for a more complete table of formatting codes and a thorough
description of the new-style formatting system specified with string.format() method.
Do'stlaringiz bilan baham: |