Whatever the origins of a biological sequence, before writing programs to work with
biological sequence information one must first have the sequences represented in some
data structure; ideally this should suit the purpose of any subsequent analyses. There are
various ways in which people store sequences, ranging from the simplistic to the
exceedingly complex, and each will have its own advantages and disadvantages.
The commonest and simplest method is to store sequences as text; i.e. as strings of
letters, where each letter represents a different kind of residue. Thus for DNA and RNA
we will be working with alphabets of four letters, representing nucleotides, and for
proteins an alphabet of 20, representing amino acids. For the standard set of residues that
make up the majority of biological polymers, this representation is sufficient as we have
more than enough letters on a standard keyboard. However, a simple one-letter
representation is not good enough if we need to describe unusual amino acids (both
naturally modified and artificially created). In such circumstances people usually resort to
three-letter code strings for amino acids: for example one can distinguish between proline
‘PRO’ and hydroxyproline ‘HYP’.
6
One can go further still and define a biological
sequence as a series of purpose-made object data structures, rather than a series of text
codes. While using lists of complex objects will be cumbersome and unnecessary for
many tasks, they are certainly a good choice if you need to work with the underlying
atoms within a residue, as is the case in structural biochemistry.
In Python, a sequence of one-letter residue codes will usually be represented as a string
data type and three-letter codes as a list, although other arrangements are of course
possible. Also, if we are being cautious with our sequences then we may like to check that
our data structures only contain valid codes. A biological sequence can also be included in
a larger data structure if it needs to be annotated with further information. Although
Python dictionaries can be used for this purpose when you need something quick, we
sometimes advocate defining a custom object that can link your sequences to other data.
For testing and demonstration purposes, like the examples in this book, sequence data
can be entered directly into the code of your programs. Of course for real-world
applications of programs we would want to have our programs work on arbitrary
sequences that we read in from a file or database. These could be sequences that have been
output from another program, something you have obtained by searching a large sequence
database or even an entire genome sequence that you have downloaded. Interacting with
files and databases directly is dealt with in
Chapters 6
and
20
, and for the moment we will
simply demonstrate with short sequences.
Once you have your sequences in some kind of data structure, it is time to start analysis.
While we cannot hope to anticipate all that you might need to do, we can at least give
some idea of what is possible. At the same time we aim to show how some of the things
that are commonly done with sequences can be readily achieved with Python. The
following examples are simple scripts that all deduce some property of an input DNA,
RNA or protein sequence that gives some real-world information or prediction about the
sequence. Note that in all of the examples we will forego checking that the sequences we
used are valid: that they are the right kind of object and that they contain only the known
types of residue code or letter. In an important real-world application you would clearly
make such checks before you try to run any analyses and the BioPython modules that we
demonstrate at the end can help you do this.
Do'stlaringiz bilan baham: