Python Programming for Biology: Bioinformatics and Beyond

Using biological sequences in computing

Download 7,75 Mb.

Pdf ko'rish

bet	143/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 139 140 141 142 143 144 145 146 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using biological sequences in computing

Whatever the origins of a biological sequence, before writing programs to work with

biological sequence information one must first have the sequences represented in some

data structure; ideally this should suit the purpose of any subsequent analyses. There are

various ways in which people store sequences, ranging from the simplistic to the

exceedingly complex, and each will have its own advantages and disadvantages.

The commonest and simplest method is to store sequences as text; i.e. as strings of

letters, where each letter represents a different kind of residue. Thus for DNA and RNA

we will be working with alphabets of four letters, representing nucleotides, and for

proteins an alphabet of 20, representing amino acids. For the standard set of residues that

make up the majority of biological polymers, this representation is sufficient as we have

more than enough letters on a standard keyboard. However, a simple one-letter

representation is not good enough if we need to describe unusual amino acids (both

naturally modified and artificially created). In such circumstances people usually resort to

three-letter code strings for amino acids: for example one can distinguish between proline

‘PRO’ and hydroxyproline ‘HYP’.

One can go further still and define a biological

sequence as a series of purpose-made object data structures, rather than a series of text

codes. While using lists of complex objects will be cumbersome and unnecessary for

many tasks, they are certainly a good choice if you need to work with the underlying

atoms within a residue, as is the case in structural biochemistry.

In Python, a sequence of one-letter residue codes will usually be represented as a string

data type and three-letter codes as a list, although other arrangements are of course

possible. Also, if we are being cautious with our sequences then we may like to check that

our data structures only contain valid codes. A biological sequence can also be included in

a larger data structure if it needs to be annotated with further information. Although

Python dictionaries can be used for this purpose when you need something quick, we

sometimes advocate defining a custom object that can link your sequences to other data.

For testing and demonstration purposes, like the examples in this book, sequence data

can be entered directly into the code of your programs. Of course for real-world

applications of programs we would want to have our programs work on arbitrary

sequences that we read in from a file or database. These could be sequences that have been

output from another program, something you have obtained by searching a large sequence

database or even an entire genome sequence that you have downloaded. Interacting with

files and databases directly is dealt with in

Chapters 6

and

, and for the moment we will

simply demonstrate with short sequences.

Once you have your sequences in some kind of data structure, it is time to start analysis.

While we cannot hope to anticipate all that you might need to do, we can at least give

some idea of what is possible. At the same time we aim to show how some of the things

that are commonly done with sequences can be readily achieved with Python. The

following examples are simple scripts that all deduce some property of an input DNA,

RNA or protein sequence that gives some real-world information or prediction about the

sequence. Note that in all of the examples we will forego checking that the sequences we

used are valid: that they are the right kind of object and that they contain only the known

types of residue code or letter. In an important real-world application you would clearly

make such checks before you try to run any analyses and the BioPython modules that we

demonstrate at the end can help you do this.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 139 140 141 142 143 144 145 146 ... 514