Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	168/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 164 165 166 167 168 169 170 171 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Substitutability

Next we will move on from measuring a simple sequence identity to the more subtle

measure of sequence similarity; this is to say that sequence pairs in an alignment can have

a score even when they are not the same. The notion of similarity in this case is somewhat

subjective and ultimately depends on the kind of biology you are working with.

Nevertheless the general idea when scoring how similar two residues are, when aligned as

a pair, is to consider how substitutable one residue type is for another; in other words, how

likely they are to have been swapped or exchanged for one another. Residues that

commonly swap for one another are deemed to be similar and give high scores, while

those that rarely swap are dissimilar and give low scores. High similarity in this instance

doesn’t necessarily mean that two residues are always chemically similar, although they

often are. Strictly speaking the substitutability of one residue type for another depends on

the exact context of the residue (where it is in a chromosome or protein etc.) but we can

ignore this complication for now and consider just an average value for swap-ability.

The substitutability of one residue for another is stored as a two-dimensional array,

commonly called a substitution matrix or similarity matrix. The idea is that each score

value in the matrix represents the substitutability of two residue types, e.g. ‘A’ to ‘G’ in

DNA or ‘V’ to ‘L’ in proteins. The two residue types can be thought of as indicating the

row and column of an element in a matrix, although in our Python examples we will

encode matrices as dictionaries of dictionaries. Using dictionaries we can look up the

score for two residue types by using the residue letters directly as keys, without having to

work out the numbers for the matrix row and column. With a substitution matrix

dictionary the first key (residue letter) identifies a sub-dictionary from inside the main

dictionary and the second key gets the final value from inside the sub-dictionary.

Below is an example of a very simple substitution matrix that would give the same

scores as if you were measuring sequence identity. i.e. a score of one where residues are

identical and zero elsewhere.

DNA_1 = {'G': { 'G':1, 'C':0, 'A':0, 'T':0 },

'C': { 'G':0, 'C':1, 'A':0, 'T':0 },

'A': { 'G':0, 'C':0, 'A':1, 'T':0 },

'T': { 'G':0, 'C':0, 'A':0, 'T':1 }}

Remembering that two keys are needed to extract a value (one for the main dictionary

and one for the sub-dictionaries) we would get 1 for identical residue look-ups like

DNA_1[‘G’][‘G’] and 0 for non-identical keys like DNA_1[‘G’][‘A’].

Changing track slightly, rather than scoring DNA sequences for matches we could also

score for complementarity (i.e. using Crick and Watson’s pairing rules), with 1 for A:T or

G:C matches and -1 for mismatches. Expressed as a Python dictionary this would be:

REV_COMP = {'G': { 'G':-1, 'C': 1, 'A':-1, 'T':-1 },

'C': { 'G': 1, 'C':-1, 'A':-1, 'T':-1 },

'A': { 'G':-1, 'C':-1, 'A':-1, 'T': 1 },

'T': { 'G':-1, 'C':-1, 'A': 1, 'T':-1 }}

Moving on to a more sophisticated matrix, as illustrated above, you will note that

substitution scores can have negative values (mismatch) and that a score of zero is often

used to indicate indifference. In the DNA_2 matrix below note that identical residue keys

give a score of 1 but non-identical -3. In other words the mismatches are strongly

penalised; in an alignment three identical residues are required to balance one mismatch.

Also note that the example uses the residue code ‘N’, which in this instance for DNA

means any

unidentified residue, which is indifferent in an alignment, given that we

cannot tell if it is good or bad and so scores zero with everything.

DNA_2 = {'G': { 'G': 1, 'C':-3, 'A':-3, 'T':-3, 'N':0 },

'C': { 'G':-3, 'C': 1, 'A':-3, 'T':-3, 'N':0 },

'A': { 'G':-3, 'C':-3, 'A': 1, 'T':-3, 'N':0 },

'T': { 'G':-3, 'C':-3, 'A':-3, 'T': 1, 'N':0 },

'N': { 'G': 0, 'C': 0, 'A': 0, 'T': 0, 'N':0 }}

The next example is part of a substitution matrix for protein sequences. It is a fairly

famous one called BLOSUM62 (often the default in many programs). You will of course

note that the matrix is much larger than for DNA because we have 20 regular amino acids,

plus ‘X’ for unknown type. We have only shown the first four sub-dictionaries here, but

the

full

matrix

can

found

the

on-line

material

(available

via

http://www.cambridge.org/pythonforbiology

). There are usually many variants of a given

substitution matrix type. Here we specifically use the ‘62’

version of BLOSUM series

because it is a good general-purpose one. You would commonly consider using different

matrix versions to tune your alignment for more closely related or distantly related

sequences for which substitution preferences are known to differ.

BLOSUM62 = {'A':{'A': 4,'R':-1,'N':-2,'D':-2,'C': 0,'Q':-1,

'E':-1,'G': 0,'H':-2,'I':-1,'L':-1,'K':-1,

'M':-1,'F':-2,'P':-1,'S': 1,'T': 0,'W':-3,

'Y':-2,'V': 0,'X':0},

'R':{'A':-1,'R': 5,'N': 0,'D':-2,'C':-3,'Q': 1,

'E': 0,'G':-2,'H': 0,'I':-3,'L':-2,'K': 2,

'M':-1,'F':-3,'P':-2,'S':-1,'T':-1,'W':-3,

'Y':-2,'V':-3,'X':0},

'N':{'A':-2,'R': 0,'N': 6,'D': 1,'C':-3,'Q': 0,

'E': 0,'G': 0,'H': 1,'I':-3,'L':-3,'K': 0,

'M':-2,'F':-3,'P':-2,'S': 1,'T': 0,'W':-4,

'Y':-2,'V':-3,'X':0},

'D':{'A':-2,'R':-2,'N': 1,'D': 6,'C':-3,'Q': 0,

'E': 2,'G':-1,'H':-1,'I':-3,'L':-4,'K':-1,

'M':-3,'F':-3,'P':-1,'S': 0,'T':-1,'W':-4,

'Y':-3,'V':-3,'X':0}}

## SNIP: THE FULL MATRIX CARRIES ON FOR 17 MORE SUB-DICTIONARIES ##

As with the DNA matrices we use two keys to get the substitution score and have

positive, zero and negative values. Note that the matrix, like the DNA matrix examples, is

symmetric,

i.e. BLOSUM62[‘A’][‘R’] equals BLOSUM62[‘R’][‘A’]. Unlike the DNA

examples the diagonal of the matrix is not uniform, which is to say that the score for

residue types being the same in an alignment differs. For example, BLOSUM62[‘A’][‘A’],

meaning an exact alanine match, gives a score of 4, but an exact asparagine match

BLOSUM62[‘N’][‘N’] gives a higher score of 6. Thus ‘A’ is less well conserved (more

swappable for something else) than ‘N’.

We will not go into fine detail about how substitution matrices are calculated until

Chapter 14. In essence the idea is that you first generate good, well-curated alignments of

multiple sequences using as much information as humanly possible from structure and

function etc. and you then count how many times one residue type is substituted for

another within the alignment. Then in various ways these counts are converted into whole-

number scores, relative to some baseline value. If you are really interested, we recommend

the early papers on the PAM

and BLOSUM

matrices. These two protein matrices are

calculated in slightly different ways, but together they give a good idea of the underlying

principles.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 164 165 166 167 168 169 170 171 ... 514