Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet168/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   164   165   166   167   168   169   170   171   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Substitutability

Next  we  will  move  on  from  measuring  a  simple  sequence  identity  to  the  more  subtle

measure of sequence similarity; this is to say that sequence pairs in an alignment can have

a score even when they are not the same. The notion of similarity in this case is somewhat

subjective  and  ultimately  depends  on  the  kind  of  biology  you  are  working  with.

Nevertheless the general idea when scoring how similar two residues are, when aligned as

a pair, is to consider how substitutable one residue type is for another; in other words, how

likely  they  are  to  have  been  swapped  or  exchanged  for  one  another.  Residues  that

commonly  swap  for  one  another  are  deemed  to  be  similar  and  give  high  scores,  while

those that rarely swap are dissimilar and give low scores. High similarity in this instance

doesn’t  necessarily  mean  that  two  residues  are  always  chemically  similar,  although  they

often are. Strictly speaking the substitutability of one residue type for another depends on

the exact context of the residue (where it is in a chromosome or protein etc.) but we can

ignore this complication for now and consider just an average value for swap-ability.

The  substitutability  of  one  residue  for  another  is  stored  as  a  two-dimensional  array,

commonly  called  a  substitution  matrix  or  similarity  matrix.  The  idea  is  that  each  score

value  in  the  matrix  represents  the  substitutability  of  two  residue  types,  e.g.  ‘A’  to  ‘G’  in

DNA or ‘V’ to ‘L’ in proteins. The two residue types can be thought of as indicating the

row  and  column  of  an  element  in  a  matrix,  although  in  our  Python  examples  we  will

encode  matrices  as  dictionaries  of  dictionaries.  Using  dictionaries  we  can  look  up  the

score for two residue types by using the residue letters directly as keys, without having to

work  out  the  numbers  for  the  matrix  row  and  column.  With  a  substitution  matrix

dictionary  the  first  key  (residue  letter)  identifies  a  sub-dictionary  from  inside  the  main

dictionary and the second key gets the final value from inside the sub-dictionary.

Below  is  an  example  of  a  very  simple  substitution  matrix  that  would  give  the  same



scores as if you were measuring sequence identity. i.e. a score of one where residues are

identical and zero elsewhere.

DNA_1 = {'G': { 'G':1, 'C':0, 'A':0, 'T':0 },

'C': { 'G':0, 'C':1, 'A':0, 'T':0 },

'A': { 'G':0, 'C':0, 'A':1, 'T':0 },

'T': { 'G':0, 'C':0, 'A':0, 'T':1 }}

Remembering that two keys are needed to extract a value (one for the main dictionary

and  one  for  the  sub-dictionaries)  we  would  get  1  for  identical  residue  look-ups  like

DNA_1[‘G’][‘G’] and 0 for non-identical keys like DNA_1[‘G’][‘A’].

Changing track slightly, rather than scoring DNA sequences for matches we could also

score for complementarity (i.e. using Crick and Watson’s pairing rules), with 1 for A:T or

G:C matches and -1 for mismatches. Expressed as a Python dictionary this would be:

REV_COMP = {'G': { 'G':-1, 'C': 1, 'A':-1, 'T':-1 },

'C': { 'G': 1, 'C':-1, 'A':-1, 'T':-1 },

'A': { 'G':-1, 'C':-1, 'A':-1, 'T': 1 },

'T': { 'G':-1, 'C':-1, 'A': 1, 'T':-1 }}

Moving  on  to  a  more  sophisticated  matrix,  as  illustrated  above,  you  will  note  that

substitution scores can have negative values (mismatch) and that a score of zero is often

used to indicate indifference. In the DNA_2 matrix below note that identical residue keys

give  a  score  of  1  but  non-identical  -3.  In  other  words  the  mismatches  are  strongly

penalised; in an alignment three identical residues are required to balance one mismatch.

Also  note  that  the  example  uses  the  residue  code  ‘N’,  which  in  this  instance  for  DNA

means  any

9

 unidentified  residue,  which  is  indifferent  in  an  alignment,  given  that  we



cannot tell if it is good or bad and so scores zero with everything.

DNA_2 = {'G': { 'G': 1, 'C':-3, 'A':-3, 'T':-3, 'N':0 },

'C': { 'G':-3, 'C': 1, 'A':-3, 'T':-3, 'N':0 },

'A': { 'G':-3, 'C':-3, 'A': 1, 'T':-3, 'N':0 },

'T': { 'G':-3, 'C':-3, 'A':-3, 'T': 1, 'N':0 },

'N': { 'G': 0, 'C': 0, 'A': 0, 'T': 0, 'N':0 }}

The  next  example  is  part  of  a  substitution  matrix  for  protein  sequences.  It  is  a  fairly

famous one called BLOSUM62 (often the default in many programs). You will of course

note that the matrix is much larger than for DNA because we have 20 regular amino acids,

plus ‘X’  for  unknown  type.  We  have  only  shown  the  first  four  sub-dictionaries  here,  but

the

full


matrix

can


be

found


in

the


on-line

material

(available

via


http://www.cambridge.org/pythonforbiology

). There are usually many variants of a given

substitution  matrix  type.  Here  we  specifically  use  the  ‘62’

10

 version  of  BLOSUM  series



because it is a good general-purpose one. You would commonly consider using different

matrix  versions  to  tune  your  alignment  for  more  closely  related  or  distantly  related

sequences for which substitution preferences are known to differ.

BLOSUM62 = {'A':{'A': 4,'R':-1,'N':-2,'D':-2,'C': 0,'Q':-1,

'E':-1,'G': 0,'H':-2,'I':-1,'L':-1,'K':-1,

'M':-1,'F':-2,'P':-1,'S': 1,'T': 0,'W':-3,

'Y':-2,'V': 0,'X':0},

'R':{'A':-1,'R': 5,'N': 0,'D':-2,'C':-3,'Q': 1,




'E': 0,'G':-2,'H': 0,'I':-3,'L':-2,'K': 2,

'M':-1,'F':-3,'P':-2,'S':-1,'T':-1,'W':-3,

'Y':-2,'V':-3,'X':0},

'N':{'A':-2,'R': 0,'N': 6,'D': 1,'C':-3,'Q': 0,

'E': 0,'G': 0,'H': 1,'I':-3,'L':-3,'K': 0,

'M':-2,'F':-3,'P':-2,'S': 1,'T': 0,'W':-4,

'Y':-2,'V':-3,'X':0},

'D':{'A':-2,'R':-2,'N': 1,'D': 6,'C':-3,'Q': 0,

'E': 2,'G':-1,'H':-1,'I':-3,'L':-4,'K':-1,

'M':-3,'F':-3,'P':-1,'S': 0,'T':-1,'W':-4,

'Y':-3,'V':-3,'X':0}}

## SNIP: THE FULL MATRIX CARRIES ON FOR 17 MORE SUB-DICTIONARIES ##

As  with  the  DNA  matrices  we  use  two  keys  to  get  the  substitution  score  and  have

positive, zero and negative values. Note that the matrix, like the DNA matrix examples, is

symmetric,

11

 i.e.  BLOSUM62[‘A’][‘R’]  equals  BLOSUM62[‘R’][‘A’].  Unlike  the  DNA



examples  the  diagonal  of  the  matrix  is  not  uniform,  which  is  to  say  that  the  score  for

residue types being the same in an alignment differs. For example, BLOSUM62[‘A’][‘A’],

meaning  an  exact  alanine  match,  gives  a  score  of  4,  but  an  exact  asparagine  match

BLOSUM62[‘N’][‘N’]  gives  a  higher  score  of  6.  Thus  ‘A’  is  less  well  conserved  (more

swappable for something else) than ‘N’.

We  will  not  go  into  fine  detail  about  how  substitution  matrices  are  calculated  until

Chapter 14. In essence the idea is that you first generate good, well-curated alignments of

multiple  sequences  using  as  much  information  as  humanly  possible  from  structure  and

function  etc.  and  you  then  count  how  many  times  one  residue  type  is  substituted  for

another within the alignment. Then in various ways these counts are converted into whole-

number scores, relative to some baseline value. If you are really interested, we recommend

the early papers on the PAM

12

and BLOSUM



13

matrices. These two protein matrices are

calculated in slightly different ways, but together they give a good idea of the underlying

principles.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   164   165   166   167   168   169   170   171   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish