Python Programming for Biology: Bioinformatics and Beyond


Calculating substitution matrices



Download 7,75 Mb.
Pdf ko'rish
bet202/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   198   199   200   201   202   203   204   205   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Calculating substitution matrices

Next we will move on from conservation to a means of measuring the variations, or more

specifically  the  kinds  of  residue  substitution,  present  in  a  multiple-sequence  alignment.

We  will  use  the  analysis  of  substitutions  to  illustrate  how  an  empirically  derived

substitution matrix, like the BLOSUM variant we have been using, can be measured. Of

course it may seem that there is a somewhat circular argument in what we will be doing;

an initial substitution matrix is needed to make the alignments from which we calculate a

new substitution matrix. Clearly the composition of the new matrix will be influenced by

the  choice  of  the  first.  However,  this  is  not  all  folly.  Firstly,  there  are  many  ways  of

making and improving alignments which don’t rely on a substitution rule. For proteins this

would typically involve using alignments of three-dimensional structures. If homologous

proteins of known structure can be aligned spatially then we have an alternative means to

detect equivalent residue pairs. Secondly, even if we use one matrix to generate the input

alignments the choice of sequences in the alignment has a vast influence on the outcome.

So,  for  example,  we  might  use  a  general  matrix  to  perform  alignments  of  a  very  narrow

class of sequences, but the measured substitutions will reflect the choice of input. In this

way  a  substitution  matrix  becomes  more  specialised,  and  thus  better  able  to  detect

substitutions of the specialist class.

The function to calculate substitution naturally accepts a list of alignments to analyse,

and various other arguments, but it is notable that these include a number that is used to

smooth  the  data.  The  idea  of  data  smoothing  might  seem  dubious  at  first  but  it  has  an

important  function  here.  What  the  function  will  do  in  the  end  is  compare  the  observed

number  of  substitutions,  given  the  alignments,  and  the  expected  number  of  substitutions

for  each  possible  pair  of  residue  types.  If  the  number  of  observations  is  very  small  then

such  a  comparison  is  much  less  meaningful.  The  addition  or  removal  of  an  observation

might cause a large proportional change, but this isn’t significant. Consider a substitution

type where the number of observations is 4, but the expected value is 2. The observation is

double what you expect, but it is likely that an extra couple of observations arose by pure

chance.  If  the  number  of  observations  were  40  and  the  expected  20,  then  this  is  much

more  meaningful.  Accordingly,  the  smoothing  procedure  used  to  deal  with  this  is  only

significant  when  the  number  of  observations  is  small  (i.e.  about  the  same  size  as  the

smoothing value), and all it does is make the data move towards the expectation, such that

a  particular  point  is  not  unduly  significant.  Small  numbers  of  observations  are  of  course

much less problematic if the total amount of alignment data that you can analyse is large,

but you don’t always have the luxury of this.

We will compare observed and expected substitutions by calculating the logarithm of a

probability  ratio:  a  log-odds  score.  Hence,  we  import  the  log  function  from  Python’s

standard  mathematical  library.  Logarithms  are  convenient  for  substitution  tables  so  that

addition can be used to get an overall score for alignments etc. If raw probabilities were

used then we would need to multiply values instead, which would often result in floating

point  numbers  that  are  far  too  small  for  a  computer  to  quickly  and  accurately  deal  with.

The  function  name  is  then  defined  with  its  arguments  and  two  variables  are  initialised:

empty dictionaries that will hold the statistical data. The alphabet argument is a list of all



possible  residue  letters,  whether  or  not  they  are  observed  in  the  alignments,  and  the

maxVal argument is used to scale the substitution matrix at the end.

from math import log

def calcSubstitutionMatrix(alignments, alphabet, maxVal, smooth=5):

matrix = {}

counts = {}

The blank dictionary that will contain the substitution matrix is filled with smaller sub-

dictionaries, which in turn are filled with zero values. The residue letters are the keys in

both  dictionaries  so  two  letters  are  required  to  extract  each  value:  the  substitutability  of

one residue for another. The dictionary containing the residue counts is also filled, with an

initial value of zero for each letter.

for letterA in alphabet:

subDict = {}

for letterB in alphabet:

subDict[letterB] = 0

matrix[letterA] = subDict

counts[letterA] = 0

Initially the function will measure the observed number of residues of each kind and the

number  of  substitutions  of  one  residue  type  to  another.  However,  in  the  end  we  need  to

calculate  simple  probabilities  as  the  fraction  of  total  possible  events  we  observe.

Accordingly, variables for two totals are initialised to zero, one for the total number of pair

substitutions and one for the total number of residue letters across all sequences. Note that

these  numbers  have  floating  point  representation  (0.0  not  0)  because  we  will  eventually

use them for division and wish to avoid integer rounding in Python 2.

totalRes = 0.0

totalSub = 0.0

The main part of the function involves looping through the input alignments to collect

the  substitution  data.  It  would  have  been  possible  to  call  upon  the  profile() function that

we have used previously, but on this occasion we would have had to loop through all of

the sequences and positions in any case, to calculate the totals and residue counts. Thus in

this  instance  it  is  simpler  to  do  everything  within  the  same  set  of  loops.  After  the  initial

loop we record the number of positions in the particular alignment, and then loop through

these  positions,  initialising  a  list  which  will  collect  the  residue  letters  observed  at  this

position,  eventually  doing  yet  another  loop  though  the  residue  letters,  seq[i],  available

from  all  the  sequences  at  this  location  (i).  If  the  letter  represents  a  gap  it  is  skipped,

otherwise it is added to the letters list.

for align in alignments:

numPos = len(align[0])

for i in range(numPos):



letters = []

for seq in align:

letter = seq[i]

if letter == '-':

continue

letters.append(letter)

With the letters collected we go through the list in two loops to get all of the possible

residue  pairs.  The  count  for  each  letter  is  increased  inside  the  first  loop  and  inside  the

second loop we add each pair observation to our matrix. The matrix is thus filled with the

observed  substitutions,  although  since  the  letter  does  not  change  we  really  mean




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   198   199   200   201   202   203   204   205   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish