Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet203/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   199   200   201   202   203   204   205   206   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

preservation.  The  overall  residue  count  is  increased  by  numLetters,  the  length  of  the

letters list (so not including gaps), and the substitution total is increased by the square of

numLetters, because we compare all against all to fill the matrix.

for letterA in letters:

counts[letterA] += 1

for letterB in letters:

matrix[letterA][letterB] += 1

numLetters = len(letters)

totalRes += numLetters

totalSub += numLetters * numLetters

Once the residue and substation counts have been collected the next task is to calculate

the  various  probabilities  and  the  log-odds  scores  for  the  final  matrix.  To  calculate  the

expected  substitution  probability  we  first  calculate  the  average  composition  of  each

residue  type  by  considering  the  number  of  counts  over  all  alignments,  residue  positions

and sequences.

averageComp = {}

for letter in alphabet:

averageComp[letter] = counts[letter]/totalRes

A  variable  for  the  maximum  score  is  initialised  and  then  we  loop  through  all  residue

pairs, given the possibilities defined in our alphabet of residue letters:

maxScore = None

for resA in alphabet:

for resB in alphabet:

For  each  pair  of  residue  types  we  calculate  the  expected  substitution  probability  by

multiplying  the  average  compositions  of  each  type.  This  is  equivalent  to  saying  that  the

expectation for a given pair, in the absence of any other information, depends only upon

the  probability  of  observing  each  residue  type  independently.  With  the  expectation

calculated we do a check to make sure that it is not a zero value (False in the if statement);

if it is zero then one of the residue types is not observed at all in the data, thus we have

nothing to do and so skip the loop.




expected = averageComp[resA] * averageComp[resB]

if not expected:

continue

The observed substitution count is simply the pair count of this combination, as stored

in matrix. This is then used to calculate a weighting value that is used to smooth the data

in the event of the number of observations being low. Note that if observed is much larger

than the smooth value the weight is close to zero, if they are equal the weight is one half,

and if observed is comparatively small the weight is about one.

observed = matrix[resA][resB]

weight = 1.0 / (1.0+(observed/smooth))

The  observed  substitution  counts  for  the  pair  are  converted  into  a  probability  by

dividing  by  the  total  number  of  substitution  pairs.  Then  the  smoothing  is  applied  by

redefining the observed probability as a combination of the original observed probability

and  the  expected  probability.  With  a  low  weight  (lots  of  observations)  the  observation

probability is barely altered, but with a large weight (few observations) the probability is

closer to the expectation.

observed /= totalSub

observed = weight*expected + (1-weight)*observed

Lastly in the loops the log-odds score is calculated as the logarithm of the probability

ratio; we check whether this score is the new maximum score value and then put the score

into the matrix for the current residue letters.

logOdds = log(observed/expected)

if (maxScore is None) or (logOdds>maxScore):

maxScore = logOdds

matrix[resA][resB] = logOdds

Once  the  loops  are  done  it  simply  remains  to  scale  the  log-odds  values  in  the

substitution  matrix  and  convert  to  integers.  The  integer  conversion  is  somewhat

traditional, but does mean that calculations with the substitution matrix are quicker. Note

how  the  maximum  score  is  converted  to  a  positive  value  using  abs()because  the  score

logarithm could be negative. The maximum score is used to divide the matrix values to get

a relative figure. However, we also multiply the log-odds score by the maxVal  argument

passed  in  at  the  start,  which  means  we  generate  a  range  of  integer  values  up  this  figure.

Without maxVal using int() here would be pointless; we would only get zero or one.

maxScore = abs(maxScore)

for resA in alphabet:

for resB in alphabet:

matrix[resA][resB] = int(maxVal*matrix[resA][resB]/maxScore)

return matrix




The  function  is  tested  with  a  list  of  alignments,  a  list  of  residue  letters  and  a  scaling,

maximum value for the matrix. Note how we can cheekily get the residue letters from the

keys of an existing substitution matrix.

aminoAcids = BLOSUM62.keys()

print(calcSubstitutionMatrix([align2,], aminoAcids, 10))


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   199   200   201   202   203   204   205   206   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish