Python Programming for Biology: Bioinformatics and Beyond


Generating a consensus sequence



Download 7,75 Mb.
Pdf ko'rish
bet181/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   177   178   179   180   181   182   183   184   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Generating a consensus sequence

The following Python example takes a (multiple) alignment and produces a representative

consensus  sequence.  The  input  alignment  is  assumed  to  be  a  list  of  one-letter  sequences

with  gaps,  with  each  list  assumed  to  be  of  the  same  length.  The  second  input  to  the

function is a threshold value so that if no single residue type exceeds this fraction, at any

of  the  alignment  positions,  then  the  position  is  deemed  to  be  undefined  and  the  residue

code ‘X’ will be used. With the default threshold, 0.25, if we have an alignment containing

ten sequences, at least three residues of the same kind must be present to give something

in the consensus.

Initially, the function is defined along with some initial values, one for the length of the

sequences,  n  (we  happen  to  measure  the  first  sequence  of  the  alignment),  one  for  the

number of sequences in the alignment, nSeq, and an empty consensus string that will be

filled in and then passed back at the end. Note that we convert the number of sequences

into a floating point number because later on we will be using it to do some division, and

we want to ensure that this operation produces a floating point number.

1

def consensus(alignment, threshold=0.25):



n = len(alignment[0])

nSeq = float(len(alignment))

consensus = ''

We  loop  though  an  index  representing  all  of  the  positions  (columns)  within  the

alignment. Each time we define an empty counts dictionary that will hold the number of

each type of residue observed at this position.

for i in range(n):

counts = {}

Another  loop  is  defined  inside  the  first  one  to  iterate  over  each  sequence.  We  get  the

appropriate letter for the current sequence using the position index, and skip the remainder

of the loop if the letter is actually a dash. Finally in the loop we increase the count for this

kind of residue letter by one, using the dictionary trick involving .get(), so that a starting

value of zero is obtained if we have not seen a particular letter before during this cycle.

for seq in alignment:

letter = seq[i]

if letter == '-':

continue



counts[letter] = counts.get(letter, 0) + 1

Now  that  the  counts  for  the  residue  letters  have  been  filled  for  this  position  we  next

calculate  the  fractions  of  the  total  that  each  letter  represents.  This  is  a  simple  matter  of

dividing  the  count  for  a  letter  by  the  number  of  aligned  sequences  and  then  putting  the

fraction  and  the  corresponding  letter  in  a  list.  Note  that  we  use  small,  two-element  sub-

lists to store the fractions, rather than use a dictionary as we did for the counts, so that we

can sort the values.

fractions = []

for letter in counts:

frac = counts[letter]/nSeq

fractions.append([frac, letter])

The list of fractions is sorted and because the numeric value is the first element in the

sub-lists,  the  fractions  will  be  put  in  value  order  of  the  number,  rather  than  alphabetic

order of the letter. However, the letters will still go along for the ride. The largest fraction

(bestFraction) and corresponding letter are the ones we want to use to build the consensus

sequence  with,  thus  we  take  the  last  element  ([-1])  of  the  sorted  fractions  list,

remembering  that  the  values  are  sorted  low  to  high.  Although  here  we  represent  the

proportion of each letter using a floating number between zero and one, it would also be

possible to use integers, i.e. between zero and the number of sequences.

fractions.sort()

bestFraction, bestLetter = fractions[-1]

If  the  winning  fraction  is  below  our  significance  threshold  then  we  extend  the

consensus sequence with an ‘X’, rather than use an infrequent residue. Otherwise the most

popular residue is added to the end of the consensus, and the consensus string is returned

from the function.

if bestFraction < threshold:

consensus += 'X'

else:


consensus += bestLetter

return consensus

Note that we simply concatenated the consensus string to the next letter using ‘+=’, but

we could also have put the letters in a list and used ”.join(). Finally, the function can be

tested with an alignment represented as a list of sequences:

alignment = ['SRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK',

'TVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK',

'HHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIK',

'HEFSELVIALIIFGVMAGVIGTILFISYGSRRLIK']

print(consensus(alignment)) # HVPSPVVIILIILGVMAGVIGTILLISYGIRRLIK




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   177   178   179   180   181   182   183   184   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish