Python Programming for Biology: Bioinformatics and Beyond

Generating a consensus sequence

Download 7,75 Mb.

Pdf ko'rish

bet	181/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 177 178 179 180 181 182 183 184 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Generating a consensus sequence

The following Python example takes a (multiple) alignment and produces a representative

consensus sequence. The input alignment is assumed to be a list of one-letter sequences

with gaps, with each list assumed to be of the same length. The second input to the

function is a threshold value so that if no single residue type exceeds this fraction, at any

of the alignment positions, then the position is deemed to be undefined and the residue

code ‘X’ will be used. With the default threshold, 0.25, if we have an alignment containing

ten sequences, at least three residues of the same kind must be present to give something

in the consensus.

Initially, the function is defined along with some initial values, one for the length of the

sequences, n (we happen to measure the first sequence of the alignment), one for the

number of sequences in the alignment, nSeq, and an empty consensus string that will be

filled in and then passed back at the end. Note that we convert the number of sequences

into a floating point number because later on we will be using it to do some division, and

we want to ensure that this operation produces a floating point number.

def consensus(alignment, threshold=0.25):

n = len(alignment[0])

nSeq = float(len(alignment))

consensus = ''

We loop though an index representing all of the positions (columns) within the

alignment. Each time we define an empty counts dictionary that will hold the number of

each type of residue observed at this position.

for i in range(n):

counts = {}

Another loop is defined inside the first one to iterate over each sequence. We get the

appropriate letter for the current sequence using the position index, and skip the remainder

of the loop if the letter is actually a dash. Finally in the loop we increase the count for this

kind of residue letter by one, using the dictionary trick involving .get(), so that a starting

value of zero is obtained if we have not seen a particular letter before during this cycle.

for seq in alignment:

letter = seq[i]

if letter == '-':

continue

counts[letter] = counts.get(letter, 0) + 1

Now that the counts for the residue letters have been filled for this position we next

calculate the fractions of the total that each letter represents. This is a simple matter of

dividing the count for a letter by the number of aligned sequences and then putting the

fraction and the corresponding letter in a list. Note that we use small, two-element sub-

lists to store the fractions, rather than use a dictionary as we did for the counts, so that we

can sort the values.

fractions = []

for letter in counts:

frac = counts[letter]/nSeq

fractions.append([frac, letter])

The list of fractions is sorted and because the numeric value is the first element in the

sub-lists, the fractions will be put in value order of the number, rather than alphabetic

order of the letter. However, the letters will still go along for the ride. The largest fraction

(bestFraction) and corresponding letter are the ones we want to use to build the consensus

sequence with, thus we take the last element ([-1]) of the sorted fractions list,

remembering that the values are sorted low to high. Although here we represent the

proportion of each letter using a floating number between zero and one, it would also be

possible to use integers, i.e. between zero and the number of sequences.

fractions.sort()

bestFraction, bestLetter = fractions[-1]

If the winning fraction is below our significance threshold then we extend the

consensus sequence with an ‘X’, rather than use an infrequent residue. Otherwise the most

popular residue is added to the end of the consensus, and the consensus string is returned

from the function.

if bestFraction < threshold:

consensus += 'X'

else:

consensus += bestLetter

return consensus

Note that we simply concatenated the consensus string to the next letter using ‘+=’, but

we could also have put the letters in a list and used ”.join(). Finally, the function can be

tested with an alignment represented as a list of sequences:

alignment = ['SRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK',

'TVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK',

'HHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIK',

'HEFSELVIALIIFGVMAGVIGTILFISYGSRRLIK']

print(consensus(alignment)) # HVPSPVVIILIILGVMAGVIGTILLISYGIRRLIK

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 177 178 179 180 181 182 183 184 ... 514