Generating a consensus sequence
The following Python example takes a (multiple) alignment and produces a representative
consensus sequence. The input alignment is assumed to be a list of one-letter sequences
with gaps, with each list assumed to be of the same length. The second input to the
function is a threshold value so that if no single residue type exceeds this fraction, at any
of the alignment positions, then the position is deemed to be undefined and the residue
code ‘X’ will be used. With the default threshold, 0.25, if we have an alignment containing
ten sequences, at least three residues of the same kind must be present to give something
in the consensus.
Initially, the function is defined along with some initial values, one for the length of the
sequences, n (we happen to measure the first sequence of the alignment), one for the
number of sequences in the alignment, nSeq, and an empty consensus string that will be
filled in and then passed back at the end. Note that we convert the number of sequences
into a floating point number because later on we will be using it to do some division, and
we want to ensure that this operation produces a floating point number.
1
def consensus(alignment, threshold=0.25):
n = len(alignment[0])
nSeq = float(len(alignment))
consensus = ''
We loop though an index representing all of the positions (columns) within the
alignment. Each time we define an empty counts dictionary that will hold the number of
each type of residue observed at this position.
for i in range(n):
counts = {}
Another loop is defined inside the first one to iterate over each sequence. We get the
appropriate letter for the current sequence using the position index, and skip the remainder
of the loop if the letter is actually a dash. Finally in the loop we increase the count for this
kind of residue letter by one, using the dictionary trick involving .get(), so that a starting
value of zero is obtained if we have not seen a particular letter before during this cycle.
for seq in alignment:
letter = seq[i]
if letter == '-':
continue
counts[letter] = counts.get(letter, 0) + 1
Now that the counts for the residue letters have been filled for this position we next
calculate the fractions of the total that each letter represents. This is a simple matter of
dividing the count for a letter by the number of aligned sequences and then putting the
fraction and the corresponding letter in a list. Note that we use small, two-element sub-
lists to store the fractions, rather than use a dictionary as we did for the counts, so that we
can sort the values.
fractions = []
for letter in counts:
frac = counts[letter]/nSeq
fractions.append([frac, letter])
The list of fractions is sorted and because the numeric value is the first element in the
sub-lists, the fractions will be put in value order of the number, rather than alphabetic
order of the letter. However, the letters will still go along for the ride. The largest fraction
(bestFraction) and corresponding letter are the ones we want to use to build the consensus
sequence with, thus we take the last element ([-1]) of the sorted fractions list,
remembering that the values are sorted low to high. Although here we represent the
proportion of each letter using a floating number between zero and one, it would also be
possible to use integers, i.e. between zero and the number of sequences.
fractions.sort()
bestFraction, bestLetter = fractions[-1]
If the winning fraction is below our significance threshold then we extend the
consensus sequence with an ‘X’, rather than use an infrequent residue. Otherwise the most
popular residue is added to the end of the consensus, and the consensus string is returned
from the function.
if bestFraction < threshold:
consensus += 'X'
else:
consensus += bestLetter
return consensus
Note that we simply concatenated the consensus string to the next letter using ‘+=’, but
we could also have put the letters in a list and used ”.join(). Finally, the function can be
tested with an alignment represented as a list of sequences:
alignment = ['SRPAPVVIILIILCVMAGVIGTILLISYGIRLLIK',
'TVPAPVVIILIILCVMAGIIGTILLISYTIRRLIK',
'HHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIK',
'HEFSELVIALIIFGVMAGVIGTILFISYGSRRLIK']
print(consensus(alignment)) # HVPSPVVIILIILGVMAGVIGTILLISYGIRRLIK
Do'stlaringiz bilan baham: |