Calculating an alignment score
Our first Python examples dealing with sequence alignments are simple subroutines to
give a measure of how good an alignment is. This is important because although there are
many possible ways of arranging letters to get an alignment the one we almost always
want is the best one: the one with the highest score. The easiest way to calculate a score
for two aligned sequences is to calculate how many residue pairs are identical; this is to
say we can measure the sequence identity.
Sequence identity
The Python function to measure sequence identity is fairly simple. It accepts two
sequences and compares them under the assumption that these sequences are aligned, i.e.
that the first position in one is equivalent to the first position in the other. The procedure is
then to start with a score of zero and each time we have a position in both sequences
where the letter is the same we add one to the score. At the end we give back the score
divided by how many places we compared, so that we have an average. We also multiply
this by 100 so that we get a percentage figure, which is the conventional representation.
Note that in the function we define the number of places for comparison (numPlaces) as
the length of the shortest sequence, i.e. we take the minimum of the two sequence lengths
to guarantee that we won’t overshoot the smallest of the pair.
def calcSeqIdentity(seqA, seqB):
numPlaces = min(len(seqA), len(seqB))
score = 0.0
for i in range(numPlaces):
if seqA[i] == seqB[i]:
score += 1.0
return 100.0 * score/numPlaces
We can now test this function by defining some short protein sequences, and then
applying the function to pairs of sequences
seq1 = 'ALIGNMENTS'
seq2 = 'ALIGDVENTS'
seq3 = 'ALIGDPVENTS'
seq4 = 'ALIGN-MENTS'
print(calcSeqIdentity(seq1, seq2)) # 80.0%
print(calcSeqIdentity(seq1, seq3)) # 40.0%
print(calcSeqIdentity(seq4, seq3)) # 72.7%
Note that, as expected when seq1 is compared to seq2 the result returned is 80%.
However, for seq1 and seq3, which as you will note only differ by one extra ‘P’ residue,
the value falls to 40%. This is because although the first few residues align well, once the
extra residue is reached the sequences are no longer in step. The situation is resolved by
inserting a dash (representing a gap) in the shorter sequence, so that ‘P’ is paired with ‘-’
and the ends of the sequences align nicely. Accordingly seq4 and seq3 give a score of
72.7% (eight out of eleven). As you can see inserting a gap in the right place can be
crucial. We will illustrate later how gaps can be inserted automatically to give an optimal
alignment without human intervention.
Do'stlaringiz bilan baham: |