Python Programming for Biology: Bioinformatics and Beyond

Calculating an alignment score

Download 7,75 Mb.

Pdf ko'rish

bet	167/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 163 164 165 166 167 168 169 170 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Sequence identity

Calculating an alignment score

Our first Python examples dealing with sequence alignments are simple subroutines to

give a measure of how good an alignment is. This is important because although there are

many possible ways of arranging letters to get an alignment the one we almost always

want is the best one: the one with the highest score. The easiest way to calculate a score

for two aligned sequences is to calculate how many residue pairs are identical; this is to

say we can measure the sequence identity.

Sequence identity

The Python function to measure sequence identity is fairly simple. It accepts two

sequences and compares them under the assumption that these sequences are aligned, i.e.

that the first position in one is equivalent to the first position in the other. The procedure is

then to start with a score of zero and each time we have a position in both sequences

where the letter is the same we add one to the score. At the end we give back the score

divided by how many places we compared, so that we have an average. We also multiply

this by 100 so that we get a percentage figure, which is the conventional representation.

Note that in the function we define the number of places for comparison (numPlaces) as

the length of the shortest sequence, i.e. we take the minimum of the two sequence lengths

to guarantee that we won’t overshoot the smallest of the pair.

def calcSeqIdentity(seqA, seqB):

numPlaces = min(len(seqA), len(seqB))

score = 0.0

for i in range(numPlaces):

if seqA[i] == seqB[i]:

score += 1.0

return 100.0 * score/numPlaces

We can now test this function by defining some short protein sequences, and then

applying the function to pairs of sequences

seq1 = 'ALIGNMENTS'

seq2 = 'ALIGDVENTS'

seq3 = 'ALIGDPVENTS'

seq4 = 'ALIGN-MENTS'

print(calcSeqIdentity(seq1, seq2)) # 80.0%

print(calcSeqIdentity(seq1, seq3)) # 40.0%

print(calcSeqIdentity(seq4, seq3)) # 72.7%

Note that, as expected when seq1 is compared to seq2 the result returned is 80%.

However, for seq1 and seq3, which as you will note only differ by one extra ‘P’ residue,

the value falls to 40%. This is because although the first few residues align well, once the

extra residue is reached the sequences are no longer in step. The situation is resolved by

inserting a dash (representing a gap) in the shorter sequence, so that ‘P’ is paired with ‘-’

and the ends of the sequences align nicely. Accordingly seq4 and seq3 give a score of

72.7% (eight out of eleven). As you can see inserting a gap in the right place can be

crucial. We will illustrate later how gaps can be inserted automatically to give an optimal

alignment without human intervention.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 163 164 165 166 167 168 169 170 ... 514