The following examples move from considering a whole sequence to looking at the
properties of regions or overlapping sub-sequences from within a larger sequence.
Although we have only room to give a few simple examples, this kind of analysis is very
important in trying to ascertain the biological function of the different parts of a sequence
and in some cases to help guide experimentation.
Finding a sequence motif
The next example script is designed to find a particular smaller sub-sequence within a
larger sequence. This kind of operation is useful because specific small sequences, called
motifs, often have important biological roles. Examples of this in DNA include specific
sequences which indicate the start of genes, specify where the coding regions of genes are
spliced (introns are removed) or allow a protein to bind. There are also examples in
proteins including where specific sequences are used to interact with other proteins, bind
to DNA or direct that the protein should be modified (e.g. by joining sugars to it).
The following is a simple example of how to find a fixed sub-sequence within a larger
sequence:
seq = 'AGCTCGCTCGCTGCGTATAAAATCGCATCGCGCGCAGC'
position1 = seq.find('TATAAA')
position2 = seq.find('GAGGAG')
This will give the values for position1 as 15 and position2 as −1 (because it cannot be
found). The negative number that is given back from .find() when a substring is not found
is a bit of an oddity. The authors feel that giving back the None object would have been
more appropriate.
In many cases, however, it is not just one single well-defined sub-sequence that
corresponds to a motif with a biological function. More usually there are a range of similar
sequences that are all found to fulfil the same role. Accordingly, next we will not be
defining a sequence motif as a single stretch of residues, but rather as a residue profile.
A residue profile states, for each position in the motif, what range of residues are found
and how often one kind of residue is present compared to the others. Obviously, the
residue profile that we use to try to detect a particular motif will have been determined
beforehand, almost always by comparing sequences that we know (e.g. by
experimentation) will function in the desired way.
This particular example attempts to find the region of a DNA sequence called the
‘TATA box’. This is a biologically important region which, as the name suggests, often
contains a particular sequence. The biological role of this sequence is to help define where
the start of a gene is. Note that only some genes use the TATA box system.
Firstly, we define the profile that encodes the preferred sequence letters for each
position in a section of DNA. The profile is stored as a Python dictionary, where the keys
to the dictionary are the nucleotide letters and the values are lists of scores. Each score list
contains a number for each position that indicates how likely it is to find the given
nucleotide at that location.
profile = {
'A':[ 61, 16,352, 3,354,268,360,222,155, 56, 83, 82, 82, 68, 77],
'C':[145, 46, 0, 10, 0, 0, 3, 2, 44,135,147,127,118,107,101],
'G':[152, 18, 2, 2, 5, 0, 10, 44,157,150,128,128,128,139,140],
'T':[ 31,309, 35,374, 30,121, 6,121, 33, 48, 31, 52, 61, 75, 71]}
Next we define a function that takes a DNA sequence and a profile to determine where
the profile best matches. A for loop is used to define the start point i of each sub-sequence
that we wish to test against the profile. Note that we have ensured that the sub-sequence
does not fall off the end of the larger input sequence because we stop the loop at the end of
the sequence minus the profile length, for which we use the variable width.
We calculate the score for each sub-sequence by looking up the scores for the
component letters using the profile dictionary: we use the nucleotide letter to first get the
correct sores list and then the profile position j, to extract the right score. The position we
interrogate from the input sequence is defined as i+j; i.e. we add the position within the
sub-sequence to the absolute start of the sub-sequence within the larger sequence. The best
score that has been found so far is recorded, together with the position where that score
occurred.
def matchDnaProfile(seq, profile):
""" Find the best-matching position and score when comparing a DNA
sequence with a DNA sequence profile """
bestScore = 0
bestPosition = None # Just to start with
width = len(profile['A'])
for i in range(len(seq)-width):
score = 0
for j in range(width):
letter = seq[i+j]
score += profile[letter][j]
if score > bestScore:
bestScore = score
bestPosition = i
return bestScore, bestPosition
Rather than finding the best score and best position, we could alternatively have
reported the scores at all positions, or even a list of positions that all give a score above a
specified threshold. To test our function with our profile and DNA sequence, and report
back which section of the sequence gave the best hit, we do the following:
score, position = matchDnaProfile(dnaSeq, profile)
print(score, position, dnaSeq[position:position+15])
Do'stlaringiz bilan baham: