Python Programming for Biology: Bioinformatics and Beyond

Simple sub-sequence properties

Download 7,75 Mb.

Pdf ko'rish

bet	146/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 142 143 144 145 146 147 148 149 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Finding a sequence motif

Simple sub-sequence properties

The following examples move from considering a whole sequence to looking at the

properties of regions or overlapping sub-sequences from within a larger sequence.

Although we have only room to give a few simple examples, this kind of analysis is very

important in trying to ascertain the biological function of the different parts of a sequence

and in some cases to help guide experimentation.

Finding a sequence motif

The next example script is designed to find a particular smaller sub-sequence within a

larger sequence. This kind of operation is useful because specific small sequences, called

motifs, often have important biological roles. Examples of this in DNA include specific

sequences which indicate the start of genes, specify where the coding regions of genes are

spliced (introns are removed) or allow a protein to bind. There are also examples in

proteins including where specific sequences are used to interact with other proteins, bind

to DNA or direct that the protein should be modified (e.g. by joining sugars to it).

The following is a simple example of how to find a fixed sub-sequence within a larger

sequence:

seq = 'AGCTCGCTCGCTGCGTATAAAATCGCATCGCGCGCAGC'

position1 = seq.find('TATAAA')

position2 = seq.find('GAGGAG')

This will give the values for position1 as 15 and position2 as −1 (because it cannot be

found). The negative number that is given back from .find() when a substring is not found

is a bit of an oddity. The authors feel that giving back the None object would have been

more appropriate.

In many cases, however, it is not just one single well-defined sub-sequence that

corresponds to a motif with a biological function. More usually there are a range of similar

sequences that are all found to fulfil the same role. Accordingly, next we will not be

defining a sequence motif as a single stretch of residues, but rather as a residue profile.

A residue profile states, for each position in the motif, what range of residues are found

and how often one kind of residue is present compared to the others. Obviously, the

residue profile that we use to try to detect a particular motif will have been determined

beforehand, almost always by comparing sequences that we know (e.g. by

experimentation) will function in the desired way.

This particular example attempts to find the region of a DNA sequence called the

‘TATA box’. This is a biologically important region which, as the name suggests, often

contains a particular sequence. The biological role of this sequence is to help define where

the start of a gene is. Note that only some genes use the TATA box system.

Firstly, we define the profile that encodes the preferred sequence letters for each

position in a section of DNA. The profile is stored as a Python dictionary, where the keys

to the dictionary are the nucleotide letters and the values are lists of scores. Each score list

contains a number for each position that indicates how likely it is to find the given

nucleotide at that location.

profile = {

'A':[ 61, 16,352, 3,354,268,360,222,155, 56, 83, 82, 82, 68, 77],

'C':[145, 46, 0, 10, 0, 0, 3, 2, 44,135,147,127,118,107,101],

'G':[152, 18, 2, 2, 5, 0, 10, 44,157,150,128,128,128,139,140],

'T':[ 31,309, 35,374, 30,121, 6,121, 33, 48, 31, 52, 61, 75, 71]}

Next we define a function that takes a DNA sequence and a profile to determine where

the profile best matches. A for loop is used to define the start point i of each sub-sequence

that we wish to test against the profile. Note that we have ensured that the sub-sequence

does not fall off the end of the larger input sequence because we stop the loop at the end of

the sequence minus the profile length, for which we use the variable width.

We calculate the score for each sub-sequence by looking up the scores for the

component letters using the profile dictionary: we use the nucleotide letter to first get the

correct sores list and then the profile position j, to extract the right score. The position we

interrogate from the input sequence is defined as i+j; i.e. we add the position within the

sub-sequence to the absolute start of the sub-sequence within the larger sequence. The best

score that has been found so far is recorded, together with the position where that score

occurred.

def matchDnaProfile(seq, profile):

""" Find the best-matching position and score when comparing a DNA

sequence with a DNA sequence profile """

bestScore = 0

bestPosition = None # Just to start with

width = len(profile['A'])

for i in range(len(seq)-width):

score = 0

for j in range(width):

letter = seq[i+j]

score += profile[letter][j]

if score > bestScore:

bestScore = score

bestPosition = i

return bestScore, bestPosition

Rather than finding the best score and best position, we could alternatively have

reported the scores at all positions, or even a list of positions that all give a score above a

specified threshold. To test our function with our profile and DNA sequence, and report

back which section of the sequence gave the best hit, we do the following:

score, position = matchDnaProfile(dnaSeq, profile)

print(score, position, dnaSeq[position:position+15])

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 142 143 144 145 146 147 148 149 ... 514