Python Programming for Biology: Bioinformatics and Beyond

Figure 21.3. Nucleotide probabilities for two DNA positions

Download 7,75 Mb.

Pdf ko'rish

bet	316/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 312 313 314 315 316 317 318 319 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 21.3. Nucleotide probabilities for two DNA positions. For two positions in a

DNA sequence there are 16 possible outcomes, given the four different types of

nucleotide. If the probabilities of single nucleotides are equal then the probabilities of all

nucleotide pairs are also equal (

/

16

), and naturally sum to one.

As we have alluded to, in order to obtain a realistic probability estimate for different

outcomes we generally count the number of occurrences of each in a large data set. Hence

for our mouse-breeding example, even if we didn’t have a good theoretical model we

could cross the two strains, count the different coat colours of the progeny and then

express the counts as a proportion of the total. We may do such experiments to validate a

given model, which in this case might show something of genetic interest, if the model

does not fit. Though, for this kind of hypothesis testing (which is more properly described

Chapter 22

) we have to be mindful of how the amount of data affects our confidence.

To take an arbitrary example with a mouse cross, just because eight black mice were born

in a litter does not mean that the model of a 3:1 black-white ratio is wrong; litters of eight

would be all black about 10% of the time (0.75

). You would need a much larger sample

of data to be confident of the probabilities; the more experimental examples we have the

closer the experimental ratios will match the long-term probabilities. Likewise for DNA

nucleotide probabilities we can count C:G and A:T base pairs we find in a genome,

and

will get the most accurate results by choosing as large a sample of sequence data as

possible. If we want our probabilities to be general for the whole genome we would not

want to look at only a small part, which may not be representative.

In Python if we know the number of G:C and the number of A:T pairs for a whole

genome then the probability of each, i.e. Pr(G), Pr(C), Pr(A) and Pr(T), at a random

position can be calculated as the proportion of the total:

counts = {'G':2356491, 'C':2356491, 'A':2283184, 'T':2283184}

total = float(sum(counts.values()))

letterProbs = {}

for letter in counts:

letterProbs[letter] = counts[letter] / total

print(letterProbs)

# Result: {'A':0.24605, 'C':0.25395, 'T':0.24605, 'G':0.25395}

Even though these probabilities are improved from ¼ for all bases it should still

potentially be considered as an approximation, depending on the situation at hand. You

may have noticed that we have been quite careful to say that this is the probability at a

random position. If the DNA position we are considering is not random then the above

whole-genome average would just be the first approximation.

The G:C content of DNA is

actually different for different chromosomes and generally varies depending on whether a

position is in a gene or non-gene region. We could end up with endless categorisations and

qualifications for probabilities. So while it is possible to define the probabilities for C or G

being at (to take an arbitrary and complex example) the last position of the first exon of all

carbohydrate metabolism genes, we wouldn’t want to go into so much detail unless there

was a special reason. In general a balance is struck between having accurate general

probabilities, supported by large amounts of data, and contextualised probabilities, which

may be supported by very little data. In a probabilistic analysis we may wish to account

for context, to make more accurate predictions, but naturally we must have data for the

different situations and know when to use them. There will be some further discussion of

such matters in the Markov chains section below.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 312 313 314 315 316 317 318 319 ... 514