nucleotide. If the probabilities of single nucleotides are equal then the probabilities of all
As we have alluded to, in order to obtain a realistic probability estimate for different
outcomes we generally count the number of occurrences of each in a large data set. Hence
for our mouse-breeding example, even if we didn’t have a good theoretical model we
could cross the two strains, count the different coat colours of the progeny and then
express the counts as a proportion of the total. We may do such experiments to validate a
given model, which in this case might show something of genetic interest, if the model
does not fit. Though, for this kind of hypothesis testing (which is more properly described
To take an arbitrary example with a mouse cross, just because eight black mice were born
in a litter does not mean that the model of a 3:1 black-white ratio is wrong; litters of eight
). You would need a much larger sample
of data to be confident of the probabilities; the more experimental examples we have the
closer the experimental ratios will match the long-term probabilities. Likewise for DNA
nucleotide probabilities we can count C:G and A:T base pairs we find in a genome,
2
and
will get the most accurate results by choosing as large a sample of sequence data as
possible. If we want our probabilities to be general for the whole genome we would not
want to look at only a small part, which may not be representative.
In Python if we know the number of G:C and the number of A:T pairs for a whole
genome then the probability of each, i.e. Pr(G), Pr(C), Pr(A) and Pr(T), at a random
position can be calculated as the proportion of the total:
counts = {'G':2356491, 'C':2356491, 'A':2283184, 'T':2283184}
total = float(sum(counts.values()))
letterProbs = {}
for letter in counts:
letterProbs[letter] = counts[letter] / total
print(letterProbs)
# Result: {'A':0.24605, 'C':0.25395, 'T':0.24605, 'G':0.25395}
Even though these probabilities are improved from ¼ for all bases it should still
potentially be considered as an approximation, depending on the situation at hand. You
may have noticed that we have been quite careful to say that this is the probability at a
random position. If the DNA position we are considering is not random then the above
whole-genome average would just be the first approximation.
3
The G:C content of DNA is
actually different for different chromosomes and generally varies depending on whether a
position is in a gene or non-gene region. We could end up with endless categorisations and
qualifications for probabilities. So while it is possible to define the probabilities for C or G
being at (to take an arbitrary and complex example) the last position of the first exon of all
carbohydrate metabolism genes, we wouldn’t want to go into so much detail unless there
was a special reason. In general a balance is struck between having accurate general
probabilities, supported by large amounts of data, and contextualised probabilities, which
may be supported by very little data. In a probabilistic analysis we may wish to account
for context, to make more accurate predictions, but naturally we must have data for the
different situations and know when to use them. There will be some further discussion of
such matters in the Markov chains section below.
Do'stlaringiz bilan baham: