Conditional probabilities
Moving to a different kind of example, we will consider probabilities associated with the
occurrence of a disease (D) and how this relates to the experimental observation of a
particular mutant version (M) of a gene, i.e. with a different DNA sequence. Here the
probability that both occur, Pr(D and M), on its own does nothing to suggest whether the
two are related. Naturally to investigate the link between the two we would need to know
probabilities of the events alone (having the disease and having the mutation) and thus
whether the intersection of the two is more or less than we would expect if they were
unrelated. By doing this we implicitly use the concept of hypothesis testing. As far as
medical prediction and diagnosis is concerned it is helpful to consider the complementary
events. In this case these are the event that there is no disease and the event that there is no
mutation. With these we can compare the hypothesis, that the disease and mutations are
linked, with an appropriate alternative and null hypothesis (see
Chapter 22
).
By counting occurrences of the different situations we can estimate the various
combinations of conditional probabilities. For example, we can estimate the probability of
having the disease given that the mutation is present, Pr(D given M), and compare it to the
probability of having the disease given no mutation Pr(D given noM), i.e. whether the
mutation increases or decreases the chance of the disease. Also, if it is established that
Pr(D given M) is much greater than Pr(D given noM), i.e. that the mutation is highly
correlated with the disease, then knowing the probability of not having the disease given
the mutation being present Pr(noD given M) is vital if we hope to use a genetic test to
predict the disease outcome; in other words we need to know whether there would be lots
of false-positive results.
We can also think of the dependent DNA events in the HindIII restriction enzyme
example in terms of conditional probabilities, for example, what the probability of having
a cut site (AAGCTT) is in a region of DNA given a G:C content greater than 60%. It
should be noted that this is a distinct question from asking what the probability of one
event and another is, though the two are related. For this example the probability that both
events occur considers the outcomes from all the possible DNA sequences, while the
probability that one occurs given the other does not, it only considers situations where the
second event has definitely occurred. The probability that they are both true is the same as
the probability of one occurring multiplied by the probability of the second occurring
given that we’ve already got the other. So for two arbitrary events X and Y we have:
Pr(X and Y) = Pr(X) × Pr(Y given X)
And it doesn’t matter which way we phrase this, the converse is also true:
Pr(X and Y) = Pr(Y) × Pr(X given Y)
Obviously this only makes sense if Pr(X) and Pr(Y) are not zero. Combining these two
formulations we can say that one is equal to the other, i.e. that:
Pr(X) × Pr(Y given X) = Pr(Y) × Pr(X given Y)
which is often written in the form:
Pr(Y given X) = Pr(Y) × Pr(X given Y) / Pr(X)
This is a very important result which is called Bayes’ theorem. As we discuss in the next
section this formulation is commonly used for hypothesis testing.
Returning to our medical example, for prognosis and appropriate treatment we might
want to know the probability of getting the disease given the mutation Pr(D given M).
However, it may not be cost-effective to obtain statistics by genetically testing large
numbers of people for the mutation, just for the chance that they would get a rare disease.
Also, it might be that the disease is difficult to diagnose and doesn’t show immediately.
Conversely it may be easier to determine Pr(M given D) by testing a limited number of
people who definitely do have the disease to discover whether they have the mutation.
Using Bayes’ theorem we can easily get the probability we want from the other.
Pr(D given M) = Pr(M given D) Pr(D) / Pr(M)
Naturally we must also estimate Pr(D) and Pr(M), the probabilities of disease and
mutation in the absence of any other information, from statistical data. However, Pr(D)
could simply come from medical records and Pr(M) could come from testing any group of
people, whether or not they had the rare disease.
Do'stlaringiz bilan baham: |