21
Probability
Contents
The basics of probability theory
Sample space
Probability values
Restriction enzyme example
Combining probabilities
Conditional probabilities
Bayesian analysis
Random variables
Binomial distribution
Poisson distribution
Geometric distribution
Markov chains
Markov processes
Hidden Markov models
Using Python for hidden Markov models
The Viterbi algorithm
The forward-backward algorithm
Implementing a protein sequence HMM
The basics of probability theory
The theory of probability was based on the observation of random physical events, most
notably for games of chance. And naturally, calculating accurate probabilities became
especially important for people when money was wagered on the outcome. Probability is a
way of ascribing numerical values to the possible outcomes to help us understand a
random process more fully. This enables us to ask questions like how much more often
one event occurs compared to another, but because of the random nature of what we are
studying we can never say what the outcome will definitely be. Rather we tend to think of
the process in terms of what the long-term proportions of different outcomes are, if the
random experiment were repeated a very large number of times, or perhaps if money is
involved what a wager on a particular outcome is worth.
Turning to biological systems, some things in living organisms occur as a result of
random processes, like the segregation of a parent’s chromosomes among their children or
base-pair changes in DNA (such as a result of replication errors or ionising radiation),
though, under most circumstances we don’t get to see the actual random event. For the
most part we just view the outcomes, sometimes billions of years later in the case of DNA
sequence changes. Of course a DNA sequence isn’t actually random, given that it exists to
contain biologically meaningful information representing genes and gene control elements
etc. which have been selected for their function during evolution, even if the initial
mutations were random. Nonetheless for a sufficiently large and unbiased selection of
DNA we can treat the sequence as if it were random in order to ask various questions. For
example, how often do I find the sub-sequence AAGCTT in a megabase-long region of
DNA?
Probability theory is often also useful in situations where there is no underlying
randomness in the biology, but rather an uncertainty in our scientific interpretation. Here a
probabilistic treatment of our uncertainty can lead to informative predictions. An example
of this would be for the classification of whether two genes have the same function as one
another (generally because they have a common ancestor). They either do or do not, and
the underlying assignment of this status is not a random process, but our prediction based
on the available data does have an uncertain component, and so it can be helpful to treat
the situation probabilistically. It is also notable that in biological analyses it may be rare to
actually deal with probabilities directly, but probability theory underpins statistical tests
which are very commonly used, and we describe some of those in
Chapter 22
.
Here we will lightly go through some of the fundamentals of probability theory. Being
mindful of our expected readership, we will endeavour to avoid going into too much
detailed mathematical notation. We won’t escape the equations entirely but hopefully
these will serve as a primer for further reading.
Sample space
Firstly, we need to define a probabilistic system by knowing what the range of possible
outcomes is. In mathematical jargon this means to define the sample space. The range of
possible outcomes can be fairly straightforward, so for a six-sided die we know that there
are simply six outcomes corresponding to the numbers of spots on different faces. If we
are thinking about the occurrence of a DNA base at a position in a genome then we know
that it must be either G, C, A or T. Often though we are thinking about multiple dice rolls
or several positions in a DNA sequence. In these cases we think of the sample space in
terms of the combinations of possibilities for each roll or position. Hence for rolling two
dice we have six possibilities for the first roll, and then for any given first roll there are a
further six possibilities for the second roll. Overall there will be six times six possibilities
for the total number of possible outcomes. Naturally if there is a further roll there are six
more possibilities for each of the 36 two-roll outcomes. So here the general rule is that the
size of the sample space is 6
N
, if there are N rolls, i.e. multiplied by six for each roll. The
same idea can be applied to sequential positions in DNA. Here there are four nucleotide
possibilities at each position and so for a sequence (or sub-sequence) of length N there are
4
N
different combinations. Although the nucleotides of a DNA sequence are actually all
present in the same molecule, it may be helpful for understanding to fictitiously imagine
the sequence being generated by the roll of an imaginary four-sided die.
More generally there can sometimes be the complication that we actually don’t have a
fixed number of dice rolls or a fixed length of DNA. For example, we may be interested in
finding out how many dice rolls we would expect to make, on average, before we roll
three sixes. The DNA equivalent of this is to ask what the expected length of DNA
(number of positions) is before we find a given small sub-sequence. The latter is quite a
relevant question biologically because the small sub-sequence might be a cut site for a
restriction enzyme,
1
where it can be useful to know the average size of DNA fragments the
enzyme would generate. In these examples the sample space may be unbounded, or at
least very large in the case of a genome. Nonetheless we still have a firm idea of what the
range of possibilities is, even if it is technically infinite. For example, even though it may
be technically possible to never roll three consecutive sixes if a die were rolled
continuously for the history of the universe the odds are so astronomically small (close to
zero) that this, and similar extremes, don’t have any practical effect.
Do'stlaringiz bilan baham: