Figure 22.3. Calculating tail probabilities in the binomial test. The probability mass
function and the regions considered in probabilistic tail tests are illustrated for an example
binomial distribution, where probability of an event (p) is 0.5 and the number of trials (n)
is 1000. For a one-tailed test the probability considered corresponds to the area under the
curve from the test value, which here is 530, and above. For the two-tailed test the
probability considered corresponds to the area under the curve in two regions above and
below the mean (530 and 470 here) that start at the same separation from the mean and go
outwards. The cumulative distribution is the running total of the probabilities for all the
counts up to a particular value (i.e. starting at zero) and can be used to look up the areas of
the tails from the probability mass function. For the right-hand threshold (530) the
cumulative probability is the sum up to that value, but for the tail area we require the
cumulative probability from that value, hence we subtract the cumulative probability
before the right-hand threshold from 1.0 to get the required value.
The most common kind of region, and hence integral, to choose for statistical tests is
the tails of the probability distribution. Here, by ‘tail’ we mean the area at the extreme
edge of the distribution that is bounded by a specified threshold. Many probability
distributions will have two tails, for extremely small and extremely large values
respectively, and in a statistical test we may choose to consider either or both of these
tails, depending on the situation. Accordingly, you will see both one-tailed and two-tailed
tests used in scientific contexts. In biological research p-values are frequently quoted,
which is simply the application of one-tailed tests to a null hypothesis. Put more formally,
a p-value is the probability of values at least as extreme as the observation being generated
by the null hypothesis.
9
(It is not the probability that the null hypothesis is correct or the
probability that a value is random ‘noise’.) While it is possible to define other regions of
the distribution for testing purposes, using the tails of the distributions has the advantage
of having no extra parameters to consider (such as a region width) and their integral is
easy to calculate using a cumulative density function, which represents the running total of
probabilities up to a specific value. We will illustrate one-tailed and two-tailed tests using
the discrete (i.e. whole number) binomial and Poisson probability distributions that we
introduced in
Chapter 21
, as well as for the continuous normal distribution.
First we consider the binomial test, which is concerned with the number of occurrences
of an event that has a fixed probability of occurring, given a certain number of trials. If we
apply this to the example of a DNA sequence of specified length (a specified number of
trials) where we believe that the probability of G:C base pairs occurring is equal to the
probability of A:T pairs we can ask whether an observed G:C versus A:T count fits with
our model. If we look at a DNA sequence of length 1000, then following the random
hypothesis we would expect that on average around 500 of the nucleotides would be G or
C, and 500 would be A or T. If we get significantly more or fewer than 500 (of either pair)
then we can conclude that the G:C content is likely to be not random. The binomial test
formalises these considerations: if the probability of the G:C event is 0.5 then the expected
number of events in 1000 trials is 0.5×1000 = 500. Furthermore, the binomial distribution
allows us to associate a probability to every number of G:C events that may be observed
in 1000 trials and hence also calculate the area under the tails of the distribution.
The exact form of the one-tailed test depends on whether the observed number of
events is higher or lower than the expected number of events. If it is higher, then we
calculate the probability that the observed number is that value or larger. If it is lower, we
instead consider the probability that it is that value or smaller. In both cases the two-tailed
test is the same. If in our example the actual number of observed G:C pairs is 530, then the
one-tailed test considers the probability that we would observe at least 530 G:C
nucleotide pairs in the sequence of length 1000. The two-tailed test not only considers the
probability of observing at least 530 G:C (30 more than the mean) but also considers the
probability of observing at most 470 G:C (30 fewer than the mean). Conversely, if the
actual number of observed G:C pairs is 470, then the one-tailed test considers the
probability that we would observe at most 470 G:C nucleotide pairs in the sequence of
length 1000.
The two-tailed binomial test is available as a function in SciPy:
from scipy.stats import binom_test
count, nTrials, pEvent = 530, 1000, 0.5
result = binom_test(count, nTrials, pEvent)
print('Binomial two tail', result)
However, we can also create our own function using the explicit probability
distribution. We will allow the input counts argument to be a list, tuple or a NumPy array.
With a bit of extra effort we could also allow the argument to be a simple integer. We first
convert counts to be a NumPy array, because that allows for calculations to be done on the
entire array in one go, rather than having to loop over the elements. (For a discussion
about NumPy, see
Chapter 9
.)
We also include an option to specify one-tailed tests. The one-tailed test is complicated
because, as noted above, the calculation is different depending on whether the counts are
greater or less than or equal to the mean. Here we test whether elements are greater than
the mean (counts > mean), which yields an array of True and False values. Then the
NumPy nonzero() function is used to obtain the indices of the true elements: those that
were greater than the mean. Conveniently, we can access the corresponding elements of a
NumPy array using these indices. This is faster than looping over the individual elements
of the array, at the cost of requiring more memory for the indices. The values that are less
than the mean will be the opposite selection: we also get these from the same test, but it is
noteworthy that a NumPy array cannot be negated with the ordinary Python not Boolean
operator. Rather the ‘~’ symbol negates each element of the array (swaps True with False).
from scipy.stats import binom
from numpy import array, zeros
def binomialTailTest(counts, nTrials, pEvent, oneSided=True):
counts = array(counts)
mean = nTrials * pEvent
if oneSided:
result = zeros(counts.shape)
isAboveMean = counts > mean
aboveIdx = isAboveMean.nonzero()
belowIdx = (~isAboveMean).nonzero()
result[aboveIdx] = binom.sf(counts[aboveIdx]-1, nTrials, pEvent)
result[belowIdx] = binom.cdf(counts[belowIdx], nTrials, pEvent)
else:
diffs = abs(counts-mean)
result = binom.cdf(mean-diffs, nTrials, pEvent)
result += binom.sf(mean+diffs-1, nTrials, pEvent)
return result
For counts less than or equal to the mean, we use the cumulative distribution function,
.cdf(), which exactly provides the probability that the value is less than or equal to the
count. For counts higher than the mean we instead use .sf(), the ‘survival’ function, as an
alternative to calculating 1.0 minus the cumulative distribution function, which would be
required to calculate the integral of the probabilities greater than or equal to the count. The
subtraction of 1 from the counts in the .sf()call is the main oddity here, and this is because
the binom.sf() function calculates the probability that the number of successes is greater
than that argument, rather than greater than or equal.
For the two-tailed test we subtract the test values (counts) from the distribution’s mean
value to get separations from the mean. These differences are then used to get the upper
and lower test thresholds. The probability sums for the lower thresholds are obtained via
.cdf(), and the upper thresholds with .sf(), like the one-tailed test.
Continuing with the G:C content example, where we observe 530 G:C events from
1000 positions, to do the one-tailed test to get the probability that we observe at least 530
we simply call the function with the appropriate arguments:
counts = [530]
result = binomialTailTest(counts, 1000, 0.5, oneSided=True)
print( 'Binomial one tail', result))
and this gives a probability of 0.0310. Given that the binomial distribution was the null
hypothesis this one-tailed probability is the p-value. Similarly, we can do the two-tailed
test to calculate the probability that we observe at least 530 or at most 470 events,
remembering that we set oneSided to be False:
result = binomialTailTest(counts, 1000, 0.5, oneSided=False)
print( 'Binomial two tail', result)
which gives a result of 0.0620. These results illustrate the subtlety of statistical analysis.
Often it is deemed that something is significant if it has less than a 5% probability of
originating from the null distribution. Given the one-tailed probability here is 3.1% one
might want to conclude from this that it is likely that the DNA sequence is non-random
(doesn’t fit the null hypothesis). However, the two-tailed probability is 6.2%, so one might
equally say that one cannot conclude this. Fundamentally it is the probability itself that
matters, including how it is defined (one-tailed or two-tailed) and no matter what
threshold we use, the probability itself tells us how likely we are to be right or wrong
about any specific statement. Significance thresholds like 5% are merely conveniences for
standardisation and not always applicable in every situation.
Although we have illustrated the tailed test for the binomial distribution, we can use the
same strategy for other distributions that are provided by SciPy. In the next example, we
illustrate for the Poisson distribution, noting that the function (like the binomial one) can
accept an array of values to calculate probabilities for:
from scipy.stats import poisson
from numpy import abs, array
def poissonTailTest(counts, eventRate, oneSided=True):
counts = array(counts)
if oneSided:
result = poisson.sf(counts-1, eventRate)
else:
diffs = abs(counts-eventRate)
result = 2*poisson.cdf(eventRate-diffs, eventRate)
return result
counts = [2300, 2400, 2550]
result = poissonTailTest(counts, 2500, oneSided=False)
result = poissonTailTest(counts, 2500, oneSided=False)
print( 'Poisson two tail', result)
# Result is [0.00002655, 0.022746, 0.1611]
The only differences to the binomial example are that the mean value is simply the
event rate (the distribution’s only parameter) and because the distribution is symmetric
either side of the mean the one-tail calculation is simpler, and we can double the integral
for one tail to get the probability corresponding to both tails.
Next we move on to an example that uses the normal distribution, otherwise known as
the Gaussian distribution. In this case we are dealing with a continuous probability
distribution (rather than a discrete one), i.e. where the outcome is a value from the
continuum of real numbers. The concepts largely carry over from discrete to continuous
distributions, with the important difference that discrete sums become integrals.
10
And as
Do'stlaringiz bilan baham: |