Python Programming for Biology: Bioinformatics and Beyond

Figure 22.3. Calculating tail probabilities in the binomial test

Download 7,75 Mb.

Pdf ko'rish

bet	351/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 347 348 349 350 351 352 353 354 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 22.3. Calculating tail probabilities in the binomial test. The probability mass

function and the regions considered in probabilistic tail tests are illustrated for an example

binomial distribution, where probability of an event (p) is 0.5 and the number of trials (n)

is 1000. For a one-tailed test the probability considered corresponds to the area under the

curve from the test value, which here is 530, and above. For the two-tailed test the

probability considered corresponds to the area under the curve in two regions above and

below the mean (530 and 470 here) that start at the same separation from the mean and go

outwards. The cumulative distribution is the running total of the probabilities for all the

counts up to a particular value (i.e. starting at zero) and can be used to look up the areas of

the tails from the probability mass function. For the right-hand threshold (530) the

cumulative probability is the sum up to that value, but for the tail area we require the

cumulative probability from that value, hence we subtract the cumulative probability

before the right-hand threshold from 1.0 to get the required value.

The most common kind of region, and hence integral, to choose for statistical tests is

the tails of the probability distribution. Here, by ‘tail’ we mean the area at the extreme

edge of the distribution that is bounded by a specified threshold. Many probability

distributions will have two tails, for extremely small and extremely large values

respectively, and in a statistical test we may choose to consider either or both of these

tails, depending on the situation. Accordingly, you will see both one-tailed and two-tailed

tests used in scientific contexts. In biological research p-values are frequently quoted,

which is simply the application of one-tailed tests to a null hypothesis. Put more formally,

a p-value is the probability of values at least as extreme as the observation being generated

by the null hypothesis.

(It is not the probability that the null hypothesis is correct or the

probability that a value is random ‘noise’.) While it is possible to define other regions of

the distribution for testing purposes, using the tails of the distributions has the advantage

of having no extra parameters to consider (such as a region width) and their integral is

easy to calculate using a cumulative density function, which represents the running total of

probabilities up to a specific value. We will illustrate one-tailed and two-tailed tests using

the discrete (i.e. whole number) binomial and Poisson probability distributions that we

introduced in

Chapter 21

, as well as for the continuous normal distribution.

First we consider the binomial test, which is concerned with the number of occurrences

of an event that has a fixed probability of occurring, given a certain number of trials. If we

apply this to the example of a DNA sequence of specified length (a specified number of

trials) where we believe that the probability of G:C base pairs occurring is equal to the

probability of A:T pairs we can ask whether an observed G:C versus A:T count fits with

our model. If we look at a DNA sequence of length 1000, then following the random

hypothesis we would expect that on average around 500 of the nucleotides would be G or

C, and 500 would be A or T. If we get significantly more or fewer than 500 (of either pair)

then we can conclude that the G:C content is likely to be not random. The binomial test

formalises these considerations: if the probability of the G:C event is 0.5 then the expected

number of events in 1000 trials is 0.5×1000 = 500. Furthermore, the binomial distribution

allows us to associate a probability to every number of G:C events that may be observed

in 1000 trials and hence also calculate the area under the tails of the distribution.

The exact form of the one-tailed test depends on whether the observed number of

events is higher or lower than the expected number of events. If it is higher, then we

calculate the probability that the observed number is that value or larger. If it is lower, we

instead consider the probability that it is that value or smaller. In both cases the two-tailed

test is the same. If in our example the actual number of observed G:C pairs is 530, then the

one-tailed test considers the probability that we would observe at least 530 G:C

nucleotide pairs in the sequence of length 1000. The two-tailed test not only considers the

probability of observing at least 530 G:C (30 more than the mean) but also considers the

probability of observing at most 470 G:C (30 fewer than the mean). Conversely, if the

actual number of observed G:C pairs is 470, then the one-tailed test considers the

probability that we would observe at most 470 G:C nucleotide pairs in the sequence of

length 1000.

The two-tailed binomial test is available as a function in SciPy:

from scipy.stats import binom_test

count, nTrials, pEvent = 530, 1000, 0.5

result = binom_test(count, nTrials, pEvent)

print('Binomial two tail', result)

However, we can also create our own function using the explicit probability

distribution. We will allow the input counts argument to be a list, tuple or a NumPy array.

With a bit of extra effort we could also allow the argument to be a simple integer. We first

convert counts to be a NumPy array, because that allows for calculations to be done on the

entire array in one go, rather than having to loop over the elements. (For a discussion

about NumPy, see

Chapter 9

We also include an option to specify one-tailed tests. The one-tailed test is complicated

because, as noted above, the calculation is different depending on whether the counts are

greater or less than or equal to the mean. Here we test whether elements are greater than

the mean (counts > mean), which yields an array of True and False values. Then the

NumPy nonzero() function is used to obtain the indices of the true elements: those that

were greater than the mean. Conveniently, we can access the corresponding elements of a

NumPy array using these indices. This is faster than looping over the individual elements

of the array, at the cost of requiring more memory for the indices. The values that are less

than the mean will be the opposite selection: we also get these from the same test, but it is

noteworthy that a NumPy array cannot be negated with the ordinary Python not Boolean

operator. Rather the ‘~’ symbol negates each element of the array (swaps True with False).

from scipy.stats import binom

from numpy import array, zeros

def binomialTailTest(counts, nTrials, pEvent, oneSided=True):

counts = array(counts)

mean = nTrials * pEvent

if oneSided:

result = zeros(counts.shape)

isAboveMean = counts > mean

aboveIdx = isAboveMean.nonzero()

belowIdx = (~isAboveMean).nonzero()

result[aboveIdx] = binom.sf(counts[aboveIdx]-1, nTrials, pEvent)

result[belowIdx] = binom.cdf(counts[belowIdx], nTrials, pEvent)

else:

diffs = abs(counts-mean)

result = binom.cdf(mean-diffs, nTrials, pEvent)

result += binom.sf(mean+diffs-1, nTrials, pEvent)

return result

For counts less than or equal to the mean, we use the cumulative distribution function,

.cdf(), which exactly provides the probability that the value is less than or equal to the

count. For counts higher than the mean we instead use .sf(), the ‘survival’ function, as an

alternative to calculating 1.0 minus the cumulative distribution function, which would be

required to calculate the integral of the probabilities greater than or equal to the count. The

subtraction of 1 from the counts in the .sf()call is the main oddity here, and this is because

the binom.sf() function calculates the probability that the number of successes is greater

than that argument, rather than greater than or equal.

For the two-tailed test we subtract the test values (counts) from the distribution’s mean

value to get separations from the mean. These differences are then used to get the upper

and lower test thresholds. The probability sums for the lower thresholds are obtained via

.cdf(), and the upper thresholds with .sf(), like the one-tailed test.

Continuing with the G:C content example, where we observe 530 G:C events from

1000 positions, to do the one-tailed test to get the probability that we observe at least 530

we simply call the function with the appropriate arguments:

counts = [530]

result = binomialTailTest(counts, 1000, 0.5, oneSided=True)

print( 'Binomial one tail', result))

and this gives a probability of 0.0310. Given that the binomial distribution was the null

hypothesis this one-tailed probability is the p-value. Similarly, we can do the two-tailed

test to calculate the probability that we observe at least 530 or at most 470 events,

remembering that we set oneSided to be False:

result = binomialTailTest(counts, 1000, 0.5, oneSided=False)

print( 'Binomial two tail', result)

which gives a result of 0.0620. These results illustrate the subtlety of statistical analysis.

Often it is deemed that something is significant if it has less than a 5% probability of

originating from the null distribution. Given the one-tailed probability here is 3.1% one

might want to conclude from this that it is likely that the DNA sequence is non-random

(doesn’t fit the null hypothesis). However, the two-tailed probability is 6.2%, so one might

equally say that one cannot conclude this. Fundamentally it is the probability itself that

matters, including how it is defined (one-tailed or two-tailed) and no matter what

threshold we use, the probability itself tells us how likely we are to be right or wrong

about any specific statement. Significance thresholds like 5% are merely conveniences for

standardisation and not always applicable in every situation.

Although we have illustrated the tailed test for the binomial distribution, we can use the

same strategy for other distributions that are provided by SciPy. In the next example, we

illustrate for the Poisson distribution, noting that the function (like the binomial one) can

accept an array of values to calculate probabilities for:

from scipy.stats import poisson

from numpy import abs, array

def poissonTailTest(counts, eventRate, oneSided=True):

counts = array(counts)

if oneSided:

result = poisson.sf(counts-1, eventRate)

else:

diffs = abs(counts-eventRate)

result = 2*poisson.cdf(eventRate-diffs, eventRate)

return result

counts = [2300, 2400, 2550]

result = poissonTailTest(counts, 2500, oneSided=False)

print( 'Poisson two tail', result)

# Result is [0.00002655, 0.022746, 0.1611]

The only differences to the binomial example are that the mean value is simply the

event rate (the distribution’s only parameter) and because the distribution is symmetric

either side of the mean the one-tail calculation is simpler, and we can double the integral

for one tail to get the probability corresponding to both tails.

Next we move on to an example that uses the normal distribution, otherwise known as

the Gaussian distribution. In this case we are dealing with a continuous probability

distribution (rather than a discrete one), i.e. where the outcome is a value from the

continuum of real numbers. The concepts largely carry over from discrete to continuous

distributions, with the important difference that discrete sums become integrals.

And as

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 347 348 349 350 351 352 353 354 ... 514