Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	357/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 353 354 355 356 357 358 359 360 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Probability intervals

So far in this chapter we have been using tailed tests to calculate the probability of

obtaining a given value from a statistical sample, and then on the basis of this probability

we can decide whether we deem the sample to be significantly different from a null

hypothesis using a threshold probability, say 5%. However, we can also take the reverse

approach and use a probability threshold upfront to calculate what the equivalent test

statistic would be for this limiting value. In turn this then leads to a corresponding interval

in the actual measurements.

Returning to the one-sample T-test example of comparing a sample mean, , with the

true mean, μ

, we can determine a confidence interval for the true mean, related to a

specified probability, given the sample mean and the unbiased sample standard deviation.

Mathematically we want to determine the interval size I such that there is a specified

probability that μ

is within I of .

This is a two-tailed test, and the one-sided equivalent would be the probability that

is larger or smaller than some value. To calculate the interval we say that the

probability of the absolute difference between means is the same as the probability that the

magnitude of the T-distribution is less than the interval divided by the standard error,

which simply comes from rearranging the formula for the T-statistic:

We need to invert this function to determine I given a probability. To do this practically we

use a function called the quantile function or percent point function. This does the inverse

job to the cumulative distribution function, so we pass in a probability and get out a

threshold value that the random variable will be bounded by (at or below). Fortunately for

Python the percent point function is available for all the common probability distributions

described in the scipy.stats module, so we generally don’t have to worry about its precise

formulation. When we have calculated the inverse for a given probability we then simply

multiply by an appropriate factor, representing the standard error, to obtain the

measurement interval.

We now provide a Python function to calculate the value of the interval, given the

probability, or confidence that the samples were drawn from the distribution. The input

can be a list or a NumPy array of samples, and a confidence level (e.g. 0.95 for 95%

confidence). The result is the sampleMean and the interval. For the two-sided test this

means that the actual mean is between sampleMean-interval and sampleMean+interval

with the probability given by the confidence level.

from numpy import mean, std, sqrt

from scipy.stats import t

def tConfInterval(samples, confidence, isOneSided=True):

n = len(samples)

sampleMean = mean(samples)

sampleStdDev = std(samples, ddof=1) # Unbiased estimate

if not isOneSided:

confidence = 0.5 * (1+confidence)

interval = t(n-1).ppf(confidence) * sampleStdDev / sqrt(n)

return sampleMean, interval

Inside the function, if the test is two-tailed we adjust the confidence value so that the

tail probability used is half that for a single tail. For example, for an input 95% confidence

(5% tail probability) we will find the interval corresponding to a one-tailed confidence of

97.5% (2.5% tail probability) because there will be two tail integrals that both contribute.

Next, using scipy.stats.t we pass the appropriate number of degrees of freedom (n-1) in to

the T-distribution and use the percent point function for this with ppf(). The value obtained

is actually

, so we scale this by

to get the required interval. The function

can be tested with our previous example, using a sample of human heights:

from numpy import array

samples = array([1.752, 1.818, 1.597, 1.697, 1.644, 1.593,

1.878, 1.648, 1.819, 1.794, 1.745, 1.827])

sMean, intvl = tConfInterval(samples, 0.95, isOneSided=False)

print('Sample mean: %.3f, 95%% interval:%.4f' % (sMean, intvl))

Note that the double ‘%%’ in the print() statement is because Python treats a single ‘%’

as the first character in a format string.

Hence, the difference to the mean that we would accept for a 95% confidence limit,

when accepting an underlying probability distribution, is an interval of 0.0615 metres. If

the mean of our null hypothesis distribution is actually 1.76 metres, then we would accept

the sample mean of 1.734 metres because it is 0.0257 metres away from the mean, and

thus lies within the interval.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 353 354 355 356 357 358 359 360 ... 514