Python Programming for Biology: Bioinformatics and Beyond

Figure 22.5. Comparing the underlying mean of a probability distribution and the

Download 7,75 Mb.

Pdf ko'rish

bet	356/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 352 353 354 355 356 357 358 359 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 22.5. Comparing the underlying mean of a probability distribution and the

sample mean. A standard normal distribution, with a mean, μ, of 0.0 is superimposed on a

set of data with a sample mean, , of 0.36. Given that the sample mean has an associated

error depending on the number of samples taken, we can use a T-test to assess whether the

separation between the two means is significant, and thus whether the probability

distribution is a good model for the data.

In mathematical parlance, for the two-sample T-test the T-statistic turns out to follow a

T-distribution with n

x

+ n

− 2 degrees of freedom. Similarly the one-sample T-test has a

T-statistic with n − 1 degrees of freedom. In statistics the notion of ‘degrees of freedom’

can be a somewhat tricky concept, but the principle is to know the number of independent

data points that can truly vary. To take an arbitrary but simple example, where there are

three sample values that have a mean of zero, once two values are known then there is no

choice about the third, because we know it must give the known mean. Hence, in general,

for a statistical analysis the number of degrees of freedom is the number of independent

sample values, minus the number of restraining parameters.

Once we have an appropriate T-statistic, with an appropriate number of degrees of

freedom, in order to perform a statistical test we use the distribution of how the T-statistic

itself varies when taking different samplings. This probability distribution is the Student T-

distribution,

which assumes the samples are independent and have the same normal

distribution. We won’t go into details of this distribution, only to say that it is a bell shape,

like the normal distribution but with thicker tails. For Python the scipy.stats module has

some pre-packaged T-test functions as well as facilities to access the Student T-

distribution, which allow us to easily make probability estimates to assess the variation

due to sample variance.

The complete T-test functions available in scipy.stats are ttest_1samp, ttest_ind and

ttest_rel and these accept samples, represented as arrays of values, and perform the

appropriate two-tailed test to estimate a probability. Also, because the T-distribution is

symmetric, for a one-tailed test we can simply halve the probability. Illustrating each of

these functions in turn, ttest_1samp finds the probability of a sample mean being the same

as the true mean from a distribution (e.g. a null hypothesis), and thus uses the T-statistic

for

described above. Note that the T-statistic as well as the two-tailed probability

are passed back by the function:

from scipy.stats import ttest_1samp

trueMean = 1.76

samples = array([1.752, 1.818, 1.597, 1.697, 1.644, 1.593,

1.878, 1.648, 1.819, 1.794, 1.745, 1.827])

tStat, twoTailProb = ttest_1samp(samples, trueMean)

# Result is: -0.918, 0.378

The function ttest_ind performs the two-sample T-test, testing whether two independent

samples have the same underlying, true mean, based on their respective sample means,

described as and above:

from scipy.stats import ttest_ind

samples1 = array([1.752, 1.818, 1.597, 1.697, 1.644, 1.593])

samples2 = array([1.878, 1.648, 1.819, 1.794, 1.745, 1.827])

tStat, twoTailProb = ttest_ind(samples1, samples2)

# Result is: -2.072, 0.0650

There is an extra option to this function, to relax the requirement that both samples have

the same variance, in which case the test is called Welch’s T-test,

though the difference

for our test case is slight:

tStat, twoTailProb = ttest_ind(samples1, samples2, equal_var=False)

# Result is: # -2.072 0.0654

Lastly, the ttest_rel function again works with two samples in the same way as above,

but assumes that the samples are dependent, i.e. that the values in the pair of samples are

related to one another (they must have the same variance, hence there is no equal_var

option). An example of this would be to take some measure from a group of people as the

first samples and then to take repeated measurements for the same people at a different

time (perhaps after some treatment) or using a different method.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 352 353 354 355 356 357 358 359 ... 514