Figure 22.5. Comparing the underlying mean of a probability distribution and the
sample mean. A standard normal distribution, with a mean, μ, of 0.0 is superimposed on a
set of data with a sample mean, , of 0.36. Given that the sample mean has an associated
error depending on the number of samples taken, we can use a T-test to assess whether the
separation between the two means is significant, and thus whether the probability
distribution is a good model for the data.
In mathematical parlance, for the two-sample T-test the T-statistic turns out to follow a
T-distribution with n
x
+ n
y
− 2 degrees of freedom. Similarly the one-sample T-test has a
T-statistic with n − 1 degrees of freedom. In statistics the notion of ‘degrees of freedom’
can be a somewhat tricky concept, but the principle is to know the number of independent
data points that can truly vary. To take an arbitrary but simple example, where there are
three sample values that have a mean of zero, once two values are known then there is no
choice about the third, because we know it must give the known mean. Hence, in general,
for a statistical analysis the number of degrees of freedom is the number of independent
sample values, minus the number of restraining parameters.
Once we have an appropriate T-statistic, with an appropriate number of degrees of
freedom, in order to perform a statistical test we use the distribution of how the T-statistic
itself varies when taking different samplings. This probability distribution is the Student T-
distribution,
13
which assumes the samples are independent and have the same normal
distribution. We won’t go into details of this distribution, only to say that it is a bell shape,
like the normal distribution but with thicker tails. For Python the scipy.stats module has
some pre-packaged T-test functions as well as facilities to access the Student T-
distribution, which allow us to easily make probability estimates to assess the variation
due to sample variance.
The complete T-test functions available in scipy.stats are ttest_1samp, ttest_ind and
ttest_rel and these accept samples, represented as arrays of values, and perform the
appropriate two-tailed test to estimate a probability. Also, because the T-distribution is
symmetric, for a one-tailed test we can simply halve the probability. Illustrating each of
these functions in turn, ttest_1samp finds the probability of a sample mean being the same
as the true mean from a distribution (e.g. a null hypothesis), and thus uses the T-statistic
for
described above. Note that the T-statistic as well as the two-tailed probability
are passed back by the function:
from scipy.stats import ttest_1samp
trueMean = 1.76
samples = array([1.752, 1.818, 1.597, 1.697, 1.644, 1.593,
1.878, 1.648, 1.819, 1.794, 1.745, 1.827])
tStat, twoTailProb = ttest_1samp(samples, trueMean)
# Result is: -0.918, 0.378
The function ttest_ind performs the two-sample T-test, testing whether two independent
samples have the same underlying, true mean, based on their respective sample means,
described as and above:
from scipy.stats import ttest_ind
samples1 = array([1.752, 1.818, 1.597, 1.697, 1.644, 1.593])
samples2 = array([1.878, 1.648, 1.819, 1.794, 1.745, 1.827])
tStat, twoTailProb = ttest_ind(samples1, samples2)
# Result is: -2.072, 0.0650
There is an extra option to this function, to relax the requirement that both samples have
the same variance, in which case the test is called Welch’s T-test,
14
though the difference
for our test case is slight:
tStat, twoTailProb = ttest_ind(samples1, samples2, equal_var=False)
# Result is: # -2.072 0.0654
Lastly, the ttest_rel function again works with two samples in the same way as above,
but assumes that the samples are dependent, i.e. that the values in the pair of samples are
related to one another (they must have the same variance, hence there is no equal_var
option). An example of this would be to take some measure from a group of people as the
first samples and then to take repeated measurements for the same people at a different
time (perhaps after some treatment) or using a different method.
Do'stlaringiz bilan baham: |