The statistical tests discussed so far in this chapter have considered whether individual
sample parameters or data values fit with probability distributions. For example, we have
illustrated tailed tests, to compare values with a null hypothesis distribution and thus
obtain a p-value. However, there are various ways that we can compare multiple
parameters and even whole distributions with one another, and naturally this can involve a
distributions have the same mean but otherwise have different shapes, then such an
analysis will convey a clear advantage. Though, as before we must naturally account for
the error associated in taking random (and potentially small) samples from an underlying
The test can be applied to categorical selection, and by extension to histograms of counts
which may be used to approximate any arbitrary distribution. The assumption for the test
is that each pair of expected and observed counts is derived from a normal random
variable. The chi-square distribution, which the statistical test is based upon, is the
distribution of such a sum of squares resulting from different random samplings in each of
the variable categories. For the chi-square distribution (as with the T-distribution) we will
need to know the number of degrees of freedom, which in general will be the number of
observed values (e.g. categories) minus the number of restraining parameters (e.g. totals).
Returning to the example G:C versus A:T content of a 1000 DNA base pairs, we can
apply the chi-square statistic to these two base-pair categories, though naturally they are
restrained because they must sum to a given total. However, we treat them as independent
observations for the calculation of the statistic and then consider the appropriate number
of degrees of freedom. Hence if we have 530 G:C pairs and 470 A:T pairs and the
expected count for each is 500, then the statistic is:
After calculating the chi-square statistic, comparing observed and expected counts, the
next stage is to evaluate the statistical significance of the resulting value (3.6 in the above
example). For this we use the chi-square distribution. The number of degrees of freedom
here is the number of random variables (which in the above case is the number of
categories) minus one; we lose a degree of freedom because the total is fixed, so the A:T
count is not random given the G:C count. We can use the cumulative density function chi-
square distribution for one degree of freedom to generate a p-value (i.e. do a one-tailed
test compared to the null hypothesis) for the observed chi-squared statistic. Fortunately in
SciPy this is all handled in one neat chisquare function. This will assume the number of
degrees of freedom is (n−1), though in other situations we could pass in ddof, representing
the difference in the number of degrees of from the default:
from scipy.stats import chisquare
obs = array([530, 470])
exp = array([500, 500])
chSqStat, pValue = chisquare(obs, exp)
print('DNA Chi-square:', chSqStat, pValue) # 3.6, 0.05778
The result for this is the anticipated chi-square statistic of 3.6 and a test probability of
0.058. It should be noted that the chi-square test is almost always a one-tailed test because
we are normally interested in whether the fit is worse than the expected fit, and not
concerned if the fit is better than expected (i.e. too good).
Moving on from the simple DNA example, we can think of samples from a probability
distribution where the resulting values have been binned into a histogram (see
Figure 22.6
for an example histogram). This will give a discrete set of categories, one for each range,
and we can treat each category as a separate, independent sampling and compare it to the
expectation from the null hypothesis. In this case the expected count will be the area of the
probability for reach range (a region of the probability density function) multiplied by the
total number of observations.