Python Programming for Biology: Bioinformatics and Beyond

Figure 22.6. A histogram of observed and expected counts for a G-test

Download 7,75 Mb.

Pdf ko'rish

bet	359/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 355 356 357 358 359 360 361 362 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 22.6. A histogram of observed and expected counts for a G-test. The

illustrated histogram represents the observed counts for a sample of 100 values that have

been binned into regions of width 0.05. The expected counts, represented by black spots,

are calculated from the probability density funtion of the normal distribution, to which the

histogram approximates. For the G-test the different bins are treated as independent

random variables, and overall we estimate the goodness of fit of the sample histogram to

the normal expectation. It should be noted that because some counts are small (<5) it

would be inappropriate to do the chi-square test.

We can illustrate this for the heights of people, using a null hypothesis with a normal

distribution, and then test the goodness of fit for different height categories. Here bins

represents the centres of the histogram ranges and obsd contains the corresponding

number of observed counts for each bin. The expected counts expd are calculated for each

of the bins values by applying the .pdf() for the normal distribution (with appropriate

mean and standard deviation) and then scaling so that the total is the same as the number

of observations.

from scipy.stats import norm

from numpy import array

bins = array([1.65, 1.7, 1.75, 1.8, 1.85,])

obsd = array([ 14, 15, 33, 22, 8,])

mean = 1.76

std = 0.075

nObs = obsd.sum()

expd = norm.pdf(bins, mean, std)

expd *= nObs / expd.sum()

# Expected counts: 9.196, 19.576, 26.720, 23.385, 13.122

chSqStat, pValue = chisquare(obsd, expd)

print('Chi square A', chSqStat, pValue)

The result is a chi-square statistic of 7.14 and corresponding p-value of 0.129 (12.9%).

Hence, we would not reject the null hypothesis. As an alternative method we could use the

SciPy distribution chi2 and do the statistical test for a one-tailed value as we illustrated

previously, noting that here we have to explicitly specify the number of degrees of

freedom for the chi-square distribution (which is one fewer than the number of

categories):

from scipy.stats import chi2

degFree = len(obsd)-1

pValue = chi2.sf(chSqStat, degFree) # Result is 0.129

Not only can the chi-square test be applied to discrete categorical counts as we illustrate

above, but the process can also be applied to the fundamental parameters of a probability

distribution. For example, the null hypothesis may be a specific probability distribution

which can be described by a few fundamental parameters (μ, σ etc.) and we can compare

that with the alternative hypothesis where the parameters are unrestricted. In the case of

mean and standard deviation there would be three degrees of freedom in the test (two of

which come from the fundamental parameters).

Pearson’s chi-square test is commonly used, and thus implemented in SciPy, but there

are limitations to its use, especially concerning low counts. A popular heuristic is that at

least five observed counts are required for each random variable (each category).

Although corrections can be applied to account for this, in general where counts are small

it may be easier to apply a different test. Indeed the chi-square statistic is really only an

approximation, developed in the days before computers for ease of calculation, to the

likelihood ratio statistic. In the modern computer age we have little problem implementing

likelihood ratio tests, such as the G-test that we describe next, which is less sensitive to

small counts. The G-test uses the following statistic:

The statistic is a more accurate version of chi-square so we still do statistical tests using

the chi-square distribution with the same degrees of freedom as the equivalent chi-square

test. The G-statistic is not only useful for deriving probabilities, but also has a useful

meaning in itself. It is the same thing as twice the Kullback-Leibler divergence

relative entropy: a measure of the extra information required to get the observed

distribution from the expected distribution. Unfortunately the G-test is not implemented

directly in a single SciPy function, but we can easily do the required steps, making use of

the chi-square distribution’s survival function (chi2.sf()) to obtain a p-value. Firstly, we

gather some test data, which is similar to before, although we have included a few more

histogram bins with small counts. Then we calculate the expectation from the normal

distribution:

bins = array([1.55, 1.6, 1.65, 1.7, 1.75, 1.8, 1.85, 1.90])

obsd = array([ 3, 4, 14, 15, 33, 22, 8, 1])

mean = 1.76

std = 0.075

nObs = obsd.sum()

expd = norm.pdf(bins, mean, std)

expd *= nObs/expd.sum()

#Expected: 0.535, 2.769, 9.194, 19.571, 26.713, 23.379, 13.119, 4.720

The G-test is easily calculated for the arrays by taking the natural logarithm (base e) of

the required ratio, multiplying by the observations and then finding twice the sum of the

resulting array. The resulting G-value is then applied to the chi-square survival function,

with the appropriate degrees of freedom:

from numpy import log

g = 2.0 * sum(obsd * log(obsd/expd))

degFree = len(obsd)-1

pValue = chi2.sf(g, degFree)

print('G test', g, pValue) # Result is: 17.34, 0.015

As illustrated, the result is a one-tailed probability of 1.5%. If we compare this with the

chi-square test, even though it is not appropriate to do so with such small counts, we get:

chSqStat, pValue = chisquare(obsd, expd)

print('Chi-square', chSqStat, pValue) # Result: 21.98, 0.00256

With the chi-square test the probability is now estimated to be 0.26%, which is

substantively different from the G-test value.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 355 356 357 358 359 360 361 362 ... 514