Python Programming for Biology: Bioinformatics and Beyond


Figure 22.6.  A histogram of observed and expected counts for a G-test



Download 7,75 Mb.
Pdf ko'rish
bet359/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   355   356   357   358   359   360   361   362   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 22.6.  A histogram of observed and expected counts for a G-test. The

illustrated histogram represents the observed counts for a sample of 100 values that have

been binned into regions of width 0.05. The expected counts, represented by black spots,

are calculated from the probability density funtion of the normal distribution, to which the

histogram approximates. For the G-test the different bins are treated as independent

random variables, and overall we estimate the goodness of fit of the sample histogram to

the normal expectation. It should be noted that because some counts are small (<5) it

would be inappropriate to do the chi-square test.

We can illustrate this for the heights of people, using a null hypothesis with a normal

distribution,  and  then  test  the  goodness  of  fit  for  different  height  categories.  Here  bins

represents  the  centres  of  the  histogram  ranges  and  obsd  contains  the  corresponding

number of observed counts for each bin. The expected counts expd are calculated for each

of  the  bins  values  by  applying  the  .pdf()  for  the  normal  distribution  (with  appropriate

mean and standard deviation) and then scaling so that the total is the same as the number

of observations.

from scipy.stats import norm

from numpy import array

bins = array([1.65, 1.7, 1.75, 1.8, 1.85,])

obsd = array([ 14, 15, 33, 22, 8,])

mean = 1.76

std = 0.075

nObs = obsd.sum()

expd = norm.pdf(bins, mean, std)

expd *= nObs / expd.sum()

# Expected counts: 9.196, 19.576, 26.720, 23.385, 13.122

chSqStat, pValue = chisquare(obsd, expd)

print('Chi square A', chSqStat, pValue)



The result is a chi-square statistic of 7.14 and corresponding p-value of 0.129 (12.9%).

Hence, we would not reject the null hypothesis. As an alternative method we could use the

SciPy  distribution  chi2  and  do  the  statistical  test  for  a  one-tailed  value  as  we  illustrated

previously,  noting  that  here  we  have  to  explicitly  specify  the  number  of  degrees  of

freedom  for  the  chi-square  distribution  (which  is  one  fewer  than  the  number  of

categories):

from scipy.stats import chi2

degFree = len(obsd)-1

pValue = chi2.sf(chSqStat, degFree) # Result is 0.129

Not only can the chi-square test be applied to discrete categorical counts as we illustrate

above, but the process can also be applied to the fundamental parameters of a probability

distribution.  For  example,  the  null  hypothesis  may  be  a  specific  probability  distribution

which can be described by a few fundamental parameters (μ, σ etc.) and we can compare

that  with  the  alternative  hypothesis  where  the  parameters  are  unrestricted.  In  the  case  of

mean and standard deviation there would be three degrees of freedom in the test (two of

which come from the fundamental parameters).

Pearson’s chi-square test is commonly used, and thus implemented in SciPy, but there

are limitations to its use, especially concerning low counts. A popular heuristic is that at

least  five  observed  counts  are  required  for  each  random  variable  (each  category).

Although corrections can be applied to account for this, in general where counts are small

it may be easier to apply a different test. Indeed the chi-square statistic is really only an

approximation,  developed  in  the  days  before  computers  for  ease  of  calculation,  to  the



likelihood ratio statistic. In the modern computer age we have little problem implementing

likelihood  ratio  tests,  such  as  the  G-test  that  we  describe  next,  which  is  less  sensitive  to

small counts. The G-test uses the following statistic:

The statistic is a more accurate version of chi-square so we still do statistical tests using

the chi-square distribution with the same degrees of freedom as the equivalent chi-square

test.  The  G-statistic  is  not  only  useful  for  deriving  probabilities,  but  also  has  a  useful

meaning  in  itself.  It  is  the  same  thing  as  twice  the  Kullback-Leibler  divergence

15

 or



relative  entropy:  a  measure  of  the  extra  information  required  to  get  the  observed

distribution  from  the  expected  distribution.  Unfortunately  the  G-test  is  not  implemented

directly in a single SciPy function, but we can easily do the required steps, making use of

the  chi-square  distribution’s  survival  function  (chi2.sf())  to  obtain  a  p-value.  Firstly,  we

gather some test data, which is similar to before, although we have included a few more

histogram  bins  with  small  counts.  Then  we  calculate  the  expectation  from  the  normal

distribution:

bins = array([1.55, 1.6, 1.65, 1.7, 1.75, 1.8, 1.85, 1.90])

obsd = array([ 3, 4, 14, 15, 33, 22, 8, 1])

mean = 1.76

std = 0.075



nObs = obsd.sum()

expd = norm.pdf(bins, mean, std)

expd *= nObs/expd.sum()

#Expected: 0.535, 2.769, 9.194, 19.571, 26.713, 23.379, 13.119, 4.720

The G-test is easily calculated for the arrays by taking the natural logarithm (base e) of

the required ratio, multiplying by the observations and then finding twice the sum of the

resulting array. The resulting G-value is then applied to the chi-square survival function,

with the appropriate degrees of freedom:

from numpy import log

g = 2.0 * sum(obsd * log(obsd/expd))

degFree = len(obsd)-1

pValue = chi2.sf(g, degFree)

print('G test', g, pValue) # Result is: 17.34, 0.015

As illustrated, the result is a one-tailed probability of 1.5%. If we compare this with the

chi-square test, even though it is not appropriate to do so with such small counts, we get:

chSqStat, pValue = chisquare(obsd, expd)

print('Chi-square', chSqStat, pValue) # Result: 21.98, 0.00256

With  the  chi-square  test  the  probability  is  now  estimated  to  be  0.26%,  which  is

substantively different from the G-test value.


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   355   356   357   358   359   360   361   362   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish