Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet358/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   354   355   356   357   358   359   360   361   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Chi-squared and G-tests

The  statistical  tests  discussed  so  far  in  this  chapter  have  considered  whether  individual

sample parameters or data values fit with probability distributions. For example, we have

illustrated  tailed  tests,  to  compare  values  with  a  null  hypothesis  distribution  and  thus

obtain  a  p-value.  However,  there  are  various  ways  that  we  can  compare  multiple

parameters and even whole distributions with one another, and naturally this can involve a

distribution  for  hypothesis  testing  in  what  is  termed  a  goodness-of-fit  test.  If  two

distributions  have  the  same  mean  but  otherwise  have  different  shapes,  then  such  an

analysis will convey a clear advantage. Though, as before we must naturally account for

the error associated in taking random (and potentially small) samples from an underlying

probability distribution.

The first method to compare multiple variables we will cover is Pearson’s chi-squared



test.  This  test  is  based  on  the  chi-squared  statistic  (χ

2

),  which  is  defined  as  follows  for



observed frequencies of events (o

i

)  and  the  expected  frequencies  of  events  (e



i

),  which  is

generally based on the null hypothesis:

The test can be applied to categorical selection, and by extension to histograms of counts

which may be used to approximate any arbitrary distribution. The assumption for the test

is  that  each  pair  of  expected  and  observed  counts  is  derived  from  a  normal  random

variable.  The  chi-square  distribution,  which  the  statistical  test  is  based  upon,  is  the

distribution of such a sum of squares resulting from different random samplings in each of

the variable categories. For the chi-square distribution (as with the T-distribution) we will

need to know the number of degrees of freedom, which in general will be the number of

observed values (e.g. categories) minus the number of restraining parameters (e.g. totals).



Returning  to  the  example  G:C  versus  A:T  content  of  a  1000  DNA  base  pairs,  we  can

apply  the  chi-square  statistic  to  these  two  base-pair  categories,  though  naturally  they  are

restrained because they must sum to a given total. However, we treat them as independent

observations for the calculation of the statistic and then consider the appropriate number

of  degrees  of  freedom.  Hence  if  we  have  530  G:C  pairs  and  470  A:T  pairs  and  the

expected count for each is 500, then the statistic is:

After  calculating  the  chi-square  statistic,  comparing  observed  and  expected  counts,  the

next stage is to evaluate the statistical significance of the resulting value (3.6 in the above

example). For this we use the chi-square distribution. The number of degrees of freedom

here  is  the  number  of  random  variables  (which  in  the  above  case  is  the  number  of

categories) minus one; we lose a degree of freedom because the total is fixed, so the A:T

count is not random given the G:C count. We can use the cumulative density function chi-

square  distribution  for  one  degree  of  freedom  to  generate  a  p-value  (i.e.  do  a  one-tailed

test compared to the null hypothesis) for the observed chi-squared statistic. Fortunately in

SciPy this is all handled in one neat chisquare function. This will assume the number of

degrees of freedom is (n−1), though in other situations we could pass in ddof, representing

the difference in the number of degrees of from the default:

from scipy.stats import chisquare

obs = array([530, 470])

exp = array([500, 500])

chSqStat, pValue = chisquare(obs, exp)

print('DNA Chi-square:', chSqStat, pValue) # 3.6, 0.05778

The result for this is the anticipated chi-square statistic of 3.6 and a test probability of

0.058. It should be noted that the chi-square test is almost always a one-tailed test because

we  are  normally  interested  in  whether  the  fit  is  worse  than  the  expected  fit,  and  not

concerned if the fit is better than expected (i.e. too good).

Moving on from the simple DNA example, we can think of samples from a probability

distribution where the resulting values have been binned into a histogram (see

Figure 22.6

for an example histogram). This will give a discrete set of categories, one for each range,

and we can treat each category as a separate, independent sampling and compare it to the

expectation from the null hypothesis. In this case the expected count will be the area of the

probability for reach range (a region of the probability density function) multiplied by the

total number of observations.





Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   354   355   356   357   358   359   360   361   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish