Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet354/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   350   351   352   353   354   355   356   357   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Z-scores and Z-test

A  Z-score  (also  called  the  standard  score)  is  the  number  of  standard  deviations  an

observed  value  is  different  from  the  mean.  So  for  our  human  height  example,  values  of

1.685  metres  and  1.910  metres  have  Z-scores  of  −1.0  and  2.0  because  they  are

respectively  1σ  below  and  2σ  above  the  mean.  This  is  formalised  in  the  following

equation, i.e. subtract the mean and divide by the standard deviation:

If we apply this to a whole distribution of values then we will centre it (the mean) at zero

and  give  it  a  standard  deviation  of  1.0,  e.g.  to  create  the  standard  normal  distribution,

whose  random  variable  is  often  labelled  Z.  We  can  easily  calculate  a  Z-score  in  Python,

here taking parameters from the human height example:

from numpy import abs

mean = 1.76

stdDev = 0.075

values = array([1.8, 1.9, 2.0])

zScores = abs(values - mean)/stdDev

print('Z scores', zScores)

Thus we estimate that 1.8, 1.9 and 2.0 metres respectively correspond to about 0.5, 1.9

and  3.2  standard  deviations  from  the  mean.  Note  that  SciPy  provides  the  stats.zscore()

function, but it operates differently because it estimates its own sample mean and sample

standard deviation from the input values:

from scipy.stats import zscore, norm



samples = norm.rvs(mean, stdDev, size=25) # Values for testing

zScores = zscore(samples, ddof=1) # Unbiased estimators

print('Est. Z scores ', zScores)

A related concept to this is the Z-test, which can be used when we have samples that are

taken from a normal distribution where the true mean and standard deviation are known.

The  Z-test  is  effectively  the  calculation  of  a  Z-score  for  a  sample  mean.  A  common

situation for use of the Z-test is where a large population is known to have a mean, μ

0

, and



standard  deviation,  σ,  and  where  some  other  population  of  size  n  is  measured  to  have  a

sample  mean,

,  and  the  same  standard  deviation.  We  want  to  know  whether  this  is

significantly different and the null hypothesis would be that the two populations have the

same mean. For the Z-test the Z-score is defined as:

As  discussed  above,  in  the  context  of  the  standard  error  of  the  mean,  the  standard

deviation of the sample mean is a factor of

smaller than the standard deviation of

the distribution. The analysis also works if the distribution is not normal but the number of

samples, n, is large, by the central limit theorem (assuming the conditions for the theorem

are satisfied). If the standard deviation is not known, then the T-test described in the next

section should be used instead.

Given a standard normal distribution (μ = 0, σ  =  1),  the  probability  of  observing  a  Z-

score or worse is a two-tailed test. If this probability is low then the two populations are

deemed  to  have  a  significantly  different  mean,  and  the  null  hypothesis  is  rejected.  If  z

were  positive  we  could  also  consider  a  one-tailed  test,  which  is  the  probability  of

observing a result at least this positive. For the Z-test there is no direct SciPy function to

perform the whole calculation of tail probabilities. Hence we need to take specific steps to

find the integral of the probability distribution from the Z-score. Fortunately this is partly

solved by having a cumulative distribution available: the summation up to a threshold of

the probability density function. The cumulative distribution of the standard normal (Φ) is

required  for  the  tailed  test.  This  is  easily  calculated  in  Python  using  the  error  function

12

available  in  SciPy,  which  is  related  to  cumulative  distribution  of  the  standard  normal  :



,  and  thus  solves  the  integral  we  require  without  too  much

hassle.


The code to calculate the Z-test probability in SciPy involves calculating the Z-scores

for  the  standard  error  of  the  means  and  then  using  the  error  function  erf()  to  derive  the

cumulative probability:

from numpy import sqrt

from scipy.special import erf

def zTestMean(sMean, nSamples, normMean, stdDev, oneSided=True):

zScore = abs(sMean - normMean) / (stdDev / sqrt(nSamples))

prob = 1-erf(zScore/sqrt(2))




if oneSided:

prob *= 0.5

return prob

The calculation of the probability involves a trivial bit of arithmetic, remembering that

we want 1− Φ, the tail  of the cumulative  distribution of the  standard normal, and  noting

that  the  initial  cumulative  probability  calculation  is  the  two-tailed  result  (i.e.  twice  Φ),

which we halve for the one-tailed result. This can be tested with some example data values

which are roughly normal:

samples = array([1.752, 1.818, 1.597, 1.697, 1.644, 1.593,

1.878, 1.648, 1.819, 1.794, 1.745, 1.827])

mean = 1.76

stDev = 0.075

result = zTestMean(samples.mean(), len(samples),

mean, stdDev, oneSided=True)

print( 'Z-test', result) # Result is 0.1179

The  resulting  probability  of  the  sample  mean  coming  from  the  normal  distribution  is

11.8%, so we generally wouldn’t want to reject the notion that the samples were generated

from it.


As another example, suppose we have a large database of DNA sequences and the G:C

content of sequences in the database has mean 0.59 and standard deviation 0.1. The G:C

content  would  not  usually  be  modelled  using  a  normal  distribution,  but  if  we  have  100

sequences  not  in  the  database,  and  measure  the  G:C  content  of  each,  then  we  could  still

reasonably apply the Z-test, thus informing us whether they are likely to be from the same

population of sequences. Suppose that the average G:C content in these 100 sequences is

0.61. The one-tailed test is given by

result = zTestMean(0.59, 100, 0.61, 0.1)

with result 0.023. The two-tailed test gives twice this, so 0.046. In both cases, if 5% is the

significance  level  used,  then  the  null  hypothesis  is  rejected,  and  it  is  concluded  that  the

100  sequences  have  a  significantly  different  G:C  content  than  the  sequences  in  the

database.



T-tests

The  Z-test  we  described  relied  on  knowledge  of  a  distribution’s  standard  deviation  (or

having a good estimate from a large population). However, in many situations we do not

know the underlying mean and standard deviations of the probability distributions. This is

often  the  natural  outcome  of  having  small  statistical  samples.  Nonetheless,  we  may  still

want  to  evaluate  whether  statistical  samples  are  significantly  different  from  one  another.

This is where the idea of T-tests comes in.

T-tests  are  based  on  the  notion  of  the  T-statistic,  which  is  similar  to  the  Z-score

discussed  before.  Accordingly,  the  T-statistic  is  the  measure  of  the  number  of  standard



errors  a  measured  parameter  value  is  from  its  true  value.  In  many  cases  the  parameter

we’re interested in is the mean of a normal distribution, in which case the T-statistic could

be the number of standard errors that the sample mean ( ) lies from the true mean (μ


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   350   351   352   353   354   355   356   357   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish