Python Programming for Biology: Bioinformatics and Beyond


Figure 22.3.  Calculating tail probabilities in the binomial test



Download 7,75 Mb.
Pdf ko'rish
bet351/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   347   348   349   350   351   352   353   354   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 22.3.  Calculating tail probabilities in the binomial test. The probability mass

function and the regions considered in probabilistic tail tests are illustrated for an example

binomial distribution, where probability of an event (p) is 0.5 and the number of trials (n)

is 1000. For a one-tailed test the probability considered corresponds to the area under the

curve from the test value, which here is 530, and above. For the two-tailed test the

probability considered corresponds to the area under the curve in two regions above and

below the mean (530 and 470 here) that start at the same separation from the mean and go

outwards. The cumulative distribution is the running total of the probabilities for all the

counts up to a particular value (i.e. starting at zero) and can be used to look up the areas of

the tails from the probability mass function. For the right-hand threshold (530) the

cumulative probability is the sum up to that value, but for the tail area we require the

cumulative probability from that value, hence we subtract the cumulative probability

before the right-hand threshold from 1.0 to get the required value.

The  most  common  kind  of  region,  and  hence  integral,  to  choose  for  statistical  tests  is

the  tails  of  the  probability  distribution.  Here,  by  ‘tail’  we  mean  the  area  at  the  extreme

edge  of  the  distribution  that  is  bounded  by  a  specified  threshold.  Many  probability

distributions  will  have  two  tails,  for  extremely  small  and  extremely  large  values

respectively,  and  in  a  statistical  test  we  may  choose  to  consider  either  or  both  of  these

tails, depending on the situation. Accordingly, you will see both one-tailed and two-tailed

tests  used  in  scientific  contexts.  In  biological  research  p-values  are  frequently  quoted,

which is simply the application of one-tailed tests to a null hypothesis. Put more formally,

a p-value is the probability of values at least as extreme as the observation being generated

by the null hypothesis.

9

(It is not the probability that the null hypothesis is correct or the



probability that a value is random ‘noise’.) While it is possible to define other regions of

the distribution for testing purposes, using the tails of the distributions has the advantage

of  having  no  extra  parameters  to  consider  (such  as  a  region  width)  and  their  integral  is

easy to calculate using a cumulative density function, which represents the running total of

probabilities up to a specific value. We will illustrate one-tailed and two-tailed tests using

the  discrete  (i.e.  whole  number)  binomial  and  Poisson  probability  distributions  that  we




introduced in

Chapter 21

, as well as for the continuous normal distribution.

First we consider the binomial test, which is concerned with the number of occurrences

of an event that has a fixed probability of occurring, given a certain number of trials. If we

apply this to the example of a DNA sequence of specified length (a specified number of

trials)  where  we  believe  that  the  probability  of  G:C  base  pairs  occurring  is  equal  to  the

probability of A:T pairs we can ask whether an observed G:C versus A:T count fits with

our  model.  If  we  look  at  a  DNA  sequence  of  length  1000,  then  following  the  random

hypothesis we would expect that on average around 500 of the nucleotides would be G or

C, and 500 would be A or T. If we get significantly more or fewer than 500 (of either pair)

then we can conclude that the G:C content is likely to be not random. The binomial test

formalises these considerations: if the probability of the G:C event is 0.5 then the expected

number of events in 1000 trials is 0.5×1000 = 500. Furthermore, the binomial distribution

allows us to associate a probability to every number of G:C events that may be observed

in 1000 trials and hence also calculate the area under the tails of the distribution.

The  exact  form  of  the  one-tailed  test  depends  on  whether  the  observed  number  of

events  is  higher  or  lower  than  the  expected  number  of  events.  If  it  is  higher,  then  we

calculate the probability that the observed number is that value or larger. If it is lower, we

instead consider the probability that it is that value or smaller. In both cases the two-tailed

test is the same. If in our example the actual number of observed G:C pairs is 530, then the

one-tailed  test  considers  the  probability  that  we  would  observe  at  least  530  G:C

nucleotide pairs in the sequence of length 1000. The two-tailed test not only considers the

probability of observing at least 530 G:C (30 more than the mean) but also considers the

probability  of  observing  at  most  470  G:C  (30  fewer  than  the  mean).  Conversely,  if  the

actual  number  of  observed  G:C  pairs  is  470,  then  the  one-tailed  test  considers  the

probability  that  we  would  observe  at  most  470  G:C  nucleotide  pairs  in  the  sequence  of

length 1000.

The two-tailed binomial test is available as a function in SciPy:

from scipy.stats import binom_test

count, nTrials, pEvent = 530, 1000, 0.5

result = binom_test(count, nTrials, pEvent)

print('Binomial two tail', result)

However,  we  can  also  create  our  own  function  using  the  explicit  probability

distribution. We will allow the input counts argument to be a list, tuple or a NumPy array.

With a bit of extra effort we could also allow the argument to be a simple integer. We first

convert counts to be a NumPy array, because that allows for calculations to be done on the

entire  array  in  one  go,  rather  than  having  to  loop  over  the  elements.  (For  a  discussion

about NumPy, see

Chapter 9

.)

We also include an option to specify one-tailed tests. The one-tailed test is complicated



because, as noted above, the calculation is different depending on whether the counts are

greater or less than or equal to the mean. Here we test whether elements are greater than

the  mean  (counts  >  mean),  which  yields  an  array  of  True  and  False  values.  Then  the

NumPy  nonzero()  function  is  used  to  obtain  the  indices  of  the  true  elements:  those  that

were greater than the mean. Conveniently, we can access the corresponding elements of a



NumPy array using these indices. This is faster than looping over the individual elements

of the array, at the cost of requiring more memory for the indices. The values that are less

than the mean will be the opposite selection: we also get these from the same test, but it is

noteworthy that a NumPy array cannot be negated with the ordinary Python not  Boolean

operator. Rather the ‘~’ symbol negates each element of the array (swaps True with False).

from scipy.stats import binom

from numpy import array, zeros

def binomialTailTest(counts, nTrials, pEvent, oneSided=True):

counts = array(counts)

mean = nTrials * pEvent

if oneSided:

result = zeros(counts.shape)

isAboveMean = counts > mean

aboveIdx = isAboveMean.nonzero()

belowIdx = (~isAboveMean).nonzero()

result[aboveIdx] = binom.sf(counts[aboveIdx]-1, nTrials, pEvent)

result[belowIdx] = binom.cdf(counts[belowIdx], nTrials, pEvent)

else:


diffs = abs(counts-mean)

result = binom.cdf(mean-diffs, nTrials, pEvent)

result += binom.sf(mean+diffs-1, nTrials, pEvent)

return result

For counts less than or equal to the mean, we use the cumulative distribution function,

.cdf(),  which  exactly  provides  the  probability  that  the  value  is  less  than  or  equal  to  the

count. For counts higher than the mean we instead use .sf(), the ‘survival’ function, as an

alternative to calculating 1.0 minus the cumulative distribution function, which would be

required to calculate the integral of the probabilities greater than or equal to the count. The

subtraction of 1 from the counts in the .sf()call is the main oddity here, and this is because

the binom.sf()  function  calculates  the  probability  that  the  number  of  successes  is  greater

than that argument, rather than greater than or equal.

For the two-tailed test we subtract the test values (counts) from the distribution’s mean

value to get separations from the mean. These differences are then used to get the upper

and lower test thresholds. The probability sums for the lower thresholds are obtained via

.cdf(), and the upper thresholds with .sf(), like the one-tailed test.

Continuing  with  the  G:C  content  example,  where  we  observe  530  G:C  events  from

1000 positions, to do the one-tailed test to get the probability that we observe at least 530

we simply call the function with the appropriate arguments:

counts = [530]

result = binomialTailTest(counts, 1000, 0.5, oneSided=True)

print( 'Binomial one tail', result))

and  this  gives  a  probability  of  0.0310.  Given  that  the  binomial  distribution  was  the  null

hypothesis  this  one-tailed  probability  is  the  p-value.  Similarly,  we  can  do  the  two-tailed

test  to  calculate  the  probability  that  we  observe  at  least  530  or  at  most  470  events,



remembering that we set oneSided to be False:

result = binomialTailTest(counts, 1000, 0.5, oneSided=False)

print( 'Binomial two tail', result)

which gives a result of 0.0620. These results illustrate the subtlety of statistical analysis.

Often  it  is  deemed  that  something  is  significant  if  it  has  less  than  a  5%  probability  of

originating  from  the  null  distribution.  Given  the  one-tailed  probability  here  is  3.1%  one

might  want  to  conclude  from  this  that  it  is  likely  that  the  DNA  sequence  is  non-random

(doesn’t fit the null hypothesis). However, the two-tailed probability is 6.2%, so one might

equally  say  that  one  cannot  conclude  this.  Fundamentally  it  is  the  probability  itself  that

matters,  including  how  it  is  defined  (one-tailed  or  two-tailed)  and  no  matter  what

threshold  we  use,  the  probability  itself  tells  us  how  likely  we  are  to  be  right  or  wrong

about any specific statement. Significance thresholds like 5% are merely conveniences for

standardisation and not always applicable in every situation.

Although we have illustrated the tailed test for the binomial distribution, we can use the

same strategy for other distributions that are provided by SciPy. In the next example, we

illustrate for the Poisson distribution, noting that the function (like the binomial one) can

accept an array of values to calculate probabilities for:

from scipy.stats import poisson

from numpy import abs, array

def poissonTailTest(counts, eventRate, oneSided=True):

counts = array(counts)

if oneSided:

result = poisson.sf(counts-1, eventRate)

else:


diffs = abs(counts-eventRate)

result = 2*poisson.cdf(eventRate-diffs, eventRate)

return result

counts = [2300, 2400, 2550]

result = poissonTailTest(counts, 2500, oneSided=False)

result = poissonTailTest(counts, 2500, oneSided=False)

print( 'Poisson two tail', result)

# Result is [0.00002655, 0.022746, 0.1611]

The  only  differences  to  the  binomial  example  are  that  the  mean  value  is  simply  the

event  rate  (the  distribution’s  only  parameter)  and  because  the  distribution  is  symmetric

either side of the mean the one-tail calculation is simpler, and we can double the integral

for one tail to get the probability corresponding to both tails.

Next we move on to an example that uses the normal distribution, otherwise known as

the  Gaussian  distribution.  In  this  case  we  are  dealing  with  a  continuous  probability

distribution  (rather  than  a  discrete  one),  i.e.  where  the  outcome  is  a  value  from  the

continuum  of  real  numbers.  The  concepts  largely  carry  over  from  discrete  to  continuous




distributions, with the important difference that discrete sums become integrals.

10

And as



Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   347   348   349   350   351   352   353   354   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish