Python Programming for Biology: Bioinformatics and Beyond


Variance, standard deviation and skew



Download 7,75 Mb.
Pdf ko'rish
bet347/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   343   344   345   346   347   348   349   350   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Variance, standard deviation and skew

The standard deviation and variance (the square of the standard deviation) are measures of

the spread in the range of values. These can be calculated for any given sample of values.

However, and in a similar manner to the mean, the parameters calculated for a given set of

samples are only an estimate for the underlying probability distribution. Here we label the

true  (or  population)  standard  deviation  as  σ  and  true  variance  as  σ

2

,  whereas  the  sample



standard deviation is s and sample variance is s

2

.



The  variance  is  a  measure  of  how  far  the  values  are  spread  from  the  mean.

Mathematically  it  is  the  expectation  of  the  squared  differences  from  the  mean.  For  a

sample it is calculated as the sum of the square differences from the mean divided by the

number of values (n) minus one:

This is an unbiased estimate

7

of the underlying variance, but it is also commonplace to



simply  divide  by  the  number  of  values,  which  for  a  large  sample  size  makes  little

difference, though strictly speaking it is biased:




We can calculate the variance in standard Python if we need to:

values = [1,2,2,3,2,1,4,2,3,1,0]

n = float(len(values))

mean = sum(values)/n

diffs = [v-mean for v in values]

variance = sum([d*d for d in diffs])/(n-1) # Unbiased estimate

Although, as you might expect, there is a handy var() function in NumPy, which is also

built into array  objects.  Similar  to  the  mean  function,  we  can  also  specify  an  axis  to  get

variances  across  rows  and  columns  of  multi-dimensional  arrays.  It  should  be  noted  that

var()  takes  a  ddof  argument,

8

 which  should  be  set  at  1  for  the  unbiased  estimate;  the



default value is zero for the biased estimate.

from numpy import array

valArray = array(values)

variance = valArray.var() # Biased estimate

print('Var 1', variance) # Result is 1.1736

variance = valArray.var(ddof=1) # Unbiased estimate

print('Var 2', variance) # Result is 1.2909

The  biased  variance  equation  can  be  rearranged  as  follows,  as  the  mean  (the

expectation) of the squared values minus the square of the mean:

Formulating  the  variance  in  this  way  can  be  handy  in  various  situations  because  it

involves fewer computational steps; we don’t need to do a subtraction for every data point

(which additionally may incur floating point errors).

Given that the standard deviation is the square root of the variance it is useful because it

gives  a  measure  of  spread  in  the  same  units  of  measurement  as  the  data.  This  is  handy

when  describing  statistical  samples,  so,  for  example,  the  height  of  a  population  may  be

described as the mean plus or minus the standard deviation: e.g. 1.777 ± 0.075 metres. The

standard  deviation  is  trivial  to  obtain  using  a  square  root  operation  and  the  above

equations for variance, though there is also a handy std() function that is also inbuilt into

NumPy  arrays,  noting  again  that  we  set  ddof=1  to  use  the  unbiased  estimate  of  the

variance (although even in this case, std(ddof=1) does not give a truly unbiased estimate

of the standard deviation):

from numpy import std, sqrt

stdDev = sqrt(variance)

stdDev = std(valArray) # Biased estimate - 1.0833




stdDev = valArray.std(ddof=1) # "Unbiased" estimate - 1.1362

print('Std:', stdDev)

Related  to  the  standard  deviation  is  a  value  called  the  standard  error  of  the  mean

(SEM).  Given  that  the  mean  of  a  sample  is  only  an  estimation  of  the  underlying  mean

there will naturally be some variation in its calculation. The SEM represents the standard

deviation  in  the  sample  mean  that  results  from  different  samplings  of  the  underlying

probability distribution. Scientifically it can be important to acknowledge that the sample

mean is an estimate, and when supporting theories with a mean value it is often helpful to

show the SEM, for example, on a graph, to indicate the confidence in the argument. The

standard error of the mean is the standard deviation in x (s



x

) divided by the square root of

the number of values:

This may be calculated in Python from the standard deviation and also by using a function

from scipy.stats:

stdErrMean = valArray.std(ddof=1)/sqrt(len(valArray))

from scipy.stats import sem

stdErrMean = sem(valArray, ddof=1) # Result is 0.3426

The  skewness  of  a  distribution  is  a  measure  of  asymmetry  or  lopsidedness.  Though

perhaps not as commonly used as the other parameters, estimating the skewness can be a

useful  test  if  you  believe  the  underlying  probability  distribution  ought  to  be  symmetric.

The  skewness  is  commonly  estimated  for  a  sample  as  the  mean  cubed  difference  of  the

data from the mean divided by the standard deviation cubed:

We  can  illustrate  this  in  Python  using  a  random  sample  drawn  from  the  asymmetric

gamma function.

from scipy.stats import skew

from numpy import random

samples = random.gamma(3.0, 2.0, 100) # Example data

skewdness = skew(samples)

print( 'Skew', skewness ) # Result depends on random sample

Alternatively,  as  a  very  rough  measure  of  skew,  the  non-parametric  skew  is  easy  to

calculate  as  the  difference  between  the  mean  and  the  median  divided  by  the  standard

deviation, with the general idea being that the mean and the median will be the same when

the distribution is symmetric.





Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   343   344   345   346   347   348   349   350   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish