Python Programming for Biology: Bioinformatics and Beyond

Variance, standard deviation and skew

Download 7,75 Mb.

Pdf ko'rish

bet	347/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 343 344 345 346 347 348 349 350 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Variance, standard deviation and skew

The standard deviation and variance (the square of the standard deviation) are measures of

the spread in the range of values. These can be calculated for any given sample of values.

However, and in a similar manner to the mean, the parameters calculated for a given set of

samples are only an estimate for the underlying probability distribution. Here we label the

true (or population) standard deviation as σ and true variance as σ

, whereas the sample

standard deviation is s and sample variance is s

The variance is a measure of how far the values are spread from the mean.

Mathematically it is the expectation of the squared differences from the mean. For a

sample it is calculated as the sum of the square differences from the mean divided by the

number of values (n) minus one:

This is an unbiased estimate

of the underlying variance, but it is also commonplace to

simply divide by the number of values, which for a large sample size makes little

difference, though strictly speaking it is biased:

We can calculate the variance in standard Python if we need to:

values = [1,2,2,3,2,1,4,2,3,1,0]

n = float(len(values))

mean = sum(values)/n

diffs = [v-mean for v in values]

variance = sum([d*d for d in diffs])/(n-1) # Unbiased estimate

Although, as you might expect, there is a handy var() function in NumPy, which is also

built into array objects. Similar to the mean function, we can also specify an axis to get

variances across rows and columns of multi-dimensional arrays. It should be noted that

var() takes a ddof argument,

which should be set at 1 for the unbiased estimate; the

default value is zero for the biased estimate.

from numpy import array

valArray = array(values)

variance = valArray.var() # Biased estimate

print('Var 1', variance) # Result is 1.1736

variance = valArray.var(ddof=1) # Unbiased estimate

print('Var 2', variance) # Result is 1.2909

The biased variance equation can be rearranged as follows, as the mean (the

expectation) of the squared values minus the square of the mean:

Formulating the variance in this way can be handy in various situations because it

involves fewer computational steps; we don’t need to do a subtraction for every data point

(which additionally may incur floating point errors).

Given that the standard deviation is the square root of the variance it is useful because it

gives a measure of spread in the same units of measurement as the data. This is handy

when describing statistical samples, so, for example, the height of a population may be

described as the mean plus or minus the standard deviation: e.g. 1.777 ± 0.075 metres. The

standard deviation is trivial to obtain using a square root operation and the above

equations for variance, though there is also a handy std() function that is also inbuilt into

NumPy arrays, noting again that we set ddof=1 to use the unbiased estimate of the

variance (although even in this case, std(ddof=1) does not give a truly unbiased estimate

of the standard deviation):

from numpy import std, sqrt

stdDev = sqrt(variance)

stdDev = std(valArray) # Biased estimate - 1.0833

stdDev = valArray.std(ddof=1) # "Unbiased" estimate - 1.1362

print('Std:', stdDev)

Related to the standard deviation is a value called the standard error of the mean

(SEM). Given that the mean of a sample is only an estimation of the underlying mean

there will naturally be some variation in its calculation. The SEM represents the standard

deviation in the sample mean that results from different samplings of the underlying

probability distribution. Scientifically it can be important to acknowledge that the sample

mean is an estimate, and when supporting theories with a mean value it is often helpful to

show the SEM, for example, on a graph, to indicate the confidence in the argument. The

standard error of the mean is the standard deviation in x (s

) divided by the square root of

the number of values:

This may be calculated in Python from the standard deviation and also by using a function

from scipy.stats:

stdErrMean = valArray.std(ddof=1)/sqrt(len(valArray))

from scipy.stats import sem

stdErrMean = sem(valArray, ddof=1) # Result is 0.3426

The skewness of a distribution is a measure of asymmetry or lopsidedness. Though

perhaps not as commonly used as the other parameters, estimating the skewness can be a

useful test if you believe the underlying probability distribution ought to be symmetric.

The skewness is commonly estimated for a sample as the mean cubed difference of the

data from the mean divided by the standard deviation cubed:

We can illustrate this in Python using a random sample drawn from the asymmetric

gamma function.

from scipy.stats import skew

from numpy import random

samples = random.gamma(3.0, 2.0, 100) # Example data

skewdness = skew(samples)

print( 'Skew', skewness ) # Result depends on random sample

Alternatively, as a very rough measure of skew, the non-parametric skew is easy to

calculate as the difference between the mean and the median divided by the standard

deviation, with the general idea being that the mean and the median will be the same when

the distribution is symmetric.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 343 344 345 346 347 348 349 350 ... 514