The standard deviation and variance (the square of the standard deviation) are measures of
the spread in the range of values. These can be calculated for any given sample of values.
However, and in a similar manner to the mean, the parameters calculated for a given set of
samples are only an estimate for the underlying probability distribution. Here we label the
.
Mathematically it is the expectation of the squared differences from the mean. For a
sample it is calculated as the sum of the square differences from the mean divided by the
We can calculate the variance in standard Python if we need to:
values = [1,2,2,3,2,1,4,2,3,1,0]
n = float(len(values))
mean = sum(values)/n
diffs = [v-mean for v in values]
variance = sum([d*d for d in diffs])/(n-1) # Unbiased estimate
Although, as you might expect, there is a handy var() function in NumPy, which is also
built into array objects. Similar to the mean function, we can also specify an axis to get
variances across rows and columns of multi-dimensional arrays. It should be noted that
var() takes a ddof argument,
8
which should be set at 1 for the unbiased estimate; the
default value is zero for the biased estimate.
from numpy import array
valArray = array(values)
variance = valArray.var() # Biased estimate
print('Var 1', variance) # Result is 1.1736
variance = valArray.var(ddof=1) # Unbiased estimate
print('Var 2', variance) # Result is 1.2909
The biased variance equation can be rearranged as follows, as the mean (the
expectation) of the squared values minus the square of the mean:
Formulating the variance in this way can be handy in various situations because it
involves fewer computational steps; we don’t need to do a subtraction for every data point
(which additionally may incur floating point errors).
Given that the standard deviation is the square root of the variance it is useful because it
gives a measure of spread in the same units of measurement as the data. This is handy
when describing statistical samples, so, for example, the height of a population may be
described as the mean plus or minus the standard deviation: e.g. 1.777 ± 0.075 metres. The
standard deviation is trivial to obtain using a square root operation and the above
equations for variance, though there is also a handy std() function that is also inbuilt into
NumPy arrays, noting again that we set ddof=1 to use the unbiased estimate of the
variance (although even in this case, std(ddof=1) does not give a truly unbiased estimate
of the standard deviation):
from numpy import std, sqrt
stdDev = sqrt(variance)
stdDev = std(valArray) # Biased estimate - 1.0833
stdDev = valArray.std(ddof=1) # "Unbiased" estimate - 1.1362
print('Std:', stdDev)
Related to the standard deviation is a value called the standard error of the mean
(SEM). Given that the mean of a sample is only an estimation of the underlying mean
there will naturally be some variation in its calculation. The SEM represents the standard
deviation in the sample mean that results from different samplings of the underlying
probability distribution. Scientifically it can be important to acknowledge that the sample
mean is an estimate, and when supporting theories with a mean value it is often helpful to
show the SEM, for example, on a graph, to indicate the confidence in the argument. The
standard error of the mean is the standard deviation in x (s
x
) divided by the square root of
the number of values:
This may be calculated in Python from the standard deviation and also by using a function
from scipy.stats:
stdErrMean = valArray.std(ddof=1)/sqrt(len(valArray))
from scipy.stats import sem
stdErrMean = sem(valArray, ddof=1) # Result is 0.3426
The skewness of a distribution is a measure of asymmetry or lopsidedness. Though
perhaps not as commonly used as the other parameters, estimating the skewness can be a
useful test if you believe the underlying probability distribution ought to be symmetric.
The skewness is commonly estimated for a sample as the mean cubed difference of the
data from the mean divided by the standard deviation cubed:
We can illustrate this in Python using a random sample drawn from the asymmetric
gamma function.
from scipy.stats import skew
from numpy import random
samples = random.gamma(3.0, 2.0, 100) # Example data
skewdness = skew(samples)
print( 'Skew', skewness ) # Result depends on random sample
Alternatively, as a very rough measure of skew, the non-parametric skew is easy to
calculate as the difference between the mean and the median divided by the standard
deviation, with the general idea being that the mean and the median will be the same when
the distribution is symmetric.