For the last part of this chapter we move from studying distributions of one type of
measurement to the comparison of two different types, each with a different random
variable. We can imagine the random variables to correspond to different dimensions or
here might be to apply statistical tests to a two-dimensional, joint probability distribution,
employing the methods already discussed. However, we are often interested in the
way. In other words if the value of one measurement increases we would like to know
whether the other measurement also increases, decreases or stays the same overall. This is
what we call correlation. Naturally, this is also subject to significance testing because the
variation associated with sampling of the probability distributions impinges on our
of samples we may observe an apparent correlation and need to know the likelihood that it
Covariance is a measure of whether two random variables vary simultaneously as their
values increase or decrease. The covariance is calculated by subtracting the means of the
random variables, so they are effectively
centred on zero, and then finding the average
product of the two coordinates. Hence for two probability distributions, described by
random variables X and Y with sample points x
i
and y
i
respectively, the covariance may be
written as:
The idea is that if there is a correlation then the positions from both axes will be on the
same side of their means, giving consistently positive products. If there is no correlation
the products will be both positive and negative, averaging towards zero. In Python there is
the handy numpy.cov() function to do the work for us. Here we illustrate with two test
combinations for random xVals: yVals1 is completely random and yVals2 is derived from
xVals by adding an offset, gradient and small random deviations:
from numpy import random, cov
xVals = random.normal(0.0, 1.0, 100)
yVals1 = random.normal(0.0, 1.0, 100) # Random, independent of xVals
deltas = random.normal(0.0, 0.75, 100)
yVals2 = 0.5 + 2.0 * (xVals - deltas) # Derived from xVals
cov1 = cov(xVals, yVals1)
# The exact values below depend on the random numbers
# Cov 1: [[0.848, 0.022]
# [0.022, 1.048]]
cov2 = cov(xVals, yVals2)
# Cov 2: [[0.848, 1.809]
# [1.809, 5.819]]
The result here is the covalence matrix, rather than just a single value. This is just a
generalisation of the process, where if you pass in several arrays it will give back a matrix
of the covariance for all possible pairs. Hence for our two input arrays we will get a matrix
with four values, i.e.
, so the diagonal is simply the variances for X and Y and the
other values are equal to the covariance we generally want. Here the interesting
covariances are 0.022 and 1.809 for yVals1 and yVals2 respectively.
Do'stlaringiz bilan baham: