Figure 22.7. Pearson’s correlation coefficient values (r) for a variety of different
data samples. The coefficient represents the degree of linear covariance in the two
quantities and is scaled so that the value lies between −1 (for negative correlation) and +1
(positive correlation). Values near zero indicate the quantities are non-linearly correlated,
although there may be other patterns or forms of non-linear correlation, which would not
be exposed by this test.
The correlation coefficient is readily calculated in Python using the numpy.corrcoef()
function, and as with the covariance function we get back a matrix of values, for all pairs
of inputs. Testing on the previously used values we get:
from numpy import corrcoef
r1 = corrcoef(xVals, yVals1)[0, 1] # Result is: 0.0231
r2 = corrcoef(xVals, yVals2)[0, 1] # Result is: 0.8145
Hence we can see that xVals has almost no correlation with yVals1, but has a large
positive correlation (0.8145) with yVals2, as we might expect. If we wished we could
naturally also derive the correlation coefficient from the previously calculated covariance,
remembering that we use the unbiased sample standard deviation (ddof=1):
from numpy import std
cov2 = cov2[0,1] # X-Y element
stdDevX = std(xVals, ddof=1)
stdDevY = std(yVals2, ddof=1)
r2 = cov2 / (stdDevX*stdDevY)
Although the correlation coefficient is insensitive to different sample means and
variances for the quantities, it should not be forgotten that it is only a test of a linear
relationship. There may be a distinct non-random, non-linear relationship between the
quantities which will not be picked up by the test, although in some instances it is possible
to transform a quantity (e.g. by taking a logarithm) so that the relationship becomes linear.
We can subject the correlation coefficient to significance tests if we consider an
uncorrelated null hypothesis, i.e. where the underlying correlation coefficient is 0. The
basic idea here is that even if distributions are really uncorrelated they can appear to be
correlated (points are coincidentally linear), especially if the size of a sample is small. If
the underlying distributions are normal then it can be shown that the null hypothesis can
be rejected at the 0.95 confidence level if the test statistic
is larger than the corresponding T-distribution percent point function with confidence level
0.975 (because our test is two-tailed) and n−2 degrees of freedom. Here n is the number of
sample points in each of X and Y. We can invert the above function, and solve for the
correlation coefficient r as a function of n.
Accordingly we can plot the correlation coefficient as a function of the sample size, as
illustrated in
Figure 22.8
. If r is larger than the value then the null hypothesis is rejected.
This is readily done in Python using the .ppf() function of the scipy.stats.t distribution
object and applying the above equation:
from numpy import sqrt
from scipy.stats import t
nVals = range(5, 101)
rVals = []
for n in nVals:
tVal = t(n-2).ppf(0.975)
tVal2 = tVal * tVal
rVal = sqrt(tVal2/(n-2+tVal2))
rVals.append(rVal)
pyplot.plot(nVals, rVals, color='black')
pyplot.show()
Do'stlaringiz bilan baham: |