Python Programming for Biology: Bioinformatics and Beyond

Correlation and covariance

Download 7,75 Mb.

Pdf ko'rish

bet	360/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 356 357 358 359 360 361 362 363 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Correlation and covariance

For the last part of this chapter we move from studying distributions of one type of

measurement to the comparison of two different types, each with a different random

variable. We can imagine the random variables to correspond to different dimensions or

axes. Hence, a data point will be composed of two values, one for each axis. An approach

here might be to apply statistical tests to a two-dimensional, joint probability distribution,

employing the methods already discussed. However, we are often interested in the

relatively simple question of whether the values for the two axes vary together in some

way. In other words if the value of one measurement increases we would like to know

whether the other measurement also increases, decreases or stays the same overall. This is

what we call correlation. Naturally, this is also subject to significance testing because the

variation associated with sampling of the probability distributions impinges on our

measures of correlation. In particular, because of the variation arising from a small number

of samples we may observe an apparent correlation and need to know the likelihood that it

was generated by a random process.

Covariance

Covariance is a measure of whether two random variables vary simultaneously as their

values increase or decrease. The covariance is calculated by subtracting the means of the

random variables, so they are effectively centred on zero, and then finding the average

product of the two coordinates. Hence for two probability distributions, described by

random variables X and Y with sample points x

i

and y

respectively, the covariance may be

written as:

The idea is that if there is a correlation then the positions from both axes will be on the

same side of their means, giving consistently positive products. If there is no correlation

the products will be both positive and negative, averaging towards zero. In Python there is

the handy numpy.cov() function to do the work for us. Here we illustrate with two test

combinations for random xVals: yVals1 is completely random and yVals2 is derived from

xVals by adding an offset, gradient and small random deviations:

from numpy import random, cov

xVals = random.normal(0.0, 1.0, 100)

yVals1 = random.normal(0.0, 1.0, 100) # Random, independent of xVals

deltas = random.normal(0.0, 0.75, 100)

yVals2 = 0.5 + 2.0 * (xVals - deltas) # Derived from xVals

cov1 = cov(xVals, yVals1)

# The exact values below depend on the random numbers

# Cov 1: [[0.848, 0.022]

# [0.022, 1.048]]

cov2 = cov(xVals, yVals2)

# Cov 2: [[0.848, 1.809]

# [1.809, 5.819]]

The result here is the covalence matrix, rather than just a single value. This is just a

generalisation of the process, where if you pass in several arrays it will give back a matrix

of the covariance for all possible pairs. Hence for our two input arrays we will get a matrix

with four values, i.e.

, so the diagonal is simply the variances for X and Y and the

other values are equal to the covariance we generally want. Here the interesting

covariances are 0.022 and 1.809 for yVals1 and yVals2 respectively.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 356 357 358 359 360 361 362 363 ... 514