Python Programming for Biology: Bioinformatics and Beyond

Principal component analysis

Download 7,75 Mb.

Pdf ko'rish

bet	382/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 378 379 380 381 382 383 384 385 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Principal component analysis

Principal component analysis (PCA) is a relatively simple but widely used technique to

extract the innate trends within data. The principal components of data are the vectors (in

the same feature space of the data) that give the best separation of the data items in terms

of their covariances. Mathematically, PCA gives the eigenvectors of the covariance

matrix. Taking the eigenvalues of these in size order we can find the most significant,

principal components that account for most of the variance in the data. Taking fewer

principal components than the number of dimensions present in the input data set allows

for a lower-dimensionality representation of the data (by projecting it onto these

directions) that still contains the important correlations.

When we calculate the principal components of a data set we obtain vectors, directions

in the data, which are orthogonal (perpendicular) to one another. Thus each component

vector represents an independent axis and there can be as many components as there are

data dimensions. As some axes are more significant than others, for separating the data,

those are the ones we are usually interested in. We can also consider the principal

component vectors as a transformation, which we can apply to our data: to orient it and

stretch it along these orthogonal axes.

Principal component analysis does have limitations, which the programmer should be

aware of, but it is quick and easy, so often worth a try. A classic example where PCA fails

is for ‘checkerboard’ data, i.e. alternating squares of two categories, where there are no

simple axes in data that separate the categories. In such instances more sophisticated, non-

linear techniques, such as support vector machines (see

Chapter 24

), may be used.

In the Python example for PCA we first make the NumPy imports and define the

function which takes data (as an array of arrays) and the number of principal components

we wish to extract.

from numpy import cov, linalg, sqrt, zeros, ones, diag

def principalComponentAnalysis(data, n=2):

First we get the size of the data, in terms of the number of data items (samples) and the

number of dimensions (features).

samples, features = data.shape

We calculate the average data item (effectively the centre of the data) by finding the

mean along the primary data axis. We then centre the input data on zero (on the feature

axes) by taking away this average. Note that we then transpose the data with .T, turning it

sideways, so we can calculate the covariance in its dimensions.

meanVec = data.mean(axis=0)

dataC = (data - meanVec).T

The covariance is estimated using the cov() function and the eigenvalues and

eigenvectors are extracted with the linalg.eig() function. Here the inbuilt NumPy functions

for dealing with arrays and linear algebra really show their value.

covar = cov(dataC)

evals, evecs = linalg.eig(covar)

The resulting eigenvalues, which represent the scaling factors along the eigenvectors,

are sorted by size. Here we use the .argsort() function, which gives the indices of the array

in the order in which the values increase. We then use the cunning trick [::-1] for reversing

NumPy arrays.

The eigenvectors are then reordered by using these indices, to put them

in order of decreasing eigenvalue, i.e. most significant component first.

indices = evals.argsort()[::-1]

evecs = evecs[:,indices]

We can then take the top eigenvectors as the first n principal components, which we call

basis, because these are the directions that we can use to map our data to. The energy is

simply a measure of how much covariance our top eigenvalues explain, which is useful for

detecting whether more principal components should be considered.

basis = evecs[:,:n]

energy = evals[:n].sum()

# norm wrt to variance

#sd = sqrt(diag(covar))

#zscores = dataC.T / sd

return basis, energy

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 378 379 380 381 382 383 384 385 ... 514