Python Programming for Biology: Bioinformatics and Beyond


Principal component analysis



Download 7,75 Mb.
Pdf ko'rish
bet382/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   378   379   380   381   382   383   384   385   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Principal component analysis

Principal  component  analysis  (PCA)  is  a  relatively  simple  but  widely  used  technique  to

extract the innate trends within data. The principal components of data are the vectors (in

the same feature space of the data) that give the best separation of the data items in terms

of  their  covariances.  Mathematically,  PCA  gives  the  eigenvectors  of  the  covariance

matrix.  Taking  the  eigenvalues  of  these  in  size  order  we  can  find  the  most  significant,

principal  components  that  account  for  most  of  the  variance  in  the  data.  Taking  fewer

principal components than the number of dimensions present in the input data set allows

for  a  lower-dimensionality  representation  of  the  data  (by  projecting  it  onto  these

directions) that still contains the important correlations.

When we calculate the principal components of a data set we obtain vectors, directions

in  the  data,  which  are  orthogonal  (perpendicular)  to  one  another.  Thus  each  component

vector represents an independent axis and there can be as many components as there are

data  dimensions.  As  some  axes  are  more  significant  than  others,  for  separating  the  data,

those  are  the  ones  we  are  usually  interested  in.  We  can  also  consider  the  principal

component  vectors  as  a  transformation,  which  we  can  apply  to  our  data:  to  orient  it  and

stretch it along these orthogonal axes.

Principal  component  analysis  does  have  limitations,  which  the  programmer  should  be

aware of, but it is quick and easy, so often worth a try. A classic example where PCA fails

is  for  ‘checkerboard’  data,  i.e.  alternating  squares  of  two  categories,  where  there  are  no

simple axes in data that separate the categories. In such instances more sophisticated, non-

linear techniques, such as support vector machines (see

Chapter 24

), may be used.

In  the  Python  example  for  PCA  we  first  make  the  NumPy  imports  and  define  the

function which takes data (as an array of arrays) and the number of principal components

we wish to extract.

from numpy import cov, linalg, sqrt, zeros, ones, diag




def principalComponentAnalysis(data, n=2):

First we get the size of the data, in terms of the number of data items (samples) and the

number of dimensions (features).

samples, features = data.shape

We  calculate  the  average  data  item  (effectively  the  centre  of  the  data)  by  finding  the

mean  along  the  primary  data  axis.  We  then  centre  the  input  data  on  zero  (on  the  feature

axes) by taking away this average. Note that we then transpose the data with .T, turning it

sideways, so we can calculate the covariance in its dimensions.

meanVec = data.mean(axis=0)

dataC = (data - meanVec).T

The  covariance  is  estimated  using  the  cov()  function  and  the  eigenvalues  and

eigenvectors are extracted with the linalg.eig() function. Here the inbuilt NumPy functions

for dealing with arrays and linear algebra really show their value.

covar = cov(dataC)

evals, evecs = linalg.eig(covar)

The  resulting  eigenvalues,  which  represent  the  scaling  factors  along  the  eigenvectors,

are sorted by size. Here we use the .argsort() function, which gives the indices of the array

in the order in which the values increase. We then use the cunning trick [::-1] for reversing

NumPy arrays.

10

The eigenvectors are then reordered by using these indices, to put them



in order of decreasing eigenvalue, i.e. most significant component first.

indices = evals.argsort()[::-1]

evecs = evecs[:,indices]

We can then take the top eigenvectors as the first n principal components, which we call

basis, because these are the directions that we can use to map our data to. The energy  is

simply a measure of how much covariance our top eigenvalues explain, which is useful for

detecting whether more principal components should be considered.

basis = evecs[:,:n]

energy = evals[:n].sum()

# norm wrt to variance

#sd = sqrt(diag(covar))

#zscores = dataC.T / sd

return basis, energy


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   378   379   380   381   382   383   384   385   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish