Python Programming for Biology: Bioinformatics and Beyond


Extracting a principal component



Download 7,75 Mb.
Pdf ko'rish
bet383/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   379   380   381   382   383   384   385   386   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Extracting a principal component

As the calculation of all of the principal components of a data set is often not required, e.g.

we  may  be  only  interested  in  the  first  two  principal  components,  the  next  example

function  can  more  efficiently  extract  the  principal  components  one  at  a  time  (most




significant  first)  without  calculating  a  full  covariance  matrix.  Each  time  a  component  is

extracted  the  data  may  be  transformed,  eliminating  that  component,  so  the  next  most

important component can be extracted. This is an iterative method that converges on the

principal component, hence we pass in a precision value to state how long we iteratively

cycle to improve the convergence.

from numpy import random, dot, array, outer

def extractPrincipalComponent(data, precision=1e-9):

As  before,  we  extract  the  number  of  samples  (data  items)  and  features  from  the  input

data. We then calculate the mean (centre) of the data so that it can be moved to the zero

point.


samples, features = data.shape

meanVec = data.mean(axis=0)

dataC = data - meanVec

The initial guess at the principal component pc1 is constructed as a random array, with

the same length as the number of features. We also initialise pc0, which will be the value

from the previous cycle, so that we can check for convergence. Initially pc0 is arbitrarily

set to be different to pc1.

pc1 = random.random(features)

pc0 = pc1 - 1.0

The main optimisation involves a while loop that checks the two subsequent estimates

for  the  principal  components.  If  their  difference,  summed  up  over  all  dimensions,  is

sufficiently close (within the precision value) the loop will stop.

while abs((pc0-pc1).sum()) > precision:

Inside  the  loop  the  principal  component  vector  is  improved  upon  for  the  next  cycle.

This involves going through all the items in the (centred) data, scaling each according to

the  projection  onto  the  guess  for  the  principal  component  (with  dot  product)  and  adding

them to a total vector, t. When normalised (scaled to length one) this total vector will be

the new estimate for the principal component. Here the dot product gives the coincidence

of  the  data  vectors  with  the  current  principal  component  estimate.  The  summation  will

give  an  average  vector  which  is  weighted  according  to  the  data  items  that  have  most

coincidence. The convergence occurs because the largest correlation in the data will have

a  biased  influence  on  the  weights,  increasing  the  size  of  t  each  cycle  until  there  is

maximum overlap with pc1.  At  this  point  pc1  represents  a  fundamental  axis  in  the  data,

along which most convergence occurs.

t = zeros(features)

for datum in dataC:

t += dot(datum, pc1) * datum

At the end of each cycle the previous principal component estimate is stored as pc0 and

the new estimate is found by scaling the t vector to length one.

pc0 = pc1




pc1 = t / sqrt(dot(t,t))

Once  there  is  little  improvement  in  the  principal  component  the  cycles  stop  and  the

estimate is returned.

return pc1

We will test both the full and quick PCA functions with the same data set. This will be a

random  data  set  of  100  points  in  two  dimensions,  which  is  transformed  by  a  shearing

matrix, in order to produce a visible correlation in the data. Naturally we are not limited to

only two-dimensional data, but this is easier to illustrate here.

testData = random.normal(0.0, 2.0, (100,2))

shear = array([[2,1],[1,0]])

testData = dot(testData, shear)

The PCA is performed by the two functions and the results are compared.

pc1 = extractPrincipalComponent(testData)

print('Quick PC1:', pc1)

basis, energy = principalComponentAnalysis(testData, n=2)

print('Full PCA:', basis, energy)

The  test  data  can  be  plotted  in  the  usual  way  and  we  can  visualise  the  first  principal

component (pc1)  by  drawing  a  line  along  its  direction.  We  make  a  line  from  pc1,  which

we can plot, by scaling the vector (by a factor of ten to make it visible) either side of the

centre.


from matplotlib import pyplot

x,y = zip(*testData)

pyplot.scatter(x, y, s=20, c='#F0F0F0', marker='o')

x,y = zip(-10*pc1, 10*pc1)

pyplot.plot(x, y)

To illustrate that the basis matrix transforms the data in the right way we can apply it to

the test data, so that the principal component axes are aligned with the axes of our graph.

transformed = dot(testData, basis)

x,y = zip(*transformed)

pyplot.scatter(x, y, s=10, c='#000000', marker='^')

pyplot.show()


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   379   380   381   382   383   384   385   386   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish