Python Programming for Biology: Bioinformatics and Beyond


Figure 23.7.  Principal component analysis of 2D data and its projection onto a



Download 7,75 Mb.
Pdf ko'rish
bet385/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   381   382   383   384   385   386   387   388   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 23.7.  Principal component analysis of 2D data and its projection onto a

principal axis. As illustrated for a two-dimensional example, the first principal

component (PC 1) is the single linear combination of features (i.e. a direction relative to

the axes) that explains most of the variance in the data. Subsequent principal components

represent orthogonal directions (i.e. at right angles to the other components) that maximise

the remaining variance not covered by earlier principal components; though, for this two-

dimensional example PC 2 is determined by PC 1, since it is the only possibility.

Projecting a data set onto its most important principal component axes allows for the

dimensionality of the data set to be reduced, while still preserving as much linear

correlation as possible. This can be useful for visualisation and to reduce the complexity

of high-dimensionality data sets.

The  LDA  optimisation  is  achieved  by  finding  the  matrix  (and  hence  orientation)  that

maximises the separation (scatter) between the data sets, relative to the separation within

each  data  set.  The  scatter  within  is  simply  the  weighted  covariances  of  the  data  sets

separately,  and  the  scatter  between  is  the  difference  between  their  means.  In  essence  we




scale  the  line  between  the  two  means  of  the  data  sets  by  the  combined  size  of  the  data

scatter for each dimension. The resultant matrix can be used to transform the data into a

new orientation for easy discrimination.

Firstly  a  Python  function  is  defined  which  takes  two  data  sets.  These  could  be  the

results of a clustering operation or a previously known classification.

def twoClassLda(dataA, dataB):

Firstly,  we  find  the  averages  (centres)  of  the  data  sets,  by  summing  along  the  major

axes, i.e. adding the vectors together and dividing by the total number.

meanA = dataA.mean(axis=0)

meanB = dataB.mean(axis=0)

Then  the  cov()  function  is  used  to  calculated  the  covariance  matrix  for  each  data  set

(the size of the correlation between the data dimensions).

covA = cov(dataA.T)

covB = cov(dataB.T)

Then the number of points in each data set, less one, is calculated for the data sets. The

subtraction  is  present  effectively  because  having  only  one  data  point  does  not  give  a

scatter, so we use the number of points above this first one.

nA = len(dataA)-1.0

nB = len(dataB)-1.0

The  scatter  within  each  category  is  simply  defined  as  the  sum  of  the  covariance

matrices, weighted according to the size of the data sets. The scatter between categories is

simply  the  separation  between  the  data  sets,  i.e.  the  difference  from  one  data  centre  to

another.

scatterWithin = nA * covA + nB * covB

scatterBetween = meanA - meanB

The  discrimination  matrix  between  data  sets  is  the  line  between  centres

(scatterBetween) divided by the scatter within the data (multiplied by inverse matrix) for

each dimension.

discrim = dot(linalg.inv(scatterWithin),scatterBetween)

The  data  sets  are  transformed  using  the  discrimination  matrix,  reorienting  them  along

the line of best separation. These are passed back at the end for inspection.

transfA = dot(dataA, discrim.T)

transfB = dot(dataB, discrim.T)

The best guess for the dividing point that separates the data sets is the average of the

two data centres reshaped (transformed) to lie along the discriminating direction.

divide = dot(discrim,(meanA+meanB))/2.0

return transfA, transfB, divide



Here  we  test  the  LDA  function  with  two  normally  distributed,  random  data  sets.  One

has  a  small  spread  and  is  moved  to  the  side  (by  adding  array([-10.0,5.0]))  and  the  other

has a wider spread. The two sets should overlap (intermingle).

testData1 = random.normal(0.0, 2.0, (100,2)) + array([-10.0,5.0])

testData2 = random.normal(0.0, 6.0, (100,2))

The test sets can be visualised with matplotlib in the usual way:

from matplotlib import pyplot

x, y = zip(*testData1)

pyplot.scatter(x, y, s=25, c='#404040', marker='o')

x, y = zip(*testData2)

pyplot.scatter(x, y, s=25, c='#FFFFFF', marker='^')

pyplot.show()

Running  the  function  on  these  data  sets  we  get  two  arrays  proj1,  proj2  containing

transformed  data  and  a  dividing  value,  div.  These  arrays  are  one-dimensional  and

represent the projection of our two data sets onto the discrimination line. We can plot them

as points along a line, which here we separate for clarity with y-values at 0.5 and -0.5. The

dividing value div can be used to draw a line to show where the LDA has estimated the

best boundary between categories (assuming a symmetric, normal distribution).

proj1, proj2, div = twoClassLda(testData1, testData2)

print(div)

x = proj1

y = [0.5] * len(x)

pyplot.scatter(x, y, s=35, c='#404040', marker='o')

x = proj2

y = [-0.5] * len(x)

pyplot.scatter(x, y, s=35, c='#FFFFFF', marker='^')

pyplot.plot((div, div), (1.0, -1.0))

pyplot.show()





Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   381   382   383   384   385   386   387   388   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish