Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet379/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   375   376   377   378   379   380   381   382   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

The ‘jump’ method

Next we will consider the situation where you do not know in advance how many clusters

the data should be grouped into, but do not wish to use a threshold-based method. We will

consider an augmentation on top of the k-means method to try various numbers of clusters

and  determine  the  most  meaningful.  The  example  function  uses  what  we  will  term  the

‘jump’  method.

7

 It  will  perform  k-means  clustering  with  different  values  of  k  and  then



assesses which of the steps between increasing values of k represents the best compromise

between the number of clusters and complexity of the solution.

For  a  given  trial  value  of  k,  the  result  of  a  trial  clustering  attempt  involves  the

calculation of what we term the ‘distortion’, which in essence is a measure of the average

spread, from the cluster centres, of data points over all the clusters. This spread is adjusted

for the covariance

8

in the data, which indicates how much variation there is in the values



for the various axes, such that the relative scale of the different features does not matter.

The  jumpMethodCluster  function  is  constructed  to  take  a  set  of  data,  as  an  array  of

arrays,  an  optional  range  of  k  values  (numbers  of  clusters)  to  test  and  cycles,  which

determines  how  many  times  to  repeat  the  clustering.  This  cyclic  repetition  is  important

because  different  clustering  attempts,  involving  random  numbers,  may  give  variable

results.


from numpy import cov, linalg

def jumpMethodCluster(data, kRange=None, cycles=10):

Firstly, we extract the length of the data and the number of dimensions (features). This

is simply the size (.shape in NumPy) of the array. Then we define the range of k values if

none were passed in; by default the range is between two and the number of data items.

Note that the range is specified by an upper limit, which will not be included.

n, dims = data.shape

if kRange is None:

start, limit = (2, n+1)

else:


start, limit = kRange

Next  we  define  the  power  variable,  which  will  be  used  to  adjust  the  calculated

distortion  values  according  to  the  number  of  dimensions  in  the  data  (see  the  primary

reference for an explanation of this). The distortions are collected in a dictionary, keyed by



k values, so this is initially defined as blank. The invCovMat is a matrix representing the

covariance in the dimensions of the input data. Here we can make use of NumPy’s cov and




linalg.pinv functions to estimate the covariance and then invert the matrix.

9

power = dims/2.0



distortions = {}

invCovMat = linalg.pinv(cov(data.T))

Then  a  loop  goes  through  all  of  the  values  of  k  which  are  to  be  tested.  For  each  we

define  an  array,  initially  at  zero,  of  average  distortions  (scaled  distances).  We  will  have

one value for each cycle, with the aim being to find the cycle with the best clustering and

hence minimum distortion.

for k in range(start, limit):

meanDists = zeros(cycles)

The  next  for  loop  exists  to  provide  several  clustering  attempts,  to  guard  against

clustering where k-means gets stuck in a stable but sub-optimal situation. For each attempt

we  define  a  sum  of  distortions  sumDist,  and  use  the  kMeansSpread  function  to  actually

generate the clusters for this cycle and value of k.

for c in range(cycles):

sumDist = 0.0

centers, clusters = kMeansSpread(data, k)

With the clustering attempt made the results are analysed. We go through each cluster

in  turn  and  calculate  diffs,  the  difference  between  the  data  items  in  the  cluster  and  the

cluster centres that the k-means algorithm gave.

for i, cluster in enumerate(clusters):

size = len(cluster)

diffs = array(cluster) - centers[i]

The  differences  are  used  to  calculate  the  average  distortion.  At  a  basic  level  this

requires that the diff arrays are squared using a dot product (dot(diff.T, diff) ), but we also

apply the inverse covariance matrix to scale the values according to the spread of data in

the  dimensions  (i.e.  treat  each  data  feature  equivocally).  The  distortion,  which  can  be

thought  of  as  a  measure  of  distance,  is  then  divided  by  the  size  of  the  cluster,  to  get  an

average for the items in the cluster, and added to the total.

for j, diff in enumerate(diffs):

dist = dot(diff.T, dot(diff, invCovMat))

sumDist += dist / size

The final summation of average distortions in the cycle is then averaged again, over the

number  of  clusters  and  dimensions,  so  that  we  get  the  average  for  the  whole  clustering

solution.

meanDists[c] = sumDist / (dims * k)

Finally, after all trial cycles are done the minimum value of the averaged distortions is

selected: the trial that gave the best clustering. This value is adjusted by being raised to the

negative power -power, which scales the result according to how the distortion is expected

to vary according to the number of data dimensions (see reference above).




distortions[k] = min(meanDists) ** (-power)

With  the  distortions  calculated,  it  only  remains  to  go  through  the  values  to  find  the

value of k (cluster size) that corresponds to the biggest improvement. The largest jump in

the distortion value will occur when we reach the optimum value of k,  the  one  that  give

the  most  notable  improvement  in  the  relative  separation  between  the  data  items  and  the

cluster centres.

maxJump = None

bestK = None

for k in range(start+1, limit):

jump = distortions[k] - distortions[k-1]

if (maxJump is None) or (jump > maxJump):

maxJump = jump

bestK = k

return bestK

To visualise how the jump method works, imagine three real clusters, as illustrated in

Figure 23.6

. There will be some large spreads from the cluster centres if we try to use only

two clusters. Using four clusters will tend to split a real cluster, leading to only a modest

reduction  in  the  spread.  Scaled  properly,  according  to  the  effect  of  dimensionality,  the

spread measure (distortion) can be used to suggest the value of k.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   375   376   377   378   379   380   381   382   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish