Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet378/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   374   375   376   377   378   379   380   381   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Improving k-means

The  k-means  algorithm  can  be  improved  by  having  a  better  guess  at  the  starting  cluster

centres.  The  idea  is  that  by  choosing  cluster  centres  that  are  spaced  further  apart  the

optimisation  converges  more  quickly  and  is  less  likely  to  get  stuck  in  sub-optimal

positions,  which  are  often  characterised  by  two  centres  being  close  and  dividing  a  real

cluster  at  the  expense  of  a  globally  better  solution.  One  method  that  uses  better  starting

positions is called k-means++, which works by choosing centres on a probabilistic basis,

where  the  chance  of  a  point  becoming  an  initial  centre  increases  with  the  square  of  the

distance  to  the  other  centres  specified  so  far.  However,  here  we  will  consider  another,

slightly more directed approach with a similar aim.

For  the  kMeansSpread  function  we  guess  one  cluster  centre  by  taking  a  random  data

point  and  then  choose  the  centres  of  subsequent  clusters  by  selecting  points  that  are

furthest away from those defined so far. Each cluster centre is chosen by creating an index,

which  is  placed  in  the  set  indices,  which  is  used  at  the  end  to  select  corresponding  data

items and thus create an array of centres. The index for the first cluster is a random integer

between zero (the first data item) and n-1 (the last data item), selected using the standard




random number Python module: random.randint(0, n-1).

5

The remaining indices are added



in a while loop, until a total of k is achieved, by choosing subsequent points which have

minimum radial influence from the centres already chosen.

from numpy import zeros, ones, vstack

from random import randint

def kMeansSpread(data, k):

n = len(data)

index = randint(0, n-1)

indices = set([index])

influence = zeros(n)

while len(indices) < k:

diff = data - data[index]

sumSq = (diff * diff).sum(axis=1) + 1.0

influence += 1.0 / sumSq

index = influence.argmin()

while index in indices:

index = randint(0, n-1)

indices.add(index)

centers = vstack([data[i] for i in indices])

return kMeans(data, k, centers)

Key  to  this  approach  is  the  influence array. Initially starting with zeros, each time we

add  an  index  for  a  new  centre  we  calculate  the  difference  from  that  cluster  centre

(data[index])  to  all  the  data,  thus  creating  diff.  Each  value  in  the  influence  array  is  then

increased by one over sumSq: the sum of this difference squared. Note that sumSq has 1.0

added to each element to avoid dividing by zero, i.e. where a data point coincides exactly

with the previous centre. Also, we do an element-wise division when we divide the simple

number  1.0  by  the  whole  array  sumSq,  so  we  get  an  array  of  reciprocals.  In  effect  each

centre  has  a  diminishing  radial  influence  and  we  choose  the  index  where  the  sum  of  all

these influences influence is minimised. In some circumstances an index can actually be

picked twice, but we guard against this with a second while loop which checks for repeats

and chooses a random index until an unused one is found. Once k indices are collected, a

list  of  data  items,  representing  the  cluster  centres,  is  created  using  a  list  comprehension

and  stacked  into  a  NumPy  array  (vstack  operates  on  arrays  much  like  append  does  on

lists). Overall this method is analogous to the concept of repelling charges. In the end the

‘repelled’  centres  are  a  fair  starting  point  for  k-means,  which  is  called  at  the  end  so  its

results are passed back directly at the return.

The  function  is  used,  and  can  be  tested,  in  the  same  way  as  kMeans().  If  speed  of

execution  becomes  important  then,  aside  from  faster  implementations  (e.g.  using  a  C-

language extension, as described in

Chapter 27

), further improvements can sometimes be

made to the algorithmic efficiency. For example, if the dimensionality of the data items is

not  too  large  (e.g.  three-dimensional)  techniques  like  kD-trees

6

 can  be  used  to  more




efficiently  calculate  the  closest  centre  for  each  data  item.  Also,  if  the  data  set  to  be

clustered is large, significant time savings can be made by initially running the algorithm

on  a  smaller,  random  subset  of  the  data.  The  cluster  centres  from  the  smaller  subset  can

then be used as good starting values in a second round of clustering using the whole data

set. A point of note with k-means and similar methods is that they tend to only work well

where the clusters are of about the same size and density and are convex (blobs).




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   374   375   376   377   378   379   380   381   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish