Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet376/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   372   373   374   375   376   377   378   379   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

K-means clustering

The  next  Python  example  for  clustering  is  an  algorithm  known  as  k-means  clustering.

Here the ‘k’ refers to the number of clusters that we wish to group our list data items into.

The  ‘means’  refers  to  the  concept  of  describing  each  cluster  as  the  central  average

position: a geometric average of its members. There are implementations of the k-means

algorithm

in

the


Scientific

Python


library

(scipy.cluster.vq.kmeans

and

scipy.cluster.vq.kmeans2),  but  we  will  create  one  from  scratch  here  to  illustrate  the



methods  and  to  give  a  basis  for  modification.  Also,  there  are  variants  of  this  clustering

method that can be used, like ‘k-medoids’ or ‘k-medians’, which can also be considered,

but which we will not go into here for reasons of brevity. Compared to the threshold-based

methods already discussed, the k-means method has the advantage that you do not have to

choose  a  threshold.  Although  you  do  have  to  say  how  many  clusters  (k)  to  make,  it  is

possible to try several values for k and take the best one. Compared to DBSCAN, k-means

will not work as well with data clusters that have irregular or intermingled shapes, but will

allow separation of regular (‘blobby’) clusters even when they are overlapped.

Initially  the  cluster  centres  are  defined  randomly,  basing  them  on  the  locations  of

random data items. Each of the input data items will then become a member of a cluster

by  simply  determining  which  centre  is  closest.  The  effect  is  that  the  discrimination

boundaries between clusters are straight lines.

3

Most likely the initial guess for the cluster



centres will be bad, so the algorithm goes through a number of iterative cycles to improve

the  locations  of  the  centres.  Each  cycle  consists  of  re-appraising  which  data  items  are

members  of  which  cluster  and  then  moving  the  centres  of  the  clusters  to  the  average  of

their memberships. When a cluster centre moves the membership of items may change for




the next cycle, which in turn leads to a new centre, and so the cycle repeats. The k-means

algorithm is simple in that there is no recorded statistic that is being optimised. Instead the

iteration  continues  until  a  stable  situation  is  reached.  For  data  sets  where  there  are  (k)

obvious  clumps  of  items  the  algorithm  will  be  fairly  reproducible.  However,  where  the

data items are more evenly spread the final clustering will depend on the initial guess of

the centres, i.e. there can be more than one stable solution. If there isn’t a stable solution

you  should  consider  whether  it  is  meaningful  to  cluster  the  data  in  this  way;  imagine

dividing a regular grid of points into two, where any straight line that makes two groups of

equal size is a possible solution.

The  Python  function  that  performs  k-means  clustering  takes  an  array  of  input  data

vectors and the number k,  to  specify  how  many  clusters  there  will  be.  We  also  have  the

option  of  passing  in  some  initial  guesses  for  the  cluster  centres,  but  by  default  this  is

None, and we choose random centres. If unspecified, to define the initial guess for centers

we  use  the  sample  function  from  Python’s  standard  random  number  module,  which  will

choose  k  different  items  from  the  data,  and  then  convert  the  resulting  list  into  a  NumPy

array.


from numpy import array, random

from random import sample

def kMeans(data, k, centers=None):

if centers is None:

centers = array( sample(list(data), k) ) # list() not needed in Python

2

With  the  initial  cluster  centres  defined  we  now  come  to  the  iterative  part.  Here  we



define the amount of change, which is initially large (proportionately speaking). The while

loop  will  continue  as  long  as  the  change  between  subsequent  cycles  is  big  enough,

otherwise we deem the clustering to have reached a stable solution.

change = 1.0

while change > 1e-8:

The membership of the clusters is defined by putting each data item into a list, one for

each cluster, according to which cluster centre is closest. Hence clusters is defined as a list

of  lists  which  collects  the  data  items.  Note  that  the  membership  of  clusters  may  change

each  cycle,  so  we  have  to  repeatedly  re-appraise  the  memberships.  The  for  loop  goes

though each data item, here called vector, and subtracts the centres; this will give an array

of  difference  vectors  from  the  data  item  to  all  of  the  centres.  Note  that  this  subtraction

assumes we are working with NumPy arrays, hence subtracting two arrays gives another

array. The array of differences is then squared and then summed up for each vector, so that

we have a list of (square

4

) distances, with one value for each centre; i.e. an array of (x, y,



z) items becomes (x

2

, y



2

, z


2

) and then (x

2

+y

2



+z

2

). The closest centre is then the index of



the smallest value in the array dists.argmin(). The data vector is added to the appropriate

list in clusters with the closest index.

clusters = [[] for x in range(k)]



for vector in data:

diffs = centers - vector

dists = (diffs * diffs).sum(axis=1)

closest = dists.argmin()

clusters[closest].append(vector)

With  the  membership  of  the  clusters  defined,  the  next  loop  recalculates  the  centres  of

the clusters based on the data vectors that belong to each. We simply go through the list of

clusters  and,  after  converting  to  a  NumPy  array,  we  calculate  the  average  data  vector,

taking a summation for each data dimension and dividing by the length of the cluster. The

difference between the old and new cluster centre is calculated by subtraction, then added

as  the  sum  of  squares  (i.e.  length  squared)  to  the  total  amount  of  change  for  this  cycle.

Finally in the loop the new centre replaces the old one: centers[i] = center.

change = 0

for i, cluster in enumerate(clusters):

cluster = array(cluster)

center = cluster.sum(axis=0)/len(cluster)

diff = center - centers[i]

change += (diff * diff).sum()

centers[i] = center

return centers, clusters

At  the  end  of  the  function,  once  the  change  is  small  enough,  we  pass  back  the  last

values of the cluster centres and the list which contains the data items grouped according

to cluster membership.

To  test  the  data  we  can  use  a  random  spread  of  1000  data  points  in  two  dimensions,

grouping into three clusters:

testDataA = random.random((1000,2)) # No clumps

centers, clusters = kMeans(testDataA, 3)

Or we can add two distinct clumps of two-dimensional points:

from numpy import vstack

testDataB1 = random.normal(0.0, 2.0, (100,2))

testDataB2 = random.normal(7.0, 2.0, (100,2))

testDataB = vstack([testDataB1, testDataB2]) # Two clumps

centers, clusters = kMeans(testDataB, 2)

For  each  test  we  can  display  the  result  in  a  scatter  graph  using  the  matplotlib  library.

Note that to make the plot we extract the x and y coordinates for the data items separately

using  zip,  because  matplotlib.pyplot  expects  two  separate  lists  (rather  than  (x,y)  points).

After plotting the data with separate colours we also make a scatter plot of the centres of

the clusters with black circles.

from matplotlib import pyplot

colors = ['#FF0000','#00FF00','#0000FF',

'#FFFF00','#00FFFF','#FF00FF']



for i, cluster in enumerate(clusters):

x, y = zip(*cluster)

color = colors[i % len(colors)]

pyplot.scatter(x, y, c=color, marker='o')

x, y = zip(*centers)

pyplot.scatter(x, y, s=40, c='black', marker='o')

pyplot.show()


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   372   373   374   375   376   377   378   379   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish