Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet372/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   368   369   370   371   372   373   374   375   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Clustering methods

The  following  Python  examples  demonstrate  a  range  of  clustering  methods.  Overall  we

have tried to give an overview of the different approaches and have chosen examples that

are reasonably easy to explain (and write in Python) but which should also prove useful in

many different situations.

Simple threshold clustering

The first Python example in this chapter involves grouping items of data into clusters in a

fairly simple manner, based upon how close (similar) the items of data are. This method is

somewhat naïve and suffers from a few problems, but is a good place to start in the subject

and  we  will  subsequently  show  improvements.  The  algorithm  works  by  initially

considering an item of data as a potentially separate cluster and then expands the cluster

by adding any other items that are sufficiently close, i.e. the distance between data points

is  less  than  a  threshold.  The  general  problem  with  this  kind  of  approach  is  that  it  is

sensitive to the arrangement of data points; two genuine clusters can be merged into one

by the presence of outlying points that just happen to span the gap between the clusters.

The method is sometimes described as being ‘too greedy’.



Firstly, in the Python example we define a simple function that calculates the regular,

Euclidian  distance  between  two  items  of  data  (feature  vectors).  Although  this  is  a

reasonable general choice for measuring the similarity between data items, there are some

situations  where  something  else  may  be  more  appropriate.  A  good  example  of  using  a

different  kind  of  measure  would  be  for  biological  sequences,  where  it  is  possible  to

estimate how similar the sequences are, assuming an evolutionary process.

The euclideanDist function will assume the data items are NumPy array objects, so that

we can do some quick manipulations without having to go through loops. The distance is

calculated  by  first  finding  the  difference  between  two  vectors:  another  vector,  diff.  The

square root of the sum of the squares of the values in the diff vector is then the distance we

want. We use the dot() function to do the summing and squaring for us. (See

Chapter 9

for

an explanation of dot().)



from numpy import dot, sqrt

def euclideanDist(vectorA, vectorB):

diff = vectorA-vectorB

return sqrt(dot(diff,diff))

Next is another small helper function that we will use in the clustering. In this case the

findNeighbours  function  is  built  to  look  through  all  pairs  of  data  items,  calculate  the

distance  between  the  items  (for  example,  using  the  euclideanDist  function)  and  record

those items that are sufficiently close to be considered as near neighbours. The three input

arguments  are  naturally  the  data  itself,  the  function  to  get  the  distance  value  and  a

threshold  for  defining  neighbours.  The  results  will  be  stored  in  a  Python  dictionary

neighbourDict, so this is defined and initially filled with empty lists, one for each item of

data.


def findNeighbours(data, distFunc, threshold):

neighbourDict = {}

n = len(data)

for i in range(n):

neighbourDict[i] = []

for i in range(0,n-1):

for j in range(i+1,n):

dist = distFunc(data[i], data[j])

if dist < threshold:

neighbourDict[i].append(j)

neighbourDict[j].append(i)

return neighbourDict

The remainder of the function involves going through all pairs of data items. Hence we

have two loops, one with the index i and one with the index j, to get two items. Note how

we have deliberately chosen the ranges for these indices to avoid repetition, i.e. j are the



indices larger than i.  The  distance  is  calculated  by  using  the  function  that  was  passed  in

and then if the distance is less than the threshold we record the data for i and j indices to

be neighbours; the indices go in each other’s list of neighbours, held in neighbourDict.

Finally we get to the actual simpleCluster clustering function. This takes a list of data,

each item of which is assumed to be a NumPy array, a distance threshold and the function

to  make  the  distance  measurement.  Initially,  the  dictionary  of  neighbours  is  filled  using

findNeighbours,  then  the  remainder  of  the  function  works  to  find  clusters  by  combining

neighbours together into groups.

def simpleCluster(data, threshold, distFunc=euclideanDist):

neighbourDict = findNeighbours(data, distFunc, threshold)

For  the  clusters  we  first  define  an  empty  list,  which  will  represent  items  of  data  in

discrete groups. The pool  variable  is  a  set  of  numbers,  the  indices  of  data  items  that  we

have not yet processed. The aim will be to go through all the pool and place the indices in

the appropriate cluster. We are using Python sets here because detecting something in a set

is quicker than looking though a list.

clusters = []

pool = set(range(len(data)))

Now the task is to process the data, which involves taking an item from the ‘to-do’ pool

(remembering  pop()  removes  something  from  a  set  so  the  pool  shrinks)  and  looking

through its neighbours, which we fetch from the neighbourDict. Because the item was in

the pool it will not yet be in a cluster, thus we define a new one as an empty set and add

the  item  index  i.  The  item  is  now  considered  processed.  It  may  seem  dumb  to  already

consider  the  item  processed,  but  as  will  become  apparent  most  items  will  be  added  in  a

different way; this is just a start for the cluster.

while pool:

i = pool.pop()

neighbours = neighbourDict[i]

cluster = set()

cluster.add(i)

Once the new cluster is defined we then add all of the neighbours which are known to

be close to the same cluster. To do this a new pool pool2 is defined from the indices j of

the neighbours.

pool2 = set(neighbours)

while pool2:

j = pool2.pop()

If  the  neighbour  has  not  yet  been  processed  (j  still  in  pool)  then  we  process  it  by

removing it from the main pool, and add it to the current cluster. Its own neighbours then

go  into  pool2,  the  set  that  we  are  currently  working  on.  In  this  way  neighbours  of

neighbours are placed in the same cluster. Note the use of the update, which will expand a

Python set to include any new values from another group.

if j in pool:



pool.remove(j)

cluster.add(j)

neighbours2 = neighbourDict[j]

pool2.update(neighbours2)

This  inner  loop  ends  when  no  more  neighbours  are  placed  in  the  pool  for  the  current

cluster and pool2 is fully processed (empty), whereupon the cluster is added to the list of

clusters.  Another  cluster  will  be  created  by  the  outer  while  loop  as  long  as  the  pool  still

contains something to process: those that are not a member of any clusters seen so far.

clusters.append(cluster)

The final output of the clusters will be given as a list of lists; each sub-list will contain

all  of  the  data  items  that  go  together  in  the  same  cluster.  So  far  the  clusters  are  only

defined  in  terms  of  index  numbers,  so  it  is  a  simple  matter  to  go  through  each  set  of

clustered  indices  and  add  the  actual  data  items  they  correspond  to  (data[i])  to  the

clusterData list which will be passed back at the end of the function.

clusterData = []

for cluster in clusters:

clusterData.append( [data[i] for i in cluster] )

return clusterData

For  testing  we  will  use  some  random  data  clusters  with  a  normal  (Gaussian)

distribution, readily created using random.normal.  Here  each  data  cluster  is  centred  on  a

different point but has the same spread and size: 100 points in two dimensions. These are

then combined into a single array of vectors (300 points on two dimensions) using vstack,

which combines the arrays of vectors along their first axis (rows), and shuffled to mix up

the vectors from the different clusters.

from numpy import random, vstack

spread = 0.12

sizeDims = (100,2)

data = [random.normal(( 0.0, 0.0), spread, sizeDims),

random.normal(( 1.0, 1.0), spread, sizeDims),

random.normal(( 1.0, 0.0), spread, sizeDims)]

data = vstack(data)

random.shuffle(data) # Randomise order

clusters = simpleCluster(data, 0.10)

The following can then plot the results with different symbols and colours. Note that we

plot all the small ‘noise’ clusters that have fewer than three items as black spots. Also, the

zip function is handy here because the scatter plotting function requires two separate lists

of x values and y values, but our data has (x,y) pairs. Hence *cluster expands the one list

of (x,y) pairs into multiple small lists, and zip groups the separate x and y values together

(see

Chapter 10



).

from matplotlib import pyplot




colors = ['#F0F0F0','#A0A0A0','#505050',

'#D0D0D0','#808080','#202020']

markers = ['d','o','s','>','^']

i = 0


for cluster in clusters:

allX, allY = zip(*cluster)

if len(cluster) > 3:

color = colors[i % len(colors)]

marker = markers[i % len(markers)]

pyplot.scatter(allX, allY, s=30, c=color, marker=marker)

i += 1

else:


pyplot.scatter(allX, allY, s=5, c='black', marker='o')

pyplot.show()




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   368   369   370   371   372   373   374   375   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish