Hierarchical clustering
The next example moves on to rearranging the data (swapping rows and columns) so that
we can better see similarities or correlations in the elements. This rearrangement will
involve hierarchical clustering, and uses the same idea as building a phylogenetic tree (see
Chapter 14
), so that the most similar rows and columns are placed next to one another.
The aim is that any rows or columns that show similar patterns will be placed together in
easily visualised sub-groups. The image-generating method described earlier can then be
used to make a visual representation (see
Figure 16.3
).
The function __hierarchicalRowCluster takes an array of data as a two-dimensional
matrix and uses this to build a matrix of the distances between each row, so the element
distanceMatrix[i][j] is the distance between row i and row j. By ‘distance’ here we mean
the similarity between different rows, and measure this here as the Euclidean distance
between row vectors (root of sum of differences squared). In different situations you could
consider other means of estimating the similarity between rows.
6
It should be noted that
the function name begins with a double underscore ‘__’. In Python a function starting, and
not ending, with a double underscore is effectively private,
7
so it can’t be called directly
from an instance of a class. A function is kept private like this if you don’t want to expose
it as a normal method (a function linked to a class), and only use it internally within the
class definition. Hence, here we make the clustering function private and use it in the inner
workings of Microarray but do not allow method calls from an object like
microrray1.__hierarchicalRowCluster(data).
def __hierarchicalRowCluster(self, dataMatrix):
from SeqVariation import neighbourJoinTree
n = len(dataMatrix[0])
distanceMatrix = zeros((n, n), float)
We loop through each layer and row in the dataMatrix and take this row away from the
whole array, so we get a matrix of differences to this row. The differences are squared,
added up along the row and the square root is taken, giving a distance from row to all
other rows. These distances are then placed at position i for this row within the
distanceMatrix.
for channelData in dataMatrix:
for i, row in enumerate(channelData):
diffs = channelData - row
sqDiffs = diffs * diffs
sqDists = sqDiffs.sum(axis=1)
distanceMatrix[i,:] += sqDists
The function neighbourJoinTree which was described previously in
Chapter 14
is
reused here to create a hierarchical tree from the distance matrix. Note that we use .tolist()
to make a list of lists from the array of distances because the tree-generating function did
not make use of NumPy arrays earlier in the book.
tree, joinOrder = neighbourJoinTree(distanceMatrix.tolist())
The hierarchical tree structure is then interrogated by following its branches, effectively
flattening it into a single list of node indices, which represents the order of rows for the
shuffled matrix. The list rowOrder is initially defined as a copy of the tree, but is then
processed to insert the contents of any sub-lists (or tuples) directly into the main list. The
use of a while loop here is because the size of the rowOrder list will grow as the tree
branches are flattened. We record an index i, which represents the position in the list that
is being processed, and continue until the end of the list. The inner while loop checks
whether the list item at position i is an integer (the end of a tree branch), and if it is not the
contents of the sub-list (rowOrder[i]) are inserted into the main list using the slice notation
[i:i+1], replacing the original range covering the sub-list with its contents. The use of the
second while loop is needed because the new element at position i may itself be another
sub-list.
rowOrder = list(tree)
i = 0
while i < len(rowOrder):
while not isinstance(rowOrder[i], int):
rowOrder[i:i+1] = rowOrder[i]
i += 1
return rowOrder
The row clustering routine is actually used by the hierarchicalCluster function, which is
not private and so provides the method microarray.hierarchicalCluster(). This clusters the
rows in self.data and then reorders the array according to the row hierarchy. The resulting
data array is then transposed (flip rows for columns) and clustered again, effectively
clustering the columns, which forms the new column order.
def hierarchicalCluster(self):
rows = self.__hierarchicalRowCluster(self.data)
swapped = transpose(self.data, axes=(0,2,1))
cols = self.__hierarchicalRowCluster(swapped)
The reordered rows and cols are used as indices into the NumPy arrays to shuffle the
data, according to the hierarchical clustering, noting that we don’t affect self.data.
8
The
data is then used to make an entirely new Microarray object, with a different order for its
rows and columns.
data = self.data[:,rows] # Rearrange
data = data[:,:,cols]
# data = array(data.tolist()) # to fix PIL.Image bug
name = self.name + '-Sorted'
rowData = [self.rowData[i] for i in rows]
colData = [self.colData[j] for j in cols]
sortedArray = Microarray(name, data, rowData, colData)
return sortedArray
When testing the hierarchical clustering we get back a new Microarray object with a
different order of rows and columns:
sortedArray = rgArray.hierarchicalCluster()
sortedArray.makeImage(20).show()
print(rgArray.rowData)
print(sortedArray.rowData)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] – Original
rows
# [4, 13, 2, 11, 6, 15, 5, 14, 7, 16, 1, 10, 3, 12, 8, 17, 0, 9] – Shuffled
rows
Do'stlaringiz bilan baham: |