Figure 23.1. How many seashells in how many groups? Some views of data are better
at distinguishing items and clusters than others.
Clustering
Clustering relates to the process of partitioning data units into discrete groups. Such an
operation requires that the similarity (or difference) between units is measured and then
the members of each group are allocated to give the arrangement that maximises the
association of similar items and the separation of dissimilar ones. In practice most of the
clustering methods presented here will not be able to give an immediate analytical solution
to this optimisation problem, rather the process will be an iterative one, with several
cycles of improvement until a stable solution is found. As mentioned above, clustering
may operate on data items which have a high dimensionality, represented as feature
vectors. However, if the analysis is too slow or too complicated the original data may be
transformed (projected) into a set of lower-dimensionality data by methods like PCA prior
to the clustering operation.
Depending on the situation, the process of clustering may work with prior knowledge
about the number of clusters, e.g. what the underlying data categories are. Alternatively,
the number of clusters may be completely unknown. If the numbers of clusters is not
known then this number must be deduced or optimised. Generally, several different trials
are run, each of which involves a different number of clusters. Within each trial there is a
separate optimisation for how the data items are allocated within that number of clusters.
The best number of clusters is then determined from the best overall arrangement from all
the trials. It would be possible to place each data item in a separate cluster, thus giving
maximum separation, but the objective is to give the best balance between the number of
clusters and the degree of separation, rather than only maximising separation.
Once clusters are defined the result may then be used as a means of predicting
classification, i.e. estimating in which cluster a previously unseen piece of data lies.
Making a prediction may be as simple as finding which cluster is closest. Alternatively,
more advanced approaches, such as the supervised machine learning methods described in
Chapter 24
, can be used where classification is not so easy. These can learn patterns from
training data with known, fixed classifications before predictions are made. One of the
machine learning methods presented later, the self-organising map, is notable because it is
unsupervised (needs no prior classifications) and thus can be viewed as an alternative to
the linear clustering methods presented in this chapter.
Do'stlaringiz bilan baham: |