Support vector machines
The final example of a machine learning method in this chapter moves away from neural
networks to a different and more recent approach, described generally as kernel methods,
and the specific example given here is known as a support vector machine (SVM). The
SVM can be thought of as being related to the k-nearest neighbour idea, where we
visualise known points of data in a vector space and aim to make predictions on
previously unseen input by looking where a query lies in this space in relation to the
known, training data. Whereas with the k-nearest neighbour method a prediction is made
by looking at a number of known values in the vicinity of a query, for an SVM the training
data vectors are used to define the location of a boundary that separates the vector space
into two regions. A prediction is made for an unseen query vector by seeing on which side
of the decision boundary it occurs. For two dimensions such a decision boundary would be
a line, and in three dimensions it is a plane, but for most SVMs we are working in more,
often many more, than three dimensions (data features), in which case the boundary is a
hyperplane. The name support vector machine derives from the support vectors, which are
the known, training data points that are found on the edge of this boundary, defining the
boundary position.
One of the useful properties of support vector machines compared to neural network
techniques comes from the fact that they maximise the margin between data categories. If
you imagine the training data to be two distinct, fixed groups the operation finds the
widest region that separates them. The boundary between these two groups is placed in the
centre of the separating region, and the prediction of previously unseen data is made by
finding out on which side of the boundary the query point lies. With a neural network the
effective boundary between data categories will not always be fairly placed in the middle.
This is a particular issue when the training data is not evenly spread (e.g. one prediction
category is significantly more populous) and if it has been over-trained. The decision line
may be much closer to one category or the other, with the result that unseen query data
that lies between the two will tend to be classified inappropriately, in only one way. With a
support vector machine it is only the support vectors at the edge of the maximised margin
between categories that define the boundary, thus over-training is not possible and the
amount of data on either side does not affect the decision line so much.
13
A support vector machine effectively defines the widest boundary between two
categories of data vector by doing linear optimisation, similar to finding a line of best fit.
However, the input data may not naturally lie in two discrete regions and thus be separable
with a simple plane boundary. Imagine trying to decide which applicants should be
allowed to join the army based on their weight measurements. You wouldn’t expect to be
able to sensibly use a single decision line to separate those who pass and those who fail,
given that you have to distinguish between both those people who are overweight and
those who are underweight. The solution to this kind of problem can nonetheless be solved
using a single line, and thus by a support vector machine, because the decision boundary
can be placed in a space of higher dimensionality than the original problem. Referring
again to the army recruits, if instead of having a single weight axis you put the weight
measurements in a semi-circle then you can now separate the pass and fail categories
above and below a single line. It is simply that the data is now in two dimensions, rather
than the original one. The extra dimension has nothing to do with the original data, it has
simply spread it out in a predictable way and you can still retrieve the original values if
required: in this case from the distance along the curve. Using a known function to spread
data points into more dimensions is often referred to as the kernel trick
14
and gives power
to the SVM to operate on nonlinear problems (e.g. with context dependency) while using
mathematically robust, linear means to make an optimum decision. Although the simple
example mentioned uses two dimensions to achieve separation, a general support vector
machine can use any number of extra dimensions. Although at first this might seem to
arbitrarily complicate the problem, the really clever part of the kernel trick is that within
the algorithm to find the separating boundary the high-dimensionality locations of the data
points do not have to be stored or even calculated. The optimisation can be performed by
merely considering what the projection of the high-dimensionality points onto the original
feature axes (i.e. using the dot product) would be. For a more mathematical explanation of
support vector machines and the kernel trick see the reference given below.
15
Do'stlaringiz bilan baham: |