A Python support vector machine
The Python example of a support vector machine starts with the import of various array
and mathematical NumPy functions that will be needed. As with the neural network
example the use of array operations aims to focus attention on the high-level operation
being performed, and thus the reasoning involved in the method. Helpfully, the NumPy
routines are also quicker to execute than looping through large arrays. There are several
NumPy imports and it should be noted that many of these have the same name as the
equivalent in the standard math module, although as you might expect these versions work
on whole arrays, not just single numbers:
from numpy import exp, power, array, zeros, sqrt
from numpy import nonzero, random, abs, sum, dot
Next, because we will be using some random numbers we initialise the random number
generator in NumPy. This initialisation is done using a seed number, which in this instance
is the system time in seconds
16
because it is easy to access and constantly changing.
Initialising random numbers in this way is always good if you do not want to have exactly
the same result each time. Although, if you do want to have the same result, say when
doing debugging of the code, you can use the same seed number; so that although a stream
of numbers come out apparently randomly it will be the same numbers each time the
program is run.
from time import time
random.seed(int(time()))
Next we define the kernel functions, which are used to replace the explicit calculation
of dot products between feature vectors in high-dimensionality space, as a measure of the
coincidence between two data points. We give only two simple examples for kernel
functions here, although more could be considered. Both functions take in two feature
vectors (data points) and give a single value measure of their similarity or coincidence.
The Gaussian kernel (normal curve) is perhaps the most general and widely used form.
The calculation for the Gaussian kernel is fairly simple and involves only a single
parameter, sigma, which dictates the width of the function. First the difference between
two feature vectors is calculated, then the sum of the differences squared (a single scalar
value) is conveniently calculated using the dot product. The number is finally scaled and
the exponent of its negative returned. This kernel will give a measure of coincidence that
is always positive and how quickly this diminishes will depend on sigma.
def kernelGauss(vectorI, vectorJ, sigma=1.0):
sigma2 = sigma * sigma
diff = vectorI - vectorJ
dotProd = dot(diff,diff)
return exp( -0.5 * dotProd / sigma2 )
The linear kernel is also very simple; the function calculates the difference of each
vector to the vector mean of the input data (which we calculated beforehand) and returns
the dot product of these differences. Effectively the value generated is a measure of the
coincidence of the input vectors relative to the centre of the data. It should be noted that
the output can be negative if the input vectors go in opposite directions relative to the
mean.
def kernelLinear(vectorI, vectorJ, mean):
diffI = vectorI - mean
diffJ = vectorJ - mean
return dot(diffI, diffJ)
Next we get to the main support vector machine training algorithm, the details of which
follow the method of ‘successive over-relaxation’ as described in the reference given
below.
17
In essence this will find the optimal boundary between two sets of input data.
The inputs to the function comprise an array of known classifications containing +1 or −1
for the two classes, an array of training feature vectors (data) which are typically
normalised to be between 0.0 and 1.0, a kernel function, like those described above, a set
of parameters that are used with the kernel function and three numeric values. It should be
noted that all of the input arrays are expected to be of the numpy.array type. The limit
parameter is the upper limit to the amount of ‘support’ that a data vector can provide
towards the optimisation of the classifying boundary plane. The maxSteps number is
simply the maximum number of cycles that we will allow during the training of the SVM.
The last argument (relax) takes values between 0 and 2.0 and is known as the ‘relaxation
parameter’. This governs the rate at which the successive over-relaxation technique
performs the optimisation. Under many circumstances this does not need to be changed
from its default value.
def svmTrain(knowns, data, kernelFunc, kernelParams,
limit=1.0, maxSteps=500, relax=1.3):
Next we do some initialisation to set up various numbers and arrays in the calculation.
Firstly, we define the size of the problem from the dimensions of the input data array. The
dimensionality of the feature vectors that comprise the training data is n and the number of
training data points is m. Next we initialise the supports array, which is the thing we are
trying to optimise by the successive over-relaxation technique that the training is based on.
The supports array will define which of the training data are support vectors, and hence
where the classification hyperplane lies. Initially the support values are set as zeros using
the handy zeros() function from NumPy, noting that we pass in float (an inbuilt Python
data type) as an argument to state that the array is composed of floating point numbers,
rather than integers. The change value will be used to record how much change is made to
the support array during the optimisation, testing for convergence. Its starting value is 1.0,
which is arbitrary but large enough to cause the first optimisation cycle to proceed.
m, n = data.shape
supports = zeros(m, float)
change = 1.0 # arbitrary but big start
The next initialisation is to pre-calculate the kernelArray. This square array represents
the coincidence (similarity) between all of the m pairs of feature vectors that make up the
input training data. Then the kernel function is used for all pairs of data vector to calculate
the coincidence values, which are stored in the array using indices. Once the kernel array
is filled we can quickly look up values using appropriate indices, rather than having to
calculate a value each time. It should be noted that in order to keep this example relatively
simple, and improve calculation speed, we calculate all of the kernel values at the start,
but this requires memory to store the information. If the training data set were very large,
compared to the amount of available memory, then more on-the-fly calculations and a
smaller memory cache can be used. Because of the successive over-relaxation method
used to perform the optimisation, the kernel array has all of its elements incremented by
1.0 (the diagonal will carry 1.0 values as that was not set from the kernel function).
kernelArray = zeros((m,m), float)
for i in range(m): # xrange() in Python 2
for j in range(i+1): # xrange() in Python 2
coincidence = kernelFunc(data[i], data[j], *kernelParams)
kernelArray[i,j] = kernelArray[j,i] = coincidence
kernelArray += 1
The main part of the function performs the actual optimising loops that employ the
method of successive over-relaxation. We use a while loop to continue the optimisation of
the supports array, until it converges (the change is small) or we hit the maxSteps limit.
The first line inside the loop makes a copy of the supports array so that we can work out
how much change has occurred during the cycle.
steps = 0
while (change > 1e-4) and (steps < maxSteps):
prevSupports = supports.copy()
Next we build sortSup, an array of values and index numbers from the supports.
Naturally, for the first optimisation cycle all the values will be zero, but that will quickly
change for subsequent iterations. We get these indices by using the inbuilt enumerate()
function inside a list comprehension, where we collect the index i. The sortSup array is
then sorted in order of decreasing value, which will be according to the first element in
each (val,i) tuple; this follows the published successive over-relaxation procedure, but
other implementations
18
favour using a random order, the code for which is shown in the
comment in the code example below.
sortSup = [(val,i) for i, val in enumerate(supports)]
sortSup.sort(reverse=True)
#random.shuffle(sortSup) – also possible
Eventually we get to the actual successive over-relaxation optimisation. A precise
explanation of what is being done will not be given here; this is only a programming book
with limited space after all. However, the keen more mathematical reader can investigate
the SVM section of the book reference given in footnote 18. What we will aim to do is
merely give a hint at why things are done. The next loop is the first of two optimisation
stages. At this point we consider all of the data vectors, or more precisely all of the indices
in range(m). For each index, representing one of the training data points, a summation
named pull is calculated by multiplying the supports values by the required row of the
kernelArray matrix and the values for the known classification. Here the kernel array will
specify how similar the data vector at index i is to all of the other vectors. The knowns
array contains values of +1.0 or −1.0, thus each individual multiplication that goes into the
final summation will pull the value to the positive or negative side, depending on the
classification of the training data vector. Note that we are making use of NumPy arrays
here and that when we do multiplication it is being performed in an element-by-element
manner within the arrays.
for support, i in sortSup:
pull = sum( supports * kernelArray[i,:] * knowns )
The pull value, towards the negative or positive, is used to calculate an adjustment for
the support at this position. The product knowns[i] * pull will be positive if both the pull
and the known classification have the same sign, i.e. go in the same direction. Taking 1.0
from this means that when the values have the same sign the adjustment will be closer to
zero compared to when they have a different sign. Dissimilar signs mean that the current
supports values do not separate the training data vectors in the right way. Note that
negative values of adjust will actually increase the supports[i] value, so that mismatching
training vectors get more influence on the placement of the classification boundary. The
actual adjustment to the supports values is scaled relative to the maximum amount of
support for that data point (i.e. the kernel array value at the diagonal; kernelArray[i,i]) and
multiplied by the relax parameter, which governs the rate of change and thus convergence.
Note that this adjustment can overshoot the optimum, so a bigger relax is not always
better. Finally, the adjusted value for the support at position i is bounded by the upper and
lower limits; it must be at least zero and at most limit.
adjust = knowns[i] * pull - 1.0
supports[i] -= adjust * relax / kernelArray[i,i]
supports[i] = max(0.0, min(limit, supports[i]))
At this point we could go through another round optimising all the support values.
However, we will get quicker convergence if we do some more optimising in this cycle for
only a subset of the data points: those with a support value that is above zero. Hence, we
calculate the nonZeroSup subset of points which have at least some support (val > 0) in a
similar manner to what was done for sortSup, by using enumerate to get the value and the
index, so that later we can sort according to the value. Note that if there is none of this
type of support we skip to the next cycle with continue. It is reasonable to suggest that
these points on the edges of the separation margin need further attention and more
optimisation, and in practice this really does help.
nonZeroSup = [(val,i) for i, val in enumerate(supports) if val > 0]
if not nonZeroSup:
continue
nonZeroSup.sort()
We collect the indices inds for the non-zero support positions, because this is handy for
the next stage, and then estimate how many extra optimisation cycles (niter) to do on the
supporting values based on the number that remain. This is not a precise calculation but
the general principle is that the more marginal support vectors there are the more cycles
are required. If there are no non-zero supports we use a minimum value, but otherwise
make a guess based on the square root of the number of non-zero supports.
inds = [x[1] for x in nonZeroSup]
niter = 1 + int(sqrt(len(inds)))
Next we repeat exactly the same kind of support value optimisation as before, but this
time only using the indices of the non-zero, supporting points. These extra sub-cycles will
often be significantly quicker than the previous optimisation that goes through all the
points, i.e. the number of supports is relatively few. Note how we use slice indexing of the
form ‘knowns[inds]’ to perform operations on only the required subset of the NumPy
arrays.
for i in range(niter): # xrange in Python 2
for j in inds:
pull = sum(kernelArray[j,inds] * knowns[inds] * supports[inds])
adjust = knowns[j] * pull - 1.0
supports[j] -= adjust * relax / kernelArray[j,j]
supports[j] = max(0.0, min(limit, supports[j]))
Finally, after all the optimisation is done for this cycle we see how much change we
have conferred to the support value and increase the step count. The change is simply the
square root of the summation of the difference between old and new values squared.
diff = supports - prevSupports
change = sqrt( sum(diff * diff) )
steps += 1
At the end of the function we pass back the optimised array of support values, the
number of steps that were taken during the optimisation and the array containing the
kernel values (the measures of coincidence between the data vectors), to save calculating
it again when investigating the decision hyperplane.
return supports, steps, kernelArray
Do'stlaringiz bilan baham: |