Python Programming for Biology: Bioinformatics and Beyond


A Python support vector machine



Download 7,75 Mb.
Pdf ko'rish
bet407/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   403   404   405   406   407   408   409   410   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

A Python support vector machine

The  Python  example  of  a  support  vector  machine  starts  with  the  import  of  various  array

and  mathematical  NumPy  functions  that  will  be  needed.  As  with  the  neural  network

example  the  use  of  array  operations  aims  to  focus  attention  on  the  high-level  operation

being  performed,  and  thus  the  reasoning  involved  in  the  method.  Helpfully,  the  NumPy

routines  are  also  quicker  to  execute  than  looping  through  large  arrays.  There  are  several

NumPy  imports  and  it  should  be  noted  that  many  of  these  have  the  same  name  as  the

equivalent in the standard math module, although as you might expect these versions work

on whole arrays, not just single numbers:

from numpy import exp, power, array, zeros, sqrt

from numpy import nonzero, random, abs, sum, dot

Next, because we will be using some random numbers we initialise the random number

generator in NumPy. This initialisation is done using a seed number, which in this instance

is  the  system  time  in  seconds

16

 because  it  is  easy  to  access  and  constantly  changing.



Initialising random numbers in this way is always good if you do not want to have exactly

the  same  result  each  time.  Although,  if  you  do  want  to  have  the  same  result,  say  when

doing debugging of the code, you can use the same seed number; so that although a stream

of  numbers  come  out  apparently  randomly  it  will  be  the  same  numbers  each  time  the

program is run.

from time import time

random.seed(int(time()))

Next we define the kernel functions, which are used to replace the explicit calculation

of dot products between feature vectors in high-dimensionality space, as a measure of the

coincidence  between  two  data  points.  We  give  only  two  simple  examples  for  kernel

functions  here,  although  more  could  be  considered.  Both  functions  take  in  two  feature

vectors  (data  points)  and  give  a  single  value  measure  of  their  similarity  or  coincidence.

The  Gaussian  kernel  (normal  curve)  is  perhaps  the  most  general  and  widely  used  form.

The  calculation  for  the  Gaussian  kernel  is  fairly  simple  and  involves  only  a  single

parameter,  sigma,  which  dictates  the  width  of  the  function.  First  the  difference  between



two feature vectors is calculated, then the sum of the differences squared (a single scalar

value) is conveniently calculated using the dot product. The number is finally scaled and

the exponent of its negative returned. This kernel will give a measure of coincidence that

is always positive and how quickly this diminishes will depend on sigma.

def kernelGauss(vectorI, vectorJ, sigma=1.0):

sigma2 = sigma * sigma

diff = vectorI - vectorJ

dotProd = dot(diff,diff)

return exp( -0.5 * dotProd / sigma2 )

The  linear  kernel  is  also  very  simple;  the  function  calculates  the  difference  of  each

vector to the vector mean of the input data (which we calculated beforehand) and returns

the  dot  product  of  these  differences.  Effectively  the  value  generated  is  a  measure  of  the

coincidence of the input vectors relative to the centre of the data. It should be noted that

the  output  can  be  negative  if  the  input  vectors  go  in  opposite  directions  relative  to  the

mean.

def kernelLinear(vectorI, vectorJ, mean):



diffI = vectorI - mean

diffJ = vectorJ - mean

return dot(diffI, diffJ)

Next we get to the main support vector machine training algorithm, the details of which

follow  the  method  of  ‘successive  over-relaxation’  as  described  in  the  reference  given

below.


17

 In  essence  this  will  find  the  optimal  boundary  between  two  sets  of  input  data.

The inputs to the function comprise an array of known classifications containing +1 or −1

for  the  two  classes,  an  array  of  training  feature  vectors  (data)  which  are  typically

normalised to be between 0.0 and 1.0, a kernel function, like those described above, a set

of parameters that are used with the kernel function and three numeric values. It should be

noted  that  all  of  the  input  arrays  are  expected  to  be  of  the  numpy.array  type.  The  limit

parameter  is  the  upper  limit  to  the  amount  of  ‘support’  that  a  data  vector  can  provide

towards  the  optimisation  of  the  classifying  boundary  plane.  The  maxSteps  number  is

simply the maximum number of cycles that we will allow during the training of the SVM.

The last argument (relax) takes values between 0 and 2.0 and is known as the ‘relaxation

parameter’.  This  governs  the  rate  at  which  the  successive  over-relaxation  technique

performs  the  optimisation.  Under  many  circumstances  this  does  not  need  to  be  changed

from its default value.

def svmTrain(knowns, data, kernelFunc, kernelParams,

limit=1.0, maxSteps=500, relax=1.3):

Next we do some initialisation to set up various numbers and arrays in the calculation.

Firstly, we define the size of the problem from the dimensions of the input data array. The

dimensionality of the feature vectors that comprise the training data is n and the number of

training data points is m. Next we initialise the supports array, which is the thing we are




trying to optimise by the successive over-relaxation technique that the training is based on.

The  supports  array  will  define  which  of  the  training  data  are  support  vectors,  and  hence

where the classification hyperplane lies. Initially the support values are set as zeros using

the  handy  zeros()  function  from  NumPy,  noting  that  we  pass  in  float  (an  inbuilt  Python

data  type)  as  an  argument  to  state  that  the  array  is  composed  of  floating  point  numbers,

rather than integers. The change value will be used to record how much change is made to

the support array during the optimisation, testing for convergence. Its starting value is 1.0,

which is arbitrary but large enough to cause the first optimisation cycle to proceed.

m, n = data.shape

supports = zeros(m, float)

change = 1.0 # arbitrary but big start

The next initialisation is to pre-calculate the kernelArray. This square array represents

the coincidence (similarity) between all of the m pairs of feature vectors that make up the

input training data. Then the kernel function is used for all pairs of data vector to calculate

the coincidence values, which are stored in the array using indices. Once the kernel array

is  filled  we  can  quickly  look  up  values  using  appropriate  indices,  rather  than  having  to

calculate a value each time. It should be noted that in order to keep this example relatively

simple,  and  improve  calculation  speed,  we  calculate  all  of  the  kernel  values  at  the  start,

but this requires memory to store the information. If the training data set were very large,

compared  to  the  amount  of  available  memory,  then  more  on-the-fly  calculations  and  a

smaller  memory  cache  can  be  used.  Because  of  the  successive  over-relaxation  method

used to perform the optimisation, the kernel array has all of its elements incremented by

1.0 (the diagonal will carry 1.0 values as that was not set from the kernel function).

kernelArray = zeros((m,m), float)

for i in range(m): # xrange() in Python 2

for j in range(i+1): # xrange() in Python 2

coincidence = kernelFunc(data[i], data[j], *kernelParams)

kernelArray[i,j] = kernelArray[j,i] = coincidence

kernelArray += 1

The  main  part  of  the  function  performs  the  actual  optimising  loops  that  employ  the

method of successive over-relaxation. We use a while loop to continue the optimisation of

the supports  array,  until  it  converges  (the  change  is  small)  or  we  hit  the  maxSteps  limit.

The first line inside the loop makes a copy of the supports array so that we can work out

how much change has occurred during the cycle.

steps = 0

while (change > 1e-4) and (steps < maxSteps):

prevSupports = supports.copy()

Next  we  build  sortSup,  an  array  of  values  and  index  numbers  from  the  supports.

Naturally, for the first optimisation cycle all the values will be zero, but that will quickly

change  for  subsequent  iterations.  We  get  these  indices  by  using  the  inbuilt  enumerate()

function  inside  a  list  comprehension,  where  we  collect  the  index  i.  The  sortSup  array  is

then  sorted  in  order  of  decreasing  value,  which  will  be  according  to  the  first  element  in

each  (val,i)  tuple;  this  follows  the  published  successive  over-relaxation  procedure,  but



other implementations

18

favour using a random order, the code for which is shown in the



comment in the code example below.

sortSup = [(val,i) for i, val in enumerate(supports)]

sortSup.sort(reverse=True)

#random.shuffle(sortSup) – also possible

Eventually  we  get  to  the  actual  successive  over-relaxation  optimisation.  A  precise

explanation of what is being done will not be given here; this is only a programming book

with limited space after all. However, the keen more mathematical reader can investigate

the  SVM  section  of  the  book  reference  given  in  footnote  18.  What  we  will  aim  to  do  is

merely give a hint at why things are done. The next loop is the first of two optimisation

stages. At this point we consider all of the data vectors, or more precisely all of the indices

in  range(m).  For  each  index,  representing  one  of  the  training  data  points,  a  summation

named  pull  is  calculated  by  multiplying  the  supports  values  by  the  required  row  of  the

kernelArray matrix and the values for the known classification. Here the kernel array will

specify  how  similar  the  data  vector  at  index  i  is  to  all  of  the  other  vectors.  The  knowns

array contains values of +1.0 or −1.0, thus each individual multiplication that goes into the

final  summation  will  pull  the  value  to  the  positive  or  negative  side,  depending  on  the

classification  of  the  training  data  vector.  Note  that  we  are  making  use  of  NumPy  arrays

here and that when we do multiplication it is being performed in an element-by-element

manner within the arrays.

for support, i in sortSup:

pull = sum( supports * kernelArray[i,:] * knowns )

The pull value, towards the negative or positive, is used to calculate an adjustment for

the support at this position. The product knowns[i] * pull will be positive if both the pull

and the known classification have the same sign, i.e. go in the same direction. Taking 1.0

from this means that when the values have the same sign the adjustment will be closer to

zero compared to when they have a different sign. Dissimilar signs mean that the current

supports  values  do  not  separate  the  training  data  vectors  in  the  right  way.  Note  that

negative values of adjust will actually increase the supports[i] value, so that mismatching

training  vectors  get  more  influence  on  the  placement  of  the  classification  boundary.  The

actual  adjustment  to  the  supports  values  is  scaled  relative  to  the  maximum  amount  of

support for that data point (i.e. the kernel array value at the diagonal; kernelArray[i,i]) and

multiplied by the relax parameter, which governs the rate of change and thus convergence.

Note  that  this  adjustment  can  overshoot  the  optimum,  so  a  bigger  relax  is  not  always

better. Finally, the adjusted value for the support at position i is bounded by the upper and

lower limits; it must be at least zero and at most limit.

adjust = knowns[i] * pull - 1.0

supports[i] -= adjust * relax / kernelArray[i,i]

supports[i] = max(0.0, min(limit, supports[i]))

At  this  point  we  could  go  through  another  round  optimising  all  the  support  values.

However, we will get quicker convergence if we do some more optimising in this cycle for

only a subset of the data points: those with a support value that is above zero. Hence, we



calculate the nonZeroSup subset of points which have at least some support (val > 0) in a

similar manner to what was done for sortSup, by using enumerate to get the value and the

index,  so  that  later  we  can  sort  according  to  the  value.  Note  that  if  there  is  none  of  this

type  of  support  we  skip  to  the  next  cycle  with  continue.  It  is  reasonable  to  suggest  that

these  points  on  the  edges  of  the  separation  margin  need  further  attention  and  more

optimisation, and in practice this really does help.

nonZeroSup = [(val,i) for i, val in enumerate(supports) if val > 0]

if not nonZeroSup:

continue

nonZeroSup.sort()

We collect the indices inds for the non-zero support positions, because this is handy for

the next stage, and then estimate how many extra optimisation cycles (niter) to do on the

supporting values based on the number that remain. This is not a precise calculation but

the general principle is that the more marginal support vectors there are the more cycles

are  required.  If  there  are  no  non-zero  supports  we  use  a  minimum  value,  but  otherwise

make a guess based on the square root of the number of non-zero supports.

inds = [x[1] for x in nonZeroSup]

niter = 1 + int(sqrt(len(inds)))

Next we repeat exactly the same kind of support value optimisation as before, but this

time only using the indices of the non-zero, supporting points. These extra sub-cycles will

often  be  significantly  quicker  than  the  previous  optimisation  that  goes  through  all  the

points, i.e. the number of supports is relatively few. Note how we use slice indexing of the

form  ‘knowns[inds]’  to  perform  operations  on  only  the  required  subset  of  the  NumPy

arrays.


for i in range(niter): # xrange in Python 2

for j in inds:

pull = sum(kernelArray[j,inds] * knowns[inds] * supports[inds])

adjust = knowns[j] * pull - 1.0

supports[j] -= adjust * relax / kernelArray[j,j]

supports[j] = max(0.0, min(limit, supports[j]))

Finally,  after  all  the  optimisation  is  done  for  this  cycle  we  see  how  much  change  we

have conferred to the support value and increase the step count. The change is simply the

square root of the summation of the difference between old and new values squared.

diff = supports - prevSupports

change = sqrt( sum(diff * diff) )

steps += 1

At  the  end  of  the  function  we  pass  back  the  optimised  array  of  support  values,  the

number  of  steps  that  were  taken  during  the  optimisation  and  the  array  containing  the

kernel values (the measures of coincidence between the data vectors), to save calculating

it again when investigating the decision hyperplane.

return supports, steps, kernelArray




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   403   404   405   406   407   408   409   410   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish