Feature space
The k-nearest neighbour algorithm described below is a very simple yet surprisingly
powerful method to perform classification. The idea is that you describe the known,
classified data as points, with coordinate locations, in a feature space. Just like real three-
dimensional space, a feature space is defined by separate axes and a location in that space
is defined by coordinates on each of the axes. A simple example of a feature space is
colour, which may be described
1
by red, green and blue axes. Any colour within and
including the extremes of black (zero red, green and blue) and white (maximum red, green
and blue) is described by a certain amount of red, green and blue, in other words positions
on these axes, and thus a colour may be described as a location vector (red, green, blue).
We may also measure the distance between colour vectors, which is effectively to say how
similar the colours are. Note that with this colour example we don’t have the same kind of
unbounded continuum that we have in real three-dimensional space: it is meaningless to
have a negative colour value and we don’t let the values exceed the maximum intensity;
nothing can be whiter than white.
When dealing with feature spaces in general you can have a mixture of both bounded
and unbounded axes, and as many numbers of ‘dimensions’ as is required to describe your
data, although it is often a good idea to normalise the numeric data so that it fits in a
limited range of values (typically −1 to +1). A good example of high-dimensionality data
occurs in the use of biological sequences. For example, if you have input data that consists
of DNA sequences that are five base pairs long, then because we can have one of four
independent nucleotide residue types at each position along the length we need a vector of
length 20 (20 ‘dimensions’) to describe a five-letter sequence; 5 positions times 4 residue
types. In this case with axes for G, C, A and T, the sequence ‘TATGA’ would have the
vector (0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0), where each sequential group of
four numbers corresponds to one position in the DNA sequence. Note that 1 indicates the
presence of a letter on its axis, and 0 the absence. If we moved from describing pure
sequences to alignment profiles (i.e. proportions of the residue type at each position) we
could have a sequence with fractional values instead of the plain 0 or 1. You might be
tempted to encode DNA sequences with a numbered system where, for example, A = 0, T
= 1, C = 2, G = 3, so the above example would be (1,0,1,3,0). However, this would not
work because the different nucleotides are entirely alternative states and not points on a
continuum where one type is meaningfully ‘closer’ to another.
Do'stlaringiz bilan baham: |