initially random 100×100 colour pixel map (left) and the effect of the self-organising map
The next machine learning example that we will cover is another kind of artificial neural
when the network ‘learns’, it takes input data (more feature vectors) and changes its
internal weights so that it can reproduce a known answer. The supervisory process
to the right answer as possible, for some known data is usually referred to as training.
Naturally, when training a neural network it is important to have as large and as
representative a set of training data as possible. The predictive power comes from the fact
that the neural network can accept input from data that it has not seen before, that was not
used in the training. Predictions can be made for unseen data because inputs that resemble
those that were used during the initial training will give similar outputs. In this regard it
doesn’t actually matter very much what the input or output data represents, the patterns
and connections between them can be learnt nonetheless.
The neural network that we describe below is composed of a series of nodes arranged
into three layers. The prediction of this network will proceed by a feed-forward
mechanism, whereby input (often referred to as ‘signal’) is entered into the first layer of
nodes. This input data is then moved to the middle or hidden layer to which it is
connected, before finally reaching the last output layer of nodes. It is possible to construct
feed-forward networks with more than three layers (i.e. more hidden layers). However,
these can be more difficult to train and it has been shown that for many situations three
layers are sufficient to do everything that more layers can do
10
(although the number of
nodes will differ). The number of nodes in the three-layer network depends on the
problem being addressed. The number of input nodes represents the size of the input
vector; the value of each feature goes to a different input node. For example, if the input
was a colour with red, green and blue features, there would be three input nodes. If the
input was a DNA sequence composed of four base letters, there would be four input nodes
for each position of the sequence analysed, thus a sequence of length ten would need 40
inputs. The number of output nodes depends on the problem, but there is some flexibility
to represent the data in different ways. For example, if the network is used to predict an
angle then the output could be a single number or it could be the sine and the cosine of the
angle separately. When being used for categorisation, then there would be as many output
nodes as there are categories. If the neural network was instead being used to approximate
a continuous function, then the output will have a variable number of nodes, depending on
how many axes are required. The number of hidden nodes used will depend on the type
and complexity but will normally be optimised to give the best predictions. Numbers
between three and ten are common. The smaller the number of nodes the quicker it is to
optimise the network during training, but the fewer the number of patterns that can be
detected in the data. The optimum number of hidden nodes can often be smaller than the
number of inputs but is usually larger than the number of outputs. A convenient way to
think of things is that the number of hidden nodes represents the complexity
(dimensionality) of the problem, which is not necessarily related to the size of the input or
output.
The three layers of nodes in our feed-forward network will be connected together. Each
node will be connected to all of the others in a neighbouring layer. Thus, each input node
is connected to all hidden nodes; each hidden node is connected to all of the input and
output nodes; and each output to each hidden node. The properties of a neural network
emerge because the strength of the connection between nodes can vary during the learning
process; so some nodes become more or less well connected. If a connection ends up
having a zero weight then its linked nodes are effectively disconnected; thus the network
can represent a large number of possible internal organisations. A node will be connected
to many others to varying degrees, but the actual feed-forward action of the network that
is used to make predictions (generate output) uses what is known as a trigger function to
adjust the response. In essence, a node collects input signals on one side and has to
combine these in some manner to generate output on the other side, which could be an
intermediate or final output signal. The input signals are added together, but the strength
of the resulting output which is sent to any nodes in the next layer is altered. Firstly, the
summation of the combined inputs is scaled to be within certain minimum and maximum
bounds for practical purposes. And secondly, the input is applied to the trigger function to
increase or decrease the effect that certain amounts of input have. Sometimes the trigger
function is a two-state switch where smaller input values produce very little response, but
above a particular threshold the response is very strong; this is perhaps analogous to the
firing of a neuron inside a brain. However, many types of trigger functions are possible,
and the one we employ here is the popular hyperbolic tangent function (tanh; see
Figure
24.5
). Using the sigmoid-shaped hyperbolic tangent curve means that in mid ranges the
strength of a node’s output is roughly proportional to its input, but at the high and low
input extremes the output is attenuated towards limits. This function also benefits from
having an easily calculated gradient (required for training) and has successfully been used
in many diverse situations.
Do'stlaringiz bilan baham: