Clearly, the neural network processes input and produces output, and this works on
many types of input data with varying features. However, a critical piece to notice is that
this neural network has no notion of the time of the occurrence of the event (input), only
time? How can the neural network shown above handle trending in events, seasonality
in events, etc.? How can it learn from the past and apply it to the present and future?
Recurrent neural networks try to address this by incrementally building neural
networks, taking in signals from a previous timestamp into the current network.
217
You can see that RNN is a neural network with multiple layers or steps or stages.
Each stage represents a time T; the RNN at T+1 will consider the RNN at time T as one
of the signals. Each stage passes its output to the next stage. The hidden state, which is
passed from one stage to next, is the key for the RNN to work so well and this hidden
state is analogous to some sort of memory retention. A RNN layer (or stage) acts as an
encoder as it processes the input sequence and returns its own internal state. This state
serves as the input of the decoder in the next stage, which is trained to predict the next
point of the target sequence, given previous points of the target sequence. Specifically,
it is trained to turn the target sequences into the same sequences but offset by one
timestep in the future.
Backpropagation is used when training a RNN as in other neural networks, but
in RNNs there is also a time dimension. In backpropagation, we take the derivative
(gradient) of the loss with respect to each of the parameters. Using this information
(loss), we can then shift the parameters in the opposite direction with a goal to minimize
the loss. We have a loss at each timestep since we are moving through time and we
can sum the losses across time to get the loss at each timestep. This is the same as
summation of gradients across time.
The problem with the above recurrent neural networks, constructed from regular
neural network nodes, is that as we try to model dependencies between sequence values
that are separated by a significant number of other values, the gradients of timestep
Figure 6-6. A recurrent neural network
Chapter 6 Long Short-term memory modeLS
218
T depends on gradients at T-1, gradients at T-2, and so on. This leads to the earliest
gradient’s contribution getting smaller and smaller as we move along the timesteps
where the chain of gradients gets longer and longer. This is what is known as the
vanishing gradient problem. This means the gradients of those earlier layers will become
smaller and smaller and therefore the network won’t learn long-term dependencies.
RNN becomes biased as a result, only dealing with short-term data points.
LSTM networks are a way of solving this problem with RNNs.
Do'stlaringiz bilan baham: