Figure 6-11. A sigmoid activation function
Chapter 6 Long Short-term memory modeLS
222
A forget gate is the first part of the LSTM stage and pretty much decides how much
information from a prior stage should be remembered or forgotten. This is accomplished
by passing the previous hidden state hT-1 and current input xT through a sigmoid
function.
The input gate helps decide how much information to pass to current stage by using
the sigmoid function and also a tanh function.
The output gate controls how much information will be retained by the hidden state
of this stage and passed onto the next stage. Again, the current state passes through the
tanh function.
Just for information, the compact forms of the equations for the forward pass of an
LSTM unit with a forget gate are (source : Wikipedia)
f
W x U h
b
i
W x U h
b
o
W x U h
t
g
f
t
f
t
f
t
g
i t
i t
i
t
g
o t
o t
=
+
+
(
)
=
+
+
(
)
=
+
-
-
-
s
s
s
1
1
1
++
(
)
=
+
+
+
(
)
=
( )
-
-
b
c
f c
i
W x U h
b
h
o
c
o
t
t
t
t
c
c t
c t
c
t
t
h
t
1
1
s
s
where the initial values are c
0
= 0 and h
0
= 0, and the operator
0
denotes the element- wise
product. The subscript indexes the time step.
Figure 6-12. A detailed LSTM network
Source: commons.wikimedia.org
Chapter 6 Long Short-term memory modeLS
223
Variables
• x t
∈ R d {\displaystyle x_{t}\in \mathbb {R} ^{d}} x
t
∈ ℝ
d
: Input vector
to the LSTM unit
• f t
∈ R h {\displaystyle f_{t}\in \mathbb {R} ^{h}} f
t
∈ ℝ
h
: Forget gate’s
activation vector
• i t
∈ R h {\displaystyle i_{t}\in \mathbb {R} ^{h}} i
t
∈ ℝ
h
: Input/update
gate’s activation vector
• o t
∈ R h {\displaystyle o_{t}\in \mathbb {R} ^{h}} o
t
∈ ℝ
h
: Output
gate’s activation vector
• h t
∈ R h {\displaystyle h_{t}\in \mathbb {R} ^{h}} h
t
∈ ℝ
h
: Hidden
state vector, also known as the output vector of the LSTM unit
• c t
∈ R h {\displaystyle c_{t}\in \mathbb {R} ^{h}} c
t
∈ ℝ
h
: Cell state
vector
• W
∈ R h × d {\displaystyle W\in \mathbb {R} ^{h\times d}} W ∈ ℝ
h × d
,
U
∈ ℝ
h × h
and b
∈ ℝ
h
U
∈ R h × h {\displaystyle U\in \mathbb {R}
^{h\times h}} b
∈ R h {\displaystyle b\in \mathbb {R} ^{h}} : Weight
matrices and bias vector parameters, which need to be learned
during training
The superscripts refer to the number of input features and number of hidden units,
respectively.
s
g
sigmoid function
:
s
c
hyperbolic tangent function
:
s
h
hyperbolic tangentfunction
:
LSTM for Anomaly Detection
In this section, you will look at LSTM implementations for some use cases using time
series data as examples. You have few different time series datasets to use to try to detect
anomalies using LSTM. All of them have a timestamp and a value that can easily be
plotted in Python.
Chapter 6 Long Short-term memory modeLS
224
Figure
6-13
shows the basic code to import all necessary packages. Also note the
versions of the various necessary packages.
Figure
6-14
shows the code to visualize the results via a chart for the anomalies and a
chart for the errors (the difference between predicted and truth) while training.
Figure 6-13. Code to import packages
Chapter 6 Long Short-term memory modeLS
225
You will use different examples of time series data to detect whether a point is
normal/expected or abnormal/anomaly. Figure
6-15
shows the data being loaded into a
Pandas dataframe. It shows a list of paths to datasets.
Figure 6-14. Code to visualize errors and anomalies
Figure 6-15. A list of paths to datasets
Chapter 6 Long Short-term memory modeLS
226
You will work with one of the datasets in more detail now. The dataset is nyc_taxi,
which basically consists of timestamps and demand for taxis. This dataset shows the
NYC taxi demand from 2014–07–01 to 2015–01–31 with an observation every half hour.
There are few detectable anomalies in this dataset: Thanksgiving, Christmas, New Year’s
Day, a snow storm, etc.
Figure
6-16
shows the code to select the dataset.
Figure 6-16. Code to select the dataset
You can load the data form the dataFilePath as a csv file using Pandas. Figure
6-17
shows the code to read the csv datafile into Pandas.
Figure 6-17. Code to read a csv datafile into Pandas
Figure
6-18
shows the plotting of the time series showing the months on the
x-axis and the value on the y-axis. It also shows the code to generate a graph showing
the time series.
Chapter 6 Long Short-term memory modeLS
227
Let’s understand the data more. You can run the describe() command to look at the
value column. Figure
6-19
shows the code to describe the value column.
Figure 6-18. Plotting the time series
Figure 6-19. Describing the value column
Chapter 6 Long Short-term memory modeLS
228
You can also plot the data using seaborn kde plot, as shown in Figure
6-20
.
The data points have a minimum of 8 and maximum of 39197, which is a wide range.
You can use scaling to normalize the data.
The formula for scaling is (x-Min) / (Max-Min). Figure
6-21
shows the code to scale
the data.
Figure 6-20. Using kde to plot the value column
Figure 6-21. Code to scale the data
Chapter 6 Long Short-term memory modeLS
229
Now that you scaled the data, you can plot the data again. You can plot the data using
seaborn kde plot, as shown in Figure
6-22
.
Figure 6-22. Using kde to plot the scaled_value column
You can take a look at the dataframe now that you have scaled the value column.
Figure
6-23
shows the dataframe showing the timestamp and value as well as scaled_
value and the datetime.
Figure 6-23. The modified dataframe
Chapter 6 Long Short-term memory modeLS
230
There are 10320 data points in the sequence and your goal is to find anomalies. This
means you are trying to find out when data points are abnormal. If you can predict a
data point at time T based on the historical data until T-1, then you have a way of looking
at an expected value compared to an actual value to see if you are within the expected
range of values for time T. If you predicted that ypred number of taxis are in demand on
January 1, 2015, then you can compare this ypred with the actual yactual. The difference
between ypred and yactual gives the error, and when you get the errors of all the points
in the sequence, you end up with a distribution of just errors.
To accomplish this, you will use a sequential model using Keras. The model consists
of a LSTM layer and a dense layer. The LSTM layer takes as input the time series data and
learns how to learn the values with respect to time. The next layer is the dense layer (fully
connected layer). The dense layer takes as input the output from the LSTM layer, and
transforms it into a fully connected manner. Then, you apply a sigmoid activation on the
dense layer so that the final output is between 0 and 1.
You also use the
Do'stlaringiz bilan baham: |