Detecting positive or negative sentiments in
user reviews
In this section, we're going to look at detecting positive and negative sentiments in user
reviews. In other words, we are going to detect whether the user is typing a positive
comment or a negative comment about the product or service. We're going to use
8PSE7FD
and
%PD7FD
specifically and the
HFOTJN
Python library for those services.
There are two categories, which are positive and negative, and we have over 3,000 different
reviews to look at. These come from Yelp, IMDb, and Amazon. Let's begin the code by
importing the
HFOTJN
library, which provides
8PSE7FD
and
%PD7FD
for logging to note
status of the messages:
First, we will see how to load a pre-built
8PSE7FD
model, provided by Google, that has
been trained on billions of pages of text and has ultimately produced 300-dimensional
vectors for all the different words. Once the model is loaded, we will look at the vector for
DBU
. This shows that the model is a 300-dimensional vector, as represented by the word
DBU
:
Applications for Comment Classification
Chapter 3
[ 68 ]
The following screenshot shows the 300-dimensional vector for the word
EPH
:
Applications for Comment Classification
Chapter 3
[ 69 ]
The following screenshot shows the 300-dimensional vector for the word
TQBUVMB
:
We obtain a result of 76% when computing the similarity of dog and cat, as follows:
The similarity between cat and spatula is 12%; it is a bit lower, as it should be:
Applications for Comment Classification
Chapter 3
[ 70 ]
Here we train our
8PSE7FD
and
%PD7FD
model using the following code:
We are using
%PD7FD
because we want to determine a vector for each document, not
necessarily for each word in the document, because our documents are reviews and we
want to see whether these reviews are positive or negative, which means it's similar to
positive reviews or similar to negative reviews.
%PD7FD
is provided by
HFOTJN
and the
library has a class called
5BHHFE%PDVNFOU
that allows us to use "
UIFTFBSFUIFXPSET
JOUIFEPDVNFOUBOE%PD7FDJTUIFNPEFM
".
Now we create a utility function that will take a sentence or a whole paragraph and
lowercase it and remove all the HTML tags, apostrophes, punctuation, spaces, and repeated
spaces, and then ultimately break it apart by words:
Now it's time for our training set. We are not going to use the 3,000 Yelp, IMDb, and
Amazon reviews because there's simply not enough data to train for a good
%PD7FD
model. If we had millions reviews, then we could take a good portion of that to train with
and use the rest for testing, but with just 3,000 reviews it's not enough. So, instead, I've
gathered reviews from IMDb and other places, including Rotten Tomato. This will be
enough to train a
%PD7FD
model, but none of these are actually from the dataset that we're
going to use for our final prediction. These are simply reviews. They're positive; they're
negative. I don't know which, as I'm not keeping track of which. What matters is that we
have enough text to learn how words are used in these reviews. Nothing records whether
the review is positive or negative.
Do'stlaringiz bilan baham: |