subscribe
and
channel
, while the second pair of sentences have fewer words in
common, such as
to
and
the
. Consider the each phrase representing vector of numbers in a
way that the top pair is similar to the numbers in the second pair. Only then we will be able
to use random forest or another technique for classification, in this case, to detect YouTube
comment spam. To achieve this, we need to use the bag-of-words model.
Bag of words
The bag-of-words model does exactly we want that is to convert the phrases or sentences
and counts the number of times a similar word appears. In the world of computer science, a
bag refers to a data structure that keeps track of objects like an array or list does, but in such
cases the order does not matter and if an object appears more than once, we just keep track
of the count rather we keep repeating them.
For example, consider the first phrase from the previous diagram, it has a bag of words that
contents words such as
Do'stlaringiz bilan baham: |