One-hot encoding.
One-hot vector is a vector of size
consisting of single 1 and
zeros. For exmple,
,
. In this case,
the index with the value 1 in the one-hot vector determines the ordinal number of the word in the
dictionary. In this approach, it is very difficult to identify semantic similarities between words.
Term Frequency - Inverse Document Frequency.
The weighted TF-IDF value of a word in a set
of words or in a document text is a statistical size aimed to display the degree of word importance in a
document. TF-IDF weight size is widely used in text mining and natural language processing. Its value is
usually calculated as follows. Suppose we have a set of
documents,
is the frequency of word
i
in
document
j
. If
n
is the number of all words in the document
j,
then the word frequency is calculated by
formula
. If the word
i
in the
N
set of documents appears in
document, then for the word
i
the
IDF
(inverse document frequency) is determined by formula
. The formula
is also used for the word
i
in document
j.
There are some disadvantages of methods such as one-hot encoding, TF-IDF. When working with
large text data, word vectors increase in size and computation time. Since word vectors are formed by
taking into account the number of occurrences of words in the text, they reflect the statistical features of
words. It does not represent the semantic similarity between words. To overcome this and other
inconveniences, distributed word representation technologies have been developed. This technology is
also called word embedding, the most popular word embeddings are Word2Vec, GloVe, and fastText, etc.
Word2Vec.
The Word2Vec [Mikolov, 2013: 3114] model is a neural network model that studies
the semantics of a word from the text context and consists of the CBOW and Skip-gram models.
Word2Vec represents each word as a numerical vector. The vectors corresponding to the words are
chosen so that mathematically the cosine of the angle between the vectors (1) indicates the degree of
semantic similarity of the words corresponding to these vectors.
(1)
where
is the dimension of the word vector,
is a numerical vector corresponding to the word
, is a numerical vector corresponding to the word .
The CBOW model is designed to predict the target word from the context words that surround it. If
the sequence of words in the text corpus is
, and the window size is , the target word is
are predicted using
context words. The purpose of
the CBOW model
(2)
Do'stlaringiz bilan baham: |