Alisher Navoiy nomidagi Toshkent
davlat o‘zbek tili va adabiyoti
universiteti
“O‘ZBEK MILLIY VA TA’LIMIY
KORPUSLARINI YARATISHNING NAZARIY
HAMDA AMALIY MASALALARI”
Xalqaro ilmiy-amaliy konferensiya
Vol. 1
№. 01 (2021)
291
is to maximize the probability. The value of
is calculated using the softmax
function using formula (3).
(3)
where
is the numerical vector of the embedding matrix corresponding to the word
. is the
number of words in the dictionary. Usually, there are a lot of words in the dictionary, the values of
the
- embedding matrix must be updated each time, and this is a balance problem in the sample. The
number of occurrence different words will be various in the corpus. Therefore, multi-layer softmax and
negative sampling optimization methods are used.
Skip-gram is similar in principle to the CBOW model, but the Skip-gram model is designed to
predict the context words from the target word. That is, the Skip-gram model intends to predict
context words from
target word which surrounds it
with width.
Word2Vec focuses on local context but does not use global statistics for the full text. Taking this
into account GloVe (Global vector for word representation) is presented in [Pennington, 2014: 1535]. In
GloVe word embedding, the number of occurrences of the word in the full text is also taken into account.
GloVe also focuses on the number of word occurrence in the corpus.
Although Word2Vec is simple, efficient and semantically represents the word in the context, it
does not describe out-of-vocabulary (OOV) words in the embedding matrix. Such words do not appear in
the existing dictionary or the training corpus. There is much research done to solve this problem. One of
these solutions is fastText, which was proposed by Facebook AI researchers [Bojanowski, 2017: 138].
fastText.
fastText uses sub-word information to store the relationship between word letters and the
internal semantics of a word.
The Uzbek movie review comments (UMR) corpus for opinion classification is a collection of
posts on 75 Uzbek films using the YouTube Data API. The comments consist of 121,441 tokens written
in the Uzbek Cyrillic and Latin alphabets, and the total number of posts is 17,486. The UMR dataset was
labeled by 6 annotators. They labeled each post as positive, negative, or irrelevant. In this evaluation,
2044 positive and 519 negative posts were selected. All posts were converted to the Latin alphabet. Each
word in the text was POS tagged by 2 linguists [Rabbimov, 2020: 4].
For each post in the dataset, statistical features, POS-based features, and emoji-based features are
calculated.
Do'stlaringiz bilan baham: |