Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti



Download 7,21 Mb.
Pdf ko'rish
bet334/398
Sana26.02.2022
Hajmi7,21 Mb.
#467559
1   ...   330   331   332   333   334   335   336   337   ...   398
Bog'liq
Тайёр Миллий корпус тўплам 17.05

One-hot encoding.
One-hot vector is a vector of size 
consisting of single 1 and 
zeros. For exmple, 

. In this case, 
the index with the value 1 in the one-hot vector determines the ordinal number of the word in the 
dictionary. In this approach, it is very difficult to identify semantic similarities between words. 
Term Frequency - Inverse Document Frequency.
The weighted TF-IDF value of a word in a set 
of words or in a document text is a statistical size aimed to display the degree of word importance in a 
document. TF-IDF weight size is widely used in text mining and natural language processing. Its value is 
usually calculated as follows. Suppose we have a set of 
 
documents, 
is the frequency of word 
i
in 
document 
j
. If
 n
is the number of all words in the document 
j,
then the word frequency is calculated by 
formula 
. If the word 

in the 
N
set of documents appears in 
document, then for the word 

the
IDF
(inverse document frequency) is determined by formula 
. The formula 
is also used for the word 
i
in document 
j.
There are some disadvantages of methods such as one-hot encoding, TF-IDF. When working with 
large text data, word vectors increase in size and computation time. Since word vectors are formed by 
taking into account the number of occurrences of words in the text, they reflect the statistical features of 
words. It does not represent the semantic similarity between words. To overcome this and other 
inconveniences, distributed word representation technologies have been developed. This technology is 
also called word embedding, the most popular word embeddings are Word2Vec, GloVe, and fastText, etc. 
Word2Vec.
The Word2Vec [Mikolov, 2013: 3114] model is a neural network model that studies 
the semantics of a word from the text context and consists of the CBOW and Skip-gram models. 
Word2Vec represents each word as a numerical vector. The vectors corresponding to the words are 
chosen so that mathematically the cosine of the angle between the vectors (1) indicates the degree of 
semantic similarity of the words corresponding to these vectors. 
(1) 
where 
is the dimension of the word vector, 
is a numerical vector corresponding to the word 
, is a numerical vector corresponding to the word . 
The CBOW model is designed to predict the target word from the context words that surround it. If 
the sequence of words in the text corpus is 
, and the window size is , the target word is 
are predicted using 
context words. The purpose of 
the CBOW model 
(2) 



Download 7,21 Mb.

Do'stlaringiz bilan baham:
1   ...   330   331   332   333   334   335   336   337   ...   398




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish