Python Artificial Intelligence Projects for Beginners


channel , with one occurrence,  plz



Download 16,12 Mb.
Pdf ko'rish
bet36/65
Sana02.01.2022
Hajmi16,12 Mb.
#311589
1   ...   32   33   34   35   36   37   38   39   ...   65
Bog'liq
Python Artificial Intelligence Projects for Beginners - Get up and running with 8 smart and exciting AI applications by Joshua Eckroth (z-lib.org)

channel
, with one occurrence
plz
, with one occurrence, 
subscribe
,
two occurrences, and so on. Then, we would collect all these counts in a vector, where one
vector per phrase or sentence or document, depending on what you are working with.
Again, the order in which the words appeared originally doesn't matter.
The vector that we created can also be used to sort data alphabetically, but it needs to be
done consistently for all the different phrases. However, we still have the same problem.
Each phrase has a vector with different columns, because each phrase has different words
and a different number of columns, as shown in the following two tables:


Applications for Comment Classification
Chapter 3
[ 50 ]
If we make a larger vector with all the unique words across both phrases, we get a proper
matrix representation. With each row representing a different phrase, notice the use of 
0
 to
indicate that a phrase doesn't have a word:
If you want to have a bag of words with lots of phrases, documents, or  we would need to
collect all the unique words that occur across all the examples and create a huge matrix, 
N
 x
M
, where 
N
 is the number of examples and 
M
 is the number of occurrences. We could
easily have thousands of dimensions compared in a four-dimensional model for the iris
dataset. The bag of words matrix is likely to be sparse, meaning mostly zeros, since most
phrases don't have most words.
Before we start building our bag of words model, we need to take care of a few things, such
as the following:
Lowercase every word
Drop punctuation
Drop very common words (stop words)
Remove plurals (for example, bunnies => bunny)
Perform lemmatization (for example, reader => read, reading = read)
Use n-grams, such as bigrams (two-word pairs) or trigrams
Keep only frequent words (for example, must appear in >10 examples)
Keep only the most frequent 
M
 words (for example, keep only 1,000)
Record binary counts (
1
 = present, 
0
 = absent) rather than true counts
There are many other combinations for best practice, and finding the best that suits the
Download 16,12 Mb.

Do'stlaringiz bilan baham:
1   ...   32   33   34   35   36   37   38   39   ...   65




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish