Usmanov Zafar Juraevich
Doctor of Physical and Mathematical Sciences, Professor,
Academician of the National Academy of Sciences of Tajikistan,
Institute of Mathematics named after Academician A.Juraev
National Academy of Sciences of Tajikistan, Republic of Tajikistan
Kosimov Abdunabi Abduraufovich
Candidate of Technical Sciences, Senior Lecturer at the Department of Automated Control Systems,
Tajik Technical University named after Academician M.S.Osimi, Republic of Tajikistan
ABOUT THE AUTOMATIC RECOGNITION OF THE LANGUAGES OF
WORKS BASED ON THE LATIN ALPHABET
Abstract. This article describes about an example of a model collection of 10 texts in
five different languages using Latin graphics. The applicability of the γ-classifier for
automatic recognition of the language of the works is established based on the
frequency of common 26 Latin alphabetic letters.
Keywords: text, language, alphabet, frequency, classifier.
In recent time, writing based on the Latin alphabet has become widespread among
the Romance, Germanic Slavic, Finno-Ugric, Turkic, Semitic and Iranian groups of
languages, among the countries of Indochina, the Sunda Archipelago and the Philippines,
Africa (south of the Sahara), America, Australia and Oceania, [1]. With the exception of
modern English, for majority of languages, the 26-letter Latin alphabet (a, b, c, d, e, f, g,
h, i, j, k, l, m, n, o, p, q, r , s, t, u, v, w, x, y, z) turned out to be insufficient, and therefore,
to reflect the phonetic features of certain language systems, various diacritics, ligatures
and other modifications of letters were added to the basic Latin script. In this work, we
study the question of whether it is possible to do with only 26 Latin letters for automatic
recognition of the language in which one or another printed matter is written.
As the experimental material on which our research unfolds, we have chosen a
SCIENCE AND PRACTICE: IMPLEMENTATION TO MODERN SOCIETY
835
small collection
𝑪 of 10 works (texts), among which
in English (En):
W. Shakespeare “Romeo and Juliet” ( en_1, 25832 words),
M. Twain “A Connecticut Yankee in King Arthur's Court” ( en_2, 117257
words);
in German (De):
G.Pease “Schiff ohne Mannschaft” ( de_1, 59695 words),
G.Diana “Das flammende Kreuz: Roman” (, de_2, 70104 words);
in Spanish (Es):
D.J.Henrich “El ocaso de la magia” ( es_1, 73300 words),
V.F.Alberto “Oceano” ( es_2, 103596 words);
in Italian (It):
G.Ed “Elminster: la nascita di un mago” ( it_1, 127087 words),
S.Robert “Il paradosso del passato” ( it_2, 69697 words);
and in French (Fr):
S.Georges “Lavinia” ( fr_ 1, 13151 words),
B.Michel “Les Nymphéas noirs” ( fr_ 2, 108137 words).
It should be noted that information about the texts contains the names of the
authors, the titles of their works in the original, as well as abbreviated designations of
the works together with their sizes, determined by the number of words. The peculiarity
of the collection is that it covers only 5 European languages and all of its texts are based
on Latin graphics using additional specific characters: 4 (ä, ö, ß, ü) - in German ( de),
one character (ñ) - in Spanish ( es), 10 characters (à, è, é, ì, í, î, ò, ó, ù, ú) - in Italian ( it)
and 14 characters (â, à, ç, é, ê, è, ë, î, ï, ô, û, ù, ü, ÿ) - in French ( fr) languages.
Do'stlaringiz bilan baham: |