FILLING THE KNOWLEDGE BASE OF THE FILTERING SYSTEMALGORITHM DEVELOPMENT It is necessary to constantly add new types of knowledge to the database used to detect spam messages and to integrate this database into the international database of spam messages. The integration of the existing knowledge base in the organization into the international spam database and the IP address database that sends spam messages is based on the internal regulations of each organization. Therefore, the issue of integration may be different for different organizations. However, in order to constantly add new types of knowledge to the developed knowledge base, it is necessary to form the knowledge base in accordance with the information system in the organization. Because the spam message filtering system in an organization is hierarchical, the knowledge base is also organized hierarchically. The following issue should be taken into account. In addition to spam message databases, it is necessary to create a knowledge base that contains information about the spam message in order to identify spam messages that do not exist in the database of spam messages on the organization's email servers. There are many ways to do this, but the most widely used among them is the simple intersection method. A simple intersection uses messages identified by all users of the mail system, such as spam messages, to form a knowledge base. In this case, the error rate of the first category increases, while the error rate of the second category decreases. in addition to spam message databases, it is necessary to create a knowledge base that includes knowledge about the spam message in order to identify spam messages in a new view.
Here: - organizational knowledge base; - postal system - user knowledge base; - The number of users of the e-mail system in the organization.
In a complex intersection, the decision to add messages identified by a particular user, such as spam messages, to the database is made by the mail system user, which in turn creates a sequence of how many current users mark the message correctly and how many incorrectly. In many cases, the assessment of each user’s authority is abstract. Recipients who receive a single message independently of the system compare it to spam messages. In this case, the error rate of the first category increases, while the error rate of the second category decreases. That is why this method is not widely used.
This way of forming a knowledge base is done in a simplified form as follows.
In merging, the email system uses messages detected by at least one user, such as spam messages, to form a knowledge base. In this case, the error rate of the first category decreases and the error rate of the second category increases. This method is also rarely used.
This way of forming a knowledge base can be expressed as follows:
)
While the filters used to detect spam messages are common to all users, they can be installed on a single server or based on the user’s personal filters provided to users to account for a specific user area of interest and perform accurate filtering of incoming messages. But one of the disadvantages of users ’personal spam filters is that they can’t be used from other users’ experiences. That is, when receiving spam messages that have passed through a personal filter, each user must then point an error to the filter itself so that the filter can detect and capture them when receiving spam messages of this category.
A voluntary organization that typically supports an anti-spam system is divided into three levels:
end user level;
department or subdivision level;
overall organizational level.
At the lower level, each user separates all the received ones into useful messages and spam messages. In parallel with this process, it can also display folders where documents that are useful to the user are stored in order to fill the knowledge base with knowledge that is high priority for itself. The system then clearly distinguishes between words specific to spam messages and words used in useful messages. This is done based on the method of detecting spam messages based on keywords. The process of forming a database system to combat spam messages is shown in Figure 1.
Figure 1. Spam message detection is a hierarchy of multi-level system knowledge base formation
The keywords are then assigned to the department level where all the lists come together, as the professional interests of all employees in a single department are almost matched. In the final stage of knowledge base formation, the system separates keywords separated by all sections and removes words not separated by at least one section from the knowledge base. This is due to the possibility that different departments of the same organization have different areas of interest.
In such a scheme, high efficiency of spam filters can be achieved by increasing the size of the training selection (creating a knowledge base at the department level) while maintaining a low level of error by removing randomly dropped keywords (at the organization level) from the training selection.
This way of forming KB can be expressed as follows:
Here:
- number of departments in the organization;
- The number of employees in the first department of the organization;
- the number of employees in the second department of the organization;
- The number of employees in the department of the organization;
-organization - knowledge base formed by the employee of the department;
- knowledge base of the organization-department;
- The whole organizational knowledge base.
In particular, each unsolicited e-mail is intended for one user only:
From (7) and (8) we obtain:
From (10) and (11) we obtain:
From (1) and (2) we obtain:
Here:
- Stream of incoming e-mails to individual users or to all users;
- mail server;
- server classifier;
- Section classifier set and trained by users of the department;
- Stream of incoming e-mails filtered by the server classifier and intended for users of the department;
- first by the server classifier
-incoming stream of incoming e-messages filtered by the section filter and intended for the section -user;
- -section is intended for teaching classifier and -class-message by -user of -section;
- a message classified by all users of the section designed to teach the server classifier
The following steps are performed to complete this formed knowledge base. First of all, special attention is paid to filling the knowledge base of individual users, ie the content of messages that are useful for the user for whom the knowledge base is formed. The algorithm for filling the database of individual user knowledge is shown in Figure 2.
At the beginning of the work, the user must specify a folder containing text and tables of documents that the system can use to populate the database of useful messages. The system then begins to form a knowledge base. Each document is reviewed separately to save the user's computer computing resources.
The system in which a single document index is built then moves on to building the next document index, and so on until the entire document index is built. In the process of constructing a document index, the system must calculate the frequency of occurrence of words separated within the document being analyzed and, in the current case, separate the lexical unit of words. When separating individual words from the document being analyzed, the system removes suffixes to generalize words that are used in different ways.
It is possible to create a dictionary showing words with all their roots and suffixes in each word, but since it takes a long time to search for them in the database, the system estimates the additional length based on the length of the words. The main goal is to find the core of the keyword. The longer the word, the more characters are removed from the end.
If it is placed directly in the knowledge base without removing the suffixes in the words, the priority of the knowledge in the knowledge base decreases. Because it is clear that the meaning of a word changes with the help of an appendix, but the first type of error occurs as a result of an increase in the number of words consisting of the same stem.
Therefore, keywords are mainly formed from words with a single core without the addition of suffixes, and keywords are included in the knowledge base.
Figure 2. Block diagram of the algorithm for filling the knowledge base.
In order to calculate the frequency of occurrence of words within a document, ie all documents, all words are sorted in alphabetical order, and then the cases of word repetition are calculated. When keywords occur in repetition cases, the priority of these keywords increases. By identifying the existing content (documents) in the organization and the keywords that can be found in these content, it will be possible to filter the documents that come to the e-mail messages throughout the organization after adding them to the knowledge base.
If the keywords that occur among the existing documents in the organization are not combined into the same indexes, there will be an increase in the first and second types of errors in the process of filtering documents. This in turn affects the efficiency and reliability of the email message filtering system.