How to calculate (term frequency )TF of documents?
Assume your query is best car insurance, your total vocabulary contains car, best, auto, insurance and you have N=1,000,000 documents. So your query is something like below:
And one of your document could be:
Now calculate cosine similarity between TF-IDF of your Query and Document.
How to calculate IDF(inverse documents frequency )?
Definition
TF*IDF is a formula for calculating the weighting of certain terms in a document in relation to the total number of documents dealing with the same subject. The formula can also be applied in the context of web pages. In this case, it denotes the weighting of certain terms on a web page in relation to all other pages that rank for a specific search term.
Using the TF*IDF formula, you can analyze textual content on your website and compare it to other web pages in order to increase the relevance of your content for a particular search term. For this reason, optimizing your content according to TF*IDF is an important task in search engine optimization (SEO).
Calculation
Two formulas are required to calculate the TF*IDF value: TF and IDF.
TF
TF stands for "Term Frequency" and serves to calculate the frequency of a term, i.e. a single word or a certain word combination, in a document or on a web page in relation to all other terms on this page. The corresponding formula is:
Freq(i,j) = Frequency of term i in document j
L(j) = Total number of terms in document j
Basically, this is the keyword density, with the only difference that the values are logarithmized. The logarithmic function serves to "compress" the results, i.e. it prevents particularly high term frequencies from distorting the value.
IDF
IDF is the abbreviation for "Inverse Document Frequency". This value stands for the number of all considered documents in relation to the number of documents that contain the term i. The corresponding formula is:
ND = Number of considered documents
fi = Number of documents containing term i
The lower the number of documents containing term i, the higher the IDF and the more important the term. This can be explained by the fact that rare words and expressions are more informative for classifying the content of a document than terms that are present in almost all documents. Due to the higher significance of rare words (represented by a high IDF value), multiplication by TF results in a higher overall value.
Multiplication of TF and IDF
The multiplication of both individual frequencies yields the relative term weighting of a word in a document in relation to all documents considered. Terms that occur frequently in a document but are rather rare in all other documents have a high TF*IDF value. An example would be the term "SEO" in a text about search engine optimization.
However, if a term occurs frequently in a document, but is also mentioned very often in all other documents, its TF*IDF value is low. This is the case for words such as "and", "the", "with", etc. These terms contribute very little to classifying the content of a document.
Importance for SEO
Using the TF*IDF formula, you can compare the content on your website with the content of the best ranking pages to a keyword. Such a comparison can reveal important optimization potentials for your content and is possible with Seobility’s TF*IDF tool, for example. TF*IDF tools indicate which terms should appear more or less frequently in a text to achieve an optimal ratio. In addition, so-called "proof keywords" can be used to underline the relevance of your texts for a specific search term. These are expressions that are semantically close to the considered search term and proof that your text is about that topic. Documents that exceed the average term weighting, are sometimes considered spam. Reducing the frequency of said terms helps to avoid such misinterpretation.
In addition, TF*IDF tools can serve as inspiration when searching for specific sub-topics that should be addressed in a text about a specific search term.
Do'stlaringiz bilan baham: |