As I understand, min_term_freq=2 look at the input text and the term is used for searching only if it occurs at least two times.
But what does min_doc_freq mean? The documentation says
The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.
But I am not able to figure out what that means? Does it look at the input document or the rest of the index?
Lucene scoring formula uses TF-IDF weights to reflect how meaningful a word is to a document in a corpus.
That's why the More Like This component uses this numerical statistic.
The idf represents the inverse of the number of documents in which a given term appears : a term appearing in every document would be considered as not pertinent (high doc frequency, and thus low idf).
That being said, a word that appears only one time in one document could also be a typo, a lorem ipsum excerpt, or something like that : a term without any meaning but that get a significant tf-idf weight, hence the need to leave some "room" to avoid issues induced by nothing more than a "theoretical meaningfulness".
The
min_doc_freqallows to set a threshold below which any term having adocFreqless than this value (among the selected K terms with highest tf-idf) will be ignored from the input document. For example,min_doc_freq=5term must appear at least in 5 documents otherwise it will be excluded from the MLT query. This can be useful in situations where you want MLT to return documents similar to the given one only if the terms of the query yields a well-addressed topic (addressed in at least 5 documents).So, Does it look at the input document or the rest of the index?
Both : from the input document, it needs the top K terms and for each one of them, to check their
docFreqwhich is a TermStatistics queried against the index.In the same context, you would use
max_doc_freqto ignore highly frequent words such as stop words.https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html