Understanding MaxDocs in Elastic Search's explain feature

1.2k Views Asked by At

I am doing an elastic search 1.5.2 query with the "explain" flag turned on. The output for the inverse document frequency is

{
    "description": "idf(docFreq=2, maxDocs=56)", 
    "value": 3.9267395
}

I understand the idea behind inverse document frequency. If I have 100 docs and one includes the word "rhododendron" then the idf = num docs / num docs with term "rhododendron" = 100 / 1

But where is the max docs number coming from in Elastic Search? I don't see anything in the documentation.

2

There are 2 best solutions below

0
On

maxDocs is computed by Lucene's IndexReader and the API documentation states the following:

public abstract int maxDoc()

Returns one greater than the largest possible document number. This may be used to, e.g., determine how big to allocate an array which will have an element for every document number in an index.

In other words, maxDocs is the total number of documents in the index (+1), including the deleted ones.

We can confirm this by looking at the source code for IndexReader, which basically shows that the following formula holds true: numDeletedDocs() = maxDoc() - numDocs(), where

  • numDeletedDocs() returns the total number of deleted documents in the index
  • numDocs() returns the number of visible documents in the index

It is also worth noting, though, that depending on which shard (primary or replica) is hit by your query, maxDocs can differ (and hence your score, too). See this thread for a full explanation. To palliate this problem (called "bouncing results"), you can specify the preference parameter in your queries.

0
On

The default search type is query_then_fetch in which the term and document frequency calculations are local to each of the shards in the index. That's the reason you see maxDocs=56, which could be number of docs in that shard, instead of 100 - which is total number of docs in the index.

Replacing _search by _search?search_type=dfs_query_then_fetch in your query will result in more accurate calculation of term/document frequency. More details can be found in this elastic blog