I am using the significant terms aggregation, which gives me n significant terms with their doc_count and bg_count using the following query:
{
"query" : {
"terms" : {"user_id": ["x"]}
},
"aggregations" : {
"word_cloud" : {
"significant_terms": {
"field" : "transcript.results.alternatives.words.word.keyword",
"size": 200
}
}
},
"size": 0
}
If I am taking a term returned by significant terms aggregation and do a match phrase query for that term. Then I am getting a different value of hits than the doc_count in the aggregation.
Match phrase query:
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"preprocess_data.results.alternatives.transcript": "<term>"
}
},
{
"match_phrase": {
"user_id": "x"
}
}
]
}
},
"from": 0,
"size": 22
}
The field preprocess_data.results.alternatives.transcript
has the following mapping:
{
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
I am unable to explain the difference in document count when doing an aggregation and a match phrase search. Please help.
This behaviour is because the data regarding
doc_count
is fetched from all shards of your index, and this data could be approximate in case of significant terms aggregation. Quoting elastic search documentation: