Efficiently Computing Significant Terms in SQL

Question

Efficiently Computing Significant Terms in SQL

492 Views Asked by Alix Axel At 20 August 2025 at 10:04

I was introduced to ElasticSearch significant terms aggregation a while ago and was positively surprised how good and relevant this metric turns out to be. For those not familiar with it, it's quite a simple concept - for a given query (foreground set) a given property is scored against the statistical significance of the background set.

For example, if we were querying for the most significant crime types in the British Transport Police:

C = 5,064,554 -- total number of crimes
T =    66,799 -- total number of bicycle thefts
S =    47,347 -- total number of crimes in British Transport Police
I =     3,640 -- total number of bicycle thefts in British Transport Police

Ordinarily, bicycle thefts represent only 1% of crimes (66,799/5,064,554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3,640/47,347) is a bike theft. This is a significant seven-fold increase in frequency.

The significance for "bicycle theft" would be [(I/S) - (T/C)] * [(I/S) / (T/C)] = 0.371...

Where:

C is the number of all documents in the collection
S is the number of documents matching the query
T is the number of documents with the specific term
I is the number of documents that intersect both S and T

For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements), I'm looking to implement the significant terms aggregation in SQL or directly in code.

I've been looking at some ways to potentially optimize this kind of query, specifically, decreasing the memory requirements and increasing the query speed, at the expense of some error margin - but so far I haven't cracked it. It seems to me that:

The variables C and S are easily cacheable or queriable.
The variable T could be derived from a Count-Min Sketch instead of querying the database.
The variable I however, seems impossible to derive with the Count-Min Sketch from T.

I was also looking at the MinHash, but from the description it seems that it couldn't be applied here.

Does anyone know about some clever algorithm or data structure that helps tackling this problem?

Original Q&A

There are 2 best solutions below

Leon On 11 June 2016 at 16:46

An efficient solution is possible for the case when the foreground set is small enough. Then you can afford processing all documents in the foreground set.

Collect the set {X_k} of all terms occurring in the foreground set for the chosen field, as well as their frequencies {f_k} in the foreground set.
For each X_k
- Calculate the significance of X_k as (f_k - F_k) * (f_k / F_k), where F_k=T_k/C is the frequency of X_k in the background set.
Select the terms with the highest significance values.

However, due to the simplicity of this approach, I wonder if ElasticSearch already contains that optimization. If it doesn't - then it very soon will!

**MarkH** · Accepted Answer

I doubt a SQL impl will be faster. The values for C and T are maintained ahead of time by Lucene. S is a simple count derived from the query results and I is looked up using O(1) data structures. The main cost are the many T lookups for each of the terms observed in the chosen field. Using min_doc_count typically helps drastically reduce the number of these lookups.

For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements

Have you looked into using doc values to manage elasticsearch memory better? See https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale

Efficiently Computing Significant Terms in SQL

There are 2 best solutions below

Related Questions in SQL

Related Questions in ELASTICSEARCH

Related Questions in QUERY-OPTIMIZATION

Related Questions in AGGREGATION

Related Questions in SIGNIFICANT-TERMS

Trending Questions

Popular # Hahtags

Popular Questions