how to get total tokens count in documents in elasticsearch

2.1k Views Asked by At

I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.

I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
   "query": {
      "match_all": {}
   },
   "aggs": {
      "tk_count": {
         "sum": {
            "script": "_index[\"body\"].sumttf()"
         }
      }
   },
   "size": 0
}

Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).

2

There are 2 best solutions below

2
On

Seems like you want to retrieve cardinality of total tokens in body field.

In such case you can just use cardinality aggregation like below.

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "tk_count": {
            "cardinality" : {
                "field" : "body"
            }
        }
    },
    "size": 0
}

For detailed information, see this official document

0
On

This worked for me, is it what you need?

Rather than getting token count on query (using tk_count aggregation, as suggested in the other answer), my solution stores the token count on indexing using the token_count datatype., so that I could get "name.stored_length" values returned in query results.

token_count is a "multi-field" it works on one-field-at-a-time (i.e. the "name" field or the "body" field). I modified the example slightly to store the "name.stored_length"

Notice in my example it does not count cardinality of tokens (i.e. distinct values), it counts total tokens; "John John Doe" has 3 tokens in it; "name.stored_length"===3; (even though its count distinct tokens is only 2). Notice I ask for specific "stored_fields" : ["name.stored_length"]

Finally, you may need to re-update your documents (i.e. send a PUT), or any technique to get the values you want! In this case I PUT "John John Doe", even if it was already POST/PUT in elasticsearch; the tokens were not counted until a PUT again, after adding tokens to the mapping.!)

PUT test_token_count
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "stored_length": { 
              "type":     "token_count",
              "analyzer": "standard",
     //------------------v
              "store": true
            }
          }
        }
      }
    }
  }
}

PUT test_token_count/_doc/1
{
    "name": "John John Doe" 
}

Now we can query, or search for results, and configure results to include the name.stored_length field (which is both a multi-field and a stored field!):

GET/POST test_token_count/_search
{
      //------------------v
    "stored_fields" : ["name.stored_length"]
}

And results to the search should include the total token count as named.stored_length...

{
  ...
  "hits": {
     ...
    "hits": [
      {
        "_index": "test_token_count",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "fields": {
 //------------------v
          "name.stored_length": [
            3
          ]
        }
      }
    ]
  }
}