How to sort Elasticsearch by documents in an id?

347 Views Asked by At

I am using the free tier of Bonsai and am trying to write a script to manage the number of documents in my Elastic index. To maximize the number of documents I can save, I would like to start removing docs for which there are many nested documents within.

Example:

{   
 "title": "Spiderman saves child from well",   
 "body":  "Move over, Lassie! New York has a new hero. But is he also a menace?",   
 "authors": [
   { 
      "name":  "Jonah Jameson",       
      "title": "Sr. Editor",     
   },     
   {       
      "name":  "Peter Parker",       
      "title": "Photos",     
   }   
  ],   
 "comments": [     
   {       
      "username": "captain_usa",       
      "comment":  "I understood that reference!",     
   },     
   {       
      "username": "man_of_iron",       
      "comment":  "Congrats on being slightly more useful than a ladder.",     
   }   
  ],   
 "photos": [ 
   {       
      "url":      "https://assets.dailybugle.com/12345",       
      "caption":  "Spiderman delivering Timmy back to his mother",     
   }   
  ] 
 }
    

Is there anything in Elastic that would tell me that this document is really 6 documents because of the extensive nesting? Ideally, I would be able to sort elastic records by this "document count".

Thanks!

1

There are 1 best solutions below

0
On

If your authors, comments and photos are trivially nested (an array of objects) OR of the dedicated elasticsearch nested data type, you can do the following:

GET bonsai/_search
{
  "_source": [""], 
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          
          "source": """
            def count = 1; // top level doc count is 1
            for (def entry : params._source.values()) {
              if (entry instanceof ArrayList) {
                count += entry.size()
              }
            }
            return count;
          """
        }
      }
    }
  ]
}

I don't really see how the above doc would be of size 6 -- so I presumed it's because you counted the top level doc too. Feel free to start counting at 0 in the script.