MongoDB Select 2 most recent documents from each provider in each category

73 Views Asked by At

Trying to model a homepage for our website where we want to have a list of MOST RECENTLY added list of documents limited to a maximum of 2 documents per provider per each category. in other words, we don't want to over-feature individual provider or individual category.

we have a following doc structure:

events:

{
 created_at: isodate(),
 start_date: isodate(),
 title: "document 1 title health related electronic device",
 categories: [
         {"id": 1, "title": "health"},
         {"id": 2, "title": "electronics"}],
 provider: { "id": 555, "name": "PAMF"}
 }
},
{
 created_at: isodate(),
 start_date: isodate(),
 title: "document 2 title electronics iphone",
 categories: [
         {"id": 5, "title": "phone"},
         {"id": 2, "title": "electronics"},
         {"id": 9, "title": "accessory"},
         {"id": 15, "title": "mobile"}],
 provider: { "id": 555, "name": "Apple Store"}
 }
}

... and so on.

Is there a way to write a query in MONGODB to select 2 of each documents from each provider in each category ?

Just for the comparison sake we currently do it in elasticsearch like this:

{
  "size": 0,
  "aggs": {
    "level-1": {
      "terms": {
        "field": "provider.name",
        "size": 0
      },
      "aggs": {
        "level-2": {
          "terms": {
            "field": "categories.id",
            "size": 0,
            "order": {
              "_term": "asc"
            }
          },
          "aggs": {
            "top-hits-docs": {
              "top_hits": {
                "size": 2,
                "sort": [
                  {
                    "start_date": "desc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

As pointed out, I forgot to include desired output, but mainly because output doesn't really matter. This is not a real-time user facing job, and it will be executed offline to re-sort homepage listings multiple times throughout the day. The enclosing top-level element doesn't matter much as long as the desired dataset is there. So one of the ways to wrap each host inside each category with couple sample events, such as:

category_id.1
   provider1
      doc5
      doc1
   provider3
      doc8
      doc6
category_id.2
   provider2
      doc45
      doc15
   provider3
      doc153
      doc66

could also be in the reverse order, where everything is enclosed inside of provider, ie:

provider1
  category_id.1
      doc599
      doc15
  category_id.2
      doc81
      doc67
provider1
  category_id.1
      doc57
      doc14
   category_id.6
      doc86
      doc64
0

There are 0 best solutions below