Mongo - aggregate docs containing contiguous time ranges into minimum document number

47 Views Asked by At

I have millions of docs containing start and end times of something they're measuring.

[
   {
    "start" : ISODate("2024-01-01T00:00:00"),
    "end" :   ISODate("2024-01-01T01:00:00")
   },
   {
    "start" : ISODate("2024-01-01T01:00:00"),
    "end" :   ISODate("2024-01-01T02:00:00")
   },
   {
    "start" : ISODate("2024-01-01T03:00:00"),
    "end" :   ISODate("2024-01-01T04:00:00")
   },
]

I would like a way of generating a result set containing only the non-contiguous blocks. For this example, that would look like:

[
   {
    "start" : ISODate("2024-01-01T00:00:00"),
    "end" :   ISODate("2024-01-01T02:00:00")
   },
   {
    "start" : ISODate("2024-01-01T03:00:00"),
    "end" :   ISODate("2024-01-01T04:00:00")
   },
]

I'm experienced with using aggregate and $group, but I can't think of a way to acheive what I need.

Any suggestions would be much appreciated.

2

There are 2 best solutions below

0
ray On

You can achieve the expected behaviour through 2 $setWindowFields. For the first $setWindowFields, you can find the $max end field for each document in the document range of ["unbounded", -1] that is sorted by {start: 1}. i.e. For a current document, you will search in the document range that has a start smaller than the current document and get the max end. We call this value prevMax.

Then, by comparing the prevMax with current document's start. You can determine if it should be start of a group. We call this boolean isStart.

After that, by using another $setWindowFields, you can conditionally find the grouping of all documents through isStart and in the document range of [current, unbounded]

Finally, using the grouping field to find $max end inside the grouping.

db.collection.aggregate([
  {
    "$setWindowFields": {
      "sortBy": {
        "start": 1
      },
      "output": {
        "prevMax": {
          "$max": "$end",
          "window": {
            "documents": [
              "unbounded",
              -1
            ]
          }
        }
      }
    }
  },
  {
    $set: {
      isStart: {
        $lt: [
          "$prevMax",
          "$start"
        ]
      }
    }
  },
  {
    "$setWindowFields": {
      "sortBy": {
        "start": 1
      },
      "output": {
        "grouping": {
          "$max": {
            "$cond": {
              "if": {
                "$eq": [
                  true,
                  "$isStart"
                ]
              },
              "then": "$start",
              "else": -1
            }
          },
          "window": {
            "documents": [
              "unbounded",
              "current"
            ]
          }
        }
      }
    }
  },
  {
    "$group": {
      _id: "$grouping",
      end: {
        $max: "$end"
      }
    }
  },
  {
    "$project": {
      _id: 0,
      start: "$_id",
      end: 1
    }
  }
])

Mongo Playground

2
nimrod serok On

Following @ray, if we can assume there are no overlaps in the documents, there is an option to look only at the previous document and the next document for each document, not scanning all previous document for each document (twice). This can also help with matching, which can reduce the number of documents we need to $set:

db.collection.aggregate([
  {$setWindowFields: {
      sortBy: {start 1},
      output: {
        prevMax: {$push: "$end", window: {documents: [-1, 0]}},
        nextMin: {$push: "$start", window: {documents: [0, 1]}}
      }
  }},
  {$match: {$or: [
        {$expr: {$lt: [{$first: "$prevMax"}, "$start"]}},
        {$expr: {$gt: [{$last: "$nextMin"}, "$end"]}},
        {"prevMax.1": {$exists: false}},
        {"nextMin.1": {$exists: false}}
  ]}},
  {$set: {
      isStart: {$or: [
          {$lt: [{$first: "$prevMax"}, "$start"]},
          {$eq: [{$size: "$prevMax"}, 1]}
      ]}
  }},
  {$setWindowFields: {
      sortBy: {start: 1},
      output: {grouping: {
          $max: {$cond: [{$eq: [true, "$isStart"]}, "$start", -1}},
          window: {documents: [-1, 0]}
      }}
  }},
  {$group: {_id: "$grouping", end: {$max: "$end"}}},
  {$project: {_id: 0, start: "$_id", end: 1}}
])

See how it works on the playground example