Optimize query and index for mongo $group, $lookup and $count query

48 Views Asked by At

I have two collections:

  • Customer
    • id
    • state: in either ['active', 'inactive']
  • Tag
    • cid: map to customer.id
    • date: in YYYYMMDD integer
    • name: the tag name
    • count: the number of times the user is tagged with the specific tag name in that specific date

My goal is try to adjust index or change the query pipeline to optimize the query to answer how many active users with specific tag tag1 and tagged for at least 2 times during the date range, my mongo pipeline will be:

[
1   { $match : {
        date: { $gte: ..., $lte: ... },
        name: 'tag1'
    } },
2   { $group: {
        _id: '$cid',
        totalCount: '$count'
    } },
3   { $match: {
        totalCount: { $gte: 2 }
    } },
4   { $lookup: {
        from: 'Customer',
        pipeline: [
            { $match: {
                "$expr": {
                    "$and": [
                        { "$eq": [ "$_id", "$$cid" ] },
                        { "$eq": [ "$state", "active" ] },
                    ]
                }
            } }
        ],
        as: 'Customer'
    } },
5   { "$unwind": "$Customer" },
6   { "$count": "count" }
]

For the first part ( stage 1~3 )

I have tried add index with

{
    date : -1.0,
    name : 1.0,
    cid : 1.0
}

The following is the explain("executionStats") result without $lookup (stage 4 and 5). The $match takes only 802 ms, but the $group part takes the majority of the time (16610 - 5598 = 11012 ms).

[
    {
        $cursor: {
            executionStats: {
                executionSuccess: true,
                nReturned: 2632494,
                executionTimeMillis: 17937,
                totalKeysExamined: 2632494,
                totalDocsExamined: 2632494,
                executionStages: {
                    stage: 'PROJECTION_SIMPLE',
                    nReturned: 2632494,
                    executionTimeMillisEstimate: 802,
                    inputStage: {
                        stage: 'FETCH',
                        nReturned: 2632494,
                        executionTimeMillisEstimate: 636,
                        docsExamined: 2632494,
                        inputStage: {
                            stage: 'IXSCAN',
                            nReturned: 2632494,
                            executionTimeMillisEstimate: 376,
                            indexName: 'date_1_name_1_cid_1',
                            direction: 'forward',
                            indexBounds: {
                                date: ['[20230101.0, 20231231.0]'],
                                name: ['["tag1"]'],
                                cid: ['[MinKey, MaxKey]'],
                            },
                            keysExamined: 2632494,
                            seeks: 1,
                        },
                    },
                },
            },
        },
        nReturned: NumberLong(2632494),
        executionTimeMillisEstimate: NumberLong(5598),
    },
    {
        $group: {
            _id: '$cid',
            totalCount: { $sum: '$count' },
        },
        nReturned: NumberLong(2632494),
        executionTimeMillisEstimate: NumberLong(16610),
    },
    {
        $match: {
            totalCount: { $gte: 2.0 },
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(17537),
    },
    {
        $group: {
            _id: { $const: null },
            count: { $sum: { $const: 1 } }
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(17933),
    },
    {
        $project: { 
            count: true, 
            _id: false
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(17933),
    },
];

I also tried to add count to the last position of the compound index. The result is only a little better. One can find that the stage PROJECTION_SIMPLE has changed to PROJECTION_COVERED, the FETCH stage is gone, and the totalDocsExamined field shows 0. That is great in that the new index covered all the date required before going into the $group stage. However, group still take the majority of the time (15033 - 4239 = 10794 ms)

[
    {
        $cursor: {
            executionStats: {
                executionTimeMillis: 16285,
                totalKeysExamined : 2632494,
                totalDocsExamined : 0,
                executionStages: {
                    stage: 'PROJECTION_COVERED',
                    executionTimeMillisEstimate: 479,
                    inputStage: {
                        stage: 'IXSCAN',
                        executionTimeMillisEstimate: 370,
                        indexName: 'date_1_name_1_cid_1_count_1',
                        indexBounds: {
                            date: ['[20230101.0, 20231231.0]'],
                            name: ['["tag1"]'],
                            cid: ['[MinKey, MaxKey]'],
                            count: ['[MinKey, MaxKey]'],
                        },
                        keysExamined: 2632494,
                        seeks: 1,
                    },
                },
            },
        },
        nReturned: NumberLong(2632494),
        executionTimeMillisEstimate: NumberLong(4239),
    },
    {
        $group: {
            _id: '$cid',
            totalCount: { $sum: '$count' },
        },
        nReturned: NumberLong(2632494),
        executionTimeMillisEstimate: NumberLong(15033),
    },
    {
        $match: {
            totalCount: { $gte: 2.0 },
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(15863),
    },
    {
        $group: {
            _id: { $const: null },
            count: { $sum: { $const: 1 } }
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(16285),
    },
    {
        $project: { 
            count: true, 
            _id: false
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(16285),
    },
]

Is there any other way to improve the time?

For the second part ( all stage )

This part drives my CRAZY. I have added the following index on Customer collection and hope that the lookup can fully utilize index.

{
    _id: 1.0,
    state: 1.0,
}

When $lookup added, it takes additional 766,596 - 14,391 = 752,205 ms to complete (OMG...). In theory, since I have compound index in Customer with all required field for lookup match, the lookup should be completed by IXSCAN. And because it requires no addition field to unwind and count, I think it is possible to complete the query without fetching any documents. Since I found no way to let explain to show what happened on $lookup stage. I do not know where I can improve.

[
    ...
    {
        $match: {
            totalDensity: { $gte: 1.0 },
        },
        nReturned: NumberLong(2632494),
        executionTimeMillisEstimate: NumberLong(14391),
    },
    {
        $lookup: { ... },
        nReturned: NumberLong(2313852),
        executionTimeMillisEstimate: NumberLong(766596),
    },
    {
        $group: {
            _id: { $const: null },
            count: { $sum: { $const: 1 } },
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(766596),
    },
    {
        $project: {
            count: true,
            _id: false,
        },
        nReturned: NumberLong(1),
        executionTimeMillisEstimate: NumberLong(766596),
    }
]
1

There are 1 best solutions below

0
Wernfried Domscheit On

You don't need { "$unwind": "$Customer" }. This one should be faster:

{ $project: { size: { $size: "$Customer" } } },
{ $group: { _id: null, count: { $sum: "$size" } } }