How to remove duplicate values inside a list in mongodb

10.1k Views Asked by At

I have a mongodb collection . When I do.

db.bill.find({})

I get,

{ 
    "_id" : ObjectId("55695ea145e8a960bef8b87a"),
    "name" : "ABC. Net", 
    "code" : "1-98tfv",
    "abbreviation" : "ABC",
    "bill_codes" : [  190215,  44124,  190215,  147708 ],
    "customer_name" : "abc"
}

I need an operation to remove the duplicate values from the bill_codes. Finally it should be

{ 
    "_id" : ObjectId("55695ea145e8a960bef8b87a"),
    "name" : "ABC. Net", 
    "code" : "1-98tfv",
    "abbreviation" : "ABC",
    "bill_codes" : [  190215,  44124,  147708 ],
    "customer_name" : "abc"
}

How to achieve this in mongodb.

4

There are 4 best solutions below

4
On BEST ANSWER

Well's you can do this using the aggregation framework as follows:

collection.aggregate([
    { "$project": {
        "name": 1,
        "code": 1,
        "abbreviation": 1,
        "bill_codes": { "$setUnion": [ "$bill_codes", [] ] }
    }}
])

The $setUnion operator is a "set" operator, therefore to make a "set" then only the "unique" items are kept behind.

If you are still using a MongoDB version older than 2.6 then you would have to do this operation with $unwind and $addToSet instead:

collection.aggregate([
    { "$unwind": "$bill_codes" },
    { "$group": {
        "_id": "$_id",
        "name": { "$first": "$name" },
        "code": { "$first": "$code" },
        "abbreviation": { "$first": "$abbreviation" },
        "bill_codes": { "$addToSet": "$bill_codes" }
    }}
])

It's not as efficient but the operators are supported since version 2.2.

Of course if you actually want to modify your collection documents permanently then you can expand on this and process the updates for each document accordingly. You can retrieve a "cursor" from .aggregate(), but basically following this shell example:

db.collection.aggregate([
    { "$project": {
        "bill_codes": { "$setUnion": [ "$bill_codes", [] ] },
        "same": { "$eq": [
            { "$size": "$bill_codes" },
            { "$size": { "$setUnion": [ "$bill_codes", [] ] } }
        ]}
    }},
    { "$match": { "same": false } }
]).forEach(function(doc) {
    db.collection.update(
        { "_id": doc._id },
        { "$set": { "bill_codes": doc.bill_codes } }
    )
})

A bit more involved for earlier versions:

db.collection.aggregate([
    { "$unwind": "$bill_codes" },
    { "$group": {
        "_id": { 
            "_id": "$_id",
            "bill_code": "$bill_codes"
        },
        "origSize": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id._id",
        "bill_codes": { "$push": "$_id.bill_code" },
        "origSize": { "$sum": "$origSize" },
        "newSize": { "$sum": 1 }
    }},
    { "$project": {
        "bill_codes": 1,
        "same": { "$eq": [ "$origSize", "$newSize" ] }
    }},
    { "$match": { "same": false } }
]).forEach(function(doc) {
    db.collection.update(
        { "_id": doc._id },
        { "$set": { "bill_codes": doc.bill_codes } }
    )
})

With the added operations in there to compare if the "de-duplicated" array is the same as the original array length, and only return those documents that had "duplicates" removed for processing on updates.


Probably should add the "for python" note here as well. If you don't care about "identifying" the documents that contain duplicate array entries and are prepared to "blast" the whole collection with updates, then just use python .set() in the client code to remove the duplicates:

for doc in collection.find():
    collection.update(
       { "_id": doc["_id"] },
       { "$set": { "bill_codes": list(set(doc["bill_codes"])) } }
    )

So that's quite simple and it depends on which is the greater evil, the cost of finding the documents with duplicates or updating every document whether it needs it or not.

This at least covers techniques.

0
On

You can use a foreach loop with some javascript:

db.bill.find().forEach(function(entry){
     var arr = entry.bill_codes;
     var uniqueArray = arr.filter(function(elem, pos) {
        return arr.indexOf(elem) == pos;
     }); 
     entry.bill_codes = uniqueArray;
     db.bill.save(entry);
})
0
On

Mongo 3.4+ has $addFields aggregation stage, which allows you to avoid explicitly listing all the other fields in $project:

db.bill.aggregate([
    {"$addFields": {
        "bill_codes": {"$setUnion": ["$bill_codes", []]}
    }}
])

Just for reference, here is another (more lengthy) way that uses replaceRoot and also doesn't require listing all possible fields:

db.bill.aggregate([
    {'$unwind': {
        'path': '$bill_codes',
        // output the document even if its list of books is empty
        'preserveNullAndEmptyArrays': true
    }},
    {'$group': {
        '_id': '$_id',
        'bill_codes': {'$addToSet': '$bill_codes'},
        // arbitrary name that doesn't exist on any document
        '_other_fields': {'$first': '$$ROOT'},
    }},
    {
      // the field, in the resulting document, has the value from the last document merged for the field. (c) docs
      // so the new deduped array value will be used
      '$replaceRoot': {'newRoot': {'$mergeObjects': ['$_other_fields', "$$ROOT"]}}
    },
    {'$project': {'_other_fields': 0}}
])    
1
On

MongoDB 4.2 collection updateMany method's update parameter can also be an aggregation pipeline (instead of a document). The pipeline supports $set, $unset and $replaceWith stages. Using the $setIntersection aggregation pipeline operator with the $set stage, you can remove the duplicates from an array field and update the collection in a single operation.

An example:

arrays collection:

{ "_id" : 0, "a" : [ 3, 5, 5, 3 ] }
{ "_id" : 1, "a" : [ 1, 2, 3, 2, 4 ] }

From the mongo shell:

db.arrays.updateMany(
   {  },
   [
      { $set: { a: { $setIntersection: [ "$a", "$a" ] } } }
   ]
)

The updated arrays collection:

{ "_id" : 0, "a" : [ 3, 5 ] }
{ "_id" : 1, "a" : [ 1, 2, 3, 4 ] }

The other update methods, update(), updateOne() and findAndModify() also has this feature.