How to find Duplicate documents in Cosmos DB

11.8k Views Asked by At

I have seen like a huge amount of data write to cosmos DB from stream analytics job on a particular day. It was not supposed to write huge amount of documents in a day. I have to check if there is duplication of documents on that particular day.

Is there any query/any way to find out duplicate records in cosmos DB?

2

There are 2 best solutions below

1
On

It is possible if you know the properties to check for duplicates. We had a nasty production issue causing many duplicate records as well. Upon contacting MS Support to help us identify the duplicate documents, they gave us the following query;

Bear in mind: property A and B together define the uniqueness in our case. So if two documents have the same value for A and B, they are duplicate. You can then use the output of this query to, for example, delete the oldest ones but keep the recent (based on _ts)

SELECT d.A, d.B From 
   (SELECT c.A, c.B, count(c._ts) as counts FROM c
    GROUP BY c.Discriminator, c.EndDateTime) AS d
WHERE d.counts > 1
6
On

Is there any query/any way to find out duplicate records in cosmos DB?

Quick answer is YES.Please use distinct keyword in the cosmos db query sql.And filter the _ts(System generated unix timestamp:https://learn.microsoft.com/en-us/azure/cosmos-db/databases-containers-items#properties-of-an-item)

Something like:

Select distinct c.X,c.Y,C.Z....(all columns you want to check) from c where c._ts = particular day

Then you could delete the duplicate data using this bulk delete lib:https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started/tree/master/BulkDeleteSample.