Deduplicaton / matching in Couchdb?

491 Views Asked by At

I have documents in couchdb. The schema looks like below:

userId
email
personal_blog_url
telephone

I assume two users are actually the same person as long as they have

  • email or
  • personal_blog_url or
  • telephone

be identical.

I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,

_view/by_email:
----------------------------------
key                   values     
[email protected]    [123, 345]
[email protected]    [23, 45, 333]

_view/by_blog_url:
----------------------------------
key                   values     
http://myblog.com    [23, 45]
http://mysite.com/ss [2, 123, 345]

_view/by_telephone:
----------------------------------
key                   values     
232-932-9088          [2, 123]
000-111-9999          [45, 1234]
999-999-0000          [1]

My questions:

  • How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
  • Or whether it is a good practice to do such deduplication in couchdb?
  • Or what would be a good way to do a deduplication in couch then?

ps. in the finial view, suppose for all dupes, we only keep the smallest userId.

Thanks.

2

There are 2 best solutions below

4
On BEST ANSWER

Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).

  • Merge the views into one (emit different fields in one map):

    function (doc) { if (!doc.email || !doc.personal_blog_url || !doc.telephone) return; emit([1, doc.email], [doc._id]); emit([2, doc.personal_blog_url], [doc._id]); emit([3, doc.telephone], [doc._id]); }

  • Merge the lists of id's in reduce

  • When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.

I suggest using different document to store { userId, realId } relation.

1
On

You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.

Here's one idea.

Instead of creating 3 views, you could create one view (that indexes the data if it exists):

Key                             Values
---                             ------
[userId, 'phone']               777-555-1212
[userId, 'email']               [email protected]
[userId, 'url']                 favorite.url.example.com

I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).

Then, to query, you could do something like:

...startkey=[userId]&endkey=[userId,{}]

That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.

Here's a nice example of using arrays as keys on StackOverflow.

You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.

Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.