I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
- email or
- personal_blog_url or
- telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
[email protected] [123, 345]
[email protected] [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
- How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
- Or whether it is a good practice to do such deduplication in couchdb?
- Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to
_changes
and search for the fields you want to be unique for the real user in the views you suggested (by_*
).Merge the views into one (emit different fields in one map):
function (doc) { if (!doc.email || !doc.personal_blog_url || !doc.telephone) return; emit([1, doc.email], [doc._id]); emit([2, doc.personal_blog_url], [doc._id]); emit([3, doc.telephone], [doc._id]); }
Merge the lists of id's in reduce
keys=[[1, email], [2, personal_blog_url], ...]
and merge the three lists. If its minimal id is smaller then the changed doc, update the fieldrealId
, otherwise update the documents in the list with the changed id.I suggest using different document to store
{ userId, realId }
relation.