I have a list like the following in python (the real one is huge and I cannot do this only by looking at it):
original1=[['email', 'tel', 'fecha', 'descripcion', 'categ'],
['[email protected]', '1', '2014-08-06 00:00:06', 'MySpace a', 'animales'],
['[email protected]', '1', '2014-08-01 00:00:06', 'My Space a', 'ropa'],
['[email protected]', '2', '2014-08-06 00:00:06', 'My Space b', 'electronica'],
['[email protected]', '3', '2014-08-10 00:00:06', 'Myace c', 'animales'],
['[email protected]', '4', '2014-08-10 00:00:06', 'Myace c', 'animales']]
I split it between data and names to work with data:
datos=original1[-(len(original1)-1):len(original1)]
I need to do a dictionary that has all the duplicates together, considering email and tel, but I need to apply transitivity: since line 0= line 2 if we consider email, but also line 1 if we consider tel, and line 1= line 3 if we consider email again, I need to get that all candidates in this case are 0,1,2 and 3, while 4 is alone.
I created the following code:
from collections import defaultdict
email_to_indices = defaultdict(list)
phone_to_indices = defaultdict(list)
for idx, row in enumerate(datos):
email = row[0].lower()
phone = row[1]
email_to_indices[email].append(idx)
phone_to_indices[phone].append(idx)
So now I need to apply transitivity rules, to get together 0 to 3 and alone 4.
If you print
print 'email', email_to_indices
print 'phone', phone_to_indices
You get:
email defaultdict(, {'[email protected]': [0, 2],'[email protected]': [1, 3], '[email protected]': [4]})
phone defaultdict(, {'1': [0, 1], '3': [3], '2': [2], '4': [4]})
Don't know how to get the union of those considering the transitive property. I need to get something like:
first_group: [0, 1, 2 , 3]
second_group: [4]
Thanks!
Here you have a graph, or Bipartite graph to be more precise. The nodes are of two types: emails and phones. Two nodes are connected if there exists a record with that email and phone. Or we even can say that the record itself is the edge which connects two nodes.
The task is to find Connected components of this graph. By following the links you can find algorithms which can do it in linear time.
Of course some quick and dirty solutions can also be invented, and even may be deemed appropriate if your dataset is small enough.
You can find some Python implementations here: Python connected components
UPDATE: Here is an example of how you can construct the graph:
So every node has a type (
EMAILorPHONE, they can actually be just integers, for example 0 and 1, I made them strings only for nice printing) and a value. Graph is a dictionary with nodes as keys and sets of connected nodes as values.