Reduce the computation and storage burden for row-wise comparison in Python

51 Views Asked by At

Consider we have the following data frame:

      message
0       ABC
1       abc
2       cba
3      abcd
4      dcsa
5      adcd
6      abcd
7       cba                   

I want to perform row-wise comparisons, then finally group the data using the similarity vector for some metric calculation.

For now, I am doing the following:

     message      similarity
0       ABC  [1, 0, 0, 0, 1, 0, 1, 0]
1       abc  [1, 0, 1, 0, 1, 0, 1, 0]
2       cba  [1, 0, 0, 0, 1, 0, 1, 0]
3      abcd  [1, 0, 1, 0, 1, 0, 1, 0]
4      dcsa  [1, 0, 0, 0, 1, 0, 1, 0]
5      adcd  [1, 0, 0, 0, 1, 0, 1, 0]
6      abcd  [1, 1, 0, 0, 1, 0, 1, 0]
7       cba  [1, 1, 0, 0, 1, 0, 1, 0]

From the above, it is clear that I am doing a brute-force job. That being said, starting from row 1, I compare it with all the rows (including itself), and if they have sufficient similarity, then I store it as 1, otherwise it is 0, and thus, the vector size is the number of rows (N), and it is a O(N^2) problem. For a smaller dataset, this is acceptable, but it does not scale for bigger datasets. For example, let us imagine that there are 1M rows, and for each row, there is a 1M-sized similarity vector—this is really crazy.

Therefore, I am here for input regarding two questions:

  1. Can we reduce the O(N^2) complexity? My current thought is we can order the message column and perform comparison on a reduced basis. e.g., for row1, comparing it with row2 till rowN, and for row2, comparing it with row3 till rowN, etc. But the similarity vector will vary on its length among all the rows.

  2. Is there a better way to store the similarity vector? We need it for grouping purpose, and thus, such a vector is unavoidable. Can we use another method to handle it?

A minimal reproducible example is provided below to play with.

import pandas as pd
from difflib import SequenceMatcher

df = pd.DataFrame({'message': ['ABC', 'ABCD', 'DCBA', 'abcde', 'CBDA', 'abcde', 'edcba']})

def similarity_score(s1, s2):
    coef = SequenceMatcher(None, s1, s2).ratio()
    if coef >= 0.75:
        return 1
    else:
        return 0

def similarity(x, df):
    sim_score = []
    for i in df['message']:
        sim_score.append(similarity_score(x, i))
    return sim_score

df['similarity'] = df['message'].apply(lambda x: similarity(x, df)).astype(str)

Below I'd like to illustrate my motivation for the questions above using a reproducible example. For each message, there is an count (cnt). After the similarity vector (similarity) is created, I used it as a grouping variable to categorize the message into different groups (group), and finally, sum up the count per group.

import pandas as pd
from difflib import SequenceMatcher

df = pd.DataFrame({'message': ['ABC','abc','cba','abcd','dcsa','adcd','abcd','cba'], 
                  'cnt': [1, 2, 3, 4, 5, 6, 7, 8]})

def similarity_score(s1, s2):
    coef = SequenceMatcher(None, s1, s2).ratio()
    if coef >= 0.80:
        return 1
    else:
        return 0

def similarity(x,df):
    sim_score = []
    for i in df['message']:
        sim_score.append(similarity_score(x, i))
    return sim_score

df['similarity'] = df['message'].apply(lambda x: similarity(x, df)).astype(str)
df["group"] = pd.factorize(df["similarity"].astype(str))[0] + 1
print(df)

The output is as follows:

   message  cnt         similarity          group
0     ABC    1  [1, 0, 0, 0, 0, 0, 0, 0]      1
1     abc    2  [0, 1, 0, 1, 0, 0, 1, 0]      2
2     cba    3  [0, 0, 1, 0, 0, 0, 0, 1]      3
3    abcd    4  [0, 1, 0, 1, 0, 0, 1, 0]      2
4    dcsa    5  [0, 0, 0, 0, 1, 0, 0, 0]      4
5    adcd    6  [0, 0, 0, 0, 0, 1, 0, 0]      5
6    abcd    7  [0, 1, 0, 1, 0, 0, 1, 0]      2
7     cba    8  [0, 0, 1, 0, 0, 0, 0, 1]      3

Finally, I sum up the count by group as following (that is why at the beginning of the question, I said "group the data using the similarity vector for some metric calculation".

df_final = df.groupby("group").sum("cnt")
print(df_final)

The output is as follows:

group   cnt    
1        1
2       13
3       11
4        5
5        6

Hopefully the added example is clear enough. Thank you.

1

There are 1 best solutions below

11
Kyle F. Hartzenberg On

It sounds like the core problem is grouping strings with other strings based on their similarity. The similarity vectors seem like an (optional) means to an end; so, one can drop them entirely and go straight to grouping. The following solution takes nC2 loop iterations i.e. n! / 2*((n - 2)!). For example, 5 rows will take 10 loop iterations, 8 rows will take 28 loop iterations and so on. This will be the minimum number of comparisons that have to be made to cover all 2-element subsets of the set of rows.

Given the use of a dictionary, the solution below assumes no duplicate strings. I presume that you can pre-process your data to remove exact duplicates.

Solution

from difflib import SequenceMatcher

import pandas as pd

messages = ["ABC", "ABCD", "DCBA", "abcde", "CBDA", "abcd", "edcba"]
results = dict(zip(messages, [[] for _ in range(len(messages))]))

for idx, message in enumerate(messages):
    for comp_message in messages[idx + 1 :]:
        if similar(message, comp_message):
            results[message].append(comp_message)
            results[comp_message].append(message)

df = pd.DataFrame(results.items(), columns=["Message", "Similar-Message"])
print(df)

Output

  Message Similar-Message
0     ABC          [ABCD]
1    ABCD           [ABC]
2    DCBA          [CBDA]
3   abcde          [abcd]
4    CBDA          [DCBA]
5    abcd         [abcde]
6   edcba              []

Additional Info

Here is a small visualisation for 5 rows of strings showing what combinations of strings are compared based on their index. Comparisons are signified by the * and are carried out column-wise, then row-wise by the above solution (i.e. c1r2, c1r3, ... c1r5, c2r3, c2r4, c2r5, c3r4, c3r5, c4r5). When string 0 is compared with string 1 (i.e. 01), and it's similar, string 1 is stored in string 0's similar string list, and string 0 is stored in string 1's similar string list, thus the symmetric 10 is handled: thereby minimising the number of iterations and similarity calculations that have to be performed.

00  10  20  30  40
01* 11  21  31  41
02* 12* 22  32  42
03* 13* 23* 33  43
04* 14* 23* 34* 44