Difflib's SequenceMatcher - Customized equality

2.8k Views Asked by At

I've been trying to create a nested or recursive effect with SequenceMatcher.

The final goal is comparing two sequences, both may contain instances of different types.

For example, the sequences could be:

l1 = [1, "Foo", "Bar", 3]
l2 = [1, "Fo", "Bak", 2]

Normally, SequenceMatcher will identify only [1] as a common sub-sequence for l1 and l2.

I'd like SequnceMatcher to be applied twice for string instances, so that "Foo" and "Fo" will be considered equal, as well as "Bar" and "Bak", and the longest common sub-sequence will be of length 3 [1, Foo/Fo, Bar/Bak]. That is, I'd like SequenceMatcher to be more forgiving when comparing string members.

What I tried doing is write a wrapper for the built-in str class:

from difflib import SequenceMatcher
class myString:
    def __init__(self, string):
        self.string = string
    def __hash__(self):
        return hash(self.string)
    def __eq__(self, other):
        return SequenceMatcher(a=self.string, b=self.string).ratio() > 0.5

Edit: perhaps a more elegant way is:

class myString(str):
    def __eq__(self, other):
        return SequenceMatcher(a=self, b=other).ratio() > 0.5

By doing this, the following is made possible:

>>> Foo = myString("Foo")
>>> Fo = myString("Fo")
>>> Bar = myString("Bar")
>>> Bak = myString("Bak")
>>> l1 = [1, Foo, Bar, 3]
>>> l2 = [1, Fo, Bak, 2]
>>> SequenceMatcher(a=l1, b=l2).ratio()
0.75

So, evidently it's working, but I have a bad feeling about overriding the hash function. When is the hash used? Where can it come back and bite me?

SequenceMatcher's documentation states the following:

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

And by definition hashable elements are required to fulfill the following requirement:

Hashable objects which compare equal must have the same hash value.

In addition, do I need to override cmp as well?

I'd love to hear about other solutions that come to mind.

Thanks.

1

There are 1 best solutions below

5
On

Your solution isn't bad - you could also look at re-working the SequenceMatcher to recursively apply when elements of a sequence are themselves iterables, with some custom logic. That would be sort of a pain. If you only want this subset of SequenceMatcher's functionality, writing a custom diff tool might not be a bad idea either.

Overriding __hash__ to make "Foo" and "Fo" equal will cause collisions in dictionaries (hash tables) and such. If you're literally only interested in the first 2 characters and are set on using SequenceMatcher, returning cls.super(self[2:]) might be the way to go.

All that said, your best bet is probably a one-off diff tool. I can sketch out the basics of something like that if you're interested. You just need to know what the constraints are in the circumstances (does the subsequence always start on the first element, that kind of thing).