Percentage of how similar strings are in Python?

6.8k Views Asked by At

I don't know how to do a program that gives a percentage of how similar two strings of the same length are.

For example, for abcd and abce it should give 75%.

The order matters, I don't want that it gives me that abcd and dcab have a 100%.

I know that Levenshtein module does that, but I want a program that does it.

5

There are 5 best solutions below

0
On

Assuming that s1 and s2 have the same length:

from numpy import mean
mean([s1[i]==s2[i] for i in xrange(len(s1))])
0
On

How about this:

>>> a = list('abce')
>>> b = list('abcd')
>>> ( 100 - (sum(i != j for i, j in zip(a, b)) / float(len(a))) * 100 )
75.0
>>> a = list('abce')
>>> b = list('bdce')
>>> ( 100 - (sum(i != j for i, j in zip(a, b)) / float(len(a))) * 100 )
50.0
>>>
0
On

Wikipedia has pseudocode for Levenshtein_distance or Edit Distance It is a simple algorithm. Why don't you give it a try and ask specific question if you are stuck.

4
On
>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None, 'abcd', 'abce').ratio()
0.75

Read the docs for more. You can read the description in the docs to figure out how to do it yourself, but you're going to end up coding some kind of alignment algorithm from scratch.

1
On

The word similar has varied context, but looking at your examples, I am definite, you are looking for the

Match% = 2* Longest_Common_Substring(a, b) / (len(a) + len(b)) * 100

Just google for Longest Common Substring and you are sure to find loads of Python Implementation.

One such Python Implementation from Wikibook : Algorithm Implementation/Strings/Longest common substring is as follows

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

wrapping it over with a similarity function, the result conforms to your expectation

>>> def similarity(s1, s2):
     return 2. * len(longest_common_substring(s1, s2)) / (len(s1) + len(s2)) * 100

>>> similarity("abcd","abce")
75.0
>>> similarity("abcd","dcba")
25.0