How to sort a list while being resistant to spelling mistakes?

74 Views Asked by At

I m looking for a sorting algorithm tolerant to errors.

Suppose I have a list A and a list B with some spelling errors. Here zanana instead of banana:

A = ["raspberry","apple", "banana", "peach"] 
B = ["raspberry","apple", "zanana", "peach"] 

I would like a Sort function, which returns the same new list C using A or B as input.

Sort(A) = C
Sort(B) = C 

For instance:

C = [3,2,1,0]

If I use a classical lexicographic order, I would not have the same result because zanana and banana would have different index.

I'm trying to think whether I can use a hamming distance or something like that, but without success.

Note: This is a toy example. In real example, instead of fruits, I have long string of differents sizes. Instead of banana and zanana, i have :

 - 4df7e8ea5e6ee731ec4877a99fffc8da
 - 5df7e8ea5e6ee731ec4877a99fffc8da
2

There are 2 best solutions below

0
dridk On

One solution proposed above works pretty much .

def encode(x):
   return sum([b.bit_count() for b in x.encode()])


a = ['peach', 'banana', 'apple']
b = ['peach', 'zanana', 'apple']

c_a = sorted(a, key = encode)
c_b = sorted(n, key = encode)

print(c_a) # ['peach', 'apple', 'banana']
print(c_b) # ['peach', 'apple', 'zanana']
0
dridk On

I found a solution which works as expected :

  • vectorize each word of the list into n-gram space

  • Project each word to one space using Sparse PCA

  • Sort word according eigen value of this space

    a  = ["banana", "apple","kiwi", "orange", "table", "chair", "charly"]
    b  = ["banana", "apple","kiwi", "Zrange", "table", "Zhair", "charly"]
    
    
    def sort_ngram(array):
        # Vectorize
        model = CountVectorizer(ngram_range = (2,2), analyzer="char")
        matrix = model.fit_transform(array).toarray()
    
        df = pd.DataFrame(matrix)
        df.apply(lambda x: sorted(x))
    
        # Project to one space 
        model = SparsePCA(n_components=1)
        df = pd.DataFrame(model.fit_transform(df))
        return df.sort_values(0).index.to_list()
    
    print(sort_ngram(a))
    # ['apple', 'table', 'chair', 'kiwi', 'banana', 'charly', 'orange']
    
    print(sort_ngram(b))
    # ['apple', 'table', 'Zhair', 'kiwi', 'banana', 'charly', 'Zrange']