How to sort a list while being resistant to spelling mistakes?

74 Views Asked by dridk At 12 July 2023 at 16:59

I m looking for a sorting algorithm tolerant to errors.

Suppose I have a list A and a list B with some spelling errors. Here zanana instead of banana:

A = ["raspberry","apple", "banana", "peach"] 
B = ["raspberry","apple", "zanana", "peach"]

I would like a Sort function, which returns the same new list C using A or B as input.

Sort(A) = C
Sort(B) = C

For instance:

C = [3,2,1,0]

If I use a classical lexicographic order, I would not have the same result because zanana and banana would have different index.

I'm trying to think whether I can use a hamming distance or something like that, but without success.

Note: This is a toy example. In real example, instead of fruits, I have long string of differents sizes. Instead of banana and zanana, i have :

 - 4df7e8ea5e6ee731ec4877a99fffc8da
 - 5df7e8ea5e6ee731ec4877a99fffc8da

Original Q&A

There are 2 best solutions below

dridk On 12 July 2023 at 22:35

One solution proposed above works pretty much .

def encode(x):
   return sum([b.bit_count() for b in x.encode()])


a = ['peach', 'banana', 'apple']
b = ['peach', 'zanana', 'apple']

c_a = sorted(a, key = encode)
c_b = sorted(n, key = encode)

print(c_a) # ['peach', 'apple', 'banana']
print(c_b) # ['peach', 'apple', 'zanana']

dridk On 14 July 2023 at 13:09

I found a solution which works as expected :

vectorize each word of the list into n-gram space
Project each word to one space using Sparse PCA

Sort word according eigen value of this space

a  = ["banana", "apple","kiwi", "orange", "table", "chair", "charly"]
b  = ["banana", "apple","kiwi", "Zrange", "table", "Zhair", "charly"]


def sort_ngram(array):
    # Vectorize
    model = CountVectorizer(ngram_range = (2,2), analyzer="char")
    matrix = model.fit_transform(array).toarray()

    df = pd.DataFrame(matrix)
    df.apply(lambda x: sorted(x))

    # Project to one space 
    model = SparsePCA(n_components=1)
    df = pd.DataFrame(model.fit_transform(df))
    return df.sort_values(0).index.to_list()

print(sort_ngram(a))
# ['apple', 'table', 'chair', 'kiwi', 'banana', 'charly', 'orange']

print(sort_ngram(b))
# ['apple', 'table', 'Zhair', 'kiwi', 'banana', 'charly', 'Zrange']

How to sort a list while being resistant to spelling mistakes?

There are 2 best solutions below

Related Questions in ALGORITHM

Related Questions in SORTING

Related Questions in TEXT

Related Questions in HAMMING-DISTANCE

Trending Questions

Popular # Hahtags

Popular Questions