Levenshtein Distance using Azure Search

95 Views Asked by At

I am working on a project which requires to index documents in Azure Search service. This index is later used to search against other documents uploaded by users to find matches / similarities found in the document which is uploaded Vs document which is already indexed. We have a requirement that matching should be done based on Levenshtein algorithm.

Although, Azure search supports "Fuzzy Search" which uses similar approach, however the results/score returned by Azure Search cannot be measured based on Levenshtein distance.

I tried to use Azure Cognitive search "Skill set" to check if i can direct azure to provide Levenshtein distance based scores. However didn't found any way of doing that.

For example, for the following text

source text: "Company have its head quarter in Vienna City",

It provides result with exact match, but the score cannot be interpreted to check Levenshtein distance.

result:

{
      "@search.score": 4.399799,
      "id": "8eddb05d-8359-4a99-a629-e098d93ae296",
      "content": "Deloite have its head quarter in Vienna City."
}

However, i expect score like following

Levenshtein score: 12

Is there any way to get expected scores?

1

There are 1 best solutions below

0
Rishabh Meshram On

Azure Cognitive Search does not provide a built-in way to return search results with a Levenshtein distance score and you can't use custom skillset they are designed for processing and transforming data during the indexing process, not for performing search queries.

However, you can implement a workaround to achieve this requirement by using a custom scoring function.

Once you get the query result, you can use the input text with result to calculate the score. You can use python-Levenshtein to calculate the score.

import Levenshtein

def calculate_levenshtein_distance(source_text, result_text):
    return Levenshtein.distance(source_text, result_text)

# Example search result
search_result = {
    "@search.score": 4.399799,
    "id": "8eddb05d-8359-4a99-a629-e098d93ae296",
    "content": "Deloitte have its head quarter in Vienna City."
}

source_text = "Company have its head quarter in Vienna City"

# Calculate Levenshtein distance
levenshtein_distance = calculate_levenshtein_distance(source_text, search_result["content"])

print(f"Levenshtein distance: {levenshtein_distance}")