I am using the Damerau-Levenshtein code available from here in my similarity measurements. The problem is that when I apply the Damerau-Levenshtein on two strings such as cat sat on a mat and dog sat mat, I am getting edit distance as 8. This similarity results can get any number regarding insertion, deletion or substitution like any range from 0, 1, 2, ... . Now I am wondering if there is any way that we can assume or find a maximum of this distance (similarity) and converted between 0 and 1 or how can we set the max value that at least I can say: distance =1 - similarity.
The reason for this post is that I am setting a threshold for a few distance metrics like cosine, Levenstein and damerau levenstein and outputs of all should be betweeb zero and 1.
How to choose the proper maximum value for Damerau-Levenshtein distance?
1.7k Views Asked by Bilgin At
2
There are 2 best solutions below
Related Questions in PYTHON
- new thread blocks main thread
- Extracting viewCount & SubscriberCount from YouTube API V3 for a given channel, where channelID does not equal userID
- Display images on Django Template Site
- Difference between list() and dict() with generators
- How can I serialize a numpy array while preserving matrix dimensions?
- Protractor did not run properly when using browser.wait, msg: "Wait timed out after XXXms"
- Why is my program adding int as string (4+7 = 47)?
- store numpy array in mysql
- how to omit the less frequent words from a dictionary in python?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- python how to write list of lists to file
- Removing URL features from tokens in NLTK
- Optimizing for Social Leaderboards
- Python : Get size of string in bytes
- What is the code of the sorted function?
Related Questions in DISTANCE
- List of coordinates to matrix of distances
- sort graph by distance to end nodes
- Scatter 2D coordinates from distance matrix
- array of minimum euclidian distances between all points in array
- Calculate distance between two GeoLocation
- Find Calulated Match on Existing Data Using Levensthein Method
- Calculating distances between unique Python array regions?
- How to find the nearest value in the database
- C# MongoDB driver 2.0 - Getting distance back from near query
- Find Mahalanobis distance between 2 image histograms
- Using a distance matrix *with errors* to find the coordinates of points
- Minimum distance between turtles
- Designing an algorithm to calculate the edit distance between two strings
- Calculate distance between each tag number in R
- Find longest distance from a certain point (java, 2d diagram)
Related Questions in SIMILARITY
- R Pairwise comparison of matrix columns ignoring empty values
- MinHashing vs SimHashing
- Check the similarity between two words with NLTK with Python
- PostgreSQL multiple pg_trgm similarity score sub-query
- How to group sets by similarity in contained elements
- nltk similarity performance issue?
- Track multiple values from a method
- Lucene scoring, precision about vector space model
- SQLite combine values of similar records into one
- trying to understand LSH through the sample python code
- Techniques for Similarity matching to find similar customers with non-textual attributes
- SQL Server Record Linkage After String Matching
- Compute mean squared, absolute deviation and custom similarity measure - Python/NumPy
- Measure similarity between 2 vectors
- How word2vec output vectors are used to compute the similarities?
Related Questions in LEVENSHTEIN-DISTANCE
- levenshtein matrix cell calculation
- Explanation of normalized edit distance formula
- How do we ignore the order of letters in calculating Levenshtein distance?
- What indexer do I use to find the list in the collection that is most similar to my list?
- How to get most important occurrences from an array?
- How is Levenshtein Distance calculated on Simplified Chinese characters?
- perl custom sort by string similarity clustering
- R - stringdist cost setting error
- Extracting operations from Damerau-Levenshtein
- Calculating levenshtein distance between two strings
- How do I find the percentage of similarity between two multiline Strings?
- Finding Levenshtein distance on two string
- Finding the "difference" between two string texts (Lua example)
- Efficient kNN graph construction with deferred selection of k
- Levenshtein module in python doesn't work
Related Questions in DAMERAU-LEVENSHTEIN
- Extracting operations from Damerau-Levenshtein
- using Damerau-Levenshtein distance to compare sets of text in code.org
- Is there public data for OCR-based character distance?
- Levenstein distance, multiple paths
- Find all pairs of similar words
- Algorithm to find one edit distance words from input word using Levenshtein distance?
- Using Python to extract the specific edit when Damerau-Levenshtein distance equals 1
- Damerau-Levenshtein algorithm isn't working on short strings
- Damerau-Levenshtein distance between two vectors
- How to choose the proper maximum value for Damerau-Levenshtein distance?
- Fuzziness not behaving as expected in Elasticsearch
- Strange output of the `adist` fuction in R (string distance)
- Modify Damerau-Levenshtein algorithm to track transformations (insertions, deletions, etc)
- Suggestion for limiting fuzzy search suggestion results
- Text correction Damereau Levenshtein python
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The difficult thing is that the upper bound of Damerau-Levenshtein is infinite (given infinitely long words), but we can't practically make infinite strings.
If you wanted to be safe, you can use something that maps the range 0-> max length of a string onto the range 0->1. The max length of a string depends on the amount of memory you have (assuming 64 bit), so I'd recommend doing...not this. Source
Practically, you can also just check all of the strings you are about to compare and choose the length of the longest string in that list as the max value. Another solution is to compute all of the scores beforehand and apply the conversion factor after you know the max score. Some code that could do that:
These happen to give identical answers because most of the words are very different from each other but either of these approaches should work for most cases.
Long story short, the max distance between two strings is the length of the longer string.
Notes: if this maps in the wrong direction (i.e. high scores are showing low and vice versa, just add "1-" between the open bracket and x in adjustscore)
Also, if you want it to map do a different range, replace the 1 with a different max value.