Find element similarity within a collection of strings without evaluating all element pairs

116 Views Asked by At

So the problem collection is something like:

A = {'abc', 'abc', 'abd', 'bcde', 'acbdg', ...}

Using some type of string metric like Levenshtein distance, it's simple enough to find some sort of heuristic of string similarity between 2 strings.

However, I would like to determine, without evaluating all pairs of strings in the collection (an O(N^2) problem), some type of heuristic based on the entire collection that gives me a good idea of the overall similarity between all the strings.

The brute force approach is:

                          Sum(Metric(All Pairs in A))
CollectionSimilarity(A) = ---------------------------
                                 N*(N+1)/2

Is there a way to evaluate the similarity of the entire collection of A without evaluating every pair?

2

There are 2 best solutions below

2
ElKamina On

You can always use some approximation (eg. sampling pairs). Depending on how large N is, this value should converge with NlogN samples.

0
pkuderov On

Since every string is a vector in some metric space (where every char is particular coordinate), my solution is to find the distance between the set A and some point P.

Let's look at one metric's property - the triangle inequality:

Distance(x, y) <= Distance(x, *P*) + Distance(y, *P*)

So we can find an upper bound of Sum(Distance(All pairs in A)) as |A| * Sum(Distance(All elements in A to point P):

  Sum(Distance(x, y))      N * Sum(x, *P*)     Sum(x, *P*)
---------------------- <= ----------------- = ------------
     N*(N+1)/2               N*(N+1)/2          (N+1)/2

This point P can be random point or the center of mass (in this case you find an average radius of set) of set A or empty string (zero point) or anything else. Generally speaking P may be any hyperplane. Anyway you'll find some kind of average radius (or diameter) of your set.
Maybe some linear pre-transformation [of set or coordinate system, which is the same] is good. Or iterate multiple times and on every iteration find the distance to new random hyperplane.

Hope this may help!