Finding multisets of numbers within a defined range, efficiently

87 Views Asked by At

The problem consists of a collection of N arrays of positive real numbers. The total amount of numbers in the collection is K. The arrays can be of variable length. All K numbers in the collection must be divided over subsets, each number is drawn once but numbers in the collection can be duplicated (so we are not looking for sets but multisets). The requirement is that all numbers in a multiset must be within a range "width". The width of a multiset is the distance between its smallest and largest number.

It can be assumed that no such multisets exist within each individual array. The numbers within each array will always be spaced apart further than "width". The arrays can be assumed to be ascending. The size of the multisets is [1,N], they may not contain more than one number from the same array.

Some example data with N=4 and K=28: (the real problem is closer to N=1000 and K=10e6)

input = [[484.6, 512.2, 623.4, 784.3, 785.2, 786.1, 812.9],
        [452.2, 484.5, 512.3, 623.5, 782.6, 784.1, 786.3, 884.2],
        [452.1, 512.1, 623.4, 784.2, 785.2, 812.9],
        [484.8, 512.2, 623.3, 684.3, 785.3, 786.2, 812.9]]
width = 0.3
solution = [[452.2, 452.1],
     [484.6, 484.5, 484.8],
     [512.2, 512.3, 512.1, 512.2],
     [623.4, 623.5, 623.4, 623.3],
     [684.3],
     [782.6],
     [784.3, 784.1, 784.2],
     [785.2, 785.2, 785.3],
     [786.1, 786.3, 786.2],
     [812.9, 812.9, 812.9],
     [884.2]]

Finding one solution is ok, there is no need to find all possible solutions. It is also desired that the algorithm would be easily compatible with the ability to have as a solution the coordinates of the numbers as they appeared in the original 2D input, rather than the numbers themselves.

I have the following algorithm implemented in Python:

Initialise the list of sets as the first vector. For each remaining vector, for each value: check if the value is within range "width" of the mean values of sets. If yes, add that value to the first encountered set that meet the width requirement and update the mean of that set. If no, add that value as a new set. The sets to be found are called features here.

def find_features(data: list[list[float]], width = 0.3) -> list[list[float]]:
    
    if not data:
        return []
    
    # initialize features list as the first vector
    features = [[x] for x in data[0]]
    features_mean = [x for x in data[0]]
    
    # loop through the remaining vectors starting from the second
    for vector in data[1:]:
        for number in vector:
            added = False
            for i, feature in enumerate(features_mean):
                # if a value is within tollerance width of a feature, add it to that feature and update the mean of that feature
                if abs(feature - number) < width:
                    features[i].append(number)
                    features_mean[i] = sum(features[i]) / len(features[i])
                    added = True
                    break
            # if not added to any existing feature, add it as a new one
            if not added:
                features.append([number])
                features_mean.append(number)
    
    return features

One optimisation I tried was initially sorting the features list, and then instead of iterating through the feature_mean list, using bisect to quickly find the closest feature_mean. The problem with that is that the list doesn't necessarily stay sorted as you are inserting new values and updating the means (it does in this toy example, but may not on big datasets), so you would have to sort it again after updating a mean, which defeats the speed gain from the binary search. ​

1

There are 1 best solutions below

2
Dave On

Sort each of the input lists ascending.

Put the first element of each list in a min heap.

Initialize a working set, initially empty.

Repeatedly do the following:

  • Pop the top element from the min heap, and add the next element from the matching list, if any.
  • add the element to the working set if possible (O(1) check) since we just track & compare to the min which is the first elt added to the working set).
  • If the elt is too big to fit in the working set, create a new working set and initialize it with this elt.

E.g.

input = [[512.2, 623.4, 484.6, 784.3, 785.2, 786.1, 812.9],
    [512.3, 623.5, 484.5, 784.1, 884.2, 786.3, 452.2, 782.6],
    [512.1, 623.4, 452.1, 784.2, 785.2, 812.9],
    [512.2, 623.3, 484.8, 684.3, 785.3, 786.2, 812.9]]

step 1: sort input arrays
[484.6, 512.2, 623.4, 784.3, 785.2, 786.1, 812.9]
[452.2, 484.5, 512.3, 623.5, 782.6, 784.1, 786.3, 884.2]
[452.1, 512.1, 623.4, 784.2, 785.2, 812.9]
[484.8, 512.2, 623.3, 684.3, 785.3, 786.2, 812.9]

Repeated Step: 'ws' == working set
ws: {452.1, 452.2}. Then we set this aside and start a new working set because the next elt in our min heap is 484.5 which is more than width greater than 452.1.
ws2: {484.5, 484.6, 484.8}, Then we set this aside for the same reason, and so on...