The problem consists of a collection of N arrays of positive real numbers. The total amount of numbers in the collection is K. The arrays can be of variable length. All K numbers in the collection must be divided over subsets, each number is drawn once but numbers in the collection can be duplicated (so we are not looking for sets but multisets). The requirement is that all numbers in a multiset must be within a range "width". The width of a multiset is the distance between its smallest and largest number.
It can be assumed that no such multisets exist within each individual array. The numbers within each array will always be spaced apart further than "width". The arrays can be assumed to be ascending. The size of the multisets is [1,N], they may not contain more than one number from the same array.
Some example data with N=4 and K=28: (the real problem is closer to N=1000 and K=10e6)
input = [[484.6, 512.2, 623.4, 784.3, 785.2, 786.1, 812.9],
[452.2, 484.5, 512.3, 623.5, 782.6, 784.1, 786.3, 884.2],
[452.1, 512.1, 623.4, 784.2, 785.2, 812.9],
[484.8, 512.2, 623.3, 684.3, 785.3, 786.2, 812.9]]
width = 0.3
solution = [[452.2, 452.1],
[484.6, 484.5, 484.8],
[512.2, 512.3, 512.1, 512.2],
[623.4, 623.5, 623.4, 623.3],
[684.3],
[782.6],
[784.3, 784.1, 784.2],
[785.2, 785.2, 785.3],
[786.1, 786.3, 786.2],
[812.9, 812.9, 812.9],
[884.2]]
Finding one solution is ok, there is no need to find all possible solutions. It is also desired that the algorithm would be easily compatible with the ability to have as a solution the coordinates of the numbers as they appeared in the original 2D input, rather than the numbers themselves.
I have the following algorithm implemented in Python:
Initialise the list of sets as the first vector. For each remaining vector, for each value: check if the value is within range "width" of the mean values of sets. If yes, add that value to the first encountered set that meet the width requirement and update the mean of that set. If no, add that value as a new set. The sets to be found are called features here.
def find_features(data: list[list[float]], width = 0.3) -> list[list[float]]:
if not data:
return []
# initialize features list as the first vector
features = [[x] for x in data[0]]
features_mean = [x for x in data[0]]
# loop through the remaining vectors starting from the second
for vector in data[1:]:
for number in vector:
added = False
for i, feature in enumerate(features_mean):
# if a value is within tollerance width of a feature, add it to that feature and update the mean of that feature
if abs(feature - number) < width:
features[i].append(number)
features_mean[i] = sum(features[i]) / len(features[i])
added = True
break
# if not added to any existing feature, add it as a new one
if not added:
features.append([number])
features_mean.append(number)
return features
One optimisation I tried was initially sorting the features list, and then instead of iterating through the feature_mean list, using bisect to quickly find the closest feature_mean. The problem with that is that the list doesn't necessarily stay sorted as you are inserting new values and updating the means (it does in this toy example, but may not on big datasets), so you would have to sort it again after updating a mean, which defeats the speed gain from the binary search.
Sort each of the input lists ascending.
Put the first element of each list in a min heap.
Initialize a working set, initially empty.
Repeatedly do the following:
E.g.