Split a list of numbers by sum and count

107 Views Asked by At

I would like to split a list of numbers and create two subsets based on two conditions:

  1. The sum of the numbers of the two subsets should have a particular ratio (e.g. 45%-55%)
  2. The count of the numbers of the two subsets, should be nearly the same (nearly means to diverge at most 5%)

What I have done by now is merely a probabilistic rather than a dev approach which could be used for small lists after adding a while loop that will keep the count condition. However I would like to have a more robust approach, without running the code 100 iterations in order to find (probably find) the best split.

import numpy as np
from scipy.stats import norm

def split_into_normal_distributions(numbers, target_percentage):
    # Calculate the means and standard deviation of the two normal distributions
    mean_1 = np.mean(numbers) * target_percentage
    mean_2 = np.mean(numbers) *(1-target_percentage)
    std = np.std(numbers)

    # Initialize subsets
    subset_1 = []
    subset_2 = []

    for num in numbers:
        # Calculate probability densities for each number in both distributions
        pdf_1 = norm.pdf(num, loc=mean_1, scale=std)
        pdf_2 = norm.pdf(num, loc=mean_2, scale=std)

        # Calculate the ratio of probabilities for assignment
        ratio = pdf_1 / (pdf_1 + pdf_2)
        pdf_sum = pdf_1 + pdf_2

        # Assign numbers to subsets based on the calculated ratio
        if np.random.rand() < ratio:
            subset_1.append(num)
        else:
            subset_2.append(num)

    return subset_1, subset_2

# Sample list of numbers
numbers = [10, 20, 30, 40, 50, 60, 70, 80,10,20,25,20,21,26,65,95,84,65,2,3,6,198,16,651,984,651,35,61,651,16,56,651,651,651,2,32,615,651,984,615,351,651,651,3,5]

# Split into two normal distributions with specified means and standard deviation
subset_1, subset_2 = split_into_normal_distributions(numbers, 0.4)

print("Subset 1 (40% mean):", subset_1, sum(subset_1)/sum(numbers), len(subset_1))
print("Subset 2 (60% mean):", subset_2, sum(subset_2)/sum(numbers), len(subset_2))
len(numbers)

Thank you

2

There are 2 best solutions below

0
On

Here's what I came up with. This is a simple greedy approach to produce a split with even totals. Not quite what you asked for but hopefully it helps.

import bisect


class Cluster:
    def __init__(self, nums):
        self.nums = nums
        self.total = sum(nums)


def main():
    nums = [10, 20, 30, 40, 50, 60, 70, 80, 10, 20, 25, 20, 21, 26, 65, 95, 84, 65, 2, 3, 6, 198, 16, 651, 984, 651, 35, 61, 651, 16, 56, 651, 651, 651, 2, 32, 615, 651, 984, 615, 351, 651, 651, 3, 5]
    clusters = [Cluster([n]) for n in nums]
    while len(clusters) > 1:
        pairs = list(zip(clusters, clusters[1:]))
        best_pair_index = min(
            range(len(pairs)), key=lambda i: abs(pairs[i][0].total - pairs[i][1].total)
        )
        best_pair = pairs[best_pair_index]
        combined = Cluster(best_pair[0].nums + [-n for n in best_pair[1].nums])
        del clusters[best_pair_index : best_pair_index + 2]
        bisect.insort(clusters, combined, key=lambda c: c.total)
    [cluster] = clusters
    left = [n for n in cluster.nums if n > 0]
    right = [-n for n in cluster.nums if n < 0]
    print(sum(left))
    print(sum(right))
    print(len(left))
    print(len(right))


main()
0
On

I will post my try here for future refence. I think that Alex Hall's answer is suberb, but a bit complicated. This function can pretty well split a list with the criteria I want and tackle lists as in @RomanPerekhrest's comment.

def find_optimal_batch(numbers, ratio):

# Calculate the sum of the original list
total_sum = sum(numbers)

# Initialize variables to track the optimal batch and ratio
optimal_batch = []
optimal_difference = float('inf')  # Initialize to positive infinity

# Precompute the range of batch sizes
num_numbers = len(numbers)
batch_sizes = range(1, num_numbers + 1)

# Iterate through the batches
for batch_size in batch_sizes:

    # Iterate and calculate sums for each batch
    for i in range(num_numbers - batch_size + 1):

        batch_sum = sum(numbers[i:i+batch_size])
        batch_ratio = batch_sum / total_sum
        difference = abs(batch_ratio - ratio)

        # Check if the current batch has a closer ratio to the desired ratio
        if difference < optimal_difference:

            batch_1 = numbers[i:i+batch_size]
            batch_2 = numbers[0:i]+numbers[i+batch_size:]
            optimal_difference = difference
return batch_1, batch_2