Counting the Number of Combinations that Match a Certain Condition

361 Views Asked by At

Here is my problem:

  • There are 5 candies (1,2,3,4,5) - each candy has the same "value" associated with its number (i.e. candy_1 has value=1, candy_2 has value=2, etc.).

  • In the future, I want to be able to answer questions such as: Is it possible to pick "x" number of candies such that the values of the chosen candies sum to less than "y" and the values of the unchosen candies sum to less than "z"?

The first way to solve this problem that comes to mind is to use the "power set" (https://en.wikipedia.org/wiki/Power_set). That is, if there are "n" candies, the total number of combinations that can be made are 2^n combinations.

Step 1: For example, suppose I have 5 candies - here is a function that will identify the power set:

a <- c(1,2,3,4,5)

power_set <- function(x) {
    c(list(NULL), unlist(lapply(seq_along(x), function(i) combn(x, i, simplify = FALSE)), recursive = FALSE), list(x))
}
  
pset <- power_set(a)

Step 2: Then, for each element in the power set, I need to identify the "unchosen" candies:

find_missing <- function(x, full_set) {
  full_set[!full_set %in% x]
}

diffset <- lapply(pset, find_missing, full_set = a)

Step 3: Finally, I can use some data manipulation to bring everything into a data frame and summarize different properties of the selections (e.g. sums and counts):

pset_char <- sapply(pset, function(x) paste(x, collapse = ","))
diffset_char <- sapply(diffset, function(x) paste(x, collapse = ","))

df <- data.frame(pset = pset_char, diff = diffset_char)

df$sum_pset <- sapply(pset, sum)
df$sum_diffset <- sapply(diffset, sum)
df$count_pset <- sapply(pset, length)
df$count_diffset <- sapply(diffset, length)

The final result would look something like this:

  pset      diff sum_pset sum_diffset count_pset count_diffset
1      1,2,3,4,5        0          15          0             5
2    1   2,3,4,5        1          14          1             4
3    2   1,3,4,5        2          13          1             4
4    3   1,2,4,5        3          12          1             4
5    4   1,2,3,5        4          11          1             4
6    5   1,2,3,4        5          10          1             4

From here, I could then simply select rows that match a certain criteria (e.g. select from df where count_pset <= x & sum_pset <=y & sum_diffset <=z;)

My Question: The obvious problem with this approach is that the Power Set grows very large as the number of elements increase and this approach will become completely impractical.

Suppose I have 1000 candies (e.g. candy_1 has value = 1,... candy_999 has value = 999, candy_1000 has value = 100) and I am only interested in finding some limited number of combinations that match a certain criteria (i.e. not all of them).

Is there some other approach I can use to avoid generating the entire power set and then searching the results? For example, is there something involving recursion, backtracking, and dynamic programming that can be done to make this problem solvable? Perhaps Evolutionary Algorithms?

7

There are 7 best solutions below

4
On BEST ANSWER

From you description, it seems you would like to enumerate all possible subset combinations that satisfies some sum constraints. I think this a variant of the "subset sum" problem, and what you want is to list all the feasible subsets

Code

Below is a function hopefully might solve your problem, where a recursion helper function is devised to find the constrained subsets

f <- function(a, x, y, z) {
  helper <- function(vec, s, k) {
    v <- vec[vec <= s]
    if (k == 1) {
      return(as.list(v))
    }
    unlist(
      lapply(v, \(p) Map(c, p, helper(v[v > p], s - p, k - 1))),
      recursive = FALSE
    )
  }

  Filter(
    \(v) sum(a) - sum(v) <= z,
    unlist(
      lapply(
        0:x,
        \(k) helper(a, y, k)
      ),
      recursive = FALSE
    )
  )
}

Output Example

Given input as below

a <- 1:5
x <- 3
y <- 12
z <- 10

you will obtain the all feasible subsets

> f(a, x, y, z)
[[1]]
[1] 5

[[2]]
[1] 1 4

[[3]]
[1] 1 5

[[4]]
[1] 2 3

[[5]]
[1] 2 4

[[6]]
[1] 2 5

[[7]]
[1] 3 4

[[8]]
[1] 3 5

[[9]]
[1] 4 5

[[10]]
[1] 1 2 3

[[11]]
[1] 1 2 4

[[12]]
[1] 1 2 5

[[13]]
[1] 1 3 4

[[14]]
[1] 1 3 5

[[15]]
[1] 1 4 5

[[16]]
[1] 2 3 4

[[17]]
[1] 2 3 5

[[18]]
[1] 2 4 5

[[19]]
[1] 3 4 5

Bonus

Further, if you want the output presented in a data.frame, you can run the following steps in addition

lst <- f(a, x, y, z)
dfout <- do.call(
  rbind,
  lapply(
    lst,
    \(v)
    data.frame(
      pset = toString(v),
      diffset = toString(setdiff(a, v)),
      sum_pset = sum(v),
      sum_diffset = sum(a) - sum(v),
      cnt_pset = length(v),
      cnt_diffset = length(a) - length(v)
    )
  )
)

which gives

> dfout
      pset    diffset sum_pset sum_diffset cnt_pset cnt_diffset
1        5 1, 2, 3, 4        5          10        1           4
2     1, 4    2, 3, 5        5          10        2           3
3     1, 5    2, 3, 4        6           9        2           3
4     2, 3    1, 4, 5        5          10        2           3
5     2, 4    1, 3, 5        6           9        2           3
6     2, 5    1, 3, 4        7           8        2           3
7     3, 4    1, 2, 5        7           8        2           3
8     3, 5    1, 2, 4        8           7        2           3
9     4, 5    1, 2, 3        9           6        2           3
10 1, 2, 3       4, 5        6           9        3           2
11 1, 2, 4       3, 5        7           8        3           2
12 1, 2, 5       3, 4        8           7        3           2
13 1, 3, 4       2, 5        8           7        3           2
14 1, 3, 5       2, 4        9           6        3           2
15 1, 4, 5       2, 3       10           5        3           2
16 2, 3, 4       1, 5        9           6        3           2
17 2, 3, 5       1, 4       10           5        3           2
18 2, 4, 5       1, 3       11           4        3           2
19 3, 4, 5       1, 2       12           3        3           2
5
On

I realized that I have not actually answered the question asked (see @ThomasIsCoding's comment below). I've decided to leave my answer up, however, in case any readers are interested in linear programming problems where some or all the variables are restricted to integer values, here zero and one.




The example problem is a classic mixed integer linear program (MIP) with binary variables. Such problems can be solved with a MIP optimization package, provided the problem is not overly large or has a troublesome model structure, characteristics that may be difficult to judge in advance. Whereas linear programming problems (LP's) with thousands and variables and constraints may consistently be solved in very little time, some MIPs with far fewer variables and constraints could take years to solve.

A MIP formulation of the example problem is as follows.

Let xi equal 1 if candy i is picked and 0 if it is not. Suppose vi equals the value of candy i. There are three constraints:

Σixi = x

Σivixi <= y

Σivi(1-xi) <= z

where Σi indicates the operands are to be summed over all indices i. x, y and z are given. In addition, each xi must equal 0 or 1. The third constraint normalizes to

Σi-vixi <= z - Σivi

Such models have a linear objective to be minimized or maximized. Here that would be, say,

min Σi0*xi

since only a feasible solution is required. The MIP solver will return a feasible solution if one exists. For some similar problems it could prove benficial to have an objective function to be minimized or maximized. In those cases an optimal solution is returned so long as a feasible solution exists.

Binary variables, in general, provide a powerful way to enforce a wide range of logical requirements.

3
On

There are surely some better options, but absent deeper knowledge about these algorithms, one way would be to depend on existing algorithms like in adagio::knapsack. For n=20, your approach produced all possible solutions in about 20 seconds. This adagio::knapsack approach, on the other hand, identifies 1,000 random solutions (including duplicates) which hit or approach the target sum in about half a second.

If you just need to find some of the workable solutions, this approach scales much better, and can find a solution for n=1000 in a few seconds, even though the set of potential solutions is unworkably large. (For instance, for a = 1:1000, the total number of possible combinations is 2^1000 which is ~10^301, which is an unfathomably large number beyond calculating. Even if only a trillionth of a trillionth of a trillionth those fit constraints, the number is still way too big to even list them all.)

Here, I run the knapsack algorithm five times with random weights, aiming for the sum of the elements to be up to half the total. For n=1000, it takes about 40 seconds on my computer to come up with 5 solutions. It's possible, but improbable at this size, that there could be repeated solutions within those, so if you need to guarantee a specific number of unique solutions or specific sums, you might want some more logic to discard runs that aren't satisfactory.

a = 1:1000
library(adagio)
set.seed(42)
solutions <- list()
for(i in 1:5) {
  solutions[[i]] <- knapsack(a, runif(length(a)), floor(0.5*sum(a)))
}

sol1 <- solutions[[1]]$indices
sol1_unpicked <- dplyr::setdiff(a, sol1)
sum(sol1)
#[1] 250250
sum(sol1_unpicked)
#[1] 250250

Here's a tidyverse approach to take the list of solutions and put them into a data frame to de-dupe them and do any further analysis.

library(tidyverse)
enframe(solutions) |>
  unnest_wider(value) |>
  mutate(soln = paste0(indices, collapse = ","), .by = name) |>
  arrange(soln) |>
  distinct(soln, .keep_all = TRUE)
0
On

In all your examples, the values are 1, 2, ..., n. My O(1) answer assumes this is guaranteed.

f(n,x,y,z) = true if it is possible to pick "x" number of candies such that the values of the chosen candies sum to less than "y" and the values of the unchosen candies sum to less than "z"?

g(n) = 1+2+...+n = n(n+1)/2

The sum of x candies can be any value between g(x) and g(n) - g(n-x). If y-1 isn't between those values, return false. If g(n) - y >= z, return false. Otherwise return true.

0
On

I am just aware that you want some (instead of all) feasible combinations.


Possible Methodology

I guess you can use a randomization approach, e.g., Monte Carlo, which avoids generating the entire power set and keeps the randomly generated combinations only as long as they are valid, for example

f <- function(a, x, y, z, K) {
  res <- c()
  repeat {
    if (length(res) == K) {
      return(res)
    }
    pset <- sort(sample(a, sample.int(x, 1)))
    if (sum(pset) <= y && sum(a) - sum(pset) <= z && !list(pset) %in% res) {
      res <- c(res, list(pset))
    }
  }
}

Output Example

If you want to have K=8 distinct valid combinations, you can run

a <- 1:5
x <- 3
y <- 12
z <- 10
f(a, x, y, z, K = 8)

and you will obtain a list like

[[1]]
[1] 2 3

[[2]]
[1] 2 4 5

[[3]]
[1] 1 2 3

[[4]]
[1] 3 4 5

[[5]]
[1] 1 3 5

[[6]]
[1] 3 5

[[7]]
[1] 2 4

[[8]]
[1] 1 4 5
0
On

Assuming each value is unique, i.e. 2 only appears once

This is where you use recursion with save to build up the possibilities. Even and odd values are slightly different. Every time you step up to the next even or odd value, you add 2, which breaks down to 0+2 or 1+1.

Initialize

3: only (1, 2)

4: (2,2) & (1,3)

Now we have enough history to apply an algorithm. Even numbers will always have an (n,n) pair, and odd numbers will not.

5: We have to use 0+2 and 1+1 for the previous odd number's possibilities:
(1,2) --> (1,4) using 0+2
(1,2) --> (2,3) using 1+1

6: Same as above, but increase the (n,n) pair by 1+1, and 0+2 for all of them:
1+1 case: (2,2) --> (3,3)
0+2 cases: (2,2) --> (2,4) and (1,3) --> (1,5)

7: Repeat the odd number logic for 5:
(1,4) --> (1,6)
(1,4) --> (2,5)
(2,3) --> (2,5) duplicate, drop
(2,3) --> (3,4)

8: Repeat the even pattern
(3,3) --> (4,4)
(3,3) --> (3,5)
(2,4) --> (2,6)
(1,5) --> (1,7)

etc.

There's no way around the possibilities exploding, but if you save the combinations at each step, you never need to re-compute values.

For example, you have n = 19 and so far you've only calculated up to n = 15. The function will call itself for n = 17, which you also don't have, so it calls itself with n = 15, finds it, returns that values, calculates up to 17, returns that, calculates up to 19.

0
On

Since you state

finding some limited number of combinations that match a certain criteria

you can use the statement

In a collection of n positive numbers that sum up to S, at least one of them will be less than S divided by n (S/n)

The algorithm is the following:

Target price is K(eg 1000) and you need n (eg 10) total candies.

  1. Select a candy randomly where (price < 1000/10). Get its price, lets say X.
  2. Select a candy randomly where price < (1000 - X)/9. Get its price, assume Y.
  3. Select a candy randomly where price < (1000 - X - Y)/8.

Repeat till you find 9 candies. For the last candy, select one where price = remaining price.

The advantage is that

  • this is quick. Even if you do not find an exact match in the first run,
  • you can run it multiple times and with a lot of candies of different prices, you can get an exact result after some runs.

The disadvantages are that

  • I am not sure these collections produced are random(from all possible subsets maybe some are produced more frequently),
  • you may not find results(maybe a result does not even exist)