How to extend the iteration process into large size in Python 3

63 Views Asked by At

After tossed a coin n times (n=10), there are 2^10=1024 possible outcomes. I used

lst = [list(i) for i in itertools.product([0, 1], repeat=n)] 

to obtain n=10 all possible outcomes, and then I want to find out the number of group (m) which is defined as a maximum consecutive sequence for identical side of coin. For example, HHHTTHTTHT, the number of groups for this outcome is 6, which are (HHH)(TT)(H)(TT)(H)(T). I employed

group=[len([len(list(grp)) for k, grp in groupby(x)]) for x in lst]

to find out the number of groups (m) for each corresponding combination of 10 tossed coin. Finally, I obtain the number of possible combinations whose the number of groups is greater than 6 (m), such as,

Group6=len(list(filter(lambda x:x>m,group)))

However, when the number of tosses increases, for instance, (n=200,m=110) or (n=500, m=260). I still used the same code as the above in Python, but it time-consumed and I think it exceeded the memory size of python. Could someone please help me to figure out how to address this issue, when n and m are quite large? Thanks

1

There are 1 best solutions below

0
On

Iterating over all possible tosses is definitely not going to work for large n (and here n is already large from 20 I guess). One can only calculate this efficiently using some combinatorics.

Let's first start with a simple example: minimum two groups for four tosses: if we want to calculate the number of tosses that result in two our more groups. There are a few possibilities (we number the groups like 1, 2, etc.):

Two groups:

1112
1122
1222

Three groups:

1123
1223
1233

Four groups:

1234

We know that the value of group G1 will be different then the one for G2 and the one for G2 different from G3 and so on. So it holds that:

G1 != G2 != G3 != ... != Gn

Since there are only two values (heads and tails) this means that if we determine the first group, we have deterimed the value for all groups. If G1 is heads, all even groups are tails and all odd groups are heads.

So this means that for every combination above, there are exactly two configurations. This means that for the case n=4,m=1, there are 2×7=14 configurations (which is exactly what we get with get when we work it out with the program in your question).

Now the only problem we still have to face is how we are going to count these super-configurations. We can do this by introducing what I would call upgrade notation:

You notate an increase of the group with 1 and the same group with 0.

So now 1223 becomes 0101: we upgrade the group on the second and the fourth index. And 1234 is 0111. How does this help? Well in fact for k groups, we only have to count the number of combinations with k-1 upgrades. So this means the number of combinations (k-1 nCr n-1). With n the length of the string (or the total number of tosses), and k the number of groups*. Now (k-1 nCr n) is defined as: (n-1)!/((k-1)!*(n-k)!). With ! the factorial. I borrowed a way to calculate nCr from here:

import operator as op from functools import reduce

def ncr(n, r):
    r = min(r, n-r)
    if r == 0: return 1
    return reduce(op.mul,range(n-r+1,n+1))//reduce(op.mul,range(1,r+1))

And now the last step is only to calculate the number of combinations (times two) for every k higher than m up to n. So:

def group_num(n,m):
    total = 0
    for i in range(m+1,n+1):
        total += ncr(n-1,i-1)
    return 2*total

or putting it into a one-liner:

def group_num(n,m):
    return 2*sum(ncr(n-1,i-1) for i in range(m+1,n+1))

Now we can test our code:

>>> group_num(4,1)
14
>>> group_num(10,6)
260
>>> group_num(200,110)
125409844583698900384745448848066934911164089598228258053888
>>> group_num(500,260)
606609270097725645141493459934317664675833205307408583743573981944035656294544972476175251556943050783856517254296239386495518290119646417132819099088

Which are the expected numbers (for the first two). As you can see the total amount blows up enormously so even the fastest algorithm to count the number of tosses one at a time will easily be outperformed by this approach (the results were calculated in less than a second).

The ncr function can be calculated in O(n) (but this can even be improved by using a memoize on the factorial for instance); and the group_num function is thus computed (without memoize optimization) in O((n-m)×n). Which thus completely outfactors the exponential behavior.