Correctly seeding numpy random generator

326 Views Asked by At

For my scientific experiments, I usually seed using:

rng = np.random.Generator(np.random.PCG64(seed))

which for the current numpy version is equivalent to

rng = np.random.Generator(np.random.default_rng(seed))

As I repeat my experiments n times and average their results, I usually set the seed to all the numbers between 0 and n.

However, reading the documentations here and here it states that

Seeds should be large positive integers.

or

We default to using a 128-bit integer using entropy gathered from the OS. This is a good amount of entropy to initialize all of the generators that we have in numpy. We do not recommend using small seeds below 32 bits for general use.

However, in the second reference, it also states

There will not be anything wrong with the results, per se; even a seed of 0 is perfectly fine thanks to the processing that SeedSequence does.

This feels contradictory and I wonder, if small seeds are now totally fine to use, or one should move towards higher seeds. Especially, I wonder, (i) at which point (if any) would a large seed make a difference to a low seed and (ii) if one does scientific experiments (e.g. machine learning / algorithmic research) should one prefer higher to lower seeds or should it not make a difference?

PS: This question is highly related to Random number seed in numpy but concerns the now recommended Generator. Furthermore, the answer seems not in-depth enough as it does not include a discussion about high and low seeds.

2

There are 2 best solutions below

3
On BEST ANSWER

The justification is in the quick start page which you linked:

We recommend using very large, unique numbers to ensure that your seed is different from anyone else’s. This is good practice to ensure that your results are statistically independent from theirs unless you are intentionally trying to reproduce their result.

In short, this is to avoid reproducing someone else's bias (if any) by generating the exact same dataset, since humans are more likely to pick short numbers by default (0, 11, 42) rather than very large ones.

In your use case this is probably not important.

0
On

Frankly, you have to think about internals of any pseudo RNG and compare them with what you want. RNG is pretty simple construct.

There is N-bit state as well as three functions:

  • state seed2state(seed) # this converts seed to appropriate state
  • state nextstate(state) # advance state by one step
  • uint64 state2output(state) # extract "randomness" from state and use it for say [0...1) U01 rv.

That is it, nothing more, nothing less. Everything else are helpful add-ons.

You are using PCG64 (XLS RR varian), which has 128bit state, and generate 64bit output.

First problem is how you populate state from seed. Potential problem here is that you use small value integer, and not enough bit chopping/mixing happens right away after one call to nextstate() and then to state2output(). Basically small integer seed will leave state of mostly zeros, which could affect you simulation.

Solution - run some 1K-10K calls to mixup your RNG. And then save PCG state as dictionary/JSON and start for here, you have your state well mixed. This is actually best approach - simulation code shall get state, not seed, as input. You control state outside of the simulation code.

seed = np.uint64(13579754321)

pcg = np.random.PCG64(seed)
rng = np.random.Generator(pcg)

rng.bytes(64*10000) # mixing/warm-up

state = pcg.state # save/restore state
pcg.state = state

Second issue is to ensure independent sampling sequences for different simulations, such that there is no overlap, and no correlations.

Advice to make seeds large such there is no overlap works ONLY BY CHANCE. Why leave to change things which could be perfectly controlled?

PCG64 has two feature which could greatly help - advance and jumps. Advance will move state like there are distance number of calls made. Internally it is pretty fast and works in log(distance) time.

You have to calculate or check once how many RNG calls you need for single simulation, make some reservations and just use it.

Some pseudocode

def simulation(some parameters, state) -> value:
    pcg = np.random.PCG64(1)
    pcg.state = state
    
    rng = np.random.Generator(pcg)
    ...
    return value

main function

def main():

    nof_PCG_calls_per_simulation = 100000000

    pcg = np.random.PCG64(135797531)
    rng = np.random.Generator(pcg)

    rng.bytes(64*10000) # mixing/warm-up

    state = pcg.state

    N = 1000 # number of simulations to run

    s = 0.0
    for k in range(0, N):
        pcg = np.random.PCG64(1)
        pcg.state = state

        simstate = pcg.advance(N*nof_PCG_calls_per_simulation).state

        v = simulation(whatever, simstate)
        s += v

    return (s/N, N)

There is also good PCG method to consider - jumps. It is semantically very close to advance, it returns new PCG and states are different as if 2127 numbers are generated.

This way you would be sure to have well-mixed PCG state as well as that different streams used in different simulation are REALLY not overlapping.