How could these alternate numpy `uniform` vs `random` constructions possibly differ?

444 Views Asked by At

I had some code that random-initialized some numpy arrays with:

rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32)  # [-1.0, 1.0)
new_vectors /= vector_size

And all was working well, all project tests passing.

Unfortunately, uniform() returns np.float64, though downstream steps only want np.float32, and in some cases, this array is very large (think millions of 400-dimensional word-vectors). So the temporary np.float64 return-value momentarily uses 3X the RAM necessary.

Thus, I replaced the above with what definitionally should be equivalent:

rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32)  # [0.0, 1.0)                                                 
new_vectors *= 2.0  # [0.0, 2.0)                                                                                  
new_vectors -= 1.0  # [-1.0, 1.0)
new_vectors /= vector_size

And after this change, all closely-related functional tests still pass, but a single distant, fringe test relying on far-downstream calculations from the vectors so-initialized has started failing. And failing in a very reliable way. It's a stochastic test, and passes with large margin-for-error in top case, but always fails in bottom case. So: something has changed, but in some very subtle way.

The superficial values of new_vectors seem properly and similarly distributed in both cases. And again, all the "close-up" tests of functionality still pass.

So I'd love theories for what non-intuitive changes this 3-line change may have made that could show up far-downstream.

(I'm still trying to find a minimal test that detects whatever's different. If you'd enjoy doing a deep-dive into the affected project, seeing the exact close-up tests that succeed & one fringe test that fails, and commits with/without the tiny change, at https://github.com/RaRe-Technologies/gensim/pull/2944#issuecomment-704512389. But really, I'm just hoping a numpy expert might recognize some tiny corner-case where something non-intuitive happens, or offer some testable theories of same.)

Any ideas, proposed tests, or possible solutions?

3

There are 3 best solutions below

2
On BEST ANSWER

I ran your code with the following values:

seed = 0
target_shape = [100]
vector_size = 3

I noticed that the code in your first solution generated a different new_vectors then your second solution.

Specifically it looks like uniform keeps half of the values from the random number generator that random does with the same seed. This is probably because of an implementation detail within the random generator from numpy.

In the following snippet i only inserted spaces to align similar values. there is probably also some float rounding going on making the result appear not identical.

[            0.09130779,              -0.15347552,             -0.30601767,              -0.32231492,              0.20884682, ...]
[0.23374946, 0.09130772, 0.007424275, -0.1534756, -0.12811375, -0.30601773, -0.28317323, -0.32231498, -0.21648853, 0.20884681, ...]

Based on this i speculate that your stochastic test case only tests your solution with one seed and because you generate a different sequence with the new solution. and this result causes a failure in the test case.

1
On

Let's print new_vectors * 2**22 % 1 for both methods, i.e., let's look at what's left after the first 22 fractional bits (program is at the end). With the first method:

[[0.         0.5        0.25       0.         0.        ]
 [0.5        0.875      0.25       0.         0.25      ]
 [0.         0.25       0.         0.5        0.5       ]
 [0.6875     0.328125   0.75       0.5        0.52539062]
 [0.75       0.75       0.25       0.375      0.25      ]]

With the second method:

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Quite a difference! The second method doesn't produce any numbers with 1-bits after the first 22 fractional bits.

Let's imagine we had a type float3 which could only hold three significant bits (think span of non-zero bits), so for example the numbers (in binary) 1.01 or 11100.0 or 0.0000111, but not 10.01 because that has four significant bits.

Then the random number generator for the range [0, 1) would pick from these eight numbers:

0.000
0.001
0.010
0.011
0.100
0.101
0.110
0.111

Wait, hold on. Why only from those eight? What about for example the aforementioned 0.0000111? That's in [0, 1) and can be represented, right?

Well yes, but note that that's in [0, 0.5). And there are no further representable numbers in the range [0.5, 1), as those numbers all start with "0.1" and thus any further 1-bits can only be at the second or third fractional bit. For example 0.1001 would not be representable, as that has four significant bits.

So if the generator would also pick from any other than those eight numbers above, they'd all have to be in [0, 0.5), creating a bias. It could pick from different four numbers in that range instead, or possibly include all representable numbers in that range with proper probabilities, but either way you'd have a "gap bias", where numbers picked from [0, 0.5) can have smaller or larger gaps than numbers picked from [0.5, 1). Not sure "gap bias" is a thing or the correct term, but the point is that the distribution in [0, 0.5) would look different than that in [0.5, 1). The only way to make them look the same is if you stick with picking from those equally-spaced eight numbers above. The distribution/possibilities in [0.5, 1) dictate what you should use in [0, 0.5).

So... a random number generator for float3 would pick from those eight numbers and never generate for example 0.0000111. But now imagine we also had a type float5, which could hold five significant bits. Then a random number generator for that could pick 0.00001. And if you then convert that to our float3, that would survive, you'd have 0.00001 as float3. But in the range [0.5, 1), this process of generating float5 numbers and converting them to float3 would still only produce the numbers 0.100, 0.101, 0.110 and 0.111, since float3 still can't represent any other numbers in that range.

So that's what you get, just with float32 and float64. Your two methods give you different distributions. I'd say the second method's distribution is actually better, as the first method has what I called "gap bias". So maybe it's not your new method that's broken, but the test. If that's the case, fix the test. Otherwise, an idea to fix your situation might be to use the old float64-to-float32 way, but not produce everything at once. Instead, prepare the float32 structure with just 0.0 everywhere, and then fill it in smaller chunks generated with your new way.

Small caveat, btw: Looks like there's a bug in NumPy for generating random float32 values, not using the lowest-position bit. So that might be another reason the test fails. You could try your second method with (rng.integers(0, 2**24, target_shape) / 2**24).astype(np.float32) instead of rng.random(target_shape, dtype=np.float32). I think that's equivalent to what the fixed version would be (since it's apparently currently doing it that way, except with 23 instead of 24).

The program for the experiment at the top (also at repl.it):

import numpy as np

# Some setup
seed = 13
target_shape = (5, 5)
vector_size = 1

# First way
rng = np.random.default_rng(seed=seed)
new_vectors = rng.uniform(-1.0, 1.0, target_shape).astype(np.float32)  # [-1.0, 1.0)
new_vectors /= vector_size

print(new_vectors * 2**22 % 1)

# Second way
rng = np.random.default_rng(seed=seed)
new_vectors = rng.random(target_shape, dtype=np.float32)  # [0.0, 1.0)                                                 
new_vectors *= 2.0  # [0.0, 2.0)                                                                                  
new_vectors -= 1.0  # [-1.0, 1.0)
new_vectors /= vector_size

print(new_vectors * 2**22 % 1)
2
On

A way to maintain precision and conserve memory could be to create your large target array, then fill it in using blocks at higher precision.

For example:

def generate(shape, value, *, seed=None, step=10):
  arr = np.empty(shape, dtype=np.float32)
  rng = np.random.default_rng(seed=seed)
  (d0, *dr) = shape
  for i in range(0, d0, step):
    j = min(d0, i + step)
    arr[i:j,:] = rng.uniform(-1/value, 1/value, size=[j-i]+dr)
  return arr

which can be used as:

generate((100, 1024, 1024), 7, seed=13)

You can tune the size of these blocks (via step) to maintain performance.