Problem :
I stumbled across a problem that has deep (and very impactful) consequences in the industry, and does not seem to be documented anywhere :
It seems that in python, Matrix Multiplication (using either @ or np.matmul) gives different answers between C-contiguous and F-contiguous arrays :
import platform
print(platform.platform()) # Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31
import numpy as np
print(np.__version__) # 1.23.5
np.random.seed(0) # Will work with most of seed, I believe
M, N = 10, 5
X_c = np.random.normal(size=M*N).reshape(M,N)
X_f = np.asfortranarray(X_c)
assert (X_c == X_f).all()
p = np.random.normal(size=N)
assert (X_c @ p == X_f @ p).all() # FAIL
To your best knowledge, is this problem documented anywhere ?
Example of Consequence :
In some cases, these errors can become huge :
Example : In sklearn.linear_model.LinearRegression class, the fit method will leads to very wrong parameters in the case of near-singular matrix X that is F-contiguous (typically, that came out of a pandas DataFrame).
This can leads to prediction all over the place and negative R2.
Summary
As I see it, the problem that you noticed is a consequence of the fundamental limitations of floating point arithmetic; in particular
Why the order of summation matters
Consider the example from the code below, in which we want to sum all elements of an array
a. We will do this in two different orders:sum_1sums all values from left to right,sum_2progressively sums all neighboring pairs of values until only one value remains. Mathematically, this should not make a difference, as summation is an associative operation on real numbers. It is not associative on the floating point numbers on a computer, however, as we see from the output of the last line:What is going on here? Notice the use of
eps, which is defined as the difference between 1.0 and the next smallest representable float larger than 1.0. For thefloattype, at least on my system, the value is ca.2e-16, i.e. 0.0000000000000002. The definition implies that, if we add a value smaller thanepsto 1.0, the result will still be 1.0!In the constructed example from above, the latter happens in the calculation for
sum_1but not forsum_2, as we can see from their calculation steps:sum_1:0.5 + 0.5 → 1.0add a[0] and a[1]: works as expected1.0 + eps/2 → 1.0add a[2] to previous sum: eps/2 is too small for the result to be represented as its own float value larger than 1.0, so the sum remains 1.01.0 + eps/2 → 1.0add a[3] to previous sum: same problem as in step 2sum_2:0.5 + 0.5 → 1.0add a[0] and a[1]: works as expectedeps/2 + eps/2 → epsadd a[2] and a[3]: works as expected1.0 + eps → 1.0000000000000002add results from steps 1 and 2: works as expected, because eps is large enough for the result to be represented as its own float value larger than 1.0What this has to do with matrix-vector products
You might ask yourself what your observation has to do with my example from above. Here is what is most likely going on:
For each element in the result vector of a matrix-vector product, as is calculated in your code, essentially we calculate a dot product of two vectors: (1) the corresponding row of the given matrix and (2) the given vector. The dot product, by definition, multiplies the corresponding elements of the given vectors, then sums these intermediate results. Now the only option that may give us different results in calculating a matrix-vector product from identical input values may lie in this summation; or rather, its order, as the last step of each dot product. And here is the thing:
eps, but a general problem of summing large and small floating point values.What is the impact of all this
I try to summarize some points that should serve as key takeaways:
epsfrom above). This does not mean that such imprecisions should be neglected, e.g. they may amplify for progressive calculations.np.allclose()over exact equality comparison.I admit, I doubt a bit your final conclusion; namely, that the huge errors that you see in
sklearn.linear_model.LinearRegressionhave the same cause, but maybe it is the result of the aforementioned amplification effects. Maybe you should provide a minimum reproducible example for further investigation there.