I have two sets of numbers a and b, where a is (11, 1) and b is (46, 1). All the numbers in a and b are the same.
Here are the specific values:
0.0435390887054011854750967813743045553565025329589843750
However, when I subtract the averages of the two sets of numbers, two completely different results appear in matlab and python:
Matlab:
a = repmat(0.0435390887054011854750967813743045553565025329589843750, 1, 11);
b = repmat(0.0435390887054011854750967813743045553565025329589843750, 1, 46);
mean(a)-mean(b)
Output:
ans = 0
Python:
a = np.array([0.0435390887054011854750967813743045553565025329589843750] * 11, dtype=np.float64)
b = np.array([0.0435390887054011854750967813743045553565025329589843750] * 46, dtype=np.float64)
np.mean(a)-np.mean(b)
Output:
-6.938893903907228e-18
In fact, the averages of the two groups are the same, and the result should be 0. The python result obviously produced an error.
Taking into account the error of floating point numbers, if the array is converted to decimal form, python can correctly output 0. However, converting to decimal will increase the operation time sharply.
Is there a better way to improve it?
In fact, this problem was discovered when trying a two-sample t-test. The t-test was performed on the above a and b, and different results were obtained in matlab and python:
Matlab:
[tad,p_tmp] = ttest2(a, b)
Output:
tad = NaN
p_tmp = NaN
Python:
_, p_tmp = ttest_ind(a, b)
Output:
p_tmp = 0.001731693698230969
Among them, I conducted experiments on some other similar arrays. Matlab outputs NAN or 1, while python approximates some very small numbers.
Through debugging, I found that the problem occurred: "mean(a)-mean(b) operation" was performed in ttest_ind of scipy.stats.
If "mean(a)-mean(b)" is 0, then p=1. However, as mentioned before, "mean(a)-mean(b)" is in gives a non-zero result in python.
I also tried other libraries for python, such as statsmodels.stats.weightstats and pingouin, but all produced similar situations
In addition, I also discovered another problem. If there are two feature matrices "X1" and "X2", corresponding to the positive and negative data respectively, where: X1 is (11, 906), and X2 is (46, 906)
Assuming that "c" and "d" are one of the sets of numbers, then whether the t test is performed on them individually or the t test is performed on the entire matrix, the corresponding p values in matlab are the same, but different results are produced in python.
Let’s give a simple example: assuming "c" and "d" are the data in column 420, then there is:
ttest_ind(c, d)
pvalue = 1.1031870941764164e-09
_, pvalue = ttest_ind(X1, X2)
pvalue[420] = 8.819570120388648e-09
However, in either case in matlab, the final result is:
pvalue = 8.819570120388648e-09
Why does this difference occur? How to make the results of two situations consistent in python?