Anomalous Parameter Estimation Following Gamma Distribution Fitting with scipy.stats

47 Views Asked by At

I have two sets of datasets (https://github.com/badal01/precipi_data/blob/main/Data1.xlsx). The range of the two datasets is more or less similar. The alpha value for the first dataset (first row in Data1.xlsx) is 5.002. However, the alpha value for the second row is 396111157.0657891, which is abnormal. I attempted to check if the data contains any absurd values, very high values, or negative values. However, the data looks completely normal and very similar to the first row.

import scipy.stats as stats
import pandas as pd
io = pd.read_excel('Data1.xlsx',header=None).to_numpy()
fit_alpha, fit_loc, fit_beta=stats.gamma.fit(io[0,:])
fit_alpha1, fit_loc1, fit_beta1=stats.gamma.fit(io[1,:])

I tried it with MATLAB, and it is working quite well. However, I do not understand what's happening in Python. Any help would be appreciated.

I tried with different library. I did not observe any different results.

1

There are 1 best solutions below

0
On

Here is complete code to go along with my comment above. (In the future, please include any data in plain text.)

Here's the fit you were happy with:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

x1 = np.asarray([4.43662e-05,2.43413e-05,2.73413e-05,2.9961e-05,4.06934e-05,2.56346e-05,4.02114e-05,1.94386e-05,2.18386e-05,3.1264e-05,3.31055e-05,3.00496e-05,3.00685e-05,3.31455e-05,3.98705e-05,6.30289e-05,2.39244e-05,5.56489e-05,1.90789e-05,2.67739e-05,3.46394e-05,3.67507e-05,3.12904e-05,3.55849e-05,3.91662e-05,7.19415e-05,2.21005e-05,2.9677e-05,3.56736e-05,3.29453e-05,4.79659e-05,4.09342e-05,2.72911e-05,3.61119e-05,5.24038e-05,3.84542e-05,4.64852e-05,1.71126e-05,4.47858e-05,5.89909e-05,2.81931e-05,4.28806e-05,4.32288e-05,3.52785e-05,5.44103e-05,2.42965e-05,1.87145e-05,5.13848e-05,2.6523e-05,3.37923e-05,2.43713e-05,3.43567e-05,3.67469e-05,3.86891e-05,3.59647e-05,3.80963e-05,2.43904e-05,3.79046e-05,2.07185e-05,3.1492e-05,2.84438e-05,5.70566e-05,3.68865e-05,3.39524e-05,3.34151e-05,4.95275e-05,4.34074e-05,2.86148e-05,3.05382e-05,2.9697e-05,2.44529e-05,4.17229e-05,5.39806e-05,6.02317e-05,3.82269e-05,3.63218e-05,4.14949e-05,3.10743e-05,2.43437e-05,3.16157e-05,4.81429e-05,2.66055e-05,3.00394e-05,4.37599e-05,4.01972e-05,4.15613e-05])
x2 = np.asarray([5.48193e-05,4.79855e-05,7.32296e-05,6.02249e-05,5.51259e-05,4.70833e-05,4.3032e-05,5.08287e-05,4.90293e-05,5.65956e-05,2.10931e-05,5.3334e-05,4.45536e-05,6.39792e-05,6.37097e-05,6.71664e-05,6.67199e-05,6.53865e-05,4.05624e-05,2.86803e-05,5.09496e-05,5.64781e-05,5.00183e-05,5.86148e-05,5.84764e-05,4.27342e-05,8.01059e-05,3.5541e-05,7.83815e-05,7.64419e-05,6.71847e-05,4.83841e-05,5.31507e-05,5.55475e-05,4.5599e-05,8.18473e-05,7.35816e-05,5.1769e-05,5.91082e-05,6.43396e-05,4.32755e-05,7.08638e-05,6.88461e-05,6.31059e-05,4.04879e-05,6.10569e-05,6.51778e-05,3.58981e-05,7.56359e-05,7.48164e-05,5.36863e-05,5.94128e-05,3.60576e-05,4.93196e-05,4.86909e-05,4.92113e-05,2.95355e-05,4.34835e-05,5.83857e-05,7.798e-05,3.49355e-05,5.67644e-05,5.92727e-05,3.66655e-05,4.39566e-05,4.53161e-05,5.67604e-05,3.63818e-05,5.8882e-05,3.98616e-05,4.59374e-05,5.21106e-05,5.67607e-05,5.92866e-05,5.00237e-05,6.78354e-05,7.96478e-05,3.85134e-05,6.38701e-05,5.60584e-05,7.24924e-05,5.33234e-05,4.41117e-05,4.6092e-05,6.4302e-05,5.448e-05])

ptp= np.ptp(x1)
x = np.linspace(x1.min() - ptp/10, x1.max() + ptp/10, 300)

params1 = stats.gamma.fit(x1)
dist1 = stats.gamma(*params1)
plt.hist(x1, density=True)
plt.plot(x, dist1.pdf(x))

params1  # (5.005034504766963, 1.1531618825829435e-05, 4.9139272044517935e-06)

enter image description here

For the other data, you get an unexpected large shape parameter.

ptp= np.ptp(x2)
x = np.linspace(x2.min() - ptp/10, x2.max() + ptp/10, 300)

params2 = stats.gamma.fit(x2)
dist2 = stats.gamma(*params2)
plt.hist(x2, density=True)
plt.plot(x, dist2.pdf(x))
params2  # (398115981.5241561, -0.25897428431767466, 6.506374646288466e-10)

enter image description here

Keep in mind that fit provides the unconstrained maximum likelihood estimate of the three parameter gamma distribution; it does not consider any other constraints or desires you might have if you don't specify them.

If you are happy with a two-parameter gamma distribution, you can fix the location to zero with floc=0.

params2 = stats.gamma.fit(x2, floc=0)
dist2 = stats.gamma(*params2)
plt.hist(x2, density=True)
plt.plot(x, dist2.pdf(x))
params2  # (16.548028224665476, 0, 3.3166055685313535e-06)

enter image description here

If you want to put bounds on the fitted parameters, try scipy.stats.fit.

bounds = {'a': (1e-3, 100), 'loc': (0, 0), 'scale': (1e-12, 1)}
res = stats.fit(stats.gamma, x2, bounds=bounds)
print(res)
#  params: FitParams(a=13.076023225160911, loc=0.0, scale=4.194486773039213e-06)
# success: True
# message: 'Optimization terminated successfully.'
res.plot(plot_type='cdf')

enter image description here

That said, you might also consider whether your data is more consistent with a different distribution.