I have a Sample
in OpenTURNS, and I want to fit a distribution on it. In order to take the number of parameters into account, I want to use the Bayesian Information Criteria (BIC). The Bayesian Information Criterion (BIC) ranks a list of models according to a weighted maximum likelihood criteria which takes into account for the sample size and the number of parameters of each distribution. A lower BIC score is better.
I know that FittingTest.BestModelBIC
returns the model which best fit to the data. However, I would like to see more than the best fit: perhaps the second best BIC has a more physical meaning for me?
How to perform this in OpenTURNS?
PS Here is an example for BestModelBIC.
I suggest to use
GetContinuousUniVariateFactories
to get the list of continuous distribution factories and to combine it to theFittingTest.BIC
function which computes the BIC score of a distribution. Finally, sorting the table can be done with thesortAccordingToAComponent
method of theSample
.Here is a detailed example of this method. We begin by generating a sample.
The
GetContinuousUniVariateFactories
static method returns a list of all available factories for continuous distributions. We could use this list without further processing, but the histogram would come first in the ranking, because it is specially designed for this purpose. Hence, we do not include it in our computation of the BIC score.This prints:
In the following script, we perform a
for
loop over all factories in the list that we previously created. We will later sort the BIC scores by increasing order. This is why we store the BIC score and the marginal index in thescore_array
sample. The computation can be quite long for some distributions. Hence we use thetqdm
module to print a progress bar. Finally, some distribution do not build on this specific sample. In order to avoid to break the for loop, we wrap the call to theBIC
method into atry/except
. If the distribution fitting fails, we set the BIC score to the maximum finite value of a floating point number (this isMaxScalar
), which is approximately equal to $10^{308}$.The key step is to sort the array containing the BIC scores.
There might be more than 30 distributions which can be built onto the sample. Here, we limit the list to the top 10 distributions which have the lowest BIC scores. We will use Pandas in order to print the BIC scores nicely. To do this, we create the
BIC_data
list, which contains the name of the factory and the corresponding BIC score. This is where the index of the distribution in the first column ofsorted_BIC_scores
is used. However, theSample
storesfloat
s: we have to convert them into an integer before using it as an index.Now comes the easiest part, where we finally use Pandas'
DataFrame
.What this prints looks like this:
We see that, luckily enough, the
NormalFactory
fits the gaussian sample. The truncated normal factory ranks in 4th position, after the two kinds of Weibull distributions.