How to rank a list of distributions with BIC in OpenTURNS?

222 Views Asked by At

I have a Sample in OpenTURNS, and I want to fit a distribution on it. In order to take the number of parameters into account, I want to use the Bayesian Information Criteria (BIC). The Bayesian Information Criterion (BIC) ranks a list of models according to a weighted maximum likelihood criteria which takes into account for the sample size and the number of parameters of each distribution. A lower BIC score is better.

I know that FittingTest.BestModelBIC returns the model which best fit to the data. However, I would like to see more than the best fit: perhaps the second best BIC has a more physical meaning for me?

How to perform this in OpenTURNS?

PS Here is an example for BestModelBIC.

1

There are 1 best solutions below

0
On

I suggest to use GetContinuousUniVariateFactories to get the list of continuous distribution factories and to combine it to the FittingTest.BIC function which computes the BIC score of a distribution. Finally, sorting the table can be done with the sortAccordingToAComponent method of the Sample.

Here is a detailed example of this method. We begin by generating a sample.

import openturns as ot
import tqdm
import pandas as pd
sample = ot.Normal().getSample(100)

The GetContinuousUniVariateFactories static method returns a list of all available factories for continuous distributions. We could use this list without further processing, but the histogram would come first in the ranking, because it is specially designed for this purpose. Hence, we do not include it in our computation of the BIC score.

factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
marginalFactories = []
for factory in factories:
    if str(factory) != "HistogramFactory":
        print(factory)
        marginalFactories.append(factory)
number_of_factories = len(marginalFactories)
print("Number of selected factories:", number_of_factories)

This prints:

ArcsineFactory
BetaFactory
BurrFactory
[...]
Number of selected factories: 29

In the following script, we perform a for loop over all factories in the list that we previously created. We will later sort the BIC scores by increasing order. This is why we store the BIC score and the marginal index in the score_array sample. The computation can be quite long for some distributions. Hence we use the tqdm module to print a progress bar. Finally, some distribution do not build on this specific sample. In order to avoid to break the for loop, we wrap the call to the BIC method into a try/except. If the distribution fitting fails, we set the BIC score to the maximum finite value of a floating point number (this is MaxScalar), which is approximately equal to $10^{308}$.

score_array = ot.Sample(number_of_factories, 2)
for i in tqdm.tqdm(range(number_of_factories)):
    try:
        factory = marginalFactories[i]
        fitted_dist, bic_score = ot.FittingTest.BIC(sample, factory)
        score_array[i] = [i, bic_score]
    except TypeError:
        print("Cannot build ", factory)
        score_array[i] = [i, ot.SpecFunc.MaxScalar]

The key step is to sort the array containing the BIC scores.

sorted_BIC_scores = score_array.sortAccordingToAComponent(1)

There might be more than 30 distributions which can be built onto the sample. Here, we limit the list to the top 10 distributions which have the lowest BIC scores. We will use Pandas in order to print the BIC scores nicely. To do this, we create the BIC_data list, which contains the name of the factory and the corresponding BIC score. This is where the index of the distribution in the first column of sorted_BIC_scores is used. However, the Sample stores floats: we have to convert them into an integer before using it as an index.

BIC_data = []
rank = list(range(min(number_of_factories, 10)))
for i in rank:
    distribution_index = int(sorted_BIC_scores[i, 0])
    factory = marginalFactories[distribution_index]
    BIC_score = sorted_BIC_scores[i, 1]
    BIC_data.append([factory, BIC_score])

Now comes the easiest part, where we finally use Pandas' DataFrame.

columns = ["BIC", "Factory"]
df = pd.DataFrame(BIC_data, index=rank, columns=columns,)
print(df)

What this prints looks like this:

Rank BIC Factory
0 NormalFactory 2.902476
1 WeibullMaxFactory 2.933988
2 WeibullMinFactory 2.938930
3 TruncatedNormalFactory 2.939140
4 LogisticFactory 2.945102
5 LogNormalFactory 2.948479
6 StudentFactory 2.948733
7 TriangularFactory 2.968305
8 BetaFactory 3.033244
9 RayleighFactory 3.051117

We see that, luckily enough, the NormalFactory fits the gaussian sample. The truncated normal factory ranks in 4th position, after the two kinds of Weibull distributions.