How to rank a list of distributions with BIC in OpenTURNS?

Question

How to rank a list of distributions with BIC in OpenTURNS?

214 Views Asked by Michael Baudin At 28 July 2025 at 02:26

I have a Sample in OpenTURNS, and I want to fit a distribution on it. In order to take the number of parameters into account, I want to use the Bayesian Information Criteria (BIC). The Bayesian Information Criterion (BIC) ranks a list of models according to a weighted maximum likelihood criteria which takes into account for the sample size and the number of parameters of each distribution. A lower BIC score is better.

I know that FittingTest.BestModelBIC returns the model which best fit to the data. However, I would like to see more than the best fit: perhaps the second best BIC has a more physical meaning for me?

How to perform this in OpenTURNS?

PS Here is an example for BestModelBIC.

Original Q&A

There are 1 best solutions below

**Michael Baudin** · Answer 1

I suggest to use GetContinuousUniVariateFactories to get the list of continuous distribution factories and to combine it to the FittingTest.BIC function which computes the BIC score of a distribution. Finally, sorting the table can be done with the sortAccordingToAComponent method of the Sample.

Here is a detailed example of this method. We begin by generating a sample.

import openturns as ot
import tqdm
import pandas as pd
sample = ot.Normal().getSample(100)

The GetContinuousUniVariateFactories static method returns a list of all available factories for continuous distributions. We could use this list without further processing, but the histogram would come first in the ranking, because it is specially designed for this purpose. Hence, we do not include it in our computation of the BIC score.

factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
marginalFactories = []
for factory in factories:
    if str(factory) != "HistogramFactory":
        print(factory)
        marginalFactories.append(factory)
number_of_factories = len(marginalFactories)
print("Number of selected factories:", number_of_factories)

This prints:

ArcsineFactory
BetaFactory
BurrFactory
[...]
Number of selected factories: 29

In the following script, we perform a for loop over all factories in the list that we previously created. We will later sort the BIC scores by increasing order. This is why we store the BIC score and the marginal index in the score_array sample. The computation can be quite long for some distributions. Hence we use the tqdm module to print a progress bar. Finally, some distribution do not build on this specific sample. In order to avoid to break the for loop, we wrap the call to the BIC method into a try/except. If the distribution fitting fails, we set the BIC score to the maximum finite value of a floating point number (this is MaxScalar), which is approximately equal to $10^{308}$.

score_array = ot.Sample(number_of_factories, 2)
for i in tqdm.tqdm(range(number_of_factories)):
    try:
        factory = marginalFactories[i]
        fitted_dist, bic_score = ot.FittingTest.BIC(sample, factory)
        score_array[i] = [i, bic_score]
    except TypeError:
        print("Cannot build ", factory)
        score_array[i] = [i, ot.SpecFunc.MaxScalar]

The key step is to sort the array containing the BIC scores.

sorted_BIC_scores = score_array.sortAccordingToAComponent(1)

There might be more than 30 distributions which can be built onto the sample. Here, we limit the list to the top 10 distributions which have the lowest BIC scores. We will use Pandas in order to print the BIC scores nicely. To do this, we create the BIC_data list, which contains the name of the factory and the corresponding BIC score. This is where the index of the distribution in the first column of sorted_BIC_scores is used. However, the Sample stores floats: we have to convert them into an integer before using it as an index.

BIC_data = []
rank = list(range(min(number_of_factories, 10)))
for i in rank:
    distribution_index = int(sorted_BIC_scores[i, 0])
    factory = marginalFactories[distribution_index]
    BIC_score = sorted_BIC_scores[i, 1]
    BIC_data.append([factory, BIC_score])

Now comes the easiest part, where we finally use Pandas' DataFrame.

columns = ["BIC", "Factory"]
df = pd.DataFrame(BIC_data, index=rank, columns=columns,)
print(df)

What this prints looks like this:

Rank	BIC	Factory
0	NormalFactory	2.902476
1	WeibullMaxFactory	2.933988
2	WeibullMinFactory	2.938930
3	TruncatedNormalFactory	2.939140
4	LogisticFactory	2.945102
5	LogNormalFactory	2.948479
6	StudentFactory	2.948733
7	TriangularFactory	2.968305
8	BetaFactory	3.033244
9	RayleighFactory	3.051117

We see that, luckily enough, the NormalFactory fits the gaussian sample. The truncated normal factory ranks in 4th position, after the two kinds of Weibull distributions.

How to rank a list of distributions with BIC in OpenTURNS?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DISTRIBUTION

Related Questions in INFERENCE

Related Questions in OPENTURNS

Trending Questions

Popular # Hahtags

Popular Questions