How to incorporate prior information in pomegranate? In other words: Does pomegranate support incremental learning?

128 Views Asked by At

Say I fit a model using pomegranate to the data available at that time. Once more data is coming in, I'd like to update the model accordingly. In order words, is it possible with pomegranate to update the existing model with new data without overriding the previous parameter? Just to be clear: I'm not referring to out-of-core learning, since my question relates to data being available at different time points rather than having too-large-for-memory data available at a single time point.

Here is what I tried:

>>> from pomegranate.distributions import BetaDistribution

>>> # suppose a coin generated the following data, where 1 is head and 0 is tail
>>> data1 = [0, 0, 0, 1, 0, 1, 0, 1, 0, 0]

>>> # as usual, we fit a Beta distribution to infer the bias of the coin
>>> model = BetaDistribution(1, 1)
>>> model.summarize(data1)  # compute sufficient statistics

>>> # presume we have seen all the data available so far,
>>> # we can now estimate the parameters
>>> model.from_summaries()

>>> # this results in the following model (so far so good)
>>> model
{
    "class" :"Distribution",
    "name" :"BetaDistribution",
    "parameters" :[
        3.0,
        7.0
    ],
    "frozen" :false
}

>>> # now suppose the coin is flipped a few more times, getting the following data
>>> data2 = [0, 1, 0, 0, 1]

>>> # we would like to update the model parameters accordingly
>>> model.summarize(data2)

>>> # but this fits only data2, overriding the previous parameters
>>> model.from_summaries()
>>> model
{
    "class" :"Distribution",
    "name" :"BetaDistribution",
    "parameters" :[
        2.0,
        3.0
    ],
    "frozen" :false
}


>>> # however I want to get the result that corresponds to the following,
>>> # but ideally without having to "drag along" data1
>>> data3 = data1 + data2
>>> model.fit(data3)
>>> model  # this should be the final model
{
    "class" :"Distribution",
    "name" :"BetaDistribution",
    "parameters" :[
        5.0,
        10.0
    ],
    "frozen" :false
}

Edit:

Another way to ask the question: Does pomegranate support incremental or online learning? Basically, I'm looking for something similar to scikit-learn's partial_fit() as you can find here.

Given that pomegranate supports out-of-core learning, I feel like I'm overlooking something. Any help?

1

There are 1 best solutions below

2
On

It is actually from_summaries that is the issue. In the case of the Beta distribution it does: self.summaries = [0, 0]. All of the from_summaries methods are destructive. They replace the summaries with the parameters in the distribution. The summaries can always be updated for additional observations, the parameters can not be.

I think this is a bad design. It would be better to treat them as accumulators of observations with the parameters as derived cached values.

If you do:

model = BetaDistribution(1, 1)
model.summarize(data1)
model.summarize(data2)
model.from_summaries()
model

You will find that it does produce the same result as if model.summarize(data1 + data2) had been used.