Here you can see the plot of the newly fit model:
the bins show all the now available data, so the initial data used to fit the model and the new data. The new data does not include the higher values. These are the model parameters:
GaussianMixture(max_iter=10000, n_components=2, tol=0.0001, warm_start=True)
so warm_start certainly is set to true. When sampling from the model i also do not receive the high values. So it does not seem to be an error in the plot either.
When fitting the model, which is called gmm
, with new data i simply do
gmm_new = gmm.fit(new_data)
The new data is already expanded in dimensions so that this works. When fitting the model again with new AND old data, so the whole dataset, the results look fine though. But wouldn't that mean that I fitted the old data twice? Am I using the warm-start wrong?
Well, as turns out the glossary holds the answer:
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset.
So it does make sense that the results seem to be good when fiting again on the whole data set