Huge datasets in machine learning sklearn

2.2k Views Asked by At

I have a dataset which that grows on a daily basis , I am concerned about the fact that , soon it would reach a size that the memory might not be able to accommodate. I am using random forest classifiers and regressors in my application . I have heard of partial fitting , but I don't know if random forest can be done in that manner. How do I ensure that the application doesn't break and continues to perform well even if the data set grows beyond memory size. Also would the scenario be any different if svm were used instead of random forest .

1

There are 1 best solutions below

0
On

In general, you should look for methods that offer incremental or online training. In such you don't have to present to the algorithm the complete data set at once, but rather when new data becomes available. That's essential if the data grows on daily basis and your computational resources are limited. Stochastic gradient descent is a pretty popular optimisation method that meets your requirements.

You could use a variation of random forest called Mondarian Forest. To quote from the abstract of the linked paper: Mondrian forests achieve competitive predictive performance comparable with existing online random forests and periodically re-trained batch random forests, while being more than an order of magnitude faster, thus representing a better computation vs accuracy tradeoff. The code can be found on GitHub.

Without knowing your data and nature of your problem it's impossible to offer you specific guidance of what would perform better than random forest. If you would like to stick to the scikit learn, check article Strategies to scale computationally: bigger data.