Why does Random Forest give different result when run in paralel?

70 Views Asked by At

I am running Random Forest in Databricks (the skitlearn library one). When I run the model :

rf = RandomForestRegressor(n_estimators=150,max_features=3,min_samples_leaf=5,min_samples_split=12,n_jobs=7, random_state=0)

category at a time, it gives me good results. But when I run categories in parallel (using Joblib library) it gives me way worst results (sometimes the double than what I expected). Why is that? Is there a way of running it in parallel and get the same result?

I know that Random Forest are non-deterministic, but when running one category multiple times it gives me always good results.

EDIT:

Basically I have different categories for different products at a supermarket. Categories like A,B,C,...,Z. And inside each category I have different range of products.

Then I have a function where I can pass those categories and I do some data manipulation, feature engineering, etc..., then I split the data and apply the random forest regressor to the data.

I am using joblib function parallel so I can train and test multiple categories at a time.

The thing I am noticing is that when I run like this:

model("A")

it gives me good results, everytime I run it it gives me different results because of being non-deterministic but I gives me always around the same values. For pretty much all categories and products

When running with joblib, I am running like this:

parallel = Parallel(n_jobs=7, pre_dispatch="n_jobs", backend="threading")
out = parallel(delayed(lifecycle_model)(category) for category in list_of_categories)

And this gives me way off results, sometimes double that what I as expecting.

0

There are 0 best solutions below