xgboost CV and number of trees

8.9k Views Asked by At

I was going through the article here but do not fully understand the details regarding the function CV and the parameter "number of trees" in xgboost.

Suppose we start with a dataframe of features and target values. What does CV do in each round? If the CV result has 500 rows (i.e. there are 500 decision trees), how is each tree constructed? And how are the 500 trees combined to produce a single log-loss number?

If we can get a single prediction from CV function, why do we need XGBClassifier.fit which also produces a model (thus a loss number)?

Thank you.

2

There are 2 best solutions below

1
On

Xgboost is a gradient boosting method, as such it adds trees in every iteration to improve the prediction accuracy. See the introduction in this article to get an idea of how gradient boosting works: https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/. This should explain how the trees are constructed.

The prediction accuracy increases when you add more and more trees until you start to overfit, at which point the prediction accuracy decreases. So you need to find that optimum number of trees.

It is basically impossible to guess this number from the get-go. That is what xgboost.cv is for. It partitions your training data into two subsets. The first subset is used to train xgboost, the second is used as a validation set. After each iteration (which adds an additional tree) xgboost calculates the new validation error. With that xgboost is able to detect when it starts to overfit (when the validation error starts to increase). This will give you the optimal number of trees for a given set of hyperparameters.

Note that the xgboost.cv returns an evaluation history (a list), whereas xgboost.train returns a booster.

Also note that xgboost.fit is part of the sklearn wrapper (so better not compare it too xgboost.cv which is part of the xgboost learning api).

And as a final note: You don't need xgboost.cv to find the optimal number of trees. You can also run xgboost.train with "early_stopping_rounds" set.

If you have any questions let me know in the comments.

0
On

Python xgb.cv or xgb.evals_result_ returns a dictionary of all metrics during training and validating iterations. You can use them to plot them and see when they are over-fitting.