Considering an XGBoost model with T trees, I'm currently exploring the performance implications of utilizing only the first k trees. In this particular instance, let's denote T as 500 and k as 100. While I acknowledge that for the IRIS dataset, these values of T and k might seem excessive, they serve the purpose of illustration.
I'm aware that one approach involves employing early_stopping=k during the training phase. However, rather than retraining the model, I'm seeking a solution that allows for the choosing of a predetermined number of trees to use them as an xgb model.
In the following example I want to something to accomplish something to achieve < ADD tree to first_100_booster>.
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define XGBoost parameters
params = {
'objective': 'multi:softmax', # multiclass classification
'num_class': 3, # number of classes in the dataset
'max_depth': 3, # maximum depth of each tree
'n_estimators': 500 # maximum number of trees to grow
}
# Train the XGBoost model
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)
# Extract the first 100 trees
first_100_trees = model.get_booster().get_dump()[:100]
# Create a new Booster object with the first 100 trees
first_100_booster = xgb.Booster(model.get_xgb_params())
for tree in first_100_trees:
"""
<ADD tree to first_100_booster>
"""
# Test using the first 100 trees
dtest = xgb.DMatrix(X_test)
y_pred_first_100 = first_100_booster.predict(dtest)
accuracy_first_100 = accuracy_score(y_test, y_pred_first_100)
print("Accuracy using the first 100 trees:", accuracy_first_100)
Since v1.4, the
.predict()method forxgboostscikit-learn estimators supports an argumentiteration_range. This takes a tuple describing a contiguous range of tree indices, so you can use it to achieve the behavior "generate predictions from a subset of trees".Consider this example with Python 3.11,
xgboost==2.0.3, andscikit-learn==1.4.1.With this approach, it isn't necessary to create a new
Boosterobject to evaluate the performance implications of using different numbers of trees.