How to select a subset of trees from a pretrained XGBoost model?

Question

How to select a subset of trees from a pretrained XGBoost model?

21 Views Asked by aroyc At 28 February 2024 at 20:20

Considering an XGBoost model with T trees, I'm currently exploring the performance implications of utilizing only the first k trees. In this particular instance, let's denote T as 500 and k as 100. While I acknowledge that for the IRIS dataset, these values of T and k might seem excessive, they serve the purpose of illustration.

I'm aware that one approach involves employing early_stopping=k during the training phase. However, rather than retraining the model, I'm seeking a solution that allows for the choosing of a predetermined number of trees to use them as an xgb model.

In the following example I want to something to accomplish something to achieve < ADD tree to first_100_booster>.

import numpy as np
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define XGBoost parameters
params = {
    'objective': 'multi:softmax',  # multiclass classification
    'num_class': 3,  # number of classes in the dataset
    'max_depth': 3,  # maximum depth of each tree
    'n_estimators': 500  # maximum number of trees to grow
}

# Train the XGBoost model
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)

# Extract the first 100 trees
first_100_trees = model.get_booster().get_dump()[:100]

# Create a new Booster object with the first 100 trees
first_100_booster = xgb.Booster(model.get_xgb_params())
for tree in first_100_trees:
    """
    <ADD tree to first_100_booster>
    """

# Test using the first 100 trees
dtest = xgb.DMatrix(X_test)
y_pred_first_100 = first_100_booster.predict(dtest)
accuracy_first_100 = accuracy_score(y_test, y_pred_first_100)
print("Accuracy using the first 100 trees:", accuracy_first_100)

Original Q&A

There are 1 best solutions below

**James Lamb** · Accepted Answer · 2024-03-27T02:28:20.923000

Since v1.4, the .predict() method for xgboost scikit-learn estimators supports an argument iteration_range. This takes a tuple describing a contiguous range of tree indices, so you can use it to achieve the behavior "generate predictions from a subset of trees".

Consider this example with Python 3.11, xgboost==2.0.3, and scikit-learn==1.4.1.

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# train multiclass classifier
model = xgb.XGBClassifier(
    objective="multi:softmax",
    num_class=3,
    max_depth=3,
    n_estimators=15
)
model.fit(X, y)

# get predictions based just on the first 5 trees
preds_first5_trees = model.predict(X, iteration_range=(0, 5))
accuracy_score(preds_first5_trees, y)
# 0.973

# get predictions based on all trees in model
full_preds = model.predict(X)
accuracy_score(full_preds, y)
# 0.993

With this approach, it isn't necessary to create a new Booster object to evaluate the performance implications of using different numbers of trees.

How to select a subset of trees from a pretrained XGBoost model?

There are 1 best solutions below

Related Questions in XGBOOST

Related Questions in XGBCLASSIFIER

Related Questions in XGBREGRESSOR

Trending Questions

Popular # Hahtags

Popular Questions