How to calculate the values at each node in a scikit-learn GradientBoostingRegressor?

60 Views Asked by At

I am trying to manually calculate the values shown at each node of each tree of an ensemble returned by a GradientBoostingRegressor.

So here is how I train the model:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split


X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)


# fit model

gbm = GradientBoostingRegressor(
    criterion="squared_error",
    n_estimators=2,
    max_depth=3,
    random_state=3,
)

gbm.fit(X_train, y_train)

Now I plot the trees:

plt.figure(figsize=(8, 8), dpi=300)

# first tree
plt.subplot(2, 1, 1)
plot_tree(
    decision_tree=gbm.estimators_[0][0],
    feature_names=X_train.columns.to_list(),
    filled=True,  # color the squares
    rounded=True,  # round squares
    precision = 20,
)
plt.title("First tree")


# second tree
plt.subplot(2, 1, 2)
plot_tree(
    decision_tree=gbm.estimators_[1][0],
    feature_names=X_train.columns.to_list(),
    filled=True,  # color the squares
    rounded=True,  # round squares
    precision = 20,
)
plt.title("Second tree")

plt.show()

And this is the result:

enter image description here

The question is, how can I manually calculate the value of say, the first node of the first and then second tree?

I tried this but the output does not match the values in the picture:

# first node first tree
value = np.mean(y_train - y_train.mean())

# first node second tree
residuals = y_train - 0.1 * gbm.estimators_[0][0].predict(X_train)

np.mean((residuals - np.mean(residuals))

What am I missing?

Thanks a lot!

0

There are 0 best solutions below