I am trying to manually calculate the values shown at each node of each tree of an ensemble returned by a GradientBoostingRegressor.
So here is how I train the model:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
# fit model
gbm = GradientBoostingRegressor(
criterion="squared_error",
n_estimators=2,
max_depth=3,
random_state=3,
)
gbm.fit(X_train, y_train)
Now I plot the trees:
plt.figure(figsize=(8, 8), dpi=300)
# first tree
plt.subplot(2, 1, 1)
plot_tree(
decision_tree=gbm.estimators_[0][0],
feature_names=X_train.columns.to_list(),
filled=True, # color the squares
rounded=True, # round squares
precision = 20,
)
plt.title("First tree")
# second tree
plt.subplot(2, 1, 2)
plot_tree(
decision_tree=gbm.estimators_[1][0],
feature_names=X_train.columns.to_list(),
filled=True, # color the squares
rounded=True, # round squares
precision = 20,
)
plt.title("Second tree")
plt.show()
And this is the result:
The question is, how can I manually calculate the value
of say, the first node of the first and then second tree?
I tried this but the output does not match the values in the picture:
# first node first tree
value = np.mean(y_train - y_train.mean())
# first node second tree
residuals = y_train - 0.1 * gbm.estimators_[0][0].predict(X_train)
np.mean((residuals - np.mean(residuals))
What am I missing?
Thanks a lot!