I'm trying to add a learning rate hyperparameter to my train_model function, which uses XGBoost for regression in a Snowflake environment. However, whenever I include the learning rate parameter, the function fails with an error. The function works fine without the learning rate, so I suspect there's an issue with the way I'm adding the learning rate.
The main train_model function:
from typing import Tuple
import numpy as np
import snowflake.snowpark.types as T
"""
Trains an XGBoost model using the provided Snowflake session, table, features, and target variable.
Args:
session (snowflake.snowpark.Session): Snowflake session object for connecting to Snowflake.
table (str): Name of the table in Snowflake containing the training data.
features (list): List of feature column names to be used for training.
target_variable (str): Name of the target variable column.
cat_cols (list): List of categorical column names in the feature set.
num_cols (list): List of numerical column names in the feature set.
Returns:
float: Root mean squared error (RMSE) of the trained XGBoost model on the validation set.
"""
def train_model(session: snowflake.snowpark.Session,
table: str,
features: list,
target_variable: str,
cat_cols: list,
num_cols: list) -> T.Variant:
# Load the Snowflake table
snowdf = session.table(table)
# Split the data into training and validation sets
snowdf_train, snowdf_valid = snowdf.random_split([0.75, 0.25], seed=123)
# Save the train and validation sets in Snowflake
snowdf_train.write.mode("overwrite").save_as_table("lapse_data_train")
snowdf_valid.write.mode("overwrite").save_as_table("lapse_data_valid")
# Prepare the training and validation data
train_x = snowdf_train[features].to_pandas() # Drop labels for the training set
train_y = snowdf_train.select(target_variable).to_pandas()
valid_x = snowdf_valid[features].to_pandas()
valid_y = snowdf_valid.select(target_variable).to_pandas()
# Define the preprocessing steps for numerical and categorical features
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")), # Impute missing values with median
('std_scaler', StandardScaler()), # Scale the numerical features
])
preprocessor = ColumnTransformer(
transformers=[
('num', num_pipeline, num_cols), # Apply the numerical pipeline to numerical features
('encoder', OneHotEncoder(handle_unknown="ignore"), cat_cols), # One-hot encode categorical features
]
)
# Construct the pipeline with preprocessing and XGBoost model
pipe = Pipeline([
('preprocessor', preprocessor),
('xgboost', XGBRegressor(learning_rate=0.01)), # XGBoost regression model
])
# Train the model
pipe.fit(train_x, train_y)
# Make predictions on the validation set
valid_preds = pipe.predict(valid_x)
# Calculate the root mean squared error (RMSE) of the predictions
rmse = mean_squared_error(valid_y, valid_preds, squared=False)
# Save the trained model to a file
model_file = os.path.join('/tmp', 'model.joblib')
joblib.dump(pipe, model_file)
session.file.put(model_file, "@SANDBOX_SGATE", overwrite=True)
return rmse
The error when I try to write the stored procedure to Snowflake:
# Now create a stored procedure of the train function and export to Snowflake
train_model_sp = F.sproc(train_model,
session=session,
replace=True,
is_permanent=True,
name="xgboost_sproc",
stage_location="@SANDBOX_SGATE")
ProgrammingError: 091003 (22000): Failure using stage area. Cause: [SANDBOX_SGATE GET and PUT commands are not supported with external stage]
- I've reviewed the XGBoost documentation for Python and verified that the learning rate parameter is valid.
- I've checked my imports and made sure all required libraries, including the Snowpark Python Connector, are installed.
- I've attempted various ways of adding the learning rate parameter, but it still results in an error.