XGBoost train_model function fails when adding a learning rate hyperparameter (using Snowpark)

106 Views Asked by At

I'm trying to add a learning rate hyperparameter to my train_model function, which uses XGBoost for regression in a Snowflake environment. However, whenever I include the learning rate parameter, the function fails with an error. The function works fine without the learning rate, so I suspect there's an issue with the way I'm adding the learning rate.

The main train_model function:

from typing import Tuple
import numpy as np
import snowflake.snowpark.types as T

"""
Trains an XGBoost model using the provided Snowflake session, table, features, and target variable.

Args:
    session (snowflake.snowpark.Session): Snowflake session object for connecting to Snowflake.
    table (str): Name of the table in Snowflake containing the training data.
    features (list): List of feature column names to be used for training.
    target_variable (str): Name of the target variable column.
    cat_cols (list): List of categorical column names in the feature set.
    num_cols (list): List of numerical column names in the feature set.

Returns:
    float: Root mean squared error (RMSE) of the trained XGBoost model on the validation set.
"""

def train_model(session: snowflake.snowpark.Session, 
                table: str, 
                features: list, 
                target_variable: str,
                cat_cols: list,
                num_cols: list) -> T.Variant:

    # Load the Snowflake table
    snowdf = session.table(table)

    # Split the data into training and validation sets
    snowdf_train, snowdf_valid = snowdf.random_split([0.75, 0.25], seed=123)

    # Save the train and validation sets in Snowflake
    snowdf_train.write.mode("overwrite").save_as_table("lapse_data_train")
    snowdf_valid.write.mode("overwrite").save_as_table("lapse_data_valid")

    # Prepare the training and validation data
    train_x = snowdf_train[features].to_pandas()  # Drop labels for the training set
    train_y = snowdf_train.select(target_variable).to_pandas()
    valid_x = snowdf_valid[features].to_pandas()
    valid_y = snowdf_valid.select(target_variable).to_pandas()

    # Define the preprocessing steps for numerical and categorical features
    num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),  # Impute missing values with median
        ('std_scaler', StandardScaler()),  # Scale the numerical features
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_pipeline, num_cols),  # Apply the numerical pipeline to numerical features
            ('encoder', OneHotEncoder(handle_unknown="ignore"), cat_cols),  # One-hot encode categorical features
        ]
    )

    # Construct the pipeline with preprocessing and XGBoost model
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('xgboost', XGBRegressor(learning_rate=0.01)),  # XGBoost regression model
    ])

    # Train the model
    pipe.fit(train_x, train_y)

    # Make predictions on the validation set
    valid_preds = pipe.predict(valid_x)

    # Calculate the root mean squared error (RMSE) of the predictions
    rmse = mean_squared_error(valid_y, valid_preds, squared=False)
    
    # Save the trained model to a file
    model_file = os.path.join('/tmp', 'model.joblib')
    joblib.dump(pipe, model_file)
    session.file.put(model_file, "@SANDBOX_SGATE", overwrite=True)

    return rmse

The error when I try to write the stored procedure to Snowflake:

# Now create a stored procedure of the train function and export to Snowflake
train_model_sp = F.sproc(train_model, 
                         session=session, 
                         replace=True,
                         is_permanent=True, 
                         name="xgboost_sproc", 
                         stage_location="@SANDBOX_SGATE")

ProgrammingError: 091003 (22000): Failure using stage area. Cause: [SANDBOX_SGATE GET and PUT commands are not supported with external stage]

  • I've reviewed the XGBoost documentation for Python and verified that the learning rate parameter is valid.
  • I've checked my imports and made sure all required libraries, including the Snowpark Python Connector, are installed.
  • I've attempted various ways of adding the learning rate parameter, but it still results in an error.
0

There are 0 best solutions below