I am running Databricks AutoML in a Python notebook with the look-ups from the feature tables. However, the additional columns are always included, and all runs fail.
import databricks.automl
automl_feature_lookups = [
{
"table_name":"lakehouse_in_action.favorita_forecasting.oil_10d_lag_ft",
"lookup_key":"date",
"feature_names":"lag10_oil_price"
},
{
"table_name":"lakehouse_in_action.favorita_forecasting.store_holidays_ft",
"lookup_key":["date","store_nbr"]
},
{
"table_name":"lakehouse_in_action.favorita_forecasting.stores_ft",
"lookup_key":"store_nbr",
"feature_names":["cluster","store_type"]
}
]
automl_data = raw_data.filter("date > '2016-12-31'")
summary = databricks.automl.regress(automl_data,
target_col=label_name,
time_col="date",
timeout_minutes=60,
feature_store_lookups=automl_feature_lookups)
It turns out that when creating a training set you have the option to specify features using feature_names
. When creating the dictionary for AutoML, feature_names
is not a valid option.
I tried removing feature_names
, but it did not fix my issue.
I addedexclude_cols=['id','city','state','price_date']
, but according to the error I received columns from feature lookup tables cannot be excluded.
InvalidArgumentError: Dataset schema does not contain column with name 'city'. Please pass a valid column name for param: exclude_cols
Removing
feature_store_lookups=automl_feature_lookups
produces a successful AutoML experiment, indicating the issue is only in the lookups. The solution is to create a training set, load a dataframe, and execute AutoML that way.Note: Filtering by date was only to shrink my data and isn't required.