I have 5 months (153 days) daily sales data. I want to forecast future 30 days sales. I used Temporal Fusion Transformer to build time series model. In this case, if I set max_prediction_length = 7, max_encoder_length = 21 it works. When I increase max_prediction_length to 13, an error reported.
"KeyError: "Unknown category '10' encountered. Set add_nan=True
to allow unknown categories"
"
Anyone has some idea or experience about this? thank you.
max_prediction_length = 13 # one week forecasting
max_encoder_length = 30 #It determines what the maximum length of the history from which a model will build features before forecasting/predicting/decoding. Yes. There is an override IF you initialize the TemporalFusionTransformer from a dataset (which is the recommended method).
holdout_cut = df["time_idx"].max() - max_prediction_length
data = df[lambda x: x.time_idx <= holdout_cut]
test_data = df[lambda x: x.time_idx > holdout_cut]
print(test_data.shape)
training_cutoff = data["time_idx"].max() - max_prediction_length
print('training_cutoff: ', training_cutoff)
def create_sdata(data):
return TimeSeriesDataSet(
#data[lambda x: x.time_idx <= training_cutoff],
data,
time_idx="time_idx",
target="throughput",
group_ids=['cat'],
min_encoder_length=1, # keep encoder length long (as it is in the validation set)
max_encoder_length=max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=['cat'],
time_varying_known_categoricals=['month','day','week'],
time_varying_known_reals=["time_idx"],
time_varying_unknown_categoricals=[],
time_varying_unknown_reals=['throughput', "log_throughput","avg_throughput_by_cat",],
target_normalizer=GroupNormalizer(groups=['cat'], transformation="softplus"), # use softplus and normalize by group
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
allow_missing_timesteps=True,
categorical_encoders={
# 'cat': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=True),
# 'month': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=True),
# 'week': pytorch_forecasting.data.encoders.NaNLabelEncoder(add_nan=True),
},
)
#Would assume not all time series have a minimum length of 13 (min_prediction_length + min_encoder_length)
training = create_sdata(data[lambda x: x.time_idx <= training_cutoff])
# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True, )
# create dataloaders for model
batch_size = 32 # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=5)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=5)
This error is very likely coming from your validation dataset. Your model is using your training data (and the values that its categories can take) to define the categorical embeddings of the dataset. Here, your validation set probably has a categorical value that does not appear in the training set, thus it's unclear how to embed that new categorical variable.
For each categorical variable with a value in the validation set that does not appear in the training set, you can either remove it or treat it as a cold-start problem by specifying
categorical_encoders = {COLUMN_NAME:NaNLabelEncoder(add_nan=True)
(after importing NaNLabelEncoder from from pytorch_forecasting.data.encoders) in yourcreate_sdata()
function.