Multivariate Time Series Forecasting using Pytorch TimeSeriesDataSet

499 Views Asked by At

enter image description here

I want to forecast a Target using its history and the history of covariates (Cov1, Cov2,Cov3).

I have several samples (Id) each of them with 601 observations (time) of (Target, Cov1, Cov2,Cov3) and want to train my model (a TemporalFusionTransformer model) on the 1st 60 observations to predict the 541 remaining Target values.

I plan to train/validate my model using Pytorch TimeSeriesDataSet object and then test it on unseen samples.

I readed a lot of Pytorch TimeSeriesDataSet examples (pytorch-forecasting.readthedocs.io , Kaggle notebooks, data scientist posts like https://towardsdatascience.com/all-about-n-hits-the-latest-breakthrough-in-time-series-forecasting-a8ddcb27b0d5...) but most of them subset a single timeseries example in consecutive train/validation/test sets.

I don't find that many examples training on several samples and testing on others. So my questions are about data preprocessing prior to fit my model. Here is the code I used:

max_prediction_length = 540
max_encoder_length = 61 
training_cutoff = df["time"].max() - max_prediction_length

training = TimeSeriesDataSet(df[lambda x: x.time <= training_cutoff],
time_idx="time",target="Target", group_ids=["id"], 
min_encoder_length= max_encoder_length,max_encoder_length=max_encoder_length,
min_prediction_length=max_prediction_length, max_prediction_length=max_prediction_length, 
time_varying_unknown_reals=["Cov1", "Cov2",”Cov3”,”Target”])

# creating validation set (predict=True) which means to predict the last max_prediction_length points in time for each series:
validation = TimeSeriesDataSet.from_dataset(training,df_patients, predict=True, stop_randomization=True)

# create dataloaders for model:
batch_size = 4
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)
  1. I’m not sure if I should use df[lambda x: x.time <= training_cutoff] like I see on code examples I found (instead of df) if I use min_encoder_length= max_encoder_length, max_encoder_length=max_encoder_length, min_prediction_length=max_prediction_length, max_prediction_length=max_prediction_length as length parameters ?

  2. I still don't clearly understand how the boolean predict=True, stop_randomization=True & train=True/False work to differentiate training and validation sets

Any help would be appreciated!

0

There are 0 best solutions below