Pytorch Forecasting: How to convert a TimeSeriesDataSet object to pd.DataFrame?

1.1k Views Asked by At

I am new to time series forecasting. I encountered Pytorch Forecasting a few months back.

I was wondering if there is at all any possibility to convert a TimeSeriesDataSet object or an object dataloader object to a dataframe.

I saw this post: How to convert torch tensor to pandas dataframe? that you can do so; however, the difficult part is mapping the columns and understand different parts within a tensor object.

Take the temporal fusion transformer tutorial as an exmaple: https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html

max_encoder_length = 24
training_cutoff = data["time_idx"].max() - max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="volume",
    group_ids=["agency", "sku"],
    min_encoder_length=max_encoder_length // 2,  # keep encoder length long (as it is in the validation set)
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=["agency", "sku"],
    static_reals=["avg_population_2017", "avg_yearly_household_income_2017"],
    time_varying_known_categoricals=["special_days", "month"],
    variable_groups={"special_days": special_days},  # group of categorical variables can be treated as one variable
    time_varying_known_reals=["time_idx", "price_regular", "discount_in_percent"],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=[
        "volume",
        "log_volume",
        "industry_volume",
        "soda_volume",
        "avg_max_temp",
        "avg_volume_by_agency",
        "avg_volume_by_sku",
    ],
    target_normalizer=GroupNormalizer(
        groups=["agency", "sku"], transformation="softplus"
    ),  # use softplus and normalize by group
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
)

# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True)

# create dataloaders for model
batch_size = 128  # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)

# calculate baseline mean absolute error, i.e. predict next value as the last available value from the history
actuals = torch.cat([y for x, (y, weight) in iter(val_dataloader)])
baseline_predictions = Baseline().predict(val_dataloader)
(actuals - baseline_predictions).abs().mean().item()

How would one convert validation or val_dataloader to pd.DataFrame object?

Moreover, is it possible to also convert the prediction result (torch.tensor) into a dataframe as well?

validation:

>>> validation.data
>>> {'reals': tensor([[-0.9593, -0.6123,  0.0000,  ..., -2.9171, -1.0676,  1.0738],
         [-0.9593, -0.6123,  0.0000,  ..., -2.1644, -1.0561,  1.3626],
         [-0.9593, -0.6123,  0.0000,  ..., -0.9712, -1.0254,  1.6461],
         ...,
         [ 1.2221,  1.2074,  0.0000,  ..., -0.2065,  1.4175, -1.4105],
         [ 1.2221,  1.2074,  0.0000,  ..., -0.1767,  0.9821, -1.4105],
         [ 1.2221,  1.2074,  0.0000,  ..., -1.3639,  1.3839, -1.4105]]),
 'categoricals': tensor([[ 0,  0,  0,  ...,  0,  0,  0],
         [ 0,  0,  0,  ...,  0,  0,  4],
         [ 0,  0,  3,  ...,  0,  7,  5],
         ...,
         [57, 17,  0,  ...,  0,  0,  1],
         [57, 17,  0,  ...,  1,  0,  2],
         [57, 17,  0,  ...,  0,  0,  3]]),
 'groups': tensor([[ 0,  0],
         [ 0,  0],
         [ 0,  0],
         ...,
         [57, 17],
         [57, 17],
         [57, 17]]),
 'target': [tensor([8.0676e+01, 9.8064e+01, 1.3370e+02,  ..., 9.9000e-01, 9.0000e-02,
          2.2500e+00])],
 'weight': None,
 'time': tensor([ 0,  1,  2,  ..., 57, 58, 59])}

Attempt:

>>> actuals = torch.cat([y for x, (y, weight) in iter(val_dataloader)])
baseline_predictions = Baseline().predict(val_dataloader)

prediction_df = pd.DataFrame(baseline_predictions.numpy())
prediction_df.columns = validation.data.keys()
prediction_df

>>>     reals   categoricals    groups     target   weight       time
0   84.239998   84.239998   84.239998   84.239998   84.239998   84.239998
1   43.848000   43.848000   43.848000   43.848000   43.848000   43.848000
2   25.718399   25.718399   25.718399   25.718399   25.718399   25.718399
3   15.208200   15.208200   15.208200   15.208200   15.208200   15.208200
4   25.240499   25.240499   25.240499   25.240499   25.240499   25.240499
... ... ... ... ... ... ...
345 349.228790  349.228790  349.228790  349.228790  349.228790  349.228790
346 2053.746094 2053.746094 2053.746094 2053.746094 2053.746094 2053.746094
347 2207.361816 2207.361816 2207.361816 2207.361816 2207.361816 2207.361816
348 77.437500   77.437500   77.437500   77.437500   77.437500   77.437500
349 2.520000    2.520000    2.520000    2.520000    2.520000    2.520000
350 rows × 6 columns

Ideal Result would look something like:

ideal dataframe

The result is not making a lot of sense. Am I missing something? How would I map it back to find the correct column names and indices?

0

There are 0 best solutions below