I am using a probabilistic forecasting model for quantile forecasting, specifically the Temporal Fusion Transformer (TFT), in my work. I have fixed the history length to 72 and evaluated the model's performance at different prediction horizons by varying the prediction length (specifically, 1, 6, 12, 36, and 72). The evaluation metric I used is the Mean Weighted Quantile Loss.
However, I am not confident about the final results because I have observed that the performance of the Temporal Fusion Transformer in short-term predictions is significantly worse compared to long-term predictions (my implementation is based on GluonTS). For example, on the first dataset, the mean_wQuantileLoss values for the five prediction length settings are 0.0240, 0.0039, 0.0028, 0.0030, and 0.0043. It can be observed that the evaluation metric at prediction length = 1 is an order of magnitude worse than the other prediction lengths. I have observed a similar pattern on the second dataset as well, where the mean_wQuantileLoss values are 0.0363, 0.02519, 0.0222, 0.0091, and 0.0085. There is a significant improvement in model performance from prediction length = 12 to prediction length = 36.
I would like to know if this behavior is normal, and what could be the reasons causing such a situation.
I tried to search for similar observations in other works based on the Temporal Fusion Transformer (TFT), including the original paper's experiments. However, I couldn't find any evaluation of the Temporal Fusion Transformer in the original paper with different prediction lengths.