why does pmdarima predict_in_sample not depend on values of exogenous variables

34 Views Asked by At

Also posted as an issue in GitHub:

When a pmdarima model is fit with exogenous variables, the values of X passed to predict_in_sample do not appear to affect the predictions. Even X arrays with the incorrect number of rows or columns are allowed. What is going on here? Am I missing something? See below:

import pmdarima as pm
from pmdarima import model_selection
import numpy as np
import pandas as pd

np.random.seed(42)

y = pm.datasets.load_wineind()
df = pd.DataFrame(
    {
        "x1": y * np.random.uniform(0, 0.5, len(y)) + np.random.randint(1, 1000, len(y)),
        "x2": y * np.random.uniform(0.5, 0.7, len(y)) + np.random.randint(1, 10000, len(y)),
    }
)
df["y"] = y
train, test = model_selection.train_test_split(df, train_size=150)

arima = pm.auto_arima(
    train["y"],
    train.drop(columns="y"),
    error_action="ignore",
    trace=True,
    suppress_warnings=True,
    maxiter=5,
    seasonal=True,
    m=12,
)


# preds1 takes the expected X args
preds1 = arima.predict_in_sample(X=train.drop(columns="y"))

# preds2 takes xargs with the correct dims, but different values from those used for preds1
preds2 = arima.predict_in_sample(X=train.drop(columns="y") + 1000)

# preds3 takes only x2, not x1, and x2 is subset to only 10 observations
preds3 = arima.predict_in_sample(X=train[:10].drop(columns=["y", "x1"]))

len(preds1)  # 150
len(preds2)  # 150
len(preds3)  # 150

all(preds1 == preds2)  # True
all(preds2 == preds3)  # True

arima.summary()  # To confirm that indeed x1 and x2 are in the model

I expect the values of X passed to predict_in_sample to affect the predictions, and for arrays of the incorrect size to produce an error.

Note: it looks like predict_in_sample is using statsmodels.tsa.statespace.sarimax.SARIMAXResultsWrapper.predict() under the hood.

0

There are 0 best solutions below