I would like to understand if the procedure I'm following is standard or if I'm making some mistake.
I have a time series of 48 values (one value per month from 2018 to 2021), stored in the data frame df
:
Amount
2018-01 125.6
... ...
2020-12 145.2
2021-01 148.4
... ...
2021-12 198.8
I would like to create a model that can predict the quantity for the months I want.
In short, I take the first three years (36 months) and use this data to train my model, and then test it on the last year (2021), as follows:
df_train = df[:36]
df_test = df[36:]
arima = pm.auto_arima(df_train, error_action='ignore', trace=True,
suppress_warnings=True, maxiter=50,
seasonal=True, m=12,
random_state=1)
# Best model: ARIMA(1,1,1)(0,1,0)[12]
predictions, conf_int = arima.predict(n_periods=12, return_conf_int=True)
df_predictions = pd.DataFrame(predictions, index=df_test.index)
df_predictions.columns = ['Predicted amount']
Then, I use:
r2_score(df_test['Amount'], df_predictions['Predicted amount'])
getting about 0.92, so everything seems to be fine. Is this correct up to here?
Finally, I want to forecast 2022 amounts, where I have no control data. To do this, I update the model and repeat the process from before:
arima.update(df_test)
df_forecasts = pd.DataFrame(arima.predict(n_periods=12), index=pd.date_range(start='2022-01-01', end='2022-12-01', freq='MS'))
df_forecasts.columns = ['Forecasted amount']
I'm more unsure about this last part, is that correct?
I have made a very concise summary of the procedure, but I am interested in understanding if the path I have followed is standard and correct. Thanks to anyone who can answer me.