How to predict a multidimensional time series using python, sklearn with unknown X values

625 Views Asked by At

By trying to predict future Bitcoin prices, I ran into the following predicament:

I can only predict the the y label (for instance Open Price) by providing all the X features that I used to train my model. However, what I need is a prediction into the future, which means my X feature values are also unknown.

Here is a snippet of my data (6 feature columns, 1 label):

                   Open    High     Low    HL-PCT  PCT-change  \

2016-01-01 00:00:00 430.89 432.58 429.82 0.642129 -0.030161
2016-01-01 01:00:00 431.51 432.01 429.08 0.682856 0.348829
2016-01-01 02:00:00 430.00 431.69 430.00 0.393023 -0.132383
2016-01-01 03:00:00 430.50 433.37 430.03 0.776690 -0.662252
2016-01-01 04:00:00 433.34 435.72 432.55 0.732863 -0.406794
2016-01-01 05:00:00 435.11 436.00 434.47 0.352153 -0.066605
2016-01-01 06:00:00 435.44 435.44 430.08 1.246280 0.440569
2016-01-01 07:00:00 434.71 436.00 433.50 0.576701 0.126681
2016-01-01 08:00:00 433.82 434.19 431.00 0.740139 -0.059897
2016-01-01 09:00:00 433.99 433.99 431.23 0.640030 0.460648

                 Volume (BTC)   Label  

2016-01-01 00:00:00 41.32 434.87
2016-01-01 01:00:00 31.21 434.44
2016-01-01 02:00:00 12.25 433.47
2016-01-01 03:00:00 74.98 431.80
2016-01-01 04:00:00 870.80 433.28
2016-01-01 05:00:00 78.53 433.31
2016-01-01 06:00:00 177.11 433.39
2016-01-01 07:00:00 158.45 432.61
2016-01-01 08:00:00 210.59 432.80
2016-01-01 09:00:00 129.68 432.17

Here is my code:

#First get my own data
symbols = ["bitstamp_hourly_2016"]
timestamp = pd.date_range(start='2016-01-01 00:00', end='2016-12-23 09:00', 
                      freq='1h', periods=None)

df_all = bf.get_data2(symbols, timestamp)    
#Feature Slicing
df = df_all[['Open', 'High', 'Low', 'Close', 'Volume (BTC)']]    

df.loc[:,'HL-PCT'] = (df['High'] - df['Low'])/df['Low']*100.0
df.loc[:,'PCT-change'] = (df['Open'] - df['Close'])/df['Close']*100.0

#only relevant features
df= df[['Open','High', 'Low', 'HL-PCT', 'PCT-change', 'Volume (BTC)']]

df.fillna(-99999, inplace=True)

#cut off the last 24 hours
forecast_out = int(math.ceil(0.0027*len(df)))

forecast_col = 'Open'
df['Label'] = df[forecast_col].shift(-forecast_out)

#X Features and y Label
X = np.array(df.drop(['Label'],1))
X = preprocessing.scale(X)

#Last 24 hours
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
y = np.array(df['Label'])
y = y[:-forecast_out]

#Train and Test set
test_size= int(math.ceil(0.3*len(df)))
X_train, y_train = X[:-test_size], y[:-test_size]
X_test, y_test= X[-test_size:], y[-test_size:]

#use linear regression
clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)

#BIG QUESTION: WHAT TO INSERT HERE TO GET THE REAL FUTURE VALUES
prediction = clf.predict(X_lately)

# The coefficients
print('Coefficients: \n', clf.coef_)
# The mean squared error
print("Mean squared error: %.4f"
      % np.mean((clf.predict(X_test) - y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.4f' % clf.score(X_test, y_test))

Outcome:

How many Hours were predicted:  24
Coefficients: [  5.30676009e+00   1.05641430e+02   1.44632212e+01       1.47255264e+00
-1.52247332e+00  -6.26777634e-03]
Mean squared error: 133.4017
Variance score: 0.9717

What I want to do is: Give just a new Date, use the trained model and its knowledge from the past to give me a reasonable outcome for lets say the next 24 hours (the actual future, for which I do not have data). So far, I can only work with past data on clf.predict().

This should be possible somehow with the Regression line, but how? I could also just use the Date as my X dataframe, but would that not make my model useless?

Thanks

1

There are 1 best solutions below

0
On BEST ANSWER

If you want to stick to linear regression and not using merely the date, you can try to predict (with whatever model you like) the regressors of your model and then perform the linear regression with the forecasted values.

Anyway it seems that the type of advice you need is not programming-related, I think your question is more appropriate for https://stats.stackexchange.com/