I am quite new to NLP.
I am building a Regression model for predicting discrete values (like price).
While I was Using xgboostRegressor + word2vec. It throws the below error when trying to fit the model.
My input to the word2vec is a list of words
[text, font, graphics, screenshot, gain]
from xgboost import XGBRegressor
xgb_model = XGBRegressor(
objective = 'reg:squarederror',
colsample_bytree = 0.5,
learning_rate = 0.05,
max_depth = 6,
min_child_weight = 1,
n_estimators = 1000,
subsample = 0.7)
%time xgb_model.fit(list(x_train), y_train, early_stopping_rounds=5, verbose=False)
y_pred_xgb = xgb_model.predict(x_test)
XGBoostError Traceback (most recent call last) in () 10 subsample = 0.7) 11 ---> 12 get_ipython().magic('time xgb_model.fit(list(x_train), y_train, early_stopping_rounds=5, verbose=False)') 13 14 y_pred_xgb = xgb_model.predict(x_test)
8 frames
<decorator-gen-53> in time(self, line, cell, local_ns)
<timed eval> in <module>()
/usr/local/lib/python3.7/dist-packages/xgboost/core.py in _check_call(ret)
174 """
175 if ret != 0:
--> 176 raise XGBoostError(py_str(_LIB.XGBGetLastError()))
177
178
XGBoostError: [01:43:27] /workspace/src/objective/regression_obj.cu:65: Check failed: preds.Size() == info.labels_.Size() (1 vs. 70812) : labels are not correctly providedpreds.size=1, label.size=70812
Stack trace:
[bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f45763dfcb4]
[bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::GetGradient(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*)+0x21e) [0x7f45765ea84e]
[bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x345) [0x7f4576479505]
[bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f45763dcaa5]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f45d2a66dae]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f45d2a6671f]
[bt] (6) /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x28c) [0x7f45d2c7a5dc]
[bt] (7) /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x109e3) [0x7f45d2c799e3]
[bt] (8) /usr/bin/python3(_PyObject_FastCallKeywords+0x92) [0x5559ff072902]
[1]: https://i.stack.imgur.com/JcTKs.png
The error is indicating that there's a problem with the dimensions of
x_train
: xgboost thinks that you've given it 1 training example inx_train
and 70812 labels iny_train
.You need to check the shape of
x_train
and verify that you have a 2-dimensional array with the first dimension being the number of training examples, and the second dimension being the size of the embedding. The size ofy_train
should match the size of the first dimension ofx_train
.When you say that your input to word2vec is a list of words, do you mean that each of your training examples is just one word, or that each example is a list of words? If you only have one word per example, then the encoded dataset should have dimensions of
(num_examples, embedding_dim)
.If each example is a sequence of words, then you will have
(num_examples, sequence_len, embedding_dim)
which is too many dimensions, so you'll have to take the average of embeddings over each sequence, or use sentence embeddings instead.For example, given some randomly initialized numpy arrays:
This should print:
(70812, 100) (70812,)
. 70812 is the number of training examples, and 100 is the size of each vector.Then you can fit the model as before: