I am quite new to NLP. I am building a Regression model for predicting discrete values (like price). While I was Using xgboostRegressor + word2vec. It throws the below error when trying to fit the model.
My input to the word2vec is a list of words

[text, font, graphics, screenshot, gain]

from xgboost import XGBRegressor

xgb_model = XGBRegressor(
        objective = 'reg:squarederror',
        colsample_bytree = 0.5,
        learning_rate = 0.05,
        max_depth = 6,
        min_child_weight = 1,
        n_estimators = 1000,
        subsample = 0.7)

%time xgb_model.fit(list(x_train), y_train, early_stopping_rounds=5, verbose=False)

y_pred_xgb = xgb_model.predict(x_test)

XGBoostError Traceback (most recent call last) in () 10 subsample = 0.7) 11 ---> 12 get_ipython().magic('time xgb_model.fit(list(x_train), y_train, early_stopping_rounds=5, verbose=False)') 13 14 y_pred_xgb = xgb_model.predict(x_test)

8 frames
<decorator-gen-53> in time(self, line, cell, local_ns)

<timed eval> in <module>()

/usr/local/lib/python3.7/dist-packages/xgboost/core.py in _check_call(ret)
    174     """
    175     if ret != 0:
--> 176         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    177 
    178 

XGBoostError: [01:43:27] /workspace/src/objective/regression_obj.cu:65: Check failed: preds.Size() == info.labels_.Size() (1 vs. 70812) : labels are not correctly providedpreds.size=1, label.size=70812
Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f45763dfcb4]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::GetGradient(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*)+0x21e) [0x7f45765ea84e]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x345) [0x7f4576479505]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(XGBoosterUpdateOneIter+0x35) [0x7f45763dcaa5]
  [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f45d2a66dae]
  [bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f45d2a6671f]
  [bt] (6) /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x28c) [0x7f45d2c7a5dc]
  [bt] (7) /usr/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x109e3) [0x7f45d2c799e3]
  [bt] (8) /usr/bin/python3(_PyObject_FastCallKeywords+0x92) [0x5559ff072902]


  [1]: https://i.stack.imgur.com/JcTKs.png
1

There are 1 best solutions below

0
On

The error is indicating that there's a problem with the dimensions of x_train: xgboost thinks that you've given it 1 training example in x_train and 70812 labels in y_train.

You need to check the shape of x_train and verify that you have a 2-dimensional array with the first dimension being the number of training examples, and the second dimension being the size of the embedding. The size of y_train should match the size of the first dimension of x_train.

When you say that your input to word2vec is a list of words, do you mean that each of your training examples is just one word, or that each example is a list of words? If you only have one word per example, then the encoded dataset should have dimensions of (num_examples, embedding_dim).

If each example is a sequence of words, then you will have (num_examples, sequence_len, embedding_dim) which is too many dimensions, so you'll have to take the average of embeddings over each sequence, or use sentence embeddings instead.

For example, given some randomly initialized numpy arrays:

import numpy as np

num_examples = 70812
embedding_dim = 100
x_train = np.random.rand(num_examples, embedding_dim)
y_train = np.random.rand(num_examples)
print(x_train.shape, y_train.shape)

This should print: (70812, 100) (70812,). 70812 is the number of training examples, and 100 is the size of each vector.

Then you can fit the model as before:

from xgboost import XGBRegressor

xgb_model = XGBRegressor(
    objective = 'reg:squarederror',
    colsample_bytree = 0.5,
    learning_rate = 0.05,
    max_depth = 6,
    min_child_weight = 1,
    n_estimators = 1000,
    subsample = 0.7
)
xgb_model.fit(x_train, y_train)