I m using xgboost for regression problem but I m getting error regarding response variable which is output sales and it is initially numeric in class but as I use xgboost it shows error BUT I want output in numeric form only

labels <- train$Item_Outlet_Sales# train label
ts_label <- test$Item_Outlet_Sales  # test label

# converted into matrix ( one hot encoding )
new_tr <- model.matrix(~.+0,data = train[,-c("Item_Outlet_Sales"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("Item_Outlet_Sales"),with=F])

## checking class
class(labels)
[1] "numeric"

I have created label or response variable in test as test$Item_Outlet_Sales <- NA

class(test$Item_Outlet_Sales)
[1] "logical"

# coverting `ts_label` into numeric as it initially is logical
ts_label <- as.numeric(ts_label)-1
class(ts_label)
[1] "numeric"

now

 dtrain1 <- xgb.DMatrix(data = new_tr,label = labels) 
 dtest1 <- xgb.DMatrix(data = new_ts,label= ts_label)

 xgbmodel1 = xgb.train(data=dtrain1, nround=150, max_depth=5, eta=0.1,  subsample=0.9, 
                       objective="reg:logistic", booster="gbtree", eval_metric="rmse")

Error -

Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  [14:08:41] amalgamation/../src/objective/regression_obj.cc:108: 
  label must be in [0,1] for logistic regression

I used then this:

xgbmodel1 = xgb.train(data=dtrain1, nround=150, max_depth=5, eta=0.1,  subsample=0.9, 
                      objective="reg:linear", booster="gbtree", eval_metric="rmse")

I got all values of response variable equal to -1 and my rmse score is infinite..

Please tell me how to implement xgboost effectively in this case even with default conditions so no error comes.

I have 4 categorical variables in this dataset.

here is a subset of train dataset

sure, r <- train[1:3,]

r

Item_Identifier Item_Fat_Content Item_Type Item_MRP Outlet_Identifier 1: FDA15 Low Fat Dairy 249.8092 OUT049 2: DRC01 Regular Soft Drinks 48.2692 OUT018 3: FDN15 Low Fat Meat 141.6180 OUT049 Outlet_Establishment_Year Outlet_Location_Type Outlet_Type Item_Outlet_Sales 1: 1999 Tier 1 Supermarket Type1 3735.1380 2: 2009 Tier 3 Supermarket Type2 443.4228 3: 1999 Tier 1 Supermarket Type1 2097.2700 Item_Weight Item_Visibility Outlet_Size 1: 9.30 0.01604730 2 2: 5.92 0.01927822 2 3: 17.50 0.01676007 2

1

There are 1 best solutions below

6
On

I see two problems here:

  1. The algorithm expects labels to be either 0s or 1s. On the contrary your code sets them to the value 0 or -1. Correct the line where you define the ts_label variable as follows:

    ts_label <- as.numeric(ts_label)
    
  2. You have a binary target and categorical predictors. Why do you want to do logistic regression? I feel "binary:logistic" may be a better objective here. "reg:linear" makes no sense and your loss function should be based on accuracy and not rmse.