I m using xgboost
for regression problem but I m getting error regarding response variable which is output sales and it is initially numeric in class but as I use xgboost it shows error BUT I want output in numeric form only
labels <- train$Item_Outlet_Sales# train label
ts_label <- test$Item_Outlet_Sales # test label
# converted into matrix ( one hot encoding )
new_tr <- model.matrix(~.+0,data = train[,-c("Item_Outlet_Sales"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("Item_Outlet_Sales"),with=F])
## checking class
class(labels)
[1] "numeric"
I have created label or response variable in test as
test$Item_Outlet_Sales <- NA
class(test$Item_Outlet_Sales)
[1] "logical"
# coverting `ts_label` into numeric as it initially is logical
ts_label <- as.numeric(ts_label)-1
class(ts_label)
[1] "numeric"
now
dtrain1 <- xgb.DMatrix(data = new_tr,label = labels)
dtest1 <- xgb.DMatrix(data = new_ts,label= ts_label)
xgbmodel1 = xgb.train(data=dtrain1, nround=150, max_depth=5, eta=0.1, subsample=0.9,
objective="reg:logistic", booster="gbtree", eval_metric="rmse")
Error -
Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
[14:08:41] amalgamation/../src/objective/regression_obj.cc:108:
label must be in [0,1] for logistic regression
I used then this:
xgbmodel1 = xgb.train(data=dtrain1, nround=150, max_depth=5, eta=0.1, subsample=0.9,
objective="reg:linear", booster="gbtree", eval_metric="rmse")
I got all values of response variable equal to -1 and my rmse score is infinite..
Please tell me how to implement xgboost
effectively in this case even with default conditions so no error comes.
I have 4 categorical variables in this dataset.
here is a subset of train dataset
sure, r <- train[1:3,]
r
Item_Identifier Item_Fat_Content Item_Type Item_MRP Outlet_Identifier 1: FDA15 Low Fat Dairy 249.8092 OUT049 2: DRC01 Regular Soft Drinks 48.2692 OUT018 3: FDN15 Low Fat Meat 141.6180 OUT049 Outlet_Establishment_Year Outlet_Location_Type Outlet_Type Item_Outlet_Sales 1: 1999 Tier 1 Supermarket Type1 3735.1380 2: 2009 Tier 3 Supermarket Type2 443.4228 3: 1999 Tier 1 Supermarket Type1 2097.2700 Item_Weight Item_Visibility Outlet_Size 1: 9.30 0.01604730 2 2: 5.92 0.01927822 2 3: 17.50 0.01676007 2
I see two problems here:
The algorithm expects labels to be either 0s or 1s. On the contrary your code sets them to the value 0 or -1. Correct the line where you define the
ts_label
variable as follows:You have a binary target and categorical predictors. Why do you want to do logistic regression? I feel "binary:logistic" may be a better objective here. "reg:linear" makes no sense and your loss function should be based on accuracy and not rmse.