I have a use case to predict the estimated time of resolution for my IT ticketing data that has the below features
- Ticket title and description in text format
- Created hour (ETR would be lesser for tickets created at 10 am than tickets created at 4 pm) - {between 0 and 23}
- Business Day (Treating weekdays as business day (1) and weekends as non-business day (0)) - {0 and 1}
I have a training data of about 100,000 tickets and their predicted times (labels). Since I have text in my dataset, I am using a Catboost regressor model.
On the fly, my model gave me a bad R2 and high RMSE with the below code.
text_features = ["combined_text"]
train_dataset = catboost.Pool(train_data[feature_columns], train_data[target], text_features=text_features)
test_dataset = catboost.Pool(train_data[feature_columns], train_data[target], text_features=text_features)
#Fit the model
model = CatBoostRegressor(verbose=0, text_features=text_features, loss_function='RMSE', eval_metric = 'R2')
grid = {'iterations': [250, 300, 400],
'learning_rate': [0.1,0.01]
}
model.grid_search(grid, train_dataset)
I am getting an R2 of just 0.28 and RMSE of 63.667.
When I tried to further analyze my data to check which feature holds a linear or non-linear relationship between the predictor variables and response variables, I got the below scatter plot for Created Hour (0-23 hrs) vs Resolution Time.
which confuses me even more now. I planned to create a polynomial degree for this feature but now I am not sure if it makes sense.
Since my model is now underfitting, I cannot
- Bring more data as this is the data I have.
- Extract more features as there is no room for it.
- Polynomialize a feature as I cannot understand the relationship.
May I know what are the other possible steps I can take?