Find out if Catboost regression works well for Time estimation

57 Views Asked by At

I have a use case to predict the estimated time of resolution for my IT ticketing data that has the below features

  1. Ticket title and description in text format
  2. Created hour (ETR would be lesser for tickets created at 10 am than tickets created at 4 pm) - {between 0 and 23}
  3. Business Day (Treating weekdays as business day (1) and weekends as non-business day (0)) - {0 and 1}

I have a training data of about 100,000 tickets and their predicted times (labels). Since I have text in my dataset, I am using a Catboost regressor model.

On the fly, my model gave me a bad R2 and high RMSE with the below code.

    text_features = ["combined_text"]
    train_dataset = catboost.Pool(train_data[feature_columns], train_data[target], text_features=text_features)
    test_dataset = catboost.Pool(train_data[feature_columns], train_data[target], text_features=text_features)
    #Fit the model
    model = CatBoostRegressor(verbose=0, text_features=text_features, loss_function='RMSE', eval_metric = 'R2')
    
    grid = {'iterations': [250, 300, 400],
        'learning_rate': [0.1,0.01]
       }
    model.grid_search(grid, train_dataset)

I am getting an R2 of just 0.28 and RMSE of 63.667.

When I tried to further analyze my data to check which feature holds a linear or non-linear relationship between the predictor variables and response variables, I got the below scatter plot for Created Hour (0-23 hrs) vs Resolution Time.

enter image description here

which confuses me even more now. I planned to create a polynomial degree for this feature but now I am not sure if it makes sense.

Since my model is now underfitting, I cannot

  1. Bring more data as this is the data I have.
  2. Extract more features as there is no room for it.
  3. Polynomialize a feature as I cannot understand the relationship.

May I know what are the other possible steps I can take?

0

There are 0 best solutions below