I am trying to use catboost
java API but facing high latency issues at a large scale. I currently run a high-scale multi-threaded system with around 300+ worker threads that query the catboost model multiple times per client request. Here is the sample code:
loadModel(){
// load model file
// modelFilePath is around 1 GB
CatBoostModel model = CatBoostModel.loadModel(modelFilePath);
}
...
...
getprecition(){
// called from multiple threads multiple times per user request
// Predict for all inputs at once.
// IMP: inputCount is always 1
float[][] numericalFeatures = new float[inputCount][];
String[][] catFeatures = new String[inputCount][];
...
...
CatBoostPredictions prediction = model.predict(numericalFeatures, catFeatures);
double result = sigmoid(prediction.get(0, 0));
}
I have generated flame graph that shows significant times of CPU is used in catboost prediction
I was expecting model prediction latency to be under 1 ms but it suddenly starts increasing when the load increases on the server (from 9-10k QPS to 12-13k QPS x 10-100 model queries per request).
Another thing that I noticed is that the CPU load average also increases a lot (without using the model also) to 100+ on 48 core server.
I tried having 4 instances of model and querying each instance in a round-robin fashion but no improvement.
Is there a way to optimize it?