Using tensorflow-decision-forests in a resource constrained environment

65 Views Asked by At

I have trained two RandomForestModels to perform regression targeting home team/away team scores in NBA games, and I hope to be able to run their predict methods on an input dataset of usually no more than 10 rows and 56 columns via an Airflow KubernetesPodOperator.

My first question: Is there a more optimized way of saving trained models than the SavedModel format? Saving in this format creates two artifacts that are 1.6GB each. I have to mount these in a Docker image and the container running the predict method needs to load them into memory with tensorflow.keras.saving.load_model. It doesn't appear to be possible to use the keras or hd5 formats, which would create much smaller artifacts.

And that leads me to the next questions. I've ran a profiling tool against my program and these are the results.

heap info: Partition of a set of 1179978 objects. Total size = 175722275 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 353630  30 49081242  28  49081242  28 str
     1  71591   6 28654472  16  77735714  44 types.CodeType
     2 237103  20 18383072  10  96118786  55 tuple
     3 141968  12 14190306   8 110309092  63 bytes
     4  69081   6 10500312   6 120809404  69 function
     5   7479   1  9944648   6 130754052  74 type
     6  23928   2  9309152   5 140063204  80 collections.OrderedDict
     7  19465   2  4184544   2 144247748  82 dict (no owner)
     8   3536   0  3862944   2 148110692  84 dict of module
     9   7479   1  2788544   2 150899236  86 dict of type

If I'm reading this correctly, it only uses 175MB of memory. Yet, when I go to run the same program in a KubernetesPodOperator in Airflow, the pod is OOMKilled. To test, I've added a higher memory node, added a toleration to schedule this pod on that higher memory node, and requested 12GB of memory for the pod, and I still see OOMKilled. This makes me think that something is very wrong...

Why is there such a discrepancy between what the profiling tool says and what Kubernetes says of its memory usage? A memory leak? Is there a better approach to running just prediction, not training, in resource-constrained environments such as a Kubernetes cluster with smaller node types?

1

There are 1 best solutions below

0
On

TF-DF developer here.

In a resource-constrained environment, you should use the ydf package which has no TensorFlow dependency (~500MB heavy) and is a lot faster than TF-DF. See the project website for more information about the library.

Both packages are front-ends for the same C++ codebase and are developed by the same team. The models are generally cross-compatible between the two.

To run a TF-DF model in YDF, use

import ydf

ydf_model = ydf.from_tensorflow_decision_forests("/path/to/tfdf_model")
# Optionally, use `ydf_model.describe()` to print information about the model

# Make predictions on a Dataframe, TensorFlow dataset, ...
ydf_predictions = ydf_model.predict(test_ds) 

If you're not using any TensorFlow specific code, you can also consider using YDF for training for better training speed and more features.