I have trained two RandomForestModel
s to perform regression targeting home team/away team scores in NBA games, and I hope to be able to run their predict methods on an input dataset of usually no more than 10 rows and 56 columns via an Airflow KubernetesPodOperator
.
My first question:
Is there a more optimized way of saving trained models than the SavedModel format? Saving in this format creates two artifacts that are 1.6GB each. I have to mount these in a Docker image and the container running the predict method needs to load them into memory with tensorflow.keras.saving.load_model
. It doesn't appear to be possible to use the keras or hd5 formats, which would create much smaller artifacts.
And that leads me to the next questions. I've ran a profiling tool against my program and these are the results.
heap info: Partition of a set of 1179978 objects. Total size = 175722275 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 353630 30 49081242 28 49081242 28 str
1 71591 6 28654472 16 77735714 44 types.CodeType
2 237103 20 18383072 10 96118786 55 tuple
3 141968 12 14190306 8 110309092 63 bytes
4 69081 6 10500312 6 120809404 69 function
5 7479 1 9944648 6 130754052 74 type
6 23928 2 9309152 5 140063204 80 collections.OrderedDict
7 19465 2 4184544 2 144247748 82 dict (no owner)
8 3536 0 3862944 2 148110692 84 dict of module
9 7479 1 2788544 2 150899236 86 dict of type
If I'm reading this correctly, it only uses 175MB of memory. Yet, when I go to run the same program in a KubernetesPodOperator
in Airflow, the pod is OOMKilled. To test, I've added a higher memory node, added a toleration to schedule this pod on that higher memory node, and requested 12GB of memory for the pod, and I still see OOMKilled. This makes me think that something is very wrong...
Why is there such a discrepancy between what the profiling tool says and what Kubernetes says of its memory usage? A memory leak? Is there a better approach to running just prediction, not training, in resource-constrained environments such as a Kubernetes cluster with smaller node types?
TF-DF developer here.
In a resource-constrained environment, you should use the
ydf
package which has no TensorFlow dependency (~500MB heavy) and is a lot faster than TF-DF. See the project website for more information about the library.Both packages are front-ends for the same C++ codebase and are developed by the same team. The models are generally cross-compatible between the two.
To run a TF-DF model in YDF, use
If you're not using any TensorFlow specific code, you can also consider using YDF for training for better training speed and more features.