How to resume from a checkpoint when using Horovod with tf.keras?

480 Views Asked by user1414202 At 22 August 2025 at 07:01

Note: I'm using TF 2.1.0 and the tf.keras API. I've experienced the below issue with all Horovod versions between 0.18 and 0.19.2.

Are we supposed to call hvd.load_model() on all ranks when resuming from a tf.keras h5 checkpoint, or are we only supposed to call it on rank 0 and let the BroadcastGlobalVariablesCallback callback share these weights with the other workers? Is approach 1 incorrect/invalid, in that it will mess up training or produce different results than approach 2?

I'm currently training a ResNet-based model with some BatchNorm layers, and if we only try to load the model on the first rank (and build/compile the model on the other ranks), we get a stalled tensor issue (https://github.com/horovod/horovod/issues/1271). However, if we call hvd.load_model on all ranks when resuming, training starts resuming normally but it seems to immediately diverge, so I was confused as to whether loading the checkpoint model on all ranks (with hvd.load_model) can somehow cause training to diverge? But at the same time, we're unable to only load it on rank 0 because of https://github.com/horovod/horovod/issues/1271, causing Batch Norm to hang in horovod. Has anyone been able to successfully call hvd.load_model only on rank 0 when using BatchNorm tf.keras layers? Can someone please provide some tips here?

Thanks!

Original Q&A

There are 1 best solutions below

Paul Gwamanda On 01 July 2020 at 03:36

According to this: https://github.com/horovod/horovod/issues/120, this is the solution:

You should also be able to specify optimizer via custom object:
model = keras.models.load_model('file.h5', custom_objects={
    'Adam': lambda **kwargs: hvd.DistributedOptimizer(keras.optimizers.Adam(**kwargs))
})

How to resume from a checkpoint when using Horovod with tf.keras?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in TENSORFLOW2.0

Related Questions in TF.KERAS

Related Questions in HOROVOD

Trending Questions

Popular # Hahtags

Popular Questions