How to merge model from distributed training

248 Views Asked by At

Here is my code for distributed training via spark-tensorflow-distributor that uses tensorflow MultiWorkerMirroredStrategy to train using multiple servers

https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-distributor/spark_tensorflow_distributor/mirrored_strategy_runner.py

import sys
from spark_tensorflow_distributor import MirroredStrategyRunner
import mlflow.keras
mlflow.keras.autolog()
mlflow.log_param("learning_rate", 0.001)
import tensorflow as tf
import time
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer 

def train():
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
  #tf.distribute.experimental.CollectiveCommunication.NCCL
  model = None
  with strategy.scope():

    data = load_breast_cancer()
    X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
    N, D = X_train.shape # number of observation and variables
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    model = tf.keras.models.Sequential([
      tf.keras.layers.Input(shape=(D,)),
      tf.keras.layers.Dense(1, activation='sigmoid') # use sigmoid function for every epochs
    ])

    model.compile(optimizer='adam', # use adaptive momentum
        loss='binary_crossentropy',
        metrics=['accuracy']) 

    # Train the Model
    r = model.fit(X_train, y_train, validation_data=(X_test, y_test))

    mlflow.keras.log_model(model, "mymodel")

MirroredStrategyRunner(num_slots=4, use_custom_strategy=True).run(train)

I realize that saving via mlflow.keras.log_model produces 4 models in databrick experiments,

each of the 4 models is not a good predictor

if I change num_slots from 4 to 1, there is only 1 model saved in databrick experiment and the model is a good predictor during inference

My question is

Do I need an extra step to merge the 4 models together to create 1 model that can predict as good as num_slot = 1? Or am I doing something wrong? I was expecting only the chief node saving models

1

There are 1 best solutions below

0
On

So, you do not want to call log_model in all 4 of the Tensorflow workers. You want to log it in 1 of them. I believe you would use https://www.tensorflow.org/api_docs/python/tf/distribute/get_replica_context to figure out which worker you are, and perhaps only log if you are worker 0. That's what I do when using Horovod for a similar purpose.

You do not merge the models; they are the same model in all 4 replicas. That's the point of what this is doing.

If the model is 'worse' than with 1 replica, I would suspect other subtler issues are at play. For example, with 4 workers, your batch size has changed unless you compensate for that. See https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#train_the_model for a discussion.