How to train a model on a cluster with multiple GPUs (single node)

184 Views Asked by At

I wish to train a model on a multiple-GPUs, 1-Node cluster on AML (Azure ML). I have my cluster (STANDARD_ND40RS_V2, which has 8 V100) I have my Dataset, my Environment (docker), and my script.

When trying to run the script as a pipeline stage, I was expecting to run the script only once, create the relevant batches, and then each GPU of the 8 will get a single one in parallel. I have some prints in the main script, but when looking at the log file created by AML, these prints were run 8 times for some reason, instead of only once. It seems like the entire script is run each time instead of only once.

This is the code I used:

    script_step = PythonScriptStep(
        name=stage_name,
        script_name=stage_script.py,
        source_directory=self.project_root_path,
        arguments=arguments,
        inputs=data_inputs,
        outputs=[self.data_outputs[stage_name]],
        compute_target=compute_target,
        runconfig=runconfig,
        allow_reuse=True
    )
    pipeline_steps.append(script_step)
    pipeline = Pipeline(workspace=self.ws, steps=pipeline_steps)
    experiment = Experiment(self.ws, self.experiment_name)
    pipeline_run = experiment.submit(pipeline, tags=self.experiment_tag)

This is the script stage_script.py:

 def run_script_stage(input_ds, output_ds, args):
     input_ls, output_ls = print_stage_input_output(input_ds, output_ds, args)
     params = json.loads(args.params)
     print (f"params: {params}")
     # some lines for accessing the data from the blob, and other lines for training the model

The only thing that does run only once, is the training process. So what is happening here? How should I use a multi GPU instance for training in a pipeline?

0

There are 0 best solutions below