I am wondering if Vertex AI Training can be used for distributed training using Huggingface Trainer and deepspeed? All I have seen are examples with the native torch distribution strategy.

It would be very helpful if someone can tell me

  1. If deepspeed is supported
  2. How to integrate deepspeed when doing multi-node training in Vertex AI
1

There are 1 best solutions below

0
On

You can build a custom training image containing the DeepSpeed training code, push docker image to artifact registry then fine-tune on Vertex AI.

This post on Fine-tuning with DeepSpeed and Vertex AI explains it pretty well.