We’re training a big Temporal Fusion Transformer using PyTorch.
We’re looking into using Distributed Training and accelerate training jobs with SageMaker.
Does anyone have any examples of this? Any pattern you can recommend?
We’re training a big Temporal Fusion Transformer using PyTorch.
We’re looking into using Distributed Training and accelerate training jobs with SageMaker.
Does anyone have any examples of this? Any pattern you can recommend?
Copyright © 2021 Jogjafile Inc.
Although there is no direct example for the above mentioned model, you should be able to follow the below documentation for PL
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt-lightning.html
Refer below example for a full example of using SageMaker DDP and Pytorch Lightning.
https://github.com/aws-samples/sagemaker-distributed-training-workshop/blob/main/1_data_parallel/PyTorch%20Lightning%20on%20SageMaker.ipynb