How to properly use ShardedByS3Key in distributed training scenario?

464 Views Asked by Philipp Schmid At 31 October 2022 at 12:57

Following the API reference, one way to optimize data ingestion for distributed training is using ShardedByS3Key.

Does have code samples for using ShardedByS3Key in context of distributed training? Concretely, what changes to, e.g., PT's DistributedSampler (should it be used at all?), or TF's tf.data-pipeline is necessary?

Original Q&A

There are 1 best solutions below

Giuseppe La Gualano On 31 October 2022 at 14:37

According to the technique of "Sharded Data Parallelism":

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group, and performs gradient aggregation based on the AllReduce operation.

Then simply leave the default mode FullyReplicated in your TrainingInput's distribution param because parallelism does not occur at the level of data division on upstream instances but later on gpu.

See the guide on "How to apply Sharded data parallelism to your training work" or the full example notebook "Train GPT-2 with near-linear scaling using Sharded Data Parallelism technique in SageMaker Model Parallelism Library". In the last example it sets just the parameters step by step explicitly.

For example, you have to set at least the distribution dict params on PyTorch (or TensorFlow) estimator to enable the SageMaker distributed data parallelism:

{ "smdistributed": { "dataparallel": { "enabled": True } } }

How to properly use ShardedByS3Key in distributed training scenario?

There are 1 best solutions below

Related Questions in TENSORFLOW

Related Questions in AMAZON-SAGEMAKER

Related Questions in AMZ-SAGEMAKER-DISTRIBUTED-TRAINING

Trending Questions

Popular # Hahtags

Popular Questions