Following the API reference, one way to optimize data ingestion for distributed training is using ShardedByS3Key.
Does have code samples for using ShardedByS3Key in context of distributed training? Concretely, what changes to, e.g., PT's DistributedSampler (should it be used at all?), or TF's tf.data-pipeline is necessary?
According to the technique of "Sharded Data Parallelism":
Then simply leave the default mode
FullyReplicatedin your TrainingInput's distribution param because parallelism does not occur at the level of data division on upstream instances but later on gpu.See the guide on "How to apply Sharded data parallelism to your training work" or the full example notebook "Train GPT-2 with near-linear scaling using Sharded Data Parallelism technique in SageMaker Model Parallelism Library". In the last example it sets just the parameters step by step explicitly.
For example, you have to set at least the
distributiondict params on PyTorch (or TensorFlow) estimator to enable the SageMaker distributed data parallelism: