I need to set up GPU backed instances on AWS Batch.
Here's my .yaml file:
GPULargeLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
UserData:
Fn::Base64:
Fn::Sub: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
--==BOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
runcmd:
- yum install -y aws-cfn-bootstrap
- echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config
- echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
- echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
- /opt/aws/bin/cfn-init -v --region us-west-2 --stack cool_stack --resource LaunchConfiguration
- echo "DEVS=/dev/xvda" > /etc/sysconfig/docker-storage-setup
- echo "VG=docker" >> /etc/sysconfig/docker-storage-setup
- echo "DATA_SIZE=99%FREE" >> /etc/sysconfig/docker-storage-setup
- echo "AUTO_EXTEND_POOL=yes" >> /etc/sysconfig/docker-storage-setup
- echo "LV_ERROR_WHEN_FULL=yes" >> /etc/sysconfig/docker-storage-setup
- echo "EXTRA_STORAGE_OPTIONS=\"--storage-opt dm.fs=ext4 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker-storage-setup
- /usr/bin/docker-storage-setup
- yum update -y
- echo "OPTIONS=\"--default-ulimit nofile=1024000:1024000 --storage-opt dm.basesize=64G\"" >> /etc/sysconfig/docker
- /etc/init.d/docker restart
--==BOUNDARY==--
LaunchTemplateName: GPULargeLaunchTemplate
GPULargeBatchComputeEnvironment:
DependsOn:
- ComputeRole
- ComputeInstanceProfile
Type: AWS::Batch::ComputeEnvironment
Properties:
Type: MANAGED
ComputeResources:
ImageId: ami-GPU-optimized-AMI-ID
AllocationStrategy: BEST_FIT_PROGRESSIVE
LaunchTemplate:
LaunchTemplateId:
Ref: GPULargeLaunchTemplate
Version:
Fn::GetAtt:
- GPULargeLaunchTemplate
- LatestVersionNumber
InstanceRole:
Ref: ComputeInstanceProfile
InstanceTypes:
- g4dn.xlarge
MaxvCpus: 768
MinvCpus: 1
SecurityGroupIds:
- Fn::GetAtt:
- ComputeSecurityGroup
- GroupId
Subnets:
- Ref: ComputePrivateSubnetA
Type: EC2
UpdateToLatestImageVersion: True
MyGPUBatchJobQueue:
Type: AWS::Batch::JobQueue
Properties:
ComputeEnvironmentOrder:
- ComputeEnvironment:
Ref: GPULargeBatchComputeEnvironment
Order: 1
Priority: 5
JobQueueName: MyGPUBatchJobQueue
State: ENABLED
MyGPUJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
Type: container
ContainerProperties:
Command:
- "/opt/bin/python3"
- "/opt/bin/start.py"
- "--retry_count"
- "Ref::batchRetryCount"
- "--retry_limit"
- "Ref::batchRetryLimit"
Environment:
- Name: "Region"
Value: "us-west-2"
- Name: "LANG"
Value: "en_US.UTF-8"
Image:
Fn::Sub: "cool_1234_abc.dkr.ecr.us-west-2.amazonaws.com/my-image"
JobRoleArn:
Fn::Sub: "arn:aws:iam::cool_1234_abc:role/ComputeRole"
Memory: 16000
Vcpus: 1
ResourceRequirements:
- Type: GPU
Value: '1'
JobDefinitionName: MyGPUJobDefinition
Timeout:
AttemptDurationSeconds: 500
When I start a job, the job is stuck in RUNNABLE state forever, then I did these:
- When I swapped the instance type to be normal CPU types, redeploy the CF stack, submit a job and the job could be run and succeeded fine, so must be something missing/wrong with the way I use these GPU instance types on AWS Batch;
- Then I found this post, so I added an
ImageIdfield in my ComputeEnvironment with a known GPU optimized AMI, but still no luck; - I did a side by side comparison for the jobs between the working CPU AWS Batch and non-working GPU AWS Batch via running
aws batch describe-jobs --jobs AWS_BATCH_JOB_EXECUTION_ID --region us-west-2, I found that what's missing between them is:containerInstanceArnandtaskArnwhere in the non-working GPU instance, these two fields are just missing. - I found that in the ASG (Auto Scaling Group) created by the Compute Environment, this GPU instance is in this ASG, but when I go to the ECS, chose this GPU cluster, there's no container instances associated with it, unlike the working CPU ones, where the ECS cluster has container instances within.
Any ideas how to fix this would be greatly appreciated!
This is for sure a great learning, here's what I did and found and resolved this issue:
AWSSupport-TroubleshootAWSBatchJob runbookwhich turns out to be helpful (make sure you choose your right region) before running;ImageId: ami-019d947e77874eaeein my template, redeploy, then you could use a few commands to check the status of your GPU EC2 instance:systemctl status ecsshould be up and running so that your GPU could join your ECS cluster,sudo docker infoshould return info which shows that it's running,nvidia-smishould return info showing that your nvidia driver is properly installed and running, example info: