I am trying to implement the following example:
https://medium.com/@sayons/transfer-learning-with-amazon-sagemaker-and-fsx-for-lustre-378fa8977cc1
but I am getting the following error:
UnexpectedStatusException: Error for Training job tensorflow-training-2023-04-14-XX-XX-XX-XXX: Failed. Reason: ClientError: Artifact upload failed:Please ensure that the subnet's route table has a route to an S3 VPC endpoint or a NAT device, and both the security groups and the subnet's network ACL allow uploading data to all output URIs
How can I fix it or at least how can I analyze what the specific issue is?
It seems:
- I have setup the files system successfully
- I linked the file system with a folder in a S3 bucket
- I Have ACL-Enabled Bucket
- In my Rout Table entry I see the following two lines. How can I find the values that should be there so that I have access to My S3 Bucket?
Destination. Target 172.31.0.0/16 local 0.0.0.0/0 igw-XXXXXXXX
Not sure what the "S3 VPC endpoint' is, how to check if I have one functioning and how to link it with the RouteTable.
how to do that: "ensure that both the security groups and the subnet's network ACL allow uploading data to all output URIs"
My code is:
from sagemaker.inputs import FileSystemInput
# Specify file system id.
file_system_id = "fs-XXXXXXXXXXXXXXXXX"
#FSx_SM_Input
# Specify directory path associated with the file system. You need to provide normalized and absolute path here.
file_system_directory_path = "/YYYYYYY/XXXX"
# Specify the access mode of the mount of the directory associated with the file system.
# Directory can be mounted either in 'ro'(read-only) or 'rw' (read-write).
file_system_access_mode = "rw"
# Specify your file system type, "EFS" or "FSxLustre".
file_system_type = "FSxLustre"
# Give Amazon SageMaker Training Jobs Access to FileSystem Resources in Your Amazon VPC.
security_groups_ids = ["sg-XXXXXXX"]
subnets = ["subnet-XXXXXXXX"]
fs_train_input = FileSystemInput(
file_system_id=file_system_id,
file_system_type=file_system_type,
directory_path=file_system_directory_path,
file_system_access_mode=file_system_access_mode,
)
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow
hyperparameters = {"input_mode": "File", #"FastFile", "Pipe", "File" -- FOR FSx "File" ONLY!!!
"shards_on_input": 4 # 4, 120 -- Shards in S3
}
train_input = sagemaker.inputs.TrainingInput("s3://mnist-tdrecords/train/{}".format(hyperparameters["shards_on_input"]),
input_mode = hyperparameters["input_mode"],
distribution = 'FullyReplicated' #'ShardedByS3Key', 'FullyReplicated'
)
tf_estimator = TensorFlow(entry_point = "AWS_DataPipping_TFMirroredStrategy.py",
source_dir = "./",
framework_version = "2.3",
py_version = "py37",
instance_type = "ml.p3.2xlarge", # "ml.p3.2xlarge", "ml.p3.8xlarge", "ml.p3.16xlarge"
instance_count = 1,
role = sagemaker.get_execution_role(),
subnets=subnets,
security_group_ids=security_groups_ids,
hyperparameters = hyperparameters,
output_path = f"s3://mnist-tdrecords/output",
input_mode = hyperparameters["input_mode"], # "File", "Pipe", "FastFile")
)
s3_data_channels = {"train": fs_train_input}
#s3_data_channels = {"train": "s3://mnist-tdrecords/train/{}".format(hyperparameters["shards_on_input"])}
#"validation": f"s3://{bucket_name}/data/validation",}
tf_estimator.fit(s3_data_channels)
Thx, SebTac
Turns out that I just had to create a new VPC Endpoint of type Gateway. the error message is very misleading. It talks about "S3 VPC endpoint". There is no such thing as "S3 VPC endpoint" there are only "VPC Endpoints". Realizing that lets you know where to look for the solution. but looking into the docs of the VPC Endpoints you can see that those for the Gateway one talk a lot about its application with S3. Once you create one you can specify the Route Table associated with it and thus provide the missing entry in the RT.
The other dimensions of difficulty of coming to the solution was the fact that the AWS terms used in the post might refer online to complete different things or are discussed in completely different contexts.
Big ASK to the AWS team to standardize its nomenclature, more meaningful error messages and docs.