Accessing S3 bucket object in TFX pipeline with S3FS

287 Views Asked by At

I'm building a TFX pipeline that contains images as input from an S3 bucket. At the TF Transform component step, I'm attempting to read in a series of images with their URLs stored in TFX's SparseTensor format. I'm trying to use the S3FS Python module to do so as I've been using that for other components of my pipeline and have heard using both Boto3 and S3FS together can cause issues (this is beside the point I think).

Anyway, I've established a connection to the S3 bucket and am attempting to read in images. Here is my code (or at least the part of it I think is germane to the issue):

  s3 = s3fs.S3FileSystem()

  with s3.open(str(inputs[key]), 'rb') as f:
    for key in CV_FEATURES:
      img = np.array(Image.open(io.BytesIO(f.read())))
      img = tf.image.rgb_to_grayscale(img)
      img = tf.divide(img, 255)
      img = tf.image.resize_with_pad(img, 224, 224)
      outputs[_fill_in_missing(key)] = img

  s3.clear_instance_cache()

Running this gives me the standard error message I've seen for trying to access buckets with invalid characters:

ParamValidationError: Parameter validation failed: Invalid bucket name "SparseTensor(indices=Tensor("inputs": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).:(s3|s3-object-lambda):[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$|^arn:(aws).:s3-outposts:[a-z-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9-]{1,63}$"

The error indicates the problem is with the line with s3.open(str(inputs[key]), 'rb') as f: so somehow I need to represent the S3 URL correctly. The URLs are stored in the format bucket_name\key\file.jpg in a column called image_path in the original CSV dataset (converted to a SparseTensor before this point in represented in the above code as inputs[key]).

I don't think the issue is with the SparseTensor format, but rather the URL.

0

There are 0 best solutions below