I have an AWS Lambda Function (python 3.8) with pyarrow 9.0.0 and s3fs bundled together in a layer.

The function reads multiple JSON files one by one and converts them to a parquet dataset with partitioning (year, month, day) to an S3 location.

When executed AWS reports "Calling the invoke API action failed with this message: Network Error" or if I retry soon after, then I get "Calling the invoke API action failed with this message: Rate Exceeded.".

The function is as follows:

import pyarrow.parquet as pq
from pyarrow import fs
import pyarrow

s3_fs = fs.S3FileSystem(region='eu-west-1', allow_bucket_creation=True) # the bucket do exist but the partitioning folders do not always

pq.write_to_dataset(ddf,
                    s3uri_parquet_dataset, # s3:// removed prior
                    use_legacy_dataset=False,
                    filesystem=s3_fs,
                    compression="gzip",
                    partition_cols=["partitionDateYear", "partitionDateMonth","partitionDateDay"],          
                    basename_template=get_unique_name(),
                   )

I have a faint feeling that this might have to do with multiprocessing's Pool not available on AWS Lambda due to no shared memory constraint but I have nothing to prove it so it might as well be something else and other people don't seem to have this problem.

What is a hint is that when I comment out the partition_cols param then the lambda function does its job and it succeeds, but obviously the outcome is an unpartitioned dataset which is undesired.

Please do offer some suggestions to try, I am fairly new to AWS and it is quite a bit of an undertaking!

0

There are 0 best solutions below