I have an AWS Lambda Function (python 3.8) with pyarrow 9.0.0 and s3fs bundled together in a layer.
The function reads multiple JSON files one by one and converts them to a parquet dataset with partitioning (year, month, day) to an S3 location.
When executed AWS reports "Calling the invoke API action failed with this message: Network Error" or if I retry soon after, then I get "Calling the invoke API action failed with this message: Rate Exceeded.".
The function is as follows:
import pyarrow.parquet as pq
from pyarrow import fs
import pyarrow
s3_fs = fs.S3FileSystem(region='eu-west-1', allow_bucket_creation=True) # the bucket do exist but the partitioning folders do not always
pq.write_to_dataset(ddf,
s3uri_parquet_dataset, # s3:// removed prior
use_legacy_dataset=False,
filesystem=s3_fs,
compression="gzip",
partition_cols=["partitionDateYear", "partitionDateMonth","partitionDateDay"],
basename_template=get_unique_name(),
)
I have a faint feeling that this might have to do with multiprocessing's Pool not available on AWS Lambda due to no shared memory constraint but I have nothing to prove it so it might as well be something else and other people don't seem to have this problem.
What is a hint is that when I comment out the partition_cols param then the lambda function does its job and it succeeds, but obviously the outcome is an unpartitioned dataset which is undesired.
Please do offer some suggestions to try, I am fairly new to AWS and it is quite a bit of an undertaking!