I am running an AWS Lambda function with NodeJS as the language. This lambda receives some JSON input that I need to transform into Parquet format before writing it to S3.
Currently, I'm using the parquetjs library to convert the JSON data to Parquet. However, the parquet.ParquetWriter.openFile function creates a local file, and I need to write directly to S3.
Ideally, I would like to convert the JSON data to Parquet in memory and then send it directly to S3. Since this Lambda function will be heavily used, I need to optimize it for high loads and avoid writing to a local disk.
What would be the best practice for achieving this?
Thank you in advance for your help!
Using out-of-the-box dependencies will require writing the JSON-to-Parquet conversion to a local file. Then, you can stream-read the file and upload to S3.
AWS Lambda includes a 512 MB temporary file system (
/tmp) for your code and doesn't cause any performance hits. Depending on the size of your payload you may need to increase it, up to 10 GB.Pseudo-code (1):
Depending on the throughput of requests, you may need an SQS between services to perform the batch transformation. For example:
Request -> Lambda -> S3/json -> S3 Notification -> SQSandbatch 50 messages -> Lambda transformation -> S3/parquetAnother solution would be using AWS Glue to transform S3 objects from JSON to Parquet: https://hkdemircan.medium.com/how-can-we-json-css-files-transform-to-parquet-through-aws-glue-465773b43dad
The flow would be:
Request -> Lambda -> S3/jsonandS3/json <- Glue Crawler -> S3/parquet. You can do that via scheduled (every X minutes) or trigger it via S3 events.