Situation: A csv lands into AWS S3 every month. The vendor adds/removes/modifies columns from the file as they please. So the schema is not known ahead of time. The requirement is to create a table on-the-fly in Snowflake and load the data into said table. Matillion is our ELT tool.
This is what I have done so far.
- Setup a Lambda to detect the arrival of the file, convert it to JSON, upload to another S3 dir and adds filename to SQS.
- Matillion detects SQS message and loads the file with the JSON Data into Variant column in a SF table.
- SF Stored proc takes the variant column and generates a table based on the number of fields in the JSON data. The VARIANT column in SF only works in this way if its JSON data. CSV is sadly not supported.
This works with 10,000 rows. The problem arises when I run this with a full file which is over 1GB, which is over 10M rows. It crashes the lambda job with an out of disk space error at runtime.
These are the alternatives I have thought of so far:
- Attach an EFS volume to the lambda and use it to store the JSON file prior to the upload to S3. JSON data files are so much larger than their CSV counterparts, I expect the json file to be around 10-20GB since the file has over 10M rows.
- Matillion has an Excel Query component where it can take the headers and create a table on the fly and load the file. I was thinking I can convert the header row from the CSV into a XLX file within the Lambda, pass it to over to Matillion, have it create the structures and then load the csv file once the structure is created.
What are my other options here? Considerations include a nice repeatable design pattern to be used for future large CSVs or similar requirements, costs of the EFS, am I making the best use of the tools that I are avaialable to me? Thanks!!!
Why are you converting CSV into JSON; CSV is directly being loaded into table without doing any data transformation specifically required in case of JSON, the lateral flatten to convert json into relational data rows; and why not use Snowflake Snowpipe feature to load data directly into Snowflake without use of Matallion. You can split large csv files into smaller chunks before loading into Snowflake ; this will help in distributing the data processing loads across SF Warehouses.