Data Ingestion: Load Dynamic Files from S3 to Snowflake

1.7k Views Asked by At

Situation: A csv lands into AWS S3 every month. The vendor adds/removes/modifies columns from the file as they please. So the schema is not known ahead of time. The requirement is to create a table on-the-fly in Snowflake and load the data into said table. Matillion is our ELT tool.

This is what I have done so far.

  1. Setup a Lambda to detect the arrival of the file, convert it to JSON, upload to another S3 dir and adds filename to SQS.
  2. Matillion detects SQS message and loads the file with the JSON Data into Variant column in a SF table.
  3. SF Stored proc takes the variant column and generates a table based on the number of fields in the JSON data. The VARIANT column in SF only works in this way if its JSON data. CSV is sadly not supported.

This works with 10,000 rows. The problem arises when I run this with a full file which is over 1GB, which is over 10M rows. It crashes the lambda job with an out of disk space error at runtime.

These are the alternatives I have thought of so far:

  1. Attach an EFS volume to the lambda and use it to store the JSON file prior to the upload to S3. JSON data files are so much larger than their CSV counterparts, I expect the json file to be around 10-20GB since the file has over 10M rows.
  2. Matillion has an Excel Query component where it can take the headers and create a table on the fly and load the file. I was thinking I can convert the header row from the CSV into a XLX file within the Lambda, pass it to over to Matillion, have it create the structures and then load the csv file once the structure is created.

What are my other options here? Considerations include a nice repeatable design pattern to be used for future large CSVs or similar requirements, costs of the EFS, am I making the best use of the tools that I are avaialable to me? Thanks!!!

4

There are 4 best solutions below

1
On

Why are you converting CSV into JSON; CSV is directly being loaded into table without doing any data transformation specifically required in case of JSON, the lateral flatten to convert json into relational data rows; and why not use Snowflake Snowpipe feature to load data directly into Snowflake without use of Matallion. You can split large csv files into smaller chunks before loading into Snowflake ; this will help in distributing the data processing loads across SF Warehouses.

0
On

Why not split the initial csv file into multiple files and then process each file in the same way you currently are?

0
On

I also load CSV files from SFTP into Snowflake, using Matillion, with no idea of the schema.

In my process, I create a "temp" table in Snowflake, with 50 VARCHAR columns (Our files should never exceed 50 columns). Our data always contains text, dates or numbers, so VARCHAR isn't a problem. I can then load the .csv file into the temp table. I believe this should work for files coming from S3 as well.

That will at least get the data into Snowflake. How to create the "final" table however, given your scenario, I'm not sure. I can imagine being able to use the header row, and/or doing some analysis on the 'type' of data contained in each column, to determine the column type needed.

But if you can get the 'final' table created, you could move the data over from temp. Or alter the temp table itself.

0
On

This can be achieved using an external table where the external table will be mapped with a single column and the delimiter will be a new line character. The external table also has a special virtual column and that can be processed to extract all the columns dynamically and then create a table based on the number of columns at any given time using the stored procedure. There is an interesting video which talks about this limitation in snowflake (https://youtu.be/hRNu58E6Kmg)