Design ideas for importing multiple customer csv files into a transactional database

125 Views Asked by At

I am working on redesigning a data pipeline that is responsible for importing customer data in CSV format from cloud buckets that customers own(We have the connection details already) into a transactional database that we own. Constraints:

  1. We import ~10 million files from ~10 thousand customers everyday. Each customer can start multiple imports at any time.
  2. File Sizes vary from small (10 records) to Large (~millions)

Current Design does imports as they come and we are seeing issues with error handling, retries and backpressure management. Ideas:

  1. Use a staging blob store to stage imports and schedule jobs to import data.
  2. Use an orchestrator like Airflow with sensors to do a DAG run for each import.

Curious to see how the community would implement this?

1

There are 1 best solutions below

0
On

You already have the records in cloud buckets, put a queue in there and hang an appropriate number of worker threads on it, you will be able to fine-tune your scaling by increasing/decreasing the number of worker threads until you find the capacity of your database.

Most cloud-based queues (SQS, Azure Queue, GCP Task Queues) have the error retry, backpressure, etc built into them.