I am working on redesigning a data pipeline that is responsible for importing customer data in CSV format from cloud buckets that customers own(We have the connection details already) into a transactional database that we own. Constraints:
- We import ~10 million files from ~10 thousand customers everyday. Each customer can start multiple imports at any time.
- File Sizes vary from small (10 records) to Large (~millions)
Current Design does imports as they come and we are seeing issues with error handling, retries and backpressure management. Ideas:
- Use a staging blob store to stage imports and schedule jobs to import data.
- Use an orchestrator like Airflow with sensors to do a DAG run for each import.
Curious to see how the community would implement this?
You already have the records in cloud buckets, put a queue in there and hang an appropriate number of worker threads on it, you will be able to fine-tune your scaling by increasing/decreasing the number of worker threads until you find the capacity of your database.
Most cloud-based queues (SQS, Azure Queue, GCP Task Queues) have the error retry, backpressure, etc built into them.