Dependency based ETL flow in AWS

687 Views Asked by At

We want to create a dynamic flow based on input data in S3. Based on data available in S3 and along with meta data we want to create dynamic clusters and dynamic tasks/transformation jobs in the system. And Some jobs are dependency based. Here I am sharing the expected flow, want to know how efficiently we can do this using AWS services and env.

I am exploring AWS SWF, Data Pipe Line and Lambda. But now sure how to take care of dynamic tasks and dynamic dependencies. Any thoughts around this.

Data Flow is explained in the attached image (refer ETL Flow) ETL Flow

2

There are 2 best solutions below

3
On BEST ANSWER

Amazon Step Functions with S3 Triggers should get the job done in a cost effective and scalable manner.

All Steps are defined with state language.

https://states-language.net/spec.html

You can run jobs in parallel and wait for them to finish before you start your next job.

Below is one of the sample from AWS Step Functions,

Step Functions Description

0
On

If you use AWS Flow Framework that is part of official SWF client then modeling such dynamic flow is pretty straightforward. You define its object model, write code that instantiate it based on your pipeline definition and execute using the framework. See Deployment Sample for an example of such dynamic workflow implementation.