Is it possible to keep the data entirely on premise, while still leveraging an in-cloud managed service like Amazon Managed Workflows for Apache Airflow (MWAA)? Or does this require some kind of data transfer to the cloud?

Are there additional security concerns in doing this hybrid approach?

enter image description here

1

There are 1 best solutions below

0
On

Yes and No.

Airflow allows you to "connect" everything everywhere. This means that you can define a Connection to an on-premise / cloud resource and build ETL that query/write to it. That said, as you mention there are in some cases security or authorization issues that are not related to Airflow itself but more related to policies of your organization.

For example: If you add a connection to query a financial database this means that all your Airflow users will be able to utilize this database if they wish. There is no build-in authorization mechanism that specify who is authorized and who is not. This can be a source for trouble because you wouldn't want all your Airflow users to be able to query sensitive data. Another issue may raise if your op premise resource is designed to not allow any access from external addresses (allow/deny list etc...)

To summaries - Airflow allows you to do that. Issues can raise depending on your company procedures about resources and access control. I would suggest for you to do a POC - get a sense of how it can work for your organization and if specific issues raised ask about them and see if there are work around.

I can say that what we did in order to protect limited access databases we simply use two distinct Airflow instances. This means that the protected connections are defined only in one instance so we just moved the permission handling from the resources level to the Airflow level.