We are migrating an on-prem solution where we currently receive files from vendors via ftp into a directory location on our unix server and then a program on that server moves that file to some other path performs ETL on it.
We have containerized the application and it still expects to read a file from a directory.
What we want to do generally is this:
- Vendor delivers file to S3 bucket
- Container sees the zipped file (constantly polling)
- Container moves and extracts the file to some persistent disk (EFS?)
- Container does ETL
A couple ideas have come to mind and wanted to get advice since this is new to us:
Options 1:
S3 -> Lambda function pushes file to EFS -> EFS is exposed as a PV which is mounted to a directory on the container Observation: We feel like there's excessive overhead of houskeeping the file in two places (S3 and EFS).
Options 2:
Look into datasync to see if it can replace the need for Lambda.
To reiterate constraints:
- File must be delivered from the vendor to the S3 bucket
- The container must be able to read it from some directory that appears local to it (mount points are OK)
Any ideas or suggestions on a simple/reusable design? This is a pattern we plan to use for a large number of ingestions from different sources.