I am trying to create a Data Lake using S3, where data is coming from Aurora and eventually other sources; however, I am having troubles with creating a cost efficient solution.
I have been looking into using Data Migration Service (DMS) to stream data from Aurora to S3.
Because data arrives in S3 in the csv format of either:
(1) The initial load: (column1, column2, column3, ...)
(2) Changes in data: (D/U/I, column1, column2, column3, ...) [D: delete, U: update: I: insert]
I have been trying to transform the data such that the S3 bucket only contains data that does not have a prefix of D/U/I.
To perform this transformation, I have thought of using the following model:
Aurora -> Dirty S3 Bucket -> Curated S3 Bucket -> Athena querying from Curated S3 Bucket
The Dirty S3 Bucket contains data in the format of (column1, column2, column3, ...) and (D/U/I, column1, column2, column3, ...). When new data arrives into this bucket, an SNS message is sent out and a lambda function should read the newly arrived file, transform the data, and place it into the Curated S3 Bucket where the Curated S3 Bucket only contains data in the format of (column1, column2, column3, ...).
Assuming that all data from Aurora will have a created_at and updated_at column, I have thought of:
- Partitioning the data in S3 based on /year/month/day/hour (using created_at):
- However, if data is partitioned based on /year/month/day/hour then I assume that the costs were going to be too high.
- This is based under the assumptions that:
- 40000 requests per day (inserts/updates/deletes)
- 0.05 GB average file size
- $0.01 per GB (for data returned by S3)
- $0.01 per 1,000 requests (PUT, COPY, or POST Requests)
- $0.001 per 1,000 requests (GET, SELECT and all other Requests)
- Costs for retrieving data per day = 40000 rpd * 0.05 GB * $0.01 per GB = 20
- Costs for requesting data per day = $0.001 * (40000 rpd/1000) = .04
- Total costs = 20.04 per day
- This is based under the assumptions that:
- However, if data is partitioned based on /year/month/day/hour then I assume that the costs were going to be too high.
To note, this calculation is just for one table; There will be more than 5 tables that have similar rates.
Therefore, per month costs will be greater than $3,000.
Is there a much more cost efficient way to stream data from aurora to s3?
Or would I have to resort to obtaining a daily snapshot of the Aurora DB in order to reduce costs?
yes, you can make a lambda that connects to Aurora and brings the data to s3. In the parameter store (inside System Manager) you can save the datetime of the last record you brought for each table, so your lambda only looks for the updates and saves the date of the last record in the Parameter Store.