I have a data lake in AWS S3
. The format of data is Parquet
. Daily workload is ~70G. I want to build some ad-hoc analytics on top of that data. To do that I see 2 options:
- Use
AWS Athena
to request data via HiveQL to get data viaAWS Glue
(Data Catalog). - Move data from S3 into Redshift as a data warehouse and query Redshift to perform ad-hoc analysis.
What is the best way to do ah-hoc analysis in my case? Is there more efficient way? And what are pros and cons of mentioned options?
PS
After 6 months I'm going to move data from S3 to Amazon Glacier, so that max data volume to query in S3/Redshift can be ~13T