I exported a DynamoDB table using an AWS Data Pipeline with DataNodes > S3BackupLocation > Compression set to GZIP
. I expected compressed output with a .gz
extension, but I got uncompressed output with no extension.
Further reading reveals that the compression field "is only supported for use with Amazon Redshift and when you use S3DataNode with CopyActivity."
How can I get a gzipped backup of my DynamoDB table into S3? Do I have to resort to downloading all the files, gzipping them, and uploading them? Is there a way to make the pipeline work with CopyActivity? Is there a better approach?
I've been experimenting with using Hive for the export, but I haven't yet found a way to get the formatting right on the output. It needs to match the format below so EMR jobs can read it alongside data from another source.
{"col1":{"n":"596487.0550532"},"col2":{"s":"xxxx-xxxx-xxxx"},"col3":{"s":"xxxx-xxxx-xxxx"}}
{"col1":{"n":"234573.7390354"},"col2":{"s":"xxxx-xxxx-xxxx"},"col3":{"s":"xxxx-xxxxx-xx"}}
{"col2":{"s":"xxxx-xxxx-xxxx"},"col1":{"n":"6765424.7390354"},"col3":{"s":"xxxx-xxxxx-xx"}}
I too have been looking for how to do this. It is such a basic request that I'm surprised that it's not part of a base data pipeline workflow.
After days of investigation and experimentation, I've found 2 mechanisms:
1) use ShellCommandActivity to launch a couple of aws cli commands (s3 cp, gzip) to download from s3, gzip locally, then re-upload to s3. Here are the relevant parts:
2) create a separate EMR cluster, then create a data pipeline that uses that EMR cluster to run S3DistCp (s3-dist-cp).
Between the two of them, I like the second because s3distcp can automatically delete the source s3 files. However, it requires a separate EMR cluster to run (higher cost). Or you can add additional step to #1 to do the deletion.
Also, if you want to parameterize, you may need to directly inline the values so that you can take advantage of things like #{format(@scheduledStartTime,'YYYY-MM-dd_hh.mm')}.