How to generate the TPC-DS benchmarking data 1 TB in AWS S3?

1.1k Views Asked by At

I want to generate the TPC-DS data (1 TB and 10 TB) directly in AWS S3 without transferring from local machine to s3. What is the easiest way to do that?

1

There are 1 best solutions below

0
On

I did similar work several month ago, hive-testbench can be an option. Check the README.md about how to make it happen.

You need to configure fs.defaultFS in $HADOOP_HOME/etc/hadoop/core-site.xml to your AWS S3 bucket, the data will be generated in AWS directly.
Pass data scale parameter to ./tpcds-setup.sh to generate date with different scale.