How can i save data from hdfs to amazon s3

82 Views Asked by At

I am working on webarchives, and extracting some data, initially i used to store this data as txt in my hdfs, but due it is massive size i will have to store the output in amazon s3 buckets, how can i achieve this? i have tried to use s3a connector but it throws me an error saying credentials are wrong. the size of the txt file is in TBs and is there anyway i can store it in hdfs as i was doing before and upload it to s3 and then delete from hdfs, or any other effective way of doing this?

for bucket in buckets[4:5]:
    filenames = get_bucket_warcs(bucket)
    print("==================================================")
    print(f"bucket: {bucket}, filenames: {len(filenames)}")
    print("==================================================")
    jsonld_count = sc.accumulator(0)
    records_count = sc.accumulator(0)
    exceptions_count = sc.accumulator(0)
    rdd_filenames = sc.parallelize(filenames, len(filenames))
    rdd_jsonld = rdd_filenames.flatMap(lambda f: get_jsonld_records(bucket, f))
    rdd_jsonld.saveAsTextFile(f"{hdfs_path}/webarchive-jsonld-{bucket}")

    print(f"records processed: {records_count.value}", f"jsonld: {jsonld_count.value}", f"exceptions: {exceptions_count.value}")

    sc.stop()

this is my code and i would like to save rdd_jsonld in amazon s3 bucket.

1

There are 1 best solutions below

1
On

If the s3a connector is reporting that the credentials are wrong then either you haven't set up the credentials or you have configured the client to talk to the wrong public/private S3 store.

Look for the online documentation for the s3 connector (hadoop s3a or EMR s3) and read it, especially the sections on authentication and troubleshooting.