How can i save data from hdfs to amazon s3

Question

I am working on webarchives, and extracting some data, initially i used to store this data as txt in my hdfs, but due it is massive size i will have to store the output in amazon s3 buckets, how can i achieve this? i have tried to use s3a connector but it throws me an error saying credentials are wrong. the size of the txt file is in TBs and is there anyway i can store it in hdfs as i was doing before and upload it to s3 and then delete from hdfs, or any other effective way of doing this?

for bucket in buckets[4:5]:
    filenames = get_bucket_warcs(bucket)
    print("==================================================")
    print(f"bucket: {bucket}, filenames: {len(filenames)}")
    print("==================================================")
    jsonld_count = sc.accumulator(0)
    records_count = sc.accumulator(0)
    exceptions_count = sc.accumulator(0)
    rdd_filenames = sc.parallelize(filenames, len(filenames))
    rdd_jsonld = rdd_filenames.flatMap(lambda f: get_jsonld_records(bucket, f))
    rdd_jsonld.saveAsTextFile(f"{hdfs_path}/webarchive-jsonld-{bucket}")

    print(f"records processed: {records_count.value}", f"jsonld: {jsonld_count.value}", f"exceptions: {exceptions_count.value}")

    sc.stop()

this is my code and i would like to save rdd_jsonld in amazon s3 bucket.

How are WARCs involved in this question? WARCs sometimes are used by archives to contain extracted text from webpages, but those are generally around 1 gigabyte in size, not terabytes. — Greg Lindahl
– Greg Lindahl, Commented Dec 7, 2023 at 17:21
because i am working on WARCs, yea they are in GBs but my each bucket has tens of thousands(like 70k or even more) of WARC files which again have numerous records either html or json ld or various other types and there are around 30+ buckets which add up to that size — Kshitij Pandit
– Kshitij Pandit, Commented Dec 7, 2023 at 22:49

stevel · Accepted Answer · 2023-12-07 14:33:42Z

1

If the s3a connector is reporting that the credentials are wrong then either you haven't set up the credentials or you have configured the client to talk to the wrong public/private S3 store.

Look for the online documentation for the s3 connector (hadoop s3a or EMR s3) and read it, especially the sections on authentication and troubleshooting.

answered Dec 7, 2023 at 14:33

stevel

13.6k1 gold badge41 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kshitij Pandit Over a year ago

ok, i will check that once and see if it works. thanks

Collectives™ on Stack Overflow

How can i save data from hdfs to amazon s3

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related