0

I am working on webarchives, and extracting some data, initially i used to store this data as txt in my hdfs, but due it is massive size i will have to store the output in amazon s3 buckets, how can i achieve this? i have tried to use s3a connector but it throws me an error saying credentials are wrong. the size of the txt file is in TBs and is there anyway i can store it in hdfs as i was doing before and upload it to s3 and then delete from hdfs, or any other effective way of doing this?

for bucket in buckets[4:5]:
    filenames = get_bucket_warcs(bucket)
    print("==================================================")
    print(f"bucket: {bucket}, filenames: {len(filenames)}")
    print("==================================================")
    jsonld_count = sc.accumulator(0)
    records_count = sc.accumulator(0)
    exceptions_count = sc.accumulator(0)
    rdd_filenames = sc.parallelize(filenames, len(filenames))
    rdd_jsonld = rdd_filenames.flatMap(lambda f: get_jsonld_records(bucket, f))
    rdd_jsonld.saveAsTextFile(f"{hdfs_path}/webarchive-jsonld-{bucket}")

    print(f"records processed: {records_count.value}", f"jsonld: {jsonld_count.value}", f"exceptions: {exceptions_count.value}")

    sc.stop()

this is my code and i would like to save rdd_jsonld in amazon s3 bucket.

2
  • How are WARCs involved in this question? WARCs sometimes are used by archives to contain extracted text from webpages, but those are generally around 1 gigabyte in size, not terabytes. Commented Dec 7, 2023 at 17:21
  • because i am working on WARCs, yea they are in GBs but my each bucket has tens of thousands(like 70k or even more) of WARC files which again have numerous records either html or json ld or various other types and there are around 30+ buckets which add up to that size Commented Dec 7, 2023 at 22:49

1 Answer 1

1

If the s3a connector is reporting that the credentials are wrong then either you haven't set up the credentials or you have configured the client to talk to the wrong public/private S3 store.

Look for the online documentation for the s3 connector (hadoop s3a or EMR s3) and read it, especially the sections on authentication and troubleshooting.

Sign up to request clarification or add additional context in comments.

1 Comment

ok, i will check that once and see if it works. thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.