I am working on webarchives, and extracting some data, initially i used to store this data as txt in my hdfs, but due it is massive size i will have to store the output in amazon s3 buckets, how can i achieve this? i have tried to use s3a connector but it throws me an error saying credentials are wrong. the size of the txt file is in TBs and is there anyway i can store it in hdfs as i was doing before and upload it to s3 and then delete from hdfs, or any other effective way of doing this?
for bucket in buckets[4:5]:
filenames = get_bucket_warcs(bucket)
print("==================================================")
print(f"bucket: {bucket}, filenames: {len(filenames)}")
print("==================================================")
jsonld_count = sc.accumulator(0)
records_count = sc.accumulator(0)
exceptions_count = sc.accumulator(0)
rdd_filenames = sc.parallelize(filenames, len(filenames))
rdd_jsonld = rdd_filenames.flatMap(lambda f: get_jsonld_records(bucket, f))
rdd_jsonld.saveAsTextFile(f"{hdfs_path}/webarchive-jsonld-{bucket}")
print(f"records processed: {records_count.value}", f"jsonld: {jsonld_count.value}", f"exceptions: {exceptions_count.value}")
sc.stop()
this is my code and i would like to save rdd_jsonld in amazon s3 bucket.