1

How do I generate and load multiple s3 file path in scala so that I can use :

   sqlContext.read.json ("s3://..../*/*/*")

I know I can use wildcards to read multiple files but is there any way so that I can generate the path ? For example my fIle structure looks like this: BucketName/year/month/day/files

       s3://testBucket/2016/10/16/part00000

These files are all jsons. The issue is I need to load just spacific duration of files, for eg. Say 16 days then I need to loado files for start day ( oct 16) : oct 1 to 16.

With 28 day duration for same start day I would like to read from Sep 18

Can some tell me any ways to do this ?

2

2 Answers 2

1

You can take a look at this answer, You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

Or you can use AWS API to get the list of files locations and read those files using spark .

You can look into this answer to AWS S3 file search.

Sign up to request clarification or add additional context in comments.

4 Comments

My issue is to how do I get the list of my keys in from start day how do I get the paths dynamically. All the paths . Start day - 28
All of getting from input is start day and end date. Is there any utility to generate the dates between them so that I can create a list by concatenation
I updated my question based on ur solution. Is that a right way to do this ?
Yoi have three starts at the end in the url . its will read all the files under 2016 folder
0

You can generate comma separated path list: sqlContext.read.json (s3://testBucket/2016/10/16/,s3://testBucket/2016/10/15/,...);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.