Spark :How to generate file path to read from s3 with scala

Question

How do I generate and load multiple s3 file path in scala so that I can use :

   sqlContext.read.json ("s3://..../*/*/*")

I know I can use wildcards to read multiple files but is there any way so that I can generate the path ? For example my fIle structure looks like this: BucketName/year/month/day/files

       s3://testBucket/2016/10/16/part00000

These files are all jsons. The issue is I need to load just spacific duration of files, for eg. Say 16 days then I need to loado files for start day ( oct 16) : oct 1 to 16.

With 28 day duration for same start day I would like to read from Sep 18

Can some tell me any ways to do this ?

Possible duplicate of How to read multiple text files into a single RDD? — eliasah
– eliasah, Commented Oct 16, 2016 at 13:23

Community · Accepted Answer · 2017-05-23 12:00:43Z

1

You can take a look at this answer, You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

Or you can use AWS API to get the list of files locations and read those files using spark .

You can look into this answer to AWS S3 file search.

edited May 23, 2017 at 12:00

CommunityBot

11 silver badge

answered Oct 16, 2016 at 13:00

Pawan B

4,6332 gold badges27 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Learner Over a year ago

My issue is to how do I get the list of my keys in from start day how do I get the paths dynamically. All the paths . Start day - 28

Learner Over a year ago

All of getting from input is start day and end date. Is there any utility to generate the dates between them so that I can create a list by concatenation

Learner Over a year ago

I updated my question based on ur solution. Is that a right way to do this ?

Pawan B Over a year ago

Yoi have three starts at the end in the url . its will read all the files under 2016 folder

Igor Berman · Accepted Answer · 2016-10-16 10:15:15Z

0

You can generate comma separated path list: sqlContext.read.json (s3://testBucket/2016/10/16/,s3://testBucket/2016/10/15/,...);

answered Oct 16, 2016 at 10:15

Igor Berman

1,53210 silver badges16 bronze badges

Collectives™ on Stack Overflow

Spark :How to generate file path to read from s3 with scala

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related