Spark: read csv file from s3 using scala

Question

I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.

  val myFile = "myLocalPath/myFile.csv"
  for (line <- Source.fromFile(myFile).getLines()) {
    val data = line.split(",")
    myHashMap.put(data(0), data(1).toDouble)
  }

Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!

val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));

val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));

var line = ""
while ((line = reader.readLine()) != null) {
      val data = line.split(",")
      myHashMap.put(data(0), data(1).toDouble)
      println(line);
}

Edamame · Accepted Answer · 2015-09-10 15:47:40Z

1

I think I got it work like below:

    val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));

    val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
    for (line <- myData) {
        val data = line.split(",")
        myMap.put(data(0), data(1).toDouble)
    }

    println(" my map : " + myMap.toString())

answered Sep 10, 2015 at 15:47

Edamame

25.6k80 gold badges205 silver badges331 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Glennie Helles Sindholt · Accepted Answer · 2015-09-09 14:09:34Z

0

Read in csv-file with sc.textFile("s3://myBucket/myFile.csv"). That will give you an RDD[String]. Get that into a map

val myHashMap = data.collect
                    .map(line => {
                      val substrings = line.split(" ")
                      (substrings(0), substrings(1).toDouble)})
                    .toMap

You can the use sc.broadcast to broadcast your map, so that it is readily available on all your worker nodes.

(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)

answered Sep 9, 2015 at 14:09

Glennie Helles Sindholt

13.2k6 gold badges47 silver badges52 bronze badges

5 Comments

Edamame Over a year ago

My utility function needs myHashMap. so my code is like: output = input.map { t => myUtiltyFunction(myHashMap, t)} is it possible to avoid passing myHashMap to myUtiltiyFunction each time? Is there a way to use broadcast myHashMap and let the myUtitlityFunction know it directly? Thanks a lot!

Edamame Over a year ago

Also, I didn't want to use sc.textFile("s3://myBucket/myFile.csv") because I want to make the code generic even when without spark context. Thanks.

Glennie Helles Sindholt Over a year ago

You do realize that if you let your utility function read the map directly, and you use the utility function like you describe output = input.map { t => myUtiltyFunction(...)}, the map will be read and created for every single row of your input rdd. I really don't think you want that. If you broadcast the variable (using sc.broadcast) on the other hand, you read and create the map only once on you driver, and then all your workers have direct access to it. Why do you not want to pass the map to the utility function? That seems odd to me.

Edamame Over a year ago

Are you sure the map is created per single row of the input rdd, not per task? The reason I don't want to pass the HashMap is: 1. cleaness of the code. 2. I want the same code to be used some other scenarios which reading the input data to the utility function is trivial.

Glennie Helles Sindholt Over a year ago

If you use map and not mapPartition it will apply the utility function to each row and thus if your utility function is in charge of creating the map, it will be done for each row. If you use mapPartitions it will only create the map once per partition, but (depending on the size of your data, of course) that could still easily end up adding a significant overhead (io is never cheap). IMO, you should focus on writing code that is optimal for parallel processing (Spark) and be less concerned with other trivial (non-parallel) uses of the code.

Sarath Subramanian · Accepted Answer · 2018-07-21 04:39:12Z

0

This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile. Use the below code

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey@Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
    var data = line.split(",")
    var value1 = data(0)
    var value2 = data(1).toDouble
}

In the above code, sc.textFile will read the data from your file and store in the line RDD. It then split each line with , to a different RDD data inside the loop. Then you can access values from this RDD with the index.

answered Jul 21, 2018 at 4:39

Sarath Subramanian

21.6k12 gold badges89 silver badges95 bronze badges

3 Comments

Mitaksh Gupta Over a year ago

This code returns the error "java.io.IOException: No FileSystem for scheme: s3"

ibaralf Over a year ago

Can you explain the answer, I am also getting a java.io.FileNotFoundException

Sarath Subramanian Over a year ago

Mitkash and ibaalf - Please share your code for me to debug. There could be some typo because this is perfectly working for me

Collectives™ on Stack Overflow

Spark: read csv file from s3 using scala

3 Answers 3

Comments

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related