Loading CSV in spark

Question

I'm attempting the Kaggle Titanic Example using SparkML and Scala. I'm attempting to load the first training file but I am running into a strange error:

java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/jake/Development/titanicExample/src/main/resources/data/titanic/train.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [44, 81, 13, 10]

The file is a .csv so I'm not sure why its expecting a Parquet file.

Here is my code:

object App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("liveOrDie")
    .getOrCreate()

  def main(args: Array[String]) {

    val rawTrainingData = spark.read
      .option("header", "true")
      .option("delimiter", ",")
      .option("inferSchema", "true")
      .load("src/main/resources/data/titanic/train.csv")

//    rawTrainingData.show()
  }
}

user6022341 · Accepted Answer · 2016-12-08 21:15:28Z

3

You're missing input format. Either:

val rawTrainingData = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .csv("src/main/resources/data/titanic/train.csv")

or

val rawTrainingData = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .format("csv")
  .load("src/main/resources/data/titanic/train.csv")

answered Dec 8, 2016 at 21:15

community wiki

user6022341

Sign up to request clarification or add additional context in comments.

1 Comment

Jake Henningsgaard Over a year ago

Yeah I notice that right after I entered the question. I added the line .format("com.databricks.spark.csv") but that didn't do the trick. I also tried both of your solutions, also didn't work.

Jake Henningsgaard · Accepted Answer · 2016-12-09 20:15:55Z

1

I seem to have had a conflict with Scala versions in my pom.xml NOT my original code. My pom.xml had multiple Scala versions seemingly causing issues. I updated all dependencies that used Scala to the same version using a dynamic property <scala.dep.version>2.11</scala.dep.version> and that fixed the problem.

edited Dec 9, 2016 at 20:15

answered Dec 8, 2016 at 22:15

Jake Henningsgaard

7021 gold badge6 silver badges16 bronze badges

3 Comments

evan.oman Over a year ago

That was not my suggestion, the code you wrote doesn't make any sense since load does not return a DataFrameReader

evan.oman Over a year ago

Anyway I would remove the comment directed at me and then accept this as the answer so other SO users can learn from your experience.

evan.oman Over a year ago

Also fixing your dependency issue allowed the code you wrote in your original question to work or was it one of the other solutions? Please describe a) what needed to be fixed and b) what code you are using to load csvs now.

evan.oman · Accepted Answer · 2016-12-08 21:27:28Z

0

It is expecting a parquet file because that is what the default file type.

If you are using Spark < 2.0, you will need to use Spark-CSV. Otherwise if you are using Spark 2.0+ you will be able to use the DataFrameReader by using .csv(..fname..) instead of .load(..fname..).

answered Dec 8, 2016 at 21:27

evan.oman

5,57224 silver badges45 bronze badges

Comments

A srinivas · Accepted Answer · 2016-12-09 02:00:26Z

0

You have to add a dependency jar from databricks into your pom . Lower version spark doesn't provide api to read csv. Once you download you can write something like below..

val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
// Use first line of all files as header
.option("inferSchema", "true")
// Automatically infer data types
.load("cars.csv")

Ref url: https://github.com/databricks/spark-csv/blob/master/README.md

answered Dec 9, 2016 at 2:00

A srinivas

8512 gold badges11 silver badges28 bronze badges

2 Comments

Jake Henningsgaard Over a year ago

I had that in there. In fact, now that I have it working, I can comment the Databricks dependency out and the program will still work. I should note that I'm using Spark 2.0.

A srinivas Over a year ago

thats awesome and thanks for letting me know about spark 2.0 support.

Collectives™ on Stack Overflow

Loading CSV in spark

4 Answers 4

1 Comment

3 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related