1

I'm attempting the Kaggle Titanic Example using SparkML and Scala. I'm attempting to load the first training file but I am running into a strange error:

java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/Users/jake/Development/titanicExample/src/main/resources/data/titanic/train.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [44, 81, 13, 10]

The file is a .csv so I'm not sure why its expecting a Parquet file.

Here is my code:

object App {

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("liveOrDie")
    .getOrCreate()

  def main(args: Array[String]) {

    val rawTrainingData = spark.read
      .option("header", "true")
      .option("delimiter", ",")
      .option("inferSchema", "true")
      .load("src/main/resources/data/titanic/train.csv")

//    rawTrainingData.show()
  }
}

4 Answers 4

3

You're missing input format. Either:

val rawTrainingData = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .csv("src/main/resources/data/titanic/train.csv")

or

val rawTrainingData = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .format("csv")
  .load("src/main/resources/data/titanic/train.csv")
Sign up to request clarification or add additional context in comments.

1 Comment

Yeah I notice that right after I entered the question. I added the line .format("com.databricks.spark.csv") but that didn't do the trick. I also tried both of your solutions, also didn't work.
1

I seem to have had a conflict with Scala versions in my pom.xml NOT my original code. My pom.xml had multiple Scala versions seemingly causing issues. I updated all dependencies that used Scala to the same version using a dynamic property <scala.dep.version>2.11</scala.dep.version> and that fixed the problem.

3 Comments

That was not my suggestion, the code you wrote doesn't make any sense since load does not return a DataFrameReader
Anyway I would remove the comment directed at me and then accept this as the answer so other SO users can learn from your experience.
Also fixing your dependency issue allowed the code you wrote in your original question to work or was it one of the other solutions? Please describe a) what needed to be fixed and b) what code you are using to load csvs now.
0

It is expecting a parquet file because that is what the default file type.

If you are using Spark < 2.0, you will need to use Spark-CSV. Otherwise if you are using Spark 2.0+ you will be able to use the DataFrameReader by using .csv(..fname..) instead of .load(..fname..).

Comments

0

You have to add a dependency jar from databricks into your pom . Lower version spark doesn't provide api to read csv. Once you download you can write something like below..

val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
// Use first line of all files as header
.option("inferSchema", "true")
// Automatically infer data types
.load("cars.csv")

Ref url: https://github.com/databricks/spark-csv/blob/master/README.md

2 Comments

I had that in there. In fact, now that I have it working, I can comment the Databricks dependency out and the program will still work. I should note that I'm using Spark 2.0.
thats awesome and thanks for letting me know about spark 2.0 support.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.