Efficient way to load csv file in spark/scala

Question

I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes:

  sqlContext.read.format("csv").options(option).load(path)
  sqlContext.read.options(option).csv(path)

What is the difference between these two and which gives the better performance? Thanks

Tzach Zohar · Accepted Answer · 2017-06-13 17:58:04Z

3

There's no difference.

So why do both exist?

The .format(fmt).load(path) method is a flexible, pluggable API that allows adding more formats without having to re-compile spark - you can register aliases for custom Data Source implementations and have Spark use them; "csv" used to be such a custom implementation (outside of the packaged Spark binaries), but it is now part of the project
There are shorthand methods for "built-in" data sources (like csv, parquet, json...) which make the code a bit simpler (and verified at compile time)

Eventually, they both create a CSV Data Source and use it to load the data.

Bottom line, for any supported format, you should opt for the "shorthand" method, e.g. csv(path).

answered Jun 13, 2017 at 17:58

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1