I believe you are confused about the way Spark behaves, I would recommend you to read the official documentation and / or some tutorial first.
Nevertheless, I hope this answers your question.
This code will save a DataFrame as a SINGLE CSV File on a local filesystem.
It was tested with Spark 2.4.0 with Scala 2.12.8 on an Ubuntu 18.04 laptop.
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.appName("CSV Writter Test")
.getOrCreate()
import spark.implicits._
val df =
Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
df.printSchema
// root
// |-- NAME: string (nullable = true)
// |-- START_DATE: string (nullable = true)
// |-- END_DATE: string (nullable = true)
// |-- STATUS: string (nullable = true)
df.coalesce(numPartitions = 1)
.write
.option(key = "header", value = "true")
.option(key = "sep", value = ",")
.option(key = "encoding", value = "UTF-8")
.option(key = "compresion", value = "none")
.mode(saveMode = "OVERWRITE")
.csv(path = "file:///home/balmungsan/dailyReport/") // Change the path. Note there are 3 /, the first two are for the file protocol, the third one is for the root folder.
spark.stop()
Now, let's check the saved file.
balmungsan@BalmungSan:dailyReport $ pwd
/home/balmungsan/dailyReport
balmungsan@BalmungSan:dailyReport $ ls
part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv _SUCCESS
balmungsan@BalmungSan:dailyReport $ cat part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv
NAME,START_DATE,END_DATE,STATUS
Alex,2018-01-01 00:00:00,2018-02-01 00:00:00,OUT
Bob,2018-02-01 00:00:00,2018-02-05 00:00:00,IN
Mark,2018-02-01 00:00:00,2018-03-01 00:00:00,IN
Mark,2018-05-01 00:00:00,2018-08-01 00:00:00,OUT
Meggy,2018-02-01 00:00:00,2018-02-01 00:00:00,OUT
The _SUCCESS file exists to signal that the writing succeed.
Important notes:
- You need to specify the
file:// protocol to save to a local filesystem, instead of in HDFS.
- The path specifies the name of the folder to save the partitions of the file, not the name of the file, inside that folder there will one file per partition. If you want to read such file again with Spark, then you only need to specify the folder, Spark will understand the partition files. If not, I would recommend rename the file after - as far as I know, there is no way to control the name from Spark.
- If the df is too big to fit in the memory of just one node, the job will fail.
- If you run this on a distributed way (e.g. with master yarn), then the file will not be saved in the master node, but in one of the slave nodes. If you really need it to be in the master node, then you may collect it and write it with normal Scala as Dmitry suggested.
dataFrame.option("header", "true").csv("/home/reports/"). But you should not mix up local file system and HDFS. The path you specified is path on HDFS. Your user don't have permissions for writing to that location. If you want to write to a local directory on a server you should collect the data first usingval dataToSave = dataFrame.collect(). If you do that, all the data related to the DataFrame will go to the Spark master node. So, make sure you have enough memory for that. After that you will be able to save you data using standard Scala/Java IO API.dataToSave.write.csv(path)for saving data to HDFS or S3. If you want to save your DataFrame to some location on a server. You should doval dataToSave = dataFrame.collect(). After that the data is contained in memory on your Spark master node. Then you can usePrintWriterorFileWriterto save the data.collectmethod on the DataFrame, you will getArray[Row]. After that you can doval linesToSave: Array[String] = dataToSave.map(_.toSeq.mkString(";")). Also, you can checkau.com.bytecode.opencsv.CSVWriterfor creating csv files fromArray[Row].