0

Task - read a csv file, add 2 columns in lower case, sort & save the file. Problem - if sorting is applied, it is creating multiple files. Can someone please explain me what is happening here?

var df = spark.read
  .format("csv")
  .option("header", "true")
  .load(i_file)
  .select("Id", "Name", "Address")

df = df.withColumn("x_name", lower(col("Name")))
df = df.withColumn("x_address", lower(col("Address")))
df = df.orderBy("x_name") <---this line
df.write.option("header", "true").csv(o_file)

If I remove orderBy, it will create 1 file.

2
  • hmm..may be it does not matter, let spark store these in partitioned file. That is my understanding! Commented Sep 13, 2018 at 13:59
  • Thanks @Dima, that answers my question, sorry for the duplicate, not sure why could not find that one! Commented Sep 13, 2018 at 15:53

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.