creating dataframe by loading csv file using scala in spark

Question

but csv file is added with extra double quotes which results all cloumns into single column

there are four columns,header and 2 rows

"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""

val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv") 
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]

Please look at this forum posts, it says that the delimiter can only be single character long. Hence, either you need to specify the schema on your own, or read it as RDD and cleanse the records and convert it to DataFrame. — Pavithran Ramachandran
– Pavithran Ramachandran, Commented Mar 6, 2018 at 3:03

Anahcolus · Accepted Answer · 2018-03-06 03:10:16Z

1

What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api

//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)

You should be getting

+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1   |Priya|78 |Phone  |
|2   |Jhon |20 |mail   |
+----+-----+---+-------+

I hope the answer is helpful

answered Mar 6, 2018 at 3:10

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Anahcolus Over a year ago

you would need to import Row as import org.apache.spark.sql.Row

Anahcolus Over a year ago

2nd line is to filter the first line array of the text file. 3rd line is creating a schema for dataframe from the line array that we got from second line of code, and the 4th line of code is converting the data line arrays into rows to be converted to dataframe. I have commented for the explanation :)

premon Over a year ago

really thank you .. i got output successfully .can you suggest any website or textbook for sparksql and scala... i need to learn indepth coding

Anahcolus Over a year ago

If the answer is helpful then you should consider accepting it. About learning sparksql, you can go through official spark webisite and scala I am learning myself as well. :)

premon Over a year ago

sir in line number 3 all columns are taken as string type if i want to specify different data types of columns ... its not working

|

Collectives™ on Stack Overflow

creating dataframe by loading csv file using scala in spark

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related