0

but csv file is added with extra double quotes which results all cloumns into single column

there are four columns,header and 2 rows

"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""

val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv") 
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]
1
  • Please look at this forum posts, it says that the delimiter can only be single character long. Hence, either you need to specify the schema on your own, or read it as RDD and cleanse the records and convert it to DataFrame. Commented Mar 6, 2018 at 3:03

1 Answer 1

1

What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api

//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)

You should be getting

+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1   |Priya|78 |Phone  |
|2   |Jhon |20 |mail   |
+----+-----+---+-------+

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

6 Comments

you would need to import Row as import org.apache.spark.sql.Row
2nd line is to filter the first line array of the text file. 3rd line is creating a schema for dataframe from the line array that we got from second line of code, and the 4th line of code is converting the data line arrays into rows to be converted to dataframe. I have commented for the explanation :)
really thank you .. i got output successfully .can you suggest any website or textbook for sparksql and scala... i need to learn indepth coding
If the answer is helpful then you should consider accepting it. About learning sparksql, you can go through official spark webisite and scala I am learning myself as well. :)
sir in line number 3 all columns are taken as string type if i want to specify different data types of columns ... its not working
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.