How to read CSV files directly into spark DataFrames without using databricks csv api ?

Question

How to read CSV files directly into spark DataFrames without using databricks csv api ?
I know there is databricks csv api but i cant use it that api..
I know there is case class to use and map the cols according to cols(0) positions but the problem is i have more than 22 coloumns hence i cant use case class because in case class we have limitation of using only 22 coloumns. I know there is structtype to define schema but i feel it would be very lenghty code to define 40 coloumns in structype. I am looking for something to read into dataframe using read method but in spark we dont have direct support for csv file we need to parse it ? but how if we have more than 40 cols.?

@Himaprasoon , nothing wrong with databricks csv api ..actually i have to write a certification hortonworks hdpcd spark ,in exam they dont provide databricks api ..only spark inbuilt api we can use... — Devender Prakash
– Devender Prakash, Commented Jul 5, 2016 at 18:38
was my answer helpful? if not what have you found if there is anything else? — Ram Ghadiyaram
– Ram Ghadiyaram, Commented Oct 2, 2016 at 8:05

wmoco_6725 · Accepted Answer · 2016-07-05 05:45:09Z

0

I've also looked into this and ended up writing a python script to generate scala code for the parse(line) function and the definition of the schema. Yes, this may become a lenghty blob of code.

Another path you may walk if your data is not too big: use python pandas! Startup py-spark, read your data into a pandas dataframe, and then create a spark dataframe from that. Save it (eg. as a parquet file). And load that parquet file in scala-spark.

edited Jul 5, 2016 at 5:45

answered Jul 5, 2016 at 5:31

wmoco_6725

3,2191 gold badge14 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ram Ghadiyaram · Accepted Answer · 2017-12-22 20:17:00Z

0

Seems like scala 2.11.x onwards the arity limit issue was fixed. please have a look at https://issues.scala-lang.org/browse/SI-7296

To overcome this in <2.11 see my answer, which uses extends Product and overrides methods productArity, productElement,canEqual (that:Any)

edited Dec 22, 2017 at 20:17

answered Jul 18, 2016 at 18:31

Ram Ghadiyaram

29.4k16 gold badges102 silver badges133 bronze badges

Collectives™ on Stack Overflow

How to read CSV files directly into spark DataFrames without using databricks csv api ?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related