How to read CSV files directly into spark DataFrames without using databricks csv api ?
I know there is databricks csv api but i cant use it that api..
I know there is case class to use and map the cols according to cols(0) positions but the problem is i have more than 22 coloumns hence i cant use case class because in case class we have limitation of using only 22 coloumns.
I know there is structtype to define schema but i feel it would be very lenghty code to define 40 coloumns in structype.
I am looking for something to read into dataframe using read method but in spark we dont have direct support for csv file we need to parse it ? but how if we have more than 40 cols.?
-
what is wrong with databricks csv api ?Himaprasoon– Himaprasoon2016-07-05 05:19:02 +00:00Commented Jul 5, 2016 at 5:19
-
@Himaprasoon , nothing wrong with databricks csv api ..actually i have to write a certification hortonworks hdpcd spark ,in exam they dont provide databricks api ..only spark inbuilt api we can use...Devender Prakash– Devender Prakash2016-07-05 18:38:18 +00:00Commented Jul 5, 2016 at 18:38
-
was my answer helpful? if not what have you found if there is anything else?Ram Ghadiyaram– Ram Ghadiyaram2016-10-02 08:05:06 +00:00Commented Oct 2, 2016 at 8:05
2 Answers
I've also looked into this and ended up writing a python script to generate scala code for the parse(line) function and the definition of the schema. Yes, this may become a lenghty blob of code.
Another path you may walk if your data is not too big: use python pandas! Startup py-spark, read your data into a pandas dataframe, and then create a spark dataframe from that. Save it (eg. as a parquet file). And load that parquet file in scala-spark.
Comments
Seems like scala 2.11.x onwards the arity limit issue was fixed. please have a look at https://issues.scala-lang.org/browse/SI-7296
To overcome this in <2.11 see my answer, which uses extends Product and overrides methods productArity, productElement,canEqual (that:Any)