I am using Spark 1.5.2 to create a data frame from scala object using of of the following syntax. My purpose is to create a data for for unit testing.
class Address (first:String = null, second: String = null, zip: String = null){}
class Person (id: String = null, name: String = null, address: Seq[Address] = null){}
def test () = {
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val persons = Seq(
new Person(id = "1", name = "Salim",
address = Seq(new Address(first = "1st street"))),
new Person(name = "Sana",
address = Seq(new Address(zip = "60088")))
)
// The code can't infer schema automatically
val claimDF = sqlContext.createDataFrame(sc.parallelize(persons, 2),classOf[Person])
claimDF.printSchema() // This prints "root" not the schema of Person.
}
Instead if I convert the Person and Address to case class then Spark can inherit schema automatically using the above syntax or using sc.parallelize(persons, 2).toDF or using sqlContext.createDataFrame(sc.parallelize(persons, 2),StructType)
I can't use case class because it can't hold more than 20 fields and I have a lot of fields in the class. And using StructType causes a lot of inconvenience. Case class is most convenient but can't hold too many properties.
Please help, thanks in advance.
createDataFrame[A <: Product](data: Seq[A]))