How to read a fixed length file in Spark using DataFrame API and SCALA

Question

I have a fixed length file ( a sample is shown below) and I want to read this file using DataFrames API in Spark using SCALA(not python or java). Using DataFrames API there are ways to read textFile, json file and so on but not sure if there is a way to read a fixed-length file. I was searching the internet for this and found a github link, but I got to download spark-fixedwidth-assembly-1.0.jar for this purpose however I was unable to figure out the jar anywhere. I am completely lost here and need your suggestions and help. There are couple of posts in Stackoverflow but they are not relevant to Scala and DataFrame API.

Here is the file

56 apple     TRUE 0.56
45 pear      FALSE1.34
34 raspberry TRUE 2.43
34 plum      TRUE 1.31
53 cherry    TRUE 1.4 
23 orange    FALSE2.34
56 persimmon FALSE23.2

The fixed width of each columns are 3, 10, 5, 4

Please suggest your opinion.

Atais · Accepted Answer · 2017-07-12 10:22:54Z

The Fixed Length format is very old and I could not find a good Scala library for this format... so I have created my own.

You can check it out here: https://github.com/atais/Fixed-Length

Usage with Spark is quite simple, you would get a DataSet of your objects!

You first need to create a description of your objects, fe:

case class Employee(name: String, number: Option[Int], manager: Boolean)

object Employee {

    import com.github.atais.util.Read._
    import cats.implicits._
    import com.github.atais.util.Write._
    import Codec._

    implicit val employeeCodec: Codec[Employee] = {
      fixed[String](0, 10) <<:
        fixed[Option[Int]](10, 13, Alignment.Right) <<:
        fixed[Boolean](13, 18)
    }.as[Employee]
}

And later just use the parser:

val input = sql.sparkContext.textFile(file)
               .filter(_.trim.nonEmpty)
               .map(Parser.decode[Employee])
               .flatMap {
                  case Right(x) => Some(x)
                  case Left(e) =>
                         System.err.println(s"Failed to process file $file, error: $e")
                         None
               }
sql.createDataset(input)

sarveshseri · Accepted Answer · 2016-08-06 05:50:32Z

4

Well... use substring to break lines. Then trim to remove wheitespaces. And then do whatever you want.

case class DataUnit(s1: Int, s2: String, s3:Boolean, s4:Double)

sc.textFile('your_file_path')
  .map(l => (l.substring(0, 3).trim(), l.substring(3, 13).trim(), l.substring(13,18).trim(), l.substring(18,22).trim()))
  .map({ case (e1, e2, e3, e4) => DataUnit(e1.toInt, e2, e3.toBoolean, e4.toDouble) })
  .toDF

edited Aug 6, 2016 at 5:50

answered Aug 4, 2016 at 18:07

sarveshseri

14k30 silver badges50 bronze badges

3 Comments

Alex Raj Kaliamoorthy Over a year ago

I tried it in REPL but I got error. Can you please mention something to workout in REPL?

Alex Raj Kaliamoorthy Over a year ago

<console>:32: error: wrong number of parameters; expected = 1          val mapRDD=file.map(l => (l.substring(0, 4).trim(), l.substring(4, 14).trim(), l.substring(14,19).trim(), l.substring(19,23).trim())).map((e1, e2, e3, e4) => DataUnit(e1.toInt, e2, e3.toBoolean, e4.toDouble)).toDF                                                                                                                                                                     ^

sarveshseri Over a year ago

Should be fixed now. Try running each map step by step in REPL.

Collectives™ on Stack Overflow

How to read a fixed length file in Spark using DataFrame API and SCALA

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related