2

I have a fixed length file ( a sample is shown below) and I want to read this file using DataFrames API in Spark using SCALA(not python or java). Using DataFrames API there are ways to read textFile, json file and so on but not sure if there is a way to read a fixed-length file. I was searching the internet for this and found a github link, but I got to download spark-fixedwidth-assembly-1.0.jar for this purpose however I was unable to figure out the jar anywhere. I am completely lost here and need your suggestions and help. There are couple of posts in Stackoverflow but they are not relevant to Scala and DataFrame API.

Here is the file

56 apple     TRUE 0.56
45 pear      FALSE1.34
34 raspberry TRUE 2.43
34 plum      TRUE 1.31
53 cherry    TRUE 1.4 
23 orange    FALSE2.34
56 persimmon FALSE23.2

The fixed width of each columns are 3, 10, 5, 4

Please suggest your opinion.

2 Answers 2

5

The Fixed Length format is very old and I could not find a good Scala library for this format... so I have created my own.

You can check it out here: https://github.com/atais/Fixed-Length

Usage with Spark is quite simple, you would get a DataSet of your objects!

You first need to create a description of your objects, fe:

case class Employee(name: String, number: Option[Int], manager: Boolean)

object Employee {

    import com.github.atais.util.Read._
    import cats.implicits._
    import com.github.atais.util.Write._
    import Codec._

    implicit val employeeCodec: Codec[Employee] = {
      fixed[String](0, 10) <<:
        fixed[Option[Int]](10, 13, Alignment.Right) <<:
        fixed[Boolean](13, 18)
    }.as[Employee]
}

And later just use the parser:

val input = sql.sparkContext.textFile(file)
               .filter(_.trim.nonEmpty)
               .map(Parser.decode[Employee])
               .flatMap {
                  case Right(x) => Some(x)
                  case Left(e) =>
                         System.err.println(s"Failed to process file $file, error: $e")
                         None
               }
sql.createDataset(input)
Sign up to request clarification or add additional context in comments.

Comments

4

Well... use substring to break lines. Then trim to remove wheitespaces. And then do whatever you want.

case class DataUnit(s1: Int, s2: String, s3:Boolean, s4:Double)

sc.textFile('your_file_path')
  .map(l => (l.substring(0, 3).trim(), l.substring(3, 13).trim(), l.substring(13,18).trim(), l.substring(18,22).trim()))
  .map({ case (e1, e2, e3, e4) => DataUnit(e1.toInt, e2, e3.toBoolean, e4.toDouble) })
  .toDF

3 Comments

I tried it in REPL but I got error. Can you please mention something to workout in REPL?
<console>:32: error: wrong number of parameters; expected = 1 val mapRDD=file.map(l => (l.substring(0, 4).trim(), l.substring(4, 14).trim(), l.substring(14,19).trim(), l.substring(19,23).trim())).map((e1, e2, e3, e4) => DataUnit(e1.toInt, e2, e3.toBoolean, e4.toDouble)).toDF ^
Should be fixed now. Try running each map step by step in REPL.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.