2

I am trying to read a CSV file in Spark - using CSV reader API. I am currently encountering array index out of bound exception.

Validation:

There is no issue with the input file. All the rows have same number of columns. Column count - 65

Putting below the code that I tried.

sparkSess.read.option("header", "true").option("delimiter", "|").csv(filePath)

Expected result - dataFrame.show()

Actual Error -

19/03/28 10:42:51 INFO FileScanRDD: Reading File path: file:///C:/Users/testing/workspace_xxxx/abc_Reports/src/test/java/report1.csv, range: 0-10542, partition values: [empty row]
19/03/28 10:42:51 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
java.lang.ArrayIndexOutOfBoundsException: 63
    at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
    at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

Input Data ::

A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|AA|BB|CC|DD|EE|FF|GG|HH|II|JJ|KK|LL|MM|NN|OO|PP|QQ|RR|SS|TT|UU|VV|WW|XX|YY|ZZ|TGHJ|HG|EEE|ASD|EFFDCLDT|QSAS|WWW|DATIME|JOBNM|VFDCXS|REWE|XCVVCX|ASDFF
QW|8|2344|H02|1002|              |1|2019-01-20|9999-12-31|  |EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            |IN|9|1234444| |        |        |10|QQ|8|BMX10290M|EWR|   |.000000000|00|M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25|  |          |          |      |RE|WW|  |RQ|   |   |   |        |     |        |  | |1901-01-01|0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:34.823000000|        |
6
  • can you please share sample csv file content which you are using. Commented Mar 28, 2019 at 3:01
  • @KZapagol - added sample data as requested! Commented Mar 28, 2019 at 3:15
  • @Dasarathy..I am able read csv file with your sample data.See my updated comment. Commented Mar 28, 2019 at 3:38
  • @KZapagol Can u please give header as true & try ? Commented Mar 28, 2019 at 3:59
  • @ Dasarathy .. I have tried with header and it's working fine. Please see my updated comment. Commented Mar 28, 2019 at 5:15

2 Answers 2

1

You can you use com.databricks.spark.csv to read csv files.Please find sample code as below.

   import org.apache.spark.sql.SparkSession

object SparkCSVTest extends App {

  val spark = SparkSession
    .builder()
    .master("local")
    .appName("File_Streaming")
    .getOrCreate()

  val df = spark.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "|")
    .option("inferSchema", "false")
    .load("C:/Users/KZAPAGOL/Desktop/CSV/csvSample.csv")

  df.show()

}

CSV file used:

    A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|AA|BB|CC|DD|EE|FF|GG|HH|II|JJ|KK|LL|MM|NN|OO|PP|QQ|RR|SS|TT|UU|VV|WW|XX|YY|ZZ|TGHJ|HG|EEE|ASD|EFFDCLDT|QSAS|WWW|DATIME|JOBNM|VFDCXS|REWE|XCVVCX|ASDFF
QW|8|2344|H02|1002|              |1|2019-01-20|9999-12-31|  |EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            |IN|9|1234444| |        |        |10|QQ|8|BMX10290M|EWR|   |.000000000|00|M |2027-01-20|2027-01-20| |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25|  |          |          |      |RE|WW|  |RQ|   |   |   |        |     |        |  | |1901-01-01|0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:34.823000000|        |

With Header :

+---+---+----+---+----+--------------+---+----------+----------+---+---+----------+----------+----------+--------------------+---+---+-------+---+--------+--------+---+---+---+---------+---+---+----------+---+---+----------+----------+---+---+---+---+---+----------+----------+-------+---+----------+----------+------+---+---+---+---+---+---+---+--------+-----+--------+---+---+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
|  A|  B|   C|  D|   E|             F|  G|         H|         I|  J|  K|         L|         M|         N|                   O|  P|  Q|      R|  S|       T|       U|  V|  W|  X|        Y|  Z| AA|        BB| CC| DD|        EE|        FF| GG| HH| II| JJ| KK|        LL|        MM|     NN| OO|        PP|        QQ|    RR| SS| TT| UU| VV| WW| XX| YY|      ZZ| TGHJ|      HG|EEE|ASD|  EFFDCLDT|QSAS|          WWW|             DATIME|     JOBNM|  VFDCXS|                REWE|  XCVVCX|ASDFF|
+---+---+----+---+----+--------------+---+----------+----------+---+---+----------+----------+----------+--------------------+---+---+-------+---+--------+--------+---+---+---+---------+---+---+----------+---+---+----------+----------+---+---+---+---+---+----------+----------+-------+---+----------+----------+------+---+---+---+---+---+---+---+--------+-----+--------+---+---+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
| QW|  8|2344|H02|1002|              |  1|2019-01-20|9999-12-31|   | EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            | IN|  9|1234444|   |        |        | 10| QQ|  8|BMX10290M|EWR|   |.000000000| 00| M |2027-01-20|2027-01-20|   |.00|.00|.00|.00|2014-01-20|1901-01-01|3423.25|   |          |          |      | RE| WW|   | RQ|   |   |   |        |     |        |   |   |1901-01-01|   0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:...|        | null|
+---+---+----+---+----+--------------+---+----------+----------+---+---+----------+----------+----------+--------------------+---+---+-------+---+--------+--------+---+---+---+---------+---+---+----------+---+---+----------+----------+---+---+---+---+---+----------+----------+-------+---+----------+----------+------+---+---+---+---+---+---+---+--------+-----+--------+---+---+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+

Without Header:

+---+---+----+---+----+--------------+---+----------+----------+---+----+----------+----------+----------+--------------------+----+----+-------+----+--------+--------+----+----+----+---------+----+----+----------+----+----+----------+----------+----+----+----+----+----+----------+----------+-------+----+----------+----------+------+----+----+----+----+----+----+----+--------+-----+--------+----+----+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
|_c0|_c1| _c2|_c3| _c4|           _c5|_c6|       _c7|       _c8|_c9|_c10|      _c11|      _c12|      _c13|                _c14|_c15|_c16|   _c17|_c18|    _c19|    _c20|_c21|_c22|_c23|     _c24|_c25|_c26|      _c27|_c28|_c29|      _c30|      _c31|_c32|_c33|_c34|_c35|_c36|      _c37|      _c38|   _c39|_c40|      _c41|      _c42|  _c43|_c44|_c45|_c46|_c47|_c48|_c49|_c50|    _c51| _c52|    _c53|_c54|_c55|      _c56|_c57|         _c58|               _c59|      _c60|    _c61|                _c62|    _c63| _c64|
+---+---+----+---+----+--------------+---+----------+----------+---+----+----------+----------+----------+--------------------+----+----+-------+----+--------+--------+----+----+----+---------+----+----+----------+----+----+----------+----------+----+----+----+----+----+----------+----------+-------+----+----------+----------+------+----+----+----+----+----+----+----+--------+-----+--------+----+----+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+
|  A|  B|   C|  D|   E|             F|  G|         H|         I|  J|   K|         L|         M|         N|                   O|   P|   Q|      R|   S|       T|       U|   V|   W|   X|        Y|   Z|  AA|        BB|  CC|  DD|        EE|        FF|  GG|  HH|  II|  JJ|  KK|        LL|        MM|     NN|  OO|        PP|        QQ|    RR|  SS|  TT|  UU|  VV|  WW|  XX|  YY|      ZZ| TGHJ|      HG| EEE| ASD|  EFFDCLDT|QSAS|          WWW|             DATIME|     JOBNM|  VFDCXS|                REWE|  XCVVCX|ASDFF|
| QW|  8|2344|H02|1002|              |  1|2019-01-20|9999-12-31|   |  EE|2014-01-20|2014-01-20|2014-01-20|CNB22345            |  IN|   9|1234444|    |        |        |  10|  QQ|   8|BMX10290M| EWR|    |.000000000|  00|  M |2027-01-20|2027-01-20|    | .00| .00| .00| .00|2014-01-20|1901-01-01|3423.25|    |          |          |      |  RE|  WW|    |  RQ|    |    |    |        |     |        |    |    |1901-01-01|   0|SED2233345   |2019-01-15 22:10:23|213EDSFDS |78978775|2019-03-23 07:38:...|        | null|
+---+---+----+---+----+--------------+---+----------+----------+---+----+----------+----------+----------+--------------------+----+----+-------+----+--------+--------+----+----+----+---------+----+----+----------+----+----+----------+----------+----+----+----+----+----+----------+----------+-------+----+----------+----------+------+----+----+----+----+----+----+----+--------+-----+--------+----+----+----------+----+-------------+-------------------+----------+--------+--------------------+--------+-----+

build.sbt

    "com.databricks" %% "spark-csv" % "1.5.0",
    "org.apache.spark" %% "spark-core" % "2.2.2",
    "org.apache.spark" %% "spark-sql" % "2.2.2"

Screen Shot for Ref. :

enter image description here

Hope it helps!

Sign up to request clarification or add additional context in comments.

6 Comments

I have already tried the same thing. Only difference is the inferschema part.
And, the same code is working for the other CSV files, flawlessly.
@DasarathyDR Does my answer work for you? Still facing any issues?
I see I have some discrepancy in my input business data itself. Ur answer is right. Because - one particular file read is what is failing for me. Rest all other 3 files work seamlessly.
Could you please accept the answer if it is right. I hope it helps! Thanks
|
1

Just found the exact issue.

Actually, the 10 CSV files that I was trying to read were UTF-8 format files. Which were NOT causing the issue. 3 Files out of the total 13 files were UCS-2 formatted. Hence these were causing the issue with CSV read process. These files were the ones causing the above mentioned error.

UTF-8 ==> Unicode Transformation Format Encoding.
UCS-2 ==> Universal Coded Character Set Encoding.

By this, learnt that databricks CSV read supports UTF encoding and causes issues for UCS encoding. Hence, saved the files as UTF-8 format and tried reading the file. It worked like a charm.

Feel free to add more insights on this, if any.

1 Comment

You can use charset option to read file with other encoder. For example if you want to read Shift JS encoder file then you can set charset option as .option("charset", "Shift-JIS") .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.