How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

Question

I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to other dataframe later. I was thinking of using Apache POI and save it as a CSV and then read csv in dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.

Check this answer for newbies with steps stackoverflow.com/a/47721326/2112382 — vijayraj34
– vijayraj34, Commented Dec 8, 2017 at 20:04
Actually, I've to store a spark dataframe in a excel file format with few column as a read only nature? Can you guide me regarding the same? — kanishk kashyap
– kanishk kashyap, Commented Apr 28, 2022 at 19:41

Anahcolus · Accepted Answer · 2017-05-26 08:53:51Z

38

The solution to your problem is to use Spark Excel dependency in your project.

Spark Excel has flexible options to play with.

I have tested the following code to read from excel and convert it to dataframe and it just works perfect

def readExcel(file: String): DataFrame = sqlContext.read
    .format("com.crealytics.spark.excel")
    .option("location", file)
    .option("useHeader", "true")
    .option("treatEmptyValuesAsNulls", "true")
    .option("inferSchema", "true")
    .option("addColorColumns", "False")
    .load()

val data = readExcel("path to your excel file")

data.show(false)

you can give sheetname as option if your excel sheet has multiple sheets

.option("sheetName", "Sheet2")

I hope its helpful

answered May 26, 2017 at 8:53

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ged Over a year ago

So it processes all tabs if sheetName not specified?

Regressor Over a year ago

I used spark.read.format("com.crealytics.spark.excel").option("location","/home/mylocation/myfile.xlsx").load() but got java.lang.IllegalArgumentException: Parameter "path" is missing in options.

Anahcolus Over a year ago

@Regressor try not using location and using path in load as mentioned in github.com/crealytics/spark-excel

Anirban Nag 'tintinmj' Over a year ago

'sheetName' doesn't work anymore. You have to use 'dataAddress' - github.com/crealytics/spark-excel/issues/118

Ram Ghadiyaram · Accepted Answer · 2024-09-27 19:17:49Z

Here are read and write examples to read from and write into excel with full set of options...

Source spark-excel from crealytics

Scala API Spark 2.0+:

Create a DataFrame from an Excel file

    import org.apache.spark.sql._

val spark: SparkSession = ???
val df = spark.read
         .format("com.crealytics.spark.excel")
        .option("sheetName", "Daily") // Required
        .option("useHeader", "true") // Required
        .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
        .option("inferSchema", "false") // Optional, default: false
        .option("addColorColumns", "true") // Optional, default: false
        .option("startColumn", 0) // Optional, default: 0
        .option("endColumn", 99) // Optional, default: Int.MaxValue
        .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
        .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
        .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
        .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
        .load("Worktime.xlsx")

Write a DataFrame to an Excel file

    df.write
      .format("com.crealytics.spark.excel")
      .option("sheetName", "Daily")
      .option("useHeader", "true")
      .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
      .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
      .mode("overwrite")
      .save("Worktime2.xlsx")

Note: Instead of sheet1 or sheet2 you can use their names as well.. in this example given above Daily is sheet name.

If you want to use it from spark shell...

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

    $SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.13.1

Dependencies needs to be added (in case of maven etc...):

groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.13.1

Further reading : See my article (How to do Simple reporting with Excel sheets using Apache Spark, Scala ?) of how to write in to excel file after an aggregations in to many excel sheets

Tip : This is very useful approach particularly for writing maven test cases where you can place excel sheets with sample data in excel src/main/resources folder and you can access them in your unit test cases(scala/java), which creates DataFrame[s] out of excel sheet...

Another option you could consider is spark-hadoopoffice-ds

A Spark datasource for the HadoopOffice library. This Spark datasource assumes at least Spark 2.0.1. However, the HadoopOffice library can also be used directly from Spark 1.x. Currently this datasource supports the following formats of the HadoopOffice library:

Excel Datasource format: org.zuinnote.spark.office.Excel Loading and Saving of old Excel (.xls) and new Excel (.xlsx) This datasource is available on Spark-packages.org and on Maven Central.

PYSpark users sample :

Since we dont have maven kind of in pyspark we should mention what packaged you want in the spark session it will download and will put it in ivy cache :

Here I am creating a sample data frame and saving it as excel

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

def main(output_path):
    spark = SparkSession.builder \
        .appName("Excel Writer") \
        .config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.13.5") \
        .getOrCreate()

    schema = StructType([
        StructField("ID", IntegerType(), True),
        StructField("Name", StringType(), True),
        StructField("Age", IntegerType(), True),
        StructField("Country", StringType(), True)
    ])

    data = [
        (1, "Ram Ghadiyaram", 47, "USA"),
        (2, "Adam", 31, "UK"),
        (3, "Arindam", 25, "Canada"),
        (4, "Rachel Zane", 29, "USA")
    ]

    print("Creating a sample DataFrame...")
    df = spark.createDataFrame(data, schema)

    print("Sample DataFrame:")
    df.show()

    print("Writing DataFrame to Excel file...")
    df.write.format("com.crealytics.spark.excel") \
        .option("dataAddress", "'Sheet1'!A1") \
        .option("header", "true") \
        .option("addColorColumns", "true") \
        .mode("overwrite") \
        .save(output_path)
    print(f"Excel file written to {output_path}")
    spark.stop()

if __name__ == "__main__":
    output_file = "sample_output.xlsx"
    main(output_file)

output :

log :

C:\Users\ramgh\AppData\Local\Microsoft\WindowsApps\python3.8.exe C:\Users\ramgh\Downloads\spark-3.1.2-bin-hadoop3.2\python_pyspark\Pyspark_excel.py 
:: loading settings :: url = jar:file:/C:/Users/ramgh/Downloads/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\ramgh\.ivy2\cache
The jars for the packages stored in: C:\Users\ramgh\.ivy2\jars
com.crealytics#spark-excel_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dac60d07-afdd-430f-beb2-841d2c79578a;1.0
    confs: [default]
    found com.crealytics#spark-excel_2.12;0.13.5 in central
    found org.apache.poi#poi;4.1.2 in central
    found commons-codec#commons-codec;1.13 in central
    found org.apache.commons#commons-collections4;4.4 in central
    found org.apache.commons#commons-math3;3.6.1 in central
    found com.zaxxer#SparseBitSet;1.2 in central
    found org.apache.poi#poi-ooxml;4.1.2 in central
    found org.apache.poi#poi-ooxml-schemas;4.1.2 in central
    found org.apache.xmlbeans#xmlbeans;3.1.0 in central
    found com.github.virtuald#curvesapi;1.06 in central
    found com.norbitltd#spoiwo_2.12;1.7.0 in central
    found org.scala-lang.modules#scala-xml_2.12;1.2.0 in local-m2-cache
    found com.github.pjfanning#excel-streaming-reader;2.3.4 in central
    found com.github.pjfanning#poi-shared-strings;1.0.4 in central
    found com.h2database#h2;1.4.200 in central
    found org.apache.commons#commons-text;1.8 in central
    found org.apache.commons#commons-lang3;3.9 in local-m2-cache
    found xml-apis#xml-apis;1.4.01 in central
    found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
    found org.apache.commons#commons-compress;1.20 in central
    found com.fasterxml.jackson.core#jackson-core;2.8.8 in central
:: resolution report :: resolve 5432ms :: artifacts dl 545ms
    :: modules in use:
    com.crealytics#spark-excel_2.12;0.13.5 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.8.8 from central in [default]
    com.github.pjfanning#excel-streaming-reader;2.3.4 from central in [default]
    com.github.pjfanning#poi-shared-strings;1.0.4 from central in [default]
    com.github.virtuald#curvesapi;1.06 from central in [default]
    com.h2database#h2;1.4.200 from central in [default]
    com.norbitltd#spoiwo_2.12;1.7.0 from central in [default]
    com.zaxxer#SparseBitSet;1.2 from central in [default]
    commons-codec#commons-codec;1.13 from central in [default]
    org.apache.commons#commons-collections4;4.4 from central in [default]
    org.apache.commons#commons-compress;1.20 from central in [default]
    org.apache.commons#commons-lang3;3.9 from local-m2-cache in [default]
    org.apache.commons#commons-math3;3.6.1 from central in [default]
    org.apache.commons#commons-text;1.8 from central in [default]
    org.apache.poi#poi;4.1.2 from central in [default]
    org.apache.poi#poi-ooxml;4.1.2 from central in [default]
    org.apache.poi#poi-ooxml-schemas;4.1.2 from central in [default]
    org.apache.xmlbeans#xmlbeans;3.1.0 from central in [default]
    org.scala-lang.modules#scala-xml_2.12;1.2.0 from local-m2-cache in [default]
    org.slf4j#slf4j-api;1.7.30 from local-m2-cache in [default]
    xml-apis#xml-apis;1.4.01 from central in [default]
    :: evicted modules:
    org.apache.commons#commons-compress;1.19 by [org.apache.commons#commons-compress;1.20] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   22  |   1   |   1   |   1   ||   21  |   0   |
    ---------------------------------------------------------------------

:: problems summary ::
:::: ERRORS
    unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
:: retrieving :: org.apache.spark#spark-submit-parent-dac60d07-afdd-430f-beb2-841d2c79578a
    confs: [default]
    0 artifacts copied, 21 already retrieved (0kB/73ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/27 14:00:45 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
Creating a sample DataFrame...
Sample DataFrame:
+---+--------------+---+-------+
| ID|          Name|Age|Country|
+---+--------------+---+-------+
|  1|Ram Ghadiyaram| 47|    USA|
|  2|          Adam| 31|     UK|
|  3|       Arindam| 25| Canada|
|  4|   Rachel Zane| 29|    USA|
+---+--------------+---+-------+

Writing DataFrame to Excel file...
Excel file written to sample_output.xlsx

I used spark.read.format("com.crealytics.spark.excel").option("location","/home/mylocation/myfile.xlsx").load() but got java.lang.IllegalArgumentException: Parameter "path" is missing in options.
Actually, I've to store a spark dataframe in a excel file format with few column as a read only nature? Can you guide me regarding the same?

Jörn Franke · Accepted Answer · 2017-06-22 17:39:06Z

3

Alternatively, you can use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki), which supports also encrypted Excel documents and linked workbooks, amongst other features. Of course Spark is also supported.

answered Jun 22, 2017 at 17:39

Jörn Franke

1865 bronze badges

4 Comments

Bharath Over a year ago

Hello All,Can we use the above to write data to multiple tabs in an excel sheet?.

Jörn Franke Over a year ago

I assume you mean multiple sheets in an Excel workbook. Yes, it can write to multiple sheets. Basically you define a SpreadSheetCellDAO which specifies formattedValue, Comment, Formula, Address and Sheet. However, to support you more I would need to know more about your use case. Feel free to provide the information as Github issue: github.com/ZuInnoTe/hadoopoffice/issues

ashK Over a year ago

I have a column that has the values with double quotes eg: "xxxxx,yyy,zzz". Because of this, the value is not considered as a single column, if I see the dataframe, instead of one column, it is showing as 3 columns

Jörn Franke Over a year ago

That is strange. There is no logic to split that column based on commas or double quotes. Can you please check with the Apache POI people: poi.apache.org/help/index.html ? Can you also please verify that it is indeed just one column and provide an example file?

Sai Mammahi · Accepted Answer · 2019-06-18 05:07:33Z

1

I have used com.crealytics.spark.excel-0.11 version jar and created in spark-Java, it would be the same in scala too, just need to change javaSparkContext to SparkContext.

tempTable = new SQLContext(javaSparkContxt).read()
    .format("com.crealytics.spark.excel") 
    .option("sheetName", "sheet1")
    .option("useHeader", "false") // Required 
    .option("treatEmptyValuesAsNulls","false") // Optional, default: true 
    .option("inferSchema", "false") //Optional, default: false 
    .option("addColorColumns", "false") //Required
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff] .schema(schema)
    .schema(schema)
    .load("hdfs://localhost:8020/user/tester/my.xlsx");

answered Jun 18, 2019 at 5:07

Sai Mammahi

2272 silver badges15 bronze badges

1 Comment

kanishk kashyap Over a year ago

Actually, I've to store a spark dataframe in a excel file format with few column as a read only nature? Can you guide me regarding the same?

Sakthivel Nachimuthu · Accepted Answer · 2020-02-23 13:43:02Z

1

Hope this should help.

val df_excel= spark.read.
                   format("com.crealytics.spark.excel").
                   option("useHeader", "true").
                   option("treatEmptyValuesAsNulls", "false").
                   option("inferSchema", "false"). 
                   option("addColorColumns", "false").load(file_path)

display(df_excel)

answered Feb 23, 2020 at 13:43

Sakthivel Nachimuthu

213 bronze badges

1 Comment

kanishk kashyap Over a year ago

Actually, I've to store a spark dataframe in a excel file format with few column as a read only nature? Can you guide me regarding the same?

Collectives™ on Stack Overflow

How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

5 Answers 5

4 Comments

PYSpark users sample :

2 Comments

4 Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

PYSpark users sample :

2 Comments

4 Comments

1 Comment

1 Comment

Linked

Related