How to write Dataset to a excel file using hadoop office library in apache spark java

Question

Currently I am using com.crealytics.spark.excel to read an Excel file, but using this library I can't write the dataset to an Excel file.

This link says that using hadoop office library (org.zuinnote.spark.office.excel) we can read and write to Excel files

Please help me to write dataset object to an excel file in spark java.

CSV files don't support few formats ( eg: if a single cell contains any comma or other special characters it will split that single cell as multiple cells). Hence I need to write to an excel file, is there any approaches are there to do the same using Apache Spark? – — Shashi Kumar
– Shashi Kumar, Commented Jun 28, 2017 at 10:52
Using existing tools, you could save the data to Hive, and then use Hue to download/generate an Excel file. You can also check what logic in Hue actually provides this Excel file. — Rick Moritz
– Rick Moritz, Commented Jun 28, 2017 at 11:23

abaghel · Accepted Answer · 2017-06-28 18:05:00Z

2

You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/. However, there is one issue if you read the Excel in Dataset and try to write it in another Excel file. Please see the issue and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12.

I have written a sample program in Java using org.zuinnote.spark.office.excel and workaround given at that link. Please see if this helps you.

public class SparkExcel {
    public static void main(String[] args) {
        //spark session
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkExcel")
                .master("local[*]")
                .getOrCreate();

        //Read
        Dataset<Row> df = spark
                .read()
                .format("org.zuinnote.spark.office.excel")
                .option("read.locale.bcp47", "de")
                .load("c:\\temp\\test1.xlsx");

        //Print
        df.show();
        df.printSchema();

        //Flatmap function
        FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
            @Override
            public Iterator<String[]> call(Row row) throws Exception {
                ArrayList<String[]> rowList = new ArrayList<String[]>();
                List<Row> spreadSheetRows = row.getList(0);
                for (Row srow : spreadSheetRows) {
                    ArrayList<String> arr = new ArrayList<String>();
                    arr.add(srow.getString(0));
                    arr.add(srow.getString(1));
                    arr.add(srow.getString(2));
                    arr.add(srow.getString(3));
                    arr.add(srow.getString(4));
                    rowList.add(arr.toArray(new String[] {}));
                }
                return rowList.iterator();
            }
        };

        //Apply flatMap function
        Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());

        //Write
        df2.write()
           .mode(SaveMode.Overwrite)
           .format("org.zuinnote.spark.office.excel")
           .option("write.locale.bcp47", "de")
           .save("c:\\temp\\test2.xlsx");

    }
}

I have tested this code with Java 8 and Spark 2.1.0. I am using maven and added dependency for org.zuinnote.spark.office.excel from https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3

edited Jun 28, 2017 at 18:05

answered Jun 28, 2017 at 14:55

abaghel

15.4k2 gold badges56 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shashi Kumar Over a year ago

We are not able to add header information into the excel which we are writing Is there any options to write the headers to output excel file, Please help me to do the same

abaghel Over a year ago

How are you creating Dataset - by reading existing file or new Dateset using schema? If reading existing file then first row could be header. Can you please post a separate question with details and sample code.

Jörn Franke Over a year ago

I just want to notify you that the current HadoopOffice library 1.0.4 supports reading/writing of headers as well as reading a file into a dataframe with simple datatypes (cf. github.com/ZuInnoTe/spark-hadoopoffice-ds). Some other notable things in 1.0.4 is the support of templates for writing (to include diagrams etc.) and reading/writing in low footprint mode (with a low memory/CPU footprint) based on Event and Streaming APIs of Apache POI.

Collectives™ on Stack Overflow

How to write Dataset to a excel file using hadoop office library in apache spark java

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related