4

Currently I am using com.crealytics.spark.excel to read an Excel file, but using this library I can't write the dataset to an Excel file.

This link says that using hadoop office library (org.zuinnote.spark.office.excel) we can read and write to Excel files

Please help me to write dataset object to an excel file in spark java.

3
  • what's wrong with csv? Commented Jun 28, 2017 at 10:35
  • CSV files don't support few formats ( eg: if a single cell contains any comma or other special characters it will split that single cell as multiple cells). Hence I need to write to an excel file, is there any approaches are there to do the same using Apache Spark? – Commented Jun 28, 2017 at 10:52
  • Using existing tools, you could save the data to Hive, and then use Hue to download/generate an Excel file. You can also check what logic in Hue actually provides this Excel file. Commented Jun 28, 2017 at 11:23

1 Answer 1

2

You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/. However, there is one issue if you read the Excel in Dataset and try to write it in another Excel file. Please see the issue and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12.

I have written a sample program in Java using org.zuinnote.spark.office.excel and workaround given at that link. Please see if this helps you.

public class SparkExcel {
    public static void main(String[] args) {
        //spark session
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkExcel")
                .master("local[*]")
                .getOrCreate();

        //Read
        Dataset<Row> df = spark
                .read()
                .format("org.zuinnote.spark.office.excel")
                .option("read.locale.bcp47", "de")
                .load("c:\\temp\\test1.xlsx");

        //Print
        df.show();
        df.printSchema();

        //Flatmap function
        FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
            @Override
            public Iterator<String[]> call(Row row) throws Exception {
                ArrayList<String[]> rowList = new ArrayList<String[]>();
                List<Row> spreadSheetRows = row.getList(0);
                for (Row srow : spreadSheetRows) {
                    ArrayList<String> arr = new ArrayList<String>();
                    arr.add(srow.getString(0));
                    arr.add(srow.getString(1));
                    arr.add(srow.getString(2));
                    arr.add(srow.getString(3));
                    arr.add(srow.getString(4));
                    rowList.add(arr.toArray(new String[] {}));
                }
                return rowList.iterator();
            }
        };

        //Apply flatMap function
        Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());

        //Write
        df2.write()
           .mode(SaveMode.Overwrite)
           .format("org.zuinnote.spark.office.excel")
           .option("write.locale.bcp47", "de")
           .save("c:\\temp\\test2.xlsx");

    }
}

I have tested this code with Java 8 and Spark 2.1.0. I am using maven and added dependency for org.zuinnote.spark.office.excel from https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3

Sign up to request clarification or add additional context in comments.

3 Comments

We are not able to add header information into the excel which we are writing Is there any options to write the headers to output excel file, Please help me to do the same
How are you creating Dataset - by reading existing file or new Dateset using schema? If reading existing file then first row could be header. Can you please post a separate question with details and sample code.
I just want to notify you that the current HadoopOffice library 1.0.4 supports reading/writing of headers as well as reading a file into a dataframe with simple datatypes (cf. github.com/ZuInnoTe/spark-hadoopoffice-ds). Some other notable things in 1.0.4 is the support of templates for writing (to include diagrams etc.) and reading/writing in low footprint mode (with a low memory/CPU footprint) based on Event and Streaming APIs of Apache POI.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.