how to read excel data into a dataframe in spark/scala

Question

I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like "time_spend_company (Years)", "average_monthly_hours (hours)" etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.

I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below

val empFile = "C:\\EmpDatasets.xlsx"

val employeesDF = sc.sqlContext.read
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

employeesDF.createOrReplaceTempView("EMP")

I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.

val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'")
expLevel.show

I need help on :-

Is there any better way to load the excel and assign custom column names to it and create a dataframe?
how to write sql queries for these column names which has spaces within in it?

Note: I need to read it as excel file only, I can't convert into csv or any other file formats.

Enrique Dev · Accepted Answer · 2020-08-10 16:47:01Z

For the version 0.13.5 you will need a different set of parameters:

def readExcel(file: String): DataFrame = {
    sqlContext.read
      .format("com.crealytics.spark.excel")
      .option("dataAddress", "'sheet_name'!A1") // Optional, default: "A1"
      .option("header", "true") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, d[#All]efault None. If set, uses a streaming reader which can help with big files
      .load(file)
  }

maven dependency:

<dependency>
  <groupId>com.crealytics</groupId>
  <artifactId>spark-excel_2.11</artifactId>
  <version>0.13.5</version>
</dependency>

Amit Kumar · Accepted Answer · 2017-11-23 07:03:45Z

1

Answers for you question 2: Inspite of using ' you need to use ` before the start and end of the column names with spaces. Try below query it will work:

val expLevel = sc.sqlContext.sql("Select `time_spend_company (Years)` as `Years_spent_in_company`,count(1) from EMP where left_company = 1 group by `time_spend_company (Years)`")

Question 1: Loading excel using "com.crealytics.spark.excel" is ok. I am also using it. There can be different option too. For assigning a different column name, you can use the Struct Type to define the schema and impose it during the loading the data into dataframe. e.g

val newSchema = StructType(
    List(StructField("a", IntegerType, nullable = true),
         StructField("b", IntegerType, nullable = true),
         StructField("c", IntegerType, nullable = true),
         StructField("d", IntegerType, nullable = true))
  )

val employeesDF = spark.read.schema(newSchema)
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

The first four column names now will be accessed by a,b,c and d. Run below query it will work on new column names.

sc.sqlContext.sql("select a,b,c,d from EMP").show()

edited Nov 23, 2017 at 7:03

answered Nov 23, 2017 at 6:44

Amit Kumar

1,58412 silver badges24 bronze badges

7 Comments

Krishnan Over a year ago

I tried defining the schema as per your suggestion and loaded the dataset, but I am getting the exception like below:- Exception in thread "main" org.apache.spark.sql.AnalysisException: com.crealytics.spark.excel does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:313) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at EmpPerformanceAnalysis$.main(EmpPerformanceAnalysis.scala:58)

Krishnan Over a year ago

val empSchema = StructType( List(StructField("satisfaction_level", FloatType, nullable = true), StructField("last_evaluation", FloatType, nullable = true), .. StructField("salary", StringType, nullable = true)) ) val employeesDF = sc.sqlContext.read.schema(empSchema) .format("com.crealytics.spark.excel") .option("sheetName", "Sheet1") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", empFile) .option("addColorColumns", "False") .load()

Amit Kumar Over a year ago

@Krishnan I am running the same on spark 2.1.0 and scala 2.11. I am not getting the exception that you have mentioned. For the code you have posted, i am getting the excpetion "java.lang.IllegalArgumentException: Parameter "path" is missing in options." . Then i added the path option inplace of location option and it works fine.

Amit Kumar Over a year ago

This is what i am running : val employeesDF2 = spark.sqlContext.read.schema(empSchema).format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", "csvdatatest/Book1.xlsx").option("addColorColumns", "False").load("csvdatatest/Book1.xlsx")

Amit Kumar Over a year ago

printschema result is : scala> employeesDF2.printSchema root |-- satisfaction_level: float (nullable = true) |-- last_evaluation: float (nullable = true) |-- salary: string (nullable = true)

|

Santoshi M · Accepted Answer · 2017-11-22 20:05:41Z

1

Spark has good support for working with CSVs. So if your excel file has just one sheet you can convert it to CSV by simply renaming EmpDatasets.xlsx to EmpDatasets.csv. Use this to do it.

Once you have your file as CSV, you can read it as spark.read.csv(pathToCSV) and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathToCSV).

Here schema can be created as described here or can be extracted from a case class using spark sql Encoders Encoders.product[case_class_name].schema

You can remove spaces from the column names like:

val employeesDFColumns = employeesDF.columns.map(x => col(x.replaceAll(" ", "")))

And apply these new column names on the dataframe.

val employeeDF = employeeDF.select(employeesDFColumns:_*)

answered Nov 22, 2017 at 20:05

Santoshi M

1235 bronze badges

3 Comments

Krishnan Over a year ago

I need to read it as excel file only, I can't convert into csv or any other file formats. Can you please suggest anything on reading/parsing excel files?

Santoshi M Over a year ago

The library you are using for excel seems perfectly alright. And is one of the recommended ways to read excel. What are the problem you are facing in using it?

Krishnan Over a year ago

Please refer my question, I need to assign custom column names to it and create a dataframe, how to write sql queries for these column names which has spaces within in it?

Eric Aya · Accepted Answer · 2023-02-10 08:05:30Z

0

We can save PySpark data to an Excel file using the pandas library, which provides functionality to write data in the Excel format.

from pyspark.sql import SparkSession
import pandas as pd

# Create a Spark session
spark = SparkSession.builder.appName("PySpark to Excel").getOrCreate()

# Create a PySpark dataframe
df = spark.createDataFrame([(1, "John Doe", 30), (2, "Jane Doe", 35), (3, "Jim Brown", 40)], 
                           ["id", "name", "age"])

# Convert the PySpark dataframe to a pandas dataframe
pandas_df = df.toPandas()

# Write the pandas dataframe to an Excel file
pandas_df.to_excel("output.xlsx", index=False)

# Stop the Spark session
spark.stop()

edited Feb 10, 2023 at 8:05

Eric Aya

70.2k36 gold badges190 silver badges266 bronze badges

answered Feb 10, 2023 at 7:06

Abhi

11 bronze badge

Collectives™ on Stack Overflow

how to read excel data into a dataframe in spark/scala

4 Answers 4

Comments

7 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

7 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related