3

I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like "time_spend_company (Years)", "average_monthly_hours (hours)" etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.

I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below

val empFile = "C:\\EmpDatasets.xlsx"

val employeesDF = sc.sqlContext.read
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

employeesDF.createOrReplaceTempView("EMP")

I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.

val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'")
expLevel.show

I need help on :-

  1. Is there any better way to load the excel and assign custom column names to it and create a dataframe?
  2. how to write sql queries for these column names which has spaces within in it?

Note: I need to read it as excel file only, I can't convert into csv or any other file formats.

4 Answers 4

3

For the version 0.13.5 you will need a different set of parameters:

def readExcel(file: String): DataFrame = {
    sqlContext.read
      .format("com.crealytics.spark.excel")
      .option("dataAddress", "'sheet_name'!A1") // Optional, default: "A1"
      .option("header", "true") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, d[#All]efault None. If set, uses a streaming reader which can help with big files
      .load(file)
  }

maven dependency:

<dependency>
  <groupId>com.crealytics</groupId>
  <artifactId>spark-excel_2.11</artifactId>
  <version>0.13.5</version>
</dependency>
Sign up to request clarification or add additional context in comments.

Comments

1

Answers for you question 2: Inspite of using ' you need to use ` before the start and end of the column names with spaces. Try below query it will work:

val expLevel = sc.sqlContext.sql("Select `time_spend_company (Years)` as `Years_spent_in_company`,count(1) from EMP where left_company = 1 group by `time_spend_company (Years)`")

Question 1: Loading excel using "com.crealytics.spark.excel" is ok. I am also using it. There can be different option too. For assigning a different column name, you can use the Struct Type to define the schema and impose it during the loading the data into dataframe. e.g

val newSchema = StructType(
    List(StructField("a", IntegerType, nullable = true),
         StructField("b", IntegerType, nullable = true),
         StructField("c", IntegerType, nullable = true),
         StructField("d", IntegerType, nullable = true))
  )

val employeesDF = spark.read.schema(newSchema)
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

The first four column names now will be accessed by a,b,c and d. Run below query it will work on new column names.

sc.sqlContext.sql("select a,b,c,d from EMP").show()

7 Comments

I tried defining the schema as per your suggestion and loaded the dataset, but I am getting the exception like below:- Exception in thread "main" org.apache.spark.sql.AnalysisException: com.crealytics.spark.excel does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:313) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at EmpPerformanceAnalysis$.main(EmpPerformanceAnalysis.scala:58)
val empSchema = StructType( List(StructField("satisfaction_level", FloatType, nullable = true), StructField("last_evaluation", FloatType, nullable = true), .. StructField("salary", StringType, nullable = true)) ) val employeesDF = sc.sqlContext.read.schema(empSchema) .format("com.crealytics.spark.excel") .option("sheetName", "Sheet1") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", empFile) .option("addColorColumns", "False") .load()
@Krishnan I am running the same on spark 2.1.0 and scala 2.11. I am not getting the exception that you have mentioned. For the code you have posted, i am getting the excpetion "java.lang.IllegalArgumentException: Parameter "path" is missing in options." . Then i added the path option inplace of location option and it works fine.
This is what i am running : val employeesDF2 = spark.sqlContext.read.schema(empSchema).format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", "csvdatatest/Book1.xlsx").option("addColorColumns", "False").load("csvdatatest/Book1.xlsx")
printschema result is : scala> employeesDF2.printSchema root |-- satisfaction_level: float (nullable = true) |-- last_evaluation: float (nullable = true) |-- salary: string (nullable = true)
|
1
  1. Spark has good support for working with CSVs. So if your excel file has just one sheet you can convert it to CSV by simply renaming EmpDatasets.xlsx to EmpDatasets.csv. Use this to do it.

Once you have your file as CSV, you can read it as spark.read.csv(pathToCSV) and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathToCSV).

Here schema can be created as described here or can be extracted from a case class using spark sql Encoders Encoders.product[case_class_name].schema

  1. You can remove spaces from the column names like:

val employeesDFColumns = employeesDF.columns.map(x => col(x.replaceAll(" ", "")))

And apply these new column names on the dataframe.

val employeeDF = employeeDF.select(employeesDFColumns:_*)

3 Comments

I need to read it as excel file only, I can't convert into csv or any other file formats. Can you please suggest anything on reading/parsing excel files?
The library you are using for excel seems perfectly alright. And is one of the recommended ways to read excel. What are the problem you are facing in using it?
Please refer my question, I need to assign custom column names to it and create a dataframe, how to write sql queries for these column names which has spaces within in it?
0

We can save PySpark data to an Excel file using the pandas library, which provides functionality to write data in the Excel format.

from pyspark.sql import SparkSession
import pandas as pd

# Create a Spark session
spark = SparkSession.builder.appName("PySpark to Excel").getOrCreate()

# Create a PySpark dataframe
df = spark.createDataFrame([(1, "John Doe", 30), (2, "Jane Doe", 35), (3, "Jim Brown", 40)], 
                           ["id", "name", "age"])

# Convert the PySpark dataframe to a pandas dataframe
pandas_df = df.toPandas()

# Write the pandas dataframe to an Excel file
pandas_df.to_excel("output.xlsx", index=False)

# Stop the Spark session
spark.stop()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.