4
 val inputfile = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") 
        .option("inferSchema", "true") 
        .option("delimiter", "\t")
        .load("data")
 inputfile: org.apache.spark.sql.DataFrame = [a: string, b: bigint, c: boolean]
 val outputfile = inputfile.groupBy($"a",$"b").max($"c")

Above code fails because c is a boolean variable and aggregates cannot be applied to booleans. Is there a function in Spark that converts true value to 1 and false to 0 for the full column of Spark data frame.

I tried the following (Source: How to change column types in Spark SQL's DataFrame? )

 val inputfile = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") 
        .option("inferSchema", "true") 
        .option("delimiter", "\t")
        .load("data")
 val tempfile =inputfile.select("a","b","c").withColumn("c",toInt(inputfile("c")))   
 val outputfile = tempfile.groupBy($"a",$"b").max($"c")

Following question: Casting a new derived column in a DataFrame from boolean to integer answers for PySpark but I wanted a function specifically for Scala.

Appreciate any kind of help.

3 Answers 3

10

You don't need to use a udf to do this. If you want to convert boolean values to int, you can typecast the column to int

val df2 = df1
  .withColumn("boolAsInt",$"bool".cast("Int")
Sign up to request clarification or add additional context in comments.

Comments

0
implicit def bool2int(b:Boolean) = if (b) 1 else 0

scala> false:Int
res4: Int = 0

scala> true:Int
res5: Int = 1

scala> val b=true
b: Boolean = true


scala> 2*b+1
res2: Int = 3

Use the above function and register as UDF

val bool2int_udf = udf(bool2int _)

val tempfile =inputfile.select("a","b","c").withColumn("c",bool2int_UDF($("c")))

2 Comments

Hi @Achyuth, Thanks for looking into this problem. But this does not work. Bool2int is a function which takes in as bool as an argument but I needed org.apache.spark.sql.Column as the argument.
UDFs should be last result since they are less efficient. Casting is the better option here.
0

Below code worked for me. @Achyuth's answer provided the partial function. Then, taking ideas from this question: Applying function to Spark Dataframe Column I was able to apply function from Achyuth answer to the full column of the data frame using UDF. Here is the full code.

 implicit def bool2int(b:Boolean) = if (b) 1 else 0
 val bool2int_udf = udf(bool2int _)
 val inputfile = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") 
        .option("inferSchema", "true") 
        .option("delimiter", "\t")
        .load("data") 
 val tempfile = inputfile.select("a","b","c").withColumn("c",bool2int_udf($"c"))
 val outputfile = tempfile.groupBy($"a",$"b").max($"c")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.