Transform Boolean Column to Numerical Column in Apache Spark (Scala) data frame with constraints?

Question

 val inputfile = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") 
        .option("inferSchema", "true") 
        .option("delimiter", "\t")
        .load("data")
 inputfile: org.apache.spark.sql.DataFrame = [a: string, b: bigint, c: boolean]
 val outputfile = inputfile.groupBy($"a",$"b").max($"c")

Above code fails because c is a boolean variable and aggregates cannot be applied to booleans. Is there a function in Spark that converts true value to 1 and false to 0 for the full column of Spark data frame.

I tried the following (Source: How to change column types in Spark SQL's DataFrame? )

 val inputfile = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") 
        .option("inferSchema", "true") 
        .option("delimiter", "\t")
        .load("data")
 val tempfile =inputfile.select("a","b","c").withColumn("c",toInt(inputfile("c")))   
 val outputfile = tempfile.groupBy($"a",$"b").max($"c")

Following question: Casting a new derived column in a DataFrame from boolean to integer answers for PySpark but I wanted a function specifically for Scala.

Appreciate any kind of help.

thleo · Accepted Answer · 2018-07-25 20:49:55Z

10

You don't need to use a udf to do this. If you want to convert boolean values to int, you can typecast the column to int

val df2 = df1
  .withColumn("boolAsInt",$"bool".cast("Int")

answered Jul 25, 2018 at 20:49

thleo

8952 gold badges11 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

loneStar · Accepted Answer · 2017-11-03 17:36:41Z

0

implicit def bool2int(b:Boolean) = if (b) 1 else 0

scala> false:Int
res4: Int = 0

scala> true:Int
res5: Int = 1

scala> val b=true
b: Boolean = true


scala> 2*b+1
res2: Int = 3

Use the above function and register as UDF

val bool2int_udf = udf(bool2int _)

val tempfile =inputfile.select("a","b","c").withColumn("c",bool2int_UDF($("c")))

edited Nov 3, 2017 at 17:36

answered Oct 31, 2017 at 18:54

loneStar

4,04026 silver badges42 bronze badges

2 Comments

learner Over a year ago

Hi @Achyuth, Thanks for looking into this problem. But this does not work. Bool2int is a function which takes in as bool as an argument but I needed org.apache.spark.sql.Column as the argument.

Golan Kiviti Over a year ago

UDFs should be last result since they are less efficient. Casting is the better option here.

learner · Accepted Answer · 2017-11-02 18:41:36Z

0

Below code worked for me. @Achyuth's answer provided the partial function. Then, taking ideas from this question: Applying function to Spark Dataframe Column I was able to apply function from Achyuth answer to the full column of the data frame using UDF. Here is the full code.

 implicit def bool2int(b:Boolean) = if (b) 1 else 0
 val bool2int_udf = udf(bool2int _)
 val inputfile = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") 
        .option("inferSchema", "true") 
        .option("delimiter", "\t")
        .load("data") 
 val tempfile = inputfile.select("a","b","c").withColumn("c",bool2int_udf($"c"))
 val outputfile = tempfile.groupBy($"a",$"b").max($"c")

answered Nov 2, 2017 at 18:41

learner

8771 gold badge14 silver badges37 bronze badges

Collectives™ on Stack Overflow

Transform Boolean Column to Numerical Column in Apache Spark (Scala) data frame with constraints?

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related