Collapse column in a spark dataframe using boolean operation using scala

Question

How can we collapse a boolean column to a single row with OR operation using scala?

Part 1:

A  true
A  false
B  false
B  false
C  true
B  false
A  true
C  true

Desired Output


B  false
A  true
C  true

A solution I could think of was to group them by first column entries, filter true & false rows in separate data frames, drop duplicates and finally adding one data frame(false) to other(true) while checking if the letter (eg A) already exists in the other(true) data frame.

This solution is quite messy. Also, don't know if this would work for all edge cases. Is there some smart way to do this.

I'm an absolute beginner, any help is appreciated.

Edit: the given answers work for the above-given scenario but doesn't work for this scenario. Any way to achieve the desired output?

Part 2:

A  true    "Apple"
A  false   ""
B  false   ""
B  false   ""
C  true    "Cat"
C  true    "Cotton"
C  false   ""

Desired Output


B  false []
A  true  ["Apple"]
C  true  ["Cat","Cotton"]

I tried to achieve this by grouping by col1 and col2 and then collapsing the col3 using collect_set, then

Group by 1st column

Collect 2nd column as Set of boolean

Check if there's a single true if yes then your OR expression will evaluate to true always.

but this leads to loss of col3_set all together.

QuickSilver · Accepted Answer · 2020-06-14 16:22:28Z

1

Group by 1st column
Collect 2nd column as Set of Boolean
Collect 3nd column as Set of String
Check if there's a single true if yes then your OR expression will evaluate to true always.
Remove empty string "" from col3_set

import org.apache.spark.sql.functions._

object GroupByAgg {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    val df = List(("A", true,"Apple"), ("A", false,""),
      ("B", false,""),
      ("B", false,""),
      ("C", true,"Cat"),
      ("C", true,"Cotton"),
      ("C", true,"")).toDF("Col1", "Col2","Col3")

    //Group by 1st column
    df.groupBy("Col1")
      // Collect unique values
      .agg(collect_set("Col2").as("Col2_set"),collect_set("Col3").as("Col3_set"))
      //check if the array contains single true
      .withColumn("OutputCol2", when(array_contains(col("Col2_set"), true), true)
        .otherwise(false))
    .withColumn("OutputCol3",array_remove(col("Col3_set"),lit("")))
//.withColumn("OutputCol3",expr("filter(Col3_set, x -> x != '')"))
      .drop("Col2_set")
      .drop("Col3_set")
      .show()
  }

}

Output :

+----+----------+-------------+
|Col1|OutputCol2|   OutputCol3|
+----+----------+-------------+
|   B|     false|           []|
|   C|      true|[Cat, Cotton]|
|   A|      true|      [Apple]|
+----+----------+-------------+

edited Jun 14, 2020 at 16:22

answered Jun 13, 2020 at 10:10

QuickSilver

4,0452 gold badges15 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nikita Over a year ago

I have edited the question, this works for part 1, how can I achieve Part 2 desired output.

QuickSilver Over a year ago

@Nikita Updated the ans. You should have posted a new question instead of updating the old one because reviewers might down votes the old answer as they no where matches the expected output.

Nikita Over a year ago

didn't know that, new here, will take care. thanks for updating the answer :)

Nikita Over a year ago

is there a way to achieve this without using array_remove()

QuickSilver Over a year ago

@Nikita add alternate solution. .withColumn("OutputCol3",expr("filter(Col3_set, x -> x != '')")) in the ans

Parvez Patel · Accepted Answer · 2020-06-13 21:21:46Z

Try this:

Group by on col1
collect_set on col2 -- in case of boolean, collect_set will give you at max 2 elements, which is good for performance too

Pass Set of booleans collected in step 2 to a UDF, which will do a simple recudeLeft for doing OR operation on all the elements.

scala> val df = List(
     | ("A",  true),
     | ("A",  false),
     | ("B",  false),
     | ("B",  false),
     | ("C",  true),
     | ("B",  false),
     | ("A",  true),
     | ("C",  true)
     | ).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: boolean]

scala> df.show
+----+-----+
|col1| col2|
+----+-----+
|   A| true|
|   A|false|
|   B|false|
|   B|false|
|   C| true|
|   B|false|
|   A| true|
|   C| true|
+----+-----+


scala> val aggOr = udf((a:Seq[Boolean])=>{a.reduceLeft(_||_)})
aggOr: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(BooleanType,false))))

scala> df.groupBy("col1").agg(aggOr(collect_set("col2")).as("col2Or")).show
+----+------+
|col1|col2Or|
+----+------+
|   B| false|
|   C|  true|
|   A|  true|
+----+------+

Collectives™ on Stack Overflow

Collapse column in a spark dataframe using boolean operation using scala

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related