0

How can we collapse a boolean column to a single row with OR operation using scala?

Part 1:

A  true
A  false
B  false
B  false
C  true
B  false
A  true
C  true

Desired Output


B  false
A  true
C  true

A solution I could think of was to group them by first column entries, filter true & false rows in separate data frames, drop duplicates and finally adding one data frame(false) to other(true) while checking if the letter (eg A) already exists in the other(true) data frame.

This solution is quite messy. Also, don't know if this would work for all edge cases. Is there some smart way to do this.

I'm an absolute beginner, any help is appreciated.

Edit: the given answers work for the above-given scenario but doesn't work for this scenario. Any way to achieve the desired output?

Part 2:

A  true    "Apple"
A  false   ""
B  false   ""
B  false   ""
C  true    "Cat"
C  true    "Cotton"
C  false   ""

Desired Output


B  false []
A  true  ["Apple"]
C  true  ["Cat","Cotton"]

I tried to achieve this by grouping by col1 and col2 and then collapsing the col3 using collect_set, then

  1. Group by 1st column
  2. Collect 2nd column as Set of boolean
  3. Check if there's a single true if yes then your OR expression will evaluate to true always.

but this leads to loss of col3_set all together.

2 Answers 2

1
  1. Group by 1st column
  2. Collect 2nd column as Set of Boolean
  3. Collect 3nd column as Set of String
  4. Check if there's a single true if yes then your OR expression will evaluate to true always.
  5. Remove empty string "" from col3_set
import org.apache.spark.sql.functions._

object GroupByAgg {

  def main(args: Array[String]): Unit = {

    val spark = Constant.getSparkSess

    import spark.implicits._

    val df = List(("A", true,"Apple"), ("A", false,""),
      ("B", false,""),
      ("B", false,""),
      ("C", true,"Cat"),
      ("C", true,"Cotton"),
      ("C", true,"")).toDF("Col1", "Col2","Col3")

    //Group by 1st column
    df.groupBy("Col1")
      // Collect unique values
      .agg(collect_set("Col2").as("Col2_set"),collect_set("Col3").as("Col3_set"))
      //check if the array contains single true
      .withColumn("OutputCol2", when(array_contains(col("Col2_set"), true), true)
        .otherwise(false))
    .withColumn("OutputCol3",array_remove(col("Col3_set"),lit("")))
//.withColumn("OutputCol3",expr("filter(Col3_set, x -> x != '')"))
      .drop("Col2_set")
      .drop("Col3_set")
      .show()
  }

}

Output :

+----+----------+-------------+
|Col1|OutputCol2|   OutputCol3|
+----+----------+-------------+
|   B|     false|           []|
|   C|      true|[Cat, Cotton]|
|   A|      true|      [Apple]|
+----+----------+-------------+
Sign up to request clarification or add additional context in comments.

5 Comments

I have edited the question, this works for part 1, how can I achieve Part 2 desired output.
@Nikita Updated the ans. You should have posted a new question instead of updating the old one because reviewers might down votes the old answer as they no where matches the expected output.
didn't know that, new here, will take care. thanks for updating the answer :)
is there a way to achieve this without using array_remove()
@Nikita add alternate solution. .withColumn("OutputCol3",expr("filter(Col3_set, x -> x != '')")) in the ans
0

Try this:

  1. Group by on col1
  2. collect_set on col2 -- in case of boolean, collect_set will give you at max 2 elements, which is good for performance too

Pass Set of booleans collected in step 2 to a UDF, which will do a simple recudeLeft for doing OR operation on all the elements.

scala> val df = List(
     | ("A",  true),
     | ("A",  false),
     | ("B",  false),
     | ("B",  false),
     | ("C",  true),
     | ("B",  false),
     | ("A",  true),
     | ("C",  true)
     | ).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: boolean]

scala> df.show
+----+-----+
|col1| col2|
+----+-----+
|   A| true|
|   A|false|
|   B|false|
|   B|false|
|   C| true|
|   B|false|
|   A| true|
|   C| true|
+----+-----+


scala> val aggOr = udf((a:Seq[Boolean])=>{a.reduceLeft(_||_)})
aggOr: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(BooleanType,false))))

scala> df.groupBy("col1").agg(aggOr(collect_set("col2")).as("col2Or")).show
+----+------+
|col1|col2Or|
+----+------+
|   B| false|
|   C|  true|
|   A|  true|
+----+------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.