Applying UDF only on rows where value is not null or not an empty string not working as expected

Question

What's the best (fastest) way to apply an UDF only when a value is not null or not an empty string.

I've added a simple example.

df = spark.createDataFrame(
    [["John Jones"], ["Tracey Smith"], [None], ["Amy Sanders"], [""]]
).toDF("Name")


def upperCase(str):
    return str.upper()


upperCaseUDF = udf(lambda z: upperCase(z), StringType())

df.withColumn(
    "Cureated Name",
    F.when(
        ((F.col("Name").isNotNull()) | (F.trim(F.col("name")) != "")),
        upperCaseUDF(F.col("Name")),
    ),
)

AttributeError: 'NoneType' object has no attribute 'upper'.

I don't think the when clause works properly (or at least not as I would expect).
I get an error for the Null value.

I expect the UDF not to be executed on a Null value.
It's not about solving the Null value, but why the when clause doesn't work as I would expect !

Your problem lies with upperCaseUDF = udf(lambda z:upperCase(z),StringType())''', since you have an attribute with a value of None. NoneType has no attribute '''upper()''. You can fix this easily by updating the function ```upperCase to detect a None value and return something, else return value.upper() — itprorh66
– itprorh66, Commented Jul 12, 2021 at 13:25
@itprorh66 As stated I understand the error but why is the UDF applied while there's a Null value? It looks like the when clause is ignored. I don't wan't to check for Null values in the UDF, just not applying the UDF when the value is Null or an empty string — John Doe
– John Doe, Commented Jul 12, 2021 at 13:31
I think that the optimizer, in order to save computation time, compute both true and false output, and then select the proper output depending on when result. — Steven
– Steven, Commented Jul 12, 2021 at 13:36
@Steven : Would this depend on the size of the data set? My real case is a (very) large data set and I noticed the same behaviour. — John Doe
– John Doe, Commented Jul 13, 2021 at 4:29

Steven · Accepted Answer · 2021-07-13 09:21:41Z

2

I would advice you to consider that your UDF should apply to the whole dataframe and adapt the code in consequence:

@F.udf
def upperCase(in_string):
    return in_string.upper() if in_string else in_string


df.withColumn(
    "Created_Name",
    upperCase(F.col("Name")),
).show()

+------------+------------+
|        Name|Created_Name|
+------------+------------+
|  John Jones|  JOHN JONES|
|Tracey Smith|TRACEY SMITH|
|        null|        null|
| Amy Sanders| AMY SANDERS|
|            |            |
+------------+------------+

NB: Your UDF works if you filter out the bad lines:

df.where(F.col("Name").isNotNull()).select(upperCaseUDF(F.col("Name"))).show()
+--------------+                                                                
|<lambda>(Name)|
+--------------+
|    JOHN JONES|
|  TRACEY SMITH|
|   AMY SANDERS|
|              |
+--------------+

edited Jul 13, 2021 at 9:21

answered Jul 13, 2021 at 9:16

Steven

15.4k7 gold badges49 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Applying UDF only on rows where value is not null or not an empty string not working as expected

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related