4

What's the best (fastest) way to apply an UDF only when a value is not null or not an empty string.

I've added a simple example.

df = spark.createDataFrame(
    [["John Jones"], ["Tracey Smith"], [None], ["Amy Sanders"], [""]]
).toDF("Name")


def upperCase(str):
    return str.upper()


upperCaseUDF = udf(lambda z: upperCase(z), StringType())

df.withColumn(
    "Cureated Name",
    F.when(
        ((F.col("Name").isNotNull()) | (F.trim(F.col("name")) != "")),
        upperCaseUDF(F.col("Name")),
    ),
)

AttributeError: 'NoneType' object has no attribute 'upper'. 

I don't think the when clause works properly (or at least not as I would expect).
I get an error for the Null value.

I expect the UDF not to be executed on a Null value.
It's not about solving the Null value, but why the when clause doesn't work as I would expect !

5
  • Your problem lies with upperCaseUDF = udf(lambda z:upperCase(z),StringType())''', since you have an attribute with a value of None. NoneType has no attribute '''upper()''. You can fix this easily by updating the function ```upperCase to detect a None value and return something, else return value.upper() Commented Jul 12, 2021 at 13:25
  • @itprorh66 As stated I understand the error but why is the UDF applied while there's a Null value? It looks like the when clause is ignored. I don't wan't to check for Null values in the UDF, just not applying the UDF when the value is Null or an empty string Commented Jul 12, 2021 at 13:31
  • 2
    I think that the optimizer, in order to save computation time, compute both true and false output, and then select the proper output depending on when result. Commented Jul 12, 2021 at 13:36
  • @Steven : Would this depend on the size of the data set? My real case is a (very) large data set and I noticed the same behaviour. Commented Jul 13, 2021 at 4:29
  • @JohnDoe It is independant of the size. Commented Jul 13, 2021 at 9:11

1 Answer 1

2

I would advice you to consider that your UDF should apply to the whole dataframe and adapt the code in consequence:

@F.udf
def upperCase(in_string):
    return in_string.upper() if in_string else in_string


df.withColumn(
    "Created_Name",
    upperCase(F.col("Name")),
).show()

+------------+------------+
|        Name|Created_Name|
+------------+------------+
|  John Jones|  JOHN JONES|
|Tracey Smith|TRACEY SMITH|
|        null|        null|
| Amy Sanders| AMY SANDERS|
|            |            |
+------------+------------+

NB: Your UDF works if you filter out the bad lines:

df.where(F.col("Name").isNotNull()).select(upperCaseUDF(F.col("Name"))).show()
+--------------+                                                                
|<lambda>(Name)|
+--------------+
|    JOHN JONES|
|  TRACEY SMITH|
|   AMY SANDERS|
|              |
+--------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.