Efficient way to transform several columns to string in PySpark

Question

It is well documented on SO (link 1, link 2, link 3, ...) how to transform a single variable to string type in PySpark by analogy:

from pyspark.sql.types import StringType    
spark_df = spark_df.withColumn('name_of_column', spark_df[name_of_column].cast(StringType()))

However, when you have several columns that you want transform to string type, there are several methods to achieve it:

Using for loops -- Successful approach in my code:

Trivial example:

to_str = ['age', 'weight', 'name', 'id']
for col in to_str:
  spark_df = spark_df.withColumn(col, spark_df[col].cast(StringType()))

which is a valid method but I believe not the optimal one that I am looking for.

Using list comprehensions -- Not succesful in my code:

My wrong example:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str))

Not succesful as I receive the error message:

TypeError: 'str' object is not callable

My question then would be: which would be the optimal way to transform several columns to string in PySpark based on a list of column names like to_str in my example?

Thanks in advance for your advice.

POSTERIOR CLARIFICATION EDIT:

Thanks to @Rumoku and @pault feedback:

Both code lines are correct:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)) # My initial list comprehension expression is correct.

and

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str]) # Initial answer proposed by @Rumoku is correct.

I was receiving the error messages from PySpark given that I previously changed the name of the object to_str for col. As @pault explains: col (the list with the desired string variables) had the same name as the function col of the list comprehension, that´s why PySpark complained. Simply renaming col to to_str, and updating spark-notebook fixed everything.

Spark is lazy, so, your for loop will build single query and executes later on. There should not be a big difference between for loop and list comprehensions in term of performance. — vvg
– vvg, Commented May 16, 2018 at 10:04
Hi @Rumoku, thanks for your answer. Do you know which one would be the correct syntax of the list comprehension option in PySpark that didn´t work in my case? The line I used was: spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)), with error message: "TypeError: 'str' object is not callable" — NuValue
– NuValue, Commented May 16, 2018 at 11:12
Somewhere you have overwritten a variable as a string- my guess is perhaps col -> can you do print(type(col))? — pault
– pault, Commented May 16, 2018 at 12:52
@NuValue that is exactly it. In your for loop version you assigned col to a string (for col in to_str:). Then you are trying to use it later on as a function (col(c).cast()). — pault
– pault, Commented May 16, 2018 at 12:55
@pault, you are totally right. I needed to update my Spark-Notebook, because I removed the name of that object 'col' afterwards. Thanks a lot! — NuValue
– NuValue, Commented May 16, 2018 at 12:57

vvg · Accepted Answer · 2018-05-16 11:37:02Z

3

It should be:

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str])

answered May 16, 2018 at 11:37

vvg

6,40522 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

NuValue Over a year ago

Hi @Rumoku, thanks for your answer. After executing your line of code I get the following message: "TypeError: 'str' object is not callable". I wonder if maybe the function StringType() is the one that is causing problem in that line? I imported the module naturally with: "from pyspark.sql.types import StringType", previously.

vvg Over a year ago

do print(spark_df) on a previous line. I wonder if it's dataframe at all.

NuValue Over a year ago

thanks, print(spark_df) prints the following output: DataFrame[ID: bigint, Source: string, NickName: string, EqType: bigint]

vvg Over a year ago

how did you import col?

NuValue Over a year ago

Now it works perfectly @Rumoku, it was the refreshing of the Spark-Notebook what was needed. Will mark your answer as correct.

Amit Pathak · Accepted Answer · 2021-07-02 06:42:22Z

0

Not sure what is col() for the list comprehension part in your solution, but anyone looking for the solution can try this -

from pyspark.sql.types import StringType 

to_str = ['age', 'weight', 'name', 'id']

spark_df = spark_df.select(
  [spark_df[c].cast(StringType()).alias(c) for c in to_str]
)

To replace all the columns to str type, replace to_str with spark_df.columns.

answered Jul 2, 2021 at 6:42

Amit Pathak

1,4271 gold badge18 silver badges32 bronze badges

Collectives™ on Stack Overflow

Efficient way to transform several columns to string in PySpark

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related