4

It is well documented on SO (link 1, link 2, link 3, ...) how to transform a single variable to string type in PySpark by analogy:

from pyspark.sql.types import StringType    
spark_df = spark_df.withColumn('name_of_column', spark_df[name_of_column].cast(StringType()))

However, when you have several columns that you want transform to string type, there are several methods to achieve it:

Using for loops -- Successful approach in my code:

Trivial example:

to_str = ['age', 'weight', 'name', 'id']
for col in to_str:
  spark_df = spark_df.withColumn(col, spark_df[col].cast(StringType()))

which is a valid method but I believe not the optimal one that I am looking for.

Using list comprehensions -- Not succesful in my code:

My wrong example:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str))

Not succesful as I receive the error message:

TypeError: 'str' object is not callable

My question then would be: which would be the optimal way to transform several columns to string in PySpark based on a list of column names like to_str in my example?

Thanks in advance for your advice.

POSTERIOR CLARIFICATION EDIT:

Thanks to @Rumoku and @pault feedback:

Both code lines are correct:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)) # My initial list comprehension expression is correct.

and

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str]) # Initial answer proposed by @Rumoku is correct.

I was receiving the error messages from PySpark given that I previously changed the name of the object to_str for col. As @pault explains: col (the list with the desired string variables) had the same name as the function col of the list comprehension, that´s why PySpark complained. Simply renaming col to to_str, and updating spark-notebook fixed everything.

5
  • 1
    Spark is lazy, so, your for loop will build single query and executes later on. There should not be a big difference between for loop and list comprehensions in term of performance. Commented May 16, 2018 at 10:04
  • Hi @Rumoku, thanks for your answer. Do you know which one would be the correct syntax of the list comprehension option in PySpark that didn´t work in my case? The line I used was: spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)), with error message: "TypeError: 'str' object is not callable" Commented May 16, 2018 at 11:12
  • 2
    Somewhere you have overwritten a variable as a string- my guess is perhaps col -> can you do print(type(col))? Commented May 16, 2018 at 12:52
  • 1
    @NuValue that is exactly it. In your for loop version you assigned col to a string (for col in to_str:). Then you are trying to use it later on as a function (col(c).cast()). Commented May 16, 2018 at 12:55
  • @pault, you are totally right. I needed to update my Spark-Notebook, because I removed the name of that object 'col' afterwards. Thanks a lot! Commented May 16, 2018 at 12:57

2 Answers 2

3

It should be:

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str])
Sign up to request clarification or add additional context in comments.

5 Comments

Hi @Rumoku, thanks for your answer. After executing your line of code I get the following message: "TypeError: 'str' object is not callable". I wonder if maybe the function StringType() is the one that is causing problem in that line? I imported the module naturally with: "from pyspark.sql.types import StringType", previously.
do print(spark_df) on a previous line. I wonder if it's dataframe at all.
thanks, print(spark_df) prints the following output: DataFrame[ID: bigint, Source: string, NickName: string, EqType: bigint]
how did you import col?
Now it works perfectly @Rumoku, it was the refreshing of the Spark-Notebook what was needed. Will mark your answer as correct.
0

Not sure what is col() for the list comprehension part in your solution, but anyone looking for the solution can try this -

from pyspark.sql.types import StringType 

to_str = ['age', 'weight', 'name', 'id']

spark_df = spark_df.select(
  [spark_df[c].cast(StringType()).alias(c) for c in to_str]
)

To replace all the columns to str type, replace to_str with spark_df.columns.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.