Change the Datatype of columns in PySpark dataframe

Question

I have an input dataframe(ip_df), data in this dataframe looks like as below:

id            col_value
1               10
2               11
3               12

Data type of id and col_value is String

I need to get another dataframe(output_df), having datatype of id as string and col_value column as decimal**(15,4)**. THere is no data transformation, just data type conversion. Can i use it using PySpark. Any help will be appreciated

aclowkay · Accepted Answer · 2017-08-02 06:58:43Z

12

Try using the cast method:

from pyspark.sql.types import DecimalType
<your code>
output_df = ip_df.withColumn("col_value",ip_df["col_value"].cast(DecimalType()))

edited Aug 2, 2017 at 6:58

answered Aug 2, 2017 at 6:50

aclowkay

3,9356 gold badges41 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Arunanshu P Over a year ago

It is giving error-name 'DecimalType' is not defined

aclowkay Over a year ago

You need to import it

Rodney Over a year ago

from pyspark.sql.types import DecimalType

TMichel · Accepted Answer · 2019-01-21 17:24:11Z

5

try below statement.

output_df = ip_df.withColumn("col_value",ip_df["col_value"].cast('float'))

edited Jan 21, 2019 at 17:24

TMichel

4,49211 gold badges47 silver badges69 bronze badges

answered Aug 2, 2017 at 13:21

Neeraj Bhadani

3,14022 silver badges28 bronze badges

Comments

Amit Pathak · Accepted Answer · 2021-07-02 07:02:57Z

1

You can change multiple column types

Using withColumn() -

from pyspark.sql.types import DecimalType, StringType

output_df = ip_df \
  .withColumn("col_value", ip_df["col_value"].cast(DecimalType())) \
  .withColumn("id", ip_df["id"].cast(StringType()))

Using select()

from pyspark.sql.types import DecimalType, StringType

output_df = ip_df.select(
  (ip_df.id.cast(StringType())).alias('id'),
  (ip_df.col_value.cast(DecimalType())).alias('col_value')
)

Using spark.sql()

ip_df.createOrReplaceTempView("ip_df_view")

output_df = spark.sql('''
SELECT 
    STRING(id),
    DECIMAL(col_value)
FROM ip_df_view;
''')

answered Jul 2, 2021 at 7:02

Amit Pathak

1,4271 gold badge18 silver badges32 bronze badges

Collectives™ on Stack Overflow

Change the Datatype of columns in PySpark dataframe

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related