13

I would like append a new column on dataframe "df" from function get_distance:

def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

df = df.withColumn(
    "distance",
    lit(get_distance(df["column1"], df["column2"]))
)

But, I get this:

TypeError: 'Column' object is not callable

I think it happens because x and y are Column objects and I need to be converted to String to use in my query. Am I right? If so, how can I do this?

2 Answers 2

11

Spark should know the function that you are using is not ordinary function but the UDF.

So, there are 2 ways by which we can use the UDF on dataframes.

Method-1: With @udf annotation

@udf
def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

df = df.withColumn(
    "distance",
    lit(get_distance(df["column1"], df["column2"]))
)

Method-2: Regestering udf with pyspark.sql.functions.udf

def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

calculate_distance_udf = udf(get_distance, IntegerType())

df = df.withColumn(
    "distance",
    lit(calculate_distance_udf(df["column1"], df["column2"]))
)
Sign up to request clarification or add additional context in comments.

Comments

6
  • You cannot use Python function on a Column objects directly, unless it is intended to operate on Column objects / expressions. You need udf for that:

    @udf
    def get_distance(x, y):
        ...
    
  • But you cannot use SQLContext in udf (or mapper in general).

  • Just join:

    tab = hiveContext.table("tab").groupBy("column1", "column2").agg(first("column3"))
    df.join(tab, ["column1", "column2"])
    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.