0

I am trying to convert the following SQL query into pyspark:

SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 
    AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1
    THEN data.pred END) / COUNT(*) AS Result

The code I have in PySpark right now is this:

Result = data.select(
    count(
        (coalesce(data["pred"], lit(0)) != 0) & 
        (coalesce(data["val"], lit(0)) != 0) & 
        (abs(
             coalesce(data["pred"], lit(0)) - 
             coalesce(data["val"], lit(0))
            ) / coalesce(data["val"], lit(0)) > 0.1
        )
    )
)
aux_2 = aux_1.select(aux_1.column_name.cast("float"))

aux_3 = aux_2.head()[0]

Deviation = (aux_3 / data.count())*100

However, this is simply returning the number of rows in the "data" dataframe, and I know this isn't correct. I am very new at PySpark, can anyone help me solve this?

1 Answer 1

1

You need to collect the result into an integer, and then divide the numbers in Python:

Result = data.filter(
    (coalesce(data["pred"], lit(0)) != 0) & 
    (coalesce(data["val"], lit(0)) != 0) & 
    (abs(
         coalesce(data["pred"], lit(0)) - 
         coalesce(data["val"], lit(0))
        ) / coalesce(data["val"], lit(0)) > 0.1
    )
).count() / data.count()
Sign up to request clarification or add additional context in comments.

6 Comments

I tried to implement that, but I obtained the following error: TypeError: unsupported operand type(s) for /: 'list' and 'int' I tried indexing the list but it also did not work... do you know what may be the solution?
@Johanna oops, sorry, I forgot to get the first element of the collected results.
PS: I edited the code in the question, I had forgotten the last part (the division part)
It is still returning the entire number of rows in the dataframe... at first I thought it was correct, but looking at the values I understand it is not
I think the sql query doesn't give expected results because you're counting a bunch of boolean variables. You should filter the values that are true and count those only.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.