0

The normal sql query :

    SELECT DISTINCT(county_geoid), state_geoid, sum(PredResponse), sum(prop_count) FROM table_a GROUP BY county_geoid;

Gives me an output. However the spark sql version of this same query used in pyspark is giving me an error. How to resolve this issue?

    result_county_performance_alpha = spark.sql("SELECT distinct(county_geoid), sum(PredResponse), sum(prop_count), state_geoid FROM table_a group by county_geoid")

This gives an error :

AnalysisException: u"expression 'tract_alpha.`state_geoid`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

How to resolve this issue?

1 Answer 1

3

Your "normal" query should not work anywhere. The correct way to write the query is:

SELECT county_geoid, state_geoid, sum(PredResponse), sum(prop_count)
FROM table_a
GROUP BY county_geoid, state_geoid;

This should work on any database (where the columns and tables are defined and of the right types).

Your version has state_geoid in the SELECT, but it is not being aggregated. That is not correct SQL. It might happen to work in MySQL, but that is due to a (mis)feature in the database (that is finally being fixed).

Also, you almost never want to use SELECT DISTINCT with GROUP BY. And, the parentheses after the DISTINCT make no difference. The construct is SELECT DISTINCT. DISTINCT is not a function.

Sign up to request clarification or add additional context in comments.

2 Comments

which feature ? could you explain more on why it works only in mysql?
I understood the distinct use is a mistake, but when using group by with state_geoid and without in mysql, the answer varies.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.