Thread: question re. count, group by, and having

question re. count, group by, and having

From
"Rick Schumeyer"
Date:
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">The following query returns an error (“column c does not exist”) in pg 8.0.3:</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">(The column ‘state’ is the two letter abbreviation for a </span></font><font face="Arial"
size="2"><spanstyle="font-size:10.0pt;font-family:Arial">US</span></font><font face="Arial" size="2"><span
style="font-size:10.0pt;font-family:Arial">state)</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">-- get the number of rows for each state; list in descending order; include only states with at
least6 rows</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial">select state, count(state) as c from t group by state having c > 5 order by c desc; -- gives
error</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">If I leave the having clause out, I get the expected results:</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">select state, count(state) as c from t group by state order by c desc; -- this works</span></font><p
class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Is this a bug or a feature?  I’m not sure why I can use ‘c’ in the order by clause but not the
havingclause.  pg is much happier with the full “having count(state) > 5”.  Will this cause count to be evaluated
twice?</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">If it matters, state is varchar(2).</span></font><p class="MsoNormal"><font face="Arial"
size="2"><spanstyle="font-size:10.0pt; 
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font></div>

Re: question re. count, group by, and having

From
Sean Davis
Date:
On 10/11/05 8:50 AM, "Rick Schumeyer" <rschumeyer@ieee.org> wrote:

> The following query returns an error ("column c does not exist") in pg 8.0.3:
> 
> 
> 
> (The column 'state' is the two letter abbreviation for a US state)
> 
> 
> 
> -- get the number of rows for each state; list in descending order; include
> only states with at least 6 rows
> 
> select state, count(state) as c from t group by state having c > 5 order by c
> desc; -- gives error
> 
> 
> 
> If I leave the having clause out, I get the expected results:
> 
> 
> 
> select state, count(state) as c from t group by state order by c desc; -- this
> works
> 
> 
> 
> Is this a bug or a feature?  I'm not sure why I can use 'c' in the order by
> clause but not the having clause.  pg is much happier with the full "having
> count(state) > 5".  Will this cause count to be evaluated twice?

I think that postgres is smart enough to do the evaluation only once, but
this might be version-dependent, but one of the gurus will have to comment
on which version (if there is a version dependence) first made this
improvement.  

Sean



Re: question re. count, group by, and having

From
Tom Lane
Date:
"Rick Schumeyer" <rschumeyer@ieee.org> writes:
> Is this a bug or a feature?  I'm not sure why I can use 'c' in the order by
> clause but not the having clause.  pg is much happier with the full "having
> count(state) > 5".

Actually, referring to any of the output columns in any of the modifier
clauses is logically suspect.  Original SQL (back around 89 or so)
required ORDER BY items to be output column names, thus wiring in an
assumption that sorting happens after calculation of the output values,
but that is surely not true for any of the other clauses.  And it's
pretty bogus even for sorting, since you might wish to sort on a value
you're not displaying.

If we were working in a green field we'd doubtless get rid of the
output-column-reference feature entirely.  But for backward
compatibility's sake we're stuck with allowing ORDER BY items to
be simple output column names, per ancient versions of the SQL spec.
At one point or another somebody thought it a good idea to propagate
that special rule into GROUP BY; which in hindsight was an awful idea.
(It's not in the spec.  I'm not sure if this is just a Postgres-ism
or if we borrowed someone else's bad idea.)  But we're stuck with
supporting that odd case too, now.  We certainly aren't going to add
more.

> Will this cause count to be evaluated twice?

Recent versions of PG are smart enough to merge duplicate aggregates.
This isn't necessarily true for other forms of common subexpressions,
but it works for aggregate functions.
        regards, tom lane


pg, mysql comparison with "group by" clause

From
"Rick Schumeyer"
Date:
I'm not sure what I was thinking, but I tried the following query in pg:

SELECT * FROM t GROUP BY state;

pg returns an error.

Mysql, OTOH, returns the first row for each state.  (The first row with
"AK", the first row with "PA", etc.)  

I'm no SQL expert, but it seems to me that the pg behavior is correct, and
the mysql result is just weird.  Am I correct?



Re: pg, mysql comparison with "group by" clause

From
Scott Marlowe
Date:
On Tue, 2005-10-11 at 16:12, Rick Schumeyer wrote:
> I'm not sure what I was thinking, but I tried the following query in pg:
> 
> SELECT * FROM t GROUP BY state;
> 
> pg returns an error.
> 
> Mysql, OTOH, returns the first row for each state.  (The first row with
> "AK", the first row with "PA", etc.)  
> 
> I'm no SQL expert, but it seems to me that the pg behavior is correct, and
> the mysql result is just weird.  Am I correct?

Yes, you are correct.  The SQL standard is quite clear that in order to
appear in the select list, an entry must either be in the group by or be
operated upon by an aggregate function.  PostgreSQL supports this same
action by way of the non-standard

select distinct on(<fieldlist>), <fieldlist> from ....

Since you don't know for sure which answer you'll get each time, it's
better to KNOW you're doing something that may not be reproduceable than
to accidentally do it when your database SHOULD be throwing an error.

That's just one of many many things MySQL does that makes my head hurt. 
For more, search google for "mysql gotchas"


Re: question re. count, group by, and having

From
Muralidharan Ramakrishnan
Date:
Hi
 
  Logically HAVING is executed after the GROUP BY and it must contain only the columns in the GROUP BY or aggregated function.
 
select state, count(state) as c from t group by state having c > 5
 
The above query grouped only on state and HAVING can be used only with the column state or the aggregated function count(state).
 
Regards,
R.Muralidharan
 

Rick Schumeyer <rschumeyer@ieee.org> wrote:

The following query returns an error (“column c does not exist”) in pg 8.0.3:

 

(The column ‘state’ is the two letter abbreviation for a US state)

 

-- get the number of rows for each state; list in descending order; include only states with at least 6 rows

select state, count(state) as c from t group by state having c > 5 order by c desc; -- gives error

 

If I leave the having clause out, I get the expected results:

 

select state, count(state) as c from t group by state order by c desc; -- this works

 

Is this a bug or a feature?  I’m not sure why I can use ‘c’ in the order by clause but not the having clause.  pg is much happier with the full “having count(state) > 5”.  Will this cause count to be evaluated twice?

 

If it matters, state is varchar(2).

 

 

 


Yahoo! India Matrimony: Find your partner online.

Re: pg, mysql comparison with "group by" clause

From
Stephan Szabo
Date:
On Tue, 11 Oct 2005, Rick Schumeyer wrote:

> I'm not sure what I was thinking, but I tried the following query in pg:
>
> SELECT * FROM t GROUP BY state;
>
> pg returns an error.
>
> Mysql, OTOH, returns the first row for each state.  (The first row with
> "AK", the first row with "PA", etc.)
>
> I'm no SQL expert, but it seems to me that the pg behavior is correct, and
> the mysql result is just weird.  Am I correct?

In your case, it sounds like the mysql result is wrong. I believe SQL99
would allow it if the other columns were functionally dependant upon state
(as there'd by definition only be one value for the other columns per
group).



Re: pg, mysql comparison with "group by" clause

From
Greg Stark
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> On Tue, 11 Oct 2005, Rick Schumeyer wrote:
> 
> > I'm not sure what I was thinking, but I tried the following query in pg:
> >
> > SELECT * FROM t GROUP BY state;
> >
> > pg returns an error.
> >
> > Mysql, OTOH, returns the first row for each state.  (The first row with
> > "AK", the first row with "PA", etc.)
> >
> > I'm no SQL expert, but it seems to me that the pg behavior is correct, and
> > the mysql result is just weird.  Am I correct?
> 
> In your case, it sounds like the mysql result is wrong. I believe SQL99
> would allow it if the other columns were functionally dependant upon state
> (as there'd by definition only be one value for the other columns per
> group).

I believe this is a documented feature.

MySQL treats "select a,b from t group by a" equivalently to Postgres's 
"select distinct on (a) a,b from t"

I suppose "equivalent" isn't quite true. It's more general since it allows
aggregate functions as well. The equivalently general Postgres syntax is to
have a first() aggregate function and do "select a,first(b) from t group by a".

I'm sure it's very convenient.

-- 
greg



Re: pg, mysql comparison with "group by" clause

From
Scott Marlowe
Date:
On Wed, 2005-10-12 at 16:54, Greg Stark wrote:
> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> 
> > On Tue, 11 Oct 2005, Rick Schumeyer wrote:
> > 
> > > I'm not sure what I was thinking, but I tried the following query in pg:
> > >
> > > SELECT * FROM t GROUP BY state;
> > >
> > > pg returns an error.
> > >
> > > Mysql, OTOH, returns the first row for each state.  (The first row with
> > > "AK", the first row with "PA", etc.)
> > >
> > > I'm no SQL expert, but it seems to me that the pg behavior is correct, and
> > > the mysql result is just weird.  Am I correct?
> > 
> > In your case, it sounds like the mysql result is wrong. I believe SQL99
> > would allow it if the other columns were functionally dependant upon state
> > (as there'd by definition only be one value for the other columns per
> > group).
> 
> I believe this is a documented feature.

Hehe.  When I turn on my windshield wipers and my airbag deploys, is it
a documented "feature" if the dealership told me about this behaviour
ahead of time?  In much the same way, while this behaviour may be
documented by MySQL, I can't imagine it really being called a feature. 
But at least this misbehaviour is documented.  However, I think most
people in the MySQL universe just stumble onto it by accident when they
try it and it works.  I'd at least prefer it to throw a warning or
notice or something.

> MySQL treats "select a,b from t group by a" equivalently to Postgres's 
> "select distinct on (a) a,b from t"
> 
> I suppose "equivalent" isn't quite true. It's more general since it allows
> aggregate functions as well. The equivalently general Postgres syntax is to
> have a first() aggregate function and do "select a,first(b) from t group by a".

A Subselect would let you do such a thing as well, and while it's more
complicated to write, it is likely to be easier to tell just what it's
doing.

> I'm sure it's very convenient.

Not always, but I'd rather get the right answer with difficulty than the
wrong one with ease. :)


Re: pg, mysql comparison with "group by" clause

From
Greg Stark
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:

> Hehe.  When I turn on my windshield wipers and my airbag deploys, is it
> a documented "feature" if the dealership told me about this behaviour
> ahead of time?  

Well it's more like my car where the dashboard dims when I turn on my
headlights which annoys me to no end since I learned to always put my
headlights on even in the day.

> In much the same way, while this behaviour may be documented by MySQL, I
> can't imagine it really being called a feature. But at least this
> misbehaviour is documented. However, I think most people in the MySQL
> universe just stumble onto it by accident when they try it and it works. I'd
> at least prefer it to throw a warning or notice or something.

I don't see why you think people stumble on this by accident. I think it's
actually an extremely common need. So common that Postgres has the same
feature (though less general) and invented a whole syntax to handle it.

I think most MySQL users don't stumble on it, they learn it as the way to
handle the common use case when you join a master table against a detail table
and then want to aggregate all the detail records. In standard SQL you have to
write GROUP BY ... and list every single column you need from the master
table. Forcing the database to do a lot of redundant comparisons and sort on
uselessly long keys where in fact you only really need it to sort and group by
the primary key.

Remember, most MySQL users learn MySQL first, and only later learn what is
standard SQL and what isn't. 

> A Subselect would let you do such a thing as well, and while it's more
> complicated to write, it is likely to be easier to tell just what it's
> doing.

Subselects have their own problems here. Mainly Postgres's optimizer, as good
as it is, doesn't treat them with the same code paths as joins and can't find
all the same plans for them. But in any case you cannot always write a
subselect that's equivalent to an arbitrary join.


-- 
greg



Re: pg, mysql comparison with "group by" clause

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I think most MySQL users don't stumble on it, they learn it as the way
> to handle the common use case when you join a master table against a
> detail table and then want to aggregate all the detail records. In
> standard SQL you have to write GROUP BY ... and list every single
> column you need from the master table. Forcing the database to do a
> lot of redundant comparisons and sort on uselessly long keys where in
> fact you only really need it to sort and group by the primary key.

Actually, if you're grouping by a table's primary key, the SQL99 spec
says you shouldn't have to explicitly list the other columns from that
table --- they are guaranteed to have unique values per group anyway.
This is a single case in the "functional dependency" stuff.  That
verbiage is incredibly dense and I don't think we want to tackle all of
it any time soon, but the primary-key case probably wouldn't be very
hard to implement.  We really ought to have this in TODO ... I'm sure
it's been discussed before.
        regards, tom lane


Re: pg, mysql comparison with "group by" clause

From
Scott Marlowe
Date:
On Wed, 2005-10-12 at 20:13, Greg Stark wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> 
> > Hehe.  When I turn on my windshield wipers and my airbag deploys, is it
> > a documented "feature" if the dealership told me about this behaviour
> > ahead of time?  
> 
> Well it's more like my car where the dashboard dims when I turn on my
> headlights which annoys me to no end since I learned to always put my
> headlights on even in the day.

Sorry, but it's worse than that.  It is quite possible that two people
could run this query at the same time and get different data from the
same set and the same point in time.  That shouldn't happen accidentally
in SQL, you should know it's coming.

But it's not the same as the air bag deploying, it's like a different
random part of my car activates / deactivates each time.  The horn, a
turn indicator, the trunk opens.  

> > In much the same way, while this behaviour may be documented by MySQL, I
> > can't imagine it really being called a feature. But at least this
> > misbehaviour is documented. However, I think most people in the MySQL
> > universe just stumble onto it by accident when they try it and it works. I'd
> > at least prefer it to throw a warning or notice or something.
> 
> I don't see why you think people stumble on this by accident. I think it's
> actually an extremely common need. So common that Postgres has the same
> feature (though less general) and invented a whole syntax to handle it.

Because I answer a boatload of questions on phpbuilder, where there are
tons of MySQL and PostgreSQL beginners who learn by stumbling around in
their database of choice.

Most MySQL users think that the select a,b,c from table group by a is a
valid query, and don't even realize that they are getting theoretically
different results each time.  It's one of those many things they learn
wrong on MySQL and have to unlearn everywhere else.

They didn't go looking for this behaviour, and almost none of them
realized when they were doing it that they could get a different answer
each time.

> I think most MySQL users don't stumble on it, they learn it as the way to
> handle the common use case when you join a master table against a detail table
> and then want to aggregate all the detail records.

This isn't really common sense all the time though.  It's more about the
law of unintended consequences.  People write these queries, and never
realize that they are actually random responses coming back.  And if
they aren't random responses, then their data likely isn't normalized.

> In standard SQL you have to
> write GROUP BY ... and list every single column you need from the master
> table. Forcing the database to do a lot of redundant comparisons and sort on
> uselessly long keys where in fact you only really need it to sort and group by
> the primary key.

But again, you're getting whatever row the database feels like giving
you.  A use of a simple, stupid aggregate like an any() aggregate would
be fine here, and wouldn't require a lot of overhead, and would meet the
SQL spec.

The real reason this thing exists today and not an any() aggregate or
some equivalent in MySQL is because of all the legacy code using the
messed up group by syntax.  It's hard to change that kind of stuff when
you've got a lot of market share to hold on to.

> Remember, most MySQL users learn MySQL first, and only later learn what is
> standard SQL and what isn't. 

Hehe.  I'll never forget, remember, I answer questions on databases on
phpbuilder.  I'm always amazed at how many folks come there who are just
starting out and making the same mistakes we all made when beginning.

> > A Subselect would let you do such a thing as well, and while it's more
> > complicated to write, it is likely to be easier to tell just what it's
> > doing.
> 
> Subselects have their own problems here. Mainly Postgres's optimizer, as good
> as it is, doesn't treat them with the same code paths as joins and can't find
> all the same plans for them. But in any case you cannot always write a
> subselect that's equivalent to an arbitrary join.

Actually, for things like aggregates, I've often been able to improve
performance with sub selects in PostgreSQL.  Although, back in the 7.2
days it was still pretty pokey at that kind of stuff.  


Re: pg, mysql comparison with "group by" clause

From
Greg Stark
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:

> Sorry, but it's worse than that.  It is quite possible that two people
> could run this query at the same time and get different data from the
> same set and the same point in time.  That shouldn't happen accidentally
> in SQL, you should know it's coming.

I'm pretty unsympathetic to the "we should make a language less powerful and
more awkward because someone might use it wrong" argument.

> > In standard SQL you have to
> > write GROUP BY ... and list every single column you need from the master
> > table. Forcing the database to do a lot of redundant comparisons and sort on
> > uselessly long keys where in fact you only really need it to sort and group by
> > the primary key.
> 
> But again, you're getting whatever row the database feels like giving
> you.  A use of a simple, stupid aggregate like an any() aggregate would
> be fine here, and wouldn't require a lot of overhead, and would meet the
> SQL spec.

Great, so I have a user table with, oh, say, 40 columns. And I want to return
all those columns plus their current account balance in a single query.

The syntax under discussion would be:

select user.*, sum(money) from user join user_money using (user_id) group by user_id

You would prefer:

select user_id,       any(username) as username, any(firstname) as firstname,       any(lastname) as lastname,
any(address)as address,      any(city) as city, any(street) as street, any(phone) as phone,      any(last_update) as
last_update,any(last_login) as last_login,      any(referrer_id) as referrer_id, any(register_date) as register_date,
  ...      sum(money) as balance,      count(money) as num_txns from user join user_money using (user_id) group by
user_id


Having a safeties is fine but when I have to disengage the safety for every
single column it starts to get more than a little annoying. 

Note that you cannot write the above as a subquery since there are two
aggregates. You could write it as a join against a view but don't expect to
get the same plans from Postgres for that.


> Actually, for things like aggregates, I've often been able to improve
> performance with sub selects in PostgreSQL.  

If your experience is like mine it's a case of two wrongs cancelling each
other out. The optimizer underestimates the efficiency of nested loops which
is another problem. Since subqueries' only eligible plan is basically a nested
loop it often turns out to be faster than the more exotic plans a join can
reach.

In an ideal world subqueries would be transformed into the equivalent join (or
some more general join structure that can cover both sets of semantics) and
then planned through the same code path. In an ideal world the user should be
guaranteed that equivalent queries would always result in the same plan
regardless of how they're written.

-- 
greg



Re: pg, mysql comparison with "group by" clause

From
Scott Marlowe
Date:
On Thu, 2005-10-13 at 13:26, Greg Stark wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> 
> > Sorry, but it's worse than that.  It is quite possible that two people
> > could run this query at the same time and get different data from the
> > same set and the same point in time.  That shouldn't happen accidentally
> > in SQL, you should know it's coming.
> 
> I'm pretty unsympathetic to the "we should make a language less powerful and
> more awkward because someone might use it wrong" argument.

I'm in favor of getting the right answer.  If my database can't do that,
then it's not worth the bits to download it.  But I'm funny that way.

> > > In standard SQL you have to
> > > write GROUP BY ... and list every single column you need from the master
> > > table. Forcing the database to do a lot of redundant comparisons and sort on
> > > uselessly long keys where in fact you only really need it to sort and group by
> > > the primary key.
> > 
> > But again, you're getting whatever row the database feels like giving
> > you.  A use of a simple, stupid aggregate like an any() aggregate would
> > be fine here, and wouldn't require a lot of overhead, and would meet the
> > SQL spec.
> 
> Great, so I have a user table with, oh, say, 40 columns. And I want to return
> all those columns plus their current account balance in a single query.
> 
> The syntax under discussion would be:
> 
> select user.*, sum(money) from user join user_money using (user_id) group by user_id

> You would prefer:
> 
> select user_id, 
>        any(username) as username, any(firstname) as firstname, 
>        any(lastname) as lastname, any(address) as address,
>        any(city) as city, any(street) as street, any(phone) as phone,
>        any(last_update) as last_update, any(last_login) as last_login,
>        any(referrer_id) as referrer_id, any(register_date) as register_date,
>        ...
>        sum(money) as balance,
>        count(money) as num_txns
>   from user join user_money using (user_id) group by user_id

I's select the SINGLE entries from a child table that matched the parent
id and add the sum(money) to it.  Then, there'd be no need for aggregate
functions on those fields, or inaccurate / possibly random data.

But I'm funny that way.

> Having a safeties is fine but when I have to disengage the safety for every
> single column it starts to get more than a little annoying. 
> 
> Note that you cannot write the above as a subquery since there are two
> aggregates. You could write it as a join against a view but don't expect to
> get the same plans from Postgres for that.

I'd just write is a big join.  Again, getting the right answer is
important to me.


Re: pg, mysql comparison with "group by" clause

From
Andrew Sullivan
Date:
On Thu, Oct 13, 2005 at 02:26:58PM -0400, Greg Stark wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > could run this query at the same time and get different data from the
> > same set and the same point in time. 
> 
> I'm pretty unsympathetic to the "we should make a language less powerful and
> more awkward because someone might use it wrong" argument.

That's not what Scott's saying.  Scott is saying that the syntax
you're talking about is _formally wrong_.  That's surely not "more
powerful", except in the sense that stepping on a land mine is more
powerful than many other ways you could shoot yourself in the foot.  

> path. In an ideal world the user should be guaranteed that
> equivalent queries would always result in the same plan regardless
> of how they're written.

And again, I say it sounds like you're actually arguing for "the
optimiser needs to get better".  Special-purpose, formally wrong
syntax is surely not better than making the optimiser get the right
syntax right every time, is it?

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
When my information changes, I alter my conclusions.  What do you do sir?    --attr. John Maynard Keynes


Re: pg, mysql comparison with "group by" clause

From
Tom Lane
Date:
>>>> In standard SQL you have to
>>>> write GROUP BY ... and list every single column you need from the master
>>>> table.

This thread seems to have gone off on a tangent that depends on the
assumption that the above is a correct statement.  It's not.  It *was*
true, in SQL92, but SQL99 lets you omit unnecessary GROUP BY columns.

The gripe against mysql, I think, is that they don't enforce the
conditions that guarantee the query will give a unique result.

The gripe against postgres is that we haven't implemented the SQL99
semantics yet.
        regards, tom lane