Thread: question re. count, group by, and having
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">The following query returns an error (“column c does not exist”) in pg 8.0.3:</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">(The column ‘state’ is the two letter abbreviation for a </span></font><font face="Arial" size="2"><spanstyle="font-size:10.0pt;font-family:Arial">US</span></font><font face="Arial" size="2"><span style="font-size:10.0pt;font-family:Arial">state)</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">-- get the number of rows for each state; list in descending order; include only states with at least6 rows</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">select state, count(state) as c from t group by state having c > 5 order by c desc; -- gives error</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">If I leave the having clause out, I get the expected results:</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">select state, count(state) as c from t group by state order by c desc; -- this works</span></font><p class="MsoNormal"><fontface="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">Is this a bug or a feature? I’m not sure why I can use ‘c’ in the order by clause but not the havingclause. pg is much happier with the full “having count(state) > 5”. Will this cause count to be evaluated twice?</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial">If it matters, state is varchar(2).</span></font><p class="MsoNormal"><font face="Arial" size="2"><spanstyle="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt; font-family:Arial"> </span></font></div>
On 10/11/05 8:50 AM, "Rick Schumeyer" <rschumeyer@ieee.org> wrote: > The following query returns an error ("column c does not exist") in pg 8.0.3: > > > > (The column 'state' is the two letter abbreviation for a US state) > > > > -- get the number of rows for each state; list in descending order; include > only states with at least 6 rows > > select state, count(state) as c from t group by state having c > 5 order by c > desc; -- gives error > > > > If I leave the having clause out, I get the expected results: > > > > select state, count(state) as c from t group by state order by c desc; -- this > works > > > > Is this a bug or a feature? I'm not sure why I can use 'c' in the order by > clause but not the having clause. pg is much happier with the full "having > count(state) > 5". Will this cause count to be evaluated twice? I think that postgres is smart enough to do the evaluation only once, but this might be version-dependent, but one of the gurus will have to comment on which version (if there is a version dependence) first made this improvement. Sean
"Rick Schumeyer" <rschumeyer@ieee.org> writes: > Is this a bug or a feature? I'm not sure why I can use 'c' in the order by > clause but not the having clause. pg is much happier with the full "having > count(state) > 5". Actually, referring to any of the output columns in any of the modifier clauses is logically suspect. Original SQL (back around 89 or so) required ORDER BY items to be output column names, thus wiring in an assumption that sorting happens after calculation of the output values, but that is surely not true for any of the other clauses. And it's pretty bogus even for sorting, since you might wish to sort on a value you're not displaying. If we were working in a green field we'd doubtless get rid of the output-column-reference feature entirely. But for backward compatibility's sake we're stuck with allowing ORDER BY items to be simple output column names, per ancient versions of the SQL spec. At one point or another somebody thought it a good idea to propagate that special rule into GROUP BY; which in hindsight was an awful idea. (It's not in the spec. I'm not sure if this is just a Postgres-ism or if we borrowed someone else's bad idea.) But we're stuck with supporting that odd case too, now. We certainly aren't going to add more. > Will this cause count to be evaluated twice? Recent versions of PG are smart enough to merge duplicate aggregates. This isn't necessarily true for other forms of common subexpressions, but it works for aggregate functions. regards, tom lane
I'm not sure what I was thinking, but I tried the following query in pg: SELECT * FROM t GROUP BY state; pg returns an error. Mysql, OTOH, returns the first row for each state. (The first row with "AK", the first row with "PA", etc.) I'm no SQL expert, but it seems to me that the pg behavior is correct, and the mysql result is just weird. Am I correct?
On Tue, 2005-10-11 at 16:12, Rick Schumeyer wrote: > I'm not sure what I was thinking, but I tried the following query in pg: > > SELECT * FROM t GROUP BY state; > > pg returns an error. > > Mysql, OTOH, returns the first row for each state. (The first row with > "AK", the first row with "PA", etc.) > > I'm no SQL expert, but it seems to me that the pg behavior is correct, and > the mysql result is just weird. Am I correct? Yes, you are correct. The SQL standard is quite clear that in order to appear in the select list, an entry must either be in the group by or be operated upon by an aggregate function. PostgreSQL supports this same action by way of the non-standard select distinct on(<fieldlist>), <fieldlist> from .... Since you don't know for sure which answer you'll get each time, it's better to KNOW you're doing something that may not be reproduceable than to accidentally do it when your database SHOULD be throwing an error. That's just one of many many things MySQL does that makes my head hurt. For more, search google for "mysql gotchas"
Rick Schumeyer <rschumeyer@ieee.org> wrote:
The following query returns an error (column c does not exist) in pg 8.0.3:
(The column state is the two letter abbreviation for a US state)
-- get the number of rows for each state; list in descending order; include only states with at least 6 rows
select state, count(state) as c from t group by state having c > 5 order by c desc; -- gives error
If I leave the having clause out, I get the expected results:
select state, count(state) as c from t group by state order by c desc; -- this works
Is this a bug or a feature? Im not sure why I can use c in the order by clause but not the having clause. pg is much happier with the full having count(state) > 5. Will this cause count to be evaluated twice?
If it matters, state is varchar(2).
Yahoo! India Matrimony: Find your partner online.
On Tue, 11 Oct 2005, Rick Schumeyer wrote: > I'm not sure what I was thinking, but I tried the following query in pg: > > SELECT * FROM t GROUP BY state; > > pg returns an error. > > Mysql, OTOH, returns the first row for each state. (The first row with > "AK", the first row with "PA", etc.) > > I'm no SQL expert, but it seems to me that the pg behavior is correct, and > the mysql result is just weird. Am I correct? In your case, it sounds like the mysql result is wrong. I believe SQL99 would allow it if the other columns were functionally dependant upon state (as there'd by definition only be one value for the other columns per group).
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Tue, 11 Oct 2005, Rick Schumeyer wrote: > > > I'm not sure what I was thinking, but I tried the following query in pg: > > > > SELECT * FROM t GROUP BY state; > > > > pg returns an error. > > > > Mysql, OTOH, returns the first row for each state. (The first row with > > "AK", the first row with "PA", etc.) > > > > I'm no SQL expert, but it seems to me that the pg behavior is correct, and > > the mysql result is just weird. Am I correct? > > In your case, it sounds like the mysql result is wrong. I believe SQL99 > would allow it if the other columns were functionally dependant upon state > (as there'd by definition only be one value for the other columns per > group). I believe this is a documented feature. MySQL treats "select a,b from t group by a" equivalently to Postgres's "select distinct on (a) a,b from t" I suppose "equivalent" isn't quite true. It's more general since it allows aggregate functions as well. The equivalently general Postgres syntax is to have a first() aggregate function and do "select a,first(b) from t group by a". I'm sure it's very convenient. -- greg
On Wed, 2005-10-12 at 16:54, Greg Stark wrote: > Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > > > On Tue, 11 Oct 2005, Rick Schumeyer wrote: > > > > > I'm not sure what I was thinking, but I tried the following query in pg: > > > > > > SELECT * FROM t GROUP BY state; > > > > > > pg returns an error. > > > > > > Mysql, OTOH, returns the first row for each state. (The first row with > > > "AK", the first row with "PA", etc.) > > > > > > I'm no SQL expert, but it seems to me that the pg behavior is correct, and > > > the mysql result is just weird. Am I correct? > > > > In your case, it sounds like the mysql result is wrong. I believe SQL99 > > would allow it if the other columns were functionally dependant upon state > > (as there'd by definition only be one value for the other columns per > > group). > > I believe this is a documented feature. Hehe. When I turn on my windshield wipers and my airbag deploys, is it a documented "feature" if the dealership told me about this behaviour ahead of time? In much the same way, while this behaviour may be documented by MySQL, I can't imagine it really being called a feature. But at least this misbehaviour is documented. However, I think most people in the MySQL universe just stumble onto it by accident when they try it and it works. I'd at least prefer it to throw a warning or notice or something. > MySQL treats "select a,b from t group by a" equivalently to Postgres's > "select distinct on (a) a,b from t" > > I suppose "equivalent" isn't quite true. It's more general since it allows > aggregate functions as well. The equivalently general Postgres syntax is to > have a first() aggregate function and do "select a,first(b) from t group by a". A Subselect would let you do such a thing as well, and while it's more complicated to write, it is likely to be easier to tell just what it's doing. > I'm sure it's very convenient. Not always, but I'd rather get the right answer with difficulty than the wrong one with ease. :)
Scott Marlowe <smarlowe@g2switchworks.com> writes: > Hehe. When I turn on my windshield wipers and my airbag deploys, is it > a documented "feature" if the dealership told me about this behaviour > ahead of time? Well it's more like my car where the dashboard dims when I turn on my headlights which annoys me to no end since I learned to always put my headlights on even in the day. > In much the same way, while this behaviour may be documented by MySQL, I > can't imagine it really being called a feature. But at least this > misbehaviour is documented. However, I think most people in the MySQL > universe just stumble onto it by accident when they try it and it works. I'd > at least prefer it to throw a warning or notice or something. I don't see why you think people stumble on this by accident. I think it's actually an extremely common need. So common that Postgres has the same feature (though less general) and invented a whole syntax to handle it. I think most MySQL users don't stumble on it, they learn it as the way to handle the common use case when you join a master table against a detail table and then want to aggregate all the detail records. In standard SQL you have to write GROUP BY ... and list every single column you need from the master table. Forcing the database to do a lot of redundant comparisons and sort on uselessly long keys where in fact you only really need it to sort and group by the primary key. Remember, most MySQL users learn MySQL first, and only later learn what is standard SQL and what isn't. > A Subselect would let you do such a thing as well, and while it's more > complicated to write, it is likely to be easier to tell just what it's > doing. Subselects have their own problems here. Mainly Postgres's optimizer, as good as it is, doesn't treat them with the same code paths as joins and can't find all the same plans for them. But in any case you cannot always write a subselect that's equivalent to an arbitrary join. -- greg
Greg Stark <gsstark@mit.edu> writes: > I think most MySQL users don't stumble on it, they learn it as the way > to handle the common use case when you join a master table against a > detail table and then want to aggregate all the detail records. In > standard SQL you have to write GROUP BY ... and list every single > column you need from the master table. Forcing the database to do a > lot of redundant comparisons and sort on uselessly long keys where in > fact you only really need it to sort and group by the primary key. Actually, if you're grouping by a table's primary key, the SQL99 spec says you shouldn't have to explicitly list the other columns from that table --- they are guaranteed to have unique values per group anyway. This is a single case in the "functional dependency" stuff. That verbiage is incredibly dense and I don't think we want to tackle all of it any time soon, but the primary-key case probably wouldn't be very hard to implement. We really ought to have this in TODO ... I'm sure it's been discussed before. regards, tom lane
On Wed, 2005-10-12 at 20:13, Greg Stark wrote: > Scott Marlowe <smarlowe@g2switchworks.com> writes: > > > Hehe. When I turn on my windshield wipers and my airbag deploys, is it > > a documented "feature" if the dealership told me about this behaviour > > ahead of time? > > Well it's more like my car where the dashboard dims when I turn on my > headlights which annoys me to no end since I learned to always put my > headlights on even in the day. Sorry, but it's worse than that. It is quite possible that two people could run this query at the same time and get different data from the same set and the same point in time. That shouldn't happen accidentally in SQL, you should know it's coming. But it's not the same as the air bag deploying, it's like a different random part of my car activates / deactivates each time. The horn, a turn indicator, the trunk opens. > > In much the same way, while this behaviour may be documented by MySQL, I > > can't imagine it really being called a feature. But at least this > > misbehaviour is documented. However, I think most people in the MySQL > > universe just stumble onto it by accident when they try it and it works. I'd > > at least prefer it to throw a warning or notice or something. > > I don't see why you think people stumble on this by accident. I think it's > actually an extremely common need. So common that Postgres has the same > feature (though less general) and invented a whole syntax to handle it. Because I answer a boatload of questions on phpbuilder, where there are tons of MySQL and PostgreSQL beginners who learn by stumbling around in their database of choice. Most MySQL users think that the select a,b,c from table group by a is a valid query, and don't even realize that they are getting theoretically different results each time. It's one of those many things they learn wrong on MySQL and have to unlearn everywhere else. They didn't go looking for this behaviour, and almost none of them realized when they were doing it that they could get a different answer each time. > I think most MySQL users don't stumble on it, they learn it as the way to > handle the common use case when you join a master table against a detail table > and then want to aggregate all the detail records. This isn't really common sense all the time though. It's more about the law of unintended consequences. People write these queries, and never realize that they are actually random responses coming back. And if they aren't random responses, then their data likely isn't normalized. > In standard SQL you have to > write GROUP BY ... and list every single column you need from the master > table. Forcing the database to do a lot of redundant comparisons and sort on > uselessly long keys where in fact you only really need it to sort and group by > the primary key. But again, you're getting whatever row the database feels like giving you. A use of a simple, stupid aggregate like an any() aggregate would be fine here, and wouldn't require a lot of overhead, and would meet the SQL spec. The real reason this thing exists today and not an any() aggregate or some equivalent in MySQL is because of all the legacy code using the messed up group by syntax. It's hard to change that kind of stuff when you've got a lot of market share to hold on to. > Remember, most MySQL users learn MySQL first, and only later learn what is > standard SQL and what isn't. Hehe. I'll never forget, remember, I answer questions on databases on phpbuilder. I'm always amazed at how many folks come there who are just starting out and making the same mistakes we all made when beginning. > > A Subselect would let you do such a thing as well, and while it's more > > complicated to write, it is likely to be easier to tell just what it's > > doing. > > Subselects have their own problems here. Mainly Postgres's optimizer, as good > as it is, doesn't treat them with the same code paths as joins and can't find > all the same plans for them. But in any case you cannot always write a > subselect that's equivalent to an arbitrary join. Actually, for things like aggregates, I've often been able to improve performance with sub selects in PostgreSQL. Although, back in the 7.2 days it was still pretty pokey at that kind of stuff.
Scott Marlowe <smarlowe@g2switchworks.com> writes: > Sorry, but it's worse than that. It is quite possible that two people > could run this query at the same time and get different data from the > same set and the same point in time. That shouldn't happen accidentally > in SQL, you should know it's coming. I'm pretty unsympathetic to the "we should make a language less powerful and more awkward because someone might use it wrong" argument. > > In standard SQL you have to > > write GROUP BY ... and list every single column you need from the master > > table. Forcing the database to do a lot of redundant comparisons and sort on > > uselessly long keys where in fact you only really need it to sort and group by > > the primary key. > > But again, you're getting whatever row the database feels like giving > you. A use of a simple, stupid aggregate like an any() aggregate would > be fine here, and wouldn't require a lot of overhead, and would meet the > SQL spec. Great, so I have a user table with, oh, say, 40 columns. And I want to return all those columns plus their current account balance in a single query. The syntax under discussion would be: select user.*, sum(money) from user join user_money using (user_id) group by user_id You would prefer: select user_id, any(username) as username, any(firstname) as firstname, any(lastname) as lastname, any(address)as address, any(city) as city, any(street) as street, any(phone) as phone, any(last_update) as last_update,any(last_login) as last_login, any(referrer_id) as referrer_id, any(register_date) as register_date, ... sum(money) as balance, count(money) as num_txns from user join user_money using (user_id) group by user_id Having a safeties is fine but when I have to disengage the safety for every single column it starts to get more than a little annoying. Note that you cannot write the above as a subquery since there are two aggregates. You could write it as a join against a view but don't expect to get the same plans from Postgres for that. > Actually, for things like aggregates, I've often been able to improve > performance with sub selects in PostgreSQL. If your experience is like mine it's a case of two wrongs cancelling each other out. The optimizer underestimates the efficiency of nested loops which is another problem. Since subqueries' only eligible plan is basically a nested loop it often turns out to be faster than the more exotic plans a join can reach. In an ideal world subqueries would be transformed into the equivalent join (or some more general join structure that can cover both sets of semantics) and then planned through the same code path. In an ideal world the user should be guaranteed that equivalent queries would always result in the same plan regardless of how they're written. -- greg
On Thu, 2005-10-13 at 13:26, Greg Stark wrote: > Scott Marlowe <smarlowe@g2switchworks.com> writes: > > > Sorry, but it's worse than that. It is quite possible that two people > > could run this query at the same time and get different data from the > > same set and the same point in time. That shouldn't happen accidentally > > in SQL, you should know it's coming. > > I'm pretty unsympathetic to the "we should make a language less powerful and > more awkward because someone might use it wrong" argument. I'm in favor of getting the right answer. If my database can't do that, then it's not worth the bits to download it. But I'm funny that way. > > > In standard SQL you have to > > > write GROUP BY ... and list every single column you need from the master > > > table. Forcing the database to do a lot of redundant comparisons and sort on > > > uselessly long keys where in fact you only really need it to sort and group by > > > the primary key. > > > > But again, you're getting whatever row the database feels like giving > > you. A use of a simple, stupid aggregate like an any() aggregate would > > be fine here, and wouldn't require a lot of overhead, and would meet the > > SQL spec. > > Great, so I have a user table with, oh, say, 40 columns. And I want to return > all those columns plus their current account balance in a single query. > > The syntax under discussion would be: > > select user.*, sum(money) from user join user_money using (user_id) group by user_id > You would prefer: > > select user_id, > any(username) as username, any(firstname) as firstname, > any(lastname) as lastname, any(address) as address, > any(city) as city, any(street) as street, any(phone) as phone, > any(last_update) as last_update, any(last_login) as last_login, > any(referrer_id) as referrer_id, any(register_date) as register_date, > ... > sum(money) as balance, > count(money) as num_txns > from user join user_money using (user_id) group by user_id I's select the SINGLE entries from a child table that matched the parent id and add the sum(money) to it. Then, there'd be no need for aggregate functions on those fields, or inaccurate / possibly random data. But I'm funny that way. > Having a safeties is fine but when I have to disengage the safety for every > single column it starts to get more than a little annoying. > > Note that you cannot write the above as a subquery since there are two > aggregates. You could write it as a join against a view but don't expect to > get the same plans from Postgres for that. I'd just write is a big join. Again, getting the right answer is important to me.
On Thu, Oct 13, 2005 at 02:26:58PM -0400, Greg Stark wrote: > Scott Marlowe <smarlowe@g2switchworks.com> writes: > > could run this query at the same time and get different data from the > > same set and the same point in time. > > I'm pretty unsympathetic to the "we should make a language less powerful and > more awkward because someone might use it wrong" argument. That's not what Scott's saying. Scott is saying that the syntax you're talking about is _formally wrong_. That's surely not "more powerful", except in the sense that stepping on a land mine is more powerful than many other ways you could shoot yourself in the foot. > path. In an ideal world the user should be guaranteed that > equivalent queries would always result in the same plan regardless > of how they're written. And again, I say it sounds like you're actually arguing for "the optimiser needs to get better". Special-purpose, formally wrong syntax is surely not better than making the optimiser get the right syntax right every time, is it? A -- Andrew Sullivan | ajs@crankycanuck.ca When my information changes, I alter my conclusions. What do you do sir? --attr. John Maynard Keynes
>>>> In standard SQL you have to >>>> write GROUP BY ... and list every single column you need from the master >>>> table. This thread seems to have gone off on a tangent that depends on the assumption that the above is a correct statement. It's not. It *was* true, in SQL92, but SQL99 lets you omit unnecessary GROUP BY columns. The gripe against mysql, I think, is that they don't enforce the conditions that guarantee the query will give a unique result. The gripe against postgres is that we haven't implemented the SQL99 semantics yet. regards, tom lane