Re: pg, mysql comparison with "group by" clause - Mailing list pgsql-sql

From Anthony Molinaro
Subject Re: pg, mysql comparison with "group by" clause
Date
Msg-id 3C6C2B281FD3E74C9F7C9D5B1EDA4582182615@wgexch01.wgenhq.net
Whole thread Raw
In response to pg, mysql comparison with "group by" clause  ("Rick Schumeyer" <rschumeyer@ieee.org>)
List pgsql-sql
Greg,

Ok, I think I see where you're going (I don't agree, but I think
I get you now).

So, using your example of:
"dept_name is guaranteed to be the same for all records with the
same dept_id."

Here:

select d.deptno,d.dname  from emp  e, dept d where e.deptno=d.deptno

DEPTNO DNAME
------ --------------   10 ACCOUNTING   10 ACCOUNTING   10 ACCOUNTING   20 RESEARCH   20 RESEARCH   20 RESEARCH   20
RESEARCH  20 RESEARCH   30 SALES   30 SALES   30 SALES   30 SALES   30 SALES   30 SALES 

ok, so there's your scenario.

And you're suggesting that one should be able to
Do the following query?

select d.deptno,d.dname,count(*)  from emp  e, dept d where e.deptno=d.deptno


if that's what you suggest, then we'll just have to agree to disagree.

That query needs a group by. What you're suggesting is, imo,
a wholly unnecessary shortcut (almost as bad as that ridiculous "natural
join" - whoever came up with that should be tarred and feathered).

I think I see your point now, I just disagree.
Your depending on syntax to work based on data integrity?
Hmmm.... don't think I like that idea
What performance improvement do you get from leaving group by out?
Look at the query above, doesn't a count of distinct deptno,dname pairs
have
to take place anyway? What do you save by excluding group by?
Are you suggesting COUNT be computed for each row (windowed) or that
COUNT is computed for each group?

If you want repeating rows, then you want windowing.
For example:

select d.deptno,d.dname,count(*)over(partition by d.deptno,d.dname) cnt  from emp  e, dept d where e.deptno=d.deptno

DEPTNO DNAME          CNT
------ -------------- ---   10 ACCOUNTING       3   10 ACCOUNTING       3   10 ACCOUNTING       3   20 RESEARCH
5  20 RESEARCH         5   20 RESEARCH         5   20 RESEARCH         5   20 RESEARCH         5   30 SALES
6  30 SALES            6   30 SALES            6   30 SALES            6   30 SALES            6   30 SALES
6



if you want "groups", then use group by:

select d.deptno,d.dname,count(*) cnt  from emp  e, dept dwhere e.deptno=d.deptno group by d.deptno,d.dname

DEPTNO DNAME          CNT
------ -------------- ---   10 ACCOUNTING       3   20 RESEARCH         5   30 SALES            6


what your suggesting doesn't seem to fit in at all,
particularly when pg implements window functions.

If you're suggesting the pg optimizer isn't doing the right thing
with group by queries, then this is an optimizer issue and
that should be hacked, not group by. If you're suggesting certain
rows be ditched or shortcuts be taken, then the optimizer should do
that, not the programmer writing sql.

Db2 and oracle have no problem doing these queries, I don't see
why pg should have a problem.

imo, the only items that should not be listed in the group by
are:

1. constants and deterministic functions
2. scalar subqueries
3. window functions

1 - because the value is same for each row
2&3 - because they are evaluated after the grouping takes place

regards, Anthony

-----Original Message-----
From: gsstark@mit.edu [mailto:gsstark@mit.edu]
Sent: Thursday, October 13, 2005 12:25 AM
To: Anthony Molinaro
Cc: gsstark@mit.edu; Tom Lane; Scott Marlowe; Stephan Szabo; Rick
Schumeyer; pgsql-sql@postgresql.org
Subject: Re: [SQL] pg, mysql comparison with "group by" clause

"Anthony Molinaro" <amolinaro@wgen.net> writes:

> Greg,
>   You'll have to pardon me...
>
> I saw this comment:
>
> "I don't see why you think people stumble on this by accident.
> I think it's actually an extremely common need."
>
> Which, if referring to the ability to have items in the select that do
not
> need to be included in the group, (excluding constants and the like)
is just
> silly.

Well the "constants and the like" are precisely the point. There are
plenty of
cases where adding the column to the GROUP BY is unnecessary and since
Postgres makes no attempt to prune them out, inefficient. And constants
aren't
the only such case. The most common case is columns that are coming from
a
table where the primary key is already included in the GROUP BY list.

In the case of columns coming from a table where the primary key is
already in
the GROUP BY list it's possible for the database to deduce that it's
unnecessary to group on that column.

But it's also possible to have cases where the programmer has out of
band
knowledge that it's unnecessary but the database doesn't have that
knowledge.
The most obvious case that comes to mind is a denormalized data model
that
includes a redundant column.
 select dept_id, dept_name, count(*) from employee_list

For example if dept_name is guaranteed to be the same for all records
with the
same dept_id. Of course that's generally considered poor design but it
doesn't
mean there aren't thousands of databases out there with data models like
that.

--
greg



pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: UPDATE Trigger on multiple tables
Next
From: "NSO"
Date:
Subject: Storing images from Delphi to postgresql