Thread: column order in GROUP BY

column order in GROUP BY

From
Neil Conway
Date:
The query optimizer currently does not consider reordering a query's
grouping columns. While the order in which ORDER BY columns are
specified affects the semantics of the query, AFAICS GROUP BY's column
order does not. Reordering a query's grouping columns would allow the
optimizer to avoid some unnecessary sorts; for example, given an index
on (a, b), we should be able to avoid a sort in this query:

SELECT a, b, max(c) FROM t1 GROUP BY b, a;

which the optimizer is currently incapable of doing.

I think fixing this properly would require teaching the planner that
certain PathKeys are unordered, so the planner can pick whichever order
is best. That looks like a fairly invasive change: the assumption that
PathKeyItems are ordered looks pretty widespread.

A simple hack might help with a subset of this problem, though. For
queries with both ORDER BY and GROUP BY clauses, we can sort the
grouping columns according to their position in the ORDER BY list. So,
given a query like:

SELECT a, b, max(c) FROM t1 GROUP BY a, b ORDER BY b;

We can avoid the redundant sort for the ORDER BY by grouping by (b, a)
instead. Attached is a proof-of-concept patch that implements this,
although it's an enormous kludge.

Thoughts?

-Neil


Attachment

Re: column order in GROUP BY

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> A simple hack might help with a subset of this problem, though. For
> queries with both ORDER BY and GROUP BY clauses, we can sort the
> grouping columns according to their position in the ORDER BY list. So,
> given a query like:

> SELECT a, b, max(c) FROM t1 GROUP BY a, b ORDER BY b;

> We can avoid the redundant sort for the ORDER BY by grouping by (b, a)
> instead. Attached is a proof-of-concept patch that implements this,
> although it's an enormous kludge.

I think that's the wrong place.  transformGroupClause is the right
place.   It already does some hacking to try to make the GROUP BY
semantics match ORDER BY, but it doesn't think to try reordering
the GROUP BY items.
        regards, tom lane


Re: column order in GROUP BY

From
Lukas Smith
Date:
Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
>> A simple hack might help with a subset of this problem, though. For
>> queries with both ORDER BY and GROUP BY clauses, we can sort the
>> grouping columns according to their position in the ORDER BY list. So,
>> given a query like:
> 
>> SELECT a, b, max(c) FROM t1 GROUP BY a, b ORDER BY b;
> 
>> We can avoid the redundant sort for the ORDER BY by grouping by (b, a)
>> instead. Attached is a proof-of-concept patch that implements this,
>> although it's an enormous kludge.
> 
> I think that's the wrong place.  transformGroupClause is the right
> place.   It already does some hacking to try to make the GROUP BY
> semantics match ORDER BY, but it doesn't think to try reordering
> the GROUP BY items.

Does it also throw out unnecessary columns in the GROUP BY? Like when 
the GROUP BY contains multiple columns of which one (or a set) already 
uniquely identifies every row.

regards,
Lukas


Re: column order in GROUP BY

From
Tom Lane
Date:
Lukas Smith <smith@pooteeweet.org> writes:
> Tom Lane wrote:
>> I think that's the wrong place.  transformGroupClause is the right
>> place.   It already does some hacking to try to make the GROUP BY
>> semantics match ORDER BY, but it doesn't think to try reordering
>> the GROUP BY items.

> Does it also throw out unnecessary columns in the GROUP BY? Like when 
> the GROUP BY contains multiple columns of which one (or a set) already 
> uniquely identifies every row.

No, and it would be quite inappropriate to do that in the parser, since
the constraints making it a valid transformation might get dropped
before the query is planned/used.  It'd be OK to throw out trivial
duplicates ("GROUP BY x,x") but I doubt that it's worth the cycles even
to try --- if you write a query that stupid you shouldn't complain that
it doesn't run efficiently.

There's a fairly fundamental point here, which is that the parser is
responsible for determining semantics --- in this case, what is the
semantic meaning of GROUP BY, in particular which operators should
implement it --- and then the planner is responsible for optimization
without changing those semantics.  Given the system design assumption
that GROUP BY is associated with a specific sort ordering, changing the
column order is a semantic change and so it's reasonable for the parser
to do it.  If we got rid of that design assumption then it'd become
planner territory, but as Neil observes that's not exactly low-hanging
fruit.
        regards, tom lane