Thread: column order in GROUP BY
The query optimizer currently does not consider reordering a query's grouping columns. While the order in which ORDER BY columns are specified affects the semantics of the query, AFAICS GROUP BY's column order does not. Reordering a query's grouping columns would allow the optimizer to avoid some unnecessary sorts; for example, given an index on (a, b), we should be able to avoid a sort in this query: SELECT a, b, max(c) FROM t1 GROUP BY b, a; which the optimizer is currently incapable of doing. I think fixing this properly would require teaching the planner that certain PathKeys are unordered, so the planner can pick whichever order is best. That looks like a fairly invasive change: the assumption that PathKeyItems are ordered looks pretty widespread. A simple hack might help with a subset of this problem, though. For queries with both ORDER BY and GROUP BY clauses, we can sort the grouping columns according to their position in the ORDER BY list. So, given a query like: SELECT a, b, max(c) FROM t1 GROUP BY a, b ORDER BY b; We can avoid the redundant sort for the ORDER BY by grouping by (b, a) instead. Attached is a proof-of-concept patch that implements this, although it's an enormous kludge. Thoughts? -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > A simple hack might help with a subset of this problem, though. For > queries with both ORDER BY and GROUP BY clauses, we can sort the > grouping columns according to their position in the ORDER BY list. So, > given a query like: > SELECT a, b, max(c) FROM t1 GROUP BY a, b ORDER BY b; > We can avoid the redundant sort for the ORDER BY by grouping by (b, a) > instead. Attached is a proof-of-concept patch that implements this, > although it's an enormous kludge. I think that's the wrong place. transformGroupClause is the right place. It already does some hacking to try to make the GROUP BY semantics match ORDER BY, but it doesn't think to try reordering the GROUP BY items. regards, tom lane
Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: >> A simple hack might help with a subset of this problem, though. For >> queries with both ORDER BY and GROUP BY clauses, we can sort the >> grouping columns according to their position in the ORDER BY list. So, >> given a query like: > >> SELECT a, b, max(c) FROM t1 GROUP BY a, b ORDER BY b; > >> We can avoid the redundant sort for the ORDER BY by grouping by (b, a) >> instead. Attached is a proof-of-concept patch that implements this, >> although it's an enormous kludge. > > I think that's the wrong place. transformGroupClause is the right > place. It already does some hacking to try to make the GROUP BY > semantics match ORDER BY, but it doesn't think to try reordering > the GROUP BY items. Does it also throw out unnecessary columns in the GROUP BY? Like when the GROUP BY contains multiple columns of which one (or a set) already uniquely identifies every row. regards, Lukas
Lukas Smith <smith@pooteeweet.org> writes: > Tom Lane wrote: >> I think that's the wrong place. transformGroupClause is the right >> place. It already does some hacking to try to make the GROUP BY >> semantics match ORDER BY, but it doesn't think to try reordering >> the GROUP BY items. > Does it also throw out unnecessary columns in the GROUP BY? Like when > the GROUP BY contains multiple columns of which one (or a set) already > uniquely identifies every row. No, and it would be quite inappropriate to do that in the parser, since the constraints making it a valid transformation might get dropped before the query is planned/used. It'd be OK to throw out trivial duplicates ("GROUP BY x,x") but I doubt that it's worth the cycles even to try --- if you write a query that stupid you shouldn't complain that it doesn't run efficiently. There's a fairly fundamental point here, which is that the parser is responsible for determining semantics --- in this case, what is the semantic meaning of GROUP BY, in particular which operators should implement it --- and then the planner is responsible for optimization without changing those semantics. Given the system design assumption that GROUP BY is associated with a specific sort ordering, changing the column order is a semantic change and so it's reasonable for the parser to do it. If we got rid of that design assumption then it'd become planner territory, but as Neil observes that's not exactly low-hanging fruit. regards, tom lane