WIP patch for consolidating misplaced-aggregate checks - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | WIP patch for consolidating misplaced-aggregate checks |
Date | |
Msg-id | 4164.1344527793@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: WIP patch for consolidating misplaced-aggregate checks
|
List | pgsql-hackers |
I wrote: > I find it fairly annoying though that parseCheckAggregates (and likewise > parseCheckWindowFuncs) have to dig through previously parsed query trees > to look for misplaced aggregates; so adding even more of that is grating > on me. It would be a lot cleaner if transformAggregateCall and > transformWindowFuncCall could throw these errors immediately. The > reason they can't is lack of context about what portion of the query we > are currently parsing. I'm thinking it'd be worthwhile to add an enum > field to ParseState that shows whether we're currently parsing the > associated query level's target list, WHERE clause, GROUP BY clause, > etc. The easiest way to ensure this gets set for all cases should be to > add the enum value as another argument to transformExpr(), which > would then save it into the ParseState for access by subsidiary > expression transformation functions. I've been fooling around with this concept, and attach a draft patch that I'm not terribly satisfied with yet. It's not making the code any shorter, although it holds some promise of being able to do that if we push the concept even further. There are some judgment calls and things that need discussion; in particular I need some input from people who do message translation work. In the attached patch, I invented an EXPR_KIND_XXX enum value for each part of a SELECT/INSERT/UPDATE/DELETE query that we could want to distinguish, but lumped all utility-statement cases into one EXPR_KIND_UTILITY entry. That means the utility-statement callers still all need to do their own checking/reporting of disallowed aggregates and window functions, so that they can produce appropriate messages. I'm tempted to enlarge the set of EXPR_KINDs to include cases for utility statements so that those checks can be moved too; but that would make the enum much more in need of continuing maintenance in the future. On the other hand, if we did that then we could also push the responsibility for rejecting sub-selects (which practically all of those callers also do) into transformSubLink; which seems like a nice thing. At the moment, the patch faithfully preserves (well, 99% preserves) the current spellings of the error messages, so that no regression test entries change. Once all those messages were brought together, it became painfully obvious that we have been less than consistent about how to phrase them: errmsg("aggregates not allowed in JOIN conditions"), errmsg("aggregates not allowed in FROM clause"), errmsg("cannot use aggregate function in function expression in FROM"), errmsg("aggregates not allowed in WHERE clause"), errmsg("argument of RANGE must not contain aggregate functions"), errmsg("argument of ROWS must not contain aggregate functions"), errmsg("cannot use aggregate function in INSERT"), errmsg("cannot use aggregate function in UPDATE"), errmsg("aggregates not allowed in GROUP BY clause"), errmsg("argument of LIMIT must not contain aggregate functions"), errmsg("argument of OFFSET must not contain aggregate functions"), errmsg("cannot use aggregate function in RETURNING"), errmsg("cannot use aggregate function in VALUES"), (and that's not even counting the utility-statement callers, which I've not unified yet). I think obviously this should be fixed, but what phrasing shall we settle on? At the moment I'm leaning to "aggregate functions are not allowed in <context>", which is not exactly like any of these but seems better English. Also, at least for the cases where the <context> is just a SQL keyword, it is mighty tempting to use just one message string, ie "aggregate functions are not allowed in %s". I know that we do this in assorted places, but I've never had a good grasp of exactly how painful that is for translators to deal with. Does anyone better-informed than me want to vote on whether to collapse the multiple messages like that? If we do go with the %s-for-a-SQL-keyword approach, it would then become tempting to force-fit all of the cases into that style. We would lose a certain amount of specificity of the messages if we did that; for example, "function expression in FROM" would probably become just "FROM", and messages that currently mention "index expression" or "index predicate" would probably say "CREATE INDEX". But I think this is all right considering we'd be producing an error cursor pointing at the aggregate or window function, which the utility-statement call sites generally don't produce at the moment. I had hoped that we might be able to get rid of after-the-fact rechecking of parse trees entirely, and thereby get rid of contain_aggs_of_level and locate_agg_of_level; which would go a long way towards making this a net code savings. It turns out that this is basically impossible for the case of detecting nested aggregates, because you have to parse-analyze an aggregate's arguments (so as to determine their datatypes) before you can look up the function and discover that it is indeed an aggregate. The other place where it's problematic is "select max(x) from ... group by 1". That has to be rejected, but at the time you're parsing max(x) you only know it's a targetlist entry, so it looks fine. The GROUP BY code has to go back and recheck when accepting a GROUP BY reference to an existing tlist entry. But I did manage to get rid of parse tree rescans in other places. Also, it turned out that we were searching an aggregate's arguments up to *three* times: once to look for variables to determine its semantic level, once to look for nested aggregates, and once to look for nested window functions. So I got rid of find_minimum_var_level, which never proved to have any other use anyway, and replaced it with a dedicated function check_agg_arguments that does all that work in one pass. So even though this isn't any smaller, it should be a bit faster than the existing code, at least for complex queries. One other thing worth mentioning is that this exercise has so far revealed two bugs: 1. The parser fails to detect nested outer aggregates, for instance (with un-patched code): regression=# select (select max(max(q1))) from int8_tbl; ERROR: aggregate function calls cannot be nested That looks like it's all right, but where's the error cursor? If you turn on VERBOSE you find that actually that's coming out of the planner, which is being paranoid and rechecking for possible nesting. If somebody were to remove that check or reduce it to an Assert, we'd have a problem. Since the backstop is there in existing branches, this doesn't seem to need a back-patch, but if we don't apply some form of the attached patch I think we need to fix this some other way in HEAD. 2. We forgot to add a no-window-functions check to DefineIndex, so the executor fails instead: regression=# create index fooey on int8_tbl (max(q1) over ()); ERROR: WindowFunc found in non-WindowAgg plan node Again, I don't feel a strong need to back-patch a fix, but we'd want to repair this oversight in HEAD. Thoughts, opinions, better ideas? regards, tom lane
Attachment
pgsql-hackers by date: