Thread: Calculation of unused columns
Dear PostgreSQL developers, I'm confused about the absence of a very simple optimization in PostgreSQL. Suppose we have a VIEW where some columns are expensive to be calculated: CREATE VIEW a AS SELECT (... expensive calculation ...) as expensive, count(*) as cheap FROM x; where "x" is a sufficiently large table. I would expect the following query to be very fast: SELECT cheap FROM a; However, it takes the same time as "SELECT * FROM a;". In other words: The column "expensive" is calculated although it hasn't been asked for. Of course, there are work-arounds for that, but I wonder why PostgreSQL doesn't perform this small optimization by itself. I checked that behaviour with PostgreSQL 8.3.7 (Debian/Etch) and 8.4.1 (Debian/Lenny). Greets, Volker -- Volker Grabsch ---<<(())>>--- Administrator NotJustHosting GbR
Volker Grabsch <vog@notjusthosting.com> writes: > I'm confused about the absence of a very simple optimization > in PostgreSQL. Suppose we have a VIEW where some columns are > expensive to be calculated: > CREATE VIEW a AS > SELECT > (... expensive calculation ...) as expensive, > count(*) as cheap > FROM > x; > where "x" is a sufficiently large table. I would expect the > following query to be very fast: > SELECT cheap FROM a; > However, it takes the same time as "SELECT * FROM a;". I think your main error is in supposing that count(*) is cheap ... PG will suppress unused columns when a view can be flattened, but typically not otherwise. In particular a view involving aggregates won't get flattened; but given that the aggregates will force a whole-table scan anyway, I can't get that excited about removing a subset of them. regards, tom lane
[ please keep the list cc'd ] Volker Grabsch <vog@notjusthosting.com> writes: > The "count(*)" in the example seems to be distracting. In fact, > it could be replaced with a simple constant value, the effect > is the same: > CREATE VIEW a AS > SELECT > (... expensive calculation ...) as expensive, > 1 as cheap > FROM > x; Well, if you try that case, you'll find that the "expensive" column *does* get thrown away. (Using EXPLAIN VERBOSE under 8.4 may help you see what's going on here.) It's only when there's some reason why the view can't get flattened that there's an issue. I've been thinking about this since your earlier mail, and I think it would probably be possible to suppress unused columns in a non-flattened subquery. I remain unconvinced that it's worth the trouble though. A real (not handwavy) example would help make the case here. As an example of why I'm not convinced, one thing we'd have to consider is whether it is okay to suppress calculation of columns containing volatile functions. I'd be inclined to think not, since that could cause side-effects to not happen that the user might be expecting to happen. (We got beat up in the past for letting the planner be cavalier about that consideration.) I suspect, though, that that case and other non-optimizable cases might account for the bulk of situations where the existing optimization doesn't happen. regards, tom lane
On Sat, Oct 17, 2009 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've been thinking about this since your earlier mail, and I think it > would probably be possible to suppress unused columns in a non-flattened > subquery. I remain unconvinced that it's worth the trouble though. > A real (not handwavy) example would help make the case here. Are there any situations where this would enable join removal that otherwise wouldn't be possible? Maybe a query involving an unflattenable VIEW? ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Oct 17, 2009 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I've been thinking about this since your earlier mail, and I think it >> would probably be possible to suppress unused columns in a non-flattened >> subquery. �I remain unconvinced that it's worth the trouble though. >> A real (not handwavy) example would help make the case here. > Are there any situations where this would enable join removal that > otherwise wouldn't be possible? Hmm, yeah maybe, now that we have some join removal logic. Seems a bit of a stretch to claim that's a common real-world case though. regards, tom lane
I have a very common example which would illustrate the above problem a bit more. Guess the following view on a company table, which references the country of that company in another table. The view itself just returns the company-id and the country-name, create view companys_and_countries as select company.id, country.name from company left join country on (company.country_id = country.id); Pleaso note we have a left join here, so the contents of country do by no means affect the contents of the "id" row in that view. Lets see what happens when we just query for the ids: explain select id from companys_and_countries; The join is done anyway, even if its removed (At least on Postgres 8.3). The more common usecase would be having Display-Tables, where are foreign keys are dereferenced to their values. One could store this in a view, and then query only the columns one needs (This is especially useful if the user is able to configure its client for which columns he needs). I would like if unnessecary joins would be cut off here, as well as unnessecary columns. I know this would be a performance hit, so maybe a session option would be the right way here? Regards, Daniel Migowski
Daniel Migowski <dmigowski@ikoffice.de> writes: > I have a very common example which would illustrate the above problem a > bit more. This is (a) still handwaving, not a testable example, and (b) unrelated to the question at hand, because the suggested view is flattenable. regards, tom lane
Daniel Migowski <dmigowski@ikoffice.de> wrote: > I have a very common example which would illustrate the > above problem a bit more. Guess the following view on a > company table, which references the country of that company > in another table. The view itself just returns the > company-id and the country-name, > create view companys_and_countries as > select company.id, country.name from company left join > country on (company.country_id = country.id); > Pleaso note we have a left join here, so the contents of > country do by no means affect the contents of the "id" row > in that view. Lets see what happens when we just query for > the ids: > explain select id from companys_and_countries; > The join is done anyway, even if its removed (At least on > Postgres 8.3). [...] How could that be done otherwise? PostgreSQL *must* look at country to determine how many rows the left join produces. Tim
On Sun, Oct 18, 2009 at 10:35 AM, Tim Landscheidt <tim@tim-landscheidt.de> wrote: > Daniel Migowski <dmigowski@ikoffice.de> wrote: > >> I have a very common example which would illustrate the >> above problem a bit more. Guess the following view on a >> company table, which references the country of that company >> in another table. The view itself just returns the >> company-id and the country-name, > >> create view companys_and_countries as >> select company.id, country.name from company left join >> country on (company.country_id = country.id); > >> Pleaso note we have a left join here, so the contents of >> country do by no means affect the contents of the "id" row >> in that view. Lets see what happens when we just query for >> the ids: > >> explain select id from companys_and_countries; > >> The join is done anyway, even if its removed (At least on >> Postgres 8.3). [...] > > How could that be done otherwise? PostgreSQL *must* look at > country to determine how many rows the left join produces. Even if country.id is a primary or unique key? Jeff
On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sun, Oct 18, 2009 at 10:35 AM, Tim Landscheidt > <tim@tim-landscheidt.de> wrote: >> Daniel Migowski <dmigowski@ikoffice.de> wrote: >> >>> I have a very common example which would illustrate the >>> above problem a bit more. Guess the following view on a >>> company table, which references the country of that company >>> in another table. The view itself just returns the >>> company-id and the country-name, >> >>> create view companys_and_countries as >>> select company.id, country.name from company left join >>> country on (company.country_id = country.id); >> >>> Pleaso note we have a left join here, so the contents of >>> country do by no means affect the contents of the "id" row >>> in that view. Lets see what happens when we just query for >>> the ids: >> >>> explain select id from companys_and_countries; >> >>> The join is done anyway, even if its removed (At least on >>> Postgres 8.3). [...] >> >> How could that be done otherwise? PostgreSQL *must* look at >> country to determine how many rows the left join produces. > > Even if country.id is a primary or unique key? Well, we currently don't have any logic for making inferences based on unique constraints. I have dreams of fixing that at some point (or maybe I'll get lucky and someone else will beat me to it) but it's currently in the category of "things for which I won't get paid but would like to spend some of my spare time in the evenings on", so it may be a while (unless of course it moves into the category of "things people are paying me a lot of money to get done", in which case it will likely happen quite a bit sooner...). ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Even if country.id is a primary or unique key? > Well, we currently don't have any logic for making inferences based on > unique constraints. Huh? http://archives.postgresql.org/pgsql-committers/2009-09/msg00159.php Admittedly it's just one case and there's lots more to be done, but it's more than nothing. So this is a *potential* argument for trying to trim subquery outputs. What I'm not sure about is whether there are common cases where this would be applicable below a non-flattenable subquery. regards, tom lane
On Sun, Oct 18, 2009 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> Even if country.id is a primary or unique key? > >> Well, we currently don't have any logic for making inferences based on >> unique constraints. > > Huh? > http://archives.postgresql.org/pgsql-committers/2009-09/msg00159.php > > Admittedly it's just one case and there's lots more to be done, but it's > more than nothing. So this is a *potential* argument for trying to trim > subquery outputs. What I'm not sure about is whether there are common > cases where this would be applicable below a non-flattenable subquery. Sorry, I have to stop writing emails when I'm half-asleep. Obviously what we don't have is logic for making deductions based on *foreign key* constraints, but that's not relevant here. Maybe I should shut up before I say any more dumb things, but one possible case where we don't currently do join removal but it would be nice if we did is: SELECT ... FROM a.x LEFT JOIN (SELECT bb.x, SUM(1) FROM bb GROUP BY bb.x) b ON a.x = b.x; Or even: SELECT ... FROM a.x LEFT JOIN (SELECT DISTINCT ON (bb.x) ... FROM bb) b ON a.x = b.x; Your commit message for the join removal patch mentions machine-generated SQL, but where join removal really comes up a lot for me is when using views. I like to define a view that includes all the columns that seem potentially useful and then let the user pick which ones they'd like to see. The trouble is that you don't want to incur the cost of computing the columns that the user doesn't select. It's probably true that in MOST of the cases where this comes up, the subquery can be flattened, from_collapse_limit permitting. But I think there are other cases, too. Another thing to keep in mind is that, in OLTP environments, it's sometimes important to minimize the number of server round-trips. The type of construction suggested by the OP might be someone's way of gather two somewhat-unrelated values with a single query. Except sometimes they only need one of them, but they end up paying for both anyway. They could probably work around this with a little bit different setup, but I don't think they're entirely wrong to find the current behavior a little bit surprising. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > It's probably true that in MOST of the cases where this comes up, the > subquery can be flattened, from_collapse_limit permitting. But I > think there are other cases, too. Right ... and from_collapse_limit is not relevant here; only the form of the subquery is. So I'd sure like to see some actual use cases before we decide to expend planning cycles on this. Just for fun, I hacked together a first cut at this. It's only about 120 lines but it's a bit cheesy (the limitation to not handling appendrel members in particular). It passes regression tests and seems to do what's wanted, but I'm not convinced it's worth the extra cycles as-is, let alone with the appendrel limitation fixed. regards, tom lane Index: allpaths.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/optimizer/path/allpaths.c,v retrieving revision 1.187 diff -c -r1.187 allpaths.c *** allpaths.c 12 Oct 2009 18:10:45 -0000 1.187 --- allpaths.c 18 Oct 2009 23:26:12 -0000 *************** *** 17,22 **** --- 17,23 ---- #include <math.h> + #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" #ifdef OPTIMIZER_DEBUG #include "nodes/print.h" *************** *** 75,80 **** --- 76,83 ---- RangeTblEntry *rte, Index rti, Node *qual); static void recurse_push_qual(Node *setOp, Query *topquery, RangeTblEntry *rte, Index rti, Node *qual); + static void remove_unused_subquery_outputs(PlannerInfo *root, RelOptInfo *rel, + Query *subquery); /* *************** *** 611,616 **** --- 614,624 ---- pfree(differentTypes); /* + * Remove unreferenced subquery outputs, if possible. + */ + remove_unused_subquery_outputs(root, rel, subquery); + + /* * We can safely pass the outer tuple_fraction down to the subquery if the * outer level has no joining, aggregation, or sorting to do. Otherwise * we'd better tell the subquery to plan for full retrieval. (XXX This *************** *** 1308,1313 **** --- 1316,1437 ---- } /***************************************************************************** + * REMOVING UNUSED SUBQUERY OUTPUTS + *****************************************************************************/ + + /* + * remove_unused_subquery_outputs + * Modify the subquery's targetlist to suppress useless output columns + * + * Views very often produce output columns that are unused by a particular + * calling query. Removing unused outputs can be useful if it avoids the + * need to calculate expensive functions or permits other optimizations + * (such as join removal) within the subquery. + * + * To avoid messing up the correlation between upper-query Vars and subquery + * output resnos, we don't actually delete unwanted tlist items, but replace + * them with null Consts. It's okay to modify the subquery tlist in-place + * because caller already made a copy of it. + */ + static void + remove_unused_subquery_outputs(PlannerInfo *root, RelOptInfo *rel, + Query *subquery) + { + Bitmapset *atts_used; + ListCell *lc; + int i; + + /* + * Can't do anything for "other rels" (appendrel members), because we + * don't track attr_needed for them --- see set_append_rel_pathlist. + * This could be fixed by making that routine do more work, or by + * having this routine find the appendrel parent and consult its + * attr_needed. (The attr numbers should be the same, in cases where + * a subquery could appear in an appendrel.) For the moment it's not + * clear that handling the case is worth any extra trouble. + */ + if (rel->reloptkind != RELOPT_BASEREL) + return; + + /* + * Can't do anything if subquery involves a setop or DISTINCT (but not + * DISTINCT ON), because then the tlist values are all important to the + * subquery's internal semantics. UNION ALL could possibly be handled, + * but we will normally not see that here because it will get flattened + * into an appendrel, so there's little point in expending effort on it. + * We would actually reject DISTINCT below anyhow, but there's no point + * in spending cycles just to discover that all the columns are needed. + */ + if (subquery->setOperations || + (subquery->distinctClause && !subquery->hasDistinctOn)) + return; + + /* + * Identify all the subquery outputs that are used by the outer query, + * and build a bitmapset of their attnos. attr_needed tells us about + * everything used in joinquals or tlist, and we have to add anything + * used in the surviving restriction clauses. (So it's important that + * this runs only after pushing down any pushable restrictclauses.) + * Subqueries can't have negative attnos, so we need not offset the + * attnos. + */ + atts_used = NULL; + for (i = rel->min_attr; i <= rel->max_attr; i++) + { + if (!bms_is_empty(rel->attr_needed[i - rel->min_attr])) + atts_used = bms_add_member(atts_used, i); + + } + foreach(lc, rel->baserestrictinfo) + { + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); + List *vars; + ListCell *vl; + + vars = pull_var_clause((Node *) rinfo->clause, + PVC_RECURSE_PLACEHOLDERS); + foreach(vl, vars) + { + Var *var = (Var *) lfirst(vl); + + Assert(IsA(var, Var)); + atts_used = bms_add_member(atts_used, var->varattno); + } + list_free(vars); + } + + /* + * If there's a whole-row Var reference then we can't remove anything. + */ + if (bms_is_member(0, atts_used)) + return; + + /* + * Okay, scan the tlist and replace unwanted items with null constants + * of the same datatype. + */ + foreach(lc, subquery->targetList) + { + TargetEntry *tle = (TargetEntry *) lfirst(lc); + + /* + * TLEs with nonzero ressortgroupref mustn't be zapped because + * they are relevant to the subquery's internal semantics. We + * also don't risk zapping any resjunk entries, though that test + * is probably redundant with the ressortgroupref test. + */ + if (tle->ressortgroupref || tle->resjunk) + continue; + + if (bms_is_member(tle->resno, atts_used)) + continue; + + tle->expr = (Expr *) makeNullConst(exprType((Node *) tle->expr), + exprTypmod((Node *) tle->expr)); + } + } + + /***************************************************************************** * DEBUG SUPPORT *****************************************************************************/
I wrote: > Just for fun, I hacked together a first cut at this. Oh, just for the archives: I forgot about not suppressing volatile expressions --- checking that would increase the cost of this significantly, though it's only another line or two. regards, tom lane
On Sat, 2009-10-17 at 21:41 -0400, Tom Lane wrote: > one thing we'd have to consider > is whether it is okay to suppress calculation of columns containing > volatile functions. I think we should have a 4th class of functions, volatile-without-side-effects (better name needed, obviously). That would allow us to optimize such calls away, if appropriate. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sat, 2009-10-17 at 21:41 -0400, Tom Lane wrote: >> one thing we'd have to consider >> is whether it is okay to suppress calculation of columns containing >> volatile functions. > I think we should have a 4th class of functions, > volatile-without-side-effects (better name needed, obviously). What for? There wouldn't be that many, I think. random() and clock_timestamp(), yeah, but most volatile user-defined functions are either volatile-with-side-effects or misdeclared. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> wrote: > I think we should have a 4th class of functions, > volatile-without-side-effects Sounds reasonable to me. > (better name needed, obviously). Well, from this list (which is where volatile points), mutable seems closest to OK, but I'm not sure I like any of them. http://www.merriam-webster.com/thesaurus/fickle Anyone else have an idea? -Kevin
On Mon, 2009-10-19 at 13:43 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Sat, 2009-10-17 at 21:41 -0400, Tom Lane wrote: > >> one thing we'd have to consider > >> is whether it is okay to suppress calculation of columns containing > >> volatile functions. > > > I think we should have a 4th class of functions, > > volatile-without-side-effects (better name needed, obviously). > > What for? There wouldn't be that many, I think. random() and > clock_timestamp(), yeah, but most volatile user-defined functions > are either volatile-with-side-effects or misdeclared. Read only vs. read write? -- Simon Riggs www.2ndQuadrant.com
On Sun, 18 Oct 2009, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> Even if country.id is a primary or unique key? > >> Well, we currently don't have any logic for making inferences based on >> unique constraints. > > Huh? > http://archives.postgresql.org/pgsql-committers/2009-09/msg00159.php > > Admittedly it's just one case and there's lots more to be done, but it's > more than nothing. So this is a *potential* argument for trying to trim > subquery outputs. What I'm not sure about is whether there are common > cases where this would be applicable below a non-flattenable subquery. > Wow. That's IHMO a major improvement in the optimizer. Is this also valid for views? In the area of views this might even be a killer feature since one can define a view with many columns and when only e.g. one is used query is still optimal. I had today such a situation where I created a new view to be ~24 times faster (with a lot of left outer joins). Is the patch only for 8.5 or even backported to 8.4 and 8.3? Thnx. Ciao, Gerhard -- http://www.wiesinger.com/
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2009-10-19 at 13:43 -0400, Tom Lane wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> I think we should have a 4th class of functions, >>> volatile-without-side-effects (better name needed, obviously). >> >> What for? There wouldn't be that many, I think. random() and >> clock_timestamp(), yeah, but most volatile user-defined functions >> are either volatile-with-side-effects or misdeclared. > Read only vs. read write? Most read-only functions are stable or even immutable. I don't say that there's zero usefulness in a fourth class, but I do say it's unlikely to be worth the trouble. (The only reason it even came up in this connection is that the default for user-defined functions is "volatile" which would defeat this optimization ... but we could hardly make the default anything else.) regards, tom lane
Gerhard Wiesinger <lists@wiesinger.com> writes: > Is the patch only for 8.5 or even backported to 8.4 and 8.3? That patch will *not* be backported. It hasn't even got through beta yet. regards, tom lane
On Mon, 2009-10-19 at 13:58 -0400, Tom Lane wrote: > > Most read-only functions are stable or even immutable. Huh? I mean a function that only contains SELECTs. (How would those ever be Stable or Immutable??) -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2009-10-19 at 13:58 -0400, Tom Lane wrote: >> Most read-only functions are stable or even immutable. > Huh? I mean a function that only contains SELECTs. (How would those ever > be Stable or Immutable??) Uh, a function containing SELECTs is exactly the use-case for STABLE. Maybe you need to go re-read the definitions? regards, tom lane