Thread: Calculation of unused columns

Calculation of unused columns

From
Volker Grabsch
Date:
Dear PostgreSQL developers,

I'm confused about the absence of a very simple optimization
in PostgreSQL. Suppose we have a VIEW where some columns are
expensive to be calculated:

    CREATE VIEW a AS
    SELECT
        (... expensive calculation ...) as expensive,
        count(*) as cheap
    FROM
        x;

where "x" is a sufficiently large table. I would expect the
following query to be very fast:

    SELECT cheap FROM a;

However, it takes the same time as "SELECT * FROM a;".

In other words: The column "expensive" is calculated although
it hasn't been asked for. Of course, there are work-arounds
for that, but I wonder why PostgreSQL doesn't perform this
small optimization by itself.

I checked that behaviour with PostgreSQL 8.3.7 (Debian/Etch)
and 8.4.1 (Debian/Lenny).


Greets,

    Volker

--
Volker Grabsch
---<<(())>>---
Administrator
NotJustHosting GbR

Re: Calculation of unused columns

From
Tom Lane
Date:
Volker Grabsch <vog@notjusthosting.com> writes:
> I'm confused about the absence of a very simple optimization
> in PostgreSQL. Suppose we have a VIEW where some columns are
> expensive to be calculated:

>     CREATE VIEW a AS
>     SELECT
>         (... expensive calculation ...) as expensive,
>         count(*) as cheap
>     FROM
>         x;

> where "x" is a sufficiently large table. I would expect the
> following query to be very fast:

>     SELECT cheap FROM a;

> However, it takes the same time as "SELECT * FROM a;".

I think your main error is in supposing that count(*) is cheap ...

PG will suppress unused columns when a view can be flattened,
but typically not otherwise.  In particular a view involving
aggregates won't get flattened; but given that the aggregates
will force a whole-table scan anyway, I can't get that excited
about removing a subset of them.

            regards, tom lane

Re: Calculation of unused columns

From
Tom Lane
Date:
[ please keep the list cc'd ]

Volker Grabsch <vog@notjusthosting.com> writes:
> The "count(*)" in the example seems to be distracting. In fact,
> it could be replaced with a simple constant value, the effect
> is the same:

>     CREATE VIEW a AS
>     SELECT
>         (... expensive calculation ...) as expensive,
>         1 as cheap
>     FROM
>         x;

Well, if you try that case, you'll find that the "expensive" column
*does* get thrown away.  (Using EXPLAIN VERBOSE under 8.4 may help you
see what's going on here.)  It's only when there's some reason why the
view can't get flattened that there's an issue.

I've been thinking about this since your earlier mail, and I think it
would probably be possible to suppress unused columns in a non-flattened
subquery.  I remain unconvinced that it's worth the trouble though.
A real (not handwavy) example would help make the case here.

As an example of why I'm not convinced, one thing we'd have to consider
is whether it is okay to suppress calculation of columns containing
volatile functions.  I'd be inclined to think not, since that could
cause side-effects to not happen that the user might be expecting to
happen.  (We got beat up in the past for letting the planner be cavalier
about that consideration.)  I suspect, though, that that case and other
non-optimizable cases might account for the bulk of situations where the
existing optimization doesn't happen.

            regards, tom lane

Re: Calculation of unused columns

From
Robert Haas
Date:
On Sat, Oct 17, 2009 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been thinking about this since your earlier mail, and I think it
> would probably be possible to suppress unused columns in a non-flattened
> subquery.  I remain unconvinced that it's worth the trouble though.
> A real (not handwavy) example would help make the case here.

Are there any situations where this would enable join removal that
otherwise wouldn't be possible?  Maybe a query involving an
unflattenable VIEW?

...Robert

Re: Calculation of unused columns

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Oct 17, 2009 at 9:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I've been thinking about this since your earlier mail, and I think it
>> would probably be possible to suppress unused columns in a non-flattened
>> subquery. �I remain unconvinced that it's worth the trouble though.
>> A real (not handwavy) example would help make the case here.

> Are there any situations where this would enable join removal that
> otherwise wouldn't be possible?

Hmm, yeah maybe, now that we have some join removal logic.  Seems a bit
of a stretch to claim that's a common real-world case though.

            regards, tom lane

Re: Calculation of unused columns

From
Daniel Migowski
Date:
I have a very common example which would illustrate the above problem a
bit more. Guess the following view on a company table, which references
the country of that company in another table. The view itself just
returns the company-id and the country-name,

    create view companys_and_countries as
    select company.id, country.name from company left join country on
(company.country_id = country.id);

Pleaso note we have a left join here, so the contents of country do by
no means affect the contents of the "id" row in that view. Lets see what
happens when we just query for the ids:

    explain select id from companys_and_countries;

The join is done anyway, even if its removed (At least on Postgres 8.3).
The more common usecase would be having Display-Tables, where are
foreign keys are dereferenced to their values. One could store this in a
view, and then query only the columns one needs (This is especially
useful if the user is able to configure its client for which columns he
needs).

I would like if unnessecary joins would be cut off here, as well as
unnessecary columns. I know this would be a performance hit, so maybe a
session option would be the right way here?

Regards,
Daniel Migowski

Re: Calculation of unused columns

From
Tom Lane
Date:
Daniel Migowski <dmigowski@ikoffice.de> writes:
> I have a very common example which would illustrate the above problem a
> bit more.

This is (a) still handwaving, not a testable example, and (b) unrelated
to the question at hand, because the suggested view is flattenable.

            regards, tom lane

Re: Calculation of unused columns

From
Tim Landscheidt
Date:
Daniel Migowski <dmigowski@ikoffice.de> wrote:

> I have a very common example which would illustrate the
> above problem a bit more. Guess the following view on a
> company table, which references the country of that company
> in another table. The view itself just returns the
> company-id and the country-name,

>    create view companys_and_countries as
>    select company.id, country.name from company left join
> country on (company.country_id = country.id);

> Pleaso note we have a left join here, so the contents of
> country do by no means affect the contents of the "id" row
> in that view. Lets see what happens when we just query for
> the ids:

>    explain select id from companys_and_countries;

> The join is done anyway, even if its removed (At least on
> Postgres 8.3). [...]

How could that be done otherwise? PostgreSQL *must* look at
country to determine how many rows the left join produces.

Tim

Re: Calculation of unused columns

From
Jeff Janes
Date:
On Sun, Oct 18, 2009 at 10:35 AM, Tim Landscheidt
<tim@tim-landscheidt.de> wrote:
> Daniel Migowski <dmigowski@ikoffice.de> wrote:
>
>> I have a very common example which would illustrate the
>> above problem a bit more. Guess the following view on a
>> company table, which references the country of that company
>> in another table. The view itself just returns the
>> company-id and the country-name,
>
>>    create view companys_and_countries as
>>    select company.id, country.name from company left join
>> country on (company.country_id = country.id);
>
>> Pleaso note we have a left join here, so the contents of
>> country do by no means affect the contents of the "id" row
>> in that view. Lets see what happens when we just query for
>> the ids:
>
>>    explain select id from companys_and_countries;
>
>> The join is done anyway, even if its removed (At least on
>> Postgres 8.3). [...]
>
> How could that be done otherwise? PostgreSQL *must* look at
> country to determine how many rows the left join produces.

Even if country.id is a primary or unique key?

Jeff

Re: Calculation of unused columns

From
Robert Haas
Date:
On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sun, Oct 18, 2009 at 10:35 AM, Tim Landscheidt
> <tim@tim-landscheidt.de> wrote:
>> Daniel Migowski <dmigowski@ikoffice.de> wrote:
>>
>>> I have a very common example which would illustrate the
>>> above problem a bit more. Guess the following view on a
>>> company table, which references the country of that company
>>> in another table. The view itself just returns the
>>> company-id and the country-name,
>>
>>>    create view companys_and_countries as
>>>    select company.id, country.name from company left join
>>> country on (company.country_id = country.id);
>>
>>> Pleaso note we have a left join here, so the contents of
>>> country do by no means affect the contents of the "id" row
>>> in that view. Lets see what happens when we just query for
>>> the ids:
>>
>>>    explain select id from companys_and_countries;
>>
>>> The join is done anyway, even if its removed (At least on
>>> Postgres 8.3). [...]
>>
>> How could that be done otherwise? PostgreSQL *must* look at
>> country to determine how many rows the left join produces.
>
> Even if country.id is a primary or unique key?

Well, we currently don't have any logic for making inferences based on
unique constraints.  I have dreams of fixing that at some point (or
maybe I'll get lucky and someone else will beat me to it) but it's
currently in the category of "things for which I won't get paid but
would like to spend some of my spare time in the evenings on", so it
may be a while (unless of course it moves into the category of "things
people are paying me a lot of money to get done", in which case it
will likely happen quite a bit sooner...).

...Robert

Re: Calculation of unused columns

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Even if country.id is a primary or unique key?

> Well, we currently don't have any logic for making inferences based on
> unique constraints.

Huh?
http://archives.postgresql.org/pgsql-committers/2009-09/msg00159.php

Admittedly it's just one case and there's lots more to be done, but it's
more than nothing.  So this is a *potential* argument for trying to trim
subquery outputs.  What I'm not sure about is whether there are common
cases where this would be applicable below a non-flattenable subquery.

            regards, tom lane

Re: Calculation of unused columns

From
Robert Haas
Date:
On Sun, Oct 18, 2009 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> Even if country.id is a primary or unique key?
>
>> Well, we currently don't have any logic for making inferences based on
>> unique constraints.
>
> Huh?
> http://archives.postgresql.org/pgsql-committers/2009-09/msg00159.php
>
> Admittedly it's just one case and there's lots more to be done, but it's
> more than nothing.  So this is a *potential* argument for trying to trim
> subquery outputs.  What I'm not sure about is whether there are common
> cases where this would be applicable below a non-flattenable subquery.

Sorry, I have to stop writing emails when I'm half-asleep.  Obviously
what we don't have is logic for making deductions based on *foreign
key* constraints, but that's not relevant here.

Maybe I should shut up before I say any more dumb things, but one
possible case where we don't currently do join removal but it would be
nice if we did is:

SELECT ... FROM a.x LEFT JOIN (SELECT bb.x, SUM(1) FROM bb GROUP BY
bb.x) b ON a.x = b.x;

Or even:

SELECT ... FROM a.x LEFT JOIN (SELECT DISTINCT ON (bb.x) ... FROM bb)
b ON a.x = b.x;

Your commit message for the join removal patch mentions
machine-generated SQL, but where join removal  really comes up a lot
for me is when using views.  I like to define a view that includes all
the columns that seem potentially useful and then let the user pick
which ones they'd like to see.  The trouble is that you don't want to
incur the cost of computing the columns that the user doesn't select.
It's probably true that in MOST of the cases where this comes up, the
subquery can be flattened, from_collapse_limit permitting.  But I
think there are other cases, too.

Another thing to keep in mind is that, in OLTP environments, it's
sometimes important to minimize the number of server round-trips.  The
type of construction suggested by the OP might be someone's way of
gather two somewhat-unrelated values with a single query.  Except
sometimes they only need one of them, but they end up paying for both
anyway.  They could probably work around this with a little bit
different setup, but I don't think they're entirely wrong to find the
current behavior a little bit surprising.

...Robert

Re: Calculation of unused columns

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> It's probably true that in MOST of the cases where this comes up, the
> subquery can be flattened, from_collapse_limit permitting.  But I
> think there are other cases, too.

Right ... and from_collapse_limit is not relevant here; only the form of
the subquery is.  So I'd sure like to see some actual use cases before
we decide to expend planning cycles on this.

Just for fun, I hacked together a first cut at this.  It's only about
120 lines but it's a bit cheesy (the limitation to not handling
appendrel members in particular).  It passes regression tests and
seems to do what's wanted, but I'm not convinced it's worth the extra
cycles as-is, let alone with the appendrel limitation fixed.

            regards, tom lane

Index: allpaths.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/allpaths.c,v
retrieving revision 1.187
diff -c -r1.187 allpaths.c
*** allpaths.c    12 Oct 2009 18:10:45 -0000    1.187
--- allpaths.c    18 Oct 2009 23:26:12 -0000
***************
*** 17,22 ****
--- 17,23 ----

  #include <math.h>

+ #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
  #ifdef OPTIMIZER_DEBUG
  #include "nodes/print.h"
***************
*** 75,80 ****
--- 76,83 ----
                     RangeTblEntry *rte, Index rti, Node *qual);
  static void recurse_push_qual(Node *setOp, Query *topquery,
                    RangeTblEntry *rte, Index rti, Node *qual);
+ static void remove_unused_subquery_outputs(PlannerInfo *root, RelOptInfo *rel,
+                                Query *subquery);


  /*
***************
*** 611,616 ****
--- 614,624 ----
      pfree(differentTypes);

      /*
+      * Remove unreferenced subquery outputs, if possible.
+      */
+     remove_unused_subquery_outputs(root, rel, subquery);
+
+     /*
       * We can safely pass the outer tuple_fraction down to the subquery if the
       * outer level has no joining, aggregation, or sorting to do. Otherwise
       * we'd better tell the subquery to plan for full retrieval. (XXX This
***************
*** 1308,1313 ****
--- 1316,1437 ----
  }

  /*****************************************************************************
+  *            REMOVING UNUSED SUBQUERY OUTPUTS
+  *****************************************************************************/
+
+ /*
+  * remove_unused_subquery_outputs
+  *        Modify the subquery's targetlist to suppress useless output columns
+  *
+  * Views very often produce output columns that are unused by a particular
+  * calling query.  Removing unused outputs can be useful if it avoids the
+  * need to calculate expensive functions or permits other optimizations
+  * (such as join removal) within the subquery.
+  *
+  * To avoid messing up the correlation between upper-query Vars and subquery
+  * output resnos, we don't actually delete unwanted tlist items, but replace
+  * them with null Consts.  It's okay to modify the subquery tlist in-place
+  * because caller already made a copy of it.
+  */
+ static void
+ remove_unused_subquery_outputs(PlannerInfo *root, RelOptInfo *rel,
+                                Query *subquery)
+ {
+     Bitmapset  *atts_used;
+     ListCell   *lc;
+     int            i;
+
+     /*
+      * Can't do anything for "other rels" (appendrel members), because we
+      * don't track attr_needed for them --- see set_append_rel_pathlist.
+      * This could be fixed by making that routine do more work, or by
+      * having this routine find the appendrel parent and consult its
+      * attr_needed.  (The attr numbers should be the same, in cases where
+      * a subquery could appear in an appendrel.)  For the moment it's not
+      * clear that handling the case is worth any extra trouble.
+      */
+     if (rel->reloptkind != RELOPT_BASEREL)
+         return;
+
+     /*
+      * Can't do anything if subquery involves a setop or DISTINCT (but not
+      * DISTINCT ON), because then the tlist values are all important to the
+      * subquery's internal semantics.  UNION ALL could possibly be handled,
+      * but we will normally not see that here because it will get flattened
+      * into an appendrel, so there's little point in expending effort on it.
+      * We would actually reject DISTINCT below anyhow, but there's no point
+      * in spending cycles just to discover that all the columns are needed.
+      */
+     if (subquery->setOperations ||
+         (subquery->distinctClause && !subquery->hasDistinctOn))
+         return;
+
+     /*
+      * Identify all the subquery outputs that are used by the outer query,
+      * and build a bitmapset of their attnos.  attr_needed tells us about
+      * everything used in joinquals or tlist, and we have to add anything
+      * used in the surviving restriction clauses.  (So it's important that
+      * this runs only after pushing down any pushable restrictclauses.)
+      * Subqueries can't have negative attnos, so we need not offset the
+      * attnos.
+      */
+     atts_used = NULL;
+     for (i = rel->min_attr; i <= rel->max_attr; i++)
+     {
+         if (!bms_is_empty(rel->attr_needed[i - rel->min_attr]))
+             atts_used = bms_add_member(atts_used, i);
+
+     }
+     foreach(lc, rel->baserestrictinfo)
+     {
+         RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+         List       *vars;
+         ListCell   *vl;
+
+         vars = pull_var_clause((Node *) rinfo->clause,
+                                PVC_RECURSE_PLACEHOLDERS);
+         foreach(vl, vars)
+         {
+             Var           *var = (Var *) lfirst(vl);
+
+             Assert(IsA(var, Var));
+             atts_used = bms_add_member(atts_used, var->varattno);
+         }
+         list_free(vars);
+     }
+
+     /*
+      * If there's a whole-row Var reference then we can't remove anything.
+      */
+     if (bms_is_member(0, atts_used))
+         return;
+
+     /*
+      * Okay, scan the tlist and replace unwanted items with null constants
+      * of the same datatype.
+      */
+     foreach(lc, subquery->targetList)
+     {
+         TargetEntry *tle = (TargetEntry *) lfirst(lc);
+
+         /*
+          * TLEs with nonzero ressortgroupref mustn't be zapped because
+          * they are relevant to the subquery's internal semantics.  We
+          * also don't risk zapping any resjunk entries, though that test
+          * is probably redundant with the ressortgroupref test.
+          */
+         if (tle->ressortgroupref || tle->resjunk)
+             continue;
+
+         if (bms_is_member(tle->resno, atts_used))
+             continue;
+
+         tle->expr = (Expr *) makeNullConst(exprType((Node *) tle->expr),
+                                            exprTypmod((Node *) tle->expr));
+     }
+ }
+
+ /*****************************************************************************
   *            DEBUG SUPPORT
   *****************************************************************************/


Re: Calculation of unused columns

From
Tom Lane
Date:
I wrote:
> Just for fun, I hacked together a first cut at this.

Oh, just for the archives: I forgot about not suppressing volatile
expressions --- checking that would increase the cost of this
significantly, though it's only another line or two.

            regards, tom lane

Re: Calculation of unused columns

From
Simon Riggs
Date:
On Sat, 2009-10-17 at 21:41 -0400, Tom Lane wrote:

> one thing we'd have to consider
> is whether it is okay to suppress calculation of columns containing
> volatile functions.

I think we should have a 4th class of functions,
volatile-without-side-effects (better name needed, obviously).

That would allow us to optimize such calls away, if appropriate.

--
 Simon Riggs           www.2ndQuadrant.com


Re: Calculation of unused columns

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sat, 2009-10-17 at 21:41 -0400, Tom Lane wrote:
>> one thing we'd have to consider
>> is whether it is okay to suppress calculation of columns containing
>> volatile functions.

> I think we should have a 4th class of functions,
> volatile-without-side-effects (better name needed, obviously).

What for?  There wouldn't be that many, I think.  random() and
clock_timestamp(), yeah, but most volatile user-defined functions
are either volatile-with-side-effects or misdeclared.

            regards, tom lane

Re: Calculation of unused columns

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:

> I think we should have a 4th class of functions,
> volatile-without-side-effects

Sounds reasonable to me.

> (better name needed, obviously).

Well, from this list (which is where volatile points), mutable seems
closest to OK, but I'm not sure I like any of them.

http://www.merriam-webster.com/thesaurus/fickle

Anyone else have an idea?

-Kevin

Re: Calculation of unused columns

From
Simon Riggs
Date:
On Mon, 2009-10-19 at 13:43 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Sat, 2009-10-17 at 21:41 -0400, Tom Lane wrote:
> >> one thing we'd have to consider
> >> is whether it is okay to suppress calculation of columns containing
> >> volatile functions.
>
> > I think we should have a 4th class of functions,
> > volatile-without-side-effects (better name needed, obviously).
>
> What for?  There wouldn't be that many, I think.  random() and
> clock_timestamp(), yeah, but most volatile user-defined functions
> are either volatile-with-side-effects or misdeclared.

Read only vs. read write?

--
 Simon Riggs           www.2ndQuadrant.com


Re: Calculation of unused columns

From
Gerhard Wiesinger
Date:
On Sun, 18 Oct 2009, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Oct 18, 2009 at 1:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> Even if country.id is a primary or unique key?
>
>> Well, we currently don't have any logic for making inferences based on
>> unique constraints.
>
> Huh?
> http://archives.postgresql.org/pgsql-committers/2009-09/msg00159.php
>
> Admittedly it's just one case and there's lots more to be done, but it's
> more than nothing.  So this is a *potential* argument for trying to trim
> subquery outputs.  What I'm not sure about is whether there are common
> cases where this would be applicable below a non-flattenable subquery.
>

Wow. That's IHMO a major improvement in the optimizer. Is this also valid
for views?

In the area of views this might even be a killer feature since one can
define a view with many columns and when only e.g. one is used query is
still optimal. I had today such a situation where I created a new view to
be ~24 times faster (with a lot of left outer joins).

Is the patch only for 8.5 or even backported to 8.4 and 8.3?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/

Re: Calculation of unused columns

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2009-10-19 at 13:43 -0400, Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> I think we should have a 4th class of functions,
>>> volatile-without-side-effects (better name needed, obviously).
>>
>> What for?  There wouldn't be that many, I think.  random() and
>> clock_timestamp(), yeah, but most volatile user-defined functions
>> are either volatile-with-side-effects or misdeclared.

> Read only vs. read write?

Most read-only functions are stable or even immutable.  I don't say
that there's zero usefulness in a fourth class, but I do say it's
unlikely to be worth the trouble.  (The only reason it even came
up in this connection is that the default for user-defined functions
is "volatile" which would defeat this optimization ... but we could
hardly make the default anything else.)

            regards, tom lane

Re: Calculation of unused columns

From
Tom Lane
Date:
Gerhard Wiesinger <lists@wiesinger.com> writes:
> Is the patch only for 8.5 or even backported to 8.4 and 8.3?

That patch will *not* be backported.  It hasn't even got through beta yet.

            regards, tom lane

Re: Calculation of unused columns

From
Simon Riggs
Date:
On Mon, 2009-10-19 at 13:58 -0400, Tom Lane wrote:
>
> Most read-only functions are stable or even immutable.

Huh? I mean a function that only contains SELECTs. (How would those ever
be Stable or Immutable??)

--
 Simon Riggs           www.2ndQuadrant.com


Re: Calculation of unused columns

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2009-10-19 at 13:58 -0400, Tom Lane wrote:
>> Most read-only functions are stable or even immutable.

> Huh? I mean a function that only contains SELECTs. (How would those ever
> be Stable or Immutable??)

Uh, a function containing SELECTs is exactly the use-case for STABLE.
Maybe you need to go re-read the definitions?

            regards, tom lane