Thread: Propogating conditions into a query

Propogating conditions into a query

From
Phil Endecott
Date:
Dear All,

I have a number of complex views for which the typical use is to select
exactly one row by id, e.g.  "select * from V where id=nnn".  Some of
these selects run orders of magnitude faster than others.  Looking at
the output of "explain analyse" it seems that in the fast cases the
"id=nnn" condition is passed down to the lower-level operations, while
in the slower cases the entire view is created and then filtered using
the condition as a final step.

I am trying to narrow down what types of query I can use in the views to
avoid the poor performance.  Here are a couple of things that I have
noticed:

- One query had a "distinct on (id)" at the top level.  This was only to
cope with an obscure case where what is normally a one-to-one join could
return multiple rows.  Removing the "distinct" and discarding the
duplicate rows in the calling code means that the "where id=nnn" is now
applied as a condition for an index scan where it previously wasn't,
reducing execution time by two orders of magnitude.  But I can't see a
reason why the "id=nnn" condition couldn't have been used inside the
query, even in the presence of the "distinct" clause.

- In another case I have a LEFT OUTER JOIN which can be made much faster
by instead using a normal JOIN.  Unfortunately a normal JOIN doesn't do
what I want, but I can't see why the condition is propogated into the
JOIN but not the LEFT OUTER JOIN.  Here is an outline of the query:

D left outer join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn

That does index scans on M and G and a merge join to create the complete
"M join G" table.  On the other hand, if I do

D join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn

then it does conditional index scans on D.id=nnn and M.b=nnn and a
nested loop join returning one row, followed by a conditional index scan
on G.  This is an order of magnitude faster.

I don't think this is a problem with statistics; the row-count estimates
are all reasonable.  I imagine that the restriction is something missing
in the query optimiser.  Can I rewrite this query somehow?  Is there
anything else I can do about it?

This is with 7.4.2.

Cheers,  Phil.



Re: Propogating conditions into a query

From
Tom Lane
Date:
Phil Endecott <spam_from_postgresql_general@chezphil.org> writes:
> I have a number of complex views for which the typical use is to select
> exactly one row by id, e.g.  "select * from V where id=nnn".  Some of
> these selects run orders of magnitude faster than others.  Looking at
> the output of "explain analyse" it seems that in the fast cases the
> "id=nnn" condition is passed down to the lower-level operations, while
> in the slower cases the entire view is created and then filtered using
> the condition as a final step.

> I am trying to narrow down what types of query I can use in the views to
> avoid the poor performance.

When in doubt, use the source ;-) ... most sorts of things like this are
pretty well commented, if you can find the relevant code.  In this case
what you are wanting is that the view subquery either get "pulled up"
into the calling query, or that conditions from the calling query get
"pushed down" into the subquery.  The former transformation is done in
src/backend/optimizer/prep/prepjointree.c, and the principal conditions
are checked in is_simple_subquery():

    /*
     * Can't currently pull up a query with setops. Maybe after querytree
     * redesign...
     */
    if (subquery->setOperations)
        return false;

    /*
     * Can't pull up a subquery involving grouping, aggregation, sorting,
     * or limiting.
     */
    if (subquery->hasAggs ||
        subquery->groupClause ||
        subquery->havingQual ||
        subquery->sortClause ||
        subquery->distinctClause ||
        subquery->limitOffset ||
        subquery->limitCount)
        return false;

    /*
     * Don't pull up a subquery that has any set-returning functions in
     * its targetlist.    Otherwise we might well wind up inserting
     * set-returning functions into places where they mustn't go, such as
     * quals of higher queries.
     */
    if (expression_returns_set((Node *) subquery->targetList))
        return false;

    /*
     * Hack: don't try to pull up a subquery with an empty jointree.
     * query_planner() will correctly generate a Result plan for a
     * jointree that's totally empty, but I don't think the right things
     * happen if an empty FromExpr appears lower down in a jointree. Not
     * worth working hard on this, just to collapse SubqueryScan/Result
     * into Result...
     */
    if (subquery->jointree->fromlist == NIL)
        return false;

    return true;

The push-down optimization is done in
src/backend/optimizer/path/allpaths.c, and that makes tests on both the
subquery involved and the qualification condition to be pushed down:

/*
 * subquery_is_pushdown_safe - is a subquery safe for pushing down quals?
 *
 * subquery is the particular component query being checked.  topquery
 * is the top component of a set-operations tree (the same Query if no
 * set-op is involved).
 *
 * Conditions checked here:
 *
 * 1. If the subquery has a LIMIT clause, we must not push down any quals,
 * since that could change the set of rows returned.
 *
 * 2. If the subquery contains EXCEPT or EXCEPT ALL set ops we cannot push
 * quals into it, because that would change the results.
 *
 * 3. For subqueries using UNION/UNION ALL/INTERSECT/INTERSECT ALL, we can
 * push quals into each component query, but the quals can only reference
 * subquery columns that suffer no type coercions in the set operation.
 * Otherwise there are possible semantic gotchas.  So, we check the
 * component queries to see if any of them have different output types;
 * differentTypes[k] is set true if column k has different type in any
 * component.
 */

/*
 * qual_is_pushdown_safe - is a particular qual safe to push down?
 *
 * qual is a restriction clause applying to the given subquery (whose RTE
 * has index rti in the parent query).
 *
 * Conditions checked here:
 *
 * 1. The qual must not contain any subselects (mainly because I'm not sure
 * it will work correctly: sublinks will already have been transformed into
 * subplans in the qual, but not in the subquery).
 *
 * 2. The qual must not refer to any subquery output columns that were
 * found to have inconsistent types across a set operation tree by
 * subquery_is_pushdown_safe().
 *
 * 3. If the subquery uses DISTINCT ON, we must not push down any quals that
 * refer to non-DISTINCT output columns, because that could change the set
 * of rows returned.  This condition is vacuous for DISTINCT, because then
 * there are no non-DISTINCT output columns, but unfortunately it's fairly
 * expensive to tell the difference between DISTINCT and DISTINCT ON in the
 * parsetree representation.  It's cheaper to just make sure all the Vars
 * in the qual refer to DISTINCT columns.
 *
 * 4. We must not push down any quals that refer to subselect outputs that
 * return sets, else we'd introduce functions-returning-sets into the
 * subquery's WHERE/HAVING quals.
 */

            regards, tom lane

Re: Propogating conditions into a query

From
Phil Endecott
Date:
Tom Lane wrote:
> Phil Endecott <spam_from_postgresql_general@chezphil.org> writes:
>
>>I have a number of complex views for which the typical use is to select
>>exactly one row by id, e.g.  "select * from V where id=nnn".  Some of
>>these selects run orders of magnitude faster than others.  Looking at
>>the output of "explain analyse" it seems that in the fast cases the
>>"id=nnn" condition is passed down to the lower-level operations, while
>>in the slower cases the entire view is created and then filtered using
>>the condition as a final step.
>
> When in doubt, use the source ;-) ... most sorts of things like this are
> pretty well commented, if you can find the relevant code.

Good plan.

>  * 3. If the subquery uses DISTINCT ON, we must not push down any quals that
>  * refer to non-DISTINCT output columns, because that could change the set
>  * of rows returned.

OK, so this is why my query with DISTINCT ON was failing.  I can fix that.

I don't see anything in there about LEFT OUTER JOIN though.  Any ideas?

--Phil.


Re: Propogating conditions into a query

From
Tom Lane
Date:
Phil Endecott <spam_from_postgresql_general@chezphil.org> writes:
> I don't see anything in there about LEFT OUTER JOIN though.  Any ideas?

Oh, I missed that part of your message.  Hmm, I think the issue is that in

>> D join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn

the planner deduces M.b=nnn by transitivity, but when the join is an
outer join it can't make the same deduction.

[ thinks some more... ]  If we distinguished conditions that hold below
the join from those that hold above it, we could deduce that M.b=nnn can
be enforced below the join even though it might not be true above it.
There's no such mechanism in existence now, though.

A possible workaround is to generate your query like

 D left join (M join G on (M.g=G.id)) on (D.id=M.b AND M.b=nnn) where D.id=nnn

but I don't know how practical that is for you.

            regards, tom lane

Re: Propogating conditions into a query

From
Phil Endecott
Date:
Tom Lane wrote:
> Phil Endecott <spam_from_postgresql_general@chezphil.org> writes:
>>>D join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn
>
> A possible workaround is to generate your query like
>
>  D left join (M join G on (M.g=G.id)) on (D.id=M.b AND M.b=nnn) where D.id=nnn

I don't suppose it would work if I did

D left join (M join G on (M.g=G.id)) on (D.id=M.b)
     where (D.id=nnn AND (M.b=nnn or M.b IS NULL))

would it?

Otherwise it breaks the view, and makes the calling code rather more messy.

--Phil.



Re: Propogating conditions into a query

From
Kim Bisgaard
Date:
Hi Tom,

This sounds like the same "problem" which prevented PG from using the
indices, and thus giving abyssmal performance in this other thread:

> I have two BIG tables (virtually identical) with 3 NOT NULL columns
> Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id,
> TimeObs) and valid ANALYSE (set statistics=100). I want to join the
> two tables with a FULL OUTER JOIN.
>
> When I specify the query as:
>
> SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
>        FROM temp_dry_at_2m a
>        FULL OUTER JOIN temp_grass b        USING (station_id, timeobs)
>        WHERE station_id = 52981
>          AND timeobs = '2004-1-1 0:0:0'

Then I would also vote for improving the inteligence of the optimizer! :-)

Regards,
Kim.

Tom Lane wrote:

>Phil Endecott <spam_from_postgresql_general@chezphil.org> writes:
>
>
>>I don't see anything in there about LEFT OUTER JOIN though.  Any ideas?
>>
>>
>
>Oh, I missed that part of your message.  Hmm, I think the issue is that in
>
>
>
>>>D join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn
>>>
>>>
>
>the planner deduces M.b=nnn by transitivity, but when the join is an
>outer join it can't make the same deduction.
>
>[ thinks some more... ]  If we distinguished conditions that hold below
>the join from those that hold above it, we could deduce that M.b=nnn can
>be enforced below the join even though it might not be true above it.
>There's no such mechanism in existence now, though.
>
>A possible workaround is to generate your query like
>
> D left join (M join G on (M.g=G.id)) on (D.id=M.b AND M.b=nnn) where D.id=nnn
>
>but I don't know how practical that is for you.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>
>

Re: Propogating conditions into a query

From
Kim Bisgaard
Date:
Hi Tom,

I have now completed the move to PG8.0.3, and feel that I have confirmed
that this problem is related to the problem I'm having:

Formulated like this, it is not performing:

SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
       FROM temp_dry_at_2m a
       FULL OUTER JOIN temp_grass b
       USING (station_id, timeobs)
       WHERE station_id = 52981
         AND timeobs = '2004-1-1 0:0:0';

Merge Full Join  (cost=1598312.83..11032924.48 rows=6956994 width=32) (actual time=119061.098..133314.306 rows=1
loops=1)
  Merge Cond: (("outer".timeobs = "inner".timeobs) AND ("outer".station_id = "inner".station_id))
  Filter: ((COALESCE("inner".station_id, "outer".station_id) = 52981) AND (COALESCE("inner".timeobs, "outer".timeobs) =
'2004-01-0100:00:00'::timestamp without time zone)) 
  ->  Sort  (cost=346429.38..352445.11 rows=2406292 width=16) (actual time=20315.241..23850.529 rows=2406292 loops=1)
        Sort Key: b.timeobs, b.station_id
        ->  Seq Scan on temp_grass b  (cost=0.00..41756.92 rows=2406292 width=16) (actual time=10.517..7003.468
rows=2406292loops=1) 
  ->  Sort  (cost=1251883.44..1269275.93 rows=6956994 width=16) (actual time=82122.354..92027.850 rows=6956994 loops=1)
        Sort Key: a.timeobs, a.station_id
        ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..117549.94 rows=6956994 width=16) (actual time=23.759..39930.741
rows=6956994loops=1) 
Total runtime: 133623.422 ms

But Postgresql can do the work, if it is reformulated into:

SELECT station_id, timeobs, temp_grass, temp_dry_at_2m
 FROM
   (SELECT station_id, timeobs, temp_dry_at_2m
     FROM temp_dry_at_2m
     WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') a
   FULL OUTER JOIN
   (SELECT station_id, timeobs, temp_grass
     FROM temp_grass
     WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') b
   USING (station_id, timeobs)

Merge Full Join  (cost=0.00..43023.64 rows=10614 width=32) (actual time=0.056..0.064 rows=1 loops=1)
  ->  Index Scan using temp_grass_idx on temp_grass  (cost=0.00..246.55 rows=61 width=16) (actual time=0.029..0.031
rows=1loops=1) 
        Index Cond: ((timeobs = '2004-01-01 00:00:00'::timestamp without time zone) AND (station_id = 52981))
  ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m  (cost=0.00..699.52 rows=174 width=16) (actual
time=0.017..0.020rows=1 loops=1) 
        Index Cond: ((timeobs = '2004-01-01 00:00:00'::timestamp without time zone) AND (station_id = 52981))
Total runtime: 0.163 ms


The reason the first query is not performing is because the query
optimizer does not push the conditions down into the sub-queries - right??

The reason that I do not just use the reformulated query, is that e.g.
the station_id comes from another table (and there can be more of them),
so it is bloody inconvenient to first select them, and then repeat them
a number of time in the above transformation (I need to outer join more
than two tables) ........

Best regards,


Kim Bisgaard wrote:

> Hi Tom,
>
> This sounds like the same "problem" which prevented PG from using the
> indices, and thus giving abyssmal performance in this other thread:
>
>> I have two BIG tables (virtually identical) with 3 NOT NULL columns
>> Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id,
>> TimeObs) and valid ANALYSE (set statistics=100). I want to join the
>> two tables with a FULL OUTER JOIN.
>>
>> When I specify the query as:
>>
>> SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
>>        FROM temp_dry_at_2m a
>>        FULL OUTER JOIN temp_grass b        USING (station_id, timeobs)
>>        WHERE station_id = 52981
>>          AND timeobs = '2004-1-1 0:0:0'
>
>
> Then I would also vote for improving the inteligence of the optimizer!
> :-)
>
> Regards,
> Kim.
>
> Tom Lane wrote:
>
>> Phil Endecott <spam_from_postgresql_general@chezphil.org> writes:
>>
>>
>>> I don't see anything in there about LEFT OUTER JOIN though.  Any ideas?
>>>
>>
>>
>> Oh, I missed that part of your message.  Hmm, I think the issue is
>> that in
>>
>>
>>
>>>> D join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn
>>>>
>>>
>>
>> the planner deduces M.b=nnn by transitivity, but when the join is an
>> outer join it can't make the same deduction.
>>
>> [ thinks some more... ]  If we distinguished conditions that hold below
>> the join from those that hold above it, we could deduce that M.b=nnn can
>> be enforced below the join even though it might not be true above it.
>> There's no such mechanism in existence now, though.
>>
>> A possible workaround is to generate your query like
>>
>> D left join (M join G on (M.g=G.id)) on (D.id=M.b AND M.b=nnn) where
>> D.id=nnn
>>
>> but I don't know how practical that is for you.
>>
>>             regards, tom lane
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>>               http://www.postgresql.org/docs/faq
>>
>>
>>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: Propogating conditions into a query

From
Tom Lane
Date:
Kim Bisgaard <kib+pg@dmi.dk> writes:
> The reason the first query is not performing is because the query
> optimizer does not push the conditions down into the sub-queries - right??

Well, it's not the same condition: the WHERE clause is constraining
the output variable of the FULL JOIN, which is logically and
implementationally distinct from either input table's variable.
I suppose it is arguable that we could transform the WHERE clause into
constraints on the input variables --- but we'd have to think carefully
about the conditions under which such a transformation is valid.
Offhand it seems like it might work for strict non-volatile constraint
expressions that refer only to output variables of the JOIN; if they
refer to other tables too then I'm not sure.

This is by no means a "bug fix", but if I have time over the next week
I'll see whether it is feasible to write for 8.1.

            regards, tom lane

Re: Propogating conditions into a query

From
Kim Bisgaard
Date:
Tom Lane wrote:

>Kim Bisgaard <kib+pg@dmi.dk> writes:
>
>
>>The reason the first query is not performing is because the query
>>optimizer does not push the conditions down into the sub-queries - right??
>>
>>
>
>Well, it's not the same condition: the WHERE clause is constraining
>the output variable of the FULL JOIN, which is logically and
>implementationally distinct from either input table's variable.
>I suppose it is arguable that we could transform the WHERE clause into
>constraints on the input variables --- but we'd have to think carefully
>about the conditions under which such a transformation is valid.
>Offhand it seems like it might work for strict non-volatile constraint
>expressions that refer only to output variables of the JOIN; if they
>refer to other tables too then I'm not sure.
>
>
I am shure you know what you are talking about - I am just showing
another case where I think the above transformations are worthwhile
(pushing my problem :-)  )

>This is by no means a "bug fix", but if I have time over the next week
>I'll see whether it is feasible to write for 8.1.
>
>
Hey thanks a lot - tell me if there is anything I can do (beer next time
in Copenhagen, work, ......)

Best regards,
Kim.
--
Computer Department                  Phone: +45 3915 7562 (direct)
Danish Meteorological Institute      Fax: +45 3915 7460 (division)