Thread: two seperate queries run faster than queries ORed together

two seperate queries run faster than queries ORed together

From
Joseph Shraibman
Date:
explain
SELECT COUNT(u.ukey) FROM u, d WHERE d.ukey = u.ukey AND u.pkey = 260
AND (u.status = 3 ) AND NOT u.boolfield ;
                                           QUERY PLAN
----------------------------------------------------------------------------------------------
  Aggregate  (cost=45707.84..45707.84 rows=1 width=4)
    ->  Nested Loop  (cost=0.00..45707.16 rows=273 width=4)
          ->  Seq Scan on usertable u  (cost=0.00..44774.97 rows=272
width=4)
                Filter: ((pkey = 260) AND (status = 3) AND (NOT boolfield))
          ->  Index Scan using d_pkey on d  (cost=0.00..3.41 rows=1 width=4)
                Index Cond: (d.ukey = "outer".ukey)


explain
SELECT COUNT(u.ukey) FROM u, d WHERE d.ukey = u.ukey AND u.pkey = 260
AND (d.status = 3 ) AND NOT u.boolfield ;

                                           QUERY PLAN
----------------------------------------------------------------------------------------------
  Aggregate  (cost=28271.38..28271.38 rows=1 width=4)
    ->  Nested Loop  (cost=0.00..28271.38 rows=1 width=4)
          ->  Seq Scan on d  (cost=0.00..28265.47 rows=1 width=4)
                Filter: (status = 3)
          ->  Index Scan using u_pkey on u  (cost=0.00..5.89 rows=1 width=4)
                Index Cond: (("outer".ukey = u.ukey) AND (u.pkey = 260))
                Filter: (NOT boolfield)


explain
SELECT COUNT(u.ukey) FROM u, d WHERE d.ukey = u.ukey AND u.pkey = 260
AND (u.status = 3 OR d.status = 3 ) AND NOT u.boolfield ;


                                       QUERY PLAN
---------------------------------------------------------------------------------------
  Aggregate  (cost=128867.45..128867.45 rows=1 width=4)
    ->  Hash Join  (cost=32301.47..128866.77 rows=272 width=4)
          Hash Cond: ("outer".ukey = "inner".ukey)
          Join Filter: (("inner".status = 3) OR ("outer".status = 3))
          ->  Seq Scan on u  (cost=0.00..41215.97 rows=407824 width=6)
                Filter: ((pkey = 260) AND (NOT boolfield))
          ->  Hash  (cost=25682.98..25682.98 rows=1032998 width=6)
                ->  Seq Scan on d  (cost=0.00..25682.98 rows=1032998
width=6)


... so what do I do?  It would be a real pain to rewrite this query to
run twice and add the results up, especially since I don't always know
beforehand when it will be faster based on different values to the query.

Re: two seperate queries run faster than queries ORed together

From
Richard Huxton
Date:
On Thursday 18 March 2004 21:21, Joseph Shraibman wrote:
> explain
> SELECT COUNT(u.ukey) FROM u, d WHERE d.ukey = u.ukey AND u.pkey = 260
> AND (u.status = 3 OR d.status = 3 ) AND NOT u.boolfield ;
>
>
>                                        QUERY PLAN
> ---------------------------------------------------------------------------
>------------ Aggregate  (cost=128867.45..128867.45 rows=1 width=4)
>     ->  Hash Join  (cost=32301.47..128866.77 rows=272 width=4)
>           Hash Cond: ("outer".ukey = "inner".ukey)
>           Join Filter: (("inner".status = 3) OR ("outer".status = 3))
>           ->  Seq Scan on u  (cost=0.00..41215.97 rows=407824 width=6)
>                 Filter: ((pkey = 260) AND (NOT boolfield))

There's your problem. For some reason it thinks it's getting 407,824 rows back
from that filtered seq-scan. I take it that pkey is a primary-key and is
defined as being UNIQUE? If you actually did have several hundred thousand
matches then a seq-scan might be sensible.

I'd start by analyze-ing the table in question, and if that doesn't have any
effect look at the column stats and see what spread of values it thinks you
have.

--
  Richard Huxton
  Archonet Ltd

Re: two seperate queries run faster than queries ORed together

From
Joseph Shraibman
Date:
Richard Huxton wrote:
> On Thursday 18 March 2004 21:21, Joseph Shraibman wrote:
>
>>explain
>>SELECT COUNT(u.ukey) FROM u, d WHERE d.ukey = u.ukey AND u.pkey = 260
>>AND (u.status = 3 OR d.status = 3 ) AND NOT u.boolfield ;
>>
>>
>>                                       QUERY PLAN
>>---------------------------------------------------------------------------
>>------------ Aggregate  (cost=128867.45..128867.45 rows=1 width=4)
>>    ->  Hash Join  (cost=32301.47..128866.77 rows=272 width=4)
>>          Hash Cond: ("outer".ukey = "inner".ukey)
>>          Join Filter: (("inner".status = 3) OR ("outer".status = 3))
>>          ->  Seq Scan on u  (cost=0.00..41215.97 rows=407824 width=6)
>>                Filter: ((pkey = 260) AND (NOT boolfield))
>
>
> There's your problem. For some reason it thinks it's getting 407,824 rows back
> from that filtered seq-scan. I take it that pkey is a primary-key and is
> defined as being UNIQUE? If you actually did have several hundred thousand
> matches then a seq-scan might be sensible.
>
No, pkey is not the primary key in this case. The number of entries in u
that have pkey 260 and not boolfield is 344706. The number of those that
have status == 3 is 7.  To total number of entries in d that have status
  == 3 is 4.

> I'd start by analyze-ing the table in question,
Is done every night.

The problem is that it seems the planner doesn't think to do the
different parts of the OR seperately and then combine the answers.

Re: two seperate queries run faster than queries ORed together

From
Tom Lane
Date:
Joseph Shraibman <jks@selectacast.net> writes:
> No, pkey is not the primary key in this case. The number of entries in u
> that have pkey 260 and not boolfield is 344706.

... and every one of those rows *must* be included in the join input,
regardless of its status value, because it might join to some d row that
has status=3.  Conversely, every single row of d must be considered in
the join because it might join to some u row with status=3.  So any way
you slice it, this query requires a large and expensive join operation,
no matter that there are only a few rows with the right status values in
the other table.

I'd rewrite the query if I were you.

            regards, tom lane

Re: two seperate queries run faster than queries ORed together

From
Joseph Shraibman
Date:
Tom Lane wrote:
> Joseph Shraibman <jks@selectacast.net> writes:
>
>>No, pkey is not the primary key in this case. The number of entries in u
>>that have pkey 260 and not boolfield is 344706.
>
>
> ... and every one of those rows *must* be included in the join input,

*If* you use one big join in the first place.  If postgres ran the query
to first get the values with status == 3 from u, then ran the query to
get the entries from d, then combined them, the result would be the same
but the output faster.  Instead it is doing seq scans on both tables and
doing an expensive join that returns only a few rows.


Re: two seperate queries run faster than queries ORed

From
Stephan Szabo
Date:
On Mon, 22 Mar 2004, Joseph Shraibman wrote:

> Tom Lane wrote:
> > Joseph Shraibman <jks@selectacast.net> writes:
> >
> >>No, pkey is not the primary key in this case. The number of entries in u
> >>that have pkey 260 and not boolfield is 344706.
> >
> >
> > ... and every one of those rows *must* be included in the join input,
>
> *If* you use one big join in the first place.  If postgres ran the query
> to first get the values with status == 3 from u, then ran the query to
> get the entries from d, then combined them, the result would be the same
> but the output faster.  Instead it is doing seq scans on both tables and

Well, you have to be careful on the combination to not give the wrong
answers if there's a row with u.status=3 that matches a row d.status=3.

Re: two seperate queries run faster than queries ORed together

From
Joseph Shraibman
Date:
Stephan Szabo wrote:
> On Mon, 22 Mar 2004, Joseph Shraibman wrote:
>
>
>>Tom Lane wrote:
>>
>>>Joseph Shraibman <jks@selectacast.net> writes:
>>>
>>>
>>>>No, pkey is not the primary key in this case. The number of entries in u
>>>>that have pkey 260 and not boolfield is 344706.
>>>
>>>
>>>... and every one of those rows *must* be included in the join input,
>>
>>*If* you use one big join in the first place.  If postgres ran the query
>>to first get the values with status == 3 from u, then ran the query to
>>get the entries from d, then combined them, the result would be the same
>>but the output faster.  Instead it is doing seq scans on both tables and
>
>
> Well, you have to be careful on the combination to not give the wrong
> answers if there's a row with u.status=3 that matches a row d.status=3.

Right you would have to avoid duplicates.  The existing DISTINCT code
should be able to handle that.

Re: two seperate queries run faster than queries ORed

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> Well, you have to be careful on the combination to not give the wrong
> answers if there's a row with u.status=3 that matches a row d.status=3.

We could in theory handle that using something similar to the method
currently used for "OR" indexscans (that is, rather than doing either
UNION- or UNION-ALL-like processing, we drop tuples from later scans
that meet the qual tests of the earlier scans).  However I don't see any
clean way in the current planner to cost out both approaches and pick
the cheaper one.  It looks to me like we'd have to do over the *entire*
join planning process each way, which is ugly as well as unreasonably
expensive.  The problem is that the OR approach only wins when the
component clauses of the OR can drop down to lower levels of the plan
tree if they are considered separately.  But a plan tree with a
restriction at a low level and one without it are two different things,
and the dynamic-programming approach we use to build up join plans
doesn't yield the same solutions.  (As indeed it shouldn't, since the
whole point of Joseph's example is to get fundamentally different plans
for the two parts of the OR.)

We could possibly approach it heuristically, that is examine the clauses
and try to guess whether it's better to split them apart or not.  But
even assuming that we punt on that part of the problem, it seems like a
mess.  For instance suppose that there are additional relations in the
query that aren't mentioned in the OR clause.  The planner may want to
join some of those relations in advance of forming the join that the OR
itself describes.  Pushing down different parts of the OR might cause
the best join path to change.  How could you merge multiple scans if
some include extra relations and some don't?

In short, I see how such a plan could be executed, but I don't see any
effective approach for generating the plan ...

            regards, tom lane