Re: ordering of 'where' sub clauses - Mailing list pgsql-general

From Tom Lane
Subject Re: ordering of 'where' sub clauses
Date
Msg-id 18188.963935438@sss.pgh.pa.us
Whole thread Raw
In response to Re: ordering of 'where' sub clauses  (Steve Heaven <steve@thornet.co.uk>)
List pgsql-general
Steve Heaven <steve@thornet.co.uk> writes:
>  explain select m.* from main m, subset_table s
>  where m.stockno=s.stockno AND m.descrip ~ 'SEARCHTERM';
> NOTICE:  QUERY PLAN:

> Hash Join  (cost=118431.88 rows=714797 width=172)
>   -> Seq Scan on main m  (cost=79300.27 rows=714796 width=160)
>   -> Hash  (cost=8121.17 rows=203793 width=12)
>     -> Seq Scan on subset_table s  (cost=8121.17 rows=203793 width=12)

Given the sizes of the tables, I'm not sure that's such an unreasonable
plan.  Your subset seems to be more than a quarter of the rows in the
main table.  The only way to build a plan that visits *only* those rows
in main that are also in subset is to use an indexscan on main, right?
The cold hard fact is that an indexscan that touches a quarter of the
rows in a large table is going to be slower than a linear scan of the
whole table.  (Barring special cases such as a nearly-in-order table,
anyway.)  The cost of extra disk I/O will swamp out the extra
comparisons.

So I think the planner is doing the right thing to use a seqscan on
main, and then its only interesting choice is how to perform the join
with subset.  The above looks pretty good, bearing in mind again that
an indexscan that has to touch every row in subset will be mighty slow.

My advice is to forget about the subset table; it's not buying you
anything that's worth the trouble of doing the join.  If your original
subset condition is really expensive, you could run it and store the
result in a field added to the main table --- otherwise, why not just
evaluate it on the fly?  In any case, retrieving a subset that
represents more than a few percent of the total number of rows is
likely best done by a sequential scan, unsexy as that may sound.

BTW, it would be interesting to see if the 7.0.* planner produces a
materially different plan than the pre-7.0 version I observe you
are using.  Its cost estimates will be different, but I expect it
will arrive at the same conclusion that indexscan is not the way
to do this query.

            regards, tom lane

pgsql-general by date:

Previous
From: Bob Parkinson
Date:
Subject: disabling triggers?
Next
From: Tom Lane
Date:
Subject: Re: disabling triggers?