Thread: how can I direct the planner ?

how can I direct the planner ?

From
Andrei Popescu-Belis
Date:
Hello everybody,

I would like to know why, in the situation described below,
the execution planner fails to choose IndexScan, and how it
could be forced to choose this option rather than SeqScan.
The latter takes about 3 minutes to execute, as opposed to
1-2 seconds for the former. Details below.

Thank you very much,
Andrei Popescu-Belis

--
ISSCO, Universite de Geneve
tel: (41 22) 705 86 81       40, bd du Pont d'Arve
fax: (41 22) 705 86 89       CH-1211 Geneve 4
   http://www.issco.unige.ch/staff/andrei

-----------------------------------------------------------

I have a table T1 with about 4 million lines ; none of the
columns provides a unique index, but the combination of the
first three columns does - that is, each triple is unique.
Using the first column as an index seems quite counter-
productive, as some values correspond to only one entry, and
others to tens of thousands.
So, I have created a unique index on T1 using the first three
columns: it has about 17 MB, as compared to the 250 MB of T1.

Question 1
++++++++++
Given the two indexes (on C1 and on C1-2-3), how come the
following plan is generated ?

sylex2000=# explain select * from T1 where C1 = 17000 ;
NOTICE:  QUERY PLAN:
Seq Scan on sol_d  (cost=0.00..82357.56 rows=41431 width=30)
EXPLAIN

This query takes about 3 minutes to solve. To force the
use of Index Scan I found the following query (C2 > 0 and
C3 > 0 are always true).

sylex2000=# explain select * from T1 where C1 = 17000 and
C2 > 0 and C3 > 0 ;
NOTICE:  QUERY PLAN:
Index Scan using T1_idx on T1 (cost=0.00..104052.07 rows=4603 width=30)
EXPLAIN

Note that the estimated cost is higher than previous, but the
query is solved in ca. 2 seconds.

Question 2
++++++++++
Is it possible to *force* the planner to always choose
the Index Scan ?

Question 3
++++++++++
I was looking for a "trick" to make the planner use both
indexes when joining two such tables. The only one I found is
to use a temporary table, as shown below. Is there any method
to do it in only one query, and have two Index Scans ?


sylex2000=# explain select *
from T1 , T2
where T1.C1 = T2.C2 and T2.C1 = 'Basel' ;

NOTICE:  QUERY PLAN:
Merge Join  (cost=4194.68..707265.19 rows=73512414 width=70)
  ->  Index Scan using T1_idx on T1  (cost=0.00..651259.77 rows=4143085
width=30)
  ->  Sort  (cost=4194.68..4194.68 rows=1774 width=40)
        ->  Seq Scan on T2  (cost=0.00..4098.93 rows=1774 width=40)

===> This takes about 1 minute (T2 is smaller)


sylex2000=# explain select * from T1 , T2 where T2.C1 = 'Basel'
and T1.C1 = T2.C2 and T1.C2 > 0 and T1.C3 > 1 ;
NOTICE:  QUERY PLAN:

Merge Join  (cost=174668.31..180444.77 rows=8168046 width=70)
  ->  Sort  (cost=170473.63..170473.63 rows=460343 width=30)
        ->  Seq Scan on T1  (cost=0.00..103072.99 rows=460343 width=30)
  ->  Sort  (cost=4194.68..4194.68 rows=1774 width=40)
        ->  Seq Scan on T2  (cost=0.00..4098.93 rows=1774 width=40)

===> This takes about 3 minutes

===> The only (poor) solution, which uses IndexScan twice:

  create temporary table T3 as (select C1 from T1 where C1 = 'Basel' ) ;
  create index T3_idx on T3 ( C1 ) ;
  select * from T3 , T2 where T3.C1 = T2.C1 ;
  drop table T3 ;


REMARK
++++++
I have found this: "A Tour of PostgreSQL Internals"
at the address:     http://www.postgresql.org/osdn/tour.pdf
Author:             Tom Lane

The basic idea of the planner is cost- estimate- based selection
of the best plan tree for a given query.
Simple example:
SELECT * FROM t WHERE f1 < 100
Assuming there is an index on t( f1), two possible execution plans are
considered:
" sequential scan over all of t
" index scan with index restriction f1 < 100
Costs of each plan (in disk page fetches and CPU time) are estimated and
the plan with lower estimated cost is selected.
Note that the indexscan is not an automatic winner, and should not be.
The choice will depend on what fraction of the rows of t are estimated
to
be retrieved.
---------
But this doesn't really answer my questions...
Thanks for any help!
APB

Re: how can I direct the planner ?

From
Tom Lane
Date:
Andrei Popescu-Belis <Andrei.Popescu-Belis@issco.unige.ch> writes:
> Using the first column as an index seems quite counter-
> productive, as some values correspond to only one entry, and
> others to tens of thousands.

Mmm, that's the source of the problem.  Currently the planner's estimate
of the selectivity of "C1 = 17000" is driven off the most common value
in the column.  If you were asking for one of the values with tens of
thousands of hits, then indeed a sequential scan would be the way to go.
The planner has no idea that the value you want has only a few hits.
(Notice that the estimated result row count has nothing to do with
reality :-()

The long-term answer for this is to maintain better statistics, so that
we can know something more about the distribution of values in the
column.  I've heard of many examples where there are a small number
of very frequent entries, with everything else much less frequent.
If we stored the top ten or so entries, not just one, we'd be able to
realize that a value that's none of the top ten must have a low
frequency.

> Is it possible to *force* the planner to always choose
> the Index Scan ?

You could try experimenting with SET enable_seqscan TO OFF.
Be wary that you don't shoot yourself in the foot for other
queries, however.

            regards, tom lane