Thread: 7.0.x not using indices

7.0.x not using indices

From
Marcin Wolinski
Date:
Dear Sirs,

[This is in a way a reiteration of Marcin Zukowski's report of 19 Jan 2001
http://pgsql.service.net.pl/mhonarc/pgsql-bugs/2001-01/msg00110.html
which was ignored because of a somewhat unfortunate test case.]

I have created a new database and then have run this:
----------------------------------------------------------------------
create table w (wnr int4, wfid int4);
create table i (ifid int4, ilid int4);

\copy w from w.dump
\copy i from i.dump

create index w_wfid on w(wfid);
create index i_ilid on i(ilid);

vacuum verbose;
----------------------------------------------------------------------

[Please excuse column names.  These are slices of tables from a
database containing morphologically tagged corpus of Polish
language. 'w' are wordforms and 'i' are their interpretations.]

After this the 'w' class contains 116170 rows.  For each value of wfid
there are no more than 7584 different values of wnr (with median of 1
value per wfid and only in about 1000 cases 10 values or more).

Class 'i' contains 47016 rows.  For each ilid there are no more than
481 different ifids (median 2, 800 cases with 10 or more).

Now consider the following two queries:

TEST1:
----------------------------------------------------------------------
explain
select wnr, wfid
from w
where wfid in (57868, 99050, 121041)
order by wfid, wnr;
----------------------------------------------------------------------
On PostgreSQL 6.5.3 on Linux 2.2.17 this shows:
----------------------------------------------------------------------
Sort  (cost=6.15 rows=5 width=8)
  ->  Index Scan using w_wfid, w_wfid, w_wfid on w  (cost=6.15 rows=5 width=8)
----------------------------------------------------------------------
while on PostgreSQL 7.0.3 on Linux and 7.0.0 on M$ Windows:
----------------------------------------------------------------------
Sort  (cost=2860.73..2860.73 rows=3450 width=8)
  ->  Seq Scan on w  (cost=0.00..2657.98 rows=3450 width=8)
----------------------------------------------------------------------

We can estimate the number of rows to be at most 3% of the table
content (and usually much much less, this particular query returns 5
rows).  So the idea of using an index here doesn't seem very strange,
yet 7.0.x does not do it.

TEST2:
----------------------------------------------------------------------
explain
select wnr, wfid
from w, i
where wfid=ifid and ilid=99050
order by wfid, wnr;
----------------------------------------------------------------------
On PostgreSQL 6.5.3:
----------------------------------------------------------------------
Sort  (cost=6.15 rows=5 width=12)
  ->  Nested Loop  (cost=6.15 rows=5 width=12)
        ->  Index Scan using i_ilid on i  (cost=2.05 rows=2 width=4)
        ->  Index Scan using w_wfid on w  (cost=2.05 rows=116170 width=8)
----------------------------------------------------------------------
On PostgreSQL 7.0.3 on Linux and 7.0.0 on M$ Windows:
----------------------------------------------------------------------
Sort  (cost=77288.43..77288.43 rows=546185 width=12)
  ->  Merge Join  (cost=311.44..7899.66 rows=546185 width=12)
        ->  Index Scan using w_wfid on w  (cost=0.00..6130.21 rows=116170 width=8)
        ->  Sort  (cost=311.44..311.44 rows=470 width=4)
              ->  Index Scan using i_ilid on i  (cost=0.00..290.57 rows=470 width=4)
----------------------------------------------------------------------

In typical case this query returns less than 1% of the rows from the
join.  So why not to use the index?

The 'cost' values in 7.0.x are really pleasing to the eye.  And it in
fact computes as it says it would.  TEST2 on Postgres 6.5 runs in less
than a second, on 7.0 on the same machine it takes 25 seconds.  This
causes PostgreSQL 7.0 to be completely useless to me, since my real
query joining 5 tables runs for hours instead to complete in seconds.

Is it something I do wrong or is it a bug?  Is there a way to
circumvent this?

The queries above have a common feature with Marcin's example: they
require to lookup more than one value in an index.  Is this a problem
for 7.0?

If you want to play with my data I can send it all to you (~2MB) or
can try to prepare a smaller test set.

BTW. How can I learn to read these query plans?


Regards (and thanks for an otherwise splendid product)
Marcin

----------------------------------------------------------------------
 Marcin Woli\'nski                        mailto:wolinski@mimuw.edu.pl
                                     http://www.mimuw.edu.pl/~wolinski
----------------------------------------------------------------------

Re: 7.0.x not using indices

From
Tom Lane
Date:
Marcin Wolinski <wolinski@mimuw.edu.pl> writes:
> After this the 'w' class contains 116170 rows.  For each value of wfid
> there are no more than 7584 different values of wnr (with median of 1
> value per wfid and only in about 1000 cases 10 values or more).

The problem here is the huge differential between the typical and
maximum frequency of wfid values.  Can you do something to get rid of
the outlier with 7584 values?

See past pghackers discussions about the need for more extensive
statistics to allow improvements in row-count estimation.  As long as
the only stat we have is the frequency of the most common value,
there's no way not to get fooled by this sort of data distribution :-(

> BTW. How can I learn to read these query plans?

See
http://www.postgresql.org/devel-corner/docs/postgres/performance-tips.html#USING-EXPLAIN

            regards, tom lane

Re: 7.0.x not using indices

From
Marcin Wolinski
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Marcin Wolinski <wolinski@mimuw.edu.pl> writes:
> > After this the 'w' class contains 116170 rows.  For each value of wfid
> > there are no more than 7584 different values of wnr (with median of 1
> > value per wfid and only in about 1000 cases 10 values or more).
>
> The problem here is the huge differential between the typical and
> maximum frequency of wfid values.  Can you do something to get rid of
> the outlier with 7584 values?

I've made some deletions in my test tables.  Now w contains 51560
records, no more than 10 records per wfid (median 1).  Table i
contains 33317, no more than 10 records per ilid (median 1)

In these circumstances
PostgreSQL 7.0.3 on i686-pc-linux-gnu, compiled by gcc 2.95.2
decided to use an index on TEST1.  However for TEST2 being

select wnr, wfid
from w, i
where wfid=ifid and ilid=99050
order by wfid, wnr;

it selects precisely the same query plan (the estimated cost has
however dropped significantly):

Sort  (cost=19886.17..19886.17 rows=171782 width=12)
  ->  Merge Join  (cost=214.83..3302.03 rows=171782 width=12)
        ->  Index Scan using w_wfid on w  (cost=0.00..2438.53 rows=51560 width=8)
        ->  Sort  (cost=214.83..214.83 rows=333 width=4)
              ->  Index Scan using i_ilid on i  (cost=0.00..200.87 rows=333 width=4)

When run with
    time psql test <test2.sql
it shows
real    0m2.960s
(I know this is terribly rough).

I've tried to make Postgres use the same plan that 6.5.3 has selected.  First
i've used:
    set enable_mergejoin to off;
and PGSQL responded with

Sort  (cost=62183.01..62183.01 rows=171782 width=12)
  ->  Hash Join  (cost=201.70..45598.87 rows=171782 width=12)
        ->  Seq Scan on w  (cost=0.00..793.60 rows=51560 width=8)
        ->  Hash  (cost=200.87..200.87 rows=333 width=4)
              ->  Index Scan using i_ilid on i  (cost=0.00..200.87 rows=333 width=4)

and the time was
real    0m0.893s

Then I've added
set enable_hashjoin to off;
with the result of (the 6.5.3's plan):

Sort  (cost=127071.79..127071.79 rows=171782 width=12)
  ->  Nested Loop  (cost=0.00..110487.65 rows=171782 width=12)
        ->  Index Scan using i_ilid on i  (cost=0.00..200.87 rows=333 width=4)
        ->  Index Scan using w_wfid on w  (cost=0.00..324.58 rows=516 width=8)

and
real    0m0.326s

To summarize:  7.0 when forced to use the same plan 6.5.3 has selected
has computation time 10 times smaller than when using the plan it
thinks is the best.  However it estimates the cost to be 10 times
bigger than the ``best''...

I'm still not convinced that some silly bug is not there.  Why were
6.5.3's estimations so much better?  It didn't have more information
to base its predictions on.

And with my original data, why doesn't 7.0 decide to use an index to
retrieve no more than 21000 rows out of 100000?  Is it a matter of
tweaking the COST_STH variables?


Regards
Marcin

----------------------------------------------------------------------
 Marcin Woli\'nski                        mailto:wolinski@mimuw.edu.pl
                                     http://www.mimuw.edu.pl/~wolinski
----------------------------------------------------------------------