Re: Inconsistant use of index. - Mailing list pgsql-bugs

From Ron Mayer
Subject Re: Inconsistant use of index.
Date
Msg-id Pine.LNX.4.33.0203262144520.16667-100000@ron
Whole thread Raw
In response to Re: Inconsistant use of index.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
First off, thanks to everyone on the list who suggested useful workarounds
to me - and I wanted to start off by saying that with the workarounds my
application is working wonderfully again.

Anyway, here's some more information about the "=" vs. "<= and >=" question
I had earlier today...


On Tue, 26 Mar 2002, Tom Lane wrote:
>
> Ron Mayer <ron@intervideo.com> writes:
> >> I'm particularly interested in the correlation estimate for the dat
> >> column.  [...]
> >
> > [...]
>
> The correlation estimate for dat is pretty low (0.086088), which I think
> reflects the fact that on a macro level your data is not very orderly
> [...]
>
> In the meantime, it would be interesting to see if re-establishing the
> big-picture order correlation would persuade the planner to do the right
> thing. [...]
> (this should leave you in a state where pg_stats shows correlation 1.0
> for fact.dat) and then see what you get from EXPLAIN?

Correlation is 1.0, but the optimizer still does not want to use
the index.

I tried two different extreme attempts.... one with the optimal
ordering suggested above, and one with an exceptionally poor ordering
(sorted by time of the day ... so that every day probably appears in
every possible block).  As expected, pg_stats shows the good ordering
has a correlation of "1.0", and the poor ordering has a correlation
of "-0.00133352".


============================================================
= logs2=# CREATE TABLE fact_by_dat AS SELECT * FROM fact ORDER BY dat;
= SELECT
= logs2=# CREATE TABLE fact_by_tim AS SELECT * FROM fact ORDER BY tim;
= SELECT
= logs2=# CREATE INDEX fact_by_dat__dat ON fact_by_dat(dat);
= CREATE
= logs2=# CREATE INDEX fact_by_tim__dat ON fact_by_tim(dat);
= CREATE
= logs2=# vacuum analyze fact_by_dat;
= VACUUM
= logs2=# vacuum analyze fact_by_tim;
= VACUUM
= logs2=#  explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
= NOTICE:  QUERY PLAN:
= Aggregate  (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1)
=   ->  Seq Scan on fact  (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295
loops=1)
= Total runtime: 77785.28 msec
= EXPLAIN
= logs2=#  explain analyze select count(*) from fact_by_tim where dat='2002-03-01';
= NOTICE:  QUERY PLAN:
= Aggregate  (cost=380341.09..380341.09 rows=1 width=0) (actual time=79308.22..79308.22 rows=1 loops=1)
=   ->  Seq Scan on fact_by_tim  (cost=0.00..379816.25 rows=209934 width=0) (actual time=24.35..78929.68 rows=180295
loops=1)
= Total runtime: 79308.35 msec
= EXPLAIN
= logs2=# select correlation from pg_stats where tablename='fact_by_dat' and attname='dat';
=  correlation
= -------------
=            1
= (1 row)
=
= logs2=# select correlation from pg_stats where tablename='fact_by_tim' and attname='dat';
=  correlation
= -------------
=  -0.00133352
= (1 row)
=
============================================================

In neither case did it use the index.  However as shown below, in the case
where it was ordered by date the index would have helped a huge amount, while
in the case where it was ordered by time using the index hurts a huge amount.

============================================================
= logs2=# set enable_seqscan to off;
= SET VARIABLE
= logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
= NOTICE:  QUERY PLAN:
=
= Aggregate  (cost=5502.57..5502.57 rows=1 width=0) (actual time=2671.20..2671.20 rows=1 loops=1)
=   ->  Index Scan using fact_by_dat__dat on fact_by_dat  (cost=0.00..4974.99 rows=211036 width=0) (actual
time=90.24..2339.64rows=180295 loops=1) 
= Total runtime: 2693.87 msec
=
= EXPLAIN
= logs2=#  explain analyze select count(*) from fact_by_tim where
= dat='2002-03-01';
= NOTICE:  QUERY PLAN:
=
= Aggregate  (cost=837849.27..837849.27 rows=1 width=0) (actual time=410705.02..410705.02 rows=1 loops=1)
=   ->  Index Scan using fact_by_tim__dat on fact_by_tim  (cost=0.00..837324.43 rows=209934 width=0) (actual
time=56.14..410271.50rows=180295 loops=1) 
= Total runtime: 410705.17 msec
=
= EXPLAIN
= logs2=#
=====================================================================


So with the ideally ordered table the index would have helped by a
factor of 30 (2.7 seconds vs. 77 seconds)... but with the bad
ordering it hurt by a factor of 5 (411 seconds vs. 79 seconds).

Very interesting...

Just for my own education, could you bare with me for a few
questions from a relative novice...

   *) Should the optimizer choose a plan that uses the index
      if the correlation is high enough?

   *) Instead of the overall correlation across the whole table,
      would a better metric be the average correlation for data
      within each page?   Then it could recognize that while I
      had a low overall correlation because I loaded Feb, then
      Jan, then Mar ... within each block the correlation was good.

   *) If the optimizer sees one alternative that may take
      from 3 - 400 seconds, and another that will produce
      a very consistent 80 seconds, is the first better because
      it could result in a 30x speedup vs. only 5x slowdown, or
      is the second better, because it could result in
      only 77sec faster vs 320 second slower.

      Or do I simply not understand optimizers at all. :-)

   *) Any reason why "a=b" and "a<=b and a>=b" would act
      differently?  I'm not complaining, because it seems
      like a convenient way to give the optimzer-hint that
      helps in my case :-), but it seems surprising.

Anyway, as I said before, thanks for all the workarounds that
got me up and running again!  I'd be glad to continue to look
if people want more information about these queries if anyone
thinks any more experimentation would be helpful.

  Thanks,
  Ron

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Inconsistant use of index.
Next
From: pgsql-bugs@postgresql.org
Date:
Subject: Bug #625: bytea data type problem/bug