Re: Inconsistant use of index. - Mailing list pgsql-bugs
From | Ron Mayer |
---|---|
Subject | Re: Inconsistant use of index. |
Date | |
Msg-id | Pine.LNX.4.33.0203262144520.16667-100000@ron Whole thread Raw |
In response to | Re: Inconsistant use of index. (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-bugs |
First off, thanks to everyone on the list who suggested useful workarounds to me - and I wanted to start off by saying that with the workarounds my application is working wonderfully again. Anyway, here's some more information about the "=" vs. "<= and >=" question I had earlier today... On Tue, 26 Mar 2002, Tom Lane wrote: > > Ron Mayer <ron@intervideo.com> writes: > >> I'm particularly interested in the correlation estimate for the dat > >> column. [...] > > > > [...] > > The correlation estimate for dat is pretty low (0.086088), which I think > reflects the fact that on a macro level your data is not very orderly > [...] > > In the meantime, it would be interesting to see if re-establishing the > big-picture order correlation would persuade the planner to do the right > thing. [...] > (this should leave you in a state where pg_stats shows correlation 1.0 > for fact.dat) and then see what you get from EXPLAIN? Correlation is 1.0, but the optimizer still does not want to use the index. I tried two different extreme attempts.... one with the optimal ordering suggested above, and one with an exceptionally poor ordering (sorted by time of the day ... so that every day probably appears in every possible block). As expected, pg_stats shows the good ordering has a correlation of "1.0", and the poor ordering has a correlation of "-0.00133352". ============================================================ = logs2=# CREATE TABLE fact_by_dat AS SELECT * FROM fact ORDER BY dat; = SELECT = logs2=# CREATE TABLE fact_by_tim AS SELECT * FROM fact ORDER BY tim; = SELECT = logs2=# CREATE INDEX fact_by_dat__dat ON fact_by_dat(dat); = CREATE = logs2=# CREATE INDEX fact_by_tim__dat ON fact_by_tim(dat); = CREATE = logs2=# vacuum analyze fact_by_dat; = VACUUM = logs2=# vacuum analyze fact_by_tim; = VACUUM = logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; = NOTICE: QUERY PLAN: = Aggregate (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1) = -> Seq Scan on fact (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295 loops=1) = Total runtime: 77785.28 msec = EXPLAIN = logs2=# explain analyze select count(*) from fact_by_tim where dat='2002-03-01'; = NOTICE: QUERY PLAN: = Aggregate (cost=380341.09..380341.09 rows=1 width=0) (actual time=79308.22..79308.22 rows=1 loops=1) = -> Seq Scan on fact_by_tim (cost=0.00..379816.25 rows=209934 width=0) (actual time=24.35..78929.68 rows=180295 loops=1) = Total runtime: 79308.35 msec = EXPLAIN = logs2=# select correlation from pg_stats where tablename='fact_by_dat' and attname='dat'; = correlation = ------------- = 1 = (1 row) = = logs2=# select correlation from pg_stats where tablename='fact_by_tim' and attname='dat'; = correlation = ------------- = -0.00133352 = (1 row) = ============================================================ In neither case did it use the index. However as shown below, in the case where it was ordered by date the index would have helped a huge amount, while in the case where it was ordered by time using the index hurts a huge amount. ============================================================ = logs2=# set enable_seqscan to off; = SET VARIABLE = logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01'; = NOTICE: QUERY PLAN: = = Aggregate (cost=5502.57..5502.57 rows=1 width=0) (actual time=2671.20..2671.20 rows=1 loops=1) = -> Index Scan using fact_by_dat__dat on fact_by_dat (cost=0.00..4974.99 rows=211036 width=0) (actual time=90.24..2339.64rows=180295 loops=1) = Total runtime: 2693.87 msec = = EXPLAIN = logs2=# explain analyze select count(*) from fact_by_tim where = dat='2002-03-01'; = NOTICE: QUERY PLAN: = = Aggregate (cost=837849.27..837849.27 rows=1 width=0) (actual time=410705.02..410705.02 rows=1 loops=1) = -> Index Scan using fact_by_tim__dat on fact_by_tim (cost=0.00..837324.43 rows=209934 width=0) (actual time=56.14..410271.50rows=180295 loops=1) = Total runtime: 410705.17 msec = = EXPLAIN = logs2=# ===================================================================== So with the ideally ordered table the index would have helped by a factor of 30 (2.7 seconds vs. 77 seconds)... but with the bad ordering it hurt by a factor of 5 (411 seconds vs. 79 seconds). Very interesting... Just for my own education, could you bare with me for a few questions from a relative novice... *) Should the optimizer choose a plan that uses the index if the correlation is high enough? *) Instead of the overall correlation across the whole table, would a better metric be the average correlation for data within each page? Then it could recognize that while I had a low overall correlation because I loaded Feb, then Jan, then Mar ... within each block the correlation was good. *) If the optimizer sees one alternative that may take from 3 - 400 seconds, and another that will produce a very consistent 80 seconds, is the first better because it could result in a 30x speedup vs. only 5x slowdown, or is the second better, because it could result in only 77sec faster vs 320 second slower. Or do I simply not understand optimizers at all. :-) *) Any reason why "a=b" and "a<=b and a>=b" would act differently? I'm not complaining, because it seems like a convenient way to give the optimzer-hint that helps in my case :-), but it seems surprising. Anyway, as I said before, thanks for all the workarounds that got me up and running again! I'd be glad to continue to look if people want more information about these queries if anyone thinks any more experimentation would be helpful. Thanks, Ron
pgsql-bugs by date: