Re: Correlation in cost_index() - Mailing list pgsql-hackers

From Manfred Koizar
Subject Re: Correlation in cost_index()
Date
Msg-id ruu7jvo5vnjl064rm327ndqml0pmvgdsfa@4ax.com
Whole thread Raw
In response to Re: Correlation in cost_index()  (Sean Chittenden <sean@chittenden.org>)
Responses Re: Correlation in cost_index()
List pgsql-hackers
On Fri, 8 Aug 2003 11:06:56 -0700, Sean Chittenden
<sean@chittenden.org> wrote:
>[...] it'd seem as though an avg depth of
>nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
>would be closer than where we are now...

Index depth does not belong here because we walk down the index only
once per index scan not once per tuple.  It might be part of the
startup cost.

The rest of your formula doesn't seem right, too, because you get
higher costs for better correlation.  Did you meanrandom_io_cost * (1 - abs(indexCorrelation))?

FWIW, for small effective_cache_size max_IO_cost is almost equal to
tuples_fetched * random_page_cost.  So your formula (with the
corrections presumed above) boils down to ignoring
effective_cache_size and linear interpolation between 0 and
max_IO_cost.

>It's very possible that cost_index() is wrong, but it seems as though
>after some testing as if PostgreSQL _overly_ _favors_ the use of
>indexes:

Was this an unpatched backend?  What were the values of
effective_cache_size and random_page_cost?

># SET enable_seqscan = true; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME
ZONE;
>INFO:  cost_seqscan: run_cost: 21472.687500
>        startup_cost: 0.000000
>
>INFO:  cost_index: run_cost: 21154.308116
>        startup_cost: 0.000000
>        indexCorrelation: 0.999729
>                                                                                    QUERY PLAN

>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc  (cost=0.00..21154.31
rows=705954width=64) (actual time=91.36..6625.79 rows=704840 loops=1)
 
>   Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
> Total runtime: 11292.68 msec
>(3 rows)

"actual time=91.36..6625.79" but "Total runtime: 11292.68 msec"!
Where did those 4.7 seconds go?

># SET enable_seqscan = true; SET enable_indexscan = false;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME
ZONE;
>INFO:  cost_seqscan: run_cost: 21472.687500
>        startup_cost: 0.000000
>
>INFO:  cost_index: run_cost: 21154.308116
>        startup_cost: 100000000.000000
>        indexCorrelation: 0.999729
>                                                              QUERY PLAN

>---------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on report_user_cat_count rucc  (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19
rows=704840loops=1)
 
>   Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
> Total runtime: 10506.44 msec
>(3 rows)

Same here: "actual time=1091.45..7441.19" but "Total runtime: 10506.44
msec"  -  more than 3 seconds lost.

When we ignore total runtime and look at actual time we get
            seq    idx
estimated   21473  21154
actual       7441   6626

This doesn't look too bad, IMHO.

BTW, I believe that with your example (single-column index, almost
perfect correlation, index cond selects almost all tuples) all
interpolation methods give an index cost estimation that is very close
to seq scan cost, and the actual runtimes show that this is correct.

>Which I find surprising and humorous given the popular belief is, mine
>included, contrary to those results.

How many tuples are in report_user_cat_count?  What are the stats for
report_user_cat_count.utc_date?

>  I can say with pretty high
>confidence that the patch to use a geometric mean isn't correct after
>having done real world testing as its break even point is vastly
>incorrect and only uses an index when there are less than 9,000 rows
>to fetch, a far cry from the 490K break even I found while testing.

Could you elaborate, please.  The intention of my patch was to favour
index scans more than the current implementation.  If it does not, you
have found a bug in my patch.  Did you test the other interpolation
methods?

>What I did find interesting, however, was that it does work better at
>determining the use of multi-column indexes,

Yes, because it computes the correlation for a two-column-index ascorrelation_of_first_index_column * 0.95
instead ofcorrelation_of_first_index_column / 2

> but I think that's
>because the geometric mean pessimizes the value of indexCorrelation,
>which gets pretty skewed when using a multi-column index.

I don't understand this.

># CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date);
># CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
># ANALYZE report_user_cat_count;
># SET enable_seqscan = true; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date >
'2002-01-01'::TIMESTAMPWITH TIME ZONE;
 
>INFO:  cost_seqscan: run_cost: 23685.025000
>        startup_cost: 0.000000
>
>INFO:  cost_index: run_cost: 366295.018684
>        startup_cost: 0.000000
>        indexCorrelation: 0.500000                          ^^^
This is certainly not with my patch.  The current implementation gives
ca. 366000 for pages_fetched = 122000 and random_page_cost = 4, which
looks plausible for 133000 tuples and (too?) small
effective_cache_size.

>                                                             QUERY PLAN

>------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on report_user_cat_count rucc  (cost=0.00..23685.03 rows=133918 width=64) (actual time=0.28..6100.85
rows=129941loops=1)
 
>   Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
> Total runtime: 6649.21 msec
>(3 rows)
>
># SET enable_seqscan = false; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date >
'2002-01-01'::TIMESTAMPWITH TIME ZONE;
 
>INFO:  cost_seqscan: run_cost: 23685.025000
>        startup_cost: 100000000.000000
>
>INFO:  cost_index: run_cost: 366295.018684
>        startup_cost: 0.000000
>        indexCorrelation: 0.500000
>                                                                                       QUERY PLAN

>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc  (cost=0.00..366295.02
rows=133918width=64) (actual time=53.91..3110.42 rows=129941 loops=1)
 
>   Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))
> Total runtime: 3667.47 msec
>(3 rows)

Which shows that Postgres does not "_overly_ _favor_ the use of
indexes".

>If I manually set the indexCorrelation to 1.0, however, the planner
>chooses the right plan on its own

Ok, with indexCorrelation == 1.0 we dont have to discuss interpolation
methods, because they all return min_IO_cost.

>Which suggests to me that line 3964 in
>./src/backend/utils/adt/selfuncs.c isn't right for multi-column
>indexes, esp for indexes that are clustered.

Agreed.

>  I don't know how to
>address this though...

I guess there is no chance without index statistics.

>FWIW, this is an old data/schema from a defunct project that I can
>give people access to if they'd like. 

Is there a dump available for download?

ServusManfred


pgsql-hackers by date:

Previous
From: Bruno Wolff III
Date:
Subject: contrib/vacuumlo problem in cvs
Next
From: Bruce Momjian
Date:
Subject: Re: contrib/vacuumlo problem in cvs