Re: Correlation in cost_index() - Mailing list pgsql-hackers
From | Manfred Koizar |
---|---|
Subject | Re: Correlation in cost_index() |
Date | |
Msg-id | ruu7jvo5vnjl064rm327ndqml0pmvgdsfa@4ax.com Whole thread Raw |
In response to | Re: Correlation in cost_index() (Sean Chittenden <sean@chittenden.org>) |
Responses |
Re: Correlation in cost_index()
|
List | pgsql-hackers |
On Fri, 8 Aug 2003 11:06:56 -0700, Sean Chittenden <sean@chittenden.org> wrote: >[...] it'd seem as though an avg depth of >nodes in index * tuples_fetched * (random_io_cost * indexCorrelation) >would be closer than where we are now... Index depth does not belong here because we walk down the index only once per index scan not once per tuple. It might be part of the startup cost. The rest of your formula doesn't seem right, too, because you get higher costs for better correlation. Did you meanrandom_io_cost * (1 - abs(indexCorrelation))? FWIW, for small effective_cache_size max_IO_cost is almost equal to tuples_fetched * random_page_cost. So your formula (with the corrections presumed above) boils down to ignoring effective_cache_size and linear interpolation between 0 and max_IO_cost. >It's very possible that cost_index() is wrong, but it seems as though >after some testing as if PostgreSQL _overly_ _favors_ the use of >indexes: Was this an unpatched backend? What were the values of effective_cache_size and random_page_cost? ># SET enable_seqscan = true; SET enable_indexscan = true; >SET >SET ># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE; >INFO: cost_seqscan: run_cost: 21472.687500 > startup_cost: 0.000000 > >INFO: cost_index: run_cost: 21154.308116 > startup_cost: 0.000000 > indexCorrelation: 0.999729 > QUERY PLAN >----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc (cost=0.00..21154.31 rows=705954width=64) (actual time=91.36..6625.79 rows=704840 loops=1) > Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone) > Total runtime: 11292.68 msec >(3 rows) "actual time=91.36..6625.79" but "Total runtime: 11292.68 msec"! Where did those 4.7 seconds go? ># SET enable_seqscan = true; SET enable_indexscan = false; >SET >SET ># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE; >INFO: cost_seqscan: run_cost: 21472.687500 > startup_cost: 0.000000 > >INFO: cost_index: run_cost: 21154.308116 > startup_cost: 100000000.000000 > indexCorrelation: 0.999729 > QUERY PLAN >--------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on report_user_cat_count rucc (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19 rows=704840loops=1) > Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone) > Total runtime: 10506.44 msec >(3 rows) Same here: "actual time=1091.45..7441.19" but "Total runtime: 10506.44 msec" - more than 3 seconds lost. When we ignore total runtime and look at actual time we get seq idx estimated 21473 21154 actual 7441 6626 This doesn't look too bad, IMHO. BTW, I believe that with your example (single-column index, almost perfect correlation, index cond selects almost all tuples) all interpolation methods give an index cost estimation that is very close to seq scan cost, and the actual runtimes show that this is correct. >Which I find surprising and humorous given the popular belief is, mine >included, contrary to those results. How many tuples are in report_user_cat_count? What are the stats for report_user_cat_count.utc_date? > I can say with pretty high >confidence that the patch to use a geometric mean isn't correct after >having done real world testing as its break even point is vastly >incorrect and only uses an index when there are less than 9,000 rows >to fetch, a far cry from the 490K break even I found while testing. Could you elaborate, please. The intention of my patch was to favour index scans more than the current implementation. If it does not, you have found a bug in my patch. Did you test the other interpolation methods? >What I did find interesting, however, was that it does work better at >determining the use of multi-column indexes, Yes, because it computes the correlation for a two-column-index ascorrelation_of_first_index_column * 0.95 instead ofcorrelation_of_first_index_column / 2 > but I think that's >because the geometric mean pessimizes the value of indexCorrelation, >which gets pretty skewed when using a multi-column index. I don't understand this. ># CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date); ># CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count; ># ANALYZE report_user_cat_count; ># SET enable_seqscan = true; SET enable_indexscan = true; >SET >SET ># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMPWITH TIME ZONE; >INFO: cost_seqscan: run_cost: 23685.025000 > startup_cost: 0.000000 > >INFO: cost_index: run_cost: 366295.018684 > startup_cost: 0.000000 > indexCorrelation: 0.500000 ^^^ This is certainly not with my patch. The current implementation gives ca. 366000 for pages_fetched = 122000 and random_page_cost = 4, which looks plausible for 133000 tuples and (too?) small effective_cache_size. > QUERY PLAN >------------------------------------------------------------------------------------------------------------------------------------ > Seq Scan on report_user_cat_count rucc (cost=0.00..23685.03 rows=133918 width=64) (actual time=0.28..6100.85 rows=129941loops=1) > Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone)) > Total runtime: 6649.21 msec >(3 rows) > ># SET enable_seqscan = false; SET enable_indexscan = true; >SET >SET ># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMPWITH TIME ZONE; >INFO: cost_seqscan: run_cost: 23685.025000 > startup_cost: 100000000.000000 > >INFO: cost_index: run_cost: 366295.018684 > startup_cost: 0.000000 > indexCorrelation: 0.500000 > QUERY PLAN >----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc (cost=0.00..366295.02 rows=133918width=64) (actual time=53.91..3110.42 rows=129941 loops=1) > Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone)) > Total runtime: 3667.47 msec >(3 rows) Which shows that Postgres does not "_overly_ _favor_ the use of indexes". >If I manually set the indexCorrelation to 1.0, however, the planner >chooses the right plan on its own Ok, with indexCorrelation == 1.0 we dont have to discuss interpolation methods, because they all return min_IO_cost. >Which suggests to me that line 3964 in >./src/backend/utils/adt/selfuncs.c isn't right for multi-column >indexes, esp for indexes that are clustered. Agreed. > I don't know how to >address this though... I guess there is no chance without index statistics. >FWIW, this is an old data/schema from a defunct project that I can >give people access to if they'd like. Is there a dump available for download? ServusManfred
pgsql-hackers by date: