Re: Correlation in cost_index() - Mailing list pgsql-hackers
From | Sean Chittenden |
---|---|
Subject | Re: Correlation in cost_index() |
Date | |
Msg-id | 20030808221006.GQ94710@perrin.int.nxad.com Whole thread Raw |
In response to | Re: Correlation in cost_index() (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Correlation in cost_index()
Re: Correlation in cost_index() |
List | pgsql-hackers |
> > Which suggests to me that line 3964 in > > ./src/backend/utils/adt/selfuncs.c isn't right for multi-column > > indexes, esp for indexes that are clustered. I don't know how to > > address this though... Tom, any hints? > > Yes, we knew that already. Oliver had suggested simply dropping the > division by nKeys, thus pretending that the first-column correlation > is close enough. That seems to me to be going too far in the other > direction, But is it really? > xbut clearly dividing by nKeys is far too pessimistic. I'd change > this in a moment if someone could point me to a formula with any > basis at all ... Got it, alright. I'd never paid attention to prior discussions as the planner had generally did the right thing (with a lowered random_page_cost ::grin::). In terms of statistics and setting indexCorrelation correctly, something like Spearman's rho calculation comes to mind, though I don't know how applicable that is to database theory. indexCorrelation is 1.0 for the 1st key in a multi-column index. The only thing different about a multi-column index and a single column index is the multi-column index takes up more space per key, resulting in fewer index entries per page and more pages being fetched than would be in a single column index, but the current cost_index() function takes increased number of page fetches into account when calculating cost. As things stand, however, if a multi-column key is used, the indexCorrelation is penalized by the size of the number of keys found in the multi-column index. As things stand the qual user_id = 42, on a CLUSTER'ed multi-column index (user_id,utc_date) has an indexCorrelation of 0.5, when in fact the correlation is 1.0. indexCorrelation == number of random page fetches, which could be next to free on a solid state drive, in this case, the page fetches aren't random, they're perfectly sequential. If it were 'user_id = 42 AND utc_date = NOW()', the correlation of a lookup of the user_id would still be 1.0 and the utc_date would be 1.0 because both values are looked up in the index key. A lookup of just the utc_date can never use the index and the planner correctly uses a sequential scan. Cost != Correlation. They're proportional, but not the same and indexCorrelation is the wrong place to handle cost as that's done by the Mackert and Lohman formula. Under what circumstances would it be correct to pessimize the use of indexCorrelation? An indexCorrelation of 0.0 doesn't mean that the index is useless either, just that we take the full hit of a completely random page read as opposed to some fraction of a random page cost. I tossed a different index on my test table to see how well things fare with a low correlation, and this was a bit disturbing: # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE rucc.html_bytes > 8000000::BIGINT; INFO: cost_seqscan: run_cost: 21472.687500 startup_cost: 0.000000 INFO: cost_index: run_cost: 112165.065458 startup_cost: 0.000000 indexCorrelation: 0.183380 QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------Seq Scanon report_user_cat_count rucc (cost=0.00..21472.69 rows=31893 width=64) (actual time=444.25..2489.27 rows=514 loops=1) Filter: (html_bytes > 8000000::bigint)Total runtime: 2492.36 msec (3 rows) # SET enable_seqscan = false; SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE rucc.html_bytes > 8000000::BIGINT; INFO: cost_seqscan: run_cost: 21472.687500 startup_cost: 100000000.000000 INFO: cost_index: run_cost: 112165.065458 startup_cost: 0.000000 indexCorrelation: 0.183380 QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index Scanusing report_user_cat_count_html_bytes_idx on report_user_cat_count rucc (cost=0.00..112165.07 rows=31893 width=64)(actual time=68.64..85.75 rows=514 loops=1) Index Cond: (html_bytes > 8000000::bigint)Total runtime: 97.75 msec (3 rows) *shrug* A low indexCorrelation overly pessimizes the cost of an index, but I'm not sure where to attribute this too. :-/ -sc -- Sean Chittenden
pgsql-hackers by date: