Re: Correlation in cost_index() - Mailing list pgsql-hackers
From | Sean Chittenden |
---|---|
Subject | Re: Correlation in cost_index() |
Date | |
Msg-id | 20030808180656.GL94710@perrin.int.nxad.com Whole thread Raw |
In response to | Re: Correlation in cost_index() (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Correlation in cost_index()
Re: Correlation in cost_index() |
List | pgsql-hackers |
> > Hrm, after an hour of searching and reading, I think one of the > > better papers on the subject can be found here: > > http://www.cs.ust.hk/faculty/dimitris/PAPERS/TKDE-NNmodels.pdf > > Interesting paper, but I don't see the connection to index order > correlation? Nothing that I found was nearly that specific, as close as I could find was the paper above on calculating the cost of fetching data from a disk, which I thought was the bigger problem at hand, but I digress... In one paper about large dimension index searches, they did suggest that cost was cumulative for the number of disk reads or nodes in the tree that weren't held in cache, which was the biggest hint that I had found on this specific topic. With that as a guiding light (or something faintly resembling it), it'd seem as though an avg depth of nodes in index * tuples_fetched * (random_io_cost * indexCorrelation) would be closer than where we are now... but now also think I/we're barking up the right tree with this thread. It's very possible that cost_index() is wrong, but it seems as though after some testing as if PostgreSQL _overly_ _favors_ the use of indexes: # SET enable_seqscan = true; SET enable_indexscan = true; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE; INFO: cost_seqscan: run_cost: 21472.687500 startup_cost: 0.000000 INFO: cost_index: run_cost: 21154.308116 startup_cost: 0.000000 indexCorrelation: 0.999729 QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index Scanusing report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc (cost=0.00..21154.31 rows=705954 width=64)(actual time=91.36..6625.79 rows=704840 loops=1) Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp withtime zone)Total runtime: 11292.68 msec (3 rows) # SET enable_seqscan = true; SET enable_indexscan = false; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE; INFO: cost_seqscan: run_cost: 21472.687500 startup_cost: 0.000000 INFO: cost_index: run_cost: 21154.308116 startup_cost: 100000000.000000 indexCorrelation: 0.999729 QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------Seq Scanon report_user_cat_count rucc (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19 rows=704840 loops=1) Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)Total runtime: 10506.44 msec (3 rows) Which I find surprising and humorous given the popular belief is, mine included, contrary to those results. I can say with pretty high confidence that the patch to use a geometric mean isn't correct after having done real world testing as its break even point is vastly incorrect and only uses an index when there are less than 9,000 rows to fetch, a far cry from the 490K break even I found while testing. What I did find interesting, however, was that it does work better at determining the use of multi-column indexes, but I think that's because the geometric mean pessimizes the value of indexCorrelation, which gets pretty skewed when using a multi-column index. # CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date); # CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count; # ANALYZE report_user_cat_count; # SET enable_seqscan = true; SET enable_indexscan = true; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMPWITH TIME ZONE; INFO: cost_seqscan: run_cost: 23685.025000 startup_cost: 0.000000 INFO: cost_index: run_cost: 366295.018684 startup_cost: 0.000000 indexCorrelation: 0.500000 QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------Seq Scanon report_user_cat_count rucc (cost=0.00..23685.03 rows=133918 width=64) (actual time=0.28..6100.85 rows=129941 loops=1) Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))Total runtime: 6649.21msec (3 rows) # SET enable_seqscan = false; SET enable_indexscan = true; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMPWITH TIME ZONE; INFO: cost_seqscan: run_cost: 23685.025000 startup_cost: 100000000.000000 INFO: cost_index: run_cost: 366295.018684 startup_cost: 0.000000 indexCorrelation: 0.500000 QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index Scanusing report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc (cost=0.00..366295.02 rows=133918 width=64)(actual time=53.91..3110.42 rows=129941 loops=1) Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestampwith time zone))Total runtime: 3667.47 msec (3 rows) If I manually set the indexCorrelation to 1.0, however, the planner chooses the right plan on its own, which is in effect what setting a higher random_page_cost had been compensating for, a poorly determined indexCorrelation. # SET enable_seqscan = true; SET enable_indexscan = true; SET SET # EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date > '2002-01-01'::TIMESTAMPWITH TIME ZONE; INFO: cost_seqscan: run_cost: 23685.025000 startup_cost: 0.000000 INFO: cost_index: run_cost: 4161.684248 startup_cost: 0.000000 indexCorrelation: 1.000000 QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index Scanusing report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc (cost=0.00..4161.68 rows=133918 width=64)(actual time=0.67..1176.63 rows=129941 loops=1) Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestampwith time zone))Total runtime: 1705.40 msec (3 rows) Which suggests to me that line 3964 in ./src/backend/utils/adt/selfuncs.c isn't right for multi-column indexes, esp for indexes that are clustered. I don't know how to address this though... Tom, any hints? FWIW, this is an old data/schema from a defunct project that I can give people access to if they'd like. -sc -- Sean Chittenden
pgsql-hackers by date: