Re: Correlation in cost_index() - Mailing list pgsql-hackers

From Sean Chittenden
Subject Re: Correlation in cost_index()
Date
Msg-id 20030808180656.GL94710@perrin.int.nxad.com
Whole thread Raw
In response to Re: Correlation in cost_index()  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Correlation in cost_index()
Re: Correlation in cost_index()
List pgsql-hackers
> > Hrm, after an hour of searching and reading, I think one of the
> > better papers on the subject can be found here:
> > http://www.cs.ust.hk/faculty/dimitris/PAPERS/TKDE-NNmodels.pdf
> 
> Interesting paper, but I don't see the connection to index order
> correlation?

Nothing that I found was nearly that specific, as close as I could
find was the paper above on calculating the cost of fetching data from
a disk, which I thought was the bigger problem at hand, but I
digress...

In one paper about large dimension index searches, they did suggest
that cost was cumulative for the number of disk reads or nodes in the
tree that weren't held in cache, which was the biggest hint that I had
found on this specific topic.  With that as a guiding light (or
something faintly resembling it), it'd seem as though an avg depth of
nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
would be closer than where we are now... but now also think I/we're
barking up the right tree with this thread.

It's very possible that cost_index() is wrong, but it seems as though
after some testing as if PostgreSQL _overly_ _favors_ the use of
indexes:

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 21472.687500       startup_cost: 0.000000

INFO:  cost_index: run_cost: 21154.308116       startup_cost: 0.000000       indexCorrelation: 0.999729
                                                                 QUERY PLAN
 

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc  (cost=0.00..21154.31 rows=705954
width=64)(actual time=91.36..6625.79 rows=704840 loops=1)  Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp
withtime zone)Total runtime: 11292.68 msec
 
(3 rows)

# SET enable_seqscan = true; SET enable_indexscan = false;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > '2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 21472.687500       startup_cost: 0.000000

INFO:  cost_index: run_cost: 21154.308116       startup_cost: 100000000.000000       indexCorrelation: 0.999729
                                                   QUERY PLAN
 

---------------------------------------------------------------------------------------------------------------------------------------Seq
Scanon report_user_cat_count rucc  (cost=0.00..21472.69 rows=705954 width=64) (actual time=1091.45..7441.19 rows=704840
loops=1) Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)Total runtime: 10506.44 msec
 
(3 rows)


Which I find surprising and humorous given the popular belief is, mine
included, contrary to those results.  I can say with pretty high
confidence that the patch to use a geometric mean isn't correct after
having done real world testing as its break even point is vastly
incorrect and only uses an index when there are less than 9,000 rows
to fetch, a far cry from the 490K break even I found while testing.
What I did find interesting, however, was that it does work better at
determining the use of multi-column indexes, but I think that's
because the geometric mean pessimizes the value of indexCorrelation,
which gets pretty skewed when using a multi-column index.

# CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count (user_id,utc_date);
# CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
# ANALYZE report_user_cat_count;
# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date >
'2002-01-01'::TIMESTAMPWITH TIME ZONE;
 
INFO:  cost_seqscan: run_cost: 23685.025000       startup_cost: 0.000000

INFO:  cost_index: run_cost: 366295.018684       startup_cost: 0.000000       indexCorrelation: 0.500000
                                           QUERY PLAN
 

------------------------------------------------------------------------------------------------------------------------------------Seq
Scanon report_user_cat_count rucc  (cost=0.00..23685.03 rows=133918 width=64) (actual time=0.28..6100.85 rows=129941
loops=1) Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with time zone))Total runtime:
6649.21msec
 
(3 rows)

# SET enable_seqscan = false; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date >
'2002-01-01'::TIMESTAMPWITH TIME ZONE;
 
INFO:  cost_seqscan: run_cost: 23685.025000       startup_cost: 100000000.000000

INFO:  cost_index: run_cost: 366295.018684       startup_cost: 0.000000       indexCorrelation: 0.500000
                                                                     QUERY PLAN
 

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc  (cost=0.00..366295.02 rows=133918
width=64)(actual time=53.91..3110.42 rows=129941 loops=1)  Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01
00:00:00-08'::timestampwith time zone))Total runtime: 3667.47 msec
 
(3 rows)


If I manually set the indexCorrelation to 1.0, however, the planner
chooses the right plan on its own, which is in effect what setting a
higher random_page_cost had been compensating for, a poorly determined
indexCorrelation.

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND utc_date >
'2002-01-01'::TIMESTAMPWITH TIME ZONE;
 
INFO:  cost_seqscan: run_cost: 23685.025000       startup_cost: 0.000000

INFO:  cost_index: run_cost: 4161.684248       startup_cost: 0.000000       indexCorrelation: 1.000000
                                                                  QUERY PLAN
 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing report_user_cat_count_utc_date_user_id_idx on report_user_cat_count rucc  (cost=0.00..4161.68 rows=133918
width=64)(actual time=0.67..1176.63 rows=129941 loops=1)  Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01
00:00:00-08'::timestampwith time zone))Total runtime: 1705.40 msec
 
(3 rows)


Which suggests to me that line 3964 in
./src/backend/utils/adt/selfuncs.c isn't right for multi-column
indexes, esp for indexes that are clustered.  I don't know how to
address this though...  Tom, any hints?

FWIW, this is an old data/schema from a defunct project that I can
give people access to if they'd like. -sc

-- 
Sean Chittenden


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: TODO items
Next
From: Alvaro Herrera
Date:
Subject: Re: new psql \d command