Home > mailing lists

Re: Correlation in cost_index() - Mailing list pgsql-hackers

From	Sean Chittenden
Subject	Re: Correlation in cost_index()
Date	August 8, 2003 22:10:18
Msg-id	20030808221006.GQ94710@perrin.int.nxad.com Whole thread Raw
In response to	Re: Correlation in cost_index() (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Correlation in cost_index() Re: Correlation in cost_index()
List	pgsql-hackers

Tree view

> > Which suggests to me that line 3964 in
> > ./src/backend/utils/adt/selfuncs.c isn't right for multi-column
> > indexes, esp for indexes that are clustered.  I don't know how to
> > address this though...  Tom, any hints?
> 
> Yes, we knew that already.  Oliver had suggested simply dropping the
> division by nKeys, thus pretending that the first-column correlation
> is close enough.  That seems to me to be going too far in the other
> direction,

But is it really?

> xbut clearly dividing by nKeys is far too pessimistic.  I'd change
> this in a moment if someone could point me to a formula with any
> basis at all ...

Got it, alright.  I'd never paid attention to prior discussions as the
planner had generally did the right thing (with a lowered
random_page_cost ::grin::).  In terms of statistics and setting
indexCorrelation correctly, something like Spearman's rho calculation
comes to mind, though I don't know how applicable that is to database
theory.

indexCorrelation is 1.0 for the 1st key in a multi-column index.  The
only thing different about a multi-column index and a single column
index is the multi-column index takes up more space per key, resulting
in fewer index entries per page and more pages being fetched than
would be in a single column index, but the current cost_index()
function takes increased number of page fetches into account when
calculating cost.  As things stand, however, if a multi-column key is
used, the indexCorrelation is penalized by the size of the number of
keys found in the multi-column index.  As things stand the qual
user_id = 42, on a CLUSTER'ed multi-column index (user_id,utc_date)
has an indexCorrelation of 0.5, when in fact the correlation is 1.0.
indexCorrelation == number of random page fetches, which could be next
to free on a solid state drive, in this case, the page fetches aren't
random, they're perfectly sequential.  If it were 'user_id = 42 AND
utc_date = NOW()', the correlation of a lookup of the user_id would
still be 1.0 and the utc_date would be 1.0 because both values are
looked up in the index key.  A lookup of just the utc_date can never
use the index and the planner correctly uses a sequential scan.  Cost
!= Correlation.  They're proportional, but not the same and
indexCorrelation is the wrong place to handle cost as that's done by
the Mackert and Lohman formula.  Under what circumstances would it be
correct to pessimize the use of indexCorrelation?  An indexCorrelation
of 0.0 doesn't mean that the index is useless either, just that we
take the full hit of a completely random page read as opposed to some
fraction of a random page cost.

I tossed a different index on my test table to see how well things
fare with a low correlation, and this was a bit disturbing:

# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE rucc.html_bytes > 8000000::BIGINT;
INFO:  cost_seqscan: run_cost: 21472.687500       startup_cost: 0.000000

INFO:  cost_index: run_cost: 112165.065458       startup_cost: 0.000000       indexCorrelation: 0.183380
                                          QUERY PLAN
 

----------------------------------------------------------------------------------------------------------------------------------Seq
Scanon report_user_cat_count rucc  (cost=0.00..21472.69 rows=31893 width=64) (actual time=444.25..2489.27 rows=514
loops=1) Filter: (html_bytes > 8000000::bigint)Total runtime: 2492.36 msec
 
(3 rows)

# SET enable_seqscan = false;
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE rucc.html_bytes > 8000000::BIGINT;
INFO:  cost_seqscan: run_cost: 21472.687500       startup_cost: 100000000.000000

INFO:  cost_index: run_cost: 112165.065458       startup_cost: 0.000000       indexCorrelation: 0.183380
                                                               QUERY PLAN
 

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------Index
Scanusing report_user_cat_count_html_bytes_idx on report_user_cat_count rucc  (cost=0.00..112165.07 rows=31893
width=64)(actual time=68.64..85.75 rows=514 loops=1)  Index Cond: (html_bytes > 8000000::bigint)Total runtime: 97.75
msec
(3 rows)


*shrug* A low indexCorrelation overly pessimizes the cost of an index,
but I'm not sure where to attribute this too.  :-/

-sc

-- 
Sean Chittenden

pgsql-hackers by date:

From: Tom Lane
Date: 08 August 2003, 21:35:15
Subject: Re: WITH HOLD and pooled connections

From: Tom Lane
Date: 08 August 2003, 22:26:00
Subject: Re: Correlation in cost_index()

Re: Correlation in cost_index() - Mailing list pgsql-hackers

Previous

Next