Re: Correlation in cost_index() - Mailing list pgsql-hackers

From Manfred Koizar
Subject Re: Correlation in cost_index()
Date
Msg-id ks5rpu8nf7ervanu2tfb4i6aabbr6nsinq@4ax.com
Whole thread Raw
In response to Re: Correlation in cost_index()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, 03 Oct 2002 14:50:00 -0400, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>> indexCorrelation is calculated by dividing the correlation of the
>> first index column by the number of index columns.
>
>Yeah, I concluded later that that was bogus.  I've been thinking of
>just using the correlation of the first index column and ignoring
>the rest; that would not be great, but it's probably better than what's
>there.  Have you got a better formula?

Unfortunately not.  I think such a formula does not exist for the
information we have.  What we'd need is a notion of correlation of the
nth (n > 1) index column for constant values of the first n-1 index
columns; or a combined correlation for the first n index columns (1 <
n <= number of index columns).

I try to understand the problem with the help of use cases.
[ Jump to the end, if this looks to long-winded. ]

1)  Have a look at invoice items with an index on (fyear, invno,
itemno).  Invoice numbers start at 1 for each financial year, item
numbers start at 1 for each invoice.  In a typical scenario
correlations for fyear, (fyear, invno), and (fyear, invno, itemno) are
close to 1;  invno correlation is expected to be low; and itemno looks
totally chaotic to the analyzer.

When we SELECT * FROM item WHERE fyear = 2001 AND invno < 1000
dividing the correlation of the first column by the number of columns
gives 1/3 which has nothing to do with what we want.  (And then the
current implementation of cost_index() squares this and gets 1/9 which
is even farther away from the truth.)  Just using the correlation of
the first index column seems right here.

2)  OTOH consider bookkeeping with enties identified by (fyear,
account, entryno).  Again fyear has a correlation near 1.  For account
we can expect something near 0, and entryno has a distribution
comparable to invno in the first use case, i.e. starting at 1 for each
year.
SELECT * from entry WHERE fyear = 2001 AND account = whatever
Taking the correlation of fyear would imply that the tuples we are
looking for are close to each other, which again can turn out to be
wrong.

So what do we know now?  Even less than before :-(

I have just one idea that might help a bit:  If correlation of the
first index column is near +/1, cost_index() should not use
baserel->pages, but baserel->pages * selectivity of the first column.

ServusManfred


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bad rules
Next
From: Greg Copeland
Date:
Subject: Re: Threaded Sorting