Re: Correlation in cost_index() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Correlation in cost_index()
Date
Msg-id 1418.1060381541@sss.pgh.pa.us
Whole thread Raw
In response to Re: Correlation in cost_index()  (Sean Chittenden <sean@chittenden.org>)
Responses Re: Correlation in cost_index()
List pgsql-hackers
Sean Chittenden <sean@chittenden.org> writes:
> indexCorrelation is 1.0 for the 1st key in a multi-column index.

... only if it's perfectly correlated.

> As things stand, however, if a multi-column key is
> used, the indexCorrelation is penalized by the size of the number of
> keys found in the multi-column index.  As things stand the qual
> user_id = 42, on a CLUSTER'ed multi-column index (user_id,utc_date)
> has an indexCorrelation of 0.5, when in fact the correlation is 1.0.

Right, in the perfectly-correlated case this calculation is clearly
wrong.  However, what of cases where the first column shows good
correlation with the physical ordering, but the second does not?

The nasty part of this is that the correlation stat that ANALYZE
computed for the second column is of no value to us.  Two examples:
X    Y                X    Y
A    A                A    BA    B                A    CA    C                A    AB    A                B    AB    B
             B    CB    C                B    BC    A                C    CC    B                C    AC    C
    C    B
 

In both cases ANALYZE will calculate correlation 1.0 for column X,
and something near zero for column Y.  We would like to come out with
index correlation 1.0 for the left-hand case and something much less
(but, perhaps, not zero) for the right-hand case.  I don't really see
a way to do this without actually examining the multi-column ordering
relationship during ANALYZE.

> I tossed a different index on my test table to see how well things
> fare with a low correlation, and this was a bit disturbing:

Seems like most of the error in that estimate has to do with the poor
rowcount estimation.  There's very little percentage in trying to
analyze the effect of index correlation in examples where we don't have
the first-order stats correct ...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Sean Chittenden
Date:
Subject: Re: Correlation in cost_index()
Next
From: Manfred Koizar
Date:
Subject: Re: Correlation in cost_index()