On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote:
> Since I don't recall any ideas ever having been thrown out on how to do
> this...
>
> ISTM that we could gain additional insight on how many rows would likely
> result from a join
One thing we can do is to use cross-column relationships to improve the
estimation of Ndistinct.
If we have a table
Order_line (PK orderId, lineNum)
If we look at lineNum and see it has on average 10 values we can then
use this information to compute that Ndistinct should be -0.1, i.e. the
number of values is proportional to the number of rows with a factor of
10.
Right now if there are more than 10 lineNums per orderId on average then
we never decide that orderId is a scalable statistic.
I propose adding a final step to ANALYZE that applies a cross-column
rule after all columns have been analysed. If all except one column of a
PK have very low Ndistinct we can use that to calculate a minimum number
of Ndistinct for the column with a high number of values. If that
minimum number is less than the Ndistinct estimate in isolation, then we
overlay the new value.
This is desirable because the estimation of Ndistinct is very sensitive
to the number of matching rows in the sample, so Ndistinct estimates are
usually very poor for large Ndistinct. The estimates for low Ndistinct
are much better, so we can use them with a lower standard error to
correct the in-isolation estimate of other columns.
-- Simon Riggs EnterpriseDB http://www.enterprisedb.com