Thread: Cross-table statistics idea

Cross-table statistics idea

From

"Jim C. Nasby"

Date:

26 September 2006, 23:27:34

Since I don't recall any ideas ever having been thrown out on how to do
this...

ISTM that we could gain additional insight on how many rows would likely
result from a join be comparing the "shape" of the histogram for the
joining columns. For example, if the histogram arrays were exactly
identical, we're essentially stuck looking at the ratio of reltuples
between the two tables. (AFAIK that's the only estimate we make today)
If one histogram ended at a value smaller than the start of the other
histogram, we would estimate that no rows would result from an equal
join.

Am I right about how our estimates work right now? Where can I look in
the code? Has anyone looked down this path in the past?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Cross-table statistics idea

From

Tom Lane

Date:

26 September 2006, 23:42:55

"Jim C. Nasby" <jim@nasby.net> writes:
> ISTM that we could gain additional insight on how many rows would likely
> result from a join be comparing the "shape" of the histogram for the
> joining columns.

eqjoinsel already does this for the case of comparing the MCV lists.
If you're serious about using the histograms: well, maybe, but it seems
to require far stronger assumptions about the behavior of the datatype
than we use now.

> Am I right about how our estimates work right now? Where can I look in
> the code? Has anyone looked down this path in the past?

src/backend/utils/adt/selfuncs.c
        regards, tom lane

Re: Cross-table statistics idea

From

Simon Riggs

Date:

27 September 2006, 08:30:55

On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote:
> Since I don't recall any ideas ever having been thrown out on how to do
> this...
> 
> ISTM that we could gain additional insight on how many rows would likely
> result from a join 

One thing we can do is to use cross-column relationships to improve the
estimation of Ndistinct.

If we have a table
Order_line (PK orderId, lineNum) 

If we look at lineNum and see it has on average 10 values we can then
use this information to compute that Ndistinct should be -0.1, i.e. the
number of values is proportional to the number of rows with a factor of
10.

Right now if there are more than 10 lineNums per orderId on average then
we never decide that orderId is a scalable statistic.

I propose adding a final step to ANALYZE that applies a cross-column
rule after all columns have been analysed. If all except one column of a
PK have very low Ndistinct we can use that to calculate a minimum number
of Ndistinct for the column with a high number of values. If that
minimum number is less than the Ndistinct estimate in isolation, then we
overlay the new value.

This is desirable because the estimation of Ndistinct is very sensitive
to the number of matching rows in the sample, so Ndistinct estimates are
usually very poor for large Ndistinct. The estimates for low Ndistinct
are much better, so we can use them with a lower standard error to
correct the in-isolation estimate of other columns.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Cross-table statistics idea

From

"Jim C. Nasby"

Date:

27 September 2006, 13:36:10

On Wed, Sep 27, 2006 at 12:30:43PM +0100, Simon Riggs wrote:
> On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote:
> > Since I don't recall any ideas ever having been thrown out on how to do
> > this...
> > 
> > ISTM that we could gain additional insight on how many rows would likely
> > result from a join 
> 
> One thing we can do is to use cross-column relationships to improve the
> estimation of Ndistinct.
> 
> If we have a table
> Order_line (PK orderId, lineNum) 
> 
> If we look at lineNum and see it has on average 10 values we can then
> use this information to compute that Ndistinct should be -0.1, i.e. the
> number of values is proportional to the number of rows with a factor of
> 10.
> 
> Right now if there are more than 10 lineNums per orderId on average then
> we never decide that orderId is a scalable statistic.
> 
> I propose adding a final step to ANALYZE that applies a cross-column
> rule after all columns have been analysed. If all except one column of a
> PK have very low Ndistinct we can use that to calculate a minimum number
> of Ndistinct for the column with a high number of values. If that
> minimum number is less than the Ndistinct estimate in isolation, then we
> overlay the new value.
> 
> This is desirable because the estimation of Ndistinct is very sensitive
> to the number of matching rows in the sample, so Ndistinct estimates are
> usually very poor for large Ndistinct. The estimates for low Ndistinct
> are much better, so we can use them with a lower standard error to
> correct the in-isolation estimate of other columns.

But wouldn't overlaying the value screw us if we wanted to look up
something based on the unique field? (ie: if there was a line_id in
order_line and we wanted to look something up based on line_id).
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)