Home > mailing lists

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From	Simon Riggs
Subject	Re: Improving N-Distinct estimation by ANALYZE
Date	January 4, 2006 20:24:26
Msg-id	1136420665.21025.223.camel@localhost.localdomain Whole thread Raw
In response to	Re: Improving N-Distinct estimation by ANALYZE ("Jim C. Nasby" <jnasby@pervasive.com>)
Responses	Re: Improving N-Distinct estimation by ANALYZE
List	pgsql-hackers

Tree view

On Wed, 2006-01-04 at 17:57 -0600, Jim C. Nasby wrote:
> On Wed, Jan 04, 2006 at 07:10:29PM +0000, Simon Riggs wrote:
> > 3. We should also apply multi-column heuristics to the estimation of D,
> > once we have estimated all columns. For column groups (pairs, triples
> > etc) that form part of a PK, we know that it must be true that D1 *
> > D2 ... Dk >= N. In many cases we will be very confident of our estimate
> > of D when we decide = d. i.e. When we have two columns, we can use this
> > to infer that D1 = N/d when D2 = d. So we can do this in any case where
> > we have confident estimates of all but one column; the required
> > information is available at that time.
> > e.g. if line_item primary key ( l_orderkey, l_linenumber ) and we know
> > that there are at most 10 l_linenumber values in the table, then there
> > should be N/10 values for l_orderkey, so set it to that if it is lower
> > (only).
> 
> Sorry if I'm pointing out the obwious, but I would do this for any 
> unique index, not just a PK. (It should still hold for any unique index,
> right?)

Yes. It's just a less common case to have > 1 unique indexes.

> Also, was an approach of sampling random rows within random blocks
> considered? Something like:
> 
> until row sample size reached
>     read random block
>     sample x% of rows in that block randomly
> done

Yes. 

I was trying to maintain the existing approach as much as possible,
rather than ripping everything out and starting again which could cause
just as many problems as it solves. So evolution, rather than
revolution.

The approach I suggested uses the existing technique for selecting
random blocks, then either an exhaustive check on all of the rows in a
block or the existing random row approach, depending upon available
memory. We need to check all of the rows in a reasonable sample of
blocks otherwise we might miss clusters of rows in large tables - which
is the source of the problems identified.

The other reason was to increase the sample size, which is a win in any
form of statistics.

Best Regards, Simon Riggs

pgsql-hackers by date:

From: Greg Stark
Date: 04 January 2006, 20:22:18
Subject: Re: Improving N-Distinct estimation by ANALYZE

From: Simon Riggs
Date: 04 January 2006, 20:57:01
Subject: Re: Improving N-Distinct estimation by ANALYZE

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

Previous

Next