Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Improving N-Distinct estimation by ANALYZE
Date
Msg-id 1137458470.3180.228.camel@localhost.localdomain
Whole thread Raw
In response to Re: Improving N-Distinct estimation by ANALYZE  (Manfred Koizar <mkoi-pg@aon.at>)
List pgsql-hackers
On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote:
> On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs
> <simon@2ndquadrant.com> wrote:
> >I enclose a patch for checking out block sampling.
> 
> Can't comment on the merits of block sampling and your implementation
> thereof.  

But your thoughts are welcome...

> |!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create
> 
> Linking the use of the Vitter algorithm to May 2004 is not quite
> appropriate.  We introduced two stage sampling at that time.

I'll redo some of the comments as you suggest and review coding points.

> |   * a random sample of targrows rows (or less, if there are less in the
> |!  * sample of blocks). In this case, targblocks is always the same as
> |!  * targrows, so we always read one row per block.
> 
> This is just wrong, unless you add "on average".  Even then it is a
> bit misleading, because in most cases we *read* more tuples than we
> use.

The current code reads a number of blocks == targrows, hence "one per
block" but I'll change the comment.

> |   * Although every row has an equal chance of ending up in the final
> |   * sample, this sampling method is not perfect: not every possible
> |   * sample has an equal chance of being selected.  For large relations
> |   * the number of different blocks represented by the sample tends to be
> |!  * too small.  In that case, block sampling should be used.
> 
> Is the last sentence a fact or personal opinion?

That is my contention, based upon research. The existing code clearly
identifies that case as requiring a solution. We use Chaudhuri et al's
1998 result for the sufficiency of sample size, yet overlook his
estimators and his ideas for random block selection.

My first point is that we need proportional sample size, not fixed size.
[I show that for large tables we currently use a sample that is 2-5
times too small, even based on Chaudhuri's work.]

ANALYZE is very I/O bound: random row sampling would need to scan the
whole table to take a 2-3% sample in most cases. Block sampling is a
realistic way of getting a larger sample size without loss of
performance, and also a way of getting a reasonable sample when
accessing very few blocks.

We would keep random row sampling for smaller relations where the
performance effects of sample collection are not noticeable.

> I haven't been following the discussion too closely but didn't you say
> that a block sampling algorithm somehow compensates for the fact that
> the sample is not random?

I haven't said that block sampling compensates for the sample being
non-random. There is a danger of bias in the sample and I made that
clear. I've tried to mitigate that by the use of random blocks, reusing
your code and taking note of the earlier problems you solved. However,
block sampling allows us to spot cases where rows are naturally grouped
together, which row sampling doesn't spot until very high sample sizes.
I explored in detail a common case where this causes the current method
to fall down very badly. 

Brutlag & Richardson [2000] give a good run down on block sampling and
I'm looking to implement some of his block estimators to complement the
current patch. (Credit to Andrew Dunstan for locating the paper...)
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps

[I did briefly explore the possibility of hybrid row/block sampling,
using row sampling as well some block sampling to identify grouping,
with an attempt to avoid swamping the sample with biased rows; but I
don't have a strong basis for that, so I've ignored that for now.]

This patch allows us to explore the possible bias that block sampling
gives, allowing us to make a later decision to include it, or not.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Michael Glaesemann
Date:
Subject: Re: [pgsql-www] source documentation tool doxygen
Next
From: Robert Treat
Date:
Subject: Re: [pgsql-www] source documentation tool doxygen