Home > mailing lists

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From	Manfred Koizar
Subject	Re: Improving N-Distinct estimation by ANALYZE
Date	January 16, 2006 16:25:02
Msg-id	50dns19bc4cs20tgg7aq4d3l80ph1f6vjq@4ax.com Whole thread Raw
In response to	Re: Improving N-Distinct estimation by ANALYZE (Simon Riggs <simon@2ndquadrant.com>)
Responses	Re: Improving N-Distinct estimation by ANALYZE
List	pgsql-hackers

Tree view

On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs
<simon@2ndquadrant.com> wrote:
>I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof.  Just some nitpicking:

|!  * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate.  We introduced two stage sampling at that time.

|   * a random sample of targrows rows (or less, if there are less in the
|!  * sample of blocks). In this case, targblocks is always the same as
|!  * targrows, so we always read one row per block.

This is just wrong, unless you add "on average".  Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

|   * Although every row has an equal chance of ending up in the final
|   * sample, this sampling method is not perfect: not every possible
|   * sample has an equal chance of being selected.  For large relations
|   * the number of different blocks represented by the sample tends to be
|!  * too small.  In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

|   * block.  The previous sampling method put too much credence in the row
|   * density near the start of the table.

FYI, "previous" refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+         /* Assume that we will have rows at least 64 bytes wide 
|+          * Currently we're very unlikely to overflow availMem here 
|+          */
|+         if ((allocrows * sizeof(HeapTuple)) > (allowedMem >> 4))

This is a funny way of sayingif (allocrows * (sizeof(Heaptuple) + 60) > allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|-     if (bs.m > 0)
|+     if (bs.m > 0 )

Oops.

|+         ereport(DEBUG2,
|+             (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u", 
|+                         stats->tupattnum,samplerows, nmultiple, f1, d)));
^                          missing space here and in some more places
 

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
ServusManfred

pgsql-hackers by date:

From: Simon Riggs
Date: 16 January 2006, 16:05:53
Subject: Re: Improving N-Distinct estimation by ANALYZE

From: Tom Lane
Date: 16 January 2006, 16:38:35
Subject: Re: Improving N-Distinct estimation by ANALYZE

Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

Previous

Next