Re: serious under-estimation of n_distinct for clustered distributions - Mailing list pgsql-performance

From Stefan Andreatta
Subject Re: serious under-estimation of n_distinct for clustered distributions
Date
Msg-id 50E081C4.2040004@synedra.com
Whole thread Raw
In response to Re: serious under-estimation of n_distinct for clustered distributions  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: serious under-estimation of n_distinct for clustered distributions  (Stefan Andreatta <s.andreatta@synedra.com>)
List pgsql-performance
On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
> On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com> wrote:
>> Now, the 2005 discussion goes into great detail on the advantages and
>> disadvantages of this algorithm, particularly when using small sample sizes,
>> and several alternatives are discussed. I do not know whether anything has
>> been changed after that, but I know that the very distinct problem, which I
>> will focus on here, still persists.
>
> It's a really hard problem to solve satisfactorily. It's a problem
> that has been studied in much detail. Yes, the algorithm used is still
> the same. See the comments within src/backend/commands/analyze.c (IBM
> Research Report RJ 10025 is referenced there).

Thanks a lot for this information! I looked through the code a bit. The
Haas & Stokes Formula is fine. The problem really lies with the two
phase random selection procedure:

Starting from line 1039, there is a comment:
  * As of May 2004 we use a new two-stage method: Stage one selects up
  * to targrows random blocks (or all blocks, if there aren't so many).
  * Stage two scans these blocks and uses the Vitter algorithm to create
  * a random sample of targrows rows (or less, if there are less in the
  * sample of blocks). The two stages are executed simultaneously: each
  * block is processed as soon as stage one returns its number and while
  * the rows are read stage two controls which ones are to be inserted
  * into the sample.
  *
  * Although every row has an equal chance of ending up in the final
  * sample, this sampling method is not perfect: not every possible
  * sample has an equal chance of being selected. For large relations
  * the number of different blocks represented by the sample tends to be
  * too small. We can live with that for now. Improvements are welcome.


Now the problem with clustered data is, that the probability of sampling
a value twice is much higher when the same page is repeatedly sampled.
As stage one takes a random sample of pages, and stage two samples rows
from these pages, the probability of visiting the same page twice (or
more often) is much higher than if random rows were selected from the
whole table. Hence we get a lot more multiple values for clustered data
and we end up with the severe under-estimation we can see in those cases.

Probabilities do my brain in, as usual, but I tested the procedure for
my test data with a simple python script. There is absolutely nothing
wrong with the implementation. It seems to be a purely statistical problem.

Not everything may be hopeless though ;-) The problem could
theoretically be avoided if random rows were selected from the whole
table. Again, that may not be feasible - the two phase approach was
probably not implemented for nothing.

Another possible solution would be to avoid much of the resampling (not
all) in phase two. For that - in theory - every page visited would have
to get a lower weight, so that revisiting this page is not any more
likely as rows were selected from the whole column. That does not sound
easy or elegant to implement. But perhaps there is some clever algorithm
- unfortunately I do not know.


> The general advice here is:
>
> 1) Increase default_statistics_target for the column.
>
> 2) If that doesn't help, consider using the following DDL:
>
> alter table foo alter column bar set ( n_distinct = 5.0);

Yes, I will probably have to live with that for now - I will come back
to these workarounds with one or two questions.

Thanks again & Regards,
Seefan


pgsql-performance by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: serious under-estimation of n_distinct for clustered distributions
Next
From: Claudio Freire
Date:
Subject: Re: serious under-estimation of n_distinct for clustered distributions