Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Improving N-Distinct estimation by ANALYZE
Date
Msg-id 1137179909.3180.91.camel@localhost.localdomain
Whole thread Raw
In response to Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Improving N-Distinct estimation by ANALYZE
Re: Improving N-Distinct estimation by ANALYZE
List pgsql-hackers
On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote:

> Anyway, since the proof is in the pudding, Simon and I will be working on
> some demo code for different sampling methods so that we can debate
> results rather than theory.

I enclose a patch for checking out block sampling. This is not
production ready, yet, but works bug-free on cvstip. Code comments have
been fully updated to explain what's going on inside.

All you need to do is "set analyze_blocks=b" and ANALYZE will switch
over to using block sampling method and will read all the rows in "b"
blocks. The sample size will also be limited by maintenance_work_mem.
(Memory limitations could be smarter). This de-couples the specification
of the sample size from the specification of the MCV/histogram size
(stats_target).
[Right now, I'm not suggesting that we have a GUC named this - it just
exists for testing. If/when we agree to allow block sampling, then we
can discuss how to invoke/specify it]

The stats calculations aren't touched - it still uses Haas-Stokes.
If you "set log_min_messages=DEBUG2" you'll also get more useful
explanations of what the variables are and what decisions it makes about
D for each column being analyzed.

This patch has two main effects:
- ANALYZE runs more than x10 faster to retrieve the same size sample
- you can specify much larger samples for bigger tables, without
increasing the stats targets

Generally, the larger samples will give better results for the
estimation. However, what is immediately apparent is that the
Haas-Stokes estimator actually gets even worse with block sampling in
the particular case I raised on-list. (Which is understandable, but not
desirable). ISTM this is a strike against Haas-Stokes, rather than a
strike against block sampling. So I'm looking at implementing the
Brutlag & Richardson estimator(s) that cope with
number-of-values-appearing in only one block. Not surprisingly that
means some non-trivial additional code to retrieve blockids for each
tuple and make decisions accordingly. I plan to use a similar technique
to the existing TupnoLink array to match blockids.

The B&R estimators should allow a fairly fast, small sample to be taken,
making this more usable for dynamic sampling during query planning (when
desirable, see recent -perform thread).

It's also worth mentioning that for datatypes that only have an "="
operator the performance of compute_minimal_stats is O(N^2) when values
are unique, so increasing sample size is a very bad idea in that case.
It may be possible to re-sample the sample, so that we get only one row
per block as with the current row sampling method. Another idea might be
just to abort the analysis when it looks fairly unique, rather than
churn through the whole sample.

Best Regards, Simon Riggs


Attachment

pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: Checkpoint question
Next
From: Simon Riggs
Date:
Subject: Re: Contrib Schemas