Re: Improving N-Distinct estimation by ANALYZE - Mailing list pgsql-hackers

From Trent Shipley
Subject Re: Improving N-Distinct estimation by ANALYZE
Date
Msg-id 200601041811.49432.tshipley@deru.com
Whole thread Raw
In response to Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Improving N-Distinct estimation by ANALYZE
List pgsql-hackers
Sorry to interupt.  The discussion is interesting, but I need some help to
follow along.

On Wednesday 2006-01-04 17:07, Josh Berkus wrote:
> Simon,
>
> > - Are there any performance issues that can be directly attributed to
> > mis-estimation of N-Distinct ("D") by the ANALYZE command?
>
> Yes.   There's at least one query (maybe two) from TPC-H which bombs
> because of bad N-distinct estimation, even with stats_target =1000.  Based
> on my experience with data warehouses, this also occurs in the field.
>
> > - If so, can we do better than we currently achieve? How?
>
> Replace the current algorithm and broaden the sample.

Is "replace the algorithm" the same as saying "contextually use some estimate
of D that is not Chaudhuri?

> > - Would altering the estimate of D cause problems in other places?
>
> Unlike Index Cost Estimation, I wouldn't expect it to.  We make pretty
> "proper" use of D right now, it's just that for some common cases our
> estimates of D are bad.
>
> > The estimation of D is difficult and imprecise. The current method works
> > well in many cases, yet breaks down *badly* in one particular very
> > common use case of database design: a large dependent table with a
> > multi-column Primary Key. In some cases the estimate of D *decreases* as
> > the size of the table increases, though the estimate of D is an
> > underestimate in almost all cases, whatever the table design.
>
> Actually, the current estimator underestimates D for *any* large table's
> high-cardinality columns, primary key, multi-column, or not.
> Chaudhuri's calculation seems to be designed to yield the smallest
> number of cardinal values that could reasonably be expected to yield the
> provided sample.   That is, if the estimate range within a stdev of 2.0
> gives from 50,000 to 500,000 distinct values, Chaudhuri picks 50,000.
>
> This conservative approach makes sense when you're only considering join
> strategies.  That is, given an unreliable estimate you want to estimate
> D low so that you don't wrongly choose a nested loop, the cost for which
> mistake being much higher than the cost of performing an unnecessary
> hash join.   It's "conservative" in that sense.

So Chaudhuri's estimate of D is appropriate (and is working) when making
decisions about joins.

> However,   PostgreSQL now has a whole set of hash operations and other
> query types for which a
> too-low estimate of D causes query lockup.

Why?

> So for these operations,
> Chaudhuri ceases to be conservative and becomes high-risk.   FWIW, my
> testing with TPCH showed that estimate error is usually OK within +/-
> 5x.  Beyond that any you start to get bad query plans.
>
> (yes, I know all of the above begs examples.  I'm swamped.   I believe I
> posted examples when I first started talking about n-distinct estimation a
> year ago)
>
> So I think it's vital that we look at algorithms designed to deliver us
> the median estimated D, not the lowest reasonable, in addition to
> increasing sample size.  The block-based estimator functions which
> Andrew and I looked at seem designed to do that provided a sample of
> between 1% and 10%.

Do you *really* want the median estimate in these case?  Are you certain you
do not want something with the opposite behavior of Chaudhuri's estimate so
that for small sample sizes the bias is toward a high estimate of D?
(Converges on D from the right instead of the left.)

Chaudhuri's <-----D------------------> needed
Estimate                               estimate

> > 1. *All* methods of statistical analysis are improved by larger sample
> > fractions. The D estimator method currently in use shows an optimum of
> > accuracy and sample fraction at around 5% of a table, as shown in the
> > author's original paper [Haas Stokes (1998)]. The current
> > implementation's error rates climb higher as table size increases.
>
> I read 5 different papers on ACM about sampling D.  All of them were
> united in saying that you couldn't get even slightly accurate estimates
> with less than 3% sampling.

These statements are at odds with my admittedly basic understanding of
statistics.  Isn't the power of a sample more related to the absolute size of
the sample than the sample as fraction of the population?  Why not just pick
a smallish sample size, say about 3000, and apply it to all the tables, even
the ones with just a single row (modify appropriately from block sampling).

> > 2. In terms of I/O, ANALYZE is relatively inefficient, since it uses a
> > row sampling technique rather than a block sampling technique. This
> > would translate directly into a performance drop from large sample
> > ratios, but since we currently use a fixed sample size this problem is
> > not yet visible for larger tables. With a 2GB table, we would typically
> > sample 1% of the blocks, yet around 0.025 - 0.05% of the rows.
>
> This woudl be a reason to use block-sampling ONLY, rather than hybrid
> sampling.
>
> > 3. Large values of statistics target (up to 1000) could cause a number
> > of problems with statistics catalog table growth (mentioned on
> > -perform). Setting these values appropriately can take significant
> > effort. Automatic scaling of such parameters is desirable.
>
> Well, decoupling the n-distinct sample size from the # of heuristics rows
> would fix that.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Heads up: upcoming back-branch re-releases
Next
From: Tom Lane
Date:
Subject: Re: Improving "missing FROM-clause entry" message