Home > mailing lists

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

From	Simon Riggs
Subject	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Date	April 26, 2005 18:06:20
Msg-id	1114549351.21529.363.camel@localhost.localdomain Whole thread Raw
In response to	Re: [PERFORM] Bad n_distinct estimation; hacks suggested? (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: [PERFORM] Bad n_distinct estimation; hacks suggested? Citation for "Bad n_distinct estimation; hacks suggested?"
List	pgsql-hackers

Tree view

On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> >> It's not just the scan --- you also have to sort, or something like
> >> that, if you want to count distinct values.  I doubt anyone is really
> >> going to consider this a feasible answer for large tables.
>
> > Assuming you don't use the HashAgg plan, which seems very appropriate
> > for the task? (...but I understand the plan otherwise).
>
> The context here is a case with a very large number of distinct
> values...

Yes, but is there another way of doing this other than sampling a larger
proportion of the table? I don't like that answer either, for the
reasons you give.

The manual doesn't actually say this, but you can already alter the
sample size by setting one of the statistics targets higher, but all of
those samples are fixed sample sizes, not a proportion of the table
itself. It seems reasonable to allow an option to scan a higher
proportion of the table. (It would be even better if you could say "keep
going until you run out of memory, then stop", to avoid needing to have
an external sort mode added to ANALYZE).

Oracle and DB2 allow a proportion of the table to be specified as a
sample size during statistics collection. IBM seem to be ignoring their
own research note on estimating ndistinct...

> keep in mind also that we have to do this for *all* the
> columns of the table.

You can collect stats for individual columns. You need only use an
option to increase sample size when required.

Also, if you have a large table and the performance of ANALYZE worries
you, set some fields to 0. Perhaps that should be the default setting
for very long text columns, since analyzing those doesn't help much
(usually) and takes ages. (I'm aware we already don't analyze var length
column values > 1024 bytes).

> A full-table scan for each column seems
> right out to me.

Some systems analyze multiple columns simultaneously.

Best Regards, Simon Riggs

pgsql-hackers by date:

From: Josh Berkus
Date: 26 April 2005, 17:40:44
Subject: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

From: Andrew Dunstan
Date: 26 April 2005, 18:42:02
Subject: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

Previous

Next