Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Date
Msg-id 1114549351.21529.363.camel@localhost.localdomain
Whole thread Raw
In response to Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Gurmeet Manku <manku@CS.Stanford.EDU>)
Citation for "Bad n_distinct estimation; hacks suggested?"  (Gurmeet Manku <manku@CS.Stanford.EDU>)
List pgsql-hackers
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> >> It's not just the scan --- you also have to sort, or something like
> >> that, if you want to count distinct values.  I doubt anyone is really
> >> going to consider this a feasible answer for large tables.
>
> > Assuming you don't use the HashAgg plan, which seems very appropriate
> > for the task? (...but I understand the plan otherwise).
>
> The context here is a case with a very large number of distinct
> values...

Yes, but is there another way of doing this other than sampling a larger
proportion of the table? I don't like that answer either, for the
reasons you give.

The manual doesn't actually say this, but you can already alter the
sample size by setting one of the statistics targets higher, but all of
those samples are fixed sample sizes, not a proportion of the table
itself. It seems reasonable to allow an option to scan a higher
proportion of the table. (It would be even better if you could say "keep
going until you run out of memory, then stop", to avoid needing to have
an external sort mode added to ANALYZE).

Oracle and DB2 allow a proportion of the table to be specified as a
sample size during statistics collection. IBM seem to be ignoring their
own research note on estimating ndistinct...

> keep in mind also that we have to do this for *all* the
> columns of the table.

You can collect stats for individual columns. You need only use an
option to increase sample size when required.

Also, if you have a large table and the performance of ANALYZE worries
you, set some fields to 0. Perhaps that should be the default setting
for very long text columns, since analyzing those doesn't help much
(usually) and takes ages. (I'm aware we already don't analyze var length
column values > 1024 bytes).

> A full-table scan for each column seems
> right out to me.

Some systems analyze multiple columns simultaneously.

Best Regards, Simon Riggs


pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Next
From: Andrew Dunstan
Date:
Subject: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?