On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
> I think you're right that a reasonable sample size for this kind of
> estimate
> is going to be proportional to the table size, not the constant sized
> sample
> that regular statistics need.
Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.
I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).
Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.
> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good.
> >
> > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> >
> > It does require a single full table scan
The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.
Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.
Best Regards, Simon Riggs