Re: estimating # of distinct values - Mailing list pgsql-hackers

From Csaba Nagy
Subject Re: estimating # of distinct values
Date
Msg-id 1294657351.2962.37.camel@clnt-sysecm-cnagy
Whole thread Raw
In response to Re: estimating # of distinct values  (tv@fuzzy.cz)
List pgsql-hackers
On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote:
> the problem is you will eventually need to drop the results and rebuild
> it, as the algorithms do not handle deletes (ok, Florian mentioned an
> algorithm L_0 described in one of the papers, but I'm not sure we can use
> it).

Yes, but even then you can start with much better cards if you already
have an estimate of what it looks like, based on the fact that you did
continuous updating of it. For example you'll have a pretty good
estimate of the bounds of the number of distinct values, while if you
really start from scratch you have nothing to start with but assume that
you must cope with the complete range between all values are distinct ->
there's only a few of them.

> I'm not sure a constantly running background process is a good idea. I'd
> prefer storing an info about the modified tuples somewhere, and starting
> analyze only when a given threshold is reached. I'm not sure how to do
> that, though.
> 
> Another thing I'm not sure about is where to store those intermediate
> stats (used to get the current estimate, updated incrementally).

The forks implementation proposed in other responses is probably the
best idea if usable. It will also solve you the problem of memory
limitations, at the expense of more resources used if the structure
grows too big, but it will still be probably fast enough if it stays
small and in cache.

Cheers,
Csaba.




pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Compatibility GUC for serializable
Next
From: Tom Lane
Date:
Subject: Re: Bug in pg_describe_object (was: Re: [HACKERS] obj_unique_identifier(oid))