Re: estimating # of distinct values - Mailing list pgsql-hackers

From tv@fuzzy.cz
Subject Re: estimating # of distinct values
Date
Msg-id 3f230db8df732b42dffcc8836be7e03e.squirrel@sq.gransy.com
Whole thread Raw
In response to estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
> On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote:
>> the problem is you will eventually need to drop the results and rebuild
>> it, as the algorithms do not handle deletes (ok, Florian mentioned an
>> algorithm L_0 described in one of the papers, but I'm not sure we can
>> use
>> it).
>
> Yes, but even then you can start with much better cards if you already
> have an estimate of what it looks like, based on the fact that you did
> continuous updating of it. For example you'll have a pretty good
> estimate of the bounds of the number of distinct values, while if you
> really start from scratch you have nothing to start with but assume that
> you must cope with the complete range between all values are distinct ->
> there's only a few of them.

Sure, using the previous estimate is a good idea. I just wanted to point
out there is no reasonable way to handle deletes, so that you have to drop
the stats are rebuild it from scratch.

The biggest problem is not choosing a reasonable parameters (some of the
parameters can handle a few billions ndistinct values with something like
128kB of memory and less than 5% error). The really serious concern is I/O
generated by rebuilding the stats.

>> Another thing I'm not sure about is where to store those intermediate
>> stats (used to get the current estimate, updated incrementally).
>
> The forks implementation proposed in other responses is probably the
> best idea if usable. It will also solve you the problem of memory
> limitations, at the expense of more resources used if the structure
> grows too big, but it will still be probably fast enough if it stays
> small and in cache.

Hm, the forks seem to be an interesting option. It's probably much better
than storing that directly in the memory (not a good idea if there is a
lot of data). And you don't really need the data to get the estimate, it's
needed just when updating the stats.

So the last thing I'm not sure is how to store the changed rows, so that
the update process can get a list of new values. Someone already offered
LISTEN/NOTIFY, but I guess we could just write the ctids into a file
(maybe another fork)?

regards
Tomas



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: hstore ?& operator versus mathematics
Next
From: Magnus Hagander
Date:
Subject: Re: Streaming base backups