Home > mailing lists

Re: estimating # of distinct values - Mailing list pgsql-hackers

From	tv@fuzzy.cz
Subject	Re: estimating # of distinct values
Date	January 10, 2011 10:56:34
Msg-id	3f230db8df732b42dffcc8836be7e03e.squirrel@sq.gransy.com Whole thread Raw
In response to	estimating # of distinct values (Tomas Vondra <tv@fuzzy.cz>)
List	pgsql-hackers

Tree view

> On Fri, 2011-01-07 at 12:32 +0100, tv@fuzzy.cz wrote:
>> the problem is you will eventually need to drop the results and rebuild
>> it, as the algorithms do not handle deletes (ok, Florian mentioned an
>> algorithm L_0 described in one of the papers, but I'm not sure we can
>> use
>> it).
>
> Yes, but even then you can start with much better cards if you already
> have an estimate of what it looks like, based on the fact that you did
> continuous updating of it. For example you'll have a pretty good
> estimate of the bounds of the number of distinct values, while if you
> really start from scratch you have nothing to start with but assume that
> you must cope with the complete range between all values are distinct ->
> there's only a few of them.

Sure, using the previous estimate is a good idea. I just wanted to point
out there is no reasonable way to handle deletes, so that you have to drop
the stats are rebuild it from scratch.

The biggest problem is not choosing a reasonable parameters (some of the
parameters can handle a few billions ndistinct values with something like
128kB of memory and less than 5% error). The really serious concern is I/O
generated by rebuilding the stats.

>> Another thing I'm not sure about is where to store those intermediate
>> stats (used to get the current estimate, updated incrementally).
>
> The forks implementation proposed in other responses is probably the
> best idea if usable. It will also solve you the problem of memory
> limitations, at the expense of more resources used if the structure
> grows too big, but it will still be probably fast enough if it stays
> small and in cache.

Hm, the forks seem to be an interesting option. It's probably much better
than storing that directly in the memory (not a good idea if there is a
lot of data). And you don't really need the data to get the estimate, it's
needed just when updating the stats.

So the last thing I'm not sure is how to store the changed rows, so that
the update process can get a list of new values. Someone already offered
LISTEN/NOTIFY, but I guess we could just write the ctids into a file
(maybe another fork)?

regards
Tomas

pgsql-hackers by date:

From: David Fetter
Date: 10 January 2011, 04:19:10
Subject: Re: hstore ?& operator versus mathematics

From: Magnus Hagander
Date: 10 January 2011, 13:09:37
Subject: Re: Streaming base backups

Re: estimating # of distinct values - Mailing list pgsql-hackers

Previous

Next