Re: estimating # of distinct values - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: estimating # of distinct values
Date
Msg-id 4D31F4DB.9050305@fuzzy.cz
Whole thread Raw
In response to estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
Hi,

a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it, just download the sources).

You can build the estimators as standalone benchmarks (build.sh) or as
aggregate functions (build-agg.sh) - I guess the latter is much easier
to play with. The installation is described on my blog, along with some
two simple benchmarks

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

Feel free to play with it with different data, and if you notice
something strange just let me know.

Several comment regarding the implemented estimators:

1) PCSA, Probabilistic Counting (both described in the 1985 paper)
  - quite good performance, especially with respect to the memory  requirements (95 and 495 B)

2) Adaptive Sampling (1990), Self-Learning Bitmap (2009)
  - very different algorithms, almost the same performance (especially  with large number of distinct values)
  - my personal favourites, as the memory requirements are still very  reasonable (compared to the Bloom Counter)

3) Rough Estimator (2010)
  - this algorithm is described as an "optimal algorithm" (not  exactly, it's a building block of the optimal
algorithm),but the  results are not as good as with (1) and (2)
 
  - one of the problems is it needs k-wise independent hash functions,  and the MD5 hashing scheme used with the other
algorithmsprobably  does not conform to this condition :-(
 
  - I've tried to implement such hash functions on my own (using prime  field and polynomials in modulo arithmetics),
butthe results were  quite bad - if you know how to get such hash functions, let me know.
 
  - this algorithm is much more complex than the other algorithms, so  if someone could do a review, that would be nice
(maybethere is a  bug resulting in the big estimate error)
 

4) Bloom Counter
  - basically a Bloom filter + a counter (incremented when an new  element is added to the filter)
  - due to the design (no false negatives, positive false positives)  it's significantly biased (gives underestimates),
butI think that  can be easily fixed with a coefficient (depends on the number of  items / false positive eror rate)
 
  - needs significantly more memory to get similar performance as the  other algorithms (a Bloom counter that can track
1milion distinct  values needs about 1.2MB of memory, while a S-Bitmap that tracks 10  milion items needs just 65kB)
 
  - so unless we can use the special feature of Bloom Filter (i.e. the  ability to detect items that are in the data),
thisis a waste of  memory
 
  I know Google uses Bloom filters in BigTable to limit I/O, i.e.  before doing the I/O they ask the Bloom filter if
thereare such  data - if the answer is no, they don't have to do the I/O (false  negatives not allowed), if the answer
isyes they do the I/O.
 
  Not sure if we can / want to do something like this in PostgreSQL,  as it significantly depends on the workload (and
inmost cases we  know the data are there, so we have to do the I/O anyway). But I  guess it might be a good idea for a
contribmodule (to maintain the  Bloom filter on your own and do the decision on application). The  tricky part here is
tomaintain the bloom filter up to date.
 

I'll try to fix the optimal estimator (not sure how to do that right
now), and I'll investigate the proposed 'relation fork' proposed as a
way to store the estimator.

regards
Tomas


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: LAST CALL FOR 9.1
Next
From: Robert Haas
Date:
Subject: Re: LOCK for non-tables