Re: estimating # of distinct values - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: estimating # of distinct values |
Date | |
Msg-id | 4D31F4DB.9050305@fuzzy.cz Whole thread Raw |
In response to | estimating # of distinct values (Tomas Vondra <tv@fuzzy.cz>) |
List | pgsql-hackers |
Hi, a short update regarding the ndistinct stream estimators - I've implemented the estimators described in the papers I've mentioned in my previous posts. If you want try it, the sources are available at github, at http://tvondra.github.com/pg_estimator (ignore the page, I have to update it, just download the sources). You can build the estimators as standalone benchmarks (build.sh) or as aggregate functions (build-agg.sh) - I guess the latter is much easier to play with. The installation is described on my blog, along with some two simple benchmarks http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/ Feel free to play with it with different data, and if you notice something strange just let me know. Several comment regarding the implemented estimators: 1) PCSA, Probabilistic Counting (both described in the 1985 paper) - quite good performance, especially with respect to the memory requirements (95 and 495 B) 2) Adaptive Sampling (1990), Self-Learning Bitmap (2009) - very different algorithms, almost the same performance (especially with large number of distinct values) - my personal favourites, as the memory requirements are still very reasonable (compared to the Bloom Counter) 3) Rough Estimator (2010) - this algorithm is described as an "optimal algorithm" (not exactly, it's a building block of the optimal algorithm),but the results are not as good as with (1) and (2) - one of the problems is it needs k-wise independent hash functions, and the MD5 hashing scheme used with the other algorithmsprobably does not conform to this condition :-( - I've tried to implement such hash functions on my own (using prime field and polynomials in modulo arithmetics), butthe results were quite bad - if you know how to get such hash functions, let me know. - this algorithm is much more complex than the other algorithms, so if someone could do a review, that would be nice (maybethere is a bug resulting in the big estimate error) 4) Bloom Counter - basically a Bloom filter + a counter (incremented when an new element is added to the filter) - due to the design (no false negatives, positive false positives) it's significantly biased (gives underestimates), butI think that can be easily fixed with a coefficient (depends on the number of items / false positive eror rate) - needs significantly more memory to get similar performance as the other algorithms (a Bloom counter that can track 1milion distinct values needs about 1.2MB of memory, while a S-Bitmap that tracks 10 milion items needs just 65kB) - so unless we can use the special feature of Bloom Filter (i.e. the ability to detect items that are in the data), thisis a waste of memory I know Google uses Bloom filters in BigTable to limit I/O, i.e. before doing the I/O they ask the Bloom filter if thereare such data - if the answer is no, they don't have to do the I/O (false negatives not allowed), if the answer isyes they do the I/O. Not sure if we can / want to do something like this in PostgreSQL, as it significantly depends on the workload (and inmost cases we know the data are there, so we have to do the I/O anyway). But I guess it might be a good idea for a contribmodule (to maintain the Bloom filter on your own and do the decision on application). The tricky part here is tomaintain the bloom filter up to date. I'll try to fix the optimal estimator (not sure how to do that right now), and I'll investigate the proposed 'relation fork' proposed as a way to store the estimator. regards Tomas
pgsql-hackers by date: