Hi everybody!
Perhaps the following papers are relevant to the discussion here
(their contact authors have been cc'd):
1. The following proposes effective algorithms for using block-level
sampling for n_distinct estimation:
"Effective use of block-level sampling in statistics estimation"
by Chaudhuri, Das and Srivastava, SIGMOD 2004.
http://www-db.stanford.edu/~usriv/papers/block-sampling.pdf
2. In a single scan, it is possible to estimate n_distinct by using
a very simple algorithm:
"Distinct sampling for highly-accurate answers to distinct value
queries and event reports" by Gibbons, VLDB 2001.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
3. In fact, Gibbon's basic idea has been extended to "sliding windows"
(this extension is useful in streaming systems like Aurora / Stream):
"Distributed streams algorithms for sliding windows"
by Gibbons and Tirthapura, SPAA 2002.
http://home.eng.iastate.edu/~snt/research/tocs.pdf
Thanks,
Gurmeet
----------------------------------------------------
Gurmeet Singh Manku Google Inc.
http://www.cs.stanford.edu/~manku (650) 967 1890
----------------------------------------------------