"Jie Zhang" <jzhang@greenplum.com> writes:
> On 7/26/06 10:14 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> ... A nonuniform distribution would probably mean that some
>> of the bitmaps compress better-than-expected and others worse. I have
>> no idea how to model that and guess what the overall result is ...
> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
> Wu et al gave an approximate answer for this question. Assume that there are
> c distinct values. Let the i-th value has a probability of p_i, the number
> of rows r, and the word size w. then the total size of the compressed bitmap
> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
> both \sum's, i is from 1 to c.
Hm, but that's still begging the question no? It's still assuming that
any one value is uniformly distributed. ISTM the cases that would break
my simplistic calculation involve clustering of particular values, such
that some areas of the bitmap are all-zero while other areas have lots
of ones.
regards, tom lane