Re: On-disk bitmap index patch - Mailing list pgsql-hackers

From Tom Lane
Subject Re: On-disk bitmap index patch
Date
Msg-id 748.1153983035@sss.pgh.pa.us
Whole thread Raw
In response to Re: On-disk bitmap index patch  ("Jie Zhang" <jzhang@greenplum.com>)
Responses Re: On-disk bitmap index patch  ("Jie Zhang" <jzhang@greenplum.com>)
List pgsql-hackers
"Jie Zhang" <jzhang@greenplum.com> writes:
> On 7/26/06 10:14 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> ... A nonuniform distribution would probably mean that some
>> of the bitmaps compress better-than-expected and others worse.  I have
>> no idea how to model that and guess what the overall result is ...

> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
> Wu et al gave an approximate answer for this question. Assume that there are
> c distinct values. Let the i-th value has a probability of p_i, the number
> of rows r, and the word size w. then the total size of the compressed bitmap
> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
> both \sum's, i is from 1 to c.

Hm, but that's still begging the question no?  It's still assuming that
any one value is uniformly distributed.  ISTM the cases that would break
my simplistic calculation involve clustering of particular values, such
that some areas of the bitmap are all-zero while other areas have lots
of ones.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Jie Zhang"
Date:
Subject: Re: On-disk bitmap index patch
Next
From: Michael Glaesemann
Date:
Subject: Re: GUC with units, details