Re: On-disk bitmap index patch - Mailing list pgsql-hackers

From Jim C. Nasby
Subject Re: On-disk bitmap index patch
Date
Msg-id 20060728171714.GR66525@pervasive.com
Whole thread Raw
In response to Re: On-disk bitmap index patch  ("Jie Zhang" <jzhang@greenplum.com>)
Responses Re: On-disk bitmap index patch  ("Luke Lonergan" <llonergan@greenplum.com>)
List pgsql-hackers
On Thu, Jul 27, 2006 at 09:13:21AM -0700, Jie Zhang wrote:
> On 7/26/06 11:50 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> > "Jie Zhang" <jzhang@greenplum.com> writes:
> >> On 7/26/06 10:14 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >>> ... A nonuniform distribution would probably mean that some
> >>> of the bitmaps compress better-than-expected and others worse.  I have
> >>> no idea how to model that and guess what the overall result is ...
> > 
> >> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
> >> Wu et al gave an approximate answer for this question. Assume that there are
> >> c distinct values. Let the i-th value has a probability of p_i, the number
> >> of rows r, and the word size w. then the total size of the compressed bitmap
> >> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
> >> both \sum's, i is from 1 to c.
> > 
> > Hm, but that's still begging the question no?  It's still assuming that
> > any one value is uniformly distributed.  ISTM the cases that would break
> > my simplistic calculation involve clustering of particular values, such
> > that some areas of the bitmap are all-zero while other areas have lots
> > of ones.
> 
> Yes, you are right -- each value is still uniformly distributed. But this
> will be the worst case in terms of the size of a bitmap vector. As for how
> to model the size of a bitmap vector for an non-uniformly distributed value,
> that's a good question. I don't really know. But we do know the best case
> and the worse case.

If the usefulness of bitmap indexes is still in doubt, could someone at
Greenplum provide data from actual data warehouses from actual
customers?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


pgsql-hackers by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: Hash indexes (was: On-disk bitmap index patch)
Next
From: "Marc G. Fournier"
Date:
Subject: Re: [CORE] Attack against postgresql.org ...