Re: Bitmap index thoughts - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Bitmap index thoughts
Date
Msg-id 1167247486.3783.470.camel@silverbirch.site
Whole thread Raw
In response to Re: Bitmap index thoughts  (Gavin Sherry <swm@linuxworld.com.au>)
List pgsql-hackers
On Wed, 2006-12-27 at 22:16 +1100, Gavin Sherry wrote:

> > But actually I'm not convinced we need to worry about efficient storage
> > of small bitmaps at all. The typical use case for bitmap indexes is
> > large tables with small number of distinct values, and the problem
> > doesn't really arise in that scenario. Let's keep it simple for now, we
> > can enhance it in later releases.
> 
> The scenario I'm concerned about is where a sales data base, say, has
> 100,000 products. However, only 500 or 1000 products are popular. They
> dominate, say >99% of the sales. The other 99,900 products consume a
> little bit over 8K each for very little benefit :-(.
> 
> This is pretty contrived but it seem real world enough...

Well, that seems to me to be the typical case. It's called a Zipf
Distribution and applies to categorical data everywhere. 

The main question is how you design your database.  We might think of
indexing the product group (a higher level of the Product Dimension),
since these tend to have a low number of values but these may have been
normalised (or "placed into a Dimension table"). This might leave you
with only the ProductId values in the large Fact table, which as Gavin
points out, may be sparsely populated.  

Another idea might to store the first N heap pointers in the auxiliary
heap, rather than allocating them a whole page for the first value. That
would be even more space efficient than allocating a fixed size part of
a page. At least that would provide some utility for that part of the
storage mechanism.

However, I think Heikki's KISG approach seems good for now. The
infrequent values will probably be infrequently accessed, so we may
never need to read them at all (wishful thinking?).

If we can get something ready for commit soon, that will leave lots of
time before 8.3 freeze to measure where things can be further improved
in terms of space and performance.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: pg_hba.conf hostname todo
Next
From: "Simon Riggs"
Date:
Subject: Re: TupleDescs and refcounts and such, again