Thread: plans for bitmap indexes?

plans for bitmap indexes?

From
Yann Michel
Date:
Hi,

I'd like to know if there are any plans on introducing bitmap indexes
into postgresql. I think this could mean a big performance improvement
especially for datawarehousing applications. I know that there is an
index type hash but I don't know how both types are comparable due to
they are both best usable for equality expressions.

Thanks in advance!

Regards,
Yann


Re: plans for bitmap indexes?

From
Reini Urban
Date:
Yann Michel schrieb:
> I'd like to know if there are any plans on introducing bitmap indexes
> into postgresql. I think this could mean a big performance improvement
> especially for datawarehousing applications. I know that there is an
> index type hash but I don't know how both types are comparable due to
> they are both best usable for equality expressions.

Sure, and the next will come with musicbrainz.
Why not :)
If you don't want to code that in the app, the database backend is the 
solution. The database is the golden hammer. 
http://c2.com/cgi/wiki?GoldenHammer
-- 
Reini Urban
http://xarch.tu-graz.ac.at/home/rurban/


Re: plans for bitmap indexes?

From
Bruce Momjian
Date:
Yann Michel wrote:
> Hi,
> 
> I'd like to know if there are any plans on introducing bitmap indexes
> into postgresql. I think this could mean a big performance improvement
> especially for datawarehousing applications. I know that there is an
> index type hash but I don't know how both types are comparable due to
> they are both best usable for equality expressions.

Lots of people have talked about it but I don't know anyone coding it.
--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi,

On Thu, Oct 07, 2004 at 06:54:15PM -0400, Bruce Momjian wrote:
> > I'd like to know if there are any plans on introducing bitmap indexes
> > into postgresql. I think this could mean a big performance improvement
> > especially for datawarehousing applications. I know that there is an
> > index type hash but I don't know how both types are comparable due to
> > they are both best usable for equality expressions.
> 
> Lots of people have talked about it but I don't know anyone coding it.

have you ever discussed if bitmap indexes lead to better query
performance than hash indexes will do?

Regards,
Yann


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi,

On Thu, Oct 07, 2004 at 10:41:04PM +0200, Reini Urban wrote:
> >I'd like to know if there are any plans on introducing bitmap indexes
> >into postgresql. I think this could mean a big performance improvement
> >especially for datawarehousing applications. I know that there is an
> >index type hash but I don't know how both types are comparable due to
> >they are both best usable for equality expressions.
> 
> Sure, and the next will come with musicbrainz.
> Why not :)

I don't know what this means to my questin, but anyway...

> If you don't want to code that in the app, the database backend is the 
> solution. The database is the golden hammer. 


Therefore I asked wheather there are any thoughts of implementing them.

Regards,
Yann


Re: plans for bitmap indexes?

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Yann Michel
> Sent: 08 October 2004 08:28
> To: Reini Urban
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] plans for bitmap indexes?
>
>
> I don't know what this means to my questin, but anyway...
>
> > If you don't want to code that in the app, the database
> backend is the
> > solution. The database is the golden hammer.
>
>
> Therefore I asked wheather there are any thoughts of
> implementing them.

I think what Reini was asking was why do you think you need bitmap
indexes as opposed to any existing type?

Regards, Dave.


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi,

On Fri, Oct 08, 2004 at 10:09:18AM +0100, Dave Page wrote:
> I think what Reini was asking was why do you think you need bitmap
> indexes as opposed to any existing type? 

due to I'm developing a datawarehousing application we have lots of
fact-data in our central fact-table. As I know how to improve
performance on Oracle based datawarehouses, I'm used to add bitmap
indexes for atributes having only a few distinct values. 
So I was looking for any comparable indexing technology but didn't find
any so far.

Regards,
Yann


Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On R, 2004-10-08 at 12:45, Yann Michel wrote:
> Hi,
> 
> On Fri, Oct 08, 2004 at 10:09:18AM +0100, Dave Page wrote:
> > I think what Reini was asking was why do you think you need bitmap
> > indexes as opposed to any existing type? 
> 
> due to I'm developing a datawarehousing application we have lots of
> fact-data in our central fact-table. As I know how to improve
> performance on Oracle based datawarehouses, I'm used to add bitmap
> indexes for atributes having only a few distinct values. 
> So I was looking for any comparable indexing technology but didn't find
> any so far.

There is currently no suitable index type for this type of queries (huge
tables with a few distinct values).

You may try to optimise performance by partitioning your fact tables on
these few dimension values by using table inheritance or union all
views.

There was a discussion on partitioning postgres tables on
pgsql-performance list a few days ago.

-------------
Hannu



Re: plans for bitmap indexes?

From
Oleg Bartunov
Date:
On Fri, 8 Oct 2004, Yann Michel wrote:

> Hi,
>
> On Fri, Oct 08, 2004 at 10:09:18AM +0100, Dave Page wrote:
>> I think what Reini was asking was why do you think you need bitmap
>> indexes as opposed to any existing type?
>
> due to I'm developing a datawarehousing application we have lots of
> fact-data in our central fact-table. As I know how to improve
> performance on Oracle based datawarehouses, I'm used to add bitmap
> indexes for atributes having only a few distinct values.
> So I was looking for any comparable indexing technology but didn't find
> any so far.

Yann, you may try our contrib/intarray module

>
> Regards,
> Yann
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: plans for bitmap indexes?

From
Josh Berkus
Date:
Yann,

> > Lots of people have talked about it but I don't know anyone coding it.

I would love to have bitmap indexes in Postgres, as would a lot of other 
community members.   However, they are far from trivial to code.  Are you 
offering to help?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi Josh,

On Fri, Oct 08, 2004 at 09:59:41AM -0700, Josh Berkus wrote:
> 
> > > Lots of people have talked about it but I don't know anyone coding it.
> 
> I would love to have bitmap indexes in Postgres, as would a lot of other 
> community members.   However, they are far from trivial to code.  Are you 
> offering to help?

I'd like to help you, but I think, that my C-Experience is not good
enough for beeing able to. I mean, I coded some C-stuff and I know how
bitmap indexes (should) work but I guess that this won't be enough. In
addidtion I never took a look into postgresql's sources. 

Regards,
Yann


Re: plans for bitmap indexes?

From
Josh Berkus
Date:
Yann,

> I'd like to help you, but I think, that my C-Experience is not good
> enough for beeing able to. I mean, I coded some C-stuff and I know how
> bitmap indexes (should) work but I guess that this won't be enough. In
> addidtion I never took a look into postgresql's sources.

Well, there's no time like the present to get started!  ;-)

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi Josh,

On Fri, Oct 08, 2004 at 10:18:27AM -0700, Josh Berkus wrote:
> 
> > I'd like to help you, but I think, that my C-Experience is not good
> > enough for beeing able to. I mean, I coded some C-stuff and I know how
> > bitmap indexes (should) work but I guess that this won't be enough. In
> > addidtion I never took a look into postgresql's sources.
> 
> Well, there's no time like the present to get started!  ;-)

O.K. I downloaded it :-) 
We will see if and how I can help.... 

Regards,
Yann


Re: plans for bitmap indexes?

From
Sailesh Krishnamurthy
Date:
>>>>> "Yann" == Yann Michel <yann-postgresql@spline.de> writes:
   Yann> O.K. I downloaded it :-) We will see if and how I can   Yann> help....

FYI .. in case you aren't aware already:

http://portal.acm.org/citation.cfm?id=98720

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: plans for bitmap indexes?

From
Chris Browne
Date:
josh@agliodbs.com (Josh Berkus) writes:
>> > Lots of people have talked about it but I don't know anyone coding it.
>
> I would love to have bitmap indexes in Postgres, as would a lot of other 
> community members.   However, they are far from trivial to code.  Are you 
> offering to help?

I'm curious as to whether partial indexes might suffice as an
alternative.

There are doubtless cases where the optimizer won't use them where it
would be plausible to do so; that suggests, to me, possibilities for
enhancing the optimizer.
-- 
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.


Re: plans for bitmap indexes?

From
Chris Browne
Date:
yann-postgresql@spline.de (Yann Michel) writes:
> On Fri, Oct 08, 2004 at 10:09:18AM +0100, Dave Page wrote:
>> I think what Reini was asking was why do you think you need bitmap
>> indexes as opposed to any existing type? 
>
> due to I'm developing a datawarehousing application we have lots of
> fact-data in our central fact-table. As I know how to improve
> performance on Oracle based datawarehouses, I'm used to add bitmap
> indexes for atributes having only a few distinct values. So I was
> looking for any comparable indexing technology but didn't find any
> so far.

The most nearly comparable thing is be the notion of "partial
indexes," where, supposing you had 60 region codes (e.g. - 50 US
states, 10 Canadian provinces), you might set up indices thus:

TABLE=my_table
FIELD=stateprov
FILE=$HOME/regionlist.txt
for region in `cat $FILE`; do query="create index ${TABLE}_partial_on_${region} on $TABLE($FIELD) where $FIELD =
'$region';"echo $query | psql -d datawarehouse
 
done

That would set up 60 (or whatever $HOME/regionlist.txt indicated)
partial indices on the table on that field.

By the way, I thought ahead a little, in that; doing the same thing
for country codes might be as easy as replacing:

FIELD=country
FILE=$HOME/countrylist.txt

The partial indexes will not ALWAYS be useful; in cases where they
aren't, it is not inconceivable that there are improvements to be made
in the query optimizer...
-- 
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.


Re: plans for bitmap indexes?

From
Josh Berkus
Date:
Chris,

> The most nearly comparable thing is be the notion of "partial
> indexes," where, supposing you had 60 region codes (e.g. - 50 US
> states, 10 Canadian provinces), you might set up indices thus:

I'm afraid that you're mistaken about the functionality of bitmap indexes.   
The purpose of a bitmap index is not to partition an index, but to allow 
multiple indexes to be used in the same operation.

For example, imagine you have a table on a dating website with 18 columns 
representing 18 different characteristics for matching.  Imagine that you 
index each of those columns seperately. If you do:

SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city = 
'San Francisco';

... then the planner can use an index on orientation OR on gender OR on city, 
but not all three.   Multicolumn indexes are no solution for this use case 
because you'd have to create a multicolumn index for each possible combo of 
two or three columns ( 18! ).   

The Bitmap index allows the query executor to use several indexes on the same 
operation, comparing them and selecting rows where they "overlap" like a Venn 
diagram.

... 

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: plans for bitmap indexes?

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city = 
> 'San Francisco';

There are actually two TODOs here.

1) a "bitmap scan" that would be usable with any type of index. The tuple  locations can be read in for each criteria
andsorted by location and built  into bitmaps. The results can be combined using bitmap operations and the  tuples
scannedin physical order.
 

2) A persistent bitmap index that would enable skipping the first step of the  above.

In the case if all the columns were btree indexes it might still make sense to
scan through all the indexes and combine the results before reading in the
actual tuples. Especially if the tuples are very wide and each column
individually very unselective, but the combination very selective.

However it would work even better if gender and orientation could be stored on
disk in a bitmap representation. They're very low cardinality and could be
stored quite compactly. The result would read the data faster, skip the sort,
and be able to start returning tuples even before it finished reading the
entire index.

-- 
greg



Re: plans for bitmap indexes?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> The Bitmap index allows the query executor to use several indexes on
> the same operation, comparing them and selecting rows where they
> "overlap" like a Venn diagram.

Note that what Josh is describing is not really a distinct index type,
but a different way of using an index: that is, you pull candidate tuple
locations from several indexes and intersect or union those sets before
you go to the heap.  In principle this works whatever the index access
methods are.

I believe that the term "bitmap index" is also used with a different
meaning wherein it actually does describe a particular kind of on-disk
index structure, with one bit per table row.

IMHO building in-memory bitmaps (the first idea) is a very good idea to
pursue for Postgres.  I'm not at all sold on on-disk bitmap indexes,
though ... those I suspect *are* sufficiently replaced by partial
indexes.
        regards, tom lane


Re: plans for bitmap indexes?

From
Gaetano Mendola
Date:
Josh Berkus wrote:
> Chris,
> 
> 
>>The most nearly comparable thing is be the notion of "partial
>>indexes," where, supposing you had 60 region codes (e.g. - 50 US
>>states, 10 Canadian provinces), you might set up indices thus:
> 
> 
> I'm afraid that you're mistaken about the functionality of bitmap indexes.   
> The purpose of a bitmap index is not to partition an index, but to allow 
> multiple indexes to be used in the same operation.
> 
> For example, imagine you have a table on a dating website with 18 columns 
> representing 18 different characteristics for matching.  Imagine that you 
> index each of those columns seperately. If you do:
> 
> SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city = 
> 'San Francisco';

> ... then the planner can use an index on orientation OR on gender OR on city, 
> but not all three.   Multicolumn indexes are no solution for this use case 
> because you'd have to create a multicolumn index for each possible combo of 
> two or three columns ( 18! ).    

I'm wondering if in some cases the following query could be more efficient

select * from people where orientation = 'gay'
intersect
select * from people where gender = 'male'
intersect
select * from people where city =





Regards
Gaetano Mendola



Re: plans for bitmap indexes?

From
Neil Conway
Date:
On Sun, 2004-10-10 at 03:36, Chris Browne wrote:
> There are doubtless cases where the optimizer won't use them where it
> would be plausible to do so; that suggests, to me, possibilities for
> enhancing the optimizer.

Speaking of which, if anyone has any examples of queries for which we
ought to be able to use a partial index but currently cannot, please
speak up (or mail me privately).

-Neil




Re: plans for bitmap indexes?

From
"Zeugswetter Andreas DAZ SD"
Date:
> > The most nearly comparable thing is be the notion of "partial
> > indexes," where, supposing you had 60 region codes (e.g. - 50 US
> > states, 10 Canadian provinces), you might set up indices thus:

> For example, imagine you have a table on a dating website with 18 columns
> representing 18 different characteristics for matching.  Imagine that you
> index each of those columns seperately. If you do:
>
> SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city =
> 'San Francisco';

I think bitmap indexes do have valid use cases, but partitioned indexes
are really a wonderful feature with a lot of use cases, maybe including this one.

Workable examples for useful partitioned indexes, that help here are:

create index people_male_ix on people (city) where gender = 'male';
create index people_gay_ix on people (city) where orientation = 'gay';
create index people_male_gay_ix on people (city) where gender = 'male' and orientation = 'gay';

Note, that the indexed column differs from the partitioning clause.
Note also, that the last index will perform way better than a combo of bitmap indexes.

Andreas


Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On K, 2004-10-13 at 00:09, Greg Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> 
> > SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city = 
> > 'San Francisco';
> 
> There are actually two TODOs here.
> 
> 1) a "bitmap scan" that would be usable with any type of index. The tuple
>    locations can be read in for each criteria and sorted by location and built
>    into bitmaps. The results can be combined using bitmap operations and the
>    tuples scanned in physical order.
> 
> 2) A persistent bitmap index that would enable skipping the first step of the
>    above.
> 
> In the case if all the columns were btree indexes it might still make sense to
> scan through all the indexes and combine the results before reading in the
> actual tuples. Especially if the tuples are very wide and each column
> individually very unselective, but the combination very selective.
> 
> However it would work even better if gender and orientation could be stored on
> disk in a bitmap representation. They're very low cardinality and could be
> stored quite compactly. The result would read the data faster, skip the sort,
> and be able to start returning tuples even before it finished reading the
> entire index.

We could go even further and use the same bm indexes for selecting the
page where the tuple is stored (found by AND of all bitmap indexes plus 
fsm) and achieve "natural" clustering 

If this is done consistently we need only 1 bit/page in our index 
(straight bitmap for 1GB fits in  16384 kb or 4 database pages)

This approach may result in poor utilisation of database pages for small
tables, but one would not use bitmap indexes for small tables in the 
first place.

--------------
Hannu






Re: plans for bitmap indexes?

From
Reini Urban
Date:
Josh Berkus schrieb:
>>The most nearly comparable thing is be the notion of "partial
>>indexes," where, supposing you had 60 region codes (e.g. - 50 US
>>states, 10 Canadian provinces), you might set up indices thus:
> 
> I'm afraid that you're mistaken about the functionality of bitmap indexes.   
> The purpose of a bitmap index is not to partition an index, but to allow 
> multiple indexes to be used in the same operation.

uh. sorry! In my first harsh replay I didn't know that. I thought you 
wanted a new index type for binary images in BLOBS.
(just some hash, maybe optimized for image similarity)

> For example, imagine you have a table on a dating website with 18 columns 
> representing 18 different characteristics for matching.  Imagine that you 
> index each of those columns seperately. If you do:
> 
> SELECT * FROM people WHERE orientation = 'gay' AND gender = 'male' AND city = 
> 'San Francisco';
> 
> ... then the planner can use an index on orientation OR on gender OR on city, 
> but not all three.   Multicolumn indexes are no solution for this use case 
> because you'd have to create a multicolumn index for each possible combo of 
> two or three columns ( 18! ).   
> 
> The Bitmap index allows the query executor to use several indexes on the same 
> operation, comparing them and selecting rows where they "overlap" like a Venn 
> diagram.
-- 
Reini Urban
http://xarch.tu-graz.ac.at/home/rurban/


Re: plans for bitmap indexes?

From
Tom Lane
Date:
"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
> Workable examples for useful partitioned indexes, that help here are:

> create index people_male_ix on people (city) where gender = 'male';
> create index people_gay_ix on people (city) where orientation = 'gay';
> create index people_male_gay_ix on people (city) where gender = 'male' and orientation = 'gay';

> Note, that the indexed column differs from the partitioning clause.
> Note also, that the last index will perform way better than a combo of bitmap indexes.

This is definitely a useful technique in some cases, but it's got its
limits.  You have to have only a fairly small number of interesting
conditions (else the number of indexes gets out of hand) and those
conditions have to be spelled out explicitly in the query.  That is,
the last index will indeed work forSELECT * FROM people WHERE orientation = 'gay'    AND gender = 'male' AND city =
'SanFrancisco';
 
but it will not work forSELECT * FROM people WHERE orientation = $1    AND gender = $2 AND city = $3;
which is the sort of thing that the planner is increasingly going to
have to deal with.

Combining bitmaps at runtime is certainly somewhat more expensive to
execute, but it can deal with cases where the specific values being
searched for are not known until runtime.
        regards, tom lane


Re: plans for bitmap indexes?

From
Josh Berkus
Date:
Andreas,

> I think bitmap indexes do have valid use cases, but partitioned indexes
> are really a wonderful feature with a lot of use cases, 

Sure, no question.  That's why we have them.

> maybe including 
> this one.

Nope, not at all.

> Workable examples for useful partitioned indexes, that help here are:
>
> create index people_male_ix on people (city) where gender = 'male';
> create index people_gay_ix on people (city) where orientation = 'gay';
> create index people_male_gay_ix on people (city) where gender = 'male' and
> orientation = 'gay';

You've forgotten part of my premise (based on a real case I discussed on IRC) 
that there are EIGHTEEN criteria columns.   That would require, by the method 
you have above, roughly 18(3rd factorial) indexes, times the number of values 
allowed by each column, which if it averaged, say 5 values, would be 24,480 
indexes.   A little impractical, hmmm?   I think that might even break a 
system limit somewhere.

Tom,

> Note that what Josh is describing is not really a distinct index type,
> but a different way of using an index: that is, you pull candidate tuple
> locations from several indexes and intersect or union those sets before
> you go to the heap.  In principle this works whatever the index access
> methods are. 

Yes, exactly.    They're known as "bitmap" indexes because that's how Oracle 
implemented them, and AFAIK only Oracle currently has anything analogous.   
I'd personally be interested in any scheme that allowed the relatively 
efficient usage of multiple indexes on a single operation.

BTW, Tom, I was talking to Sean last night and he was saying that our current 
planner cost calculations assume that a 2-column index fetch will be twice as 
expensive as a 1-column index fetch.   Is this right?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: plans for bitmap indexes?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> BTW, Tom, I was talking to Sean last night and he was saying that our
> current planner cost calculations assume that a 2-column index fetch
> will be twice as expensive as a 1-column index fetch.  Is this right?

No.
        regards, tom lane


Re: plans for bitmap indexes?

From
"Zeugswetter Andreas DAZ SD"
Date:
> > create index people_male_gay_ix on people (city) where gender = 'male' and
> > orientation = 'gay';
>
> You've forgotten part of my premise (based on a real case I discussed on IRC)
> that there are EIGHTEEN criteria columns.

That is why I said maybe :-) Whether it helps depends on the number of
actually (often) used access patterns.

>   That would require, by the method
> you have above, roughly 18(3rd factorial) indexes, times the number of values
> allowed by each column, which if it averaged, say 5 values, would be 24,480

Well, an index only needs to reduce the cost enough so that you can afford your
workload and have reasonable response times, so you might only need to create a
few of them.

I was actually only trying to help optimize this example without the "bitmap index"
feature, not trying to say that for this example "partial indexes" are better.
Especially since the first example, that mentioned partial indexes was not
"the way to do it" for a value that represents a large part of your table (here approx 50%).
(don't do: create index people_male on people (gender) where gender='male';)

Andreas


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi,

On Sat, Oct 09, 2004 at 01:31:36PM -0400, Chris Browne wrote:
> 
> The most nearly comparable thing is be the notion of "partial
> indexes," where, supposing you had 60 region codes (e.g. - 50 US
> states, 10 Canadian provinces), you might set up indices thus:
> 
[...]
> 
> The partial indexes will not ALWAYS be useful; in cases where they
> aren't, it is not inconceivable that there are improvements to be made
> in the query optimizer...

So what you are suggesting here is the "tree-fashioned-static way" of
real bitmap indexes. I.E. each time a new value is inserted vor any kind
of thus indexes column you have to create a new index which is not very
usefull as you can think of. In addition nothing about the real
granularity is known to the optimizer to let it guess the best execution
plan, i.e. to do a full table scan or use an index. That means if one
attributes value is representative for 80 percent it is usefull to do a
full table scan whereas if its value is representative for only 5
percent the index might be better. But as I understood the partial index
concept, no statistics for "value representation" are available.

Therefore I started to do read some articles and books about bitmap
index implementations to maby contribute... we will see...

BTW: Is there any more documented CVS-version available? I mean it would
be really nice to read some comments from time to time or at least more
comments about each function/method's purpose or functionality.

Regards,
Yann


Re: plans for bitmap indexes?

From
Alvaro Herrera
Date:
On Thu, Oct 14, 2004 at 11:08:54PM +0200, Yann Michel wrote:

> BTW: Is there any more documented CVS-version available? I mean it would
> be really nice to read some comments from time to time or at least more
> comments about each function/method's purpose or functionality.

Huh, the code is reasonably commented.  Certainly not following
Javadoc-like standards, but it can be read.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)



Re: plans for bitmap indexes?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Thu, Oct 14, 2004 at 11:08:54PM +0200, Yann Michel wrote:
>> BTW: Is there any more documented CVS-version available? I mean it would
>> be really nice to read some comments from time to time or at least more
>> comments about each function/method's purpose or functionality.

> Huh, the code is reasonably commented.  Certainly not following
> Javadoc-like standards, but it can be read.

Also, have you read the Internals volume of the SGML docs?  Mostly
pretty high-level stuff, but that's what you need to get started.
The developer's FAQ is also required reading for newbies.
There are also README files in various parts of the source tree that
provide information about various sub-modules.
        regards, tom lane


Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi Tom,

On Fri, Oct 15, 2004 at 11:27:05AM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > On Thu, Oct 14, 2004 at 11:08:54PM +0200, Yann Michel wrote:
> >> BTW: Is there any more documented CVS-version available? I mean it would
> >> be really nice to read some comments from time to time or at least more
> >> comments about each function/method's purpose or functionality.
> 
> > Huh, the code is reasonably commented.  Certainly not following
> > Javadoc-like standards, but it can be read.
> 
> Also, have you read the Internals volume of the SGML docs?  Mostly
> pretty high-level stuff, but that's what you need to get started.
> The developer's FAQ is also required reading for newbies.
> There are also README files in various parts of the source tree that
> provide information about various sub-modules.

I have not jet been reading all of it but some of the README files. I
will keep that hint in mind but first of all I'll read something about
bitmap compression and other relevant techniques before starting to
discover the index internals of postgresql... ;-) I've been using all
kinds of functions in oracle for a long time but never had the
experience to implement any indexing strategies. The only thing I did
were some operating system extensions for minix during my os-studies
(scheduling, driver, acl etc.)

If there is anything additional/special to know further, I apreciate any
hints.

Regards and thanks,
Yann


Re: plans for bitmap indexes?

From
Mark Kirkwood
Date:
Tom Lane wrote:

>
>I believe that the term "bitmap index" is also used with a different
>meaning wherein it actually does describe a particular kind of on-disk
>index structure, with one bit per table row.
>
>IMHO building in-memory bitmaps (the first idea) is a very good idea to
>pursue for Postgres.  I'm not at all sold on on-disk bitmap indexes,
>though ... those I suspect *are* sufficiently replaced by partial
>indexes.
>
>  
>
I believe that the benefit of on-disk bitmap indexes is supposed to be 
reduced storage size (compared to btree).

In the cases where I have put them to use, they certainly occupy 
considerably less disk than a comparable btree index - provided there 
are not too many district values in the indexed column.

regards

Mark


Re: plans for bitmap indexes?

From
"Simon Riggs"
Date:
>Mark Kirkwood wrote
> > Tom Lane wrote:
> >
> >I believe that the term "bitmap index" is also used with a different
> >meaning wherein it actually does describe a particular kind of on-disk
> >index structure, with one bit per table row.
> >
> >IMHO building in-memory bitmaps (the first idea) is a very good idea to
> >pursue for Postgres.  I'm not at all sold on on-disk bitmap indexes,
> >though ... those I suspect *are* sufficiently replaced by partial
> >indexes.
> >

Well, if we could cache the bitmap after it was created the first time then
that might offer almost the same thing..... :-)

I was thinking about this recently, then realised that building the bitmap
would not be as easily, since PostgreSQL doesn't index null values. That
would mean that the sets of CTIDs in each index would be disjoint. My
thinking about dynamic bitmaps came from Teradata, which does index null
values.

How would you dynamically build the bit maps from the indexes?

Or would you:
- copy aside and sort the indexes on CTID
- merge join them all to find matching CTIDs
- probe into the main table

Hopefully, I've missed something that you've thought of !

> I believe that the benefit of on-disk bitmap indexes is supposed to be
> reduced storage size (compared to btree).
>
> In the cases where I have put them to use, they certainly occupy
> considerably less disk than a comparable btree index - provided there
> are not too many district values in the indexed column.
>

The main problem is the need for the table to be read-only. Until we have
partitioning, we wouldn't be able to easily guarantee parts of a table as
being (effectively) read-only.



Re: plans for bitmap indexes?

From
Alvaro Herrera
Date:
On Tue, Oct 19, 2004 at 11:22:31PM +0100, Simon Riggs wrote:

> I was thinking about this recently, then realised that building the bitmap
> would not be as easily, since PostgreSQL doesn't index null values. That
> would mean that the sets of CTIDs in each index would be disjoint. My
> thinking about dynamic bitmaps came from Teradata, which does index null
> values.

Huh, you are wrong.  At least btree does index null values, and one
other index method does too.  The other two index methods don't.  What
doesn't work is using an index with the IS NULL construct, because it's
not an operator.  Maybe that can be fixed by some other means ... some
parser magic perhaps.

> Or would you:
> - copy aside and sort the indexes on CTID
> - merge join them all to find matching CTIDs
> - probe into the main table

IIRC part of the trick was to build bitmaps to apply bitwise-AND/OR
operators.  This allows to use multiple indexes for one scan, for
example.


I don't understand your comment about read only tables ...

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"I call it GNU/Linux. Except the GNU/ is silent." (Ben Reiter)



Re: plans for bitmap indexes?

From
Mark Kirkwood
Date:
Simon Riggs wrote:

>
>>I believe that the benefit of on-disk bitmap indexes is supposed to be
>>reduced storage size (compared to btree).
>>
>>    
>>
>The main problem is the need for the table to be read-only. Until we have
>partitioning, we wouldn't be able to easily guarantee parts of a table as
>being (effectively) read-only.
>
>  
>
I don't believe that read only is required. The update/insert 
performance impact of bimap indexes is however very high (in Oracle's 
implementation anyway) - to the point where many sites drop them before 
adding in new data, and recreated 'em afterwards!

In the advent that there is a benefit for the small on-disk footprint, 
the insert/update throughput implications will need to be taken into 
account.

cheers

Mark


Re: plans for bitmap indexes?

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> I was thinking about this recently, then realised that building the bitmap
> would not be as easily, since PostgreSQL doesn't index null values.

As Alvaro already pointed out, this statement is bogus; and I'm not sure
what it has to do with the topic anyway.  All you care about is the rows
that the index fingers as matching your scan condition.  If the scan
condition is strict (which it usually is) it does not matter whether the
index stores entries for nulls or not.

> How would you dynamically build the bit maps from the indexes?

> Or would you:
> - copy aside and sort the indexes on CTID
> - merge join them all to find matching CTIDs
> - probe into the main table

I've been taking "bitmap" to be a rather handwavy way of saying "a
compact representation of sets of CTIDs that is readily amenable to
being ANDed and ORed with other sets".  I don't think it'll be a pure
bitmap with no other superstructure; at the very least you'd want to
apply some sort of sparse-bitmap and/or compression techniques.  I do
suspect a bitmappy kind of representation will be more effective than
sorting arrays of CTIDs per se, although in principle you could do it
that way too.

But yeah, the basic idea is to scan an index and build some sort of
in-memory set of CTIDs of selected rows; possibly AND or OR this with
other sets built from other indexes; and then scan the set and probe
into the heap at the indicated places.  One huge advantage is that the
actual heap visiting becomes efficient, eg you never visit the same page
more than once.  (What you lose is the ability to retrieve data in
index order, so this isn't a replacement for existing indexscan methods,
just another plan type to consider.)

One interesting thought is that the bitmappy representation could be
lossy.  For instance, once you get to the point of needing to examine
most of the rows on a particular page, it's probably not worth
remembering exactly which rows; you could just remember that that whole
page is a target, and sequentially scan all the rows on it when you do
visit the heap.  (ANDing and ORing still works.)  This can scale up to
visiting consecutive ranges of pages; in the limit the operation
degenerates to a seqscan.  With this idea you can guarantee that the
in-memory bitmaps never get impracticably large.  (Obviously if they get
so large as to push the system into swapping, or even run the backend
out of memory completely, you lose, so this is a real nice guarantee to
be able to make.)  The whole thing starts to look like a self-adaptive
interpolation between our present indexscan and seqscan techniques,
which takes a lot of pressure off the planner to correctly guess the
number of matching rows in advance.

I remember batting these ideas around with people at the 2001 "OSDB
summit" conference ... I didn't think it would take us this long to get
around to doing it ...
        regards, tom lane


Re: plans for bitmap indexes?

From
Josh Berkus
Date:
Tom,

> I've been taking "bitmap" to be a rather handwavy way of saying "a
> compact representation of sets of CTIDs that is readily amenable to
> being ANDed and ORed with other sets".  

Well, actually I think we're talking about two different features:

1) a way to use more than one index per operation;
2) a more compact and thus faster index representation

The fact that Oracle solved both problems with the same code doesn't, 
obviously mean that we have to.   There's been a lot of discussion around 
problem (2) on this thread, but I don't want to lose sight of problem 
(1) .... especially since that's the problem faced by several active 
community members right now.

You gave the impression that (1) could be implemented with regular BTree 
indexes in an earlier e-mail.   Would that be very hard to do?

> The whole thing starts to look like a self-adaptive 
> interpolation between our present indexscan and seqscan techniques,
> which takes a lot of pressure off the planner to correctly guess the
> number of matching rows in advance.

This would be way cool.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: plans for bitmap indexes?

From
Gavin Sherry
Date:
On Tue, 19 Oct 2004, Josh Berkus wrote:

> Tom,
>
> > I've been taking "bitmap" to be a rather handwavy way of saying "a
> > compact representation of sets of CTIDs that is readily amenable to
> > being ANDed and ORed with other sets".
>
> Well, actually I think we're talking about two different features:
>
> 1) a way to use more than one index per operation;
> 2) a more compact and thus faster index representation

For those interested, how this generally works is that for every distinct
value in the column being indexed, a bitmap of unique row identifiers (ie,
tids) is created. With compression, this can greatly reduce the size of
indexes on a large number of rows with a small number of distinct values
(a situation in which we're highly likely to use seq scan index of index
in Postgres).

For qualifications like: bitmapcol1 AND/OR bitmapcol2, we can use bitmap
and/or respectively. Of course, this is all in theory.

Bitmap indexes can suffer concurrency issues, depending on the granularity
of locking.

> You gave the impression that (1) could be implemented with regular BTree
> indexes in an earlier e-mail.   Would that be very hard to do?
>
> > The whole thing starts to look like a self-adaptive
> > interpolation between our present indexscan and seqscan techniques,
> > which takes a lot of pressure off the planner to correctly guess the
> > number of matching rows in advance.
>
> This would be way cool.

I think there's a lot of be gained by the technique above as an
alternative to our current access methods. Its just a feeling however, I
haven't prototyped this.

Thanks,

Gavin


Re: plans for bitmap indexes?

From
"Simon Riggs"
Date:
> Tom Lane
> "Simon Riggs" <simon@2ndquadrant.com> writes:

> > How would you dynamically build the bit maps from the indexes?
>
> > Or would you:
> > - copy aside and sort the indexes on CTID
> > - merge join them all to find matching CTIDs
> > - probe into the main table
>
> I've been taking "bitmap" to be a rather handwavy way of saying "a
> compact representation of sets of CTIDs that is readily amenable to
> being ANDed and ORed with other sets".  I don't think it'll be a pure
> bitmap with no other superstructure; at the very least you'd want to
> apply some sort of sparse-bitmap and/or compression techniques.  I do
> suspect a bitmappy kind of representation will be more effective than
> sorting arrays of CTIDs per se, although in principle you could do it
> that way too.
>

OK. You seemed to be implying that.

> (What you lose is the ability to retrieve data in
> index order, so this isn't a replacement for existing indexscan methods,
> just another plan type to consider.)

Never seen an application that required a bitmap plan and sorted output.
Have you? Mostly count(*), often sum() or avg(), but never sorted, surely.

Considering there would always be >1 index, which index order did we want
anyhow?

> One interesting thought is that the bitmappy representation could be
> lossy.  For instance, once you get to the point of needing to examine
> most of the rows on a particular page, it's probably not worth
> remembering exactly which rows; you could just remember that that whole
> page is a target, and sequentially scan all the rows on it when you do
> visit the heap.  (ANDing and ORing still works.)  This can scale up to
> visiting consecutive ranges of pages; in the limit the operation
> degenerates to a seqscan.  With this idea you can guarantee that the
> in-memory bitmaps never get impracticably large.  (Obviously if they get
> so large as to push the system into swapping, or even run the backend
> out of memory completely, you lose, so this is a real nice guarantee to
> be able to make.)  The whole thing starts to look like a self-adaptive
> interpolation between our present indexscan and seqscan techniques,
> which takes a lot of pressure off the planner to correctly guess the
> number of matching rows in advance.

Well, thats the best one yet. That's the solution, if ever I heard it.

The reduction in bitmap size makes their use much safer. Size matters, since
we're likely to start using these techniques on very large databases, which
imply obviously have very large CTID lists. The problem with guessing the
number of rows is you're never too sure whether its worth the startup
overhead of using the bitmap technique. ....my next question was going to
be, so how will you know when to use the technique?

Hmmm....think....you'd need to be clear that the cost of scanning a block
didn't make the whole thing impractical. Generally, since we're using this
technique to access infrequent row combinations, we'd be looking at no more
than one row per block usually anyway. So the technique is still I/O bound -
a bit extra post I/O cpu work won't hurt much. OK, cool.

> I remember batting these ideas around with people at the 2001 "OSDB
> summit" conference ... I didn't think it would take us this long to get
> around to doing it ...

...as if you haven't been busy... ;-)

Best Regards, Simon Riggs



Re: plans for bitmap indexes?

From
"Simon Riggs"
Date:
>Alvaro Herrera
> On Tue, Oct 19, 2004 at 11:22:31PM +0100, Simon Riggs wrote:
>
> > I was thinking about this recently, then realised that building the
bitmap
> > would not be as easily, since PostgreSQL doesn't index null values. That
> > would mean that the sets of CTIDs in each index would be disjoint. My
> > thinking about dynamic bitmaps came from Teradata, which does index null
> > values.
>
> Huh, you are wrong.

Always happy to learn. Thanks for letting me know.

> At least btree does index null values, and one
> other index method does too.  The other two index methods don't.  What
> doesn't work is using an index with the IS NULL construct, because it's
> not an operator.  Maybe that can be fixed by some other means ... some
> parser magic perhaps.

The manual says this (CREATE INDEX)
"Indexes are not used for IS NULL clauses by default. The best way to use
indexes in such cases is to create a partial index using an IS NULL
comparison. "

Perhaps we can find a better way of wording this to explain what actually
occurs, which after your comments, I'm less clear on than I was before.
Could you clarify further, so we can update the documentation to be very
specific, or at least clearer.

> > Or would you:
> > - copy aside and sort the indexes on CTID
> > - merge join them all to find matching CTIDs
> > - probe into the main table
>
> IIRC part of the trick was to build bitmaps to apply bitwise-AND/OR
> operators.  This allows to use multiple indexes for one scan, for
> example.

Yes, an implication of my question was "and would that then give greater
overhead for >2 indexes...

> I don't understand your comment about read only tables ...

These are restrictions on the Oracle implementation.

If you had a larger data warehouse table that grew over time, then typically
the older data wouldn't change much and so a "read-only" technique could be
sensibly applied.

Best Regards, Simon Riggs



Re: plans for bitmap indexes?

From
Sailesh Krishnamurthy
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> One huge advantage is that the actual heap visiting becomes   Tom> efficient, eg you never visit the same page
morethan once.   Tom> (What you lose is the ability to retrieve data in index   Tom> order, so this isn't a replacement
forexisting indexscan   Tom> methods, just another plan type to consider.)
 

Even without bitmap indexes, without trying to use multiple indexes
etc. this (visiting a page only once) is useful. 

In other words, I'd like to see the indexscan broken up into: (1) an
operator that returns a list of TIDs, (2) Sort the TIDs and (3) an
operator that fetches heap tuples from the sorted TID list. 

Of course the resulting data from the heap will be out of order but
that often times is less important than unnecessarily visiting (and
possibly even re-fetching from disk) the same heap page twice for a
given index scan. 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: plans for bitmap indexes?

From
Gavin Sherry
Date:
On Tue, 19 Oct 2004, Sailesh Krishnamurthy wrote:

> >>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>     Tom> One huge advantage is that the actual heap visiting becomes
>     Tom> efficient, eg you never visit the same page more than once.
>     Tom> (What you lose is the ability to retrieve data in index
>     Tom> order, so this isn't a replacement for existing indexscan
>     Tom> methods, just another plan type to consider.)
>
> Even without bitmap indexes, without trying to use multiple indexes
> etc. this (visiting a page only once) is useful.
>
> In other words, I'd like to see the indexscan broken up into: (1) an
> operator that returns a list of TIDs, (2) Sort the TIDs and (3) an
> operator that fetches heap tuples from the sorted TID list.

I'm uncertain about the potential benefit of this. Isn't/shouldn't the
effects of caching be assisting us here?

Gavin


Re: plans for bitmap indexes?

From
Sailesh Krishnamurthy
Date:
>>>>> "Gavin" == Gavin Sherry <swm@linuxworld.com.au> writes:
   Gavin> I'm uncertain about the potential benefit of   Gavin> this. Isn't/shouldn't the effects of caching be
assisting  Gavin> us here?
 

It all depends on how large your table is, and how busy the system is
(other pressures on the cache). While it's difficult to plan for the
latter you can plan for the former. For the latter you could assume
that the effective cache size is some fraction of the real size to
account for the effects of other queries. Unless you are real sure
that you will never kick out a page from the buffer cache .. 

I believe that for large enough tables this can certainly help .. it
sure is something that many other systems have implemented. 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: plans for bitmap indexes?

From
"Jim C. Nasby"
Date:
On Tue, Oct 19, 2004 at 07:38:49PM -0300, Alvaro Herrera wrote:
> Huh, you are wrong.  At least btree does index null values, and one
> other index method does too.  The other two index methods don't.  What
> doesn't work is using an index with the IS NULL construct, because it's
> not an operator.  Maybe that can be fixed by some other means ... some
> parser magic perhaps.

Wow, that's news to me, and important to remember! Where is it
documented? I don't see it in the index chapter...

I think it would be very helpful to have a chapter that gives
information like this for experienced DBAs who are trying to learn
PostgreSQL and don't want to read through all the documentation but need
to know important differences in how PostgreSQL works that would be easy
to miss. Not indexing NULLs (ala Oracle), or not using an index for IS
NULL would certainly be examples. Things listed at
http://sql-info.de/postgresql/postgres-gotchas.html would also be likely
candidates.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On K, 2004-10-20 at 03:03, Simon Riggs wrote:

> Well, thats the best one yet. That's the solution, if ever I heard it.
> 
> The reduction in bitmap size makes their use much safer. Size matters, since
> we're likely to start using these techniques on very large databases, which
> imply obviously have very large CTID lists. The problem with guessing the
> number of rows is you're never too sure whether its worth the startup
> overhead of using the bitmap technique. ....my next question was going to
> be, so how will you know when to use the technique?
> 
> Hmmm....think....you'd need to be clear that the cost of scanning a block
> didn't make the whole thing impractical. Generally, since we're using this
> technique to access infrequent row combinations, we'd be looking at no more
> than one row per block usually anyway. So the technique is still I/O bound -
> a bit extra post I/O cpu work won't hurt much. OK, cool.

I still think that an initial implementation could be done with "a plain
bitmap" kind of bitmap, if we are content with storing one bit per page
only - a simple page bitmap for 1TB table with 8kB pages takes only 16
MB, and that's backends private memory not more scarce shared memory.

It takes only 4mb for 32kb pages.

I guess that anyone working with terabyte size tables can afford a few
megabytes of RAM for query processing.

----------------
Hannu



Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On K, 2004-10-20 at 01:52, Mark Kirkwood wrote:

> I don't believe that read only is required. The update/insert 
> performance impact of bimap indexes is however very high (in Oracle's 
> implementation anyway) - to the point where many sites drop them before 
> adding in new data, and recreated 'em afterwards!
> 
> In the advent that there is a benefit for the small on-disk footprint, 
> the insert/update throughput implications will need to be taken into 
> account.

I repeat here my earlier proposal of making the bitmap indexes
page-level and clustering data automatically on AND of all defined
bitmap indexes. 

This would mostly solve this problem too, as there will be only one
insert per page per index (when the first tuple is inserted) and one
delete (when the page gets empty).

This has a downside of suboptimal space usage but this "should not" (tm)
be an issue for large tables, where most combinations of bits will get
enough hits to fill several pages.

Such clustering would also help (probably a lot) all queries actually
using these indexes.

-------------
Hannu


Re: plans for bitmap indexes?

From
Greg Stark
Date:
Hannu Krosing <hannu@tm.ee> writes:

> I repeat here my earlier proposal of making the bitmap indexes
> page-level and clustering data automatically on AND of all defined
> bitmap indexes. 

The problem with page-level bitmaps is that they could be much less effective.
Consider a query like 'WHERE foo = ? AND bar = ? AND baz = ?" where each of
those matches about 1% of your tuples. If you have 100 tuples per page then
each of those bitmaps will find a tuple in about half the pages. So the
resulting AND will find about 1/8th of the pages as candidates. In reality the
number of pages it should have to fetch should be more like 1 in a million.

The other problem is that for persist on-disk indexes they require more work
to update. You would have to recheck every other tuple in the page to
recalculate the bit value instead of just being able to flip one bit.

-- 
greg



Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On T, 2004-10-26 at 18:42, Greg Stark wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> 
> > I repeat here my earlier proposal of making the bitmap indexes
> > page-level and clustering data automatically on AND of all defined
> > bitmap indexes. 
> 
> The problem with page-level bitmaps is that they could be much less effective.
> Consider a query like 'WHERE foo = ? AND bar = ? AND baz = ?" where each of
> those matches about 1% of your tuples. If you have 100 tuples per page then
> each of those bitmaps will find a tuple in about half the pages. So the
> resulting AND will find about 1/8th of the pages as candidates. In reality the
> number of pages it should have to fetch should be more like 1 in a million.
> 
> The other problem is that for persist on-disk indexes they require more work
> to update. You would have to recheck every other tuple in the page to
> recalculate the bit value instead of just being able to flip one bit.

Read again ;)

the per-page clustering would make sure that all the tuples would be on
1 (or on a few) pages.

and what comes to updating the index, you have to do it only once per
100 pages ;)

----------------
Hannu



Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On T, 2004-10-26 at 23:53, Hannu Krosing wrote:
> On T, 2004-10-26 at 18:42, Greg Stark wrote:
> > Hannu Krosing <hannu@tm.ee> writes:
> > 
> > > I repeat here my earlier proposal of making the bitmap indexes
> > > page-level and clustering data automatically on AND of all defined
> > > bitmap indexes. 
> > 
> > The problem with page-level bitmaps is that they could be much less effective.
> > Consider a query like 'WHERE foo = ? AND bar = ? AND baz = ?" where each of
> > those matches about 1% of your tuples. If you have 100 tuples per page then
> > each of those bitmaps will find a tuple in about half the pages. So the
> > resulting AND will find about 1/8th of the pages as candidates. In reality the
> > number of pages it should have to fetch should be more like 1 in a million.

Another way to solve the unequal number of tuples per page problem was
to have a fixed length bitmap with rollower (mod length) for each page. 

So your 100 tuples per page on average table should get a 32 (or 64)
bits per page bitmap where bits 1, 33, 65 and 97 would all be in the
same position (for 32 bits), but one could still do fast ANDS and ORS
with high degree of accuracy.

I guess the per-page clustering idea described in my previous mail can
even be extended inside the pages (i.e. cluster on same bits in
2/4/8/16/32bit page bitmap) if simple per/page bitmaps would waste too
much space (many different values, few actual rows - i.e. not a good
candidate for real bitmap indexes ;-p )

> > The other problem is that for persist on-disk indexes they require more work
> > to update. You would have to recheck every other tuple in the page to
> > recalculate the bit value instead of just being able to flip one bit.
> 
> Read again ;)
> 
> the per-page clustering would make sure that all the tuples would be on
> 1 (or on a few) pages.
> 
> and what comes to updating the index, you have to do it only once per
> 100 pages ;)

This kind of clustering index works best when created on an empty table,
so all tuples can be inserted on their rightful pages.

If this kind of BM index is created on a table with some data, we need
an additional bitmap for "gray" pages - that is pages containing tuples
matching several combinations of index bits. 

The way to sharpen a table with "gray" pages would be either a CLUSTER
command or VACUUM (which could check for same-bit-combination-ness.

At least an empty page would be initially (or after becoming empty
during vacuum) marked non-gray and it should also never become gray
unless a new bitmap index is added.

-----------------
Hannu




Re: plans for bitmap indexes?

From
Andre Maasikas
Date:
Hannu Krosing wrote:
> the per-page clustering would make sure that all the tuples would be on
> 1 (or on a few) pages.

I understand that You can cluster on one column, but how do you do it for
indexes on other columns?
BTW, lossy variants also lose count(), group by only from index

> and what comes to updating the index, you have to do it only once per
> 100 pages ;)

Sorry, how does that work, if I update foo = 'bar'->'baz' - I can flip 
the 'baz' bit
on right away but I have to check every other row to see
if I can turn the 'bar' bit off

Andre


Re: plans for bitmap indexes?

From
Hannu Krosing
Date:
On K, 2004-10-27 at 00:58, Andre Maasikas wrote:
> Hannu Krosing wrote:
> > the per-page clustering would make sure that all the tuples would be on
> > 1 (or on a few) pages.
> 
> I understand that You can cluster on one column, but how do you do it for
> indexes on other columns?

Thanks to PostgreSQL's MVCC each update inserts a complete new tuple -
you just have to insert in the right page.

so if I change foo=1 to foo=2 on a tuple that has bar=2 and baz=3 then
the updated tuple will go to a page for which foo=2, bar=2 and baz=3.

if no such page has enough free space left (found by anding bitmaps for
foo=2, bar=2 and baz=3 and FSM) then a new page is inserted and the
three corresponding indexes are updated to include that page.

> BTW, lossy variants also lose count(), group by only from index

PostgreSQL has never been able to do these from index only, as the
visibility info is stored in the main relation, and not in index.

Someone brings up adding visibility info to index about once in 6 months
and is referred to previous discussions as to why it is a bad idea. The
only thing that as been added to index is a bit telling if a tuple is
definitely invisible (i.e. older than any pending transaction) which is
updated when such tuple is accessed using this index.


> > and what comes to updating the index, you have to do it only once per
> > 100 pages ;)
> 
> Sorry, how does that work, if I update foo = 'bar'->'baz' - I can flip 
> the 'baz' bit
> on right away but I have to check every other row to see
> if I can turn the 'bar' bit off

You don't touch indexes, instead you select the right page for new
tuple. The only times you touch indexes is when you insert a new page
(or when the page becomes empty during vacuum)

----------------
Hannu




Re: plans for bitmap indexes?

From
Greg Stark
Date:
Hannu Krosing <hannu@tm.ee> writes:

> so if I change foo=1 to foo=2 on a tuple that has bar=2 and baz=3 then
> the updated tuple will go to a page for which foo=2, bar=2 and baz=3.
> 
> if no such page has enough free space left (found by anding bitmaps for
> foo=2, bar=2 and baz=3 and FSM) then a new page is inserted and the
> three corresponding indexes are updated to include that page.

This is all thinking in terms of a single index though. What do you do if I
have a dozen bitmap indexes? Each could have a 10 distinct values. You would
need 100,000 pages, each of which might only have a few tuples in them.

In any case the user may prefer to have the data clustered around a btree
index using the existing CLUSTER command.

There's a logical separation between the idea of index methods and table
storage mechanisms. Trying to implement something like this that breaks that
abstraction will only make things far more confusing.

I think what you're trying to accomplish is better accomplished through
partitioned tables. Then the user can decide which keys to use to partition
the data and the optimizer can use the data to completely exclude some
partitions from consideration. And it wouldn't interfere with indexes to
access the data within a partition.

-- 
greg



Re: plans for bitmap indexes?

From
Yann Michel
Date:
Hi,

On Wed, Oct 27, 2004 at 10:13:56AM -0400, Greg Stark wrote:
> 
> There's a logical separation between the idea of index methods and table
> storage mechanisms. Trying to implement something like this that breaks that
> abstraction will only make things far more confusing.
> 
> I think what you're trying to accomplish is better accomplished through
> partitioned tables. Then the user can decide which keys to use to partition
> the data and the optimizer can use the data to completely exclude some
> partitions from consideration. And it wouldn't interfere with indexes to
> access the data within a partition.

this is not always the truth. In datawarehouosing applications you often
use data paritioning (time based) and bitmap indexes for fast
star-transformations. A very efficient way to solve that ist using
bitmap indexes.

Regards,
Yann


Re: plans for bitmap indexes?

From
Mark Kirkwood
Date:
Greg Stark wrote:

>I think what you're trying to accomplish is better accomplished through
>partitioned tables. Then the user can decide which keys to use to partition
>the data and the optimizer can use the data to completely exclude some
>partitions from consideration. And it wouldn't interfere with indexes to
>access the data within a partition.
>  
>
Though partitioning will help, you can only partition on one key (I 
guess the ability to partition *indexes* might help here).

I think that bitmap indexes provide a flexible may to get fact access to 
the result set for multiple low cardinality conditions - something that 
partitioning will generally not do.

regards

Mark


Re: plans for bitmap indexes?

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:

>
> I think that bitmap indexes provide a flexible may to get fact access 
> to the result set

that should be *fast* access to....sorry


Re: plans for bitmap indexes?

From
Bruce Momjian
Date:
Updated TODO:

* Allow the creation of bitmap indexes which can be quickly combined with other bitmap indexes
 Bitmap indexes index single columns that can be combined with other bitmap indexes to dynamically create a composite
indexto match a specific query. Each index is a bitmap, and the bitmaps are bitwise AND'ed or OR'ed to be combined.
Suchindexes could be more compact if there are few unique value.  Also, perhaps they can be lossy requiring a scan of
theheap page to find matching rows.
 

* Allow non-bitmap indexes to be combined
 Do lookups on non-bitmap indexes and create bitmaps in memory that can be combined with other indexes.


---------------------------------------------------------------------------

Simon Riggs wrote:
> > Tom Lane
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > > How would you dynamically build the bit maps from the indexes?
> >
> > > Or would you:
> > > - copy aside and sort the indexes on CTID
> > > - merge join them all to find matching CTIDs
> > > - probe into the main table
> >
> > I've been taking "bitmap" to be a rather handwavy way of saying "a
> > compact representation of sets of CTIDs that is readily amenable to
> > being ANDed and ORed with other sets".  I don't think it'll be a pure
> > bitmap with no other superstructure; at the very least you'd want to
> > apply some sort of sparse-bitmap and/or compression techniques.  I do
> > suspect a bitmappy kind of representation will be more effective than
> > sorting arrays of CTIDs per se, although in principle you could do it
> > that way too.
> >
> 
> OK. You seemed to be implying that.
> 
> > (What you lose is the ability to retrieve data in
> > index order, so this isn't a replacement for existing indexscan methods,
> > just another plan type to consider.)
> 
> Never seen an application that required a bitmap plan and sorted output.
> Have you? Mostly count(*), often sum() or avg(), but never sorted, surely.
> 
> Considering there would always be >1 index, which index order did we want
> anyhow?
> 
> > One interesting thought is that the bitmappy representation could be
> > lossy.  For instance, once you get to the point of needing to examine
> > most of the rows on a particular page, it's probably not worth
> > remembering exactly which rows; you could just remember that that whole
> > page is a target, and sequentially scan all the rows on it when you do
> > visit the heap.  (ANDing and ORing still works.)  This can scale up to
> > visiting consecutive ranges of pages; in the limit the operation
> > degenerates to a seqscan.  With this idea you can guarantee that the
> > in-memory bitmaps never get impracticably large.  (Obviously if they get
> > so large as to push the system into swapping, or even run the backend
> > out of memory completely, you lose, so this is a real nice guarantee to
> > be able to make.)  The whole thing starts to look like a self-adaptive
> > interpolation between our present indexscan and seqscan techniques,
> > which takes a lot of pressure off the planner to correctly guess the
> > number of matching rows in advance.
> 
> Well, thats the best one yet. That's the solution, if ever I heard it.
> 
> The reduction in bitmap size makes their use much safer. Size matters, since
> we're likely to start using these techniques on very large databases, which
> imply obviously have very large CTID lists. The problem with guessing the
> number of rows is you're never too sure whether its worth the startup
> overhead of using the bitmap technique. ....my next question was going to
> be, so how will you know when to use the technique?
> 
> Hmmm....think....you'd need to be clear that the cost of scanning a block
> didn't make the whole thing impractical. Generally, since we're using this
> technique to access infrequent row combinations, we'd be looking at no more
> than one row per block usually anyway. So the technique is still I/O bound -
> a bit extra post I/O cpu work won't hurt much. OK, cool.
> 
> > I remember batting these ideas around with people at the 2001 "OSDB
> > summit" conference ... I didn't think it would take us this long to get
> > around to doing it ...
> 
> ...as if you haven't been busy... ;-)
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: plans for bitmap indexes?

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Updated TODO:

> * Allow the creation of bitmap indexes which can be quickly combined
>   with other bitmap indexes

This TODO item description is fundamentally misleading.

The point was *not* about making "bitmap indexes", which on its face
suggests a persistent on-disk data structure comparable to our existing
index types.  The point was about using transient in-memory bitmaps as
an interface between the on-disk indexes and accessing the table proper.
        regards, tom lane


Re: plans for bitmap indexes?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Updated TODO:
> 
> > * Allow the creation of bitmap indexes which can be quickly combined
> >   with other bitmap indexes
> 
> This TODO item description is fundamentally misleading.
> 
> The point was *not* about making "bitmap indexes", which on its face
> suggests a persistent on-disk data structure comparable to our existing
> index types.  The point was about using transient in-memory bitmaps as
> an interface between the on-disk indexes and accessing the table proper.

There are two separate issues --- on-disk bitmap indexes and on-the-fly
in-memory created ones.  I tried to mention both but obviously it wasn't
clear.  Here is new wording:

* Allow non-bitmap indexes to be combined by creating bitmaps in memory
 Bitmap indexes index single columns that can be combined with other bitmap indexes to dynamically create a composite
indexto match a specific query. Each index is a bitmap, and the bitmaps are bitwise AND'ed or OR'ed to be combined.
Theycan index by tid or can be lossy requiring a scan of the heap page to find matching rows.
 

* Allow the creation of on-disk bitmap indexes which can be quickly combined with other bitmap indexes

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073