Thread: State of the on-disk bitmap index

State of the on-disk bitmap index

From
Daniel Bausch
Date:
Hello Jonah, Simon, and the hackers,

I am going to implement a simple kind of "encoded bitmap indexes" (EBI).That is an index type where the bitmap columns
maynot only contain 
only a single '1' in the set of bits belonging to a tuple.  Instead, an
additional mapping table translates the distinct values of the table
column into a unique encoding.  To select for a given value all bitmap
columns must be compared instead of only one.  Queries that match
multiple different values (like IN lists or range queries) simplify to
less than the full set of bitmaps that needs to be compared because of
boolean logic.  The total number of bitmaps required to represent unique
encodings for all different values is ceil(ld(n)), where n is the number
of distinct values.  Compared to normal bitmap indexes this solves the
problem of high-cardinality columns.  It is targetet at data warehousing
scenarios with insert only data.

The respective scientific paper can be found at
http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf

I thought, it could be a good idea to base my work on the long proposed
on-disk bitmap index implementation.  Regarding to the wiki, you, Jonah
and Simon, were the last devs that touched this thing.  Unfortunately I
could not find the patch representing your state of that work.  I could
only capture the development history up to Gianni Ciolli & Gabriele
Bartolini from the old pgsql-patches archives.  Other people involved
were Jie Zhang, Gavin Sherry, Heikki Linnakangas, and Leonardo F.  Are
you and the others still interested in getting this into PG?  A rebase
of the most current bitmap index implementation onto master HEAD will be
the first 'byproduct' that I am going to deliver back to you.

1. Is anyone working on this currently?
2. Who has got the most current source code?
3. Is there a git of that or will I need to reconstruct the history from
the patches I collected?

Disclaimer: Although I read and manually processed most of the
executor's code during my last work, I would still call myself a newbie
and therefore will need some assistance most probably.

Regards,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax:  +49 6151 16 6229



Re: State of the on-disk bitmap index

From
"Albe Laurenz"
Date:
Daniel Bausch wrote:
> Hello Jonah, Simon, and the hackers,
>
> I am going to implement a simple kind of "encoded bitmap indexes"
(EBI).
>  That is an index type where the bitmap columns may not only contain
> only a single '1' in the set of bits belonging to a tuple.  Instead,
an
> additional mapping table translates the distinct values of the table
> column into a unique encoding.  To select for a given value all bitmap
> columns must be compared instead of only one.  Queries that match
> multiple different values (like IN lists or range queries) simplify to
> less than the full set of bitmaps that needs to be compared because of
> boolean logic.  The total number of bitmaps required to represent
unique
> encodings for all different values is ceil(ld(n)), where n is the
number
> of distinct values.  Compared to normal bitmap indexes this solves the
> problem of high-cardinality columns.  It is targetet at data
warehousing
> scenarios with insert only data.
>
> The respective scientific paper can be found at
> http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf

I cannot answer your questions, but I read the paper and have some
questions myself.

1) As you mention, a WHERE clause that checks for only one value  will be more expensive with an encoded bitmap index
thanwith  a regular bitmap index.  If you want to implement encoded bitmap  indexes, wouldn't it be good to also
implementregular bitmap  indexes so that the user has a choice? 

2) The paper mentions that finding a good encoding and simplifying  bitmap access for a certain query are nontrivial
problems. Moreover, an encoding is good or bad only with respect to  certain queries, which the system does not know at
index creation time.  Do you have any ideas how to approach that?  If not, the paper suggests that, with enough values
tocheck for,  even a non-optimized encoded bitmap index should perform  much better than a normal bitmap index, so
maybethat's the way  to go (maybe only encode the NULL value as all zeros). 

Yours,
Laurenz Albe



Re: State of the on-disk bitmap index

From
Daniel Bausch
Date:
Am 20.08.2012 09:40, schrieb Albe Laurenz:
> Daniel Bausch wrote:
>> Hello Jonah, Simon, and the hackers,
>>
>> I am going to implement a simple kind of "encoded bitmap indexes"
> (EBI).
>>  That is an index type where the bitmap columns may not only contain
>> only a single '1' in the set of bits belonging to a tuple.  Instead,
> an
>> additional mapping table translates the distinct values of the table
>> column into a unique encoding.  To select for a given value all bitmap
>> columns must be compared instead of only one.  Queries that match
>> multiple different values (like IN lists or range queries) simplify to
>> less than the full set of bitmaps that needs to be compared because of
>> boolean logic.  The total number of bitmaps required to represent
> unique
>> encodings for all different values is ceil(ld(n)), where n is the
> number
>> of distinct values.  Compared to normal bitmap indexes this solves the
>> problem of high-cardinality columns.  It is targetet at data
> warehousing
>> scenarios with insert only data.
>>
>> The respective scientific paper can be found at
>> http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf
>
> I cannot answer your questions, but I read the paper and have some
> questions myself.
>
> 1) As you mention, a WHERE clause that checks for only one value
>    will be more expensive with an encoded bitmap index than with
>    a regular bitmap index.  If you want to implement encoded bitmap
>    indexes, wouldn't it be good to also implement regular bitmap
>    indexes so that the user has a choice?

Sorry if that one was not clear: The first thing, I am going to do, is
to work on the normal bitmap indexes (the one based on the Bizgres
patch).  I want to port it to master HEAD and give it back to the
community.  After that I want to base my EBI implementation on that.
Eventually, I will publish that implementation, too.  (After doing
tuning, experiments, and make sure it works well.)

> 2) The paper mentions that finding a good encoding and simplifying
>    bitmap access for a certain query are nontrivial problems.
>    Moreover, an encoding is good or bad only with respect to
>    certain queries, which the system does not know at index
>    creation time.

Actually, I was not involved in writing that paper.  I want to use that
idea to show something different.  I know of a follow up work by Golam
Rabilul Alam et al. that uses the query history and data mining on that
to optimize for the most common cases.  There may be others.  A more
detailed discussion of EBI can also be found in:

http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/query.TR.ps.gz

>    Do you have any ideas how to approach that?
>    If not, the paper suggests that, with enough values to check for,
>    even a non-optimized encoded bitmap index should perform
>    much better than a normal bitmap index, so maybe that's the way
>    to go (maybe only encode the NULL value as all zeros).

Actually "all zeros" is reserved for "non-existent" (a.k.a. "deleted" or
"invisible").

The thing with the "enough values" is a bit problematic, indeed, because
even a DBA cannot influence how the queries of the user or the user
application look like.  You will not use encoded bitmap indexes or
normal bitmap indexes for a column that is usually point accessed like
the ID column.  For that you will stick to hash or tree indexes.

Kind regards,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax:  +49 6151 16 6229



Re: State of the on-disk bitmap index

From
Daniel Bausch
Date:
Am 20.08.2012 11:44, schrieb Daniel Bausch:
> Actually, I was not involved in writing that paper.  I want to use that
> idea to show something different.  I know of a follow up work by Golam
> Rabilul Alam et al. that uses the query history and data mining on that
> to optimize for the most common cases.  There may be others.  A more
> detailed discussion of EBI can also be found in:
>
> http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/query.TR.ps.gz

Oops, that was the wrong link.  I meant this one:
http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/bitmap.ps.gz

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax:  +49 6151 16 6229



Re: State of the on-disk bitmap index

From
"Albe Laurenz"
Date:
Daniel Bausch wrote:
> I am going to implement a simple kind of "encoded bitmap indexes"
(EBI).

> I thought, it could be a good idea to base my work on the long
proposed
> on-disk bitmap index implementation.  Regarding to the wiki, you,
Jonah
> and Simon, were the last devs that touched this thing.  Unfortunately
I
> could not find the patch representing your state of that work.  I
could
> only capture the development history up to Gianni Ciolli & Gabriele
> Bartolini from the old pgsql-patches archives.  Other people involved
> were Jie Zhang, Gavin Sherry, Heikki Linnakangas, and Leonardo F.  Are
> you and the others still interested in getting this into PG?  A rebase
> of the most current bitmap index implementation onto master HEAD will
be
> the first 'byproduct' that I am going to deliver back to you.
>
> 1. Is anyone working on this currently?
> 2. Who has got the most current source code?
> 3. Is there a git of that or will I need to reconstruct the history
from
> the patches I collected?

It seems like you did not get any answers from any of the
people you mentioned ...

The latest version of the patch I found is
http://archives.postgresql.org/pgsql-patches/2006-12/msg00015.php
So that's probably the best you can get.

I want to encourage you to work on this.

You'd have to come up with a sound concept and discuss it on this
list, and it would be helpful to have some draft patch for
git master that can be used as a basis for discussion.

Expect to meet some resistance.  Nobody will want the extra
code and complexity unless you can show suffitient benefits.

One concern that came up in previous discussions is that
bitmap indexes are only useful for columns with low cardinality,
and in that case the result will likely be a significant portion
of the table anyway and a sequential scan would be faster.
I think that this is less true if you have more conditions,
and this is supposedly the case where encoded bitmap indexes
work better anyway.

Another criticism I can imagine is that PostgreSQL already
supports a bitmap index scan of b-tree indexes, so you would
have to show that on-disk bitmap indexes outperform that
in realistic scenarios.  This has probably become more
difficult with the recently introduced index-only scan
for b-tree indexes, which is particularly helpful in
data warehouse scenarios.

So you'd have to run some performance tests against a draft
implementation to get people convinced that it is worth the
effort.  Supporting index-only scans Would probably give
you an edge.

Yours,
Laurenz Albe



Re: State of the on-disk bitmap index

From
Daniel Bausch
Date:
Hi Albe and the list,

>> I am going to implement a simple kind of "encoded bitmap indexes" (EBI).
>>
>> I thought, it could be a good idea to base my work on the long proposed
>> on-disk bitmap index implementation.  Regarding to the wiki, you,
>> Jonah and Simon, were the last devs that touched this thing.  Unfortunately
>> I could not find the patch representing your state of that work.  I
>> could only capture the development history up to Gianni Ciolli & Gabriele
>> Bartolini from the old pgsql-patches archives.  Other people involved
>> were Jie Zhang, Gavin Sherry, Heikki Linnakangas, and Leonardo F.  Are
>> you and the others still interested in getting this into PG?  A rebase
>> of the most current bitmap index implementation onto master HEAD will
>> be the first 'byproduct' that I am going to deliver back to you.
>>
>> 1. Is anyone working on this currently?
>> 2. Who has got the most current source code?
>> 3. Is there a git of that or will I need to reconstruct the history
>> from
>> the patches I collected?
>
> It seems like you did not get any answers from any of the
> people you mentioned ...
>
> The latest version of the patch I found is
> http://archives.postgresql.org/pgsql-patches/2006-12/msg00015.php
> So that's probably the best you can get.
>
> I want to encourage you to work on this.

Yes I do.  Thank you for your support.

I used the (more recent) patches posted by Gianni Ciolli in 2008 and
currently am in the process of porting those to master HEAD -- like I
promised.  I will post the ported patches when I get them to compile and
the index seems to work (somehow).

Nevertheless, I am still interested in what Simon, Jonah, and Leonardo
did after that point in time.  So if someone knows details (code) about
their solutions to, for example, the VACUUM problems, please mail back.

> You'd have to come up with a sound concept and discuss it on this
> list, and it would be helpful to have some draft patch for
> git master that can be used as a basis for discussion.
>
> Expect to meet some resistance.  Nobody will want the extra
> code and complexity unless you can show suffitient benefits.

If noone wants that, it would be sad.  However, I will at least do all
the work required to run benchmark queries against it.  Nevertheless, I
appreciate any help.

Indeed, the patch is a big one and the approach seems a bit hacky at
some places.  I also suspect that the compression approach could be
improved/replaced by something that is more efficient compression wise.
However, I never could come up with an own solution that complete in the
time available for my current project.

> Another criticism I can imagine is that PostgreSQL already
> supports a bitmap index scan of b-tree indexes, so you would
> have to show that on-disk bitmap indexes outperform that
> in realistic scenarios.  This has probably become more
> difficult with the recently introduced index-only scan
> for b-tree indexes, which is particularly helpful in
> data warehouse scenarios.

IIRC, it was already shown that bitmap indexes can speed up TPC-H
queries.  I will compare B+-tree, bitmap, and encoded bitmap indexes.

> So you'd have to run some performance tests against a draft
> implementation to get people convinced that it is worth the
> effort.  Supporting index-only scans Would probably give
> you an edge.

Yes I will, because I am going to write about that.

Kind regards,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax:  +49 6151 16 6229



Re: State of the on-disk bitmap index

From
Gianni Ciolli
Date:
Dear Albe and Daniel,

On Wed, Sep 05, 2012 at 11:28:18AM +0200, Daniel Bausch wrote:
> Hi Albe and the list,
> 
> >> I am going to implement a simple kind of "encoded bitmap indexes" (EBI).
> >> 
> >> I thought, it could be a good idea to base my work on the long proposed
> >> on-disk bitmap index implementation.  Regarding to the wiki, you,
> >> Jonah and Simon, were the last devs that touched this thing.  Unfortunately
> >> I could not find the patch representing your state of that work.  I
> >> could only capture the development history up to Gianni Ciolli & Gabriele
> >> Bartolini from the old pgsql-patches archives.  Other people involved
> >> were Jie Zhang, Gavin Sherry, Heikki Linnakangas, and Leonardo F.  Are
> >> you and the others still interested in getting this into PG?  A rebase
> >> of the most current bitmap index implementation onto master HEAD will
> >> be the first 'byproduct' that I am going to deliver back to you.
> >>
> >> 1. Is anyone working on this currently?
> >> 2. Who has got the most current source code?
> >> 3. Is there a git of that or will I need to reconstruct the history
> >> from
> >> the patches I collected?
> > 
> > It seems like you did not get any answers from any of the
> > people you mentioned ...

My fault: I missed the questions in August, but today my colleague
Gabriele drew my attention to them. I apologise.

> I used the (more recent) patches posted by Gianni Ciolli in 2008 and
> currently am in the process of porting those to master HEAD -- like I
> promised.

Back in 2008 the PostgreSQL project wasn't using git, and I wasn't
either; hence that patch is the best starting point I can find.

> > Another criticism I can imagine is that PostgreSQL already
> > supports a bitmap index scan of b-tree indexes, so you would
> > have to show that on-disk bitmap indexes outperform that
> > in realistic scenarios.  This has probably become more
> > difficult with the recently introduced index-only scan
> > for b-tree indexes, which is particularly helpful in
> > data warehouse scenarios.
> 
> IIRC, it was already shown that bitmap indexes can speed up TPC-H
> queries.  I will compare B+-tree, bitmap, and encoded bitmap indexes.

I think what Albe meant (also what we attempted back then, if memory
serves me, but without reaching completion) is a set of tests which
show a significant benefit of your implementation over the existing
index type implementations in PostgreSQL, to justify the increased
complexity of the source code.

The kind of test I have in mind is: a big table T with a
low-cardinality column C, such that a btree index on C is
significantly larger than the corresponding bitmap index on the same
column.

Create the bitmap index, and run a query like
 SELECT ... FROM T WHERE C = ... 

more than once; then you should notice that subsequent scans are much
faster than the first run, because the index is small enough to fit
the shared memory and will not need to be reloaded from disk at every
scan.

Then drop the bitmap index, and create a btree index on the same
column; this time the index will be too large and subsequent scans
will be slow, because the index blocks must be reloaded from disk at
every scan.

Hope that helps;
best regards,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni.ciolli@2ndquadrant.it | www.2ndquadrant.it



Re: State of the on-disk bitmap index

From
Daniel Bausch
Date:
Hi Gianni!

Thank you for your attention and response!

>> I used the (more recent) patches posted by Gianni Ciolli in 2008 and
>> currently am in the process of porting those to master HEAD -- like I
>> promised.
>
> Back in 2008 the PostgreSQL project wasn't using git, and I wasn't
> either; hence that patch is the best starting point I can find.

Ok, fine.  However, while I do not find the mail at the moment, I think,
someone said, he fixed the VACUUM.  Additionally, the Wiki lists Simon
and Jonah as the last authors, pretending they prepared a patch for 8.5.

>> IIRC, it was already shown that bitmap indexes can speed up TPC-H
>> queries.  I will compare B+-tree, bitmap, and encoded bitmap indexes.
>
> I think what Albe meant (also what we attempted back then, if memory
> serves me, but without reaching completion) is a set of tests which
> show a significant benefit of your implementation over the existing
> index type implementations in PostgreSQL, to justify the increased
> complexity of the source code.
>
> The kind of test I have in mind is: a big table T with a
> low-cardinality column C, such that a btree index on C is
> significantly larger than the corresponding bitmap index on the same
> column.
>
> Create the bitmap index, and run a query like
>
>   SELECT ... FROM T WHERE C = ...
>
> more than once; then you should notice that subsequent scans are much
> faster than the first run, because the index is small enough to fit
> the shared memory and will not need to be reloaded from disk at every
> scan.
>
> Then drop the bitmap index, and create a btree index on the same
> column; this time the index will be too large and subsequent scans
> will be slow, because the index blocks must be reloaded from disk at
> every scan.
>
> Hope that helps;

Is that, what your bmi-perf-test.tar.gz from 2008 does?  I did not look
into that.  I will at least do something like you just described plus
some TPC-H test.  As the encoding helps against the cardinality
problems, I will also draw comparisons with different cardinalities.

Yours sincerely,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax:  +49 6151 16 6229



Re: State of the on-disk bitmap index

From
Gianni Ciolli
Date:
Hi Daniel,

On Wed, Sep 05, 2012 at 01:37:59PM +0200, Daniel Bausch wrote:
> Is that, what your bmi-perf-test.tar.gz from 2008 does?  I did not
> look into that.

IIRC yes (but it's been a long time and I don't have a copy at hand
now).

Best regards,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni.ciolli@2ndquadrant.it | www.2ndquadrant.it