Re: State of the on-disk bitmap index - Mailing list pgsql-hackers

From Daniel Bausch
Subject Re: State of the on-disk bitmap index
Date
Msg-id 5032070A.302@dvs.tu-darmstadt.de
Whole thread Raw
In response to Re: State of the on-disk bitmap index  ("Albe Laurenz" <laurenz.albe@wien.gv.at>)
Responses Re: State of the on-disk bitmap index
List pgsql-hackers
Am 20.08.2012 09:40, schrieb Albe Laurenz:
> Daniel Bausch wrote:
>> Hello Jonah, Simon, and the hackers,
>>
>> I am going to implement a simple kind of "encoded bitmap indexes"
> (EBI).
>>  That is an index type where the bitmap columns may not only contain
>> only a single '1' in the set of bits belonging to a tuple.  Instead,
> an
>> additional mapping table translates the distinct values of the table
>> column into a unique encoding.  To select for a given value all bitmap
>> columns must be compared instead of only one.  Queries that match
>> multiple different values (like IN lists or range queries) simplify to
>> less than the full set of bitmaps that needs to be compared because of
>> boolean logic.  The total number of bitmaps required to represent
> unique
>> encodings for all different values is ceil(ld(n)), where n is the
> number
>> of distinct values.  Compared to normal bitmap indexes this solves the
>> problem of high-cardinality columns.  It is targetet at data
> warehousing
>> scenarios with insert only data.
>>
>> The respective scientific paper can be found at
>> http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf
>
> I cannot answer your questions, but I read the paper and have some
> questions myself.
>
> 1) As you mention, a WHERE clause that checks for only one value
>    will be more expensive with an encoded bitmap index than with
>    a regular bitmap index.  If you want to implement encoded bitmap
>    indexes, wouldn't it be good to also implement regular bitmap
>    indexes so that the user has a choice?

Sorry if that one was not clear: The first thing, I am going to do, is
to work on the normal bitmap indexes (the one based on the Bizgres
patch).  I want to port it to master HEAD and give it back to the
community.  After that I want to base my EBI implementation on that.
Eventually, I will publish that implementation, too.  (After doing
tuning, experiments, and make sure it works well.)

> 2) The paper mentions that finding a good encoding and simplifying
>    bitmap access for a certain query are nontrivial problems.
>    Moreover, an encoding is good or bad only with respect to
>    certain queries, which the system does not know at index
>    creation time.

Actually, I was not involved in writing that paper.  I want to use that
idea to show something different.  I know of a follow up work by Golam
Rabilul Alam et al. that uses the query history and data mining on that
to optimize for the most common cases.  There may be others.  A more
detailed discussion of EBI can also be found in:

http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/query.TR.ps.gz

>    Do you have any ideas how to approach that?
>    If not, the paper suggests that, with enough values to check for,
>    even a non-optimized encoded bitmap index should perform
>    much better than a normal bitmap index, so maybe that's the way
>    to go (maybe only encode the NULL value as all zeros).

Actually "all zeros" is reserved for "non-existent" (a.k.a. "deleted" or
"invisible").

The thing with the "enough values" is a bit problematic, indeed, because
even a DBA cannot influence how the queries of the user or the user
application look like.  You will not use encoded bitmap indexes or
normal bitmap indexes for a column that is usually point accessed like
the ID column.  For that you will stick to hash or tree indexes.

Kind regards,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax:  +49 6151 16 6229



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: [PATCH] Docs: Make notes on sequences and rollback more obvious
Next
From: Daniel Bausch
Date:
Subject: Re: State of the on-disk bitmap index