Thread: Best way to scan on-disk bitmaps

Best way to scan on-disk bitmaps

From
"Victor Y. Yegorov"
Date:
Greetings.

I have questions on how to implement on-disk bitmap scan.


I've been working on on-disk bitmaps as an ordinary Index Access Method, but
now it's clear to me, that it'll lose all it's strengths this way. One on-disk
bitmap has exactly one list of indexed table's TIDs and potentially unlimited
number of bitmaps (number of index attributes multiplied by attribute's
cardinality, to be precise).

So, for better performance, one should first retrieve all the needed bitmaps
from the index, then join all bitmaps according to the query clauses and, as
the last phase, retrieve TIDs from index, that matches final bitmap.

According to the docs "the index access method is responsible for
regurgitating the TIDs ...", but for on-disk bitmaps index scan is devided
into 3 phases. So, to perform the scan in phases, to my mind, executor
should be involved. (I'd like to mention again, that this is the first time
I got so deep inside postgres code).

I wanted to use Tom's nodeBitmap* stuff, but it's problematic a bit. Bitmaps
in nodeBitmap* are built upon list of TIDs retrieved during relation scan.
For on-disk bitmap indexes, there's no need for that, as all bitmaps are
already inside the index.

The question is: Is it possible to extend nodeBitmap functionality in such a
way, that Executor can either build bitmap after list of TIDs, obtained from
RelationScan, or ask index access method to give bitmaps it contain (and TIDs
at given position in the map later)?
This will, probably, require more functions in the pg_am catalog.

Or should I create a completely new node for on-disk bitmaps?


--

Victor Y. Yegorov


Re: Best way to scan on-disk bitmaps

From
Tom Lane
Date:
"Victor Y. Yegorov" <viy@mits.lv> writes:
> I have questions on how to implement on-disk bitmap scan.

I think your best plan might be

1. Be sure that all the indexable WHERE conditions are passed to the  indexscan as indexquals.  This might be,
say,WHEREa = 42 and b = 'foo'
 

2. Within the index AM, form the AND of the relevant bitmaps (here the  ones for a = 42 and b = 'foo').

3. Within the index AM, pick up the TIDs for the remaining one-bits,  and pass them back.

4. Let the existing machinery handle the OR-ing problem as well as  actual fetching of the heap rows.

This can be done without any restructuring of the index AM API.
        regards, tom lane


Re: Best way to scan on-disk bitmaps

From
"Victor Y. Yegorov"
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [12.05.2005 23:09]:
> 1. Be sure that all the indexable WHERE conditions are passed to the
>    indexscan as indexquals.  This might be, say,
>     WHERE a = 42 and b = 'foo'

If I have on-disk bitmapON (a, b, c)
will the planner pick an index scan forWHERE a = 42 AND b = 'foo'
(i.e. only part of the index attributes are involved)? Any modifications
needed to achieve this functionality?

To my mind, bitmap scan even for 1 attribute of a multi-column index would be
a win, though I haven't tested this yet.


-- 

Victor Y. Yegorov


Re: Best way to scan on-disk bitmaps

From
Tom Lane
Date:
"Victor Y. Yegorov" <viy@mits.lv> writes:
> If I have on-disk bitmap
>     ON (a, b, c)
> will the planner pick an index scan for
>     WHERE a = 42 AND b = 'foo'
> (i.e. only part of the index attributes are involved)? Any modifications
> needed to achieve this functionality?

Hmm.  That particular case will work, but the planner believes that only
consecutive columns in the index are usable --- that is, if you have
quals for a and c but not for b, it will think that the condition for c
isn't usable with the index.  This is true for btree and gist indexes,
so I suppose we'd need to introduce a pg_am column that tells what to
do.

[ thinks some more ... ]

Plan B would be to remove that restriction and teach btree and gist to
cope.  While a btree couldn't use a nonconsecutive restriction as part
of its where-to-scan logic, I don't see any good reason why it couldn't
still perform the test before returning the TID, thus possibly saving a
trip to the heap.  Offhand it seems this should be true of gist as well,
but I don't know that code well enough to be sure.
        regards, tom lane


Re: Best way to scan on-disk bitmaps

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Plan B would be to remove that restriction and teach btree and gist to
> cope.  While a btree couldn't use a nonconsecutive restriction as part
> of its where-to-scan logic, I don't see any good reason why it couldn't
> still perform the test before returning the TID, thus possibly saving a
> trip to the heap.  Offhand it seems this should be true of gist as well,
> but I don't know that code well enough to be sure.

Not long ago there was some discussion about how gist indexes don't really
handle multicolumn indexes usefully currently. They use only the first column
to determine page splits. The discussion wandered and it became clear that it
wasn't even clear what a multicolumn gist index should mean.

I suggested treating each column as independent axes. Independently ask each
column's datatype for the "distance" value and then calculate the inner
product as a kind of geometric n-dimensional distance. There was some
resistance since this limits gist indexes to always basing page splits on a
single "distance" based algorithm. (Though all current gist index methods in
the postgres source tree do work this way, mostly with copy-pasted code in
fact.)

In this model the columns listed in the gist index are unordered. Any subset
of columns can be used to perform an index lookup. Making it more like the
bitmap index behaviour you're looking at than the btree index behaviour.

-- 
greg



Re: Best way to scan on-disk bitmaps

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Plan B would be to remove that restriction and teach btree and gist to
>> cope.  While a btree couldn't use a nonconsecutive restriction as part
>> of its where-to-scan logic, I don't see any good reason why it couldn't
>> still perform the test before returning the TID, thus possibly saving a
>> trip to the heap.

> [ snip ]

> In this model the columns listed in the gist index are unordered. Any subset
> of columns can be used to perform an index lookup. Making it more like the
> bitmap index behaviour you're looking at than the btree index behaviour.

I thought some more about this since sending my earlier message.  As far
as I can recall at the moment, there really isn't anything fundamental
that depends on the consecutive-columns rule.  The one place where the
rubber meets the road is in the index cost estimation functions: if we
were to relax that rule, then btcostestimate would have to be taught to
include only the consecutive columns when estimating how much of a btree
index is going to be touched.

And more than that: if you've studied the btree code at all, you realize
that that's only an incomplete heuristic anyway.  For instance, if the
leading key is a > xxx, second keys like b > yyy and b < yyy act
completely differently in terms of indexscan cost, but btcostestimate
doesn't presently know that.

I wonder if we shouldn't migrate the amcostestimate functions into the
individual index AMs (which would mean adding a column to pg_am, but so
what).  btcostestimate could be much less phony about this if it had
access to the same infrastructure that _bt_first uses to examine the
index clauses.
        regards, tom lane


Re: Best way to scan on-disk bitmaps

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Victor Y. Yegorov" <viy@mits.lv> writes:
> > If I have on-disk bitmap
> >     ON (a, b, c)
> > will the planner pick an index scan for
> >     WHERE a = 42 AND b = 'foo'
> > (i.e. only part of the index attributes are involved)? Any modifications
> > needed to achieve this functionality?
> 
> Hmm.  That particular case will work, but the planner believes that only
> consecutive columns in the index are usable --- that is, if you have
> quals for a and c but not for b, it will think that the condition for c
> isn't usable with the index.  This is true for btree and gist indexes,
> so I suppose we'd need to introduce a pg_am column that tells what to
> do.

We do have a TODO for this:

* Use index to restrict rows returned by multi-key index when used with non-consecutive keys to reduce heap accesses
 For an index on col1,col2,col3, and a WHERE clause of col1 = 5 and col3 = 9, spin though the index checking for col1
andcol3 matches, rather than just col1; also called skip-scanning.
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Best way to scan on-disk bitmaps

From
Greg Stark
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:

> > Hmm.  That particular case will work, but the planner believes that only
> > consecutive columns in the index are usable --- that is, if you have
> > quals for a and c but not for b, it will think that the condition for c
> > isn't usable with the index.  This is true for btree and gist indexes,
> > so I suppose we'd need to introduce a pg_am column that tells what to
> > do.
> 
> We do have a TODO for this:
> 
> * Use index to restrict rows returned by multi-key index when used with
>   non-consecutive keys to reduce heap accesses
> 
>   For an index on col1,col2,col3, and a WHERE clause of col1 = 5 and
>   col3 = 9, spin though the index checking for col1 and col3 matches,
>   rather than just col1; also called skip-scanning.

That TODO is something else. 

Though it is related in that it is another example of why the existing code is
too simplistic and will eventually need to be enhanced. Not only would bitmap
indexes and (possibly) gist indexes, but even btree indexes would need to do
so if this TODO were implemented.

-- 
greg



Re: Best way to scan on-disk bitmaps

From
Manfred Koizar
Date:
On Thu, 12 May 2005 17:40:06 -0400, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>the planner believes that only
>consecutive columns in the index are usable --- that is, if you have
>quals for a and c but not for b, it will think that the condition for c
>isn't usable with the index.  This is true for btree [...]

It's not difficult to setup a test case where an index is used, but
only with a=42 as an index condition, and c='foo' is applied as a
filter condition.  Now add a redundant condition on b... AND b BETWEEN minb AND maxb ...
and watch how c='foo' moves into the index condition.

I did this test some time ago and I believe that adding the condition
on b did not change the estimated cost, only the actual execution
time.

ServusManfred



Re: Best way to scan on-disk bitmaps

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>> Hmm.  That particular case will work, but the planner believes that only
>>> consecutive columns in the index are usable --- that is, if you have
>>> quals for a and c but not for b, it will think that the condition for c
>>> isn't usable with the index.
>>
>> We do have a TODO for this:
>> 
>> * Use index to restrict rows returned by multi-key index when used with
>> non-consecutive keys to reduce heap accesses

> That TODO is something else. 

No, I think Bruce is right --- it's essentially the same thing.  It
certainly would be a good idea to change btrees to work like that,
if we are going to modify the planner to relax the restriction for
other index types.

I think it would be easy to change the planner and btree to handle
this (where "easy" means "I remember where all the skeletons are
buried").  But I don't know the gist code hardly at all.  Can anyone
offer an informed opinion on whether gist can handle this now, and
if not what it would take to handle it?

(hash and rtree are not at issue since they don't support multi-key
indexes.)
        regards, tom lane


Re: Best way to scan on-disk bitmaps

From
Oleg Bartunov
Date:
On Sun, 15 May 2005, Tom Lane wrote:

> Greg Stark <gsstark@mit.edu> writes:
>> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>>>> Hmm.  That particular case will work, but the planner believes that only
>>>> consecutive columns in the index are usable --- that is, if you have
>>>> quals for a and c but not for b, it will think that the condition for c
>>>> isn't usable with the index.
>>>
>>> We do have a TODO for this:
>>>
>>> * Use index to restrict rows returned by multi-key index when used with
>>> non-consecutive keys to reduce heap accesses
>
>> That TODO is something else.
>
> No, I think Bruce is right --- it's essentially the same thing.  It
> certainly would be a good idea to change btrees to work like that,
> if we are going to modify the planner to relax the restriction for
> other index types.
>
> I think it would be easy to change the planner and btree to handle
> this (where "easy" means "I remember where all the skeletons are
> buried").  But I don't know the gist code hardly at all.  Can anyone
> offer an informed opinion on whether gist can handle this now, and
> if not what it would take to handle it?

I think that handling this in GiST is depends solely on how users consistent 
function designed to handle NULLs in keys. Other words, it should works as 
soon as users consistent function "know" what to do with NULLs in internal keys.

Teodor will comment multi-key GiST tomorrow.

We used Paul Aoki paper "Generalizing ''Search'' in Generalized Search Trees",
(available from http://www.sai.msu.su/~megera/postgres/gist/papers/csd-97-950.pdf )
for implementation of multi-key GiST index support. It's true, that first
key is used for splitting, but elements with duplicated first key could
be shuffled to get better clustering on second key.

>
> (hash and rtree are not at issue since they don't support multi-key
> indexes.)
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Best way to scan on-disk bitmaps

From
Greg Stark
Date:

Tom Lane <tgl@sss.pgh.pa.us> writes:

> I think it would be easy to change the planner and btree to handle
> this (where "easy" means "I remember where all the skeletons are
> buried").  But I don't know the gist code hardly at all.  Can anyone
> offer an informed opinion on whether gist can handle this now, and
> if not what it would take to handle it?

Currently gist indexes only use the first column for page splits, making
multi-key gist indexes basically useless. The problem is it's hard to imagine
an API for a pickSplit function that could handle multi-key indexes with
disparate data types and operator classes. I had an idea of a new way to deal
with gist indexes that simplified the API and side-stepped the whole issue but
you raised concerns that it might be too limiting.

Unfortunately the mailing list archive seems to have missed this discussion.
I've attached three messages from the discussion at the time.

Oleg & Teodor,

If I understand the code correctly, GiST will only pass the first
attribute of each index tuple to the user-defined PickSplit method when
it wants to split a node. (see circa line 1269 of gist.c)

Is this a wise design decision? Granted, in many situations the first
attribute in the index is sufficient to make a reasonable decision about
how to divide the node into two halves, but I don't think that is
universally true. For example, consider a two column index whose first
attribute has a small number of distinct values. It could well be that
all the first attribute values in a node-to-be-split would be the same.
Only passing the first attribute to PickSplit would result in an
essentially random distribution of tuples among the split nodes, rather
than allowing the GiST extension to make use of the second attribution
to partition the nodes. That's an extreme example, but it is easy to
construct more realistic scenarios (basically, any situation in which
the cardinality of the first index attribute is low -- a reasonably
common occurrence with a multi-column index, I believe).

I'm not sure whether this would be a problem in practice. Speculation:
repeated invocations of PickSplit are one of the main factors in
deriving the ultimate shape of the GiST tree. Poor distribution of keys
resulting from PickSplit would eventually result in unnecessarily loose
bounding predicates in internal nodes, which would hurt performance.

Comments welcome,

Neil

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


Greg Stark <gsstark@MIT.EDU> writes:

> I'm not sure that GiST indexes behave the same way as btree indexes for the
> multi-column case.
>
> In a btree index the second column is entirely subordinate to the first
> column. In a GiST index the data is multi-dimensional, and all dimensions are
> equally important.

In fact on further consideration I do have a proposal.

If you look at the GiST implementations for various index types you'll see
that many (all?) take the same approach for PickSplit. In fact they pretty
much all have the same code copy/pasted to handle it.

The approach they take is to have a function which calculates an abstract
"distance" between any two entries. There's an algorithm that they use to pick
the split based on this distance function.

If you abandoned "PickSplit" and instead exposed this distance function as the
external API then the behaviour for multi-column indexes is clear. You
calculate the distance along all the axes and calculate the diagonal distance.

I think abandoning PickSplit for the distance function might also mean you
don't need a separate function for Penalty either, but I'm not sure on that.

--
greg



Greg Stark <gsstark@MIT.EDU> writes:
> The approach they take is to have a function which calculates an
> abstract "distance" between any two entries. There's an algorithm that
> they use to pick the split based on this distance function.

> If you abandoned "PickSplit" and instead exposed this distance
> function as the external API then the behaviour for multi-column
> indexes is clear. You calculate the distance along all the axes and
> calculate the diagonal distance.

Hmm ... the problem with that is the assumption that different opclasses
will compute similarly-scaled distances.  If opclass A generates
distances in the range (0,1e6) while B generates in the range (0,1),
combining them with Euclidean distance won't work well at all.  OTOH you
can't blindly normalize, because in some cases maybe the data is such
that a massive difference in distances is truly appropriate.

I'm also a bit leery of the assumption that every GiST application can
reduce its PickSplit logic to Euclidean distances.

            regards, tom lane





> (hash and rtree are not at issue since they don't support multi-key
> indexes.)

It seems like it would be trivial to make hash indexes do something useful.
Maybe it's not clear which useful something to do though.

a) hash each column and xor the results. This requires all columns be present
   for an index lookup. Again this requires making the planner logic more
   flexible and specific to each index method.

b) Partition take only a few bits from each column's hash. This would actually
   let the hash index use any subset of columns though it seems like it would
   be prohibitively expensive at first blush. And it requires more than just
   planner hacking to take advantage of.



--
greg

Re: Best way to scan on-disk bitmaps

From
Teodor Sigaev
Date:
About page splitting algorithm in GiST in multikey case. For the beginning, page 
is splitted by calling pickSplit method of key of first column (pickSplit method 
is defined for opclass and it is a user function), then it try to find equal 
values  of first column in left and right pages ( gist.c lines 1264-1901 ). If 
it is, then GiST core will try to resort tuples with first equal keys between 
left and right pages using penalty method for second and higher column's key. If 
it's not, it leave pages untouched. But unions for parent page of second and 
higher column's keys will be formed.

So, if index is defined as 'using gist (a,b,c)' then, in principle, GiST index 
can speed up queries like 'a=V1 and c=V2'.  But it will not usable for queries 
( b=V3 and c=V2 ). By the way, instead of '=' operation may be used other 
operations. Number of supported operations by GiST is indefinite unlike, for 
example, btree which supported only five: <, <=, =, =>, >.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Best way to scan on-disk bitmaps

From
Bruce Momjian
Date:
If people have GIST TODOs, please post them.

---------------------------------------------------------------------------

Teodor Sigaev wrote:
> About page splitting algorithm in GiST in multikey case. For the beginning, page 
> is splitted by calling pickSplit method of key of first column (pickSplit method 
> is defined for opclass and it is a user function), then it try to find equal 
> values  of first column in left and right pages ( gist.c lines 1264-1901 ). If 
> it is, then GiST core will try to resort tuples with first equal keys between 
> left and right pages using penalty method for second and higher column's key. If 
> it's not, it leave pages untouched. But unions for parent page of second and 
> higher column's keys will be formed.
> 
> So, if index is defined as 'using gist (a,b,c)' then, in principle, GiST index 
> can speed up queries like 'a=V1 and c=V2'.  But it will not usable for queries 
> ( b=V3 and c=V2 ). By the way, instead of '=' operation may be used other 
> operations. Number of supported operations by GiST is indefinite unlike, for 
> example, btree which supported only five: <, <=, =, =>, >.
> 
> 
> 
> -- 
> Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>                                                     WWW: http://www.sigaev.ru/
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Best way to scan on-disk bitmaps

From
Christopher Kings-Lynne
Date:

Bruce Momjian wrote:
> If people have GIST TODOs, please post them.

Concurrency :)


Re: Best way to scan on-disk bitmaps

From
Alvaro Herrera
Date:
On Tue, May 17, 2005 at 09:39:20AM +0800, Christopher Kings-Lynne wrote:
> 
> 
> Bruce Momjian wrote:
> >If people have GIST TODOs, please post them.
> 
> Concurrency :)

And WAL support.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"No necesitamos banderasNo reconocemos fronteras"                  (Jorge González)


Re: Best way to scan on-disk bitmaps

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> On Tue, May 17, 2005 at 09:39:20AM +0800, Christopher Kings-Lynne wrote:
> > 
> > 
> > Bruce Momjian wrote:
> > >If people have GIST TODOs, please post them.
> > 
> > Concurrency :)
> 
> And WAL support.

Already there:
* Add WAL index reliability improvement to non-btree indexes

and this too:
* Add concurrency to GIST

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Best way to scan on-disk bitmaps

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> wrote:
> ... So, if index is defined as 'using gist (a,b,c)' then, in
> principle, GiST index can speed up queries like 'a=V1 and c=V2'.  But
> it will not usable for queries ( b=V3 and c=V2 ). By the way, instead
> of '=' operation may be used other operations. Number of supported
> operations by GiST is indefinite unlike, for example, btree which
> supported only five: <, <=, =, =>, >.

I have committed changes to the planner to arrange that a GiST indexscan
must supply at least one restriction clause for the first index column,
and can supply restriction clauses for any, all, or none of the
remaining columns; the old left-to-right heuristic is gone.

As far as I can tell, this doesn't require any changes to the GiST code,
but please take another look if you aren't too sure about it.
        regards, tom lane


Re: Best way to scan on-disk bitmaps

From
Oleg Bartunov
Date:
On Mon, 13 Jun 2005, Tom Lane wrote:

> Teodor Sigaev <teodor@sigaev.ru> wrote:
>> ... So, if index is defined as 'using gist (a,b,c)' then, in
>> principle, GiST index can speed up queries like 'a=V1 and c=V2'.  But
>> it will not usable for queries ( b=V3 and c=V2 ). By the way, instead
>> of '=' operation may be used other operations. Number of supported
>> operations by GiST is indefinite unlike, for example, btree which
>> supported only five: <, <=, =, =>, >.
>
> I have committed changes to the planner to arrange that a GiST indexscan
> must supply at least one restriction clause for the first index column,
> and can supply restriction clauses for any, all, or none of the
> remaining columns; the old left-to-right heuristic is gone.
>
> As far as I can tell, this doesn't require any changes to the GiST code,
> but please take another look if you aren't too sure about it.

I did quick test and found no problem with gist(a,b,c) and index does used for 
(a,*,c) case

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83