Thread: Block B-Tree concept

Block B-Tree concept

From
Heikki Linnakangas
Date:
I've been experimenting with the idea of a so-called Block B-Tree. The 
basic idea is that instead of storing an index tuple for each heap 
tuple, we store an index tuple for each heap block. This dramatically 
reduces the size of an index, leading to savings on I/O. This idea was 
briefly discussed in January: 
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00565.php

To make it actually work, the semantics of the B-Tree has been modified 
so that every index tuple represents 1 or more heap tuples that fall 
within some range of values, and are on the same heap page. The range 
that an index tuple represents is from X, inclusive, to Y, exclusive, 
where X is the key of the index tuple and Y is the key of the *next* 
index tuple in the index. If the heap is in index order (as after 
CLUSTER), we get a very compact index this way, effectively eliminating 
the leaf level of the B-tree.

To locate the actual matching items on the heap page, we have to scan 
the heap page because we don't have the item ids, so this is a tradeoff 
between CPU and I/O. However, we could have a hybrid where we initially 
store heap tuple pointers like we do now, and when there's more than X 
consecutive pointers to the same page, we collapse them to just one 
pointer to the whole page. This would potentially give us the best of 
both worlds.

This design is more flexible and less invasive than true 
index-organized-tables, because it doesn't require perfect ordering of 
the heap or moving heap tuples around. You can also define than one 
Block B-Tree on a table, though you wouldn't get much benefit from using 
Block B-Tree instead of normal B-Tree if there's no correlation between 
the index order and the heap order.

It also fits in nicely with the bitmap scans, since there's already 
support for "lossy" bitmap pages. Doing normal ordered index scans 
requires some coding, but can be done.

Thoughts? I'm thinking of getting this in early in the 8.3 cycle. We'll 
have to take a look at the indexam API to support both bitmap indexes 
and block B-trees nicely.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I've been experimenting with the idea of a so-called Block B-Tree. The 
> basic idea is that instead of storing an index tuple for each heap 
> tuple, we store an index tuple for each heap block. This dramatically 
> reduces the size of an index, leading to savings on I/O.

VACUUM?
        regards, tom lane


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>   
>> I've been experimenting with the idea of a so-called Block B-Tree. The 
>> basic idea is that instead of storing an index tuple for each heap 
>> tuple, we store an index tuple for each heap block. This dramatically 
>> reduces the size of an index, leading to savings on I/O.
>>     
>
> VACUUM?
>   
There's a few options that I've thought of this far:

1. Whenever a tuple is found dead on page X, vacuum of the index will 
have to go to that page again to see if there's any matching tuples left.

2. Have a reference counter on index tuple that's increased on insert 
and decreased by vacuum.

3. Do nothing. Let index scans mark the index tuple as dead when it's 
convenient. There's no correctness problem with just leaving dead index 
tuples there, because you have to check the index quals on each heap 
tuple anyway when you scan.

3. probably isn't an option in itself, but we might want to do some kind 
of a mixture of 1 and 3.

I'm thinking that Block B-Tree is not a candidate for non-MVCC system 
catalogs, but I think that's OK.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com



Re: Block B-Tree concept

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki@enterprisedb.com> writes:
> >   
> >> I've been experimenting with the idea of a so-called Block B-Tree. The 
> >> basic idea is that instead of storing an index tuple for each heap 
> >> tuple, we store an index tuple for each heap block. This dramatically 
> >> reduces the size of an index, leading to savings on I/O.
> >>     
> >
> > VACUUM?
> >   
> There's a few options that I've thought of this far:
> 
> 1. Whenever a tuple is found dead on page X, vacuum of the index will 
> have to go to that page again to see if there's any matching tuples left.

Right now, if an index entry points to a dead tuple, we set a bit in 
the index so future lookups do not access the heap.  We could set a bit 
for block index entries that point to a page that has no live rows, and
have vacuum remove the index entry later.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block B-Tree concept

From
Teodor Sigaev
Date:
> Right now, if an index entry points to a dead tuple, we set a bit in 
> the index so future lookups do not access the heap.  We could set a bit 
> for block index entries that point to a page that has no live rows, and
> have vacuum remove the index entry later.

GIN don't support this feature...
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Block B-Tree concept

From
Bruce Momjian
Date:
Teodor Sigaev wrote:
> > Right now, if an index entry points to a dead tuple, we set a bit in 
> > the index so future lookups do not access the heap.  We could set a bit 
> > for block index entries that point to a page that has no live rows, and
> > have vacuum remove the index entry later.
> 
> GIN don't support this feature...

I think block indexes would only be for btrees.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Teodor Sigaev wrote:
>> Right now, if an index entry points to a dead tuple, we set a bit in 
>> the index so future lookups do not access the heap.  We could set a 
>> bit for block index entries that point to a page that has no live 
>> rows, and
>> have vacuum remove the index entry later. 
>
> GIN don't support this feature... 
I'm only talking about B-trees at this stage. ISTM that you could do the 
same thing with hash indexes, but I haven't given it much thought.

Anyway, I think you'd usually want to use bitmap scans with a Block 
B-tree, unless you need sorted output. And bitmap scans don't set the 
LP_DELETE flag either. We might want to do something about that.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> VACUUM?
>> 
> There's a few options that I've thought of this far:

> 1. Whenever a tuple is found dead on page X, vacuum of the index will 
> have to go to that page again to see if there's any matching tuples left.

Anything that involves having VACUUM re-evaluate index expressions is a
nonstarter ... or have you already forgotten the optimizations we put
into 8.2 that assume, eg, no sub-transactions within a VACUUM?

> 2. Have a reference counter on index tuple that's increased on insert 
> and decreased by vacuum.

The "increase on insert" part I understand, the "decrease by vacuum"
part seems to have the same problem as #1.  How do you tell which index
entries should be changed?

> 3. Do nothing. Let index scans mark the index tuple as dead when it's 
> convenient. There's no correctness problem with just leaving dead index 
> tuples there, because you have to check the index quals on each heap 
> tuple anyway when you scan.

And we're back to routine REINDEX I guess :-(.  This doesn't seem like a
satisfactory answer.
        regards, tom lane


Re: Block B-Tree concept

From
Csaba Nagy
Date:
> And we're back to routine REINDEX I guess :-(.  This doesn't seem like a
> satisfactory answer.

If the reindex works online, it could be a satisfactory solution.

Cheers,
Csaba.




Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Anything that involves having VACUUM re-evaluate index expressions is a
> nonstarter ... or have you already forgotten the optimizations we put
> into 8.2 that assume, eg, no sub-transactions within a VACUUM?

Umm, I'm afraid I have. Could you give me a clue?

>> 3. Do nothing. Let index scans mark the index tuple as dead when it's
>> convenient. There's no correctness problem with just leaving dead index
>> tuples there, because you have to check the index quals on each heap
>> tuple anyway when you scan.
>
> And we're back to routine REINDEX I guess :-(. This doesn't seem like a
> satisfactory answer.

In general, it isn't.

Though there are interesting use cases where it would be fine. For 
example, if you remove old data by dropping a partition, there's never 
really need to vacuum. Or if all of the data is accessed during normal 
operation, the index scans set the LP_DELETE flags and no additional 
vacuum is really needed.

Also, now that we have concurrent CREATE INDEX, we could implement 
concurrent REINDEX as well, I believe.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
> >Anything that involves having VACUUM re-evaluate index expressions is a
> >nonstarter ... or have you already forgotten the optimizations we put
> >into 8.2 that assume, eg, no sub-transactions within a VACUUM?
> 
> Umm, I'm afraid I have. Could you give me a clue?

The patch by Hannu Krossing that allows a lazy vacuum to ignore other
lazy vacuums, when computing the OldestXmin.

The point is you can only ignore other lazy vacuums if they are not
going to visit the heap.  Otherwise you'd risk removing a tuple that the
other vacuum wants to see.

You could here shout that surely no two vacuums could be trying to clean
the same table at the same time, but it turns out that you can have
functional indexes that scan other tables.  Sure, it's a bad idea, but ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Anything that involves having VACUUM re-evaluate index expressions is a
>> nonstarter ... or have you already forgotten the optimizations we put
>> into 8.2 that assume, eg, no sub-transactions within a VACUUM?
>
> Umm, I'm afraid I have. Could you give me a clue?
I think I found it. Is this what you're talking about (in 
commands/vacuum.c):
       /*        * During a lazy VACUUM we do not run any user-supplied functions,        * and so it should be safe to
notcreate a transaction snapshot.        *        * We can furthermore set the inVacuum flag, which lets other        *
concurrentVACUUMs know that they can ignore this one while        * determining their OldestXmin.  (The reason we don't
setinVacuum        * during a full VACUUM is exactly that we may have to run user-        * defined functions for
functionalindexes, and we want to make        * sure that if they use the snapshot set above, any tuples it        *
requirescan't get removed from other tables.  An index function        * that depends on the contents of other tables
isarguably broken,        * but we won't break it here by violating transaction semantics.)        *        * Note: the
inVacuumflag remains set until CommitTransaction or        * AbortTransaction.  We don't want to clear it until we
reset       * MyProc->xid/xmin, else OldestXmin might appear to go backwards,        * which is probably Not Good.
 */       MyProc->inVacuum = true;
 

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> Anything that involves having VACUUM re-evaluate index expressions is a
>>> nonstarter ... or have you already forgotten the optimizations we put
>>> into 8.2 that assume, eg, no sub-transactions within a VACUUM?

> I think I found it. Is this what you're talking about (in 
> commands/vacuum.c):

That's part of it, but I seem to recall other things too --- in
particular, the point about subtransactions troubles me.  Whatever you
think about an index function looking at other tables, it is perfectly
legitimate to have an exception block in an index function, and that
requires subtransactions (at least as plpgsql is now implemented).
        regards, tom lane


Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Also, now that we have concurrent CREATE INDEX, we could implement 
> concurrent REINDEX as well, I believe.

That's probably more easily said than done --- in particular, I don't
understand what the committed state after the first transaction would
look like.  CREATE INDEX can get away with it because nothing need be
depending on the new index, but you can't say that for an existing index
(esp. if it's UNIQUE).
        regards, tom lane


Re: Block B-Tree concept

From
"Jim C. Nasby"
Date:
On Tue, Sep 26, 2006 at 11:16:54AM +0100, Heikki Linnakangas wrote:
> To locate the actual matching items on the heap page, we have to scan 
> the heap page because we don't have the item ids, so this is a tradeoff 
> between CPU and I/O. However, we could have a hybrid where we initially 
> store heap tuple pointers like we do now, and when there's more than X 
> consecutive pointers to the same page, we collapse them to just one 
> pointer to the whole page. This would potentially give us the best of 
> both worlds.
> 
> This design is more flexible and less invasive than true 
> index-organized-tables, because it doesn't require perfect ordering of 
> the heap or moving heap tuples around. You can also define than one 
> Block B-Tree on a table, though you wouldn't get much benefit from using 
> Block B-Tree instead of normal B-Tree if there's no correlation between 
> the index order and the heap order.

No, but I think there's scenarios where you may not have extremely high
correlation but you'd still benefit, especially with the hybrid
approach. If you have a field with rather skewed histogram, for example,
where you're likely to have a lot of tuples with one value on any given
page. Of course, you probably would want to exclude that value from the
index entirely if it's on *every* page, but anything where you see
paterns of data wouldn't be like that.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Block B-Tree concept

From
"Jim C. Nasby"
Date:
On Tue, Sep 26, 2006 at 05:27:56PM +0100, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> >Tom Lane wrote:
> >>Anything that involves having VACUUM re-evaluate index expressions is a
> >>nonstarter ... or have you already forgotten the optimizations we put
> >>into 8.2 that assume, eg, no sub-transactions within a VACUUM?
> >
> >Umm, I'm afraid I have. Could you give me a clue?
> I think I found it. Is this what you're talking about (in 
> commands/vacuum.c):
> 
>        /*
>         * During a lazy VACUUM we do not run any user-supplied functions,
>         * and so it should be safe to not create a transaction snapshot.
>         *
>         * We can furthermore set the inVacuum flag, which lets other
>         * concurrent VACUUMs know that they can ignore this one while
>         * determining their OldestXmin.  (The reason we don't set inVacuum
>         * during a full VACUUM is exactly that we may have to run user-
>         * defined functions for functional indexes, and we want to make
>         * sure that if they use the snapshot set above, any tuples it
>         * requires can't get removed from other tables.  An index function
>         * that depends on the contents of other tables is arguably broken,
>         * but we won't break it here by violating transaction semantics.)
>         *
>         * Note: the inVacuum flag remains set until CommitTransaction or
>         * AbortTransaction.  We don't want to clear it until we reset
>         * MyProc->xid/xmin, else OldestXmin might appear to go backwards,
>         * which is probably Not Good.
>         */
>        MyProc->inVacuum = true;

Do I understand that to mean that you can no longer lazy vacuum a
functional index?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Block B-Tree concept

From
"Jim C. Nasby"
Date:
On Tue, Sep 26, 2006 at 08:51:10AM -0400, Tom Lane wrote:
> > 3. Do nothing. Let index scans mark the index tuple as dead when it's 
> > convenient. There's no correctness problem with just leaving dead index 
> > tuples there, because you have to check the index quals on each heap 
> > tuple anyway when you scan.
> 
> And we're back to routine REINDEX I guess :-(.  This doesn't seem like a
> satisfactory answer.

Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
that anyway right now?

Granted, you'd want to periodically ensure that you scan the entire
index, but that shouldn't be horribly hard to set up.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Jim C. Nasby wrote:
> Do I understand that to mean that you can no longer lazy vacuum a
> functional index?

With the normal B-trees we have now, there's no problem vacuuming a 
functional index. But it would be a problem with the Block B-tree, 
because the proposed way of vacuuming a Block B-tree involves 
re-evaluating index expressions.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Jim C. Nasby wrote:
> Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
> that anyway right now?

You mean _index_ tuples marked dead? Sure, no problem there.

> Granted, you'd want to periodically ensure that you scan the entire
> index, but that shouldn't be horribly hard to set up.

Well, it seems to be.  A vacuum can't evaluate index expressions because 
it's not in a real transaction.

The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" 
etc. with enable_seqscan=false? That would work, but we can't depend on 
an additional administrative task like. And we might as well just 
disable the optimization that's causing us problems.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Bruce Momjian
Date:
Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
> > that anyway right now?
> 
> You mean _index_ tuples marked dead? Sure, no problem there.
> 
> > Granted, you'd want to periodically ensure that you scan the entire
> > index, but that shouldn't be horribly hard to set up.
> 
> Well, it seems to be.  A vacuum can't evaluate index expressions because 
> it's not in a real transaction.
> 
> The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" 
> etc. with enable_seqscan=false? That would work, but we can't depend on 
> an additional administrative task like. And we might as well just 
> disable the optimization that's causing us problems.

Why can't the C code just do a full index scan that touches the heap,
sets those expired bits, and then do a vacuum?  My point is that the
bits can be set outside the normal vacuum process, so you don't have to
be doing heap lookups from the index inside vacuum.

Assuming the heap is mostly in index order, the full index scan
shouldn't take much longer than a heap scan, and if the heap doesn't
match index order, a block index shouldn't be used anyway.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> Heikki Linnakangas wrote:
>> Jim C. Nasby wrote:
>>> Granted, you'd want to periodically ensure that you scan the entire
>>> index, but that shouldn't be horribly hard to set up.
>> Well, it seems to be. A vacuum can't evaluate index expressions because
>> it's not in a real transaction.
>>
>> The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0"
>> etc. with enable_seqscan=false? That would work, but we can't depend on
>> an additional administrative task like. And we might as well just
>> disable the optimization that's causing us problems.
>
> Why can't the C code just do a full index scan that touches the heap,
> sets those expired bits, and then do a vacuum? My point is that the
> bits can be set outside the normal vacuum process, so you don't have to
> be doing heap lookups from the index inside vacuum.

The point of the optimization that's causing problems was to reduce the 
effect of long-running vacuum transactions. If we're going to have 
another long running transaction instead, we're back to square one.

AFAICS, we could disable the optimization and use a full-blown 
transaction when vacuuming a table with a functional block index. 
Granted, that's annoying, but not a show-stopper I think.

> Assuming the heap is mostly in index order, the full index scan
> shouldn't take much longer than a heap scan, and if the heap doesn't
> match index order, a block index shouldn't be used anyway.

It introduces one more full heap scan for each block index on a table. 
That's expensive.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> AFAICS, we could disable the optimization and use a full-blown 
> transaction when vacuuming a table with a functional block index. 

No, we couldn't, or at least it's not merely a matter of reversing a
recent optimization.

The fundamental issue with all these proposals is the assumption that
you can re-compute the index entries at all.  VACUUM has never, ever,
depended on the assumption that it can re-evaluate index entries and
get the same answers as the original insertion did.  Now obviously
it should theoretically be able to do that, in a theoretical bug-free
world, but given that we allow user-defined functions in index
expressions that is a very hazardous assumption: you might get a
different answer.  Or an error.  The current VACUUM procedure is able
to clean up index entries correctly without any recalculation of the
index values, just matching of TIDs.  I think we'd be taking a serious
robustness hit if we abandon that property.

This is basically the same objection that I have to the occasional
proposals for "retail VACUUM".

BTW, it's not merely a problem for functional indexes: the
datatype-specific functions invoked while searching an index are also
hazards.
        regards, tom lane


Re: Block B-Tree concept

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Also, now that we have concurrent CREATE INDEX, we could implement 
>> concurrent REINDEX as well, I believe.
>
> That's probably more easily said than done --- in particular, I don't
> understand what the committed state after the first transaction would
> look like.  CREATE INDEX can get away with it because nothing need be
> depending on the new index, but you can't say that for an existing index
> (esp. if it's UNIQUE).

I think you build a whole new index named something like ".temp-reindex" and
then as the last step of the second transaction delete the old idnex and
rename the new index.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Block B-Tree concept

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> That's probably more easily said than done --- in particular, I don't
>> understand what the committed state after the first transaction would
>> look like.

> I think you build a whole new index named something like ".temp-reindex" and
> then as the last step of the second transaction delete the old idnex and
> rename the new index.

That would require getting exclusive lock on the table.
        regards, tom lane


Re: Block B-Tree concept

From
Csaba Nagy
Date:
> > I think you build a whole new index named something like ".temp-reindex" and
> > then as the last step of the second transaction delete the old idnex and
> > rename the new index.
> 
> That would require getting exclusive lock on the table.

Just out of curiosity, creating a new index concurrently (or online,
whatever you call it) doesn't require to set an exclusive lock on the
table ? I thought it would, at least swiftly at the end of the
operation, after all it's modifying the table...

Cheers,
Csaba.




Re: Block B-Tree concept

From
"Jim C. Nasby"
Date:
On Wed, Sep 27, 2006 at 05:38:38AM -0400, Bruce Momjian wrote:
> Heikki Linnakangas wrote:
> > Jim C. Nasby wrote:
> > > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do
> > > that anyway right now?
> > 
> > You mean _index_ tuples marked dead? Sure, no problem there.
> > 
> > > Granted, you'd want to periodically ensure that you scan the entire
> > > index, but that shouldn't be horribly hard to set up.
> > 
> > Well, it seems to be.  A vacuum can't evaluate index expressions because 
> > it's not in a real transaction.
> > 
> > The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" 
> > etc. with enable_seqscan=false? That would work, but we can't depend on 
> > an additional administrative task like. And we might as well just 
> > disable the optimization that's causing us problems.
> 
> Why can't the C code just do a full index scan that touches the heap,
> sets those expired bits, and then do a vacuum?  My point is that the
> bits can be set outside the normal vacuum process, so you don't have to
> be doing heap lookups from the index inside vacuum.
> 
> Assuming the heap is mostly in index order, the full index scan
> shouldn't take much longer than a heap scan, and if the heap doesn't
> match index order, a block index shouldn't be used anyway.

Well, my thought was to have a backend process that would periodically
scan a small section of the index, so that you wouldn't have a
long-running transaction. That could then be followed by a vacuum of
that same section of the index, which would nuke the dead tuple entries.

Though, maybe we wouldn't even need the vacuum step since 8.2 will now
reclaim tuples marked as dead?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> AFAICS, we could disable the optimization and use a full-blown
>> transaction when vacuuming a table with a functional block index.
>
> No, we couldn't, or at least it's not merely a matter of reversing a
> recent optimization.
>
> The fundamental issue with all these proposals is the assumption that
> you can re-compute the index entries at all. VACUUM has never, ever,
> depended on the assumption that it can re-evaluate index entries and
> get the same answers as the original insertion did. Now obviously
> it should theoretically be able to do that, in a theoretical bug-free
> world, but given that we allow user-defined functions in index
> expressions that is a very hazardous assumption: you might get a
> different answer. Or an error. The current VACUUM procedure is able
> to clean up index entries correctly without any recalculation of the
> index values, just matching of TIDs. I think we'd be taking a serious
> robustness hit if we abandon that property.

I'm not worried about getting different results. If a used-defined 
function behaves badly, you're queries are screwed anyway. But throwing 
an error would be bad, because that would abort the whole vacuum.

If we want to keep the property that VACUUM doesn't re-evaluate index 
entries, any proposal that doesn't keep track of every heap tuple isn't 
going to work. I'll try to modify the design I had in mind so that it 
does keep track of every heap tuple in some form.

> This is basically the same objection that I have to the occasional
> proposals for "retail VACUUM".

Yeah. :-(

> BTW, it's not merely a problem for functional indexes: the
> datatype-specific functions invoked while searching an index are also
> hazards.

Good point. I didn't realize that before.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> If we want to keep the property that VACUUM doesn't re-evaluate index 
> entries, any proposal that doesn't keep track of every heap tuple 
> isn't going to work. I'll try to modify the design I had in mind so 
> that it does keep track of every heap tuple in some form.
After some thought:

Imagine a normal B-tree just like what we have now. But when there is 
more than one tuple on the same heap page with consecutive index keys, 
we represent all of them in a single index tuple that contains the key 
of the first one of them, and a (run-length encoded) bitmap of the 
OffsetNumbers. We should get most of the space and I/O savings as with 
the original proposal, but we can vacuum it without re-evaluating index 
expressions.

It does change the format of an index tuple, unlike the original 
proposal. That adds some complexity. but it's doable.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Martijn van Oosterhout
Date:
On Fri, Sep 29, 2006 at 10:51:32AM +0100, Heikki Linnakangas wrote:
> After some thought:
>
> Imagine a normal B-tree just like what we have now. But when there is
> more than one tuple on the same heap page with consecutive index keys,
> we represent all of them in a single index tuple that contains the key
> of the first one of them, and a (run-length encoded) bitmap of the
> OffsetNumbers. We should get most of the space and I/O savings as with
> the original proposal, but we can vacuum it without re-evaluating index
> expressions.

I think something like this has been discussed before. Where an index
tuple currently has the key values followed by a ctid, simply change
that so that it can be a list of ctid's, in order.

This works on having the same key, but doesn't care if the tuples are
on the same page. It used to be not possible because of how to handle
forward and backward scanning within an index entry during updates. I
think with the new "page at a time" index scanning, this got a lot
easier.

One issue is that a single index page could hold more than 1000 index
entries, which might cause problems for callers.

> It does change the format of an index tuple, unlike the original
> proposal. That adds some complexity. but it's doable.

This way doesn't change the current index format much.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Martijn van Oosterhout wrote:
> On Fri, Sep 29, 2006 at 10:51:32AM +0100, Heikki Linnakangas wrote:
>> After some thought:
>>
>> Imagine a normal B-tree just like what we have now. But when there is
>> more than one tuple on the same heap page with consecutive index keys,
>> we represent all of them in a single index tuple that contains the key
>> of the first one of them, and a (run-length encoded) bitmap of the
>> OffsetNumbers. We should get most of the space and I/O savings as with
>> the original proposal, but we can vacuum it without re-evaluating index
>> expressions.
>
> I think something like this has been discussed before. Where an index
> tuple currently has the key values followed by a ctid, simply change
> that so that it can be a list of ctid's, in order.

Actually it's t_tid followed by t_info (which is size + flags) followed 
by key values.

> This works on having the same key, but doesn't care if the tuples are
> on the same page. It used to be not possible because of how to handle
> forward and backward scanning within an index entry during updates. I
> think with the new "page at a time" index scanning, this got a lot
> easier.

I'm not very interested in the case where you have a lot of equal keys, 
I think the bitmap index am is more suitable for that. The Block B-tree 
could be used whenever you have a clustered table (including unique 
indexes). Some DBMSs have index-organized-tables for the same use case.

When I tested a prototype of the original idea with TPC-C (DBT-2) data, 
a block index on the order_line table was under 2% of the size of a 
normal B-tree. That's very close to a best-case scenario; narrow rows 
and a completely clustered table. I'm aiming at that order of magnitude 
reduction in storage size.

> One issue is that a single index page could hold more than 1000 index
> entries, which might cause problems for callers.

I don't think that's a problem.

-- 
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


Re: Block B-Tree concept

From
Simon Riggs
Date:
On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > If we want to keep the property that VACUUM doesn't re-evaluate index 
> > entries, any proposal that doesn't keep track of every heap tuple 
> > isn't going to work. I'll try to modify the design I had in mind so 
> > that it does keep track of every heap tuple in some form.

The ideal situation is that we have one index pointer per block, so we
should look for that and optimize accordingly. We mark the heap block to
indicate how many block index pointers there are to it. If we have only
a single pointer, then VACUUM will only have to touch the index pointer
when the whole heap block is removed. In that case we have no
re-evaluation of the index AFAICS.

> After some thought:
> 
> Imagine a normal B-tree just like what we have now. But when there is 
> more than one tuple on the same heap page with consecutive index keys, 
> we represent all of them in a single index tuple that contains the key 
> of the first one of them, and a (run-length encoded) bitmap of the 
> OffsetNumbers. We should get most of the space and I/O savings as with 
> the original proposal, but we can vacuum it without re-evaluating index 
> expressions.

The benefit we're seeking with a block index is that most INSERTs don't
write to the index. With that scheme we'd need to continually update the
index tuple so that it exactly represented the heap after each inserted
tuple, which is going to cause a hot block problem.

Much of that can go away if we have a bulk_heap_insert() which puts the
index entries in at the end of the block, though that needs some heavy
thought also.

Can we have this scheme enabled *only* for functional block indexes?
Normal case is likely to be monotonically ascending keys for real world
objects like Orders, Calls, Transactions etc.. It sounds like the
original proposal is still valid for those cases.

The bitmap would allow us to access heap rows faster in some
circumstances, I suppose.

Multi-block bitmaps do make this too much like bitmap indexes and the
use-case is very different. [Is there some kind of hybrid solution of
block & bitmap indexes?]

> It does change the format of an index tuple, unlike the original 
> proposal. That adds some complexity. but it's doable.

Can we use an info bit to have two index tuple formats?
- single tuple (as now)
- multiple tuple block bitmap (as you suggest)

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: Block B-Tree concept

From
Gregory Stark
Date:
Csaba Nagy <nagy@ecircle-ag.com> writes:

> > > I think you build a whole new index named something like ".temp-reindex" and
> > > then as the last step of the second transaction delete the old idnex and
> > > rename the new index.
> > 
> > That would require getting exclusive lock on the table.
> 
> Just out of curiosity, creating a new index concurrently (or online,
> whatever you call it) doesn't require to set an exclusive lock on the
> table ? I thought it would, at least swiftly at the end of the
> operation, after all it's modifying the table...

Nope.

As I understand it the step that fundamentally requires a table lock is
actually dropping the old index. You have to be sure nobody is actually using
it before you do anything that causes people to stop maintaining it.

We could do something like how the online index build creates the index but in
reverse. We mark the index invalid and then wait out any transactions that
could be using it. Then we can drop it safely.

But I think even that has some practical problems. Transactions that have that
index in their relcache structure for the table will try to maintain it and
get confused if it's gone.

It seems to me that taking a brief lock on the table at the end of the reindex
isn't actually much of a problem. It only needs to be held briefly and it can
be done in a separate transaction so there isn't a deadlock risk.

-- 
greg



Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> If we want to keep the property that VACUUM doesn't re-evaluate index
>>> entries, any proposal that doesn't keep track of every heap tuple
>>> isn't going to work. I'll try to modify the design I had in mind so
>>> that it does keep track of every heap tuple in some form.
>
> The ideal situation is that we have one index pointer per block, so we
> should look for that and optimize accordingly. We mark the heap block to
> indicate how many block index pointers there are to it. If we have only
> a single pointer, then VACUUM will only have to touch the index pointer
> when the whole heap block is removed. In that case we have no
> re-evaluation of the index AFAICS.

I don't see how that would work. It sounds similar to the reference 
counting option I proposed, which had the same re-evaluation problem.

And in addition, it requires adding index-specific information to the 
heap page format, which troubles me from a modularization viewpoint.
>> After some thought:
>>
>> Imagine a normal B-tree just like what we have now. But when there is
>> more than one tuple on the same heap page with consecutive index keys,
>> we represent all of them in a single index tuple that contains the key
>> of the first one of them, and a (run-length encoded) bitmap of the
>> OffsetNumbers. We should get most of the space and I/O savings as with
>> the original proposal, but we can vacuum it without re-evaluating index
>> expressions.
>
> The benefit we're seeking with a block index is that most INSERTs don't
> write to the index. With that scheme we'd need to continually update the
> index tuple so that it exactly represented the heap after each inserted
> tuple, which is going to cause a hot block problem.

That's just one of the benefits. I think the main benefit is dramatic 
reduction in index size which means that more of the index is cached.

An INSERT will have to find the corresponding leaf page anyway. Having 
to dirty it isn't a big deal assuming that the hot blocks stay in cache.

The hot block problem worries me a bit too. Any indexing scheme that 
packs more items on a block is going to suffer from that. Testing will 
show if that becomes a problem.

> Can we have this scheme enabled *only* for functional block indexes?

No. As Tom pointed out, data type specific functions have potentially 
the same problem.

And having both versions seems like a lot of code and complexity.

> The bitmap would allow us to access heap rows faster in some
> circumstances, I suppose.

Yes, you wouldn't have to re-evaluate index quals on every tuple, when 
the whole range represented by the index tuple falls within the range 
you're searching for. And when there's only few tuples with consecutive 
keys on a heap page (which is not a good use case for block B-trees), 
you don't need to scan the whole page to find those matches.

> Multi-block bitmaps do make this too much like bitmap indexes and the
> use-case is very different. [Is there some kind of hybrid solution of
> block & bitmap indexes?]

Not that I know of, though there is different kind of bitmap indexes. 
The one that didn't make it to 8.2 uses equality encoding, where you 
have a bitmap for every distinct value. You can also have 
range-encoding, where you have a bitmap for ranges of values, for 
example one bitmap for 1-10, another for 10-15 etc. If you choose the 
ranges dynamically so that you have one range for each heap page (when 
it's clustered), you get something similar to the proposed Block B-tree.

The current bitmap encoding scheme is optimized for large bitmaps, 
though, so the performance wouldn't be as good.

>> It does change the format of an index tuple, unlike the original 
>> proposal. That adds some complexity. but it's doable.
>
> Can we use an info bit to have two index tuple formats?
> - single tuple (as now)
> - multiple tuple block bitmap (as you suggest)

Yes. There's one bit free in the index tuple header.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com



Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Imagine a normal B-tree just like what we have now. But when there is 
> more than one tuple on the same heap page with consecutive index keys, 
> we represent all of them in a single index tuple that contains the key 
> of the first one of them, and a (run-length encoded) bitmap of the 
> OffsetNumbers.

At first I thought that was a typo, and instead of "consecutive" you
meant to write "equal".  I gather from the later statement

> I'm not very interested in the case where you have a lot of equal keys, 
> I think the bitmap index am is more suitable for that.

that indeed you meant to write "consecutive", and I've got a problem
with that: define "consecutive".  In a datatype independent fashion,
please.  I also wonder how you are going to implement splitting and
merging of runs, which will certainly be necessary if this isn't to be
a constantly-requires-REINDEX thing.
        regards, tom lane


Re: Block B-Tree concept

From
Simon Riggs
Date:
On Fri, 2006-09-29 at 14:54 +0100, Heikki Linnakangas wrote:

> > The benefit we're seeking with a block index is that most INSERTs don't
> > write to the index. With that scheme we'd need to continually update the
> > index tuple so that it exactly represented the heap after each inserted
> > tuple, which is going to cause a hot block problem.
> 
> That's just one of the benefits. I think the main benefit is dramatic 
> reduction in index size which means that more of the index is cached.
> 
> An INSERT will have to find the corresponding leaf page anyway. Having 
> to dirty it isn't a big deal assuming that the hot blocks stay in cache.

The index tuple would potentially grow in length while we update it, so
that means we'd need exclusive access to write, rather than shared
access to just read the index.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>   
>> I'm not very interested in the case where you have a lot of equal keys, 
>> I think the bitmap index am is more suitable for that.
>>     
>
> that indeed you meant to write "consecutive", and I've got a problem
> with that: define "consecutive".  In a datatype independent fashion,
> please.  I also wonder how you are going to implement splitting and
> merging of runs, which will certainly be necessary if this isn't to be
> a constantly-requires-REINDEX thing.
>   

I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in 
"A and B are consecutive in the index, if there's no value X in the 
index so that A < X < B". Maybe there's a better word for that.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com



Re: Block B-Tree concept

From
Jan de Visser
Date:
On Friday 29 September 2006 10:55, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki@enterprisedb.com> writes:
> >> I'm not very interested in the case where you have a lot of equal keys,
> >> I think the bitmap index am is more suitable for that.
> >
> > that indeed you meant to write "consecutive", and I've got a problem
> > with that: define "consecutive".  In a datatype independent fashion,
> > please.  I also wonder how you are going to implement splitting and
> > merging of runs, which will certainly be necessary if this isn't to be
> > a constantly-requires-REINDEX thing.
>
> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in
> "A and B are consecutive in the index, if there's no value X in the
> index so that A < X < B". Maybe there's a better word for that.

http://en.wikipedia.org/wiki/Monotonic

jan

--
--------------------------------------------------------------
Jan de Visser                     jdevisser@digitalfairway.com

                Baruk Khazad! Khazad ai-menu!
--------------------------------------------------------------


Re: Block B-Tree concept

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> define "consecutive".

> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in 
> "A and B are consecutive in the index, if there's no value X in the 
> index so that A < X < B". Maybe there's a better word for that.

Um, but how are you going to make that work without storing the keys for
both A and B?  You won't be able to tell whether an incoming key C
that's just a bit bigger than A should go before or after B.
        regards, tom lane


Re: Block B-Tree concept

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>   
>> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in 
>> "A and B are consecutive in the index, if there's no value X in the 
>> index so that A < X < B". Maybe there's a better word for that.
>>     
>
> Um, but how are you going to make that work without storing the keys for
> both A and B?  You won't be able to tell whether an incoming key C
> that's just a bit bigger than A should go before or after B.
>   

Let me describe the insertion algorithm:

1. To insert a tuple with key X, we find the position in the index where 
the new tuple would go, just like with a normal B-tree. Let's call the 
index tuple right before the position A and the next tuple B. So 
according to normal B-tree rules, X should go between A and B.

2. If A points to the same heap page as X, we set the bit representing 
the offset of the new tuple in the index tuple A (this might require 
enlarging the index tuple and event splitting the page), and we're done. 
If it points to a different page, we need split the range A-B to A-X-B, 
proceed to step 3.

3. To split the range A-B, scan the heap page to see which of the tuples 
pointed to by A are >= X and which are < X . If there's no tuples >= X, 
insert a new index tuple for X, and we're done. Otherwise, let Y be the 
smallest tuple >= X. Insert a new index tuple for Y, containing all the 
offsets with keys >= X, and remove those offsets from A. We have now 
split A-B to A-Y-B so that A < X < Y < B.

4. Insert the new index tuple for X.

(I'm not sure I got the above description correct for cases with equal 
keys.)

Note that we don't keep track of the ordering of tuples that are clumped 
into a single index tuple. That's not important, I/O wise, because if 
you're going to fetch a heap page into memory, you might as well scan 
all the tuples on it and sort them if necessary. That's where the space 
and I/O savings comes from.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com