Thread: Block B-Tree concept
I've been experimenting with the idea of a so-called Block B-Tree. The basic idea is that instead of storing an index tuple for each heap tuple, we store an index tuple for each heap block. This dramatically reduces the size of an index, leading to savings on I/O. This idea was briefly discussed in January: http://archives.postgresql.org/pgsql-hackers/2006-01/msg00565.php To make it actually work, the semantics of the B-Tree has been modified so that every index tuple represents 1 or more heap tuples that fall within some range of values, and are on the same heap page. The range that an index tuple represents is from X, inclusive, to Y, exclusive, where X is the key of the index tuple and Y is the key of the *next* index tuple in the index. If the heap is in index order (as after CLUSTER), we get a very compact index this way, effectively eliminating the leaf level of the B-tree. To locate the actual matching items on the heap page, we have to scan the heap page because we don't have the item ids, so this is a tradeoff between CPU and I/O. However, we could have a hybrid where we initially store heap tuple pointers like we do now, and when there's more than X consecutive pointers to the same page, we collapse them to just one pointer to the whole page. This would potentially give us the best of both worlds. This design is more flexible and less invasive than true index-organized-tables, because it doesn't require perfect ordering of the heap or moving heap tuples around. You can also define than one Block B-Tree on a table, though you wouldn't get much benefit from using Block B-Tree instead of normal B-Tree if there's no correlation between the index order and the heap order. It also fits in nicely with the bitmap scans, since there's already support for "lossy" bitmap pages. Doing normal ordered index scans requires some coding, but can be done. Thoughts? I'm thinking of getting this in early in the 8.3 cycle. We'll have to take a look at the indexam API to support both bitmap indexes and block B-trees nicely. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I've been experimenting with the idea of a so-called Block B-Tree. The > basic idea is that instead of storing an index tuple for each heap > tuple, we store an index tuple for each heap block. This dramatically > reduces the size of an index, leading to savings on I/O. VACUUM? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> I've been experimenting with the idea of a so-called Block B-Tree. The >> basic idea is that instead of storing an index tuple for each heap >> tuple, we store an index tuple for each heap block. This dramatically >> reduces the size of an index, leading to savings on I/O. >> > > VACUUM? > There's a few options that I've thought of this far: 1. Whenever a tuple is found dead on page X, vacuum of the index will have to go to that page again to see if there's any matching tuples left. 2. Have a reference counter on index tuple that's increased on insert and decreased by vacuum. 3. Do nothing. Let index scans mark the index tuple as dead when it's convenient. There's no correctness problem with just leaving dead index tuples there, because you have to check the index quals on each heap tuple anyway when you scan. 3. probably isn't an option in itself, but we might want to do some kind of a mixture of 1 and 3. I'm thinking that Block B-Tree is not a candidate for non-MVCC system catalogs, but I think that's OK. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: > > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > > >> I've been experimenting with the idea of a so-called Block B-Tree. The > >> basic idea is that instead of storing an index tuple for each heap > >> tuple, we store an index tuple for each heap block. This dramatically > >> reduces the size of an index, leading to savings on I/O. > >> > > > > VACUUM? > > > There's a few options that I've thought of this far: > > 1. Whenever a tuple is found dead on page X, vacuum of the index will > have to go to that page again to see if there's any matching tuples left. Right now, if an index entry points to a dead tuple, we set a bit in the index so future lookups do not access the heap. We could set a bit for block index entries that point to a page that has no live rows, and have vacuum remove the index entry later. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
> Right now, if an index entry points to a dead tuple, we set a bit in > the index so future lookups do not access the heap. We could set a bit > for block index entries that point to a page that has no live rows, and > have vacuum remove the index entry later. GIN don't support this feature... -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: > > Right now, if an index entry points to a dead tuple, we set a bit in > > the index so future lookups do not access the heap. We could set a bit > > for block index entries that point to a page that has no live rows, and > > have vacuum remove the index entry later. > > GIN don't support this feature... I think block indexes would only be for btrees. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Teodor Sigaev wrote: >> Right now, if an index entry points to a dead tuple, we set a bit in >> the index so future lookups do not access the heap. We could set a >> bit for block index entries that point to a page that has no live >> rows, and >> have vacuum remove the index entry later. > > GIN don't support this feature... I'm only talking about B-trees at this stage. ISTM that you could do the same thing with hash indexes, but I haven't given it much thought. Anyway, I think you'd usually want to use bitmap scans with a Block B-tree, unless you need sorted output. And bitmap scans don't set the LP_DELETE flag either. We might want to do something about that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> VACUUM? >> > There's a few options that I've thought of this far: > 1. Whenever a tuple is found dead on page X, vacuum of the index will > have to go to that page again to see if there's any matching tuples left. Anything that involves having VACUUM re-evaluate index expressions is a nonstarter ... or have you already forgotten the optimizations we put into 8.2 that assume, eg, no sub-transactions within a VACUUM? > 2. Have a reference counter on index tuple that's increased on insert > and decreased by vacuum. The "increase on insert" part I understand, the "decrease by vacuum" part seems to have the same problem as #1. How do you tell which index entries should be changed? > 3. Do nothing. Let index scans mark the index tuple as dead when it's > convenient. There's no correctness problem with just leaving dead index > tuples there, because you have to check the index quals on each heap > tuple anyway when you scan. And we're back to routine REINDEX I guess :-(. This doesn't seem like a satisfactory answer. regards, tom lane
> And we're back to routine REINDEX I guess :-(. This doesn't seem like a > satisfactory answer. If the reindex works online, it could be a satisfactory solution. Cheers, Csaba.
Tom Lane wrote: > Anything that involves having VACUUM re-evaluate index expressions is a > nonstarter ... or have you already forgotten the optimizations we put > into 8.2 that assume, eg, no sub-transactions within a VACUUM? Umm, I'm afraid I have. Could you give me a clue? >> 3. Do nothing. Let index scans mark the index tuple as dead when it's >> convenient. There's no correctness problem with just leaving dead index >> tuples there, because you have to check the index quals on each heap >> tuple anyway when you scan. > > And we're back to routine REINDEX I guess :-(. This doesn't seem like a > satisfactory answer. In general, it isn't. Though there are interesting use cases where it would be fine. For example, if you remove old data by dropping a partition, there's never really need to vacuum. Or if all of the data is accessed during normal operation, the index scans set the LP_DELETE flags and no additional vacuum is really needed. Also, now that we have concurrent CREATE INDEX, we could implement concurrent REINDEX as well, I believe. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: > >Anything that involves having VACUUM re-evaluate index expressions is a > >nonstarter ... or have you already forgotten the optimizations we put > >into 8.2 that assume, eg, no sub-transactions within a VACUUM? > > Umm, I'm afraid I have. Could you give me a clue? The patch by Hannu Krossing that allows a lazy vacuum to ignore other lazy vacuums, when computing the OldestXmin. The point is you can only ignore other lazy vacuums if they are not going to visit the heap. Otherwise you'd risk removing a tuple that the other vacuum wants to see. You could here shout that surely no two vacuums could be trying to clean the same table at the same time, but it turns out that you can have functional indexes that scan other tables. Sure, it's a bad idea, but ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Heikki Linnakangas wrote: > Tom Lane wrote: >> Anything that involves having VACUUM re-evaluate index expressions is a >> nonstarter ... or have you already forgotten the optimizations we put >> into 8.2 that assume, eg, no sub-transactions within a VACUUM? > > Umm, I'm afraid I have. Could you give me a clue? I think I found it. Is this what you're talking about (in commands/vacuum.c): /* * During a lazy VACUUM we do not run any user-supplied functions, * and so it should be safe to notcreate a transaction snapshot. * * We can furthermore set the inVacuum flag, which lets other * concurrentVACUUMs know that they can ignore this one while * determining their OldestXmin. (The reason we don't setinVacuum * during a full VACUUM is exactly that we may have to run user- * defined functions for functionalindexes, and we want to make * sure that if they use the snapshot set above, any tuples it * requirescan't get removed from other tables. An index function * that depends on the contents of other tables isarguably broken, * but we won't break it here by violating transaction semantics.) * * Note: the inVacuumflag remains set until CommitTransaction or * AbortTransaction. We don't want to clear it until we reset * MyProc->xid/xmin, else OldestXmin might appear to go backwards, * which is probably Not Good. */ MyProc->inVacuum = true; -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> Anything that involves having VACUUM re-evaluate index expressions is a >>> nonstarter ... or have you already forgotten the optimizations we put >>> into 8.2 that assume, eg, no sub-transactions within a VACUUM? > I think I found it. Is this what you're talking about (in > commands/vacuum.c): That's part of it, but I seem to recall other things too --- in particular, the point about subtransactions troubles me. Whatever you think about an index function looking at other tables, it is perfectly legitimate to have an exception block in an index function, and that requires subtransactions (at least as plpgsql is now implemented). regards, tom lane
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Also, now that we have concurrent CREATE INDEX, we could implement > concurrent REINDEX as well, I believe. That's probably more easily said than done --- in particular, I don't understand what the committed state after the first transaction would look like. CREATE INDEX can get away with it because nothing need be depending on the new index, but you can't say that for an existing index (esp. if it's UNIQUE). regards, tom lane
On Tue, Sep 26, 2006 at 11:16:54AM +0100, Heikki Linnakangas wrote: > To locate the actual matching items on the heap page, we have to scan > the heap page because we don't have the item ids, so this is a tradeoff > between CPU and I/O. However, we could have a hybrid where we initially > store heap tuple pointers like we do now, and when there's more than X > consecutive pointers to the same page, we collapse them to just one > pointer to the whole page. This would potentially give us the best of > both worlds. > > This design is more flexible and less invasive than true > index-organized-tables, because it doesn't require perfect ordering of > the heap or moving heap tuples around. You can also define than one > Block B-Tree on a table, though you wouldn't get much benefit from using > Block B-Tree instead of normal B-Tree if there's no correlation between > the index order and the heap order. No, but I think there's scenarios where you may not have extremely high correlation but you'd still benefit, especially with the hybrid approach. If you have a field with rather skewed histogram, for example, where you're likely to have a lot of tuples with one value on any given page. Of course, you probably would want to exclude that value from the index entirely if it's on *every* page, but anything where you see paterns of data wouldn't be like that. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Sep 26, 2006 at 05:27:56PM +0100, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > >Tom Lane wrote: > >>Anything that involves having VACUUM re-evaluate index expressions is a > >>nonstarter ... or have you already forgotten the optimizations we put > >>into 8.2 that assume, eg, no sub-transactions within a VACUUM? > > > >Umm, I'm afraid I have. Could you give me a clue? > I think I found it. Is this what you're talking about (in > commands/vacuum.c): > > /* > * During a lazy VACUUM we do not run any user-supplied functions, > * and so it should be safe to not create a transaction snapshot. > * > * We can furthermore set the inVacuum flag, which lets other > * concurrent VACUUMs know that they can ignore this one while > * determining their OldestXmin. (The reason we don't set inVacuum > * during a full VACUUM is exactly that we may have to run user- > * defined functions for functional indexes, and we want to make > * sure that if they use the snapshot set above, any tuples it > * requires can't get removed from other tables. An index function > * that depends on the contents of other tables is arguably broken, > * but we won't break it here by violating transaction semantics.) > * > * Note: the inVacuum flag remains set until CommitTransaction or > * AbortTransaction. We don't want to clear it until we reset > * MyProc->xid/xmin, else OldestXmin might appear to go backwards, > * which is probably Not Good. > */ > MyProc->inVacuum = true; Do I understand that to mean that you can no longer lazy vacuum a functional index? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, Sep 26, 2006 at 08:51:10AM -0400, Tom Lane wrote: > > 3. Do nothing. Let index scans mark the index tuple as dead when it's > > convenient. There's no correctness problem with just leaving dead index > > tuples there, because you have to check the index quals on each heap > > tuple anyway when you scan. > > And we're back to routine REINDEX I guess :-(. This doesn't seem like a > satisfactory answer. Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do that anyway right now? Granted, you'd want to periodically ensure that you scan the entire index, but that shouldn't be horribly hard to set up. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > Do I understand that to mean that you can no longer lazy vacuum a > functional index? With the normal B-trees we have now, there's no problem vacuuming a functional index. But it would be a problem with the Block B-tree, because the proposed way of vacuuming a Block B-tree involves re-evaluating index expressions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Jim C. Nasby wrote: > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do > that anyway right now? You mean _index_ tuples marked dead? Sure, no problem there. > Granted, you'd want to periodically ensure that you scan the entire > index, but that shouldn't be horribly hard to set up. Well, it seems to be. A vacuum can't evaluate index expressions because it's not in a real transaction. The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" etc. with enable_seqscan=false? That would work, but we can't depend on an additional administrative task like. And we might as well just disable the optimization that's causing us problems. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Jim C. Nasby wrote: > > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do > > that anyway right now? > > You mean _index_ tuples marked dead? Sure, no problem there. > > > Granted, you'd want to periodically ensure that you scan the entire > > index, but that shouldn't be horribly hard to set up. > > Well, it seems to be. A vacuum can't evaluate index expressions because > it's not in a real transaction. > > The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" > etc. with enable_seqscan=false? That would work, but we can't depend on > an additional administrative task like. And we might as well just > disable the optimization that's causing us problems. Why can't the C code just do a full index scan that touches the heap, sets those expired bits, and then do a vacuum? My point is that the bits can be set outside the normal vacuum process, so you don't have to be doing heap lookups from the index inside vacuum. Assuming the heap is mostly in index order, the full index scan shouldn't take much longer than a heap scan, and if the heap doesn't match index order, a block index shouldn't be used anyway. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Jim C. Nasby wrote: >>> Granted, you'd want to periodically ensure that you scan the entire >>> index, but that shouldn't be horribly hard to set up. >> Well, it seems to be. A vacuum can't evaluate index expressions because >> it's not in a real transaction. >> >> The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" >> etc. with enable_seqscan=false? That would work, but we can't depend on >> an additional administrative task like. And we might as well just >> disable the optimization that's causing us problems. > > Why can't the C code just do a full index scan that touches the heap, > sets those expired bits, and then do a vacuum? My point is that the > bits can be set outside the normal vacuum process, so you don't have to > be doing heap lookups from the index inside vacuum. The point of the optimization that's causing problems was to reduce the effect of long-running vacuum transactions. If we're going to have another long running transaction instead, we're back to square one. AFAICS, we could disable the optimization and use a full-blown transaction when vacuuming a table with a functional block index. Granted, that's annoying, but not a show-stopper I think. > Assuming the heap is mostly in index order, the full index scan > shouldn't take much longer than a heap scan, and if the heap doesn't > match index order, a block index shouldn't be used anyway. It introduces one more full heap scan for each block index on a table. That's expensive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > AFAICS, we could disable the optimization and use a full-blown > transaction when vacuuming a table with a functional block index. No, we couldn't, or at least it's not merely a matter of reversing a recent optimization. The fundamental issue with all these proposals is the assumption that you can re-compute the index entries at all. VACUUM has never, ever, depended on the assumption that it can re-evaluate index entries and get the same answers as the original insertion did. Now obviously it should theoretically be able to do that, in a theoretical bug-free world, but given that we allow user-defined functions in index expressions that is a very hazardous assumption: you might get a different answer. Or an error. The current VACUUM procedure is able to clean up index entries correctly without any recalculation of the index values, just matching of TIDs. I think we'd be taking a serious robustness hit if we abandon that property. This is basically the same objection that I have to the occasional proposals for "retail VACUUM". BTW, it's not merely a problem for functional indexes: the datatype-specific functions invoked while searching an index are also hazards. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Also, now that we have concurrent CREATE INDEX, we could implement >> concurrent REINDEX as well, I believe. > > That's probably more easily said than done --- in particular, I don't > understand what the committed state after the first transaction would > look like. CREATE INDEX can get away with it because nothing need be > depending on the new index, but you can't say that for an existing index > (esp. if it's UNIQUE). I think you build a whole new index named something like ".temp-reindex" and then as the last step of the second transaction delete the old idnex and rename the new index. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> That's probably more easily said than done --- in particular, I don't >> understand what the committed state after the first transaction would >> look like. > I think you build a whole new index named something like ".temp-reindex" and > then as the last step of the second transaction delete the old idnex and > rename the new index. That would require getting exclusive lock on the table. regards, tom lane
> > I think you build a whole new index named something like ".temp-reindex" and > > then as the last step of the second transaction delete the old idnex and > > rename the new index. > > That would require getting exclusive lock on the table. Just out of curiosity, creating a new index concurrently (or online, whatever you call it) doesn't require to set an exclusive lock on the table ? I thought it would, at least swiftly at the end of the operation, after all it's modifying the table... Cheers, Csaba.
On Wed, Sep 27, 2006 at 05:38:38AM -0400, Bruce Momjian wrote: > Heikki Linnakangas wrote: > > Jim C. Nasby wrote: > > > Couldn't vacuum just eliminate tuples marked dead? Heck, don't we do > > > that anyway right now? > > > > You mean _index_ tuples marked dead? Sure, no problem there. > > > > > Granted, you'd want to periodically ensure that you scan the entire > > > index, but that shouldn't be horribly hard to set up. > > > > Well, it seems to be. A vacuum can't evaluate index expressions because > > it's not in a real transaction. > > > > The DBA could set up a cron job to do "SELECT * FROM foo WHERE bar > 0" > > etc. with enable_seqscan=false? That would work, but we can't depend on > > an additional administrative task like. And we might as well just > > disable the optimization that's causing us problems. > > Why can't the C code just do a full index scan that touches the heap, > sets those expired bits, and then do a vacuum? My point is that the > bits can be set outside the normal vacuum process, so you don't have to > be doing heap lookups from the index inside vacuum. > > Assuming the heap is mostly in index order, the full index scan > shouldn't take much longer than a heap scan, and if the heap doesn't > match index order, a block index shouldn't be used anyway. Well, my thought was to have a backend process that would periodically scan a small section of the index, so that you wouldn't have a long-running transaction. That could then be followed by a vacuum of that same section of the index, which would nuke the dead tuple entries. Though, maybe we wouldn't even need the vacuum step since 8.2 will now reclaim tuples marked as dead? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> AFAICS, we could disable the optimization and use a full-blown >> transaction when vacuuming a table with a functional block index. > > No, we couldn't, or at least it's not merely a matter of reversing a > recent optimization. > > The fundamental issue with all these proposals is the assumption that > you can re-compute the index entries at all. VACUUM has never, ever, > depended on the assumption that it can re-evaluate index entries and > get the same answers as the original insertion did. Now obviously > it should theoretically be able to do that, in a theoretical bug-free > world, but given that we allow user-defined functions in index > expressions that is a very hazardous assumption: you might get a > different answer. Or an error. The current VACUUM procedure is able > to clean up index entries correctly without any recalculation of the > index values, just matching of TIDs. I think we'd be taking a serious > robustness hit if we abandon that property. I'm not worried about getting different results. If a used-defined function behaves badly, you're queries are screwed anyway. But throwing an error would be bad, because that would abort the whole vacuum. If we want to keep the property that VACUUM doesn't re-evaluate index entries, any proposal that doesn't keep track of every heap tuple isn't going to work. I'll try to modify the design I had in mind so that it does keep track of every heap tuple in some form. > This is basically the same objection that I have to the occasional > proposals for "retail VACUUM". Yeah. :-( > BTW, it's not merely a problem for functional indexes: the > datatype-specific functions invoked while searching an index are also > hazards. Good point. I didn't realize that before. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > If we want to keep the property that VACUUM doesn't re-evaluate index > entries, any proposal that doesn't keep track of every heap tuple > isn't going to work. I'll try to modify the design I had in mind so > that it does keep track of every heap tuple in some form. After some thought: Imagine a normal B-tree just like what we have now. But when there is more than one tuple on the same heap page with consecutive index keys, we represent all of them in a single index tuple that contains the key of the first one of them, and a (run-length encoded) bitmap of the OffsetNumbers. We should get most of the space and I/O savings as with the original proposal, but we can vacuum it without re-evaluating index expressions. It does change the format of an index tuple, unlike the original proposal. That adds some complexity. but it's doable. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 29, 2006 at 10:51:32AM +0100, Heikki Linnakangas wrote: > After some thought: > > Imagine a normal B-tree just like what we have now. But when there is > more than one tuple on the same heap page with consecutive index keys, > we represent all of them in a single index tuple that contains the key > of the first one of them, and a (run-length encoded) bitmap of the > OffsetNumbers. We should get most of the space and I/O savings as with > the original proposal, but we can vacuum it without re-evaluating index > expressions. I think something like this has been discussed before. Where an index tuple currently has the key values followed by a ctid, simply change that so that it can be a list of ctid's, in order. This works on having the same key, but doesn't care if the tuples are on the same page. It used to be not possible because of how to handle forward and backward scanning within an index entry during updates. I think with the new "page at a time" index scanning, this got a lot easier. One issue is that a single index page could hold more than 1000 index entries, which might cause problems for callers. > It does change the format of an index tuple, unlike the original > proposal. That adds some complexity. but it's doable. This way doesn't change the current index format much. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Fri, Sep 29, 2006 at 10:51:32AM +0100, Heikki Linnakangas wrote: >> After some thought: >> >> Imagine a normal B-tree just like what we have now. But when there is >> more than one tuple on the same heap page with consecutive index keys, >> we represent all of them in a single index tuple that contains the key >> of the first one of them, and a (run-length encoded) bitmap of the >> OffsetNumbers. We should get most of the space and I/O savings as with >> the original proposal, but we can vacuum it without re-evaluating index >> expressions. > > I think something like this has been discussed before. Where an index > tuple currently has the key values followed by a ctid, simply change > that so that it can be a list of ctid's, in order. Actually it's t_tid followed by t_info (which is size + flags) followed by key values. > This works on having the same key, but doesn't care if the tuples are > on the same page. It used to be not possible because of how to handle > forward and backward scanning within an index entry during updates. I > think with the new "page at a time" index scanning, this got a lot > easier. I'm not very interested in the case where you have a lot of equal keys, I think the bitmap index am is more suitable for that. The Block B-tree could be used whenever you have a clustered table (including unique indexes). Some DBMSs have index-organized-tables for the same use case. When I tested a prototype of the original idea with TPC-C (DBT-2) data, a block index on the order_line table was under 2% of the size of a normal B-tree. That's very close to a best-case scenario; narrow rows and a completely clustered table. I'm aiming at that order of magnitude reduction in storage size. > One issue is that a single index page could hold more than 1000 index > entries, which might cause problems for callers. I don't think that's a problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > If we want to keep the property that VACUUM doesn't re-evaluate index > > entries, any proposal that doesn't keep track of every heap tuple > > isn't going to work. I'll try to modify the design I had in mind so > > that it does keep track of every heap tuple in some form. The ideal situation is that we have one index pointer per block, so we should look for that and optimize accordingly. We mark the heap block to indicate how many block index pointers there are to it. If we have only a single pointer, then VACUUM will only have to touch the index pointer when the whole heap block is removed. In that case we have no re-evaluation of the index AFAICS. > After some thought: > > Imagine a normal B-tree just like what we have now. But when there is > more than one tuple on the same heap page with consecutive index keys, > we represent all of them in a single index tuple that contains the key > of the first one of them, and a (run-length encoded) bitmap of the > OffsetNumbers. We should get most of the space and I/O savings as with > the original proposal, but we can vacuum it without re-evaluating index > expressions. The benefit we're seeking with a block index is that most INSERTs don't write to the index. With that scheme we'd need to continually update the index tuple so that it exactly represented the heap after each inserted tuple, which is going to cause a hot block problem. Much of that can go away if we have a bulk_heap_insert() which puts the index entries in at the end of the block, though that needs some heavy thought also. Can we have this scheme enabled *only* for functional block indexes? Normal case is likely to be monotonically ascending keys for real world objects like Orders, Calls, Transactions etc.. It sounds like the original proposal is still valid for those cases. The bitmap would allow us to access heap rows faster in some circumstances, I suppose. Multi-block bitmaps do make this too much like bitmap indexes and the use-case is very different. [Is there some kind of hybrid solution of block & bitmap indexes?] > It does change the format of an index tuple, unlike the original > proposal. That adds some complexity. but it's doable. Can we use an info bit to have two index tuple formats? - single tuple (as now) - multiple tuple block bitmap (as you suggest) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Csaba Nagy <nagy@ecircle-ag.com> writes: > > > I think you build a whole new index named something like ".temp-reindex" and > > > then as the last step of the second transaction delete the old idnex and > > > rename the new index. > > > > That would require getting exclusive lock on the table. > > Just out of curiosity, creating a new index concurrently (or online, > whatever you call it) doesn't require to set an exclusive lock on the > table ? I thought it would, at least swiftly at the end of the > operation, after all it's modifying the table... Nope. As I understand it the step that fundamentally requires a table lock is actually dropping the old index. You have to be sure nobody is actually using it before you do anything that causes people to stop maintaining it. We could do something like how the online index build creates the index but in reverse. We mark the index invalid and then wait out any transactions that could be using it. Then we can drop it safely. But I think even that has some practical problems. Transactions that have that index in their relcache structure for the table will try to maintain it and get confused if it's gone. It seems to me that taking a brief lock on the table at the end of the reindex isn't actually much of a problem. It only needs to be held briefly and it can be done in a separate transaction so there isn't a deadlock risk. -- greg
Simon Riggs wrote: > On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote: >> Heikki Linnakangas wrote: >>> If we want to keep the property that VACUUM doesn't re-evaluate index >>> entries, any proposal that doesn't keep track of every heap tuple >>> isn't going to work. I'll try to modify the design I had in mind so >>> that it does keep track of every heap tuple in some form. > > The ideal situation is that we have one index pointer per block, so we > should look for that and optimize accordingly. We mark the heap block to > indicate how many block index pointers there are to it. If we have only > a single pointer, then VACUUM will only have to touch the index pointer > when the whole heap block is removed. In that case we have no > re-evaluation of the index AFAICS. I don't see how that would work. It sounds similar to the reference counting option I proposed, which had the same re-evaluation problem. And in addition, it requires adding index-specific information to the heap page format, which troubles me from a modularization viewpoint. >> After some thought: >> >> Imagine a normal B-tree just like what we have now. But when there is >> more than one tuple on the same heap page with consecutive index keys, >> we represent all of them in a single index tuple that contains the key >> of the first one of them, and a (run-length encoded) bitmap of the >> OffsetNumbers. We should get most of the space and I/O savings as with >> the original proposal, but we can vacuum it without re-evaluating index >> expressions. > > The benefit we're seeking with a block index is that most INSERTs don't > write to the index. With that scheme we'd need to continually update the > index tuple so that it exactly represented the heap after each inserted > tuple, which is going to cause a hot block problem. That's just one of the benefits. I think the main benefit is dramatic reduction in index size which means that more of the index is cached. An INSERT will have to find the corresponding leaf page anyway. Having to dirty it isn't a big deal assuming that the hot blocks stay in cache. The hot block problem worries me a bit too. Any indexing scheme that packs more items on a block is going to suffer from that. Testing will show if that becomes a problem. > Can we have this scheme enabled *only* for functional block indexes? No. As Tom pointed out, data type specific functions have potentially the same problem. And having both versions seems like a lot of code and complexity. > The bitmap would allow us to access heap rows faster in some > circumstances, I suppose. Yes, you wouldn't have to re-evaluate index quals on every tuple, when the whole range represented by the index tuple falls within the range you're searching for. And when there's only few tuples with consecutive keys on a heap page (which is not a good use case for block B-trees), you don't need to scan the whole page to find those matches. > Multi-block bitmaps do make this too much like bitmap indexes and the > use-case is very different. [Is there some kind of hybrid solution of > block & bitmap indexes?] Not that I know of, though there is different kind of bitmap indexes. The one that didn't make it to 8.2 uses equality encoding, where you have a bitmap for every distinct value. You can also have range-encoding, where you have a bitmap for ranges of values, for example one bitmap for 1-10, another for 10-15 etc. If you choose the ranges dynamically so that you have one range for each heap page (when it's clustered), you get something similar to the proposed Block B-tree. The current bitmap encoding scheme is optimized for large bitmaps, though, so the performance wouldn't be as good. >> It does change the format of an index tuple, unlike the original >> proposal. That adds some complexity. but it's doable. > > Can we use an info bit to have two index tuple formats? > - single tuple (as now) > - multiple tuple block bitmap (as you suggest) Yes. There's one bit free in the index tuple header. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Imagine a normal B-tree just like what we have now. But when there is > more than one tuple on the same heap page with consecutive index keys, > we represent all of them in a single index tuple that contains the key > of the first one of them, and a (run-length encoded) bitmap of the > OffsetNumbers. At first I thought that was a typo, and instead of "consecutive" you meant to write "equal". I gather from the later statement > I'm not very interested in the case where you have a lot of equal keys, > I think the bitmap index am is more suitable for that. that indeed you meant to write "consecutive", and I've got a problem with that: define "consecutive". In a datatype independent fashion, please. I also wonder how you are going to implement splitting and merging of runs, which will certainly be necessary if this isn't to be a constantly-requires-REINDEX thing. regards, tom lane
On Fri, 2006-09-29 at 14:54 +0100, Heikki Linnakangas wrote: > > The benefit we're seeking with a block index is that most INSERTs don't > > write to the index. With that scheme we'd need to continually update the > > index tuple so that it exactly represented the heap after each inserted > > tuple, which is going to cause a hot block problem. > > That's just one of the benefits. I think the main benefit is dramatic > reduction in index size which means that more of the index is cached. > > An INSERT will have to find the corresponding leaf page anyway. Having > to dirty it isn't a big deal assuming that the hot blocks stay in cache. The index tuple would potentially grow in length while we update it, so that means we'd need exclusive access to write, rather than shared access to just read the index. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> I'm not very interested in the case where you have a lot of equal keys, >> I think the bitmap index am is more suitable for that. >> > > that indeed you meant to write "consecutive", and I've got a problem > with that: define "consecutive". In a datatype independent fashion, > please. I also wonder how you are going to implement splitting and > merging of runs, which will certainly be necessary if this isn't to be > a constantly-requires-REINDEX thing. > I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in "A and B are consecutive in the index, if there's no value X in the index so that A < X < B". Maybe there's a better word for that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Friday 29 September 2006 10:55, Heikki Linnakangas wrote: > Tom Lane wrote: > > Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> I'm not very interested in the case where you have a lot of equal keys, > >> I think the bitmap index am is more suitable for that. > > > > that indeed you meant to write "consecutive", and I've got a problem > > with that: define "consecutive". In a datatype independent fashion, > > please. I also wonder how you are going to implement splitting and > > merging of runs, which will certainly be necessary if this isn't to be > > a constantly-requires-REINDEX thing. > > I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in > "A and B are consecutive in the index, if there's no value X in the > index so that A < X < B". Maybe there's a better word for that. http://en.wikipedia.org/wiki/Monotonic jan -- -------------------------------------------------------------- Jan de Visser jdevisser@digitalfairway.com Baruk Khazad! Khazad ai-menu! --------------------------------------------------------------
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> define "consecutive". > I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in > "A and B are consecutive in the index, if there's no value X in the > index so that A < X < B". Maybe there's a better word for that. Um, but how are you going to make that work without storing the keys for both A and B? You won't be able to tell whether an incoming key C that's just a bit bigger than A should go before or after B. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > >> I don't mean consecutive as in "1, 2, 3, 4, ... without gaps" but as in >> "A and B are consecutive in the index, if there's no value X in the >> index so that A < X < B". Maybe there's a better word for that. >> > > Um, but how are you going to make that work without storing the keys for > both A and B? You won't be able to tell whether an incoming key C > that's just a bit bigger than A should go before or after B. > Let me describe the insertion algorithm: 1. To insert a tuple with key X, we find the position in the index where the new tuple would go, just like with a normal B-tree. Let's call the index tuple right before the position A and the next tuple B. So according to normal B-tree rules, X should go between A and B. 2. If A points to the same heap page as X, we set the bit representing the offset of the new tuple in the index tuple A (this might require enlarging the index tuple and event splitting the page), and we're done. If it points to a different page, we need split the range A-B to A-X-B, proceed to step 3. 3. To split the range A-B, scan the heap page to see which of the tuples pointed to by A are >= X and which are < X . If there's no tuples >= X, insert a new index tuple for X, and we're done. Otherwise, let Y be the smallest tuple >= X. Insert a new index tuple for Y, containing all the offsets with keys >= X, and remove those offsets from A. We have now split A-B to A-Y-B so that A < X < Y < B. 4. Insert the new index tuple for X. (I'm not sure I got the above description correct for cases with equal keys.) Note that we don't keep track of the ordering of tuples that are clumped into a single index tuple. That's not important, I/O wise, because if you're going to fetch a heap page into memory, you might as well scan all the tuples on it and sort them if necessary. That's where the space and I/O savings comes from. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com