Thread: Brain dump: btree collapsing
I've been thinking hard for the last few days about how to do space reclamation in b-tree indexes, i.e., recycle pages that are in no-longer-useful portions of the tree structure. We know we need this to solve the "index bloat" problem that occurs when the distribution of keys changes over time. I feel that it's critical that the reclamation be doable by plain VACUUM, ie, without acquiring exclusive lock on the index as a whole. This discussion therefore assumes that page deletion must be able to operate in parallel with insertions and searches. Issues ------ We need to get rid of parent links in btree pages; otherwise removal of a non-leaf page implies we must find and update all parent links that lead to it. This is messy enough that it would be better to do without. The only thing the parent link is really needed for is to find the new parent level after a root split, and we can handle that (very infrequent) case by re-descending from the new root. Instead of storing parent links, label all pages with level (counting levels up from zero = leaf, so that a page's level doesn't change in a root split). Then, if we find ourselves needing to re-descend, we can be sure of finding the correct parent level, one above where we were, even if there's been multiple root splits meanwhile. The level will also make for a useful cross-check on link validity: we will always know the level of the page we expect to arrive at when following a link, so we can check that the page has the right level. Unfortunately this means tossing out most of the FixBtree code Vadim wrote 2 years ago, because it seems critically dependent on having parent links. But I don't really see why we need it if we rely on WAL to maintain btree consistency. That will require some improvements in WAL-logging for btrees, however. (Details below.) When a page is deleted, it can't actually be recycled until there are no more potentially in-flight references to it (ie, readers that intend to visit the page but have not yet acquired a lock on it). Those readers must be able to find the page, realize it's dead, and follow the correct sidelink from it. [Lanin&Shasha86] describe the "drain technique", which they define as "delay freeing the empty page until the termination of all processes whose locate phase began when pointers to the page still existed". We can conveniently implement this by reference to transactions: after removing all links to the page, label the now-dead page with the current contents of the next-transaction-ID counter. VACUUM can recycle the page when this is older than the oldest open transaction. Instead of an actively maintained freelist on disk as per Alvaro Herrera's patch, I plan to use the FSM to remember where recyclable pages are, much as we do for tables. The FSM space requirements would be small, since we'd not be needing to enter any data about partially-full pages; only truly empty, recyclable pages would need to be stored. (Is it worth having an alternate representation in the FSM for indexes, so that we only store page numbers and not the useless amount-free statistic?) Without a freelist on disk, VACUUM would need to scan indexes linearly to find dead pages, but that seems okay; I'm thinking of doing that anyway to look for empty pages to turn into dead ones. Restructuring the tree during page deletion ------------------------------------------- We will delete only completely-empty pages. If we were to merge nearly-empty pages by moving data items from one page to an adjacent one, this would imply changing the parent's idea of the bounding key between them --- which is okay if we are just deleting an internal key in the parent, but what if the pages have different parent pages? We'd have to adjust the parents' own bounding key, meaning the parents' parent changes, perhaps all the way to the root. (Not to mention that with variable-size keys, there's no guarantee we can make such changes without splitting the upper-level pages.) And, since we support both forward and backward index scans, we can't move leaf items in either direction without risking having a concurrent scan miss them. This is way too messy, especially for something that has only minimal return according to the literature [Johnson89]. So, no merging. Deletion of an empty page only requires removal of the parent's item linking to it (plus fixing side pointers, which is pretty trivial). We also remove the next higher key in the parent, which is the parent's upper bound for data that would have belonged on the target page. Therefore, the page's right sibling becomes responsible for storing the key range that used to belong on the deleted page. What if there is no next-higher key, you ask? Well, I'm going to punt. It is critical that the key space associated with a parent page match the key space associated with its children (eg, the high key of the rightmost child must match the parent's high key). There is no way to atomically modify a parent's key space --- this would mean simultaneously changing keys in upper levels, perhaps all the way up to the root. To avoid needing to do that, we must put a couple of restrictions on deletion: 1. The rightmost page in any tree level is never deleted, period. (This rule also simplifies traversal algorithms, as we'll see below.) 2. The rightmost child of a non-rightmost parent page can't be deleted, either, unless it is the last child of that parent. If it is the last child then the parent is *immediately* marked as half-dead, meaning it can never acquire any new children; its key space implicitly transfers to its right sibling. (The parent can then be deleted in a later, separate atomic action. Note that if the parent is itself a rightmost child of a non-rightmost parent, it will have to stay in the half-dead state until it becomes the only child; then it can be deleted and its parent becomes half-dead.) (Note that while leaf pages can be empty and still alive, upper pages can't: they must have children to delegate their key range to.) With these restrictions, there is no need to alter the "high key" fields of either the parent or the siblings of a page being deleted. The key space of the page itself transfers to its right sibling, and the key space of the parent does not change (except in the case where the parent loses all its key space to its right sibling and becomes half-dead). Restriction 1 is not a significant one in terms of space wastage. Restriction 2 is more annoying, but it should not increase overhead drastically. The parent would not be deleted or merged anyway as long as it has other children, so the worst-case overhead from this restriction is one extra page per parent page --- and parent levels are normally much smaller than child levels. The notion of a half-dead page means that the key space relationship between the half-dead page's level and its parent's level may be a little out of whack: key space that appears to belong to the half-dead page's parent on the parent level may really belong to its right sibling. We can tolerate this, however, because insertions and deletions on upper tree levels are always done by reference to child page numbers, not keys. The only cost is that searches may sometimes descend to the half-dead page and then have to move right, rather than going directly to the sibling page. Page deletion procedure ----------------------- Assume we know the target page, but hold no locks. The locks must be gotten in this order to avoid deadlock against inserters: 1. Obtain W lock on left sibling of target page (this may require searching right if left sibling splits concurrently; same as for backwards scan case). Skip this if target is leftmost. 2. Obtain super-exclusive lock on target page; check it is still empty, else abandon deletion. 3. Obtain W lock on right sibling (there must be one, else we can't delete). 4. Find and W-lock the target's current parent page. Check that target is not rightmost child of a parent with other children, else abandon deletion. 5. Update all four pages at once, write single WAL record describing all updates. (I believe we can extend WAL to allow 4 page references in a single record, if not it may be okay to cheat and not store all of the adjacent pages. Certainly need not record contents of the target page itself, so 3 is enough.) Note parent is marked half-dead here if this was its last child. Release locks. 6. The update to the target page makes it empty and marked dead, but preserves its sibling links. It can't actually be recycled until later (see above). The deletion procedure could be triggered immediately upon removal of the last item in a page, or when the next VACUUM scan finds an empty page. Not sure yet which way is better. Having to exclusive-lock four pages is annoying, but I think we must do this procedure as a single atomic operation to ensure consistency. Search/scan rules in presence of deletion ----------------------------------------- Since page deletion takes a superexclusive lock, a stopped scan will never find the page it's on deleted when it resumes. What we do have to worry about is arriving at a dead page after following a link. There are several cases: During initial search for a key, if arrive at a dead or half-dead page, chain right. (The key space must have moved to the right.) During forward scan: likewise chain right. (Note: because scans operate only on the leaf level, they should never see half-dead pages.) During backwards scan: chain right until a live page is found, and step left from there in the usual way. We cannot use the dead page's left-link because its left neighbor may have split since the page was deleted. (There is certain to be at least one live page to the right, since we never delete the rightmost page of a level.) Also, "step left in the usual way" used to mean this: A backwards scan has one additional bit of complexity: after followingthe left-link we must account for the possibility that the left sibling page got split before we could read it. So, we have to move right until we find a page whose right-link matches the page we came from. But if the "page we came from" is now dead, perhaps there is no page with a matching right-link (if updater got to it before we did)! Probably the best policy, if we fail to find a match, is to return to the page we were previously on, verify that it's now dead (else error), then chain right and left as in the case where we've linked to a dead page (see above). This means a backwards scan must always remember the last live page it was on. Again, this is greatly simplified by the fact that there must be a live page somewhere to the right. Insertion changes ----------------- The basic insertion algorithm doesn't change (but see above notes about the search part). bt_getstackbuf is a little trickier in the presence of deletion. First issue is that removal of an earlier downlink in the returned-to parent page may cause the target item to move left by some number of slots (though not into a previous page). We can deal with this by searching left after we fail to find the item by searching right, before we move on to the next page. Next is that the whole parent page may be dead or half-dead --- but it can't be physically deleted, so we can recover by following its rightlink. The nasty part is that the target item could perhaps not be there; this would imply that someone deleted the page we are trying to split, or hasn't fully finished inserting it. But L&Y already require holding a lock on that page until we have re-located and locked the parent item, so this is not possible. (Note this implies that we must search for exactly the downlink to the page we are splitting, but that's true anyway.) If we need to split on a level that was the root when we descended, but is no longer the root, then our stack doesn't tell us how to find the next level up. As discussed above, handle this case by re-descending and checking level fields, rather than depending on parent links as before. We will simply descend on the left edge of the tree and scan the whole parent level to find where to insert --- it's highly unlikely that the parent level has yet grown large enough for this to be slow, so there's not much value in trying to use a keyed search. WAL issues ---------- A page split operation can't really be atomic. We can handle the actual split ("half-split") operation as an atomic update: 1. We have W lock on page that needs to be split. 2. Find a free page (from FSM, or make it by extending rel), W-lock it. 3. W-lock page's right sibling (if any), so that we can fix its left-link. 4. Update all three pages, make WAL log entry describing same, write pages. 5. We can now release the W-lock on the right sibling page, but we must keep the W-locks on the two split pages until wehave made the parent entry; else it's conceivable for someone else to try to split or delete these pages and get confusedbecause the parent link isn't in place yet. (See L&Y.) (Note that since this is driven by an insert, the half-split WAL log entry may as well include the new key insertion step; so this entry substitutes for the plain insertion WAL entry we'd otherwise make.) (So far this is the same as the existing code.) But now we must find the parent page and enter the new key and downlink (possibly causing a split at that level). If crash occurs before we can do so, how to recover? The existing code has a lot of ad-hoc logic that tries to reconstruct the tree on-the-fly when inconsistency is detected. But it would be a lot better if we could expect WAL playback to fix things. The WAL entry describing the half-split includes enough info to re-execute the parent key insertion. While replaying log, we can remember this insertion as pending. If we see a matching insertion event in the log, discard the remembered pending insertion. At end of log, execute all still-pending insertions. (There should not be very many pending insertions at any one time, so the resources for this are not onerous.) Note however that this implies making more log entries to describe these additional updates; might complicate WAL, but I see no fundamental difficulty. Also, if we are splitting a root page, it seems best to merge creation of the new root page and updating of the metapage into the same atomic action that does the split. This won't make the WAL entry materially larger, and it will avoid problems with concurrent root splits. Need to look at deadlock issues for trying to update metapage --- is it okay to grab W lock on meta while holding it on old root? (Sure, why not? Nothing will go the other way.) We do need to update metapage as part of the atomic root-split operation, else other updaters trying to find the new root may fail because link not there yet. VACUUM mechanics ---------------- We will institute an explicit "vacuuming an index" phase in VACUUM (this is distinct from deleting index entries during vacuuming of a table, and will be done after we've finished vacuuming the associated table). In this phase, we can scan the index linearly (ie, in storage order not index order) to take advantage of sequential I/O optimizations. We scan looking for pages that are empty (if leaf pages) or half-dead (if parent pages), as well as pages that are already dead and have been so long enough to be recycled. The latter are simply reported to the FSM. Empty and half-dead pages are deleted per the previously-described mechanism. (It seems best to gather the page IDs of target pages in memory, and do the deletions only after we've finished the seqscan, so as not to confuse the kernel about whether it should read-ahead.) One step that's not explicit above is how to find the parent of a page we intend to delete. It seems most efficient to perform a search for the page's high key (it must have one, since we don't delete rightmost pages) stopping when we reach the level above the page itself. This will give us a good pointer to the parent. We should do this before starting to acquire exclusive locks for the deletion. In theory, if we find recyclable page(s) at the physical end of the index, we could truncate the file (ie, give the space back to the filesystem) instead of reporting these pages to FSM. I am not sure if this is worth doing --- in most cases it's likely that little space can be released this way, and there may be some tricky locking issues. Tree depth ---------- Because we never relabel pages' levels, the tree depth cannot be reduced (we'd have to do that by removing the current root page, which seems impractical without an exclusive lock on the whole index). So after massive shrinkage we could end up with a "thin" tree in which there are levels below the root with only one page apiece. The space wastage doesn't seem like a big issue, but the extra time to traverse these useless levels could be annoying. This could be ignored in first implementation (there's always REINDEX). Later, possibly handle it via Lanin&Shasha's notion of a critic (think VACUUM) that sets a fast pointer to the current effective root level. (Actually I think we wouldn't need a separate critic process; split and delete steps could be programmed to update the fast pointer for themselves, in a separate atomic action, when they split a one-page level or delete the next-to-last page of a level.) Any comments before I plunge into implementing this? regards, tom lane
Tom Lane wrote: <snip> > The deletion procedure could be triggered immediately upon removal of the > last item in a page, or when the next VACUUM scan finds an empty page. > Not sure yet which way is better. Having it triggered immediately upon removal of the last item in a page would make for a more "self maintaining" system wouldn't it? That sounds nice. :) <snip> > In theory, if we find recyclable page(s) at the physical end of the index, > we could truncate the file (ie, give the space back to the filesystem) > instead of reporting these pages to FSM. I am not sure if this is worth > doing --- in most cases it's likely that little space can be released this > way, and there may be some tricky locking issues. Sounds like this would be beneficial for environments with high update/delete transaction volumes, perhaps on smaller amounts of live/valid data. <snip> > This could be ignored in first implementation (there's always REINDEX). > Later, possibly handle it via Lanin&Shasha's notion of a critic (think > VACUUM) that sets a fast pointer to the current effective root level. > (Actually I think we wouldn't need a separate critic process; split and > delete steps could be programmed to update the fast pointer for > themselves, in a separate atomic action, when they split a one-page level > or delete the next-to-last page of a level.) This really sounds like good initial thoughts too. :-) Regards and best wishes, Justin Clift -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Justin Clift <justin@postgresql.org> writes: > Tom Lane wrote: >> The deletion procedure could be triggered immediately upon removal of the >> last item in a page, or when the next VACUUM scan finds an empty page. >> Not sure yet which way is better. > Having it triggered immediately upon removal of the last item in a page > would make for a more "self maintaining" system wouldn't it? That > sounds nice. :) Maybe. This isn't about getting rid of VACUUM --- there's still a need for routine maintenance vacuums. So the question really comes down to whether it's more efficient to do it "in bulk" during routine maintenance sweeps, or "retail". I'm not sold yet, but am leaning to the bulk side. >> In theory, if we find recyclable page(s) at the physical end of the index, >> we could truncate the file (ie, give the space back to the filesystem) >> instead of reporting these pages to FSM. I am not sure if this is worth >> doing --- in most cases it's likely that little space can be released this >> way, and there may be some tricky locking issues. > Sounds like this would be beneficial for environments with high > update/delete transaction volumes, perhaps on smaller amounts of > live/valid data. It would only really be worth doing if you made a substantial reduction in the number of rows in a table, and had no near-term intention of loading the table back up again. Seems like it might be sufficient to tell people to REINDEX if they want the index size to drop in that scenario. I will look at physically truncating the index during VACUUM, but I don't think it's worth getting real tense about... regards, tom lane
I think having VACUUM record free index pages just like free heap pages makes perfect sense, and is consistent. This brings up one item it would be nice to address at the same time. It would be nice if VACUUM FULL would be able to compress the actual index file and return unused space to the operating system. REINDEX does this, but I was thinking of something a little lighter that could be done automatically as part of VACUUM FULL. If we can do that, it would make consistent behavior for vacuum and heap/index files. --------------------------------------------------------------------------- Tom Lane wrote: > Justin Clift <justin@postgresql.org> writes: > > Tom Lane wrote: > >> The deletion procedure could be triggered immediately upon removal of the > >> last item in a page, or when the next VACUUM scan finds an empty page. > >> Not sure yet which way is better. > > > Having it triggered immediately upon removal of the last item in a page > > would make for a more "self maintaining" system wouldn't it? That > > sounds nice. :) > > Maybe. This isn't about getting rid of VACUUM --- there's still a need > for routine maintenance vacuums. So the question really comes down to > whether it's more efficient to do it "in bulk" during routine > maintenance sweeps, or "retail". I'm not sold yet, but am leaning to > the bulk side. > > >> In theory, if we find recyclable page(s) at the physical end of the index, > >> we could truncate the file (ie, give the space back to the filesystem) > >> instead of reporting these pages to FSM. I am not sure if this is worth > >> doing --- in most cases it's likely that little space can be released this > >> way, and there may be some tricky locking issues. > > > Sounds like this would be beneficial for environments with high > > update/delete transaction volumes, perhaps on smaller amounts of > > live/valid data. > > It would only really be worth doing if you made a substantial reduction > in the number of rows in a table, and had no near-term intention of > loading the table back up again. Seems like it might be sufficient to > tell people to REINDEX if they want the index size to drop in that > scenario. I will look at physically truncating the index during VACUUM, > but I don't think it's worth getting real tense about... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It would be nice if VACUUM FULL would be able to compress the actual > index file and return unused space to the operating system. REINDEX > does this, but I was thinking of something a little lighter that could > be done automatically as part of VACUUM FULL. But indexes tend to be very dependent on physical layout. You can't just shove stuff around without thinking about the consequences. Tables (heaps) are *much* more forgiving about that. My feeling is that what we need to fix now is index bloat during normal operation. If you want the indexes to actually *shrink*, that's a job for REINDEX. Perhaps someday we can improve on that --- but let's not blur our focus on the immediate problem. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > It would be nice if VACUUM FULL would be able to compress the actual > > index file and return unused space to the operating system. REINDEX > > does this, but I was thinking of something a little lighter that could > > be done automatically as part of VACUUM FULL. > > But indexes tend to be very dependent on physical layout. You can't > just shove stuff around without thinking about the consequences. > Tables (heaps) are *much* more forgiving about that. > > My feeling is that what we need to fix now is index bloat during normal > operation. If you want the indexes to actually *shrink*, that's a job > for REINDEX. Perhaps someday we can improve on that --- but let's not > blur our focus on the immediate problem. My point is only that while we need VACUUM and VACUUM FULL to match all heap needs, we need a VACUUM FULL capability for indexes too. REINDEX may be that capability, but it would be nice if we could compress out some index space during VACUUM FULL. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom, Sound excellent. Index growth has been something that always bothered me (not the disk space usage, but the slow searches). I believe it's best to have pages marked dead at the time the last key contained in the page is deleted (you didn't discuss how efficient this is), because this will somehow improve the three depth. The same functionality should be available in VACUUM (just in case). Thus we should 'free' the index pages with one VACUUM run, instead of two. In the spirit of my ramblings about automatic statistics/suggestions by PostgreSQL for optimizations, could you also implement a NOTICE when the index becomes too 'thin'? I believe this will help avoid severe performance degradation if the process of removing the dead tuples becomes automatic. It also occurs to me, that if such statistics are available, PostgreSQL might run VACUUM automatically, on specific tables/indexes - all this controlled by a CUG variable. Daniel
>>>Justin Clift said:> <snip>> > In theory, if we find recyclable page(s) at the physical end of the index,> > we could truncatethe file (ie, give the space back to the filesystem)> > instead of reporting these pages to FSM. I am not sure ifthis is worth> > doing --- in most cases it's likely that little space can be released this> > way, and there may be sometricky locking issues.> > Sounds like this would be beneficial for environments with high > update/delete transactionvolumes, perhaps on smaller amounts of > live/valid data. But if dead pages are removed (returned to FSM?) as soon as last item is removed from the page, the page usage will remain small. Or perhaps not? That is, it's unlikely to collect large number of dead/free pages at the end of the physical storage except if doing all this in single VACUUM session. Daniel
>>>Bruce Momjian said:> > This brings up one item it would be nice to address at the same time. > It would be nice if VACUUMFULL would be able to compress the actual> index file and return unused space to the operating system. REINDEX> doesthis, but I was thinking of something a little lighter that could> be done automatically as part of VACUUM FULL. Ifwe can do that, it> would make consistent behavior for vacuum and heap/index files. Since lazy VACUUM exists, I always wondered why VACUUM FULL doesn't run REINDEX on a table where significant number of deleted tuples. VACUUM knows those numbers - I always run REINDEX on larger tables that had huge number of index entries deleted during VACUUM... Daniel
tom lane initially wrote: > Restructuring the tree during page deletion > ------------------------------------------- > > We will delete only completely-empty pages. If we were to > merge nearly-empty pages by moving data items from one page > to an adjacent one, this would imply changing the parent's > idea of the bounding key between them --- which is okay if we > are just deleting an internal key in the parent, but what if > the pages have different parent pages? and a bit later wrote: > My feeling is that what we need to fix now is index bloat during > normal operation. How about doing deletion of partial pages with reorganization among sibling pages only (where the parent pages are the same)? This avoids the "messiness" of propogating the deletes to differing parent pages but gets most of the value of reorganization. ISTM, that a VACUUM that only reclaims empty pages will be helpful in certain cases but won't help much at all in many other common "normal operation" cases which would be helped by partial reorganization. - Curtis
Daniel Kalchev <daniel@digsys.bg> writes: > ... Thus we should 'free' the index > pages with one VACUUM run, instead of two. Bear in mind that the only thing that deletes index entries is ... VACUUM. Thus what we're really discussing is whether VACUUM should delete an index page at the same time it deletes the last entry thereon, or in a separate pass over the index. This is a minor implementation detail, not something the user will really notice in terms of having to run VACUUM multiple times to get the same effect. Also, it will definitely take two VACUUM runs (minimum) before a formerly-live page can enter FSM. Once the page is marked dead (in one run) we still need to wait for the drain interval before we can give it to FSM (in a later run). This is more or less analogous to the fact that VACUUM can't remove recently-deleted heap tuples, even though they are committed dead. You have to wait till there's no one left that might want to access that tuple (or index page). regards, tom lane
"Curtis Faith" <curtis@galtcapital.com> writes: > ISTM, that a VACUUM that only reclaims empty pages will be helpful in > certain cases but won't help much at all in many other common "normal > operation" cases which would be helped by partial reorganization. Got any evidence to back that up? I was relying on [Johnson89] Johnson, T. and Shasha, D. Utilization of B-trees with Inserts, Deletes and Modifies ACM Symp. on PODS, 235-246, 1989. which provides a ton of math and simulations leading up to the conclusion that collapsing btree pages before they are fully empty doesn't really gain anything meaningful in terms of storage utilization, in most scenarios. regards, tom lane
tom lane wrote: > Got any evidence to back that up? I was relying on > > [Johnson89] Johnson, T. and Shasha, D. Utilization of > B-trees with Inserts, Deletes and Modifies ACM Symp. on > PODS, 235-246, 1989. > > which provides a ton of math and simulations leading up to > the conclusion that collapsing btree pages before they are > fully empty doesn't really gain anything meaningful in terms > of storage utilization, in most scenarios. Well, I don't have that specific paper handy but I do have [JS93] Theodore Johnson , Dennis Shasha, "B-trees with inserts and deletes: why free-at-empty is better than merge-at-half" which appears to be their later thinking on the same subject. Note the following: "A merge-at-half B-tree will always have a space utilization of at least 50%. When all operations are modify operations, or when the number of insert operations is the same as the number of delete operations, then the utilization will be about 60%. In contrast, a free-at-empty B-tree has a 0% lower bound on its space utilization, and will have about 39% utilization on a pure-modify instruction mix. However, the space utilization of a free-at-empty B-tree remains high if there are just a few more insert operations than delete operations. Thus, merge-at-half usually buys little in terms of space utilization. In Figure 6, we showed that the restructuring rate of a merge-at-half B-tree is significantly larger than the restructuring rate of a free-at-empty B-tree for all values of q * :1. For many concurrent B-tree algorithms used in practice [4, 13], restructuring causes a serialization bottleneck. Thus, one simple but important way to increase concurrency in B-tree operations is to reduce the probability of restructuring. Since merge-at-half buys little space utilization but is expensive in terms of restructuring, we recommend that B-trees (especially concurrently accessed ones) use free-at-empty." I don't dispute their conclusions in that context and under the circumstances they outline of random distribution of deletion and insertion values for the index keys. However, as [Jannink]: "Implementing Deletion in B+-trees. SIGMOD RECORD, v.24, n.1, p.33-38, 1995" points out that assumption doesn't hold under other possibly common circumstances, specifically circumstances where the deletes are taking place in significant sections of the index at a much higher rate than inserts. Consider from [Jannink95]: "There has been some research on the acceptability of relaxing the constraint of minimum node size to reduce the number of so-called unsafe tree operations, i.e., those which contain node splitting and merging [ZH89]. The debate has culminated in analysis of a weaker form of the deletion algorithm which we call lazy deletion, that imposes no constraints on the number of entries left in the nodes, allowing them to empty completely before simply removing them. According to [GR93], most database system implementations of B+-trees have adopted this approach. Its most effective use is when it is vital to allow concurrent access to the tree [JS93b], and excessive splitting and merging of nodes would restrict concurrency. [JS89] derives some analytic solutions calculating memory utilization for B+-trees under a mix of insertions and lazy deletions, based on previous research which considered insertions only [BY89]. The simulations in [JS89] support its analysis to show that in typical situations, where deletions don't outnumber insertions in the mix of operations, the tree nodes will contain acceptable percentages of entries. One of the work's assumptions [JS93a] is that the keys and tree operations are chosen uniformly from a random distribution. This assumption is unreasonable in certain realistic situations such as one described below. Allowing interior nodes with only a single pointer to exist in a B+-tree creates the possibility for arbitrarily unbalanced trees of any height, which are virtually empty, and in which access times have degenerated from the logarithmic bound B+-trees are meant to guarantee to a worst case unbounded access time. Since nodes are not removed until they are completely empty, the lazy deletion algorithm does not regulate tree height effectively." Jannink then illustrates an example where an index is created based on a timestamp where the basic assumption of Johnson and Sasha does not hold since it is not a random distribution but a monotonically increasing value. His example is an extreme one but I believe there are many instances where a timestamp, sequence or some other monotonically increasing value is used in an index and where deletes are taking place much more frequently for largely older values. Since sequences are often used as foreign keys a significant number of indexes fit into this category. Consider a web site that tracked users and that deleted inactive accounts. There are many real-world scenarios where the number of inactive accounts is very high as a percentage of all accounts. Consider an example where 50% of the accounts are inactive and deleted after 90 days of disuse and 90% are inactive after 180 days. If the Account is implemented with an ID based on a sequence, any tables that referenced the inactive account's sequence based ID via foreign keys would have index entries that would be sparsely populated for the older entries but might not have a high percentage of completely empty index pages. The older, but still active accounts would keep the index pages from being completely empty in many cases. Further, Johnson and Sasha make the claim that "free-at-empty is better" in the specific context of restructuring causing a serialization bottleneck. I don't think this applies to the specific case I recommended where the parent is the same for both leaves, especially during a VACUUM process where presumably we can optimize for concurrency rather than the speed of VACUUM itself. - Curtis
"Curtis Faith" <curtis@galtcapital.com> writes: > I don't dispute their conclusions in that context and under the > circumstances they outline of random distribution of deletion and > insertion values for the index keys. [But the random-distribution > assumption doesn't always hold.] That's a fair point. Nonetheless, we have little choice: we cannot move keys around during concurrent operations. If we push keys to the right, we may cause an indexscan moving left to miss them, and vice versa. So we can only eliminate empty pages. We could possibly allow VACUUM FULL to collapse partly-full pages, since in that case we know there are no concurrent scans. regards, tom lane
tom lane wrote: > "Curtis Faith" <curtis@galtcapital.com> writes: > > I don't dispute their conclusions in that context and under the > > circumstances they outline of random distribution of deletion and > > insertion values for the index keys. [But the random-distribution > > assumption doesn't always hold.] > > That's a fair point. Nonetheless, we have little choice: we > cannot move keys around during concurrent operations. If we > push keys to the right, we may cause an indexscan moving left > to miss them, and vice versa. So we can only eliminate empty pages. > > We could possibly allow VACUUM FULL to collapse partly-full > pages, since in that case we know there are no concurrent scans. Couldn't we do an exclusive lock call on both leaf pages and the parent using a new LockBuffer type function, named something like LockBufferNoWait, that uses LWLockConditionalAcquire instead of LWLockAcquire, in the event that all three exclusive locks cannot be obtained release all three locks, sleep, and try again for N retries. (Actually, this would probably be four locks since the sibling pointer of one of the siblings would have to be updated to point to the new merged page instead of the to-be-deleted page.) Having exclusive locks on all three pages prior to a merge would ensure that no scans were interrupted during that merge. Releasing all the exclusive locks in the event of failure to obtain any of the locks will eliminate the possibility of creating deadlocks. After N retries, VACCUUM could abort leaving the merge to be picked up in another future VACUUM run. I would think that this would be fairly easy to implement and that except for very heavily scanned indexes, most of the time the locks could be acquired and the merge would take place without causing any concurrency problems. - Curtis
Tom Lane kirjutas N, 13.02.2003 kell 20:10: > "Curtis Faith" <curtis@galtcapital.com> writes: > > I don't dispute their conclusions in that context and under the > > circumstances they outline of random distribution of deletion and > > insertion values for the index keys. [But the random-distribution > > assumption doesn't always hold.] > > That's a fair point. Nonetheless, we have little choice: we cannot > move keys around during concurrent operations. If we push keys to > the right, we may cause an indexscan moving left to miss them, and > vice versa. So we can only eliminate empty pages. But if we would allow the scans to find the same keys twice without ill effects (as was suggested earlier, for using btrees to index arrays), then we could possibly 1) copy the key to the right 2) wait for all right-to-left scans that have fallen between old and new values to pass and only then 3) delete the "old left" key. This could solve the concurrency issue as well. > We could possibly allow VACUUM FULL to collapse partly-full pages, > since in that case we know there are no concurrent scans. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Hannu Krosing <hannu@tm.ee> writes: > But if we would allow the scans to find the same keys twice without ill > effects (as was suggested earlier, for using btrees to index arrays), How is returning the same data twice not an "ill effect"? > then we could possibly 1) copy the key to the right 2) wait for all > right-to-left scans that have fallen between old and new values to pass > and only then 3) delete the "old left" key. How will you wait for scans that you know nothing of to go past? Especially when they are going to be blocked by your own write lock on the left page? regards, tom lane
Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > But if we would allow the scans to find the same keys twice without ill > > effects (as was suggested earlier, for using btrees to index arrays), > > How is returning the same data twice not an "ill effect"? > > > then we could possibly 1) copy the key to the right 2) wait for all > > right-to-left scans that have fallen between old and new values to pass > > and only then 3) delete the "old left" key. > > How will you wait for scans that you know nothing of to go past? > Especially when they are going to be blocked by your own write lock > on the left page? I think we should skip any pages where we can't get a lock. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane kirjutas R, 14.02.2003 kell 01:13: > Hannu Krosing <hannu@tm.ee> writes: > > But if we would allow the scans to find the same keys twice without ill > > effects (as was suggested earlier, for using btrees to index arrays), > > How is returning the same data twice not an "ill effect"? >From earlier discussions I understood that there had been some work done on using btrees for indexing arrays by storing each separate element in a löeaf node. Surely that work must deal with not returning the same tuple twice. > > then we could possibly 1) copy the key to the right 2) wait for all > > right-to-left scans that have fallen between old and new values to pass > > and only then 3) delete the "old left" key. > > How will you wait for scans that you know nothing of to go past? > Especially when they are going to be blocked by your own write lock > on the left page? could we just not lock (for more than just to ensure atomic writes) the page but instead increment a "page version" on each write to detect changes? ------------ Hannu
Hannu Krosing <hannu@tm.ee> wrote: > could we just not lock (for more than just to ensure atomic writes) the > page but instead increment a "page version" on each write to detect > changes? Sounds like Index MVCC..., very nice. ;-) (Of course I have no clue about feasibility, just liked the idea) Best Regards, Michael Paesold
Hannu Krosing <hannu@tm.ee> writes: > Tom Lane kirjutas R, 14.02.2003 kell 01:13: >> How is returning the same data twice not an "ill effect"? > From earlier discussions I understood that there had been some work done > on using btrees for indexing arrays by storing each separate element in > a l�eaf node. Surely that work must deal with not returning the same > tuple twice. The only mechanism that exists for that is to discard tuples that meet the qualification tests of previous indexscans. This cannot prevent returning the same tuple twice in one scan, if the index is so ill-behaved as to return the same pointer twice. I don't know what Vadim had in mind to support multiple index entries per tuple, but it's certainly not in the code yet. >> How will you wait for scans that you know nothing of to go past? >> Especially when they are going to be blocked by your own write lock >> on the left page? > could we just not lock (for more than just to ensure atomic writes) the > page but instead increment a "page version" on each write to detect > changes? How does that help? The left-moving indexscan still has no way to recover. It can't go back to the page it was on before and try to determine which entries you added there, because it has no reliable reference point to do so. The entry it previously returned might not be there anymore, and in a non-unique index key comparisons won't help. And even more to the point, how would it know you've changed the left page? It has no idea what the previous "page version" on the left page was, because it was never there before. regards, tom lane
tom lane wrote: > How does that help? The left-moving indexscan still has no > way to recover. It can't go back to the page it was on > before and try to determine which entries you added there, > because it has no reliable reference point to do so. The > entry it previously returned might not be there anymore, and > in a non-unique index key comparisons won't help. And even > more to the point, how would it know you've changed the left > page? It has no idea what the previous "page version" on the > left page was, because it was never there before. Another way this could be handled is by not merging onto one of the existing pages but to a brand new page, a kind of special case shadow index page. That way the sibling pointers, and leaf page pointer in the parent could all be updated atomically to point to the new page. In-process index scans would still reference the merged pages which would not be deleted but marked as dead using a mechanism like you proposed for marking empty pages dead with the next-transaction-ID counter. Merging could be done after a VACUUM pass that performs deletion of empty pages in order to provide a pool of empty pages to use for the merge. This would keep the index from temporarily growing during the merge process. A similar approach could be used to reorder the index pages in the background. An index that was reordered to fairly closely reflect the tree as a breadth first traversal would provide much faster index scans if the file is not heavily fragmented. - Curtis
On Wed, 12 Feb 2003 17:42:44 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Instead of an actively maintained freelist on disk as per Alvaro Herrera's >patch, I plan to use the FSM to remember where recyclable pages are, much >as we do for tables. Given that we have a mostly empty metapage per index, and the metapage is in memory most of the time, using it for the freelist looks almost like a free lunch. I've picked up Alvaro's patch and played around a bit. Reviewing my changes it turns out that most of them deal with the freelist. If you are interested I can send my version of the patch to you or to the list. ServusManfred
"Curtis Faith" <curtis@galtcapital.com> writes: > Another way this could be handled is by not merging onto one of the > existing pages but to a brand new page, a kind of special case shadow > index page. Hmmm ... that might be made to work, but it would complicate inserts. By the time an insert navigates to the page it should insert on, it might find the page is dead, and then it would have no easy way to get to the replacement page (unless you want to dedicate another link field in every index page for that one use). I suppose it could recover by starting the search over again. Another problem is that the two dead pages are now outside of the btree structure and so their left and right links won't get updated in the face of subsequent splits and merges of the nearby pages. I spent quite a bit of time sweating over the recovery navigation details in my original proposal; I can assure you it's not easy to get right. You can chain right from a dead page with some reliability, but left is another matter. There's no good way to know, when following a left link, whether you've arrived at the proper place or need to chain right to make up for a subsequent split. The recovery procedure I proposed works for removing single pages, but it won't work with substituted pages. regards, tom lane
Manfred Koizar <mkoi-pg@aon.at> writes: > Given that we have a mostly empty metapage per index, and the metapage > is in memory most of the time, using it for the freelist looks almost > like a free lunch. No, because of locking. Every time you write-lock the metapage to add or remove freelist entries, you are denying all other processes the ability to start an index scan. Check the btree literature --- exclusive locks near the root of the tree are death for concurrent performance, and must be avoided as much as possible. If I were planning to use a freelist I would keep it in a different page so as not to need to lock the metapage for freelist manipulations. But I don't see the value of having one at all. It just adds that much more disk traffic (and WAL traffic) for each page split or merge. There are also atomicity concerns --- is addition/removal of a freelist entry an atomic part of the page merge or split operation, or is it a separate atomic operation with its own WAL record? If the former, you have deadlocking problems; if the latter, post-crash-consistency problems. regards, tom lane
tom lane wrote: > How does that help? The left-moving indexscan still has no > way to recover. It can't go back to the page it was on > before and try to determine which entries you added there, > because it has no reliable reference point to do so. The > entry it previously returned might not be there anymore, and > in a non-unique index key comparisons won't help. And even > more to the point, how would it know you've changed the left > page? It has no idea what the previous "page version" on the > left page was, because it was never there before. Revisiting the idea I proposed in a previous email after more research, I believe it is definitely possible to implement a mutex based scheme that will prevent scans from being polluted by merges of index pages and that does not result in the mutex being held for any significant duration. 1) Current scan code does this: bool _bt_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) {... definitions go here... if (ScanDirectionIsForward(dir)){ if (!PageIsEmpty(page) && offnum < maxoff) offnum = OffsetNumberNext(offnum); else { /* walk right to the next page with data */ for (;;) { /* if we're at end of scan, release the buffer and return */ ... skip code here... /* step right one page */ blkno = opaque->btpo_next; _bt_relbuf(rel, *bufP); *bufP = _bt_getbuf(rel, blkno, BT_READ); ... skip rest of code... 3) Note that the calls _bt_relbuf(rel, *bufP); *bufP = _bt_getbuf(rel, blkno, BT_READ); a) appear adjacent to each other b) relbuf calls: LockBuffer(buf, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buf); c) getbuf only calls: buf = ReadBuffer(rel, blkno);LockBuffer(buf, access); in the case of an existing buffer, rather than a new one. 4) This could easily be reordered into: buf = ReadBuffer(rel, blkno); /* pin next page */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on current page */LockBuffer(buf, BT_READ); /* lock next page */ ReleaseBuffer(buf); /* now release pin on previously current page */ without affecting the logic of the code or causing any deadlock problems since the release still occurs before the lock of the next page. 5) A mutex/spinlock that was stored in the index could be acquired by the scan code like this: buf = ReadBuffer(rel, blkno); /* pin next page */ SpinLockAcquire( indexSpecificMutex ); /* lock the index reorg mutex */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on current page */LockBuffer(buf, BT_READ); /* lock next page */ SpinLockRelease( indexSpecificMutex ); /* unlock the index reorg mutex */ ReleaseBuffer(buf); /* now release pin on previously current page */ 6) The same index specific mutex/spinlock could be used for the merge code surrounding only the acquisition of the four page locks. This would obviate any problems with scans and page merges, since the lock acquisition for the merge could never occur while a scan was between pages. Further, with the reordering, the spinlock for the scan code doesn't seem particularly onerous since it would be surrounding only two LWLock calls. To reduce the overhead to an absolute minimum for the scan case these could be pushed down into a new IW call (probably necessary since the LockBuffer, ReleaseBuffer code does some error checking and such that one wouldn't want in code guarded by a mutex. - Curtis
tom lane wrote: > Hmmm ... that might be made to work, but it would complicate > inserts. By the time an insert navigates to the page it > should insert on, it might find the page is dead, and then it > would have no easy way to get to the replacement page (unless > you want to dedicate another link field in every index page > for that one use). I suppose it could recover by starting > the search over again. Inserts could reread just the parent page if they encountered a dead leaf since the parent would have been correctly updated. > Another problem is that the two dead pages are now outside of > the btree structure and so their left and right links won't > get updated in the face of subsequent splits and merges of > the nearby pages. That seems like a "show stopper" that just defers the problem. A split of the left sibling would still screw up a scan that was using the old left leaf page and wanted to move left. Oh, well, the idea seemed worth exploring. - Curtis
I previously wrote: > 5) A mutex/spinlock that was stored in the index could be > acquired by the scan code like this: > > buf = ReadBuffer(rel, blkno); /* pin next page > */ > > SpinLockAcquire( indexSpecificMutex ); /* lock the index > reorg mutex */ > > LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on > current page */ > LockBuffer(buf, BT_READ); /* lock next page */ > > SpinLockRelease( indexSpecificMutex ); /* unlock the index > reorg mutex */ > > ReleaseBuffer(buf); /* now release pin on > previously current page */ > > 6) The same index specific mutex/spinlock could be used for > the merge code surrounding only the acquisition of the four > page locks. This would obviate any problems with scans and > page merges, since the lock acquisition for the merge could > never occur while a scan was between pages. > > Further, with the reordering, the spinlock for the scan code > doesn't seem particularly onerous since it would be > surrounding only two LWLock calls. To reduce the overhead to > an absolute minimum for the scan case these could be pushed > down into a new IW call (probably necessary since the > LockBuffer, ReleaseBuffer code does some error checking and > such that one wouldn't want in code guarded by a mutex. I forgot to mention that the mutex would have to be release in the event the next page lock could not be immediately acquired just after the addition of the scan process to the lock waiters list to avoid blocking all scans and probably causing severe deadlock problems. - Curtis
"Curtis Faith" <curtis@galtcapital.com> writes: > 4) This could easily be reordered into: > buf = ReadBuffer(rel, blkno); /* pin next page > */ > LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* release lock on > current page */ > LockBuffer(buf, BT_READ); /* lock next page */ > ReleaseBuffer(buf); /* now release pin on > previously current page */ > without affecting the logic of the code or causing any deadlock > problems since the release still occurs before the lock of the next > page. Sorry, that *does* create deadlocks. Remember the deletion process is going to need superexclusive lock (not only a BT_WRITE buffer lock, but no concurrent pins) in order to be sure there are no scans stopped on the page it wants to delete. (In the above pseudocode, the fact that you still hold a pin on the previously-current page makes you look exactly like someone who's in the middle of scanning that page, rather than trying to leave it.) The same would be true of both pages if it's trying to merge. > 5) A mutex/spinlock that was stored in the index could be acquired by > the scan code like this: "Stored in the index"? And how will you do that portably? But it doesn't matter, because the approach deadlocks. regards, tom lane
tom lane wrote: > Sorry, that *does* create deadlocks. Remember the deletion > process is going to need superexclusive lock (not only a > BT_WRITE buffer lock, but no concurrent pins) in order to be > sure there are no scans stopped on the page it wants to > delete. (In the above pseudocode, the fact that you still > hold a pin on the previously-current page makes you look > exactly like someone who's in the middle of scanning that > page, rather than trying to leave it.) The same would be > true of both pages if it's trying to merge. First, recall that under my very first proposal, the VACUUM process would try to acquire locks but NOT WAIT. Only in the event that superexclusive locks could be obtained on all pages would the merge proceed, otherwise it would DROP all the locks, sleep and retry. This would prevent the VACUUM merge from participating in deadlocks since it would never wait while holding any lock. I was assuming that here as well but did not explicitly restate this, sorry. One also needs to drop the mutex in the event you could not get the lock after placing the process in the waiter list for the next page. This entry will prevent VACUUM that wants to merge from gaining the superexclusive lock until after the scan has finished since the scans waiting lock request will block it, and as you point out, so will the pin. The mutex only needs to guard the crossing of the pages, so the pin existing outside the mutex won't cause a problem. > "Stored in the index"? And how will you do that portably? Sorry for the lack of rigorous language. I meant that there would be one mutex per index stored in the header or internal data structures associated with each index somewhere. Probably in the same structure the root node reference for each btree is stored. - Curtis
"Curtis Faith" <curtis@galtcapital.com> writes: >> "Stored in the index"? And how will you do that portably? > Sorry for the lack of rigorous language. I meant that there would be one > mutex per index stored in the header or internal data structures > associated with each index somewhere. Probably in the same structure the > root node reference for each btree is stored. Hm. A single lock that must be grabbed for operations anywhere in the index is a concurrency bottleneck. But maybe we could avoid that. regards, tom lane
> tom lane wrote: > > Hm. A single lock that must be grabbed for operations anywhere in > > the index is a concurrency bottleneck. I replied a bit later: > I don't think it would be that bad. It's not a lock but a > mutex that would only be held for relatively brief duration. > It looks like tens or a few hundred instructions at most > covered by the mutex and only during scans that cross page boundaries. > > Unless you were talking about a 16 to 64 way SMP box, I don't > think there would be much actual contention since there would > have to be contention for the same index at the same time and > depending on the number of keys per page, very little of the > run time is spent on the lock transition for traversal across > pages as compared to all the other stuff that would be going on. tom also presciently wrote: > But maybe we could avoid that After sleeping on it a bit, it occurs to me that you are again correct; the mutex is not required to do what I suggested. It is simply a matter of assuring that during the page traversal the lock is either obtained or the process added to the waiters queue before the current page lock is released. This will ensure than any process trying to obtain an exclusive or superexclusive lock will wait or fail an immediate lock acquisition. traversal pseudo code: if the next page lock can be obtained immediately grab it release the current page lock else if we would have to wait add current process to waiters queue release current page lock sleep on the lock end if end traversal pseudo code Like the current release before next page acquisition method, this won't cause deadlocks with other scanners going in the opposite direction or with concurrently inserting processes since the current page lock is released before sleeping. If this change is made then my original suggestion, that the merge process attempt a non-deadlocking acquisition of all the locks with a drop of all locks and a retry if any of the attempts would have caused a wait, should work fine. If the traversal code is changed as per above, a merging VACUUM process is guaranteed to either run into the lock held by the scanning process on the current page or the next page. This has the potential side benefit of permitting the simplification of the scan code in the case where the lock is obtained immediately. Since the scanning process would know that no inserting process has added pages to the tree because of a split, it wouldn't have to check for newly inserted pages to the right of the next page. I don't know how much work is involved in this check but if it is significant it could be eliminated in this case, one which is the most likely case in many scenarios. What do you think? - Curtis
On Wed, Feb 12, 2003 at 05:42:44PM -0500, Tom Lane wrote: > I've been thinking hard for the last few days about how to do space > reclamation in b-tree indexes, i.e., recycle pages that are in > no-longer-useful portions of the tree structure. Hi Tom, I've seen your applied changes. It looks real nice; I could probably have spent a year trying to do that. You addressed a couple of things that only occurred to me some time after I left for vacation, and then several that I never thought about. Putting the freelist on FSM rather on the metapage still strikes me as kind of strange; I remember you said the metapage was not enough space for all the possible candidate pages, but the FSM is even more limited. Oh well. Two things I regret: one is being unable to see the changes as patches the way you applied them, to get a sense of how the code evolved. Unfortunately the interface to CVS via web does not allow me to see it, or I don't know how to use it. It's not that important, however, because I was already somewhat familiar with the code. The other is that now I am left without a graduate project :-( I will continue looking at the TODO list for something that's of an appropiate size for me. (Professors here don't have any but the slightest idea of Postgres, so they aren't of any help.) I'm still thinking on implementing a replacement for the regex engine based on the Shift-or algorithm, if only to have a speed comparison with traditional engines. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
> Two things I regret: one is being unable to see the changes as patches > the way you applied them, to get a sense of how the code evolved. > Unfortunately the interface to CVS via web does not allow me to see it, > or I don't know how to use it. It's not that important, however, > because I was already somewhat familiar with the code. You can do cvs diff commands between versions... > The other is that now I am left without a graduate project :-( I will > continue looking at the TODO list for something that's of an appropiate > size for me. (Professors here don't have any but the slightest idea of > Postgres, so they aren't of any help.) I'm still thinking on > implementing a replacement for the regex engine based on the Shift-or > algorithm, if only to have a speed comparison with traditional engines. How about working on hash indexes? They have several problems at the moment that make them unusable: 1. They are really slow to build 2. They have poor concurrency 3. They are not any faster than btree indexes for '=' operators (which is lame!) 4. They are not WAL logged or recoverable 5. They cannot be marked UNIQUE (IIRC) Since you already have experience in the postgres indexing arena, you could transfer your skills to this new index method easily? Apparently berkelydb has good hash index code that could be looked at... Other things to look at perhaps would be that GiST indexes have poor concurrency and aren't WAL logged. Also, how about looking into the fact that you cannot store constants in functional indexes? Or what about bitmap indexes? Cheers, Chris
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > Putting the freelist on FSM rather on the metapage still strikes me as > kind of strange; I remember you said the metapage was not enough space > for all the possible candidate pages, but the FSM is even more limited. Well, on modern machines there should be no problem with making an FSM with multiple-million page slots, of which an individual index might claim many thousand slots at need. But you're right that a metapage plus overflow pages could remember more free pages. The problem with that approach is that (a) you incur more I/O --- not only index traffic, but WAL traffic, if you want the storage to be reliable; and (b) you incur more locking overhead. Write locks on the metapage are death for concurrency because they'll block all incoming btree operations. We might have been able to avoid (b) by putting the freelist head somewhere else than the primary metapage --- it was only a false notion of storage economy that led us to think the wasted space on the metapage was an appropriate place for the freelist. But there's really no way around (a). Given that we're not keeping records of heap free space on disk, it seemed to me to make sense not to keep records of index free space on disk either. Better to have one mechanism and concentrate on making it good than have two so-so mechanisms. > The other is that now I am left without a graduate project :-( Oops :-(. Well, there's plenty of stuff on the TODO list, not to mention useful projects that haven't made it to the list. regards, tom lane
> Two things I regret: one is being unable to see the changes as patches > the way you applied them, to get a sense of how the code evolved. > Unfortunately the interface to CVS via web does not allow me to see it, > or I don't know how to use it. It's not that important, however, > because I was already somewhat familiar with the code. You can do cvs diff commands between versions... > The other is that now I am left without a graduate project :-( I will > continue looking at the TODO list for something that's of an appropiate > size for me. (Professors here don't have any but the slightest idea of > Postgres, so they aren't of any help.) I'm still thinking on > implementing a replacement for the regex engine based on the Shift-or > algorithm, if only to have a speed comparison with traditional engines. How about working on hash indexes? They have several problems at the moment that make them unusable: 1. They are really slow to build 2. They have poor concurrency 3. They are not any faster than btree indexes for '=' operators (which is lame!) 4. They are not WAL logged Since you already have experience in the postgres indexing arena, you could transfer your skills to this new index method easily? Other things to look at perhaps would be that GiST indexes have poor concurrency and aren't WAL logged. Also, how about looking at the fact that you cannot store constants in functional indexes? Cheers, Chris
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: >> The other is that now I am left without a graduate project :-( > [ various good suggestions ] FWIW, I would much rather see effort put into GiST than hash indexes. I think that btree and GiST will be our two primary index types in the long run. Hash and rtree don't have enough to recommend them to justify directing maintenance effort that way rather than to the big guys. Just MHO of course... regards, tom lane
Just a few questions: Is it vital that the index shrink? There are three conceptual phases of a database, right? It is growing, it is shrinking, and it is staying the same. In the idea that it is growing, this is not an issue. In the idea that it is staying the same, surely it must come to some sort of equilibrium. In the idea that the DB is shrinking, is this really an issue? I would say that disk space utilization is far secondary to performance. Tom Lane wrote: >I've been thinking hard for the last few days about how to do space >reclamation in b-tree indexes, i.e., recycle pages that are in >no-longer-useful portions of the tree structure. We know we need this to >solve the "index bloat" problem that occurs when the distribution of keys >changes over time. I feel that it's critical that the reclamation be doable >by plain VACUUM, ie, without acquiring exclusive lock on the index as a >whole. This discussion therefore assumes that page deletion must be able >to operate in parallel with insertions and searches. > > >Issues >------ > >We need to get rid of parent links in btree pages; otherwise removal of a >non-leaf page implies we must find and update all parent links that lead >to it. This is messy enough that it would be better to do without. The >only thing the parent link is really needed for is to find the new parent >level after a root split, and we can handle that (very infrequent) case by >re-descending from the new root. > >Instead of storing parent links, label all pages with level (counting >levels up from zero = leaf, so that a page's level doesn't change in a >root split). Then, if we find ourselves needing to re-descend, we can be >sure of finding the correct parent level, one above where we were, even if >there's been multiple root splits meanwhile. The level will also make for >a useful cross-check on link validity: we will always know the level of >the page we expect to arrive at when following a link, so we can check >that the page has the right level. > >Unfortunately this means tossing out most of the FixBtree code Vadim wrote >2 years ago, because it seems critically dependent on having parent links. >But I don't really see why we need it if we rely on WAL to maintain btree >consistency. That will require some improvements in WAL-logging for >btrees, however. (Details below.) > >When a page is deleted, it can't actually be recycled until there are no >more potentially in-flight references to it (ie, readers that intend to >visit the page but have not yet acquired a lock on it). Those readers >must be able to find the page, realize it's dead, and follow the correct >sidelink from it. [Lanin&Shasha86] describe the "drain technique", which >they define as "delay freeing the empty page until the termination of all >processes whose locate phase began when pointers to the page still >existed". We can conveniently implement this by reference to >transactions: after removing all links to the page, label the now-dead >page with the current contents of the next-transaction-ID counter. VACUUM >can recycle the page when this is older than the oldest open transaction. > >Instead of an actively maintained freelist on disk as per Alvaro Herrera's >patch, I plan to use the FSM to remember where recyclable pages are, much >as we do for tables. The FSM space requirements would be small, since >we'd not be needing to enter any data about partially-full pages; only >truly empty, recyclable pages would need to be stored. (Is it worth >having an alternate representation in the FSM for indexes, so that we only >store page numbers and not the useless amount-free statistic?) > >Without a freelist on disk, VACUUM would need to scan indexes linearly to >find dead pages, but that seems okay; I'm thinking of doing that anyway to >look for empty pages to turn into dead ones. > > >Restructuring the tree during page deletion >------------------------------------------- > >We will delete only completely-empty pages. If we were to merge nearly-empty >pages by moving data items from one page to an adjacent one, this would >imply changing the parent's idea of the bounding key between them --- >which is okay if we are just deleting an internal key in the parent, but >what if the pages have different parent pages? We'd have to adjust the >parents' own bounding key, meaning the parents' parent changes, perhaps >all the way to the root. (Not to mention that with variable-size keys, >there's no guarantee we can make such changes without splitting the >upper-level pages.) And, since we support both forward and backward >index scans, we can't move leaf items in either direction without risking >having a concurrent scan miss them. This is way too messy, especially for >something that has only minimal return according to the literature >[Johnson89]. So, no merging. > >Deletion of an empty page only requires removal of the parent's item >linking to it (plus fixing side pointers, which is pretty trivial). We >also remove the next higher key in the parent, which is the parent's upper >bound for data that would have belonged on the target page. Therefore, >the page's right sibling becomes responsible for storing the key range >that used to belong on the deleted page. > >What if there is no next-higher key, you ask? Well, I'm going to punt. >It is critical that the key space associated with a parent page match the key >space associated with its children (eg, the high key of the rightmost child >must match the parent's high key). There is no way to atomically modify a >parent's key space --- this would mean simultaneously changing keys in upper >levels, perhaps all the way up to the root. To avoid needing to do that, we >must put a couple of restrictions on deletion: > >1. The rightmost page in any tree level is never deleted, period. (This rule >also simplifies traversal algorithms, as we'll see below.) > >2. The rightmost child of a non-rightmost parent page can't be deleted, either, >unless it is the last child of that parent. If it is the last child then the >parent is *immediately* marked as half-dead, meaning it can never acquire any >new children; its key space implicitly transfers to its right sibling. (The >parent can then be deleted in a later, separate atomic action. Note that if >the parent is itself a rightmost child of a non-rightmost parent, it will have >to stay in the half-dead state until it becomes the only child; then it can be >deleted and its parent becomes half-dead.) > >(Note that while leaf pages can be empty and still alive, upper pages >can't: they must have children to delegate their key range to.) > >With these restrictions, there is no need to alter the "high key" fields of >either the parent or the siblings of a page being deleted. The key space of >the page itself transfers to its right sibling, and the key space of the >parent does not change (except in the case where the parent loses all its key >space to its right sibling and becomes half-dead). > >Restriction 1 is not a significant one in terms of space wastage. Restriction >2 is more annoying, but it should not increase overhead drastically. The >parent would not be deleted or merged anyway as long as it has other children, >so the worst-case overhead from this restriction is one extra page per >parent page --- and parent levels are normally much smaller than child levels. > >The notion of a half-dead page means that the key space relationship between >the half-dead page's level and its parent's level may be a little out of >whack: key space that appears to belong to the half-dead page's parent on the >parent level may really belong to its right sibling. We can tolerate this, >however, because insertions and deletions on upper tree levels are always >done by reference to child page numbers, not keys. The only cost is that >searches may sometimes descend to the half-dead page and then have to move >right, rather than going directly to the sibling page. > > >Page deletion procedure >----------------------- > >Assume we know the target page, but hold no locks. The locks must be >gotten in this order to avoid deadlock against inserters: > >1. Obtain W lock on left sibling of target page (this may require searching >right if left sibling splits concurrently; same as for backwards scan case). >Skip this if target is leftmost. >2. Obtain super-exclusive lock on target page; check it is still empty, >else abandon deletion. >3. Obtain W lock on right sibling (there must be one, else we can't delete). >4. Find and W-lock the target's current parent page. Check that target is >not rightmost child of a parent with other children, else abandon deletion. >5. Update all four pages at once, write single WAL record describing all >updates. (I believe we can extend WAL to allow 4 page references in a >single record, if not it may be okay to cheat and not store all of the >adjacent pages. Certainly need not record contents of the target page >itself, so 3 is enough.) Note parent is marked half-dead here if this >was its last child. Release locks. >6. The update to the target page makes it empty and marked dead, but preserves >its sibling links. It can't actually be recycled until later (see above). > >The deletion procedure could be triggered immediately upon removal of the >last item in a page, or when the next VACUUM scan finds an empty page. >Not sure yet which way is better. > >Having to exclusive-lock four pages is annoying, but I think we must do >this procedure as a single atomic operation to ensure consistency. > > >Search/scan rules in presence of deletion >----------------------------------------- > >Since page deletion takes a superexclusive lock, a stopped scan will never >find the page it's on deleted when it resumes. What we do have to worry about >is arriving at a dead page after following a link. There are several cases: > >During initial search for a key, if arrive at a dead or half-dead page, chain >right. (The key space must have moved to the right.) > >During forward scan: likewise chain right. (Note: because scans operate >only on the leaf level, they should never see half-dead pages.) > >During backwards scan: chain right until a live page is found, and step left >from there in the usual way. We cannot use the dead page's left-link because >its left neighbor may have split since the page was deleted. (There is >certain to be at least one live page to the right, since we never delete the >rightmost page of a level.) >Also, "step left in the usual way" used to mean this: > A backwards scan has one additional bit of complexity: after > following the left-link we must account for the possibility that the > left sibling page got split before we could read it. So, we have to > move right until we find a page whose right-link matches the page we > came from. >But if the "page we came from" is now dead, perhaps there is no >page with a matching right-link (if updater got to it before we did)! >Probably the best policy, if we fail to find a match, is to return to >the page we were previously on, verify that it's now dead (else error), >then chain right and left as in the case where we've linked to a dead page >(see above). This means a backwards scan must always remember the last >live page it was on. Again, this is greatly simplified by the fact that >there must be a live page somewhere to the right. > > >Insertion changes >----------------- > >The basic insertion algorithm doesn't change (but see above notes about >the search part). > >bt_getstackbuf is a little trickier in the presence of deletion. First >issue is that removal of an earlier downlink in the returned-to parent >page may cause the target item to move left by some number of slots >(though not into a previous page). We can deal with this by searching >left after we fail to find the item by searching right, before we move on >to the next page. Next is that the whole parent page may be dead or >half-dead --- but it can't be physically deleted, so we can recover by >following its rightlink. The nasty part is that the target item could >perhaps not be there; this would imply that someone deleted the page we >are trying to split, or hasn't fully finished inserting it. But L&Y >already require holding a lock on that page until we have re-located and >locked the parent item, so this is not possible. (Note this implies that >we must search for exactly the downlink to the page we are splitting, but >that's true anyway.) > >If we need to split on a level that was the root when we descended, but >is no longer the root, then our stack doesn't tell us how to find the next >level up. As discussed above, handle this case by re-descending and >checking level fields, rather than depending on parent links as before. >We will simply descend on the left edge of the tree and scan the whole >parent level to find where to insert --- it's highly unlikely that the >parent level has yet grown large enough for this to be slow, so there's >not much value in trying to use a keyed search. > > >WAL issues >---------- > >A page split operation can't really be atomic. We can handle the actual split >("half-split") operation as an atomic update: > >1. We have W lock on page that needs to be split. >2. Find a free page (from FSM, or make it by extending rel), W-lock it. >3. W-lock page's right sibling (if any), so that we can fix its left-link. >4. Update all three pages, make WAL log entry describing same, write pages. >5. We can now release the W-lock on the right sibling page, but we must keep > the W-locks on the two split pages until we have made the parent entry; > else it's conceivable for someone else to try to split or delete these > pages and get confused because the parent link isn't in place yet. > (See L&Y.) > >(Note that since this is driven by an insert, the half-split WAL log entry >may as well include the new key insertion step; so this entry substitutes >for the plain insertion WAL entry we'd otherwise make.) > >(So far this is the same as the existing code.) > >But now we must find the parent page and enter the new key and downlink >(possibly causing a split at that level). If crash occurs before we can >do so, how to recover? The existing code has a lot of ad-hoc logic that >tries to reconstruct the tree on-the-fly when inconsistency is detected. >But it would be a lot better if we could expect WAL playback to fix things. > >The WAL entry describing the half-split includes enough info to re-execute >the parent key insertion. While replaying log, we can remember this >insertion as pending. If we see a matching insertion event in the log, >discard the remembered pending insertion. At end of log, execute all >still-pending insertions. (There should not be very many pending >insertions at any one time, so the resources for this are not onerous.) >Note however that this implies making more log entries to describe these >additional updates; might complicate WAL, but I see no fundamental >difficulty. > >Also, if we are splitting a root page, it seems best to merge creation of >the new root page and updating of the metapage into the same atomic action >that does the split. This won't make the WAL entry materially larger, and >it will avoid problems with concurrent root splits. Need to look at >deadlock issues for trying to update metapage --- is it okay to grab W >lock on meta while holding it on old root? (Sure, why not? Nothing will >go the other way.) We do need to update metapage as part of the atomic >root-split operation, else other updaters trying to find the new root may >fail because link not there yet. > > >VACUUM mechanics >---------------- > >We will institute an explicit "vacuuming an index" phase in VACUUM (this >is distinct from deleting index entries during vacuuming of a table, and >will be done after we've finished vacuuming the associated table). In >this phase, we can scan the index linearly (ie, in storage order not >index order) to take advantage of sequential I/O optimizations. We scan >looking for pages that are empty (if leaf pages) or half-dead (if parent >pages), as well as pages that are already dead and have been so long >enough to be recycled. The latter are simply reported to the FSM. >Empty and half-dead pages are deleted per the previously-described >mechanism. (It seems best to gather the page IDs of target pages in >memory, and do the deletions only after we've finished the seqscan, so >as not to confuse the kernel about whether it should read-ahead.) > >One step that's not explicit above is how to find the parent of a page we >intend to delete. It seems most efficient to perform a search for the >page's high key (it must have one, since we don't delete rightmost pages) >stopping when we reach the level above the page itself. This will give >us a good pointer to the parent. We should do this before starting to >acquire exclusive locks for the deletion. > >In theory, if we find recyclable page(s) at the physical end of the index, >we could truncate the file (ie, give the space back to the filesystem) >instead of reporting these pages to FSM. I am not sure if this is worth >doing --- in most cases it's likely that little space can be released this >way, and there may be some tricky locking issues. > > >Tree depth >---------- > >Because we never relabel pages' levels, the tree depth cannot be reduced (we'd >have to do that by removing the current root page, which seems impractical >without an exclusive lock on the whole index). So after massive shrinkage >we could end up with a "thin" tree in which there are levels below the root >with only one page apiece. The space wastage doesn't seem like a big issue, >but the extra time to traverse these useless levels could be annoying. > >This could be ignored in first implementation (there's always REINDEX). >Later, possibly handle it via Lanin&Shasha's notion of a critic (think >VACUUM) that sets a fast pointer to the current effective root level. >(Actually I think we wouldn't need a separate critic process; split and >delete steps could be programmed to update the fast pointer for >themselves, in a separate atomic action, when they split a one-page level >or delete the next-to-last page of a level.) > > >Any comments before I plunge into implementing this? > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to majordomo@postgresql.org so that your >message can get through to the mailing list cleanly > > >
> FWIW, I would much rather see effort put into GiST than hash indexes. What would be a good place to start if I wanted to put some work into improving GiST? -- Itai Zukerman <http://www.math-hat.com/~zukerman/>
Alvaro, you're welcome to put your hands on to GiST. Me and Teodor hoped to add concurrency last year but we were too busy. I consider concurrency issue the most important. There are also some thoughts about improvement of GiST interface, though. Regards, Oleg On Thu, 27 Feb 2003, Tom Lane wrote: > "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > >> The other is that now I am left without a graduate project :-( > > > [ various good suggestions ] > > FWIW, I would much rather see effort put into GiST than hash indexes. > I think that btree and GiST will be our two primary index types in the > long run. Hash and rtree don't have enough to recommend them to justify > directing maintenance effort that way rather than to the big guys. > > Just MHO of course... > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Fri, 28 Feb 2003, Christopher Kings-Lynne wrote: > How about working on hash indexes? They have several problems at the moment > that make them unusable: > [SNIP] > Since you already have experience in the postgres indexing arena, you could > transfer your skills to this new index method easily? > [SNIP] > Also, how about looking at the fact that you cannot store constants in > functional indexes? both of these would be most useful. I'd vote for hash indexes first, but not being able to use constants in functional indexes is a giant pain as well.
"scott.marlowe" <scott.marlowe@ihs.com> writes: > not being able to use constants in functional indexes is a giant pain as > well. ISTM the appropriate way to attack this is to replace "functional indexes" with "expressional indexes", wherein the value indexed is anything that can be computed by a simple scalar expression (no aggregates, no sub-selects, only immutable functions/operators). This'd be a direct superset of what we can do now. I haven't studied it in detail but I think this would not be a terribly large project. regards, tom lane
Christopher Kings-Lynne wrote: > > Two things I regret: one is being unable to see the changes as patches > > the way you applied them, to get a sense of how the code evolved. > > Unfortunately the interface to CVS via web does not allow me to see it, > > or I don't know how to use it. It's not that important, however, > > because I was already somewhat familiar with the code. > > You can do cvs diff commands between versions... The problem I have is that if multiple files are touched, they have different versions. What I do is to find one file that has the change, then do a cvs diff with two dates separated a minute before and after the single file commit time I see. Big hack, but I haven't found a better way. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> The problem I have is that if multiple files are touched, they have > different versions. What I do is to find one file that has the change, > then do a cvs diff with two dates separated a minute before and after > the single file commit time I see. Big hack, but I haven't found a > better way. I grep through the actual cvsroot for the commit message on my own projects... Chris