Thread: Nasty btree deletion bug
I've been analyzing Ed L's recent report of index corruption: http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php (thanks to Ed for letting me have the actual index to study). I think I've figured out what happened to it. nbtree/README says : The notion of a half-dead page means that the key space relationship between : the half-dead page's level and its parent's level may be a little out of : whack: key space that appears to belong to the half-dead page's parent on the : parent level may really belong to its right sibling. We can tolerate this, : however, because insertions and deletions on upper tree levels are always : done by reference to child page numbers, not keys. The only cost is that : searches may sometimes descend to the half-dead page and then have to move : right, rather than going directly to the sibling page. but unfortunately this analysis is too simplistic. In the situation where a half-dead page is its parent's rightmost child (which is the only case where we expect half-deadness to persist long), that page's former key space now belongs to its right sibling, which means that some keys that are less than the parent's "high key" now belong to the keyspace of pages below the parent's right sibling. This is OK as far as search behavior goes --- but suppose that we get a continuing stream of insertions of new keys in that key range. This will result in page splits that cause keys in that key range to bubble up into the upper levels. If that keeps happening long enough, eventually we will split the parent's right sibling at a key value less than the parent's high key, and then we will insert that key into the grandparent level just to the right of the parent's right sibling. Now we have an index that's actually corrupt, because we have out-of-order index keys in the grandparent level, which can cause searches for keys in their range to fail (a search may descend too far to the right to find the entries it should have found). Since only internal pages can be half-dead, this failure requires at least a three-level index, and it requires enough deletions within a small range for for a level-1 page to become empty (hence half dead) followed by a large number of insertions in that same range. Ed's index was probably more prone to this than average because he was indexing very wide values (~500 byte text), leading to low btree fanout and a relatively narrow value range for a level-1 page. Still I'm a bit surprised that we've not figured it out before, because the bug is presumably present all the way back to 7.4 when the btree deletion code was added. I haven't thought of a suitable fix yet --- clearly we're going to have to change the concept of half-deadness to some extent. But I have to leave for a dentist appointment, so I figured I'd post what I know. regards, tom lane
I wrote: > I've been analyzing Ed L's recent report of index corruption: > http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php After some thought, it still seems to me to be impractical to delete the rightmost child of a btree page while other children remain. Doing this would require either moving the child's keyspace assignment left to its left sibling, or moving that keyspace right to the first child of the parent's right sibling, and either of these choices means having to adjust page "high keys" in a way that's not certain to work. (For instance, moving the keyspace left means assigning the victim page's high key to its left sibling, and there might not be room in the left sibling page if the desired key is longer than what's there. In the other case the same problem of having to replace a key with a potentially longer one exists, but it manifests at the grandparent level.) So I think the rule needs to be "don't delete the rightmost child unless it's the only child, in which case you can delete the parent too --- but the same restriction must be observed at the next level up". I believe we can handle this by doing a precheck before starting a delete of a level-zero page: scan up the stack to check that the condition is satisfied for each level that needs to be deleted. The tricky part is that we don't want to exclusive-lock all those pages at once (we could do it without deadlock risk, but the concurrency hit seems bad). I think though that we can assume that if the condition is checked it will remain true until we finish the delete: 1. A page that isn't rightmost child can't become so while we're not looking, because that would require a delete to occur, and only VACUUM does index page deletes, and we already disallow two concurrent VACUUMs on the same index. 2. A page that is an only child could acquire a sibling only if it's split, but that would imply an insert (in fact multiple inserts) into the original level-zero page. We'll recheck emptiness of the level-zero page after we acquire write lock on it to begin the actual deletion, at which point it's still safe to abandon the delete attempt. The recent patch to allow ordinary non-vacuum processes to delete index entries retail makes #2 a little trickier than meets the eye. One could imagine a scenario where between the times VACUUM leaves the level-zero page and reacquires lock on it, enough entries were inserted to split the page and then they all got deleted again by that patch. However that patch in its present form cannot leave a page in a completely empty state, because it's only invoked as part of an insertion attempt. (If it did manage to delete all the existing entries, then the same process would insert a new entry onto the same page before unlocking it.) So I think it's OK, but we'll need to be wary about any proposals to remove index entries in other contexts. The concept of a half-dead page would remain, but it'd be a transient state that would normally only persist for a moment between atomic page-delete actions. If we crash between two such actions, the half-dead page would remain present, but would be found and cleaned up by the next VACUUM. In the meantime it wouldn't cause any problem because the keyspace it gives up will belong to a sibling of the same parent at whatever level the delete is ultimately supposed to stop at, and so inserts and even splits in that keyspace won't create an inconsistency. Alternatively, we could have WAL crash recovery complete the multilevel deletion using the same sort of remember-pending-actions logic we use now to handle splits. Comments? Have I missed anything? regards, tom lane
Tom Lane wrote: > I wrote: >> I've been analyzing Ed L's recent report of index corruption: >> http://archives.postgresql.org/pgsql-general/2006-10/msg01183.php Auch. That's nasty indeed. > So I think the rule needs to be "don't delete the rightmost child unless > it's the only child, in which case you can delete the parent too --- but > the same restriction must be observed at the next level up". > .... > The concept of a half-dead page would remain, but it'd be a transient > state that would normally only persist for a moment between atomic > page-delete actions. If we crash between two such actions, the > half-dead page would remain present, but would be found and cleaned up > by the next VACUUM. In the meantime it wouldn't cause any problem > because the keyspace it gives up will belong to a sibling of the same > parent at whatever level the delete is ultimately supposed to stop at, > and so inserts and even splits in that keyspace won't create an > inconsistency. I don't understand how this "in the meantime" thing works. I tried to work out a step-by-step example, could you take a look at it? See http://users.tkk.fi/~hlinnaka/pgsql/btree-deletion-bug/ > ... > > Comments? Have I missed anything? It took me a lot of time with pen and paper to understand the issue. And I'm not sure I still understood it fully. The logic is very complex, which is bad for maintainability in itself :(. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I don't understand how this "in the meantime" thing works. I tried to > work out a step-by-step example, could you take a look at it? See > http://users.tkk.fi/~hlinnaka/pgsql/btree-deletion-bug/ [ looks at that for a bit... ] Yeah, you're right. Once the deletion is completed, the F lower-bound key will disappear from the grandparent, which would restore consistency --- but we could have already delivered wrong search answers, so that won't do. In theory, given a slow-enough-moving VACUUM process, this could happen even without a crash. So I think that means we have to go over to the other plan of locking everything all the way up to the top of the deletion before we start doing it --- and also, we'll need crash recovery logic to complete the incomplete deletion, same as for incomplete inserts. Yech --- more code than I was hoping to have to write, and more concurrency hit too. OTOH we might be able to get rid of the notion of "half dead", which would allow some simplifications. Thanks for taking a close look! regards, tom lane
Tom Lane wrote: > In theory, given a slow-enough-moving VACUUM process, this could happen > even without a crash. So I think that means we have to go over to the > other plan of locking everything all the way up to the top of the > deletion before we start doing it --- and also, we'll need crash > recovery logic to complete the incomplete deletion, same as for > incomplete inserts. Yech --- more code than I was hoping to have to > write, and more concurrency hit too. ISTM we don't necessarily need the "complete the incomplete deletion" logic. Since we're holding all the pages up to the parent of the highest deleted page locked, we can atomically issue one big WAL record covering all the modifications. We can't do that with page splits, because we want to release the lock on the split pages before we move up to the parent to insert the child pointer. (whether or not it's worth it in page splits either is another issue) The concurrency hit probably isn't that bad. There shouldn't be much activity on empty pages. What does the original research paper by Lanin & Shasha say about this? I don't have it at hand. I do have the Lehman & Yao paper but that one doesn't discuss removal of non-leaf pages at all. > OTOH we might be able to get rid of the notion of "half dead", which > would allow some simplifications. Yep. > Thanks for taking a close look! No problem, I've been staring at the b-tree code lately anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I wrote: > [ looks at that for a bit... ] Yeah, you're right. Once the deletion > is completed, the F lower-bound key will disappear from the grandparent, > which would restore consistency --- but we could have already delivered > wrong search answers, so that won't do. On further reflection, I think I understand why we've not realized the existence of this bug before: in fact, it *doesn't* lead to wrong search answers. I think the only visible consequence is exactly the "failed to re-find parent key" VACUUM error that Ed saw. The reason is that the key misordering in the grandparent level is nearly harmless. Using your example of - F D D ... * if we happen to come across the F key first during a binary search of the grandparent page, and we are looking for something <= F, we will descend to its left, which is at worst a little bit inefficient: _bt_moveright will still ensure that we find what we seek. * if we happen to visit one of the D key(s) first, and we are looking for something > D, we will descend to the right of that key. Well, that's not incorrect for the live data. In fact, the *only* key in the tree that we will fail to find this way is the F bounding key for the half-dead page itself (or one of its also-deletable parents). So that's exactly why VACUUM can fail while trying to clean up the half-dead page, and it's why we're not seeing reports of wrong query answers. So that reduces the priority of the bug quite a lot in my estimation; and makes me not want to incur a lot of additional code and locking to fix it. I'm wondering whether we can simply adopt a modified strategy for searching for a half-dead page's parent during _bt_pagedel. regards, tom lane
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > What does the original research paper by Lanin & Shasha say about this? Nothing very useful. The connection of our code to that paper is actually a bit tenuous: their approach to deletion is to make the target page's key space move left not right. That doesn't work real nicely in my opinion (for one thing, you have to replace the left sibling's high key, which is problematic for variable-size keys). regards, tom lane
Tom Lane wrote: > On further reflection, I think I understand why we've not realized the > existence of this bug before: in fact, it *doesn't* lead to wrong search > answers. I think the only visible consequence is exactly the "failed to > re-find parent key" VACUUM error that Ed saw. The reason is that the > key misordering in the grandparent level is nearly harmless. Using your > example of Yep. It's pretty harmless. But now that I look at the original post by Ed, I don't see how the "failed to re-find parent key" error could result from the issue we've been talking about. The error message is printed when _bt_getstackbuf is unable to re-find an item in the parent of a deleted page, but _bt_getstackbuf doesn't look at or compare the keys at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > But now that I look at the original post by Ed, I don't see how the > "failed to re-find parent key" error could result from the issue we've > been talking about. The error message is printed when _bt_getstackbuf is > unable to re-find an item in the parent of a deleted page, but > _bt_getstackbuf doesn't look at or compare the keys at all. Right, but _bt_getstackbuf is working from a search stack created by a standard search for the victim page's high key. If that search descended through a page to the right of the victim page's actual parent, _bt_getstackbuf isn't able to recover. regards, tom lane
I wrote: > Right, but _bt_getstackbuf is working from a search stack created by > a standard search for the victim page's high key. If that search > descended through a page to the right of the victim page's actual > parent, _bt_getstackbuf isn't able to recover. What I'm tempted to do, at least in the back branches, is simply adjust _bt_pagedel to be able to recover from _bt_getstackbuf failure in this scenario. It could use the same method that _bt_insert_parent does in the concurrent-root-split case, ie (untested): ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);if (pbuf == InvalidBuffer) + { + /* Find the leftmost page at the next level up */ + pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false); + stack->bts_blkno = BufferGetBlockNumber(pbuf); + stack->bts_offset = InvalidOffsetNumber; + _bt_relbuf(rel, pbuf); + /* and repeat search from there */ + pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); + if (pbuf == InvalidBuffer) elog(ERROR, "failed to re-find parent key in \"%s\"", RelationGetRelationName(rel)); + }parent = stack->bts_blkno;poffset = stack->bts_offset; The question is whether we want a cleaner answer for future development, and if so what that answer ought to look like. It seems like the alternatives we've been discussing may not end up any simpler/shorter than the current code, and it seems hard to justify giving up some concurrency in the name of a simplification that doesn't simplify much. Thoughts? regards, tom lane