Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Race condition in b-tree page deletion |
Date | |
Msg-id | 52808F34.6020000@vmware.com Whole thread Raw |
In response to | Re: Race condition in b-tree page deletion (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Race condition in b-tree page deletion
|
List | pgsql-hackers |
On 10.11.2013 01:47, Robert Haas wrote: > I think we've tried pretty hard to avoid algorithms where the maximum > number of lwlocks that must be held at one time is not a constant, and > I think we're in for a bad time of it if we start to deviate from that > principal. I'm not sure what to do about this problem, but I think > locking N levels of the tree at once, where N can be as large as the > tree is deep, is probably a bad plan, whether the number of locks > required is N or 3N. I think I found a solution that accomplishes that. It's actually not that complicated: Like we currently do, first climb up the tree to check that it's safe to delete, ie. the downlink in the first non-empty parent is not the rightmost entry. But when we reach the level where the parent is non-empty - I'll call that the "topmost" parent - we keep that page locked. The leaf page is kept locked while we climb. This is enough to plug the race condition. As long as we hold a lock on the topmost parent containing the downlink to the branch of pages we're about to delete, it cannot become the rightmost entry. And as long as we hold a lock on the leaf page, no new insertions can happen on any of the internal pages in the branch, as insertions to internal pages only happen when a child is split. However, the rest of the algorithm needs to be slightly modified, as we cannot re-lock the lower-level pages until we release the lock on the topmost page, to avoid deadlock. So at this point, we hold two locks: the leaf page, and the topmost parent containing the downlink to the branch we're deleting. Next, we remove the downlink from the topmost parent, and mark the leaf page as half-dead in one atomic operation. Also, a pointer to the highest internal page in the branch we're deleting - the one the removed downlink pointed to - is put on the leaf page. We can now release the locks. At this point, searches and inserts work fine. The leaf page has been marked as half-dead, so any insertions to the deleted page's keyspace will go to its right sibling. The downlink is to the top of the branch is gone, so even if the right sibling is split many times, any keys in the transferred keyspace that propagate to the higher levels won't be out-of-order. All that is left is to unlink the all the lingering pages in the branch we're deleting from their left and right siblings. This can be done at any later time, and if we error out or crash for some reason, next vacuum that comes along can finish the job. This is done one level at a time. Lock the leaf page, and the internal page the leaf page points to, and the internal page's left and right siblings (in the right order, not this order). Change the left and right sibling's right- and left-links, mark the internal page as deleted, and update the pointer in the leaf page to point to the child of the deleted internal page. Then recurse to the child, until we reach the leaf level. This has the nice extra property that we don't need the incomplete-action tracking in WAL recovery. I'd like to get rid of that anyway. I'm not sure what to do about stable branches. This could be back-patched, with the caveat that this introduces new WAL record types so standbys need to be upgraded before the master. But given the lack of field reports in the ten years this race condition has existed, I'm not sure it's worth the hassle. - Heikki
pgsql-hackers by date: