Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Race condition in b-tree page deletion
Date
Msg-id 52808F34.6020000@vmware.com
Whole thread Raw
In response to Re: Race condition in b-tree page deletion  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Race condition in b-tree page deletion  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 10.11.2013 01:47, Robert Haas wrote:
> I think we've tried pretty hard to avoid algorithms where the maximum
> number of lwlocks that must be held at one time is not a constant, and
> I think we're in for a bad time of it if we start to deviate from that
> principal.  I'm not sure what to do about this problem, but I think
> locking N levels of the tree at once, where N can be as large as the
> tree is deep, is probably a bad plan, whether the number of locks
> required is N or 3N.

I think I found a solution that accomplishes that. It's actually not 
that complicated:

Like we currently do, first climb up the tree to check that it's safe to 
delete, ie. the downlink in the first non-empty parent is not the 
rightmost entry. But when we reach the level where the parent is 
non-empty - I'll call that the "topmost" parent - we keep that page 
locked. The leaf page is kept locked while we climb.

This is enough to plug the race condition. As long as we hold a lock on 
the topmost parent containing the downlink to the branch of pages we're 
about to delete, it cannot become the rightmost entry. And as long as we 
hold a lock on the leaf page, no new insertions can happen on any of the 
internal pages in the branch, as insertions to internal pages only 
happen when a child is split. However, the rest of the algorithm needs 
to be slightly modified, as we cannot re-lock the lower-level pages 
until we release the lock on the topmost page, to avoid deadlock.

So at this point, we hold two locks: the leaf page, and the topmost 
parent containing the downlink to the branch we're deleting. Next, we 
remove the downlink from the topmost parent, and mark the leaf page as 
half-dead in one atomic operation. Also, a pointer to the highest 
internal page in the branch we're deleting - the one the removed 
downlink pointed to - is put on the leaf page. We can now release the locks.

At this point, searches and inserts work fine. The leaf page has been 
marked as half-dead, so any insertions to the deleted page's keyspace 
will go to its right sibling. The downlink is to the top of the branch 
is gone, so even if the right sibling is split many times, any keys in 
the transferred keyspace that propagate to the higher levels won't be 
out-of-order.

All that is left is to unlink the all the lingering pages in the branch 
we're deleting from their left and right siblings. This can be done at 
any later time, and if we error out or crash for some reason, next 
vacuum that comes along can finish the job. This is done one level at a 
time. Lock the leaf page, and the internal page the leaf page points to, 
and the internal page's left and right siblings (in the right order, not 
this order). Change the left and right sibling's right- and left-links, 
mark the internal page as deleted, and update the pointer in the leaf 
page to point to the child of the deleted internal page. Then recurse to 
the child, until we reach the leaf level.

This has the nice extra property that we don't need the 
incomplete-action tracking in WAL recovery. I'd like to get rid of that 
anyway.

I'm not sure what to do about stable branches. This could be 
back-patched, with the caveat that this introduces new WAL record types 
so standbys need to be upgraded before the master. But given the lack of 
field reports in the ten years this race condition has existed, I'm not 
sure it's worth the hassle.

- Heikki



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Exempting superuser from row-security isn't enough. Run predicates as DEFINER?
Next
From: "Erik Rijkers"
Date:
Subject: Re: Minmax indexes