pgsql: Fix race condition in B-tree page deletion. - Mailing list pgsql-committers

From Heikki Linnakangas
Subject pgsql: Fix race condition in B-tree page deletion.
Date
Msg-id E1WOSpB-0007QM-SA@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
Fix race condition in B-tree page deletion.

In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/efada2b8e920adfdf7418862e939925d2acd1b89

Modified Files
--------------
src/backend/access/nbtree/README      |  107 +++--
src/backend/access/nbtree/nbtpage.c   |  795 ++++++++++++++++++++-------------
src/backend/access/nbtree/nbtree.c    |    2 +-
src/backend/access/nbtree/nbtxlog.c   |  340 +++++++-------
src/backend/access/rmgrdesc/nbtdesc.c |   25 +-
src/include/access/nbtree.h           |   59 ++-
src/include/access/xlog_internal.h    |    2 +-
7 files changed, 793 insertions(+), 537 deletions(-)


pgsql-committers by date:

Previous
From: Tom Lane
Date:
Subject: pgsql: Prevent interrupts while reporting non-ERROR elog messages.
Next
From: Tom Lane
Date:
Subject: pgsql: Add missing #include.