Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Race condition in b-tree page deletion
Date
Msg-id 1390074323.55520.YahooMailNeo@web122304.mail.ne1.yahoo.com
Whole thread Raw
In response to Re: Race condition in b-tree page deletion  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Race condition in b-tree page deletion
Re: Race condition in b-tree page deletion
List pgsql-hackers
Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> Here's a patch implementing this scheme.

This patch fixes a race condition bug in btree entry deletion.  The
bug seems to rarely be encountered in practice; or at least it is
generally not recognized as the cause of index corruption when
that occurs.

The basic approach is sound.  The papers on which we based our
btree implementation did not discuss entry deletion, and our
current approach introduces new possible states into the on-disk
index structure which are not covered by the papers and are not
always handled correctly by the current code.  Specifically, entry
inserts can add new pages at the lower levels which were not (yet)
in any higher-level pages, but right-sibling pointers allow allow
the correct point in the index to be found.  Our existing deletion
approach turned this on its head by having missing pages at the
bottom, which introduces whole new classes of problems which
otherwise don't exist.  It is in handling these index states which
are not addressed in the academic papers that we have a race
condition.  This patch makes interim states in the deletion process
look almost exactly the same as interim states in the insertion
process, thus dodging the need to handle new states.

The approach to limiting the number of locks to a finite (and
small) number is clever, but seems quite sound.  The locks are
always taken in an order which cannot cause a deadlock.

The write-up lacks any explicit mention of what happens if the
branch being considered for deletion reaches all the way to the
root.  Although an astute reader will notice that since root page
will always be the only page at its level, it must always be the
rightmost page at its level, and therefore is correctly handled by
the logic dealing with that, I think it merits an explicit mention
in the README.

The general feeling is that this patch is not suitable for
back-patching.  I agree with this, but this is a bug which could
lead to index corruption which exists in all supported branches.
If nobody can suggest an alternative which we can back-patch, we
might want to reconsider this after this has been in production
releases for several months.

The patch still applies with a few offsets.  It compiles with these
warnings (which were also reported by Peter Eisentraut and should
be fixed):

nbtpage.c: In function ‘_bt_pagedel’:
nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1414:18: note: ‘metad’ was declared here
nbtpage.c:1730:4: warning: ‘metapg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1413:8: note: ‘metapg’ was declared here

I'm not a fan of the new retry label introduced in _bt_pagedel()
and the goto statement which references it.  I think a loop using
while() or for() would be cleaner.  The whole idea of describing
this logic recursively and then making a big deal about using tail
recursion as an optimization seems pointless and confusing.  It
could just as easily be described as an iterative approach and save
the extra logical step in explaining the code.  I think that would
be clearer and easier to understand.  It would also eliminate the
last parameter, which is always passed as NULL and only set to
something else internally.  I think it would be cleaner to make
that a local variable initialized to NULL as part of eliminating
the fiction of recursion here.

There is a point in this loop where we return 0, but I think that
is wrong; I think we should break so that the pages already deleted
will be counted.  On a related note. the caller doesn't actually
use the count, but assumes that any non-zero value should be
treated as 1.

Similar issues with faux recursion existing the calling function,
btvacuumpage().  In that case the unnecessary parameter which the
caller must get right is orig_blkno, which must be the same as
blkno on an actual call.  That could be a local variable
initialized to blkno within the function.  I think this would be
better described as iteration directly rather than claiming
recursions and whining about how the compiler is too dumb to turn
it into iteration for us.  It's not the job of this patch to fix
that in the caller, but let's not emulate it.

Other than that, I have not found any problems.

Since this is fixing a bug which is a hard-to-hit race condition,
it is hard to test.  Pretty much you need to get into a debugger
and set breakpoints to cause the right timing to be sure of hitting
it.  I'm still playing with that, but I figured I should post these
notes from reviewing the source code while I continue with that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jeremy Harris
Date:
Subject: Re: PoC: Partial sort
Next
From: Marti Raudsepp
Date:
Subject: Re: PoC: Partial sort