Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: Race condition in b-tree page deletion |
Date | |
Msg-id | 1390074323.55520.YahooMailNeo@web122304.mail.ne1.yahoo.com Whole thread Raw |
In response to | Re: Race condition in b-tree page deletion (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Race condition in b-tree page deletion
Re: Race condition in b-tree page deletion |
List | pgsql-hackers |
Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Here's a patch implementing this scheme. This patch fixes a race condition bug in btree entry deletion. The bug seems to rarely be encountered in practice; or at least it is generally not recognized as the cause of index corruption when that occurs. The basic approach is sound. The papers on which we based our btree implementation did not discuss entry deletion, and our current approach introduces new possible states into the on-disk index structure which are not covered by the papers and are not always handled correctly by the current code. Specifically, entry inserts can add new pages at the lower levels which were not (yet) in any higher-level pages, but right-sibling pointers allow allow the correct point in the index to be found. Our existing deletion approach turned this on its head by having missing pages at the bottom, which introduces whole new classes of problems which otherwise don't exist. It is in handling these index states which are not addressed in the academic papers that we have a race condition. This patch makes interim states in the deletion process look almost exactly the same as interim states in the insertion process, thus dodging the need to handle new states. The approach to limiting the number of locks to a finite (and small) number is clever, but seems quite sound. The locks are always taken in an order which cannot cause a deadlock. The write-up lacks any explicit mention of what happens if the branch being considered for deletion reaches all the way to the root. Although an astute reader will notice that since root page will always be the only page at its level, it must always be the rightmost page at its level, and therefore is correctly handled by the logic dealing with that, I think it merits an explicit mention in the README. The general feeling is that this patch is not suitable for back-patching. I agree with this, but this is a bug which could lead to index corruption which exists in all supported branches. If nobody can suggest an alternative which we can back-patch, we might want to reconsider this after this has been in production releases for several months. The patch still applies with a few offsets. It compiles with these warnings (which were also reported by Peter Eisentraut and should be fixed): nbtpage.c: In function ‘_bt_pagedel’: nbtpage.c:1695:24: warning: ‘metad’ may be used uninitialized in this function [-Wmaybe-uninitialized] nbtpage.c:1414:18: note: ‘metad’ was declared here nbtpage.c:1730:4: warning: ‘metapg’ may be used uninitialized in this function [-Wmaybe-uninitialized] nbtpage.c:1413:8: note: ‘metapg’ was declared here I'm not a fan of the new retry label introduced in _bt_pagedel() and the goto statement which references it. I think a loop using while() or for() would be cleaner. The whole idea of describing this logic recursively and then making a big deal about using tail recursion as an optimization seems pointless and confusing. It could just as easily be described as an iterative approach and save the extra logical step in explaining the code. I think that would be clearer and easier to understand. It would also eliminate the last parameter, which is always passed as NULL and only set to something else internally. I think it would be cleaner to make that a local variable initialized to NULL as part of eliminating the fiction of recursion here. There is a point in this loop where we return 0, but I think that is wrong; I think we should break so that the pages already deleted will be counted. On a related note. the caller doesn't actually use the count, but assumes that any non-zero value should be treated as 1. Similar issues with faux recursion existing the calling function, btvacuumpage(). In that case the unnecessary parameter which the caller must get right is orig_blkno, which must be the same as blkno on an actual call. That could be a local variable initialized to blkno within the function. I think this would be better described as iteration directly rather than claiming recursions and whining about how the compiler is too dumb to turn it into iteration for us. It's not the job of this patch to fix that in the caller, but let's not emulate it. Other than that, I have not found any problems. Since this is fixing a bug which is a hard-to-hit race condition, it is hard to test. Pretty much you need to get into a debugger and set breakpoints to cause the right timing to be sure of hitting it. I'm still playing with that, but I figured I should post these notes from reviewing the source code while I continue with that. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: