Race condition within _bt_findinsertloc()? (new page split code) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Race condition within _bt_findinsertloc()? (new page split code) |
Date | |
Msg-id | CAM3SWZR77w4Fq26H0t=3n4d=142MWgHrH0EL4+1GpTFZb3HXAg@mail.gmail.com Whole thread Raw |
Responses |
Re: Race condition within _bt_findinsertloc()? (new page split code)
|
List | pgsql-hackers |
While speccing out a new B-Tree verification tool, I had the opportunity to revisit a thought I had during review concerning _bt_findinsertloc(): that the new coding is unsafe in the event of deferred split completion of a leaf page of a unique index. To recap, we now do the following in _bt_findinsertloc(): /** If this page was incompletely split, finish the split now. We* do this while holding a lock on the left sibling, whichis not* good because finishing the split could be a fairly lengthy* operation. But this should happen very seldom.*/ if (P_INCOMPLETE_SPLIT(lpageop)) { _bt_finish_split(rel, rbuf, stack); rbuf = InvalidBuffer; continue; } The "left sibling" referred to here is "the first page this key could be on", an important concept for unique index enforcement. It's the first sibling iff we're on our first iteration of the nested for(;;) loop in _bt_findinsertloc(). So the buffer lock held on this left sibling may constitute what in the past I've called a "value lock"; we've established the right to insert our value into the unique index at this point, and the lock will only be released when we're done (regardless of whether or not that buffer/page value lock is on the buffer/page we'll ultimately insert into, or an earlier one). Anyway, the concern is that there may be problems when we call _bt_finish_split() with that left sibling locked thoughout (i.e. finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and itself has a right sibling from the incomplete split (which is usually the value lock page's right-right sibling)). I'm not concerned about performance, since as the comment says it ought to be an infrequent occurrence. I also believe that there are no deadlock hazards. But consider this scenario: * We insert the value 7 into an int4 unique index. We need to split the leaf page. We run out of memory or something, and ours is an incomplete page split. Our transaction is aborted. For the sake of argument, suppose that there are also already a bunch of dead tuples within the index with values of 7, 8 and 9. * Another inserter of the value 7 comes along. It follows exactly the same downlink as the first, now aborted inserter (suppose the downlink's value is 9). It also locks the same leaf page to establish a "value lock" in precisely the same manner. Finding no room on the first page, it looks further right, maintaining its original "value lock" throughout. It finishes the first inserter's split of the non-value-lock page - a new downlink is inserted into the parent page, with the value 8. It then releases all buffer locks except the first one/original "value lock". A physical insertion has yet to occur. * A third inserter of the value comes along. It gets to a later page than the one locked by the second inserter, preferring the newer downlink with value 8 (the internal-page _bt_binsrch() logic ensures this). It exclusive locks that later page/buffer before the second guy gets a chance to lock it once again. It establishes the right to insert with _bt_check_unique(), undeterred by the second inserter's buffer lock/"value lock". The value lock is effectively skipped over. * Both the second and third inserters have "established the right" to insert the same value, 7, and both do so. The unique index has an MVCC-snapshot-wise spurious duplicate, and so is corrupt. Regardless of whether or not I have these details exactly right (that is, regardless of whether or not this scenario is strictly possible) I suggest as a code-hardening measure that _bt_findinsertloc() release its "value lock", upon realizing it must complete splits, and then complete the split or splits known to be required. It would finally report that it "couldn't find" an insertion location to _bt_doinsert(), which would then retry from the start, just like when _bt_check_unique() finds an inconclusive conflict. The only difference is that we don't have an xact to wait on. We haven't actually done anything extra that makes this later "goto top;" any different to the existing one. This should occur so infrequently that it isn't worth trying harder, or worth differentiating between the UNIQUE_CHECK_NO and !UNIQUE_CHECK_NO cases when retrying. This also removes the more general risk of holding an extra buffer lock during page split completion. It kind of looks like _bt_findinsertloc() doesn't have this bug, because in my scenario _bt_finish_split() is called with both the value lock and its right page locked (so the right page is the left page for _bt_finish_split()'s purposes). But when you take another look, and realize that _bt_finish_split() releases its locks, and so once again only the original value lock will be held when _bt_finish_split() returns, and so the downlink is there to skip the value locked page, you realize that the bug does exist (assuming that I haven't failed to consider some third factor and am not otherwise mistaken). Thoughts? -- Peter Geoghegan
pgsql-hackers by date: