Re: Race condition within _bt_findinsertloc()? (new page split code) - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Race condition within _bt_findinsertloc()? (new page split code) |
Date | |
Msg-id | 53847CD8.6080401@vmware.com Whole thread Raw |
In response to | Race condition within _bt_findinsertloc()? (new page split code) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Race condition within _bt_findinsertloc()? (new page split code)
|
List | pgsql-hackers |
On 05/27/2014 09:17 AM, Peter Geoghegan wrote: > While speccing out a new B-Tree verification tool, I had the > opportunity to revisit a thought I had during review concerning > _bt_findinsertloc(): that the new coding is unsafe in the event of > deferred split completion of a leaf page of a unique index. To recap, > we now do the following in _bt_findinsertloc(): > > /* > * If this page was incompletely split, finish the split now. We > * do this while holding a lock on the left sibling, which is not > * good because finishing the split could be a fairly lengthy > * operation. But this should happen very seldom. > */ > if (P_INCOMPLETE_SPLIT(lpageop)) > { > _bt_finish_split(rel, rbuf, stack); > rbuf = InvalidBuffer; > continue; > } > > The "left sibling" referred to here is "the first page this key could > be on", an important concept for unique index enforcement. No, it's not "the first page this key could be on". _bt_findinsertloc() does *not* hold a lock on the first valid page the key could go to. It merely ensures that when it steps to the next page, it releases the lock on the previous page only after acquiring the lock on the next page. Throughout the operation, it will hold a lock on *some* page that could legally hold the inserted value, and it acquires the locks in left-to-right order. This is sufficient for the uniqueness checks, because _bt_unique_check() scans all the pages, and _bt_unique_check() *does* hold the first page locked while it scans the rest of the pages. But _bt_findinsertlock() does not. Also note that _bt_moveright() also finishes any incomplete splits it encounters (when called for an insertion). So _bt_findinsertloc() never gets called on a page with the incomplete-split flag set. It might encounter one when it moves right, but never the first page. > Anyway, the concern is that there may be problems when we call > _bt_finish_split() with that left sibling locked thoughout (i.e. > finish a split where the right sibling is BTP_INCOMPLETE_SPLIT, and > itself has a right sibling from the incomplete split (which is usually > the value lock page's right-right sibling)). I'm not concerned about > performance, since as the comment says it ought to be an infrequent > occurrence. I also believe that there are no deadlock hazards. But > consider this scenario: > > * We insert the value 7 into an int4 unique index. We need to split > the leaf page. We run out of memory or something, and ours is an > incomplete page split. Our transaction is aborted. For the sake of > argument, suppose that there are also already a bunch of dead tuples > within the index with values of 7, 8 and 9. If I understood correctly, the tree looks like this before the insertion: Parent page: +-------------+ | | | 9 -> A | +-------------+ Leaf A +-------------+ | HI-key: 9 | | | | 7 8 9 | +-------------+ And after insertion and incomplete split: Parent page +-------------+ | | | 9 -> A | +-------------+ Leaf A Leaf B +--------------+ +-------------+ | HI-key: 8 | | HI-key: 9 | | (INCOMPLETE_ | | | | SPLIT) | <-> | | | | | | | 7 7* 8 | | 9 | +--------------+ +-------------+ where 7* is the newly inserted key, with value 7. (you didn't mention at what point the split happens, but in the next paragraph you said the new downlink has value 8, which implies the above split) > * Another inserter of the value 7 comes along. It follows exactly the > same downlink as the first, now aborted inserter (suppose the > downlink's value is 9). It also locks the same leaf page to establish > a "value lock" in precisely the same manner. Finding no room on the > first page, it looks further right, maintaining its original "value > lock" throughout. It finishes the first inserter's split of the > non-value-lock page - a new downlink is inserted into the parent page, > with the value 8. It then releases all buffer locks except the first > one/original "value lock". A physical insertion has yet to occur. Hmm, I think you got confused at this step. When inserting a 7, you cannot "look further right" to find a page with more space, because the HI-key, 8, on the first page stipulates that 7 must go on that page, not some later page. > * A third inserter of the value comes along. It gets to a later page > than the one locked by the second inserter, preferring the newer > downlink with value 8 (the internal-page _bt_binsrch() logic ensures > this). That's a contradiction: the downlink with value 8 points to the first page, not some later page. After the split is finished, the tree looks like this: Parent page +-------------+ | 8 -> A | | 9 -> B | +-------------+ Leaf A Leaf B +-------------+ +-------------+ | HI-key: 8 | | HI-key: 9 | | | <-> | | | 7 7* 8 | | 9 | +-------------+ +-------------+ > Regardless of whether or not I have these details exactly right (that > is, regardless of whether or not this scenario is strictly possible) I > suggest as a code-hardening measure that _bt_findinsertloc() release > its "value lock", upon realizing it must complete splits, and then > complete the split or splits known to be required. It would finally > report that it "couldn't find" an insertion location to > _bt_doinsert(), which would then retry from the start, just like when > _bt_check_unique() finds an inconclusive conflict. The only difference > is that we don't have an xact to wait on. We haven't actually done > anything extra that makes this later "goto top;" any different to the > existing one. > > This should occur so infrequently that it isn't worth trying harder, > or worth differentiating between the UNIQUE_CHECK_NO and > !UNIQUE_CHECK_NO cases when retrying. This also removes the more > general risk of holding an extra buffer lock during page split > completion. Yeah, that would work too. It seems safe enough as it is, though, so I don't see the point. > It kind of looks like _bt_findinsertloc() doesn't have this bug, > because in my scenario _bt_finish_split() is called with both the > value lock and its right page locked (so the right page is the left > page for _bt_finish_split()'s purposes). But when you take another > look, and realize that _bt_finish_split() releases its locks, and so > once again only the original value lock will be held when > _bt_finish_split() returns, and so the downlink is there to skip the > value locked page, you realize that the bug does exist (assuming that > I haven't failed to consider some third factor and am not otherwise > mistaken). When inserting, the scan for the right insert location always begins from the first page where the key can legally go to. Inserting a missing downlink doesn't change what that page is - it just makes it faster to find, by reducing the number of right-links you need to follow. PS. Thanks for looking into this again! These B-tree changes really need thorough review. - Heikki
pgsql-hackers by date: