Re: Failure while inserting parent tuple to B-tree is not fun - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Failure while inserting parent tuple to B-tree is not fun |
Date | |
Msg-id | CAM3SWZR-Btfvp6zAKn_BT88TC_9BxMTF0wWshjtq4r3C==W+hA@mail.gmail.com Whole thread Raw |
In response to | Re: Failure while inserting parent tuple to B-tree is not fun (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Failure while inserting parent tuple to B-tree is not
fun
Re: Failure while inserting parent tuple to B-tree is not fun Re: Failure while inserting parent tuple to B-tree is not fun |
List | pgsql-hackers |
On Thu, Nov 14, 2013 at 9:23 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Ok, here's a new version of the patch to handle incomplete B-tree splits. I finally got around to taking a look at this. Unlike with the as-yet uncommitted "Race condition in b-tree page deletion" patch that Kevin looked at, which there is a dependency on here, I could not really consider your patch in isolation. I'm really reviewing both patches, but with a particular focus on this recent set of additions (btree-incomplete-splits-2.patch), and the issues addressed by it. However, since the distinction between this patch and the patch that it has a dependency on is fuzzy, expect me to be fuzzy in differentiating the two. This patch is only very loosely an atomic unit. This is not a criticism - I understand that it just turned out that way. The first thing I noticed about this patchset is that it completely expunges btree_xlog_startup(), btree_xlog_cleanup() and btree_safe_restartpoint(). The post-recovery cleanup that previously occurred to address both sets of problems (the problem addressed by this patch, incomplete page splits, and the problem addressed by the dependency patch, incomplete page deletions) now both occur opportunistically/lazily. Notably, this means that there are now exactly zero entries in the resource manager list with a 'restartpoint' callback. I guess we should consider removing it entirely for that reason. As you said in the commit message where gin_safe_restartpoint was similarly retired (commit 631118fe): """ The post-recovery cleanup mechanism was never totally reliable, as insertion to the parent could fail e.g because of running out of memory or disk space, leaving the tree in an inconsistent state. """ I'm of the general impression that post-recovery cleanup is questionable. I'm surprised that you didn't mention this commit previously. You just mentioned the original analogous work on GiST, but this certainly seems to be something you've been thinking about *systematically* eliminating for a while. So while post-recovery callbacks no longer exist for any rmgr-managed-resource, 100% of remaining startup and cleanup callbacks concern the simple management of memory of AM-specific recovery contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't a better abstraction than that, such as a generic recovery memory context, allowing you to retire all 3 callbacks. I mean, StartupXLOG() just calls those callbacks for each resource at exactly the same time anyway, just as it initializes resource managers in precisely the same manner earlier on. Plus if you look at what those AM-local memory management routines do, it all seems very simple. I think you should bump XLOG_PAGE_MAGIC. Perhaps you should increase the elevel here: + elog(DEBUG1, "finishing incomplete split of %u/%u", + BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf)); Since this is a very rare occurrence that involves some subtle interactions, I can't think why you wouldn't want to LOG this for forensic purposes. Why did you remove the local linkage of _bt_walk_left(), given that it is called in exactly the same way as before? I guess that that is just a vestige of some earlier revision. I think I see some bugs in _bt_moveright(). If you examine _bt_finish_split() in detail, you'll see that it doesn't just drop the write buffer lock that the caller will have provided (per its comments) - it also drops the buffer pin. It calls _bt_insert_parent() last, which was previously only called last thing by _bt_insertonpg() (some of the time), and _bt_insertonpg() is indeed documented to drop both the lock and the pin. And if you look at _bt_moveright() again, you'll see that there is a tacit assumption that the buffer pin isn't dropped, or at least that "opaque" (the BTPageOpaque BT special page area pointer) continues to point to the same page held in the same buffer, even though the code doesn't set the "opaque' pointer again and it may not point to a pinned buffer or even the appropriate buffer. Ditto "page". So "opaque" may point to the wrong thing on subsequent iterations - you don't do anything with the return value of _bt_getbuf(), unlike all of the existing call sites. I believe you should store its return value, and get the held page and the special pointer on the page, and assign that to the "opaque" pointer before the next iteration (an iteration in which you very probably really do make progress not limited to completing a page split, the occurrence of which the _bt_moveright() loop gets "stuck on"). You know, do what is done in the non-split-handling case. It may not always be the same buffer as before, obviously. For a moment I thought that you might have accounted for that here: >*************** _bt_insert_parent(Relation rel, >*** 1685,1696 **** > * 05/27/97 > */ > ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); >- > pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); > >! /* Now we can unlock the children */ > _bt_relbuf(rel, rbuf); >- _bt_relbuf(rel, buf); > > /* Check for error only after writing children */ > if (pbuf == InvalidBuffer) >--- 1767,1779 ---- > * 05/27/97 > */ > ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); > pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); > >! /* >! * Now we can unlock the right child. The left child will be unlocked >! * by _bt_insertonpg(). >! */ > _bt_relbuf(rel, rbuf); > > /* Check for error only after writing children */ > if (pbuf == InvalidBuffer) But this is unrelated - the buffer + pin are still dropped by a later _bt_insertonpg(), as it says right there. Actually, to see that there is a bug in _bt_moveright() you don't really have to look much further than _bt_moveright(). So I apologize for taking you on that unnecessary detour, but FWIW that was my thought process, and I like to document that for my own benefit. :-) Do you suppose that there are similar problems in other direct callers of _bt_finish_split()? I'm thinking in particular of _bt_findinsertloc(). I'm also not sure about the lock escalation that may occur within _bt_moveright() for callers with BT_READ access - there is nothing to prevent a caller from ending up being left with a write lock where before they only had a read lock if they find that !P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and conclude that there is no split finishing to be done after all. It looks like all callers currently won't care about that, but it seems like that should either be prominently documented, or just not occur. These interactions are complex enough that we ought to be very explicit about where pins are required and dropped, or locks held before or after calling. I suggest you consider refactoring the loop in _bt_moveright() to reflect these subtleties. I think the structure of that routine could use some polishing. I'm not sure that I like the way that that and similar loops get "stuck on" page splits, although no better design occurs to me right now. That's all I have for now. I've written plenty of notes, and will work back through other points of possible concern. I don't suppose you have any testing infrastructure that you could publish? -- Peter Geoghegan
pgsql-hackers by date: