Thread: Bugs in b-tree dead page removal
Whilst looking around for stuff that could be deleted thanks to removing old-style VACUUM FULL, I came across some code in btree that seems rather seriously buggy. For reasons explained in nbtree/README, we can't physically recycle a "deleted" btree index page until all transactions open at the time of deletion are gone --- otherwise we might re-use a page that an existing scan is about to land on, and confuse that scan. (This condition is overly strong, of course, but it's what's designed in at the moment.) The way this is implemented is to label a freshly-deleted page with the current value of ReadNewTransactionId(). Once that value is older than RecentXmin, the page is presumed recyclable. I think this was all right when it was designed, but isn't it rather badly broken by our subsequent changes to have transactions not take out an XID until/unless they write something? A read-only transaction could easily be much older than RecentXmin, no? The odds of an actual problem seem not very high, since to be affected a scan would have to be already "in flight" to the problem page when the deletion occurs. By the time RecentXmin advances and we feed the page to the FSM and get it back, the scan's almost surely going to have arrived. And I think the logic is such that this would not happen before the next VACUUM in any case. Still, it seems pretty bogus. Another issue is that it's not clear what happens in a Hot Standby slave --- it doesn't look like Simon put any interlocking in this area to protect slave queries against having the page disappear from under them. The odds of an actual problem are probably a good bit higher in an HS slave. And there's another problem: _bt_pagedel is designed to recurse in certain improbable cases, but I think this is flat out wrong when doing WAL replay --- if the original process did recurse then it will have emitted a WAL record for each deleted page, meaning replay would try to delete twice. That last problem is easy to fix, but I'm not at all sure what to do about the scan interlock problem. Thoughts? regards, tom lane
I wrote: > And there's another problem: _bt_pagedel is designed to recurse > in certain improbable cases, but I think this is flat out wrong > when doing WAL replay --- if the original process did recurse > then it will have emitted a WAL record for each deleted page, > meaning replay would try to delete twice. No, scratch that, I misread it: _bt_pagedel isn't invoked during WAL replay, but for cleanup of incomplete deletions at termination of WAL replay. So any recursing it has to do also corresponds to actions that weren't in WAL. So that's OK. I'm still concerned about the interlock against read-only transactions though. regards, tom lane
Tom Lane wrote: > Whilst looking around for stuff that could be deleted thanks to removing > old-style VACUUM FULL, I came across some code in btree that seems > rather seriously buggy. For reasons explained in nbtree/README, we > can't physically recycle a "deleted" btree index page until all > transactions open at the time of deletion are gone --- otherwise we > might re-use a page that an existing scan is about to land on, and > confuse that scan. (This condition is overly strong, of course, but > it's what's designed in at the moment.) The way this is implemented > is to label a freshly-deleted page with the current value of > ReadNewTransactionId(). Once that value is older than RecentXmin, > the page is presumed recyclable. > > I think this was all right when it was designed, but isn't it rather > badly broken by our subsequent changes to have transactions not take > out an XID until/unless they write something? A read-only transaction > could easily be much older than RecentXmin, no? Yeah :-(. > The odds of an actual problem seem not very high, since to be affected > a scan would have to be already "in flight" to the problem page when > the deletion occurs. By the time RecentXmin advances and we feed the > page to the FSM and get it back, the scan's almost surely going to have > arrived. And I think the logic is such that this would not happen > before the next VACUUM in any case. Still, it seems pretty bogus. One idea is to change the traversal logic slightly, so that whenever you follow a pointer from page A to B, you keep A pinned until you've pinned B. Then we could just take cleanup lock on the left, right and parent of the empty page, one at a time, to ensure that no-one is in-flight to it, and recycle it immediately. However, forward scans screw that up. When a forward scan reads a page, it saves its right pointer at the same time, so that if the page is subsequently split, it doesn't process tuples on the new right half again. Even if we take cleanup lock on the left sibling of an empty page, there can be scans further left that have a direct pointer to the empty page. It could be salvaged if we require a forward scan to pin the next page too when it saves the right pointer, but that seems inefficient. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 2010-02-07 at 21:33 -0500, Tom Lane wrote: > Another issue is that it's not clear what happens in a Hot Standby > slave --- it doesn't look like Simon put any interlocking in this > area to protect slave queries against having the page disappear > from under them. The odds of an actual problem are probably a > good bit higher in an HS slave. Seen. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-02-07 at 21:33 -0500, Tom Lane wrote: > That last problem is easy to fix, but I'm not at all sure what to do > about the scan interlock problem. Thoughts? AFAICS the problem doesn't exist in normal running. _bt_page_recyclable() tests against RecentXmin, which includes the xmins of read only transactions. So it doesn't matter if a read-only transaction still exists that is earlier than the value of opaque->btpo.xact when it is set. If it still there later then the page cannot be reused. A basic interlock approach can be put in place for Hot Standby. We just WAL log the reuse of a btree page in _bt_getbuf() just before we _bt_pageinit(), using transaction id that took that action. We can then conflict on that xid. - - For the TODO, I'm thinking whether there's a way to allow the page to be reused earlier and have it all just work. That would allow us to recycle index blocks faster and avoid index bloat from occurring in the presence of long lived transactions. Otherwise fixing this for the normal case will accentuate index bloat. It seems possible that a page can be reused and end up at exactly the same place in the index key space, so that the left link of the new page matches the right link of the page the scan just left. Most likely it would be in a different place entirely and so ignoring the issue will cause scans to potentially stop earlier than they should and we give an incomplete answer to a query. So we can't just re-check links to validate the page. The only thing we actually need to record about the old page is the right link, so perhaps we can store the right link value in a central place, together with visibility information. Make that info WAL-logged so it is available on standby also. That would allow us to find out whether we should read the page or use the right link info to move right. We then store a recycled-by transaction id on the new page we are recycling. When we scan onto a new page we check to see whether the page has been recycled by a transaction that we consider still in progress. If so, we consult the page-visibility info to see what the right link of the page was as far as our scan is concerned, then use that to continue our scan. -- Simon Riggs www.2ndQuadrant.com