Thread: Bugs in b-tree dead page removal

Bugs in b-tree dead page removal

From
Tom Lane
Date:
Whilst looking around for stuff that could be deleted thanks to removing
old-style VACUUM FULL, I came across some code in btree that seems
rather seriously buggy.  For reasons explained in nbtree/README, we
can't physically recycle a "deleted" btree index page until all
transactions open at the time of deletion are gone --- otherwise we
might re-use a page that an existing scan is about to land on, and
confuse that scan.  (This condition is overly strong, of course, but
it's what's designed in at the moment.)  The way this is implemented
is to label a freshly-deleted page with the current value of
ReadNewTransactionId().  Once that value is older than RecentXmin,
the page is presumed recyclable.

I think this was all right when it was designed, but isn't it rather
badly broken by our subsequent changes to have transactions not take
out an XID until/unless they write something?  A read-only transaction
could easily be much older than RecentXmin, no?

The odds of an actual problem seem not very high, since to be affected
a scan would have to be already "in flight" to the problem page when
the deletion occurs.  By the time RecentXmin advances and we feed the
page to the FSM and get it back, the scan's almost surely going to have
arrived.  And I think the logic is such that this would not happen
before the next VACUUM in any case.  Still, it seems pretty bogus.

Another issue is that it's not clear what happens in a Hot Standby
slave --- it doesn't look like Simon put any interlocking in this
area to protect slave queries against having the page disappear
from under them.  The odds of an actual problem are probably a
good bit higher in an HS slave.

And there's another problem: _bt_pagedel is designed to recurse
in certain improbable cases, but I think this is flat out wrong
when doing WAL replay --- if the original process did recurse
then it will have emitted a WAL record for each deleted page,
meaning replay would try to delete twice.

That last problem is easy to fix, but I'm not at all sure what to do
about the scan interlock problem.  Thoughts?
        regards, tom lane


Re: Bugs in b-tree dead page removal

From
Tom Lane
Date:
I wrote:
> And there's another problem: _bt_pagedel is designed to recurse
> in certain improbable cases, but I think this is flat out wrong
> when doing WAL replay --- if the original process did recurse
> then it will have emitted a WAL record for each deleted page,
> meaning replay would try to delete twice.

No, scratch that, I misread it: _bt_pagedel isn't invoked during
WAL replay, but for cleanup of incomplete deletions at termination
of WAL replay.  So any recursing it has to do also corresponds to
actions that weren't in WAL.  So that's OK.

I'm still concerned about the interlock against read-only
transactions though.
        regards, tom lane


Re: Bugs in b-tree dead page removal

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Whilst looking around for stuff that could be deleted thanks to removing
> old-style VACUUM FULL, I came across some code in btree that seems
> rather seriously buggy.  For reasons explained in nbtree/README, we
> can't physically recycle a "deleted" btree index page until all
> transactions open at the time of deletion are gone --- otherwise we
> might re-use a page that an existing scan is about to land on, and
> confuse that scan.  (This condition is overly strong, of course, but
> it's what's designed in at the moment.)  The way this is implemented
> is to label a freshly-deleted page with the current value of
> ReadNewTransactionId().  Once that value is older than RecentXmin,
> the page is presumed recyclable.
> 
> I think this was all right when it was designed, but isn't it rather
> badly broken by our subsequent changes to have transactions not take
> out an XID until/unless they write something?  A read-only transaction
> could easily be much older than RecentXmin, no?

Yeah :-(.

> The odds of an actual problem seem not very high, since to be affected
> a scan would have to be already "in flight" to the problem page when
> the deletion occurs.  By the time RecentXmin advances and we feed the
> page to the FSM and get it back, the scan's almost surely going to have
> arrived.  And I think the logic is such that this would not happen
> before the next VACUUM in any case.  Still, it seems pretty bogus.

One idea is to change the traversal logic slightly, so that whenever you
follow a pointer from page A to B, you keep A pinned until you've pinned
B. Then we could just take cleanup lock on the left, right and parent of
the empty page, one at a time, to ensure that no-one is in-flight to it,
and recycle it immediately. However, forward scans screw that up. When a
forward scan reads a page, it saves its right pointer at the same time,
so that if the page is subsequently split, it doesn't process tuples on
the new right half again. Even if we take cleanup lock on the left
sibling of an empty page, there can be scans further left that have a
direct pointer to the empty page. It could be salvaged if we require a
forward scan to pin the next page too when it saves the right pointer,
but that seems inefficient.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Bugs in b-tree dead page removal

From
Simon Riggs
Date:
On Sun, 2010-02-07 at 21:33 -0500, Tom Lane wrote:
> Another issue is that it's not clear what happens in a Hot Standby
> slave --- it doesn't look like Simon put any interlocking in this
> area to protect slave queries against having the page disappear
> from under them.  The odds of an actual problem are probably a
> good bit higher in an HS slave.

Seen.

-- Simon Riggs           www.2ndQuadrant.com



Re: Bugs in b-tree dead page removal

From
Simon Riggs
Date:
On Sun, 2010-02-07 at 21:33 -0500, Tom Lane wrote:

> That last problem is easy to fix, but I'm not at all sure what to do
> about the scan interlock problem.  Thoughts?

AFAICS the problem doesn't exist in normal running.
_bt_page_recyclable() tests against RecentXmin, which includes the xmins
of read only transactions. So it doesn't matter if a read-only
transaction still exists that is earlier than the value of
opaque->btpo.xact when it is set. If it still there later then the page
cannot be reused.

A basic interlock approach can be put in place for Hot Standby. We just
WAL log the reuse of a btree page in _bt_getbuf() just before we
_bt_pageinit(), using transaction id that took that action. We can then
conflict on that xid. 

- - 

For the TODO, I'm thinking whether there's a way to allow the page to be
reused earlier and have it all just work. That would allow us to recycle
index blocks faster and avoid index bloat from occurring in the presence
of long lived transactions. Otherwise fixing this for the normal case
will accentuate index bloat.

It seems possible that a page can be reused and end up at exactly the
same place in the index key space, so that the left link of the new page
matches the right link of the page the scan just left. Most likely it
would be in a different place entirely and so ignoring the issue will
cause scans to potentially stop earlier than they should and we give an
incomplete answer to a query. So we can't just re-check links to
validate the page.

The only thing we actually need to record about the old page is the
right link, so perhaps we can store the right link value in a central
place, together with visibility information. Make that info WAL-logged
so it is available on standby also. That would allow us to find out
whether we should read the page or use the right link info to move
right.

We then store a recycled-by transaction id on the new page we are
recycling. When we scan onto a new page we check to see whether the page
has been recycled by a transaction that we consider still in progress.
If so, we consult the page-visibility info to see what the right link of
the page was as far as our scan is concerned, then use that to continue
our scan.

-- Simon Riggs           www.2ndQuadrant.com