Bugs in b-tree dead page removal - Mailing list pgsql-hackers

From Tom Lane
Subject Bugs in b-tree dead page removal
Date
Msg-id 23761.1265596434@sss.pgh.pa.us
Whole thread Raw
Responses Re: Bugs in b-tree dead page removal  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Bugs in b-tree dead page removal  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: Bugs in b-tree dead page removal  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Bugs in b-tree dead page removal  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Whilst looking around for stuff that could be deleted thanks to removing
old-style VACUUM FULL, I came across some code in btree that seems
rather seriously buggy.  For reasons explained in nbtree/README, we
can't physically recycle a "deleted" btree index page until all
transactions open at the time of deletion are gone --- otherwise we
might re-use a page that an existing scan is about to land on, and
confuse that scan.  (This condition is overly strong, of course, but
it's what's designed in at the moment.)  The way this is implemented
is to label a freshly-deleted page with the current value of
ReadNewTransactionId().  Once that value is older than RecentXmin,
the page is presumed recyclable.

I think this was all right when it was designed, but isn't it rather
badly broken by our subsequent changes to have transactions not take
out an XID until/unless they write something?  A read-only transaction
could easily be much older than RecentXmin, no?

The odds of an actual problem seem not very high, since to be affected
a scan would have to be already "in flight" to the problem page when
the deletion occurs.  By the time RecentXmin advances and we feed the
page to the FSM and get it back, the scan's almost surely going to have
arrived.  And I think the logic is such that this would not happen
before the next VACUUM in any case.  Still, it seems pretty bogus.

Another issue is that it's not clear what happens in a Hot Standby
slave --- it doesn't look like Simon put any interlocking in this
area to protect slave queries against having the page disappear
from under them.  The odds of an actual problem are probably a
good bit higher in an HS slave.

And there's another problem: _bt_pagedel is designed to recurse
in certain improbable cases, but I think this is flat out wrong
when doing WAL replay --- if the original process did recurse
then it will have emitted a WAL record for each deleted page,
meaning replay would try to delete twice.

That last problem is easy to fix, but I'm not at all sure what to do
about the scan interlock problem.  Thoughts?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Next
From: Tom Lane
Date:
Subject: Re: Bugs in b-tree dead page removal