Re: B-tree page deletion boundary cases - Mailing list pgsql-hackers

From Nikhil Sontakke
Subject Re: B-tree page deletion boundary cases
Date
Msg-id CANgU5Zc5gq00PU07ZKwfbzX5pJRjO+8FmLcSAbY22JXV5jm9tw@mail.gmail.com
Whole thread Raw
In response to B-tree page deletion boundary cases  (Noah Misch <noah@leadboat.com>)
Responses Re: B-tree page deletion boundary cases  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
<div class="gmail_extra">Hi Noah, <br /><br />Was wondering if there's a similar bug which gets triggered while using
VACUUMFULL. See for instance this thread:<br /><br /><a
href="http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html">http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html</a><br
/><br/>This issue has been reported on-off from time to time and in most cases VACUUM or VACUUM FULL appears to be
involved.We have usually attributed it to hardware issues and reindex has been recommended by default as a
solution/workaround.. <br /><br />Regards,<br />Nikhils<br /><br /><div class="gmail_quote">On Sat, Apr 21, 2012 at
10:22PM, Noah Misch <span dir="ltr"><<a href="mailto:noah@leadboat.com"
target="_blank">noah@leadboat.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"> For the sake of concurrency, our B-tree implementation has a phased
process<br/> for reusing empty pages.  Excerpting from nbtree/README:<br /><br />        A deleted page cannot be
reclaimedimmediately, since there may be other<br />        processes waiting to reference it (ie, search processes
thatjust left the<br />        parent, or scans moving right or left from one of the siblings).  These<br />      
 processesmust observe that the page is marked dead and recover<br />        accordingly.  Searches and forward scans
simplyfollow the right-link<br />        until they find a non-dead page --- this will be where the deleted page's<br
/>       key-space moved to.<br /><br />        ...<br /><br />        A deleted page can only be reclaimed once there
isno scan or search that<br />        has a reference to it; until then, it must stay in place with its<br />      
 right-linkundisturbed.  We implement this by waiting until all<br />        transactions that were running at the time
ofdeletion are dead; which is<br />        overly strong, but is simple to implement within Postgres.  When marked<br
/>       dead, a deleted page is labeled with the next-transaction counter value.<br />        VACUUM can reclaim the
pagefor re-use when this transaction number is<br />        older than the oldest open transaction.<br /><br /> The
VACUUMthat deletes a page's last tuple calls _bt_pagedel(), which flags<br /> the page BTP_DELETED and stores therein
theresult of ReadNewTransactionId().<br /> When a later VACUUM visits such a page and observes that the stored XID is
now<br/> less than or equal to RecentXmin, it adds the page to the FSM.  An INSERT or<br /> UPDATE will pull the page
fromthe FSM and repurpose it.<br /><br /> As I mentioned[1] peripherally back in November, that algorithm has been<br
/>insufficient since the introduction of non-XID-bearing transactions in<br /> PostgreSQL 8.3.  Such transactions do
notrestrain RecentXmin.  If no running<br /> transaction has an XID, RecentXmin == ReadNewTransactionId() and the
page<br/> incorrectly becomes available for immediate reuse.  This is difficult to<br /> encounter in practice.  VACUUM
acquiresa cleanup lock on every leaf page in<br /> each index.  Consequently, a problem can only arise around a leaf
page<br/> deletion when two VACUUMs visit the page during the narrow window when<br /> _bt_steppage() has released the
pinon its left sibling and not yet acquired a<br /> read lock on the target.  Non-leaf deletions might not require such
narrow<br/> conditions, but they are also exponentially less frequent.  Here is a test<br /> procedure illustrating the
bug:<br/><br /> 1. In session S1, run these commands:<br /> BEGIN;<br /> CREATE TABLE t (x int, filler
character(400));<br/> INSERT INTO t SELECT *, '' FROM generate_series(1, 10000);<br /> CREATE INDEX ON t(x);<br />
DELETEFROM t WHERE x >= 2000 AND x < 4000;<br /> COMMIT;<br /><br /> BEGIN;<br /> SET LOCAL enable_seqscan =
off;<br/> SET LOCAL enable_bitmapscan = off;<br /> DECLARE c CURSOR FOR SELECT x FROM t WHERE x >= 1990 AND x <
4510;<br/> FETCH 5 c;<br /><br /> 2. Attach gdb to S1 and set a breakpoint on _bt_getbuf.<br /> 3. In S1, run "FETCH 10
c". This will hit the breakpoint.<br /> 4. In another session S2, run these commands:<br /> VACUUM VERBOSE t; -- mark
somepages BTP_DELETED<br /> VACUUM VERBOSE t; -- update FSM to know about the pages<br /> -- reuse the pages<br />
INSERTINTO t SELECT * FROM generate_series(10001, 12000);<br /><br /> 5. Exit gdb to free up S1.  The FETCH only
returnsfive rows.<br /><br /> (The "filler" column makes each index page correspond to more heap pages.<br /> Without
it,heap page pins prevent removing some of the tuples on the index<br /> page under test.)<br /><br /><br /> The fix is
tocompare the stored XID to RecentGlobalXmin, not RecentXmin.  We<br /> already use RecentGlobalXmin when wal_level =
hot_standby. If no running<br /> transaction has an XID and all running transactions began since the last<br />
transactionthat did bear an XID, RecentGlobalXmin == ReadNewTransactionId().<br /> Therefore, the correct test is
btpo.xact< RecentGlobalXmin, not btpo.xact <=<br /> RecentGlobalXmin as we have today.  This also cleanly removes
theneed for the<br /> bit of code in _bt_getbuf() that decrements btpo.xact before sending it down<br /> for
ResolveRecoveryConflictWithSnapshot(). I suggested[2] that decrement on<br /> an unprincipled basis; it was just
maskingthe off-by-one of using "<=<br /> RecentGlobalXmin" instead of "< RecentGlobalXmin" in
_bt_page_recyclable().<br/><br /> This change makes empty B-tree pages wait through two generations of running<br />
transactionsbefore reuse, so some additional bloat will arise.  Furthermore,<br /> the set of transactions having
snapshotsprecluding reuse of the page will<br /> continue to grow until the next transaction to allocate an XID
commits. The<br /> alternative of occasionally returning wrong query results won't do, though.<br /> While we could
explorefundamentally-different page deletion algorithms for<br /> PostgreSQL 9.3, this is the only fix coming to mind
that'ssuitable for today.<br /><br /> For purposes of log message writing, this patch effectively reverts commits<br />
758bd2a433d64bed00ca084203b3e5ccfdea4499and<br /> e1cd66f74862936d84acf3008118d6094c56ad58.  I've attempted to document
in<br/> comments all the questions raised over on the SP-GiST/hot standby thread[3].<br /><br /> Thanks,<br /> nm<br
/><br/> [1] <a href="http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com"
target="_blank">http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com</a><br/> [2] <a
href="http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com"
target="_blank">http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com</a><br/> [3] <a
href="http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com"
target="_blank">http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com</a><br/><br /><br
/>--<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /></div> 

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: B-tree page deletion boundary cases
Next
From: Jeff Davis
Date:
Subject: Re: RFC: Making TRUNCATE more "MVCC-safe"