Thread: B-tree page deletion boundary cases
For the sake of concurrency, our B-tree implementation has a phased process for reusing empty pages. Excerpting from nbtree/README: A deleted page cannot be reclaimed immediately, since there may be other processes waiting to reference it (ie, search processes that just left the parent, or scans moving right or left from one of the siblings). These processes must observe that the page is marked dead and recover accordingly. Searches and forward scans simply follow the right-link until they find a non-dead page --- this will be where the deleted page's key-space moved to. ... A deleted page can only be reclaimed once there is no scan or search that has a reference to it; until then, it must stay in place with its right-link undisturbed. We implement this by waiting until all transactions that were running at the time of deletion are dead; which is overly strong, but is simple to implement within Postgres. When marked dead, a deleted page is labeled with the next-transaction counter value. VACUUM can reclaim the page for re-use when this transaction number is older than the oldest open transaction. The VACUUM that deletes a page's last tuple calls _bt_pagedel(), which flags the page BTP_DELETED and stores therein the result of ReadNewTransactionId(). When a later VACUUM visits such a page and observes that the stored XID is now less than or equal to RecentXmin, it adds the page to the FSM. An INSERT or UPDATE will pull the page from the FSM and repurpose it. As I mentioned[1] peripherally back in November, that algorithm has been insufficient since the introduction of non-XID-bearing transactions in PostgreSQL 8.3. Such transactions do not restrain RecentXmin. If no running transaction has an XID, RecentXmin == ReadNewTransactionId() and the page incorrectly becomes available for immediate reuse. This is difficult to encounter in practice. VACUUM acquires a cleanup lock on every leaf page in each index. Consequently, a problem can only arise around a leaf page deletion when two VACUUMs visit the page during the narrow window when _bt_steppage() has released the pin on its left sibling and not yet acquired a read lock on the target. Non-leaf deletions might not require such narrow conditions, but they are also exponentially less frequent. Here is a test procedure illustrating the bug: 1. In session S1, run these commands: BEGIN; CREATE TABLE t (x int, filler character(400)); INSERT INTO t SELECT *, '' FROM generate_series(1, 10000); CREATE INDEX ON t(x); DELETE FROM t WHERE x >= 2000 AND x < 4000; COMMIT; BEGIN; SET LOCAL enable_seqscan = off; SET LOCAL enable_bitmapscan = off; DECLARE c CURSOR FOR SELECT x FROM t WHERE x >= 1990 AND x < 4510; FETCH 5 c; 2. Attach gdb to S1 and set a breakpoint on _bt_getbuf. 3. In S1, run "FETCH 10 c". This will hit the breakpoint. 4. In another session S2, run these commands: VACUUM VERBOSE t; -- mark some pages BTP_DELETED VACUUM VERBOSE t; -- update FSM to know about the pages -- reuse the pages INSERT INTO t SELECT * FROM generate_series(10001, 12000); 5. Exit gdb to free up S1. The FETCH only returns five rows. (The "filler" column makes each index page correspond to more heap pages. Without it, heap page pins prevent removing some of the tuples on the index page under test.) The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. We already use RecentGlobalXmin when wal_level = hot_standby. If no running transaction has an XID and all running transactions began since the last transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId(). Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <= RecentGlobalXmin as we have today. This also cleanly removes the need for the bit of code in _bt_getbuf() that decrements btpo.xact before sending it down for ResolveRecoveryConflictWithSnapshot(). I suggested[2] that decrement on an unprincipled basis; it was just masking the off-by-one of using "<= RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable(). This change makes empty B-tree pages wait through two generations of running transactions before reuse, so some additional bloat will arise. Furthermore, the set of transactions having snapshots precluding reuse of the page will continue to grow until the next transaction to allocate an XID commits. The alternative of occasionally returning wrong query results won't do, though. While we could explore fundamentally-different page deletion algorithms for PostgreSQL 9.3, this is the only fix coming to mind that's suitable for today. For purposes of log message writing, this patch effectively reverts commits 758bd2a433d64bed00ca084203b3e5ccfdea4499 and e1cd66f74862936d84acf3008118d6094c56ad58. I've attempted to document in comments all the questions raised over on the SP-GiST/hot standby thread[3]. Thanks, nm [1] http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com [2] http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com [3] http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com
Attachment
<div class="gmail_extra">Hi Noah, <br /><br />Was wondering if there's a similar bug which gets triggered while using VACUUMFULL. See for instance this thread:<br /><br /><a href="http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html">http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html</a><br /><br/>This issue has been reported on-off from time to time and in most cases VACUUM or VACUUM FULL appears to be involved.We have usually attributed it to hardware issues and reindex has been recommended by default as a solution/workaround.. <br /><br />Regards,<br />Nikhils<br /><br /><div class="gmail_quote">On Sat, Apr 21, 2012 at 10:22PM, Noah Misch <span dir="ltr"><<a href="mailto:noah@leadboat.com" target="_blank">noah@leadboat.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px#ccc solid;padding-left:1ex"> For the sake of concurrency, our B-tree implementation has a phased process<br/> for reusing empty pages. Excerpting from nbtree/README:<br /><br /> A deleted page cannot be reclaimedimmediately, since there may be other<br /> processes waiting to reference it (ie, search processes thatjust left the<br /> parent, or scans moving right or left from one of the siblings). These<br /> processesmust observe that the page is marked dead and recover<br /> accordingly. Searches and forward scans simplyfollow the right-link<br /> until they find a non-dead page --- this will be where the deleted page's<br /> key-space moved to.<br /><br /> ...<br /><br /> A deleted page can only be reclaimed once there isno scan or search that<br /> has a reference to it; until then, it must stay in place with its<br /> right-linkundisturbed. We implement this by waiting until all<br /> transactions that were running at the time ofdeletion are dead; which is<br /> overly strong, but is simple to implement within Postgres. When marked<br /> dead, a deleted page is labeled with the next-transaction counter value.<br /> VACUUM can reclaim the pagefor re-use when this transaction number is<br /> older than the oldest open transaction.<br /><br /> The VACUUMthat deletes a page's last tuple calls _bt_pagedel(), which flags<br /> the page BTP_DELETED and stores therein theresult of ReadNewTransactionId().<br /> When a later VACUUM visits such a page and observes that the stored XID is now<br/> less than or equal to RecentXmin, it adds the page to the FSM. An INSERT or<br /> UPDATE will pull the page fromthe FSM and repurpose it.<br /><br /> As I mentioned[1] peripherally back in November, that algorithm has been<br />insufficient since the introduction of non-XID-bearing transactions in<br /> PostgreSQL 8.3. Such transactions do notrestrain RecentXmin. If no running<br /> transaction has an XID, RecentXmin == ReadNewTransactionId() and the page<br/> incorrectly becomes available for immediate reuse. This is difficult to<br /> encounter in practice. VACUUM acquiresa cleanup lock on every leaf page in<br /> each index. Consequently, a problem can only arise around a leaf page<br/> deletion when two VACUUMs visit the page during the narrow window when<br /> _bt_steppage() has released the pinon its left sibling and not yet acquired a<br /> read lock on the target. Non-leaf deletions might not require such narrow<br/> conditions, but they are also exponentially less frequent. Here is a test<br /> procedure illustrating the bug:<br/><br /> 1. In session S1, run these commands:<br /> BEGIN;<br /> CREATE TABLE t (x int, filler character(400));<br/> INSERT INTO t SELECT *, '' FROM generate_series(1, 10000);<br /> CREATE INDEX ON t(x);<br /> DELETEFROM t WHERE x >= 2000 AND x < 4000;<br /> COMMIT;<br /><br /> BEGIN;<br /> SET LOCAL enable_seqscan = off;<br/> SET LOCAL enable_bitmapscan = off;<br /> DECLARE c CURSOR FOR SELECT x FROM t WHERE x >= 1990 AND x < 4510;<br/> FETCH 5 c;<br /><br /> 2. Attach gdb to S1 and set a breakpoint on _bt_getbuf.<br /> 3. In S1, run "FETCH 10 c". This will hit the breakpoint.<br /> 4. In another session S2, run these commands:<br /> VACUUM VERBOSE t; -- mark somepages BTP_DELETED<br /> VACUUM VERBOSE t; -- update FSM to know about the pages<br /> -- reuse the pages<br /> INSERTINTO t SELECT * FROM generate_series(10001, 12000);<br /><br /> 5. Exit gdb to free up S1. The FETCH only returnsfive rows.<br /><br /> (The "filler" column makes each index page correspond to more heap pages.<br /> Without it,heap page pins prevent removing some of the tuples on the index<br /> page under test.)<br /><br /><br /> The fix is tocompare the stored XID to RecentGlobalXmin, not RecentXmin. We<br /> already use RecentGlobalXmin when wal_level = hot_standby. If no running<br /> transaction has an XID and all running transactions began since the last<br /> transactionthat did bear an XID, RecentGlobalXmin == ReadNewTransactionId().<br /> Therefore, the correct test is btpo.xact< RecentGlobalXmin, not btpo.xact <=<br /> RecentGlobalXmin as we have today. This also cleanly removes theneed for the<br /> bit of code in _bt_getbuf() that decrements btpo.xact before sending it down<br /> for ResolveRecoveryConflictWithSnapshot(). I suggested[2] that decrement on<br /> an unprincipled basis; it was just maskingthe off-by-one of using "<=<br /> RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().<br/><br /> This change makes empty B-tree pages wait through two generations of running<br /> transactionsbefore reuse, so some additional bloat will arise. Furthermore,<br /> the set of transactions having snapshotsprecluding reuse of the page will<br /> continue to grow until the next transaction to allocate an XID commits. The<br /> alternative of occasionally returning wrong query results won't do, though.<br /> While we could explorefundamentally-different page deletion algorithms for<br /> PostgreSQL 9.3, this is the only fix coming to mind that'ssuitable for today.<br /><br /> For purposes of log message writing, this patch effectively reverts commits<br /> 758bd2a433d64bed00ca084203b3e5ccfdea4499and<br /> e1cd66f74862936d84acf3008118d6094c56ad58. I've attempted to document in<br/> comments all the questions raised over on the SP-GiST/hot standby thread[3].<br /><br /> Thanks,<br /> nm<br /><br/> [1] <a href="http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com" target="_blank">http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com</a><br/> [2] <a href="http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com" target="_blank">http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com</a><br/> [3] <a href="http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com" target="_blank">http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com</a><br/><br /><br />--<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/><br /></blockquote></div><br /></div>
On Sun, Apr 22, 2012 at 12:13:34AM +0530, Nikhil Sontakke wrote: > Was wondering if there's a similar bug which gets triggered while using > VACUUM FULL. See for instance this thread: > > http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html > > This issue has been reported on-off from time to time and in most cases > VACUUM or VACUUM FULL appears to be involved. We have usually attributed it > to hardware issues and reindex has been recommended by default as a > solution/work around.. I do not perceive much similarity. The bug I've raised can produce wrong query results transiently. It might permit injecting a tuple into the wrong spot in the tree, yielding persistent wrong results. It would not introduce tree-structural anomalies like sibling pointers directed at zeroed pages or internal pages in an 1-level tree. Given the symptoms you reported, I share Robert's suspicion of WAL replay in your scenario.
<div class="gmail_extra"> <br /><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1pxsolid rgb(204,204,204);padding-left:1ex"><div class="im"> > Was wondering if there's a similar bugwhich gets triggered while using<br /> > VACUUM FULL. See for instance this thread:<br /> ><br /> > <a href="http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html" target="_blank">http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html</a><br/> ><br />> This issue has been reported on-off from time to time and in most cases<br /> > VACUUM or VACUUM FULL appears tobe involved. We have usually attributed it<br /> > to hardware issues and reindex has been recommended by default asa<br /> > solution/work around..<br /><br /></div>I do not perceive much similarity. The bug I've raised can producewrong<br /> query results transiently. It might permit injecting a tuple into the wrong<br /> spot in the tree, yieldingpersistent wrong results. It would not introduce<br /> tree-structural anomalies like sibling pointers directedat zeroed pages or<br /> internal pages in an 1-level tree. Given the symptoms you reported, I share<br /> Robert'ssuspicion of WAL replay in your scenario.<br /></blockquote></div><br />Thanks for taking the time to analyze thisNoah.<br /><br />Regards,<br />Nikhils<br /></div>
On Sat, Apr 21, 2012 at 5:52 PM, Noah Misch <noah@leadboat.com> wrote: > As I mentioned[1] peripherally back in November, that algorithm has been > insufficient since the introduction of non-XID-bearing transactions in > PostgreSQL 8.3. Such transactions do not restrain RecentXmin. If no running > transaction has an XID, RecentXmin == ReadNewTransactionId() and the page > incorrectly becomes available for immediate reuse. Good observation. > The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. We > already use RecentGlobalXmin when wal_level = hot_standby. If no running > transaction has an XID and all running transactions began since the last > transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId(). > Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <= > RecentGlobalXmin as we have today. This also cleanly removes the need for the > bit of code in _bt_getbuf() that decrements btpo.xact before sending it down > for ResolveRecoveryConflictWithSnapshot(). I suggested[2] that decrement on > an unprincipled basis; it was just masking the off-by-one of using "<= > RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable(). Looks like the right fix. I'll apply this to 9.0/9.1/HEAD. > This change makes empty B-tree pages wait through two generations of running > transactions before reuse, so some additional bloat will arise. Could arise, in some circumstances. But that assumes VACUUMs are fairly frequent and that they would be delayed/rendered less effective by this, which I don't think will be the case. I note that we don't take any account of the number of pages that may be reused when we VACUUM, so when HOT avoids a VACUUM we may accumulate pages for a longer period. Looks like there is more work to be done yet in cleaning indexes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 24, 2012 at 09:08:36AM +0100, Simon Riggs wrote: > On Sat, Apr 21, 2012 at 5:52 PM, Noah Misch <noah@leadboat.com> wrote: > > The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. ?We > > already use RecentGlobalXmin when wal_level = hot_standby. ?If no running > > transaction has an XID and all running transactions began since the last > > transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId(). > > Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <= > > RecentGlobalXmin as we have today. ?This also cleanly removes the need for the > > bit of code in _bt_getbuf() that decrements btpo.xact before sending it down > > for ResolveRecoveryConflictWithSnapshot(). ?I suggested[2] that decrement on > > an unprincipled basis; it was just masking the off-by-one of using "<= > > RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable(). > > Looks like the right fix. I'll apply this to 9.0/9.1/HEAD. Thanks. 8.3 and 8.4 need it, too, albeit simplified due to some of the affected code not existing yet. > > This change makes empty B-tree pages wait through two generations of running > > transactions before reuse, so some additional bloat will arise. > > Could arise, in some circumstances. But that assumes VACUUMs are > fairly frequent and that they would be delayed/rendered less effective > by this, which I don't think will be the case. > > I note that we don't take any account of the number of pages that may > be reused when we VACUUM, so when HOT avoids a VACUUM we may > accumulate pages for a longer period. Looks like there is more work to > be done yet in cleaning indexes. Agreed. Taking one VACUUM visit to mark the page BTP_DELETED and another to add it to the FSM is fairly pessimistic; perhaps we can do better. nm