Thread: On-the-fly index tuple deletion vs. hot_standby

On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

29 November 2010, 02:17:15

I have a hot_standby system and use it to bear the load of various reporting
queries that take 15-60 minutes each.  In an effort to avoid long pauses in
recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
the master's transactions.  Even so, I kept seeing recovery pause for the
duration of a long-running query.  In each case, the culprit record was an
XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
attached test script demonstrates the behavior (on HEAD); the index tuple
reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.

Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and remove
the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid does
not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
with snapshots having visibility over that transaction.  Could we correctly
improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
of aborted transactions and tuples inserted and deleted within one transaction?

Thanks,
nm

Attachment

repro-btree-cleanup.sh

Re: On-the-fly index tuple deletion vs. hot_standby

From

Heikki Linnakangas

Date:

03 December 2010, 15:43:56

On 29.11.2010 08:10, Noah Misch wrote:
> I have a hot_standby system and use it to bear the load of various reporting
> queries that take 15-60 minutes each.  In an effort to avoid long pauses in
> recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
> the master's transactions.  Even so, I kept seeing recovery pause for the
> duration of a long-running query.  In each case, the culprit record was an
> XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
> attached test script demonstrates the behavior (on HEAD); the index tuple
> reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.
>
> Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and remove
> the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid does
> not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
> with snapshots having visibility over that transaction.  Could we correctly
> improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
> of aborted transactions and tuples inserted and deleted within one transaction?

Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need 
similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() 
could just call HeapTupleHeaderAdvanceLatestRemoveXid().

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: On-the-fly index tuple deletion vs. hot_standby

From

Simon Riggs

Date:

09 December 2010, 05:49:12

On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
> On 29.11.2010 08:10, Noah Misch wrote:
> > I have a hot_standby system and use it to bear the load of various reporting
> > queries that take 15-60 minutes each.  In an effort to avoid long pauses in
> > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
> > the master's transactions.  Even so, I kept seeing recovery pause for the
> > duration of a long-running query.  In each case, the culprit record was an
> > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
> > attached test script demonstrates the behavior (on HEAD); the index tuple
> > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.
> >
> > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and remove
> > the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid does
> > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
> > with snapshots having visibility over that transaction.  Could we correctly
> > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
> > of aborted transactions and tuples inserted and deleted within one transaction?

@Noah Easily the best bug reported submitted in a long time. Thanks.

> Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need 
> similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() 
> could just call HeapTupleHeaderAdvanceLatestRemoveXid().

Yes, it applies to other cases also. Thanks for the suggestion.

Fix committed. Please double-check my work, committed early since I'm
about to jump on a plane.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

10 December 2010, 13:55:22

On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote:
> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
> > On 29.11.2010 08:10, Noah Misch wrote:
> > > I have a hot_standby system and use it to bear the load of various reporting
> > > queries that take 15-60 minutes each.  In an effort to avoid long pauses in
> > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
> > > the master's transactions.  Even so, I kept seeing recovery pause for the
> > > duration of a long-running query.  In each case, the culprit record was an
> > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
> > > attached test script demonstrates the behavior (on HEAD); the index tuple
> > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.
> > >
> > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and remove
> > > the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid does
> > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
> > > with snapshots having visibility over that transaction.  Could we correctly
> > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
> > > of aborted transactions and tuples inserted and deleted within one transaction?
>
> @Noah Easily the best bug reported submitted in a long time. Thanks.
>
> > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need
> > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid()
> > could just call HeapTupleHeaderAdvanceLatestRemoveXid().
>
> Yes, it applies to other cases also. Thanks for the suggestion.
>
> Fix committed. Please double-check my work, committed early since I'm
> about to jump on a plane.

Thanks for making that change.  For my understanding, why does the xmin == xmax
special case in HeapTupleHeaderAdvanceLatestRemoveXid not require !HEAP_UPDATED,
as the corresponding case in HeapTupleSatisfiesVacuum requires?  I can neither
think of a recipe for triggering a problem as the code stands, nor come up with
a sound explanation for why no such recipe can exist.

nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Heikki Linnakangas

Date:

11 March 2011, 03:45:53

On 10.12.2010 19:55, Noah Misch wrote:
> On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote:
>> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
>>> Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need
>>> similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid()
>>> could just call HeapTupleHeaderAdvanceLatestRemoveXid().
>>
>> Yes, it applies to other cases also. Thanks for the suggestion.
>>
>> Fix committed. Please double-check my work, committed early since I'm
>> about to jump on a plane.
>
> Thanks for making that change.  For my understanding, why does the xmin == xmax
> special case in HeapTupleHeaderAdvanceLatestRemoveXid not require !HEAP_UPDATED,
> as the corresponding case in HeapTupleSatisfiesVacuum requires?  I can neither
> think of a recipe for triggering a problem as the code stands, nor come up with
> a sound explanation for why no such recipe can exist.

The difference is in the purpose of those two functions. 
HeapTupleSatisfiesVacuum decides if a tuple can be safely vacuumed away. 
For that purpose, you can't remove a tuple from the middle of an update 
chain, even if that tuple was never visible to any other transaction, 
because someone might still want to follow the update chain to find the 
latest version of the row. HeapTupleHeaderAdvanceLatestRemoveXid is used 
to decide if removing a tuple would conflict with in-progress hot 
standby queries. For that purpose, you don't need to care about breaking 
update chains, as Hot Standby is only used for read-only queries and 
read-only queries never follow update chains.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

12 March 2011, 06:40:51

On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote:
> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
> > On 29.11.2010 08:10, Noah Misch wrote:
> > > I have a hot_standby system and use it to bear the load of various reporting
> > > queries that take 15-60 minutes each.  In an effort to avoid long pauses in
> > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
> > > the master's transactions.  Even so, I kept seeing recovery pause for the
> > > duration of a long-running query.  In each case, the culprit record was an
> > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
> > > attached test script demonstrates the behavior (on HEAD); the index tuple
> > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.
> > >
> > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and remove
> > > the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid does
> > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
> > > with snapshots having visibility over that transaction.  Could we correctly
> > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
> > > of aborted transactions and tuples inserted and deleted within one transaction?
>
> @Noah Easily the best bug reported submitted in a long time. Thanks.
>
> > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need
> > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid()
> > could just call HeapTupleHeaderAdvanceLatestRemoveXid().
>
> Yes, it applies to other cases also. Thanks for the suggestion.
>
> Fix committed. Please double-check my work, committed early since I'm
> about to jump on a plane.

The installation that inspired my original report recently upgraded from 9.0.1
to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
(FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
attached a test script demonstrating the behavior.  _bt_page_recyclable approves
any page deleted no more recently than RecentXmin, because we need only ensure
that every ongoing scan has witnessed the page as dead.  For the hot standby
case, we need to account for possibly-ongoing standby transactions.  Using
RecentGlobalXmin covers that, albeit with some pessimism: we really only need
LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
- vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
off, though.  Thoughts?

Thanks,
nm

Attachment

Re: On-the-fly index tuple deletion vs. hot_standby

From

Heikki Linnakangas

Date:

14 March 2011, 08:56:41

On 12.03.2011 12:40, Noah Misch wrote:
> The installation that inspired my original report recently upgraded from 9.0.1
> to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
> attached a test script demonstrating the behavior.  _bt_page_recyclable approves
> any page deleted no more recently than RecentXmin, because we need only ensure
> that every ongoing scan has witnessed the page as dead.  For the hot standby
> case, we need to account for possibly-ongoing standby transactions.  Using
> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
> off, though.  Thoughts?

Hmm, instead of bloating the master, I wonder if we could detect more 
accurately if there are any on-going scans, in the standby. For example, 
you must hold a lock on the index to scan it, so only transactions 
holding the lock need to be checked for conflict.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

15 March 2011, 23:23:23

On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> On 12.03.2011 12:40, Noah Misch wrote:
>> The installation that inspired my original report recently upgraded from 9.0.1
>> to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
>> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
>> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
>> attached a test script demonstrating the behavior.  _bt_page_recyclable approves
>> any page deleted no more recently than RecentXmin, because we need only ensure
>> that every ongoing scan has witnessed the page as dead.  For the hot standby
>> case, we need to account for possibly-ongoing standby transactions.  Using
>> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
>> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
>> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
>> off, though.  Thoughts?
>
> Hmm, instead of bloating the master, I wonder if we could detect more  
> accurately if there are any on-going scans, in the standby. For example,  
> you must hold a lock on the index to scan it, so only transactions  
> holding the lock need to be checked for conflict.

That would be nice.  Do you have an outline of an implementation in mind?

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

22 April 2011, 12:10:54

On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> > On 12.03.2011 12:40, Noah Misch wrote:
> >> The installation that inspired my original report recently upgraded from 9.0.1
> >> to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
> >> attached a test script demonstrating the behavior.  _bt_page_recyclable approves
> >> any page deleted no more recently than RecentXmin, because we need only ensure
> >> that every ongoing scan has witnessed the page as dead.  For the hot standby
> >> case, we need to account for possibly-ongoing standby transactions.  Using
> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
> >> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
> >> off, though.  Thoughts?
> >
> > Hmm, instead of bloating the master, I wonder if we could detect more  
> > accurately if there are any on-going scans, in the standby. For example,  
> > you must hold a lock on the index to scan it, so only transactions  
> > holding the lock need to be checked for conflict.
> 
> That would be nice.  Do you have an outline of an implementation in mind?

In an attempt to resuscitate this thread, here's my own shot at that.  Apologies
in advance if it's just an already-burning straw man.

I didn't see any way to take advantage of checking for the heavyweight lock that
any index scan would need to hold.  Such a check becomes obsolete the moment
it's done -- nothing stops another locker from arriving between the completion
of your check and whatever you do based on its outcome.  When a standby is in
the picture, the answer needs to be decided at xlog generation time but valid
for xlog apply time.  Therefore, I instead looked for what could be done with
buffer locks.

Regarding the underlying requirement, here is the relevant paragraph from
src/backend/access/nbtree/README:
 A deleted page can only be reclaimed once there is no scan or search that has a reference to it; until then, it must
stayin place with its right-link undisturbed.  We implement this by waiting until all transactions that were running at
thetime of deletion are dead; which is overly strong, but is simple to implement within Postgres.  When marked dead, a
deletedpage is labeled with the next-transaction counter value. VACUUM can reclaim the page for re-use when this
transactionnumber is older than the oldest open transaction.

These threads also have useful background as concerns hot standby:
http://archives.postgresql.org/message-id/23761.1265596434@sss.pgh.pa.us
http://archives.postgresql.org/message-id/4B7D3135.3090204@enterprisedb.com
http://archives.postgresql.org/message-id/1268123581.10620.76.camel@ebony

This is implemented in two passes.  After updating the adjacent page links, the
first pass stores ReadNewTransactionId() in the page.  The second pass finishes
the reuse if the stored xid is older than RecentXmin.  This has the benefits of
simplicity and low contention.  However, any long-running transaction delays
reuse.  Also, since RecentXmin's witness of "all transactions that were running
at the time of deletion" only includes master transactions, our tools for
preventing recovery conflicts (vacuum_defer_cleanup_age, feedback) do not defend
against conflicts arising from b-tree page reuse.  Fixing that means either
choosing different page reuse criteria or letting standby transactions also
delay reuse on the master.

If I understand correctly, "has a reference to it" implies a pin on the left,
right or parent page.  By the time _bt_pagedel is ready to mark the page
BTP_DELETED, we already hold exclusive content locks on all of those buffers.
Suppose we then check the pin count on each.  If they have only single local
pins, we have serendipitously acquired cleanup locks: set btpo.xact to
FrozenTransactionId, allowing reclamation at any time.  Otherwise, continue as
the code stands today.  Also change _bt_recycleable to use RecentGlobalXmin
instead of RecentXmin; this avoids the recovery conflicts.  Since, we hope, the
vast majority of deleted pages will get FrozenTransactionId, the marginal bloat
added by using RecentGlobalXmin will not be significant.  This also saves a
LW_SHARED acquisition of XidGenLock in the uncontended case -- probably not
significant, but it can't hurt.

When the standby replays the XLOG_BTREE_DELETE_PAGE with btpo_xact ==
FrozenTransactionId, it must wait for a cleanup lock on all the buffers it
updates.  We could actually do this anytime between the XLOG_BTREE_DELETE_PAGE
and the later XLOG_BTREE_REUSE_PAGE, but we only have the necessary information
handy when applying the first record.

Something this intrusive is clearly 9.2 material.  It would be nice to have a
backpatchable strategy to prop up vacuum_defer_cleanup_age and
hot_standby_feedback in 9.0 and 9.1.  For that, I haven't come up with anything
better than my original pair of proposals.

Comments?

Thanks,
nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

09 June 2011, 18:39:01

On Fri, Apr 22, 2011 at 11:10:34AM -0400, Noah Misch wrote:
> On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
> > On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> > > On 12.03.2011 12:40, Noah Misch wrote:
> > >> The installation that inspired my original report recently upgraded from 9.0.1
> > >> to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
> > >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
> > >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
> > >> attached a test script demonstrating the behavior.  _bt_page_recyclable approves
> > >> any page deleted no more recently than RecentXmin, because we need only ensure
> > >> that every ongoing scan has witnessed the page as dead.  For the hot standby
> > >> case, we need to account for possibly-ongoing standby transactions.  Using
> > >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
> > >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
> > >> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
> > >> off, though.  Thoughts?
> > >
> > > Hmm, instead of bloating the master, I wonder if we could detect more  
> > > accurately if there are any on-going scans, in the standby. For example,  
> > > you must hold a lock on the index to scan it, so only transactions  
> > > holding the lock need to be checked for conflict.
> > 
> > That would be nice.  Do you have an outline of an implementation in mind?
> 
> In an attempt to resuscitate this thread, here's my own shot at that.  Apologies
> in advance if it's just an already-burning straw man.
[full proposal at http://archives.postgresql.org/message-id/20110422151034.GA8150@tornado.gateway.2wire.net]

Anyone care to comment?  On this system, which has vacuum_defer_cleanup_age set
to 3 peak hours worth of xid consumption, the problem caps recovery conflict
hold off at 10-20 minutes.  It will have the same effect on standby feedback in
9.1.  I think we should start by using RecentGlobalXmin instead of RecentXmin as
the reuse horizon when wal_level = hot_standby, and backpatch that.  Then,
independently consider for master a bloat-avoidance improvement like I outlined
most recently; I'm not sure whether that's worth it.  In any event, I'm hoping
to get some consensus on the way forward.

Thanks,
nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Robert Haas

Date:

11 June 2011, 21:55:37

On Fri, Apr 22, 2011 at 11:10 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
>> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
>> > On 12.03.2011 12:40, Noah Misch wrote:
>> >> The installation that inspired my original report recently upgraded from 9.0.1
>> >> to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
>> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
>> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
>> >> attached a test script demonstrating the behavior.  _bt_page_recyclable approves
>> >> any page deleted no more recently than RecentXmin, because we need only ensure
>> >> that every ongoing scan has witnessed the page as dead.  For the hot standby
>> >> case, we need to account for possibly-ongoing standby transactions.  Using
>> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
>> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
>> >> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
>> >> off, though.  Thoughts?
>> >
>> > Hmm, instead of bloating the master, I wonder if we could detect more
>> > accurately if there are any on-going scans, in the standby. For example,
>> > you must hold a lock on the index to scan it, so only transactions
>> > holding the lock need to be checked for conflict.
>>
>> That would be nice.  Do you have an outline of an implementation in mind?
>
> In an attempt to resuscitate this thread, here's my own shot at that.  Apologies
> in advance if it's just an already-burning straw man.
>
> I didn't see any way to take advantage of checking for the heavyweight lock that
> any index scan would need to hold.

Have you looked at the logic in ResolveRecoveryConflictWithLock(), and
at GetLockConflicts()?

I am a little fuzzy on how the btree stuff works, but it seems to me
that you are looking for transactions that both have an xmin before
some threshold and also hold an AccessShareLock on some relation.
GetLockConflicts() will provide the latter, at least.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

12 June 2011, 00:41:11

Hi Robert,

On Sat, Jun 11, 2011 at 08:55:28PM -0400, Robert Haas wrote:
> On Fri, Apr 22, 2011 at 11:10 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
> >> On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
> >> > On 12.03.2011 12:40, Noah Misch wrote:
> >> >> The installation that inspired my original report recently upgraded from 9.0.1
> >> >> to 9.0.3, and your fix did significantly decrease its conflict frequency. ?The
> >> >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
> >> >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.) ?I've
> >> >> attached a test script demonstrating the behavior. ?_bt_page_recyclable approves
> >> >> any page deleted no more recently than RecentXmin, because we need only ensure
> >> >> that every ongoing scan has witnessed the page as dead. ?For the hot standby
> >> >> case, we need to account for possibly-ongoing standby transactions. ?Using
> >> >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
> >> >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
> >> >> - vacuum_defer_cleanup_age. ?Not sure the accounting to achieve that would pay
> >> >> off, though. ?Thoughts?
> >> >
> >> > Hmm, instead of bloating the master, I wonder if we could detect more
> >> > accurately if there are any on-going scans, in the standby. For example,
> >> > you must hold a lock on the index to scan it, so only transactions
> >> > holding the lock need to be checked for conflict.
> >>
> >> That would be nice. ?Do you have an outline of an implementation in mind?
> >
> > In an attempt to resuscitate this thread, here's my own shot at that. ?Apologies
> > in advance if it's just an already-burning straw man.
> >
> > I didn't see any way to take advantage of checking for the heavyweight lock that
> > any index scan would need to hold.
> 
> Have you looked at the logic in ResolveRecoveryConflictWithLock(), and
> at GetLockConflicts()?
> 
> I am a little fuzzy on how the btree stuff works, but it seems to me
> that you are looking for transactions that both have an xmin before
> some threshold and also hold an AccessShareLock on some relation.
> GetLockConflicts() will provide the latter, at least.

Thanks for taking a look.

For the purpose of B-tree page reuse, we don't directly care about the xmin of
any active snapshot.  We need only prove that no active scan is paused
adjacent to the page, holding a right-link to it.

We currently achieve that wait-free by first marking the page with the next
available xid and then reusing it when that mark (btpo.xact) predates the
oldest running xid (RecentXmin).  (At the moment, I'm failing to work out why
this is OK with scans from transactions that haven't allocated an xid, but I
vaguely recall convincing myself it was fine at one point.)  It would indeed
also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock)
and check whether any of the returned transactions have PGPROC.xmin below the
mark.  That's notably more expensive than just comparing RecentXmin, so I'm
not sure how well it would pay off overall.  However, it could only help us on
the master.  (Not strictly true, but any way I see to extend it to the standby
has critical flaws.)  On the master, we can see a conflicting transaction and
put off reusing the page.  By the time the record hits the standby, we have to
apply it, and we might have a running transaction that will hold a lock on the
index for the next, say, 72 hours.  At such times, vacuum_defer_cleanup_age or
hot_standby_feedback ought to prevent the recovery stall.

This did lead me to realize that what we do in this regard on the standby can
be considerably independent from what we do on the master.  If fruitful, the
standby can prove the absence of a scan holding a right-link in a completely
different fashion.  So, we *could* take the cleanup-lock approach on the
standby without changing very much on the master.

nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Robert Haas

Date:

12 June 2011, 01:15:39

On Sat, Jun 11, 2011 at 11:40 PM, Noah Misch <noah@leadboat.com> wrote:
> We currently achieve that wait-free by first marking the page with the next
> available xid and then reusing it when that mark (btpo.xact) predates the
> oldest running xid (RecentXmin).  (At the moment, I'm failing to work out why
> this is OK with scans from transactions that haven't allocated an xid, but I
> vaguely recall convincing myself it was fine at one point.)  It would indeed
> also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock)
> and check whether any of the returned transactions have PGPROC.xmin below the
> mark.  That's notably more expensive than just comparing RecentXmin, so I'm
> not sure how well it would pay off overall.  However, it could only help us on
> the master.  (Not strictly true, but any way I see to extend it to the standby
> has critical flaws.)  On the master, we can see a conflicting transaction and
> put off reusing the page.  By the time the record hits the standby, we have to
> apply it, and we might have a running transaction that will hold a lock on the
> index for the next, say, 72 hours.  At such times, vacuum_defer_cleanup_age or
> hot_standby_feedback ought to prevent the recovery stall.
>
> This did lead me to realize that what we do in this regard on the standby can
> be considerably independent from what we do on the master.  If fruitful, the
> standby can prove the absence of a scan holding a right-link in a completely
> different fashion.  So, we *could* take the cleanup-lock approach on the
> standby without changing very much on the master.

Well, I'm generally in favor of trying to fix this problem without
changing what the master does.  It's a weakness of our replication
technology that the standby has no better way to cope with a cleanup
operation on the master than to start killing queries, but then again
it's a weakness of our MVCC technology that we don't reuse space
quickly enough and end up with bloat.  I hear a lot more complaints
about the second weakness than I do about the first.

At any rate, if taking a cleanup lock on the right-linked page on the
standby is sufficient to fix the problem, that seems like a far
superior solution in any case.  Presumably the frequency of someone
having a pin on that particular page will be far lower than any
matching based on XID or heavyweight locks.  And the vast majority of
such pins should disappear before the startup process feels obliged to
get out its big hammer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

12 June 2011, 16:01:56

On Sun, Jun 12, 2011 at 12:15:29AM -0400, Robert Haas wrote:
> On Sat, Jun 11, 2011 at 11:40 PM, Noah Misch <noah@leadboat.com> wrote:
> > We currently achieve that wait-free by first marking the page with the next
> > available xid and then reusing it when that mark (btpo.xact) predates the
> > oldest running xid (RecentXmin). ?(At the moment, I'm failing to work out why
> > this is OK with scans from transactions that haven't allocated an xid, but I
> > vaguely recall convincing myself it was fine at one point.) ?It would indeed
> > also be enough to call GetLockConflicts(locktag-of-index, AccessExclusiveLock)
> > and check whether any of the returned transactions have PGPROC.xmin below the
> > mark. ?That's notably more expensive than just comparing RecentXmin, so I'm
> > not sure how well it would pay off overall. ?However, it could only help us on
> > the master. ?(Not strictly true, but any way I see to extend it to the standby
> > has critical flaws.) ?On the master, we can see a conflicting transaction and
> > put off reusing the page. ?By the time the record hits the standby, we have to
> > apply it, and we might have a running transaction that will hold a lock on the
> > index for the next, say, 72 hours. ?At such times, vacuum_defer_cleanup_age or
> > hot_standby_feedback ought to prevent the recovery stall.
> >
> > This did lead me to realize that what we do in this regard on the standby can
> > be considerably independent from what we do on the master. ?If fruitful, the
> > standby can prove the absence of a scan holding a right-link in a completely
> > different fashion. ?So, we *could* take the cleanup-lock approach on the
> > standby without changing very much on the master.
> 
> Well, I'm generally in favor of trying to fix this problem without
> changing what the master does.  It's a weakness of our replication
> technology that the standby has no better way to cope with a cleanup
> operation on the master than to start killing queries, but then again
> it's a weakness of our MVCC technology that we don't reuse space
> quickly enough and end up with bloat.  I hear a lot more complaints
> about the second weakness than I do about the first.

I fully agree.  That said, if this works on the standby, we may as well also use
it opportunistically on the master, to throttle bloat.

> At any rate, if taking a cleanup lock on the right-linked page on the
> standby is sufficient to fix the problem, that seems like a far
> superior solution in any case.  Presumably the frequency of someone
> having a pin on that particular page will be far lower than any
> matching based on XID or heavyweight locks.  And the vast majority of
> such pins should disappear before the startup process feels obliged to
> get out its big hammer.

Yep; looks promising.

Does such a thing have a chance of being backpatchable?  I think the chances
start slim and fall almost to zero on account of the difficulty of avoiding a
WAL format change.  Assuming that conclusion, I do think it's worth starting
with something simple, even if it means additional bloat on the master in the
wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
In choosing those settings, the administrator has taken constructive steps to
accept master-side bloat in exchange for delaying recovery conflict.  What's
your opinion?

Thanks,
nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Robert Haas

Date:

12 June 2011, 23:11:34

On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote:
> I fully agree.  That said, if this works on the standby, we may as well also use
> it opportunistically on the master, to throttle bloat.

As long as the performance cost is de minimis, I agree.

>> At any rate, if taking a cleanup lock on the right-linked page on the
>> standby is sufficient to fix the problem, that seems like a far
>> superior solution in any case.  Presumably the frequency of someone
>> having a pin on that particular page will be far lower than any
>> matching based on XID or heavyweight locks.  And the vast majority of
>> such pins should disappear before the startup process feels obliged to
>> get out its big hammer.
>
> Yep; looks promising.
>
> Does such a thing have a chance of being backpatchable?  I think the chances
> start slim and fall almost to zero on account of the difficulty of avoiding a
> WAL format change.

I can't see back-patching it.  Maybe people would feel it was worth
considering if we were getting legions of complaints about this
problem, but so far you're the only one.  In any case, back-patching a
WAL format change is a complete non-starter -- we can't go making
minor versions non-interoperable.

> Assuming that conclusion, I do think it's worth starting
> with something simple, even if it means additional bloat on the master in the
> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
> In choosing those settings, the administrator has taken constructive steps to
> accept master-side bloat in exchange for delaying recovery conflict.  What's
> your opinion?

I'm pretty disinclined to go tinkering with 9.1 at this point, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: On-the-fly index tuple deletion vs. hot_standby

From

Simon Riggs

Date:

13 June 2011, 12:16:21

On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote:
>> I fully agree.  That said, if this works on the standby, we may as well also use
>> it opportunistically on the master, to throttle bloat.
>
> As long as the performance cost is de minimis, I agree.
>
>>> At any rate, if taking a cleanup lock on the right-linked page on the
>>> standby is sufficient to fix the problem, that seems like a far
>>> superior solution in any case.  Presumably the frequency of someone
>>> having a pin on that particular page will be far lower than any
>>> matching based on XID or heavyweight locks.  And the vast majority of
>>> such pins should disappear before the startup process feels obliged to
>>> get out its big hammer.
>>
>> Yep; looks promising.
>>
>> Does such a thing have a chance of being backpatchable?  I think the chances
>> start slim and fall almost to zero on account of the difficulty of avoiding a
>> WAL format change.
>
> I can't see back-patching it.  Maybe people would feel it was worth
> considering if we were getting legions of complaints about this
> problem, but so far you're the only one.  In any case, back-patching a
> WAL format change is a complete non-starter -- we can't go making
> minor versions non-interoperable.
>
>> Assuming that conclusion, I do think it's worth starting
>> with something simple, even if it means additional bloat on the master in the
>> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
>> In choosing those settings, the administrator has taken constructive steps to
>> accept master-side bloat in exchange for delaying recovery conflict.  What's
>> your opinion?
>
> I'm pretty disinclined to go tinkering with 9.1 at this point, too.

Not least because a feature already exists in 9.1 to cope with this
problem: hot standby feedback.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: On-the-fly index tuple deletion vs. hot_standby

From

Simon Riggs

Date:

13 June 2011, 12:54:17

On Thu, Jun 9, 2011 at 10:38 PM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Apr 22, 2011 at 11:10:34AM -0400, Noah Misch wrote:
>> On Tue, Mar 15, 2011 at 10:22:59PM -0400, Noah Misch wrote:
>> > On Mon, Mar 14, 2011 at 01:56:22PM +0200, Heikki Linnakangas wrote:
>> > > On 12.03.2011 12:40, Noah Misch wrote:
>> > >> The installation that inspired my original report recently upgraded from 9.0.1
>> > >> to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
>> > >> last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
>> > >> (FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
>> > >> attached a test script demonstrating the behavior.  _bt_page_recyclable approves
>> > >> any page deleted no more recently than RecentXmin, because we need only ensure
>> > >> that every ongoing scan has witnessed the page as dead.  For the hot standby
>> > >> case, we need to account for possibly-ongoing standby transactions.  Using
>> > >> RecentGlobalXmin covers that, albeit with some pessimism: we really only need
>> > >> LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
>> > >> - vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
>> > >> off, though.  Thoughts?
>> > >
>> > > Hmm, instead of bloating the master, I wonder if we could detect more
>> > > accurately if there are any on-going scans, in the standby. For example,
>> > > you must hold a lock on the index to scan it, so only transactions
>> > > holding the lock need to be checked for conflict.
>> >
>> > That would be nice.  Do you have an outline of an implementation in mind?
>>
>> In an attempt to resuscitate this thread, here's my own shot at that.  Apologies
>> in advance if it's just an already-burning straw man.
> [full proposal at http://archives.postgresql.org/message-id/20110422151034.GA8150@tornado.gateway.2wire.net]
>
> Anyone care to comment?  On this system, which has vacuum_defer_cleanup_age set
> to 3 peak hours worth of xid consumption, the problem caps recovery conflict
> hold off at 10-20 minutes.  It will have the same effect on standby feedback in
> 9.1.  I think we should start by using RecentGlobalXmin instead of RecentXmin as
> the reuse horizon when wal_level = hot_standby, and backpatch that.  Then,
> independently consider for master a bloat-avoidance improvement like I outlined
> most recently; I'm not sure whether that's worth it.  In any event, I'm hoping
> to get some consensus on the way forward.

I like your ideas.

(Also, I note that using xids in this way unnecessarily keeps bloat
around for a long time, if we have periods of mostly read-only
activity. Interesting point.)

I think we would only get away with this approach on leaf pages of the
index. It doesn't seem worth trying for the locks if we were higher
up.

On the standby side, its possible this could generate additional
buffer pin deadlocks and/or contention. So I would also want to look
at some deferral mechanism, so that we can mark the block removed, but
not actually do so until some time later, or we really need to, for
example when we write new data to that page.

Got time for a patch in this coming CF?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

14 June 2011, 01:29:04

On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote:
> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote:
> >> Assuming that conclusion, I do think it's worth starting
> >> with something simple, even if it means additional bloat on the master in the
> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
> >> In choosing those settings, the administrator has taken constructive steps to
> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's
> >> your opinion?
> >
> > I'm pretty disinclined to go tinkering with 9.1 at this point, too.
>
> Not least because a feature already exists in 9.1 to cope with this
> problem: hot standby feedback.

A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the
accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the
time of issue (see _bt_page_recyclable()).  Neither hot_standby_feedback nor
vacuum_defer_cleanup_age affect RecentXmin.  Therefore, neither facility delays
conflicts arising directly from B-tree page reuse.  See attached test script,
which yields a snapshot conflict despite active hot_standby_feedback.

Thanks,
nm

Attachment

repro-btree-reuse-feedback.sh

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

14 June 2011, 02:08:38

On Mon, Jun 13, 2011 at 04:41:11PM +0100, Simon Riggs wrote:
> On Thu, Jun 9, 2011 at 10:38 PM, Noah Misch <noah@leadboat.com> wrote:
> > On Fri, Apr 22, 2011 at 11:10:34AM -0400, Noah Misch wrote:
> >> In an attempt to resuscitate this thread, here's my own shot at that. ?Apologies
> >> in advance if it's just an already-burning straw man.
> > [full proposal at http://archives.postgresql.org/message-id/20110422151034.GA8150@tornado.gateway.2wire.net]
> >
> > Anyone care to comment? ?On this system, which has vacuum_defer_cleanup_age set
> > to 3 peak hours worth of xid consumption, the problem caps recovery conflict
> > hold off at 10-20 minutes. ?It will have the same effect on standby feedback in
> > 9.1. ?I think we should start by using RecentGlobalXmin instead of RecentXmin as
> > the reuse horizon when wal_level = hot_standby, and backpatch that. ?Then,
> > independently consider for master a bloat-avoidance improvement like I outlined
> > most recently; I'm not sure whether that's worth it. ?In any event, I'm hoping
> > to get some consensus on the way forward.
> 
> I like your ideas.
> 
> (Also, I note that using xids in this way unnecessarily keeps bloat
> around for a long time, if we have periods of mostly read-only
> activity. Interesting point.)

Thanks.  Yes, the current approach can mean two long-running-transaction
lifetimes before full space reclamation.  Would be nice to avoid.

> I think we would only get away with this approach on leaf pages of the
> index. It doesn't seem worth trying for the locks if we were higher
> up.

Don't we only delete leaf pages?  (And, one might say, pages that have become
leaves by having all children deleted.)

> On the standby side, its possible this could generate additional
> buffer pin deadlocks and/or contention. So I would also want to look
> at some deferral mechanism, so that we can mark the block removed, but
> not actually do so until some time later, or we really need to, for
> example when we write new data to that page.

That could be handy.  I do wonder what high pin-hold durations arise regularly
in the field.  Preventing standby buffer pin deadlocks would be a nice win in
any case.

> Got time for a patch in this coming CF?

Not for the June CF, unfortunately.  Making a suitable test suite will be a
large portion of the work.  The logic in question closes race conditions that
probably only arise under the heaviest field concurrency, so I'll need to
systematically visit every concurrency variation to have confidence that it's
correct.

nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Simon Riggs

Date:

15 June 2011, 20:03:02

On Tue, Jun 14, 2011 at 5:28 AM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote:
>> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote:
>> >> Assuming that conclusion, I do think it's worth starting
>> >> with something simple, even if it means additional bloat on the master in the
>> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
>> >> In choosing those settings, the administrator has taken constructive steps to
>> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's
>> >> your opinion?
>> >
>> > I'm pretty disinclined to go tinkering with 9.1 at this point, too.
>>
>> Not least because a feature already exists in 9.1 to cope with this
>> problem: hot standby feedback.
>
> A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the
> accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the
> time of issue (see _bt_page_recyclable()).  Neither hot_standby_feedback nor
> vacuum_defer_cleanup_age affect RecentXmin.  Therefore, neither facility delays
> conflicts arising directly from B-tree page reuse.  See attached test script,
> which yields a snapshot conflict despite active hot_standby_feedback.

OK, agreed. Bug. Good catch, Noah.

Fix is to use RecentGlobalXmin for the cutoff when in Hot Standby
mode, so that it is under user control.

Attached patch will be applied to head and backpatched to 9.1 and 9.0
to fix this.

No effect on non-users of Hot Standby. Minimal invasive for HS users.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

hs_page_recyclable_respect.v1.patch

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

16 June 2011, 11:47:59

On Thu, Jun 16, 2011 at 12:02:47AM +0100, Simon Riggs wrote:
> On Tue, Jun 14, 2011 at 5:28 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote:
> >> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote:
> >> >> Assuming that conclusion, I do think it's worth starting
> >> >> with something simple, even if it means additional bloat on the master in the
> >> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
> >> >> In choosing those settings, the administrator has taken constructive steps to
> >> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's
> >> >> your opinion?
> >> >
> >> > I'm pretty disinclined to go tinkering with 9.1 at this point, too.
> >>
> >> Not least because a feature already exists in 9.1 to cope with this
> >> problem: hot standby feedback.
> >
> > A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the
> > accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the
> > time of issue (see _bt_page_recyclable()). ?Neither hot_standby_feedback nor
> > vacuum_defer_cleanup_age affect RecentXmin. ?Therefore, neither facility delays
> > conflicts arising directly from B-tree page reuse. ?See attached test script,
> > which yields a snapshot conflict despite active hot_standby_feedback.
> 
> OK, agreed. Bug. Good catch, Noah.
> 
> Fix is to use RecentGlobalXmin for the cutoff when in Hot Standby
> mode, so that it is under user control.
> 
> Attached patch will be applied to head and backpatched to 9.1 and 9.0
> to fix this.

Thanks.  We still hit a conflict when btpo.xact == RecentGlobalXmin and the
standby has a transaction older than any master transaction.  This happens
because the tests at nbtpage.c:704 and procarray.c:1843 both pass when the xid
exactly is that of the oldest standby transaction (line numbers as of git
cb94db91b).  I only know this because the test script from my last message hits
this case; it might never get hit in real usage.  Still, seems like a hole not
worth leaving.  I think the most-correct fix is to TransactionIdRetreat the
btpo.xact before using it as xl_btree_reuse_page.lastestRemovedXid.  btpo.xact
is the first known-safe xid, but latestRemovedXid is the last known-unsafe xmin.

nm

Re: On-the-fly index tuple deletion vs. hot_standby

From

Simon Riggs

Date:

16 June 2011, 13:00:49

On Thu, Jun 16, 2011 at 3:47 PM, Noah Misch <noah@leadboat.com> wrote:
> On Thu, Jun 16, 2011 at 12:02:47AM +0100, Simon Riggs wrote:
>> On Tue, Jun 14, 2011 at 5:28 AM, Noah Misch <noah@leadboat.com> wrote:
>> > On Mon, Jun 13, 2011 at 04:16:06PM +0100, Simon Riggs wrote:
>> >> On Mon, Jun 13, 2011 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> > On Sun, Jun 12, 2011 at 3:01 PM, Noah Misch <noah@leadboat.com> wrote:
>> >> >> Assuming that conclusion, I do think it's worth starting
>> >> >> with something simple, even if it means additional bloat on the master in the
>> >> >> wal_level=hot_standby + vacuum_defer_cleanup_age / hot_standby_feedback case.
>> >> >> In choosing those settings, the administrator has taken constructive steps to
>> >> >> accept master-side bloat in exchange for delaying recovery conflict. ?What's
>> >> >> your opinion?
>> >> >
>> >> > I'm pretty disinclined to go tinkering with 9.1 at this point, too.
>> >>
>> >> Not least because a feature already exists in 9.1 to cope with this
>> >> problem: hot standby feedback.
>> >
>> > A standby's receipt of an XLOG_BTREE_REUSE_PAGE record implies that the
>> > accompanying latestRemovedXid preceded or equaled the master's RecentXmin at the
>> > time of issue (see _bt_page_recyclable()). ?Neither hot_standby_feedback nor
>> > vacuum_defer_cleanup_age affect RecentXmin. ?Therefore, neither facility delays
>> > conflicts arising directly from B-tree page reuse. ?See attached test script,
>> > which yields a snapshot conflict despite active hot_standby_feedback.
>>
>> OK, agreed. Bug. Good catch, Noah.
>>
>> Fix is to use RecentGlobalXmin for the cutoff when in Hot Standby
>> mode, so that it is under user control.
>>
>> Attached patch will be applied to head and backpatched to 9.1 and 9.0
>> to fix this.
>
> Thanks.  We still hit a conflict when btpo.xact == RecentGlobalXmin and the
> standby has a transaction older than any master transaction.  This happens
> because the tests at nbtpage.c:704 and procarray.c:1843 both pass when the xid
> exactly is that of the oldest standby transaction (line numbers as of git
> cb94db91b).  I only know this because the test script from my last message hits
> this case; it might never get hit in real usage.  Still, seems like a hole not
> worth leaving.  I think the most-correct fix is to TransactionIdRetreat the
> btpo.xact before using it as xl_btree_reuse_page.lastestRemovedXid.  btpo.xact
> is the first known-safe xid, but latestRemovedXid is the last known-unsafe xmin.

I think you are pointing out another bug, rather than a problem in my
last commit.

The bug was caused by assuming that the xid is a "latestRemovedXid",
as is the case in the rest of Hot Standby, which masks the off-by-one
error through poor use of terms.

I agree with your suggested fix.

Thanks again.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

bt_recycle_offby1.v1.patch

Re: On-the-fly index tuple deletion vs. hot_standby

From

Simon Riggs

Date:

16 June 2011, 13:06:47

On Thu, Jun 16, 2011 at 4:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

> I agree with your suggested fix.

Please ignore the previous patch, which was sent in error. Here's the
fix. I'll apply this tomorrow morning if we all still agree.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

bt_recycle_offby1.v1.patch

Re: On-the-fly index tuple deletion vs. hot_standby

From

Noah Misch

Date:

16 June 2011, 13:14:40

On Thu, Jun 16, 2011 at 04:53:41PM +0100, Simon Riggs wrote:
> On Thu, Jun 16, 2011 at 3:47 PM, Noah Misch <noah@leadboat.com> wrote:
> > Thanks. ?We still hit a conflict when btpo.xact == RecentGlobalXmin and the
> > standby has a transaction older than any master transaction. ?This happens
> > because the tests at nbtpage.c:704 and procarray.c:1843 both pass when the xid
> > exactly is that of the oldest standby transaction (line numbers as of git
> > cb94db91b). ?I only know this because the test script from my last message hits
> > this case; it might never get hit in real usage. ?Still, seems like a hole not
> > worth leaving. ?I think the most-correct fix is to TransactionIdRetreat the
> > btpo.xact before using it as xl_btree_reuse_page.lastestRemovedXid. ?btpo.xact
> > is the first known-safe xid, but latestRemovedXid is the last known-unsafe xmin.
> 
> I think you are pointing out another bug, rather than a problem in my
> last commit.
> 
> The bug was caused by assuming that the xid is a "latestRemovedXid",
> as is the case in the rest of Hot Standby, which masks the off-by-one
> error through poor use of terms.

Exactly.

> I agree with your suggested fix.

(Note that TransactionIdRetreat mutates its argument and returns nothing.)

Thanks,
nm