Re: On-the-fly index tuple deletion vs. hot_standby - Mailing list pgsql-hackers

From Noah Misch
Subject Re: On-the-fly index tuple deletion vs. hot_standby
Date
Msg-id 20110312104039.GA21992@tornado.gateway.2wire.net
Whole thread Raw
In response to Re: On-the-fly index tuple deletion vs. hot_standby  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: On-the-fly index tuple deletion vs. hot_standby
List pgsql-hackers
On Thu, Dec 09, 2010 at 09:48:25AM +0000, Simon Riggs wrote:
> On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
> > On 29.11.2010 08:10, Noah Misch wrote:
> > > I have a hot_standby system and use it to bear the load of various reporting
> > > queries that take 15-60 minutes each.  In an effort to avoid long pauses in
> > > recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours of
> > > the master's transactions.  Even so, I kept seeing recovery pause for the
> > > duration of a long-running query.  In each case, the culprit record was an
> > > XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
> > > attached test script demonstrates the behavior (on HEAD); the index tuple
> > > reclamation conflicts with a concurrent "SELECT pg_sleep(600)" on the standby.
> > >
> > > Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
> > > HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and remove
> > > the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid does
> > > not regard the inserting-transaction outcome, so btree_redo proceeds to conflict
> > > with snapshots having visibility over that transaction.  Could we correctly
> > > improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore tuples
> > > of aborted transactions and tuples inserted and deleted within one transaction?
>
> @Noah Easily the best bug reported submitted in a long time. Thanks.
>
> > Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need
> > similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid()
> > could just call HeapTupleHeaderAdvanceLatestRemoveXid().
>
> Yes, it applies to other cases also. Thanks for the suggestion.
>
> Fix committed. Please double-check my work, committed early since I'm
> about to jump on a plane.

The installation that inspired my original report recently upgraded from 9.0.1
to 9.0.3, and your fix did significantly decrease its conflict frequency.  The
last several conflicts I have captured involve XLOG_BTREE_REUSE_PAGE records.
(FWIW, the index has generally been pg_attribute_relid_attnam_index.)  I've
attached a test script demonstrating the behavior.  _bt_page_recyclable approves
any page deleted no more recently than RecentXmin, because we need only ensure
that every ongoing scan has witnessed the page as dead.  For the hot standby
case, we need to account for possibly-ongoing standby transactions.  Using
RecentGlobalXmin covers that, albeit with some pessimism: we really only need
LEAST(RecentXmin, PGPROC->xmin of walsender_1, .., PGPROC->xmin of walsender_N)
- vacuum_defer_cleanup_age.  Not sure the accounting to achieve that would pay
off, though.  Thoughts?

Thanks,
nm

Attachment

pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: template0 database comment
Next
From: Daniel Farina
Date:
Subject: Re: Fwd: index corruption in PG 8.3.13