Re: hot_standby_feedback vs excludeVacuum and snapshots - Mailing list pgsql-hackers

From Noah Misch
Subject Re: hot_standby_feedback vs excludeVacuum and snapshots
Date
Msg-id 20180704202756.GA269861@rfd.leadboat.com
Whole thread Raw
In response to Re: hot_standby_feedback vs excludeVacuum and snapshots  (Andres Freund <andres@anarazel.de>)
Responses Re: hot_standby_feedback vs excludeVacuum and snapshots
List pgsql-hackers
On Fri, Jun 08, 2018 at 11:03:38AM -0700, Andres Freund wrote:
> On 2018-06-08 09:23:02 +0100, Simon Riggs wrote:
> > For context, AEL locks are normally removed by COMMIT or ABORT.
> > StandbyReleaseOldLocks() is just a sweeper to catch anything that
> > didn't send an abort before it died, so it hardly ever activates. The
> > coding of StandbyReleaseOldLocks() is backwards... if it ain't in the
> > running list, then we remove it.
> > 
> > But that wasn't working correctly either, since as of 49bff5300d527 we
> > assigned AELs to subxids. Subxids weren't in the running list and so
> > AELs held by them would have been removed at the wrong time, an extant
> > bug in PG10. It looks to me like they would have been removed either
> > early or late, up to the next runningxact info record. They would be
> > removed, so no leakage, but the late timing wouldn't be noticeable for
> > tests or most usage, since it would look just like lock contention.
> > Early release might give same issue of block access to non-existent
> > block/file.
> 
> Yikes, that's kinda bad. It can also just cause plain crashes, no? The
> on-disk / catalog state isn't necessarily consistent during DDL, which
> is why we hold AE locks. At the very least it can cause corruption of
> in-use relcache entries and such.

Yes.  If I'm reading the commit message right, the committed code also
releases locks too early:

> commit 15378c1
> Author:     Simon Riggs <simon@2ndQuadrant.com>
> AuthorDate: Sat Jun 16 14:03:29 2018 +0100
> Commit:     Simon Riggs <simon@2ndQuadrant.com>
> CommitDate: Sat Jun 16 14:03:29 2018 +0100
> 
>     Remove AELs from subxids correctly on standby
>     
>     Issues relate only to subtransactions that hold AccessExclusiveLocks
>     when replayed on standby.
>     
>     Prior to PG10, aborting subtransactions that held an
>     AccessExclusiveLock failed to release the lock until top level commit or
>     abort. 49bff5300d527 fixed that.
>     
>     However, 49bff5300d527 also introduced a similar bug where subtransaction
>     commit would fail to release an AccessExclusiveLock, leaving the lock to
>     be removed sometimes early and sometimes late. This commit fixes
>     that bug also. Backpatch to PG10 needed.

Subtransaction commit is too early to release an arbitrary
AccessExclusiveLock.  The primary releases every AccessExclusiveLock at
top-level transaction commit, top-level transaction abort, or subtransaction
abort.  CommitSubTransaction() doesn't do that; it transfers locks to the
parent sub(xact).  Standby nodes can't safely remove an arbitrary lock earlier
than the primary would.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Legacy GiST invalid tuples
Next
From: Alvaro Herrera
Date:
Subject: Re: Legacy GiST invalid tuples