Re: [HACKERS] Potential hot-standby bug around xacts committed but in xl_running_xacts - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [HACKERS] Potential hot-standby bug around xacts committed but in xl_running_xacts
Date
Msg-id CANP8+jLtrRs951zBgjoOp3r=HHKJpPvL0=WF4KUteec-9NQSjg@mail.gmail.com
Whole thread Raw
In response to [HACKERS] Potential hot-standby bug around xacts committed but inxl_running_xacts  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] Potential hot-standby bug around xacts committed but in xl_running_xacts  (Craig Ringer <craig@2ndquadrant.com>)
[HACKERS] Re: Potential hot-standby bug around xacts committed but inxl_running_xacts  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 1 May 2017 at 22:38, Andres Freund <andres@anarazel.de> wrote:

> The thread below http://archives.postgresql.org/message-id/f37e975c-908f-858e-707f-058d3b1eb214%402ndquadrant.com
> describes an issue in logical decoding that arises because
> xl_running_xacts' contents aren't necessarily coherent with the contents
> of the WAL, because RecordTransactionCommit() / RecordTransactionAbort()
> don't have any interlock against the procarray.  That means
> xl_running_xacts can contain transactions assumed to be running, that
> already have their commit/abort records WAL logged.

That is known/by design.

> I think that's not just problematic in logical decoding, but also
> Hot-Standby.  Consider the following:
>
> ProcArrayApplyRecoveryInfo() gets an xl_running_xacts record that's not
> suboverflowed, and thus will change to STANDBY_SNAPSHOT_READY.

Yes

> In that
> case it'll populate the KnownAssignedXids machinery using
> KnownAssignedXidsAdd().

No, because of this section of code in ProcArrayApplyRecoveryInfo()

/** The running-xacts snapshot can contain xids that were still visible* in the procarray when the snapshot was taken,
butwere already* WAL-logged as completed. They're not running anymore, so ignore* them.*/if
(TransactionIdDidCommit(xid)|| TransactionIdDidAbort(xid))        continue;
 

> Am I missing something that protects against the above scenario?

It does seem so, yes. The locking issues in Hot Standby are complex,
but they do seem to be correct, thanks for reviewing them.

This is documented in multiple locations, including what I thought was
your own comment in LogStandbySnapshot().

What I suggest is that with logical decoding in mind we do this
1. Inject a new record XLOG_SNAPSHOT_START at the start of
LogStandbySnapshot(). We start logical decoding from there.
2. Record any transactions that end
3. Now the full XLOG_RUNNING_XACTS record arrives. We apply all xacts
that are seen as running, minus any ended between 1 and 3

This avoids the problems for the race but without holding locks while
we log XLOG_RUNNING_XACTS, something that was considered painful for
Hot Standby.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: [HACKERS] Transition tables for triggers on foreign tables and views
Next
From: Craig Ringer
Date:
Subject: Re: [HACKERS] CTE inlining