Re: [BUG] standby node can not provide service even it replays alllog files - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [BUG] standby node can not provide service even it replays alllog files
Date
Msg-id 20191029.135719.784886453123056051.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re:Re: [BUG] standby node can not provide service even it replaysall log files  (Thunder <thunder1@126.com>)
Responses Re: [BUG] standby node can not provide service even it replays alllog files
List pgsql-hackers
At Thu, 24 Oct 2019 17:37:52 +0800 (CST), Thunder  <thunder1@126.com> wrote in 
> Thanks for replay.I feel confused about snapshot.
> 
> At 2019-10-23 11:51:19, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> >Hello.
> >
> >At Tue, 22 Oct 2019 20:42:21 +0800 (CST), Thunder  <thunder1@126.com> wrote in 
> >> Update the patch.
> >> 
> >> 1. The STANDBY_SNAPSHOT_PENDING state is set when we replay the first XLOG_RUNNING_XACTS and the sub transaction
idsare overflow.
 
> >> 2. When we log XLOG_RUNNING_XACTS in master node, can we assume that all xact IDS < oldestRunningXid are
consideredfinished?
 
> >
> >Unfortunately we can't. Standby needs to know that the *standby's*
> >oldest active xid exceeds the pendig xmin, not master's. And it is
> >already processed in ProcArrayApplyRecoveryInfo. We cannot assume that
> 
> >the oldest xids are not same on the both side in a replication pair.
> 
> 
> This issue occurs when master does not commit the transaction which has lots of sub transactions, while we restart or
createa new standby node.
 
> The standby node can not provide service because of this issue.
> Can the standby have any active xid while it can not provide service?

The problem is not xid, but snapshot, information on what xids are not
committed yet on the master. Standby cannot deterine what rows should
be visible without the information. The xid list is maintained using
incoming commit records and vanishes on restart. So the restarted
standby needs non-subxid-overflown XLOG_RUNNING_XACTS to make sure the
xid list is complete.

> >> 3. If we can assume this, when we replay XLOG_RUNNING_XACTS and change standbyState to STANDBY_SNAPSHOT_PENDING,
canwe record oldestRunningXid to a shared variable, like procArray->oldest_running_xid?
 
> >> 4. In standby node when call GetSnapshotData if procArray->oldest_running_xid is valid, can we set xmin to be
procArray->oldest_running_xid?
> >> 
> >> Appreciate any suggestion to this issue.

So, somehow we need to complete the KnownAssignedTransactionIds even
if there's any subxid-overflown transactions. As mentioned upthread,
I think we have at least the following choices.

- Send back the complete xid list for START REPLICATION command from
  walreceiver.

- The first XLOG_RUNNING_XACTS after a standby comes in while
  subxid-overflown transaction lives.

I think the first is better.

Any suggestions?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: "Dongming Liu"
Date:
Subject: Re: Problem with synchronous replication