Re: Loss of replication after simple misconfiguration - Mailing list pgsql-bugs

From hubert depesz lubaczewski
Subject Re: Loss of replication after simple misconfiguration
Date
Msg-id 20200410072651.GA16098@depesz.com
Whole thread Raw
In response to Re: Loss of replication after simple misconfiguration  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Loss of replication after simple misconfiguration
List pgsql-bugs
On Fri, Apr 10, 2020 at 01:14:34PM +0900, Michael Paquier wrote:
> Hmm.  We have a gap in tests here as we don't have any tests stressing
> switchovers when it comes to track_commit_timestamps.  Anyway, could
> you confirm that I got the problem right?  Here is the flow I am getting
> from the information of upthread, roughly:
> 1) Primary/standby cluster, both using max_worker_processes = 8, and
> track_commit_timestamp = off.
> 2) In order to begin the switchover, first stop cleanly the primary.
> 3) Update configuration of the standby as follows, promote it and
> restart it:
> track_commit_timestamp = on
> max_worker_processes = 50
> 4) Enable streaming on the old primary to make it a standby, starting
> it fails because of the unmatching setting for max_worker_processes.
> 5) Re-adjust max_worker_processes correctly on the new standby, start
> it.  Then this startup should fail at the lookup of pg_commit_ts/.

Well, no.

In our case it was *at least* this scenario:

1. master and slave both with max_worker_processes and
track_commit_timestamp off.
2. config files get changed on both to include track_commit_timestamp on
3. slave gets restarted
4. config files get changed on both to include max_worker_processes = 50
5. master gets stopped by "power outage"
6. after master re-starts, replication to slave dies.

but it could have been also different scenario

1. master and slave both with max_worker_processes and
track_commit_timestamp off.
2. config files get changed on both to include track_commit_timestamp on
3. slave gets restarted (or maybe not, we can't be sure)
4. config files get changed on both to include max_worker_processes = 50
5. set of 2 new slaves (slave2 and slave3) are setup off slave, both
   with max_worker_processes = 50, and track_commit_timestamps = on
6. slave3 is modified to stream off slave2
7. master crash
8. after restars one of slaves (many?) lost its replication

Andrew suggested yesterday on IRC that it could be timing issue, so
testing for it might be complicated - hence my inability to replicate
the problem in test environment.

I will try to do the tests using extended scenarios with slave2 and
slave3, but I'm not overly optimistic about replicating this particular
case.

Best regards,

depesz




pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Loss of replication after simple misconfiguration
Next
From: Michael Paquier
Date:
Subject: Re: Loss of replication after simple misconfiguration