Thread: Loss of replication after simple misconfiguration

Loss of replication after simple misconfiguration

From

hubert depesz lubaczewski

Date:

09 April 2020, 14:08:36

Hi,

I was trying to get it to replicate, but I couldn't in controlled
environment, so I have no repeatable proof of concept, but we've seen
the problem at least 4 times.

Configuration:

PostgreSQL 9.5.15 on Ubuntu bionic.
Streaming replication from master to slave, working without problem.
max_worker_processes set to 8 on both servers.

Preparation:
At some point in time, to accommodate upcoming (yes, really) upgrade to
pg12, we pushed change that write "max_worker_processes = 50" to all
configs, but didn't restart Pgs.

Fail scenario:
Master died (underlying AWS EC2 instance got hosed, was restarted).
Master got back up, but loaded new configuration from conf file.

Immediately it crashed slave with message:
2020-04-09 15:55:45.653 CEST  1304738   LOG:  started streaming WAL from primary at 3/D000000 on timeline 1
2020-04-09 15:55:45.679 CEST  505710   FATAL:  hot standby is not possible because max_worker_processes = 8 is a lower
settingthan on the master server (its value was 50)
 
2020-04-09 15:55:45.679 CEST  505710   CONTEXT:  xlog redo XLOG/PARAMETER_CHANGE: max_connections=100
max_worker_processes=50max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical wal_log_hints=off
track_commit_timestamp=off
(this message is from my test environment, but the idea is clear).

Given that we have max_worker_processes = 50 in config on slave, we
tried to restart only to be greeted by:
2020-04-07T15:13:49.729943+00:00 postgres[20491]: [7-1] db=,user= LOG:  restored log file "000000030001779200000061"
fromarchive
 
2020-04-07T15:13:49.757222+00:00 postgres[20491]: [8-1] db=,user= FATAL:  could not access status of transaction
4275781146
2020-04-07T15:13:49.757314+00:00 postgres[20491]: [8-2] db=,user= DETAIL:  Could not read from file
"pg_commit_ts/27D4B"at offset 245760: Success.
 
2020-04-07T15:13:49.757380+00:00 postgres[20491]: [8-3] db=,user= CONTEXT:  xlog redo Transaction/COMMIT: 2020-04-07
02:40:10.065859+00
2020-04-07T15:13:49.761239+00:00 postgres[20487]: [2-1] db=,user= LOG:  startup process (PID 20491) exited with exit
code1
 
2020-04-07T15:13:49.761387+00:00 postgres[20487]: [3-1] db=,user= LOG:  terminating any other active server processes

We had similar case (could not access status of transaction) on at least
4 separate pg installations, so it looks to be related to the
restart of master while bad max_worker_processes are set.

We have since rolled back the change to max_worker_processes, and will
be doing it in much saner/safer way, but the fact that slave can't start
after such master restart is troubling.

Unfortunately we don't have any instances broken in this way anymore,
and we don't have logs either - we had to rebuild the Pg setup
(master/slave) ASAP, and thus the evidence is lost.

Is there anything we could have done to fix it once it happened?

Is it a known problem?

Or maybe the max_worker_processes mismatch is redherring and the problem
was simply because master died while there was heavy write traffic with
lots of transactions?

Best regards,

depesz

Re: Loss of replication after simple misconfiguration

From

Andrew Gierth

Date:

09 April 2020, 16:19:09

>>>>> "hubert" == hubert depesz lubaczewski <depesz@depesz.com> writes:

 hubert> PostgreSQL 9.5.15 on Ubuntu bionic.
 [...]
 hubert> tried to restart only to be greeted by:
 hubert> 2020-04-07T15:13:49.729943+00:00 postgres[20491]: [7-1] db=,user= LOG:  restored log file
"000000030001779200000061"from archive
 
 hubert> 2020-04-07T15:13:49.757222+00:00 postgres[20491]: [8-1] db=,user= FATAL:  could not access status of
transaction4275781146
 
 hubert> 2020-04-07T15:13:49.757314+00:00 postgres[20491]: [8-2] db=,user= DETAIL:  Could not read from file
"pg_commit_ts/27D4B"at offset 245760: Success.
 
 hubert> 2020-04-07T15:13:49.757380+00:00 postgres[20491]: [8-3] db=,user= CONTEXT:  xlog redo Transaction/COMMIT:
2020-04-0702:40:10.065859+00
 
 hubert> 2020-04-07T15:13:49.761239+00:00 postgres[20487]: [2-1] db=,user= LOG:  startup process (PID 20491) exited
withexit code 1
 
 hubert> 2020-04-07T15:13:49.761387+00:00 postgres[20487]: [3-1] db=,user= LOG:  terminating any other active server
processes

So I've been assisting hubert with analysis of this on IRC, and what we
have found so far suggests:

1. the max_worker_processes thing is a red herring

2. It is virtually certain that the restart, in addition to changing
max_worker_processes on the master, also changed the master's setting of
track_commit_timestamp from off to on (which is clearly relevant to the
issue)

(We established #2 from the fact that we _do_ have the WAL files from
the failed recovery, and they don't contain any COMMIT_TS_ZEROPAGE
records despite covering many thousands of transactions.)

I've suggested trying to reproduce the issue by changing this parameter
across a crash.

I did notice that 9.5.15 does have a fix for an issue in this area, but
I didn't see any more recent changes - did I miss anything?

-- 
Andrew (irc:RhodiumToad)

Re: Loss of replication after simple misconfiguration

From

Victor Yegorov

Date:

09 April 2020, 16:48:17

чт, 9 апр. 2020 г. в 19:19, Andrew Gierth <andrew@tao11.riddles.org.uk>:

So I've been assisting hubert with analysis of this on IRC, and what we
have found so far suggests:

1. the max_worker_processes thing is a red herring

2. It is virtually certain that the restart, in addition to changing
max_worker_processes on the master, also changed the master's setting of
track_commit_timestamp from off to on (which is clearly relevant to the
issue)

…

I did notice that 9.5.15 does have a fix for an issue in this area, but
I didn't see any more recent changes - did I miss anything?

We've hit similar issue last week, but on 11.5 — we had track_commit_timestamp enabled on master after switchover,
replica failed to start.

It might be, that fix was here: https://git.postgresql.org/pg/commitdiff/180feb8c7e
(For 9.5 branch it is: https://git.postgresql.org/pg/commitdiff/69a5686368)

We're not in the position to test it again, though…

Victor Yegorov

Re: Loss of replication after simple misconfiguration

From

Michael Paquier

Date:

10 April 2020, 04:14:34

On Thu, Apr 09, 2020 at 07:48:17PM +0300, Victor Yegorov wrote:
> We've hit similar issue last week, but on 11.5 — we
> had track_commit_timestamp enabled on master after switchover,
> replica failed to start.
>
> It might be, that fix was here:
> https://git.postgresql.org/pg/commitdiff/180feb8c7e
> (For 9.5 branch it is: https://git.postgresql.org/pg/commitdiff/69a5686368)
>
> We're not in the position to test it again, though…

Hmm.  We have a gap in tests here as we don't have any tests stressing
switchovers when it comes to track_commit_timestamps.  Anyway, could
you confirm that I got the problem right?  Here is the flow I am getting
from the information of upthread, roughly:
1) Primary/standby cluster, both using max_worker_processes = 8, and
track_commit_timestamp = off.
2) In order to begin the switchover, first stop cleanly the primary.
3) Update configuration of the standby as follows, promote it and
restart it:
track_commit_timestamp = on
max_worker_processes = 50
4) Enable streaming on the old primary to make it a standby, starting
it fails because of the unmatching setting for max_worker_processes.
5) Re-adjust max_worker_processes correctly on the new standby, start
it.  Then this startup should fail at the lookup of pg_commit_ts/.

I have been able to write a TAP test to reproduce this exact scenario,
though it succeeds for me (it could be a good idea to add some
coverage for that actually..).  Perhaps I am missing a step though?
For example, perhaps the original setting was track_commit_timestamp =
on, then it got disabled once?
--
Michael

Attachment

Re: Loss of replication after simple misconfiguration

From

Michael Paquier

Date:

10 April 2020, 05:43:15

On Fri, Apr 10, 2020 at 01:14:34PM +0900, Michael Paquier wrote:
> I have been able to write a TAP test to reproduce this exact scenario,
> though it succeeds for me (it could be a good idea to add some
> coverage for that actually..).  Perhaps I am missing a step though?
> For example, perhaps the original setting was track_commit_timestamp =
> on, then it got disabled once?

I have been looking at more scenarios involving switching back and
forth track_commit_timestamps on/off during a switchover with a
standby failing to replay a XLOG_CHANGE_PARAMETER record during
recovery, still I cannot see a failure in scenarios close to what is
discussed here (the test includes transactions replayed with and
without the switch).  Attached is a more advanced test.  Any thoughts?
--
Michael

Attachment

Re: Loss of replication after simple misconfiguration

From

hubert depesz lubaczewski

Date:

10 April 2020, 07:26:51

On Fri, Apr 10, 2020 at 01:14:34PM +0900, Michael Paquier wrote:
> Hmm.  We have a gap in tests here as we don't have any tests stressing
> switchovers when it comes to track_commit_timestamps.  Anyway, could
> you confirm that I got the problem right?  Here is the flow I am getting
> from the information of upthread, roughly:
> 1) Primary/standby cluster, both using max_worker_processes = 8, and
> track_commit_timestamp = off.
> 2) In order to begin the switchover, first stop cleanly the primary.
> 3) Update configuration of the standby as follows, promote it and
> restart it:
> track_commit_timestamp = on
> max_worker_processes = 50
> 4) Enable streaming on the old primary to make it a standby, starting
> it fails because of the unmatching setting for max_worker_processes.
> 5) Re-adjust max_worker_processes correctly on the new standby, start
> it.  Then this startup should fail at the lookup of pg_commit_ts/.

Well, no.

In our case it was *at least* this scenario:

1. master and slave both with max_worker_processes and
track_commit_timestamp off.
2. config files get changed on both to include track_commit_timestamp on
3. slave gets restarted
4. config files get changed on both to include max_worker_processes = 50
5. master gets stopped by "power outage"
6. after master re-starts, replication to slave dies.

but it could have been also different scenario

1. master and slave both with max_worker_processes and
track_commit_timestamp off.
2. config files get changed on both to include track_commit_timestamp on
3. slave gets restarted (or maybe not, we can't be sure)
4. config files get changed on both to include max_worker_processes = 50
5. set of 2 new slaves (slave2 and slave3) are setup off slave, both
   with max_worker_processes = 50, and track_commit_timestamps = on
6. slave3 is modified to stream off slave2
7. master crash
8. after restars one of slaves (many?) lost its replication

Andrew suggested yesterday on IRC that it could be timing issue, so
testing for it might be complicated - hence my inability to replicate
the problem in test environment.

I will try to do the tests using extended scenarios with slave2 and
slave3, but I'm not overly optimistic about replicating this particular
case.

Best regards,

depesz

Re: Loss of replication after simple misconfiguration

From

Michael Paquier

Date:

10 April 2020, 07:59:32

On Fri, Apr 10, 2020 at 09:26:51AM +0200, hubert depesz lubaczewski wrote:
> In our case it was *at least* this scenario:
>
> 1. master and slave both with max_worker_processes and
> track_commit_timestamp off.
> 2. config files get changed on both to include track_commit_timestamp on
> 3. slave gets restarted
> 4. config files get changed on both to include max_worker_processes = 50
> 5. master gets stopped by "power outage"
> 6. after master re-starts, replication to slave dies.

Without the standby restarted here, its configuration remains at the
former value of max_worker_processes, which is lower than the setting
of the primary, so it would logically stop in this case if not
restarted once it replays the XLOG_PARAMETER_CHANGE record generated
from the primary.

> Andrew suggested yesterday on IRC that it could be timing issue, so
> testing for it might be complicated - hence my inability to replicate
> the problem in test environment.

I am actually wondering if we may not be putting our finger on another
issue related to the minimum consistency LSN here.

> I will try to do the tests using extended scenarios with slave2 and
> slave3, but I'm not overly optimistic about replicating this particular
> case.

Thanks.
--
Michael

Attachment

signature.asc

Re: Loss of replication after simple misconfiguration

From

hubert depesz lubaczewski

Date:

10 April 2020, 08:01:22

On Fri, Apr 10, 2020 at 04:59:32PM +0900, Michael Paquier wrote:
> On Fri, Apr 10, 2020 at 09:26:51AM +0200, hubert depesz lubaczewski wrote:
> > In our case it was *at least* this scenario:
> >
> > 1. master and slave both with max_worker_processes and
> > track_commit_timestamp off.
> > 2. config files get changed on both to include track_commit_timestamp on
> > 3. slave gets restarted
> > 4. config files get changed on both to include max_worker_processes = 50
> > 5. master gets stopped by "power outage"
> > 6. after master re-starts, replication to slave dies.
>
> Without the standby restarted here, its configuration remains at the
> former value of max_worker_processes, which is lower than the setting
> of the primary, so it would logically stop in this case if not
> restarted once it replays the XLOG_PARAMETER_CHANGE record generated
> from the primary.

Yes, I know. And this is *precisely* what has happened in at least one
case.
But I would assume that after restart, standby should be working. Which
didn't happen (in at least one case, we had lots of restarts recently,
and only ~ 4 cases of replication dying).

Best regards,

depesz