Hi,
I was trying to get it to replicate, but I couldn't in controlled
environment, so I have no repeatable proof of concept, but we've seen
the problem at least 4 times.
Configuration:
PostgreSQL 9.5.15 on Ubuntu bionic.
Streaming replication from master to slave, working without problem.
max_worker_processes set to 8 on both servers.
Preparation:
At some point in time, to accommodate upcoming (yes, really) upgrade to
pg12, we pushed change that write "max_worker_processes = 50" to all
configs, but didn't restart Pgs.
Fail scenario:
Master died (underlying AWS EC2 instance got hosed, was restarted).
Master got back up, but loaded new configuration from conf file.
Immediately it crashed slave with message:
2020-04-09 15:55:45.653 CEST 1304738 LOG: started streaming WAL from primary at 3/D000000 on timeline 1
2020-04-09 15:55:45.679 CEST 505710 FATAL: hot standby is not possible because max_worker_processes = 8 is a lower
settingthan on the master server (its value was 50)
2020-04-09 15:55:45.679 CEST 505710 CONTEXT: xlog redo XLOG/PARAMETER_CHANGE: max_connections=100
max_worker_processes=50max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical wal_log_hints=off
track_commit_timestamp=off
(this message is from my test environment, but the idea is clear).
Given that we have max_worker_processes = 50 in config on slave, we
tried to restart only to be greeted by:
2020-04-07T15:13:49.729943+00:00 postgres[20491]: [7-1] db=,user= LOG: restored log file "000000030001779200000061"
fromarchive
2020-04-07T15:13:49.757222+00:00 postgres[20491]: [8-1] db=,user= FATAL: could not access status of transaction
4275781146
2020-04-07T15:13:49.757314+00:00 postgres[20491]: [8-2] db=,user= DETAIL: Could not read from file
"pg_commit_ts/27D4B"at offset 245760: Success.
2020-04-07T15:13:49.757380+00:00 postgres[20491]: [8-3] db=,user= CONTEXT: xlog redo Transaction/COMMIT: 2020-04-07
02:40:10.065859+00
2020-04-07T15:13:49.761239+00:00 postgres[20487]: [2-1] db=,user= LOG: startup process (PID 20491) exited with exit
code1
2020-04-07T15:13:49.761387+00:00 postgres[20487]: [3-1] db=,user= LOG: terminating any other active server processes
We had similar case (could not access status of transaction) on at least
4 separate pg installations, so it looks to be related to the
restart of master while bad max_worker_processes are set.
We have since rolled back the change to max_worker_processes, and will
be doing it in much saner/safer way, but the fact that slave can't start
after such master restart is troubling.
Unfortunately we don't have any instances broken in this way anymore,
and we don't have logs either - we had to rebuild the Pg setup
(master/slave) ASAP, and thus the evidence is lost.
Is there anything we could have done to fix it once it happened?
Is it a known problem?
Or maybe the max_worker_processes mismatch is redherring and the problem
was simply because master died while there was heavy write traffic with
lots of transactions?
Best regards,
depesz