Thread: BUG #14966: Related to #14702 / corruption in replication
BUG #14966: Related to #14702 / corruption in replication
From
dennis.noordsij@alumni.helsinki.fi
Date:
The following bug has been logged on the website: Bug reference: 14966 Logged by: Dennis Noordsij Email address: dennis.noordsij@alumni.helsinki.fi PostgreSQL version: 10.1 Operating system: FreeBSD Description: Searching for info lead me to bug #14702 which seems to describe my problem, so some more information: Asynchronous replication, wal_level=replica. Replication running uneventfully, using replication slots. At some point I changed max_connections on the master and restarted, resulting in the following on the slave: postgres[665]: [8-1] 2017-12-04 10:04:00.890 CET [665] FATAL: could not send end-of-streaming message to primary: no COPY in progress postgres[2547]: [6-1] 2017-12-04 10:04:01.057 CET [2547] FATAL: could not connect to the primary server: server closed the connection unexpectedly postgres[2547]: [6-2] This probably means the server terminated abnormally postgres[2547]: [6-3] before or while processing the request. postgres[661]: [11-1] 2017-12-04 10:04:07.799 CET [661] FATAL: hot standby is not possible because max_connections = 100 is a lower setting than on the master server (its value was 512) postgres[661]: [11-2] 2017-12-04 10:04:07.799 CET [661] CONTEXT: WAL redo at 8/B2F4208 for XLOG/PARAMETER_CHANGE: max_connections=512 max_worker_processes=8 max_prepared_xacts=0 max_locks_per_xact=64 wal_level=replica wal_log_hints=off track_commit_timestamp=off I changed the max_connections on the slave and restarted the slave, replication continued. I then (a few days later) added: shared_preload_libraries = 'pg_stat_statements' track_activities = on track_counts = on track_io_timing = on track_functions = all # none, pl, all track_activity_query_size = 4096 to the master and restarted. I don't have the logs from slave at that point, but judging by the timestamps of the last WAL segment replication stopped shortly after this point. Any attempt to restart the slave then gives: FATAL: invalid memory alloc request size 1466851328 Changing memory settings (work_mem etc) has no effect. The info in #14702 lead me to emptying pg_wal on the slave (all segments were still on the master) after which the slave restarted and replication resumed. Note that statement tracking is (still) not enabled on the slave. The slave has now caught up, and a pg_dumpall on the master completed without issues. Both master and slave store their data on a ZFS pool which has zero issues (scrubbed). I would have 5 WAL files from both the master and slave pg_logs (presumably the correct ones and the corrupted versions - and presumably only the last/active one is relevant), they all differ in their md5. Is there a way to decode their content/messages and compare, if that would help to pinpoint the problem? Thank you!