Thread: BUG #14966: Related to #14702 / corruption in replication

BUG #14966: Related to #14702 / corruption in replication

From
dennis.noordsij@alumni.helsinki.fi
Date:
12 December 2017, 04:15:27
The following bug has been logged on the website:

Bug reference:      14966
Logged by:          Dennis Noordsij
Email address:      dennis.noordsij@alumni.helsinki.fi
PostgreSQL version: 10.1
Operating system:   FreeBSD
Description:

Searching for info lead me to bug #14702 which seems to describe my problem,
so some more information:

Asynchronous replication, wal_level=replica.


Replication running uneventfully, using replication slots. At some point I
changed max_connections on the master and restarted, resulting in the
following on the slave:

postgres[665]: [8-1] 2017-12-04 10:04:00.890 CET [665] FATAL:  could not
send end-of-streaming message to primary: no COPY in progress
postgres[2547]: [6-1] 2017-12-04 10:04:01.057 CET [2547] FATAL:  could not
connect to the primary server: server closed the connection unexpectedly
postgres[2547]: [6-2]         This probably means the server terminated
abnormally
postgres[2547]: [6-3]         before or while processing the request.
postgres[661]: [11-1] 2017-12-04 10:04:07.799 CET [661] FATAL:  hot standby
is not possible because max_connections = 100 is a lower setting than on the
master server (its value was 512)
postgres[661]: [11-2] 2017-12-04 10:04:07.799 CET [661] CONTEXT:  WAL redo
at 8/B2F4208 for XLOG/PARAMETER_CHANGE: max_connections=512
max_worker_processes=8 max_prepared_xacts=0 max_locks_per_xact=64
wal_level=replica wal_log_hints=off track_commit_timestamp=off

I changed the max_connections on the slave and restarted the slave,
replication continued.


I then (a few days later) added:

 shared_preload_libraries = 'pg_stat_statements'
 track_activities = on
 track_counts = on
 track_io_timing = on
 track_functions = all                   # none, pl, all
 track_activity_query_size = 4096

to the master and restarted. I don't have the logs from slave at that point,
but judging by the timestamps of the last WAL segment replication stopped
shortly after this point.

Any attempt to restart the slave then gives:
 FATAL:  invalid memory alloc request size 1466851328


Changing memory settings (work_mem etc) has no effect.

The info in #14702 lead me to emptying pg_wal on the slave (all segments
were still on the master) after which the slave restarted and replication
resumed. Note that statement tracking is (still) not enabled on the slave.

The slave has now caught up, and a pg_dumpall on the master completed
without issues.

Both master and slave store their data on a ZFS pool which has zero issues
(scrubbed).

I would have 5 WAL files from both the master and slave pg_logs (presumably
the correct ones and the corrupted versions - and presumably only the
last/active one is relevant), they all differ in their md5.

Is there a way to decode their content/messages and compare, if that would
help to pinpoint the problem? 

Thank you!