Hello.
At Mon, 21 May 2018 05:18:57 -0700 (MST), greigwise <greigwise@comcast.net> wrote in
<1526905137308-0.post@n3.nabble.com>
> Hello.
>
> We are on Postgresql version 9.6.6. We have 2 EC2 instances in different
> Amazon regions and we are doing physical replication via VPN. It all seems
> to work just fine most of the time. I'm noticing in the logs that we have
> recurring erros (maybe 10 or 12 times per day) that look like this:
<following is digested>
> 2018-05-17 06:36:14 UTC 5af0599f.210d LOG: invalid resource manager ID 49
> 2018-05-17 06:36:14 UTC 5afd22de.7ac4 LOG: started streaming WAL from
> 2018-05-17 07:20:17 UTC 5afd22de.7ac4 FATAL: could not receive data from
> WAL stream: server closed the connection unexpectedly
> Or some that also look like this:
>
> 2018-05-17 07:20:17 UTC 5af0599f.210d LOG: record with incorrect prev-link
> 2018-05-17 07:20:18 UTC 5afd2d31.1889 LOG: started streaming WAL from
> 2018-05-17 08:03:28 UTC 5afd2d31.1889 FATAL: could not receive data from
> WAL stream: server closed the connection unexpectedly
> And some like this:
>
> 2018-05-17 23:00:13 UTC 5afd63ec.26fc LOG: invalid magic number 0000 in
> log segment 00000001000003850000003C, offset 10436608
> 2018-05-17 23:00:14 UTC 5afe097d.49aa LOG: started streaming WAL from
> primary at 385/3C000000 on timeline 1
You recplication connection seems quite unstable and disconnected
frequently. After disconnection, you will see several kinds of "I
find a broken record in my WAL file" and they are cues for
standby to switch to streaming. Itself is a normal operation as
PostgreSQL with one known exception.
> Then, like maybe once every couple months or so, we have a crash with logs
> looking like this:
> 2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING:
> terminating connection because of crash of another server process
I think the lines follow an error message like "FATAL: invalid
memory alloc request size 3075129344". This is a kind of "broken
record" but it is known to lead standby to crash. It is
disucussed here.
> [bug fix] Cascaded standby cannot start after a clean shutdown
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05#0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
> When this last error occurs, the recovery is to go on the replica and remove
> all the WAL logs from the pg_xlog director and then restart Postgresql.
> Everything seems to recover and come up fine. I've done some tests
> comparing counts between the replica and the primary and everything seems
> synced just fine from all I can tell.
It is right recovery steps, as far as looking the attached log
messages.
> So, a couple of questions. 1) Should I be worried that my replica is
> corrupt in some way or given that everything *seems* ok, is it reasonable to
> believe that things are working correctly in spite of these errors being
> reported. 2) Is there something I should configure differently to avoid
> some of these errors?
It doesn't seem worth warrying from the viewpoint of data
integrity, but if walsender/walreceiver timeouts fire too
frequently, you might need to increase them for increased
stability.
> Thanks in advance for any help.
>
> Greig Wise
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center