I assume it would be related to the following:
LOG: incorrect resource manager data checksum in record at 2D6/C259AB90
since the walreceiver terminates just after this - but I'm unclear
what precisely this means. Without digging into the code, I would
guess that it's unable to verify the checksum on the segment it just
received from the master; however, there are multiple replicas here,
so it points to an issue on this client. However, it happens
everywhere -- we have ~16 replicas across 3 different clusters (on
different versions) and we see this uniformly across them all at
seemingly random times. Also, just to clarify, this will only happen
on a single replica at a time.
On Thu, Apr 23, 2020 at 2:46 PM Justin King <kingpin867@gmail.com> wrote:
>
> On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Justin King <kingpin867@gmail.com> writes:
> > > We've seen unexpected termination of the WAL receiver process. This
> > > stops streaming replication, but the replica stays available --
> > > restarting the server resumes streaming replication where it left off.
> > > We've seen this across nearly every recent version of PG, (9.4, 9.5,
> > > 11.x, 12.x) -- anything omitted is one we haven't used.
> >
> > > I don't have an explanation for the cause, but I was able to set
> > > logging to "debug5" and run an strace of the walrecevier PID when it
> > > eventually happened. It appears as if the SIGTERM is coming from the
> > > "postgres: startup" process.
> >
> > The startup process intentionally SIGTERMs the walreceiver under
> > various circumstances, so I'm not sure that there's any surprise
> > here. Have you checked the postmaster log?
> >
> > regards, tom lane
>
> Yep, I included "debug5" output of the postmaster log in the initial post.