Thread: walreceiver termination

walreceiver termination

From
Justin King
Date:
We've seen unexpected termination of the WAL receiver process.  This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened.  It appears as if the SIGTERM is coming from the
"postgres: startup" process.  The only notable thing that I've seen
discussion of (in the past) and where it might be relevant here is
that these replicas all reside on a ZFS filesystem (on Linux).  In
fact, everywhere we've seen issues, the database sits on a ZFS pool.
We also have another PG cluster running on non-ZFS (ext4) and have
never experienced this problem.

Any suggestions?

Logging snippets are below:

postgresql.log
Apr 23 09:09:13 royce postgres[2229]: [43141904-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG:  record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141904-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT:  WAL redo at 2D6/C259A6E0 for
Btree/INSERT_LEAF: off 88
Apr 23 09:09:13 royce postgres[2229]: [43141905-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG:  record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141905-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT:  WAL redo at 2D6/C259A720 for
Heap/HEAP_CONFIRM: off 6
Apr 23 09:09:13 royce postgres[2229]: [43141906-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG:  record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141906-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT:  WAL redo at 2D6/C259A750 for
Heap/INSERT: off 7 flags 0x0C
Apr 23 09:09:13 royce postgres[2229]: [43141907-1] 2020-04-23
09:09:13.663 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448555-1] 2020-04-23
09:09:13.663 GMT [2299] FATAL:  terminating walreceiver process due to
administrator command
Apr 23 09:09:13 royce postgres[2299]: [1448556-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1): 1 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2299]: [1448557-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1): 6 on_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141908-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141909-1] 2020-04-23
09:09:13.664 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448558-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG:  proc_exit(1): 2 callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141910-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2229]: [43141911-1] 2020-04-23
09:09:13.664 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448559-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG:  exit(1)
Apr 23 09:09:13 royce postgres[2229]: [43141912-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141913-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  incorrect resource manager data
checksum in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2229]: [43141914-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG:  switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2299]: [1448560-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG:  shmem_exit(-1): 0 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141915-1] 2020-04-23
09:09:13.664 GMT [2229] LOG:  incorrect resource manager data checksum
in record at 2D

process id list:
2224 ?        Ss     0:24 /usr/lib/postgresql/12/bin/postgres -D
/var/lib/postgresql/12/main -c
unix_socket_directories=/var/run/postgresql -c
config_file=/etc/postgresql/12/main/postgresql.conf -c
hba_file=/etc/postgresql/12/main/pg_hba.conf -c
ident_file=/etc/postgresql/12/main/pg_ident.conf -c
external_pid_file=/var/run/postgresql/12-main.pid
 2229 ?        Ss     7:02 postgres: startup
 2230 ?        Ss     6:09 postgres: checkpointer
 2231 ?        Ss     0:08 postgres: background writer
 2298 ?        Ss     1:29 postgres: stats collector
 2299 ?        Ss     6:11 postgres: walreceiver

strace of walreceiver:
kill(2229, SIGUSR1)                     = 0
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448554-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  sending write
2D6/C25A0000 flush 2D6/C25A0000 apply 2D6/C25915F0", 156,
MSG_NOSIGNAL, NULL, 0) = 156
sendto(4<socket:[208548975]>,

"\x64\x00\x00\x00\x26\x72\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x59\x15\xf0\x00\x02\x46\xf0\x5f\x04\xbe\x9e\x00",
39, MSG_NOSIGNAL, NULL, 0) = 39
epoll_create1(EPOLL_CLOEXEC)            = 6<anon_inode:[eventpoll]>
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
10<pipe:[208548970]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640776,
u64=94863290307528}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
7<pipe:[208548681]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640800,
u64=94863290307552}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
4<socket:[208548975]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640824,
u64=94863290307576}}) = 0
epoll_wait(6<anon_inode:[eventpoll]>, 0x564714b89410, 1, 100) = -1
EINTR (Interrupted system call)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2229, si_uid=112} ---
write(11<pipe:[208548970]>, "\x00", 1)  = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
close(6<anon_inode:[eventpoll]>)        = 0
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<131>Apr 23 09:09:13 postgres[2299]:
[1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL:  terminating
walreceiver process due to administrator command", 152, MSG_NOSIGNAL,
NULL, 0) = 152
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1):
1 before_shmem_exit callbacks to make", 144, MSG_NOSIGNAL, NULL, 0) =
144
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1):
6 on_shmem_exit callbacks to make", 140, MSG_NOSIGNAL, NULL, 0) = 140
sendto(4<socket:[208548975]>, "\x58\x00\x00\x00\x04", 5, MSG_NOSIGNAL,
NULL, 0) = 5


Any suggestions?

Thanks-
Justin



Re: walreceiver termination

From
Tom Lane
Date:
Justin King <kingpin867@gmail.com> writes:
> We've seen unexpected termination of the WAL receiver process.  This
> stops streaming replication, but the replica stays available --
> restarting the server resumes streaming replication where it left off.
> We've seen this across nearly every recent version of PG, (9.4, 9.5,
> 11.x, 12.x) -- anything omitted is one we haven't used.

> I don't have an explanation for the cause, but I was able to set
> logging to "debug5" and run an strace of the walrecevier PID when it
> eventually happened.  It appears as if the SIGTERM is coming from the
> "postgres: startup" process.

The startup process intentionally SIGTERMs the walreceiver under
various circumstances, so I'm not sure that there's any surprise
here.  Have you checked the postmaster log?

            regards, tom lane



Re: walreceiver termination

From
Justin King
Date:
On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Justin King <kingpin867@gmail.com> writes:
> > We've seen unexpected termination of the WAL receiver process.  This
> > stops streaming replication, but the replica stays available --
> > restarting the server resumes streaming replication where it left off.
> > We've seen this across nearly every recent version of PG, (9.4, 9.5,
> > 11.x, 12.x) -- anything omitted is one we haven't used.
>
> > I don't have an explanation for the cause, but I was able to set
> > logging to "debug5" and run an strace of the walrecevier PID when it
> > eventually happened.  It appears as if the SIGTERM is coming from the
> > "postgres: startup" process.
>
> The startup process intentionally SIGTERMs the walreceiver under
> various circumstances, so I'm not sure that there's any surprise
> here.  Have you checked the postmaster log?
>
>                         regards, tom lane

Yep, I included "debug5" output of the postmaster log in the initial post.



Re: walreceiver termination

From
Justin King
Date:
I assume it would be related to the following:

LOG:  incorrect resource manager data checksum in record at 2D6/C259AB90

since the walreceiver terminates just after this - but I'm unclear
what precisely this means.  Without digging into the code, I would
guess that it's unable to verify the checksum on the segment it just
received from the master; however, there are multiple replicas here,
so it points to an issue on this client.  However, it happens
everywhere -- we have ~16 replicas across 3 different clusters (on
different versions) and we see this uniformly across them all at
seemingly random times.  Also, just to clarify, this will only happen
on a single replica at a time.

On Thu, Apr 23, 2020 at 2:46 PM Justin King <kingpin867@gmail.com> wrote:
>
> On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Justin King <kingpin867@gmail.com> writes:
> > > We've seen unexpected termination of the WAL receiver process.  This
> > > stops streaming replication, but the replica stays available --
> > > restarting the server resumes streaming replication where it left off.
> > > We've seen this across nearly every recent version of PG, (9.4, 9.5,
> > > 11.x, 12.x) -- anything omitted is one we haven't used.
> >
> > > I don't have an explanation for the cause, but I was able to set
> > > logging to "debug5" and run an strace of the walrecevier PID when it
> > > eventually happened.  It appears as if the SIGTERM is coming from the
> > > "postgres: startup" process.
> >
> > The startup process intentionally SIGTERMs the walreceiver under
> > various circumstances, so I'm not sure that there's any surprise
> > here.  Have you checked the postmaster log?
> >
> >                         regards, tom lane
>
> Yep, I included "debug5" output of the postmaster log in the initial post.



Re: walreceiver termination

From
Tom Lane
Date:
Justin King <kingpin867@gmail.com> writes:
> I assume it would be related to the following:
> LOG:  incorrect resource manager data checksum in record at 2D6/C259AB90
> since the walreceiver terminates just after this - but I'm unclear
> what precisely this means.

What it indicates is corrupt data in the WAL stream.  When reading WAL
after crash recovery, we assume that that indicates end of WAL.  When
pulling live data from a source server, it suggests some actual problem
... but killing the walreceiver and trying to re-fetch the data might
be a reasonable response to that.  I'm not sure offhand what the startup
code thinks it's doing in this context.  It might either be attempting
to retry, or concluding that it's come to the end of WAL and it ought
to promote to being a live server.  If you don't see the walreceiver
auto-restarting then I'd suspect that the latter is happening.

            regards, tom lane



Re: walreceiver termination

From
Justin King
Date:
On Thu, Apr 23, 2020 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Justin King <kingpin867@gmail.com> writes:
> > I assume it would be related to the following:
> > LOG:  incorrect resource manager data checksum in record at 2D6/C259AB90
> > since the walreceiver terminates just after this - but I'm unclear
> > what precisely this means.
>
> What it indicates is corrupt data in the WAL stream.  When reading WAL
> after crash recovery, we assume that that indicates end of WAL.  When
> pulling live data from a source server, it suggests some actual problem
> ... but killing the walreceiver and trying to re-fetch the data might
> be a reasonable response to that.  I'm not sure offhand what the startup
> code thinks it's doing in this context.  It might either be attempting
> to retry, or concluding that it's come to the end of WAL and it ought
> to promote to being a live server.  If you don't see the walreceiver
> auto-restarting then I'd suspect that the latter is happening.
>
>                         regards, tom lane

walrecevier is definitely not restarting -- replication stops cold
right at that segment.  I'm a little unclear where to go from here --
is there additional info that would be useful?



Re: walreceiver termination

From
Justin King
Date:
Would there be anyone that might be able to help troubleshoot this
issue -- or at least give me something that would be helpful to look
for?

https://www.postgresql.org/message-id/flat/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1%3Dtifg%40mail.gmail.com
https://www.postgresql.org/message-id/flat/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com

https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6

I'm not the first one to report something similar and all the
complaints have a different filesystem in common -- particularly ZFS
(or btrfs, in the bottom case).  Is there anything more we can do here
to help narrow down this issue?  I'm happy to help, but I honestly
wouldn't even know where to begin.

Thanks-

Justin King
flightaware.com

On Thu, Apr 23, 2020 at 4:40 PM Justin King <kingpin867@gmail.com> wrote:
>
> On Thu, Apr 23, 2020 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Justin King <kingpin867@gmail.com> writes:
> > > I assume it would be related to the following:
> > > LOG:  incorrect resource manager data checksum in record at 2D6/C259AB90
> > > since the walreceiver terminates just after this - but I'm unclear
> > > what precisely this means.
> >
> > What it indicates is corrupt data in the WAL stream.  When reading WAL
> > after crash recovery, we assume that that indicates end of WAL.  When
> > pulling live data from a source server, it suggests some actual problem
> > ... but killing the walreceiver and trying to re-fetch the data might
> > be a reasonable response to that.  I'm not sure offhand what the startup
> > code thinks it's doing in this context.  It might either be attempting
> > to retry, or concluding that it's come to the end of WAL and it ought
> > to promote to being a live server.  If you don't see the walreceiver
> > auto-restarting then I'd suspect that the latter is happening.
> >
> >                         regards, tom lane
>
> walrecevier is definitely not restarting -- replication stops cold
> right at that segment.  I'm a little unclear where to go from here --
> is there additional info that would be useful?



Re: walreceiver termination

From
Kyotaro Horiguchi
Date:
Hello.

At Mon, 4 May 2020 09:09:15 -0500, Justin King <kingpin867@gmail.com> wrote in 
> Would there be anyone that might be able to help troubleshoot this
> issue -- or at least give me something that would be helpful to look
> for?
> 
> https://www.postgresql.org/message-id/flat/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1%3Dtifg%40mail.gmail.com
> https://www.postgresql.org/message-id/flat/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>
https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6
> 
> I'm not the first one to report something similar and all the
> complaints have a different filesystem in common -- particularly ZFS
> (or btrfs, in the bottom case).  Is there anything more we can do here
> to help narrow down this issue?  I'm happy to help, but I honestly
> wouldn't even know where to begin.

The sendto() call at the end of your strace output is "close
connecion" request to wal sender and normally should be followed by
close() and kill().  If it is really the last strace output, the
sendto() is being blocked with buffer-full.

My diagnosis of the situation is that your replication connection had
a trouble and the TCP session is broken in the way wal receiver
couldn't be aware of the breakage.  As the result feedback message
packets from wal receiver were detained in tcp send buffer then
finally the last sendto() was blocked while sending the
close-connection message.

If it happens constantly, routers or firewalls between the primary and
standby may be discarding sessions inadvertantly.

I'm not sure how ZFS can be involved in this trouble, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center