Thread: walreceiver termination
We've seen unexpected termination of the WAL receiver process. This stops streaming replication, but the replica stays available -- restarting the server resumes streaming replication where it left off. We've seen this across nearly every recent version of PG, (9.4, 9.5, 11.x, 12.x) -- anything omitted is one we haven't used. I don't have an explanation for the cause, but I was able to set logging to "debug5" and run an strace of the walrecevier PID when it eventually happened. It appears as if the SIGTERM is coming from the "postgres: startup" process. The only notable thing that I've seen discussion of (in the past) and where it might be relevant here is that these replicas all reside on a ZFS filesystem (on Linux). In fact, everywhere we've seen issues, the database sits on a ZFS pool. We also have another PG cluster running on non-ZFS (ext4) and have never experienced this problem. Any suggestions? Logging snippets are below: postgresql.log Apr 23 09:09:13 royce postgres[2229]: [43141904-1] 2020-04-23 09:09:13.663 GMT [2229] DEBUG: record known xact 8677704 latestObservedXid 8677704 Apr 23 09:09:13 royce postgres[2229]: [43141904-2] 2020-04-23 09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A6E0 for Btree/INSERT_LEAF: off 88 Apr 23 09:09:13 royce postgres[2229]: [43141905-1] 2020-04-23 09:09:13.663 GMT [2229] DEBUG: record known xact 8677704 latestObservedXid 8677704 Apr 23 09:09:13 royce postgres[2229]: [43141905-2] 2020-04-23 09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A720 for Heap/HEAP_CONFIRM: off 6 Apr 23 09:09:13 royce postgres[2229]: [43141906-1] 2020-04-23 09:09:13.663 GMT [2229] DEBUG: record known xact 8677704 latestObservedXid 8677704 Apr 23 09:09:13 royce postgres[2229]: [43141906-2] 2020-04-23 09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A750 for Heap/INSERT: off 7 flags 0x0C Apr 23 09:09:13 royce postgres[2229]: [43141907-1] 2020-04-23 09:09:13.663 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2299]: [1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL: terminating walreceiver process due to administrator command Apr 23 09:09:13 royce postgres[2299]: [1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make Apr 23 09:09:13 royce postgres[2299]: [1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 6 on_shmem_exit callbacks to make Apr 23 09:09:13 royce postgres[2229]: [43141908-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to archive after failure Apr 23 09:09:13 royce postgres[2229]: [43141909-1] 2020-04-23 09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2299]: [1448558-1] 2020-04-23 09:09:13.664 GMT [2299] DEBUG: proc_exit(1): 2 callbacks to make Apr 23 09:09:13 royce postgres[2229]: [43141910-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to stream after failure Apr 23 09:09:13 royce postgres[2229]: [43141911-1] 2020-04-23 09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2299]: [1448559-1] 2020-04-23 09:09:13.664 GMT [2299] DEBUG: exit(1) Apr 23 09:09:13 royce postgres[2229]: [43141912-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to archive after failure Apr 23 09:09:13 royce postgres[2229]: [43141913-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: incorrect resource manager data checksum in record at 2D6/C259AB90 Apr 23 09:09:13 royce postgres[2229]: [43141914-1] 2020-04-23 09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to stream after failure Apr 23 09:09:13 royce postgres[2299]: [1448560-1] 2020-04-23 09:09:13.664 GMT [2299] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make Apr 23 09:09:13 royce postgres[2229]: [43141915-1] 2020-04-23 09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum in record at 2D process id list: 2224 ? Ss 0:24 /usr/lib/postgresql/12/bin/postgres -D /var/lib/postgresql/12/main -c unix_socket_directories=/var/run/postgresql -c config_file=/etc/postgresql/12/main/postgresql.conf -c hba_file=/etc/postgresql/12/main/pg_hba.conf -c ident_file=/etc/postgresql/12/main/pg_ident.conf -c external_pid_file=/var/run/postgresql/12-main.pid 2229 ? Ss 7:02 postgres: startup 2230 ? Ss 6:09 postgres: checkpointer 2231 ? Ss 0:08 postgres: background writer 2298 ? Ss 1:29 postgres: stats collector 2299 ? Ss 6:11 postgres: walreceiver strace of walreceiver: kill(2229, SIGUSR1) = 0 getpid() = 2299 sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]: [1448554-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: sending write 2D6/C25A0000 flush 2D6/C25A0000 apply 2D6/C25915F0", 156, MSG_NOSIGNAL, NULL, 0) = 156 sendto(4<socket:[208548975]>, "\x64\x00\x00\x00\x26\x72\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x59\x15\xf0\x00\x02\x46\xf0\x5f\x04\xbe\x9e\x00", 39, MSG_NOSIGNAL, NULL, 0) = 39 epoll_create1(EPOLL_CLOEXEC) = 6<anon_inode:[eventpoll]> epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 10<pipe:[208548970]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640776, u64=94863290307528}}) = 0 epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 7<pipe:[208548681]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640800, u64=94863290307552}}) = 0 epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 4<socket:[208548975]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640824, u64=94863290307576}}) = 0 epoll_wait(6<anon_inode:[eventpoll]>, 0x564714b89410, 1, 100) = -1 EINTR (Interrupted system call) --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2229, si_uid=112} --- write(11<pipe:[208548970]>, "\x00", 1) = 1 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) close(6<anon_inode:[eventpoll]>) = 0 getpid() = 2299 sendto(3<socket:[208548671]>, "<131>Apr 23 09:09:13 postgres[2299]: [1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL: terminating walreceiver process due to administrator command", 152, MSG_NOSIGNAL, NULL, 0) = 152 getpid() = 2299 sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]: [1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make", 144, MSG_NOSIGNAL, NULL, 0) = 144 getpid() = 2299 sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]: [1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 6 on_shmem_exit callbacks to make", 140, MSG_NOSIGNAL, NULL, 0) = 140 sendto(4<socket:[208548975]>, "\x58\x00\x00\x00\x04", 5, MSG_NOSIGNAL, NULL, 0) = 5 Any suggestions? Thanks- Justin
Justin King <kingpin867@gmail.com> writes: > We've seen unexpected termination of the WAL receiver process. This > stops streaming replication, but the replica stays available -- > restarting the server resumes streaming replication where it left off. > We've seen this across nearly every recent version of PG, (9.4, 9.5, > 11.x, 12.x) -- anything omitted is one we haven't used. > I don't have an explanation for the cause, but I was able to set > logging to "debug5" and run an strace of the walrecevier PID when it > eventually happened. It appears as if the SIGTERM is coming from the > "postgres: startup" process. The startup process intentionally SIGTERMs the walreceiver under various circumstances, so I'm not sure that there's any surprise here. Have you checked the postmaster log? regards, tom lane
On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Justin King <kingpin867@gmail.com> writes: > > We've seen unexpected termination of the WAL receiver process. This > > stops streaming replication, but the replica stays available -- > > restarting the server resumes streaming replication where it left off. > > We've seen this across nearly every recent version of PG, (9.4, 9.5, > > 11.x, 12.x) -- anything omitted is one we haven't used. > > > I don't have an explanation for the cause, but I was able to set > > logging to "debug5" and run an strace of the walrecevier PID when it > > eventually happened. It appears as if the SIGTERM is coming from the > > "postgres: startup" process. > > The startup process intentionally SIGTERMs the walreceiver under > various circumstances, so I'm not sure that there's any surprise > here. Have you checked the postmaster log? > > regards, tom lane Yep, I included "debug5" output of the postmaster log in the initial post.
I assume it would be related to the following: LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 since the walreceiver terminates just after this - but I'm unclear what precisely this means. Without digging into the code, I would guess that it's unable to verify the checksum on the segment it just received from the master; however, there are multiple replicas here, so it points to an issue on this client. However, it happens everywhere -- we have ~16 replicas across 3 different clusters (on different versions) and we see this uniformly across them all at seemingly random times. Also, just to clarify, this will only happen on a single replica at a time. On Thu, Apr 23, 2020 at 2:46 PM Justin King <kingpin867@gmail.com> wrote: > > On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Justin King <kingpin867@gmail.com> writes: > > > We've seen unexpected termination of the WAL receiver process. This > > > stops streaming replication, but the replica stays available -- > > > restarting the server resumes streaming replication where it left off. > > > We've seen this across nearly every recent version of PG, (9.4, 9.5, > > > 11.x, 12.x) -- anything omitted is one we haven't used. > > > > > I don't have an explanation for the cause, but I was able to set > > > logging to "debug5" and run an strace of the walrecevier PID when it > > > eventually happened. It appears as if the SIGTERM is coming from the > > > "postgres: startup" process. > > > > The startup process intentionally SIGTERMs the walreceiver under > > various circumstances, so I'm not sure that there's any surprise > > here. Have you checked the postmaster log? > > > > regards, tom lane > > Yep, I included "debug5" output of the postmaster log in the initial post.
Justin King <kingpin867@gmail.com> writes: > I assume it would be related to the following: > LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 > since the walreceiver terminates just after this - but I'm unclear > what precisely this means. What it indicates is corrupt data in the WAL stream. When reading WAL after crash recovery, we assume that that indicates end of WAL. When pulling live data from a source server, it suggests some actual problem ... but killing the walreceiver and trying to re-fetch the data might be a reasonable response to that. I'm not sure offhand what the startup code thinks it's doing in this context. It might either be attempting to retry, or concluding that it's come to the end of WAL and it ought to promote to being a live server. If you don't see the walreceiver auto-restarting then I'd suspect that the latter is happening. regards, tom lane
On Thu, Apr 23, 2020 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Justin King <kingpin867@gmail.com> writes: > > I assume it would be related to the following: > > LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 > > since the walreceiver terminates just after this - but I'm unclear > > what precisely this means. > > What it indicates is corrupt data in the WAL stream. When reading WAL > after crash recovery, we assume that that indicates end of WAL. When > pulling live data from a source server, it suggests some actual problem > ... but killing the walreceiver and trying to re-fetch the data might > be a reasonable response to that. I'm not sure offhand what the startup > code thinks it's doing in this context. It might either be attempting > to retry, or concluding that it's come to the end of WAL and it ought > to promote to being a live server. If you don't see the walreceiver > auto-restarting then I'd suspect that the latter is happening. > > regards, tom lane walrecevier is definitely not restarting -- replication stops cold right at that segment. I'm a little unclear where to go from here -- is there additional info that would be useful?
Would there be anyone that might be able to help troubleshoot this issue -- or at least give me something that would be helpful to look for? https://www.postgresql.org/message-id/flat/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1%3Dtifg%40mail.gmail.com https://www.postgresql.org/message-id/flat/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6 I'm not the first one to report something similar and all the complaints have a different filesystem in common -- particularly ZFS (or btrfs, in the bottom case). Is there anything more we can do here to help narrow down this issue? I'm happy to help, but I honestly wouldn't even know where to begin. Thanks- Justin King flightaware.com On Thu, Apr 23, 2020 at 4:40 PM Justin King <kingpin867@gmail.com> wrote: > > On Thu, Apr 23, 2020 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Justin King <kingpin867@gmail.com> writes: > > > I assume it would be related to the following: > > > LOG: incorrect resource manager data checksum in record at 2D6/C259AB90 > > > since the walreceiver terminates just after this - but I'm unclear > > > what precisely this means. > > > > What it indicates is corrupt data in the WAL stream. When reading WAL > > after crash recovery, we assume that that indicates end of WAL. When > > pulling live data from a source server, it suggests some actual problem > > ... but killing the walreceiver and trying to re-fetch the data might > > be a reasonable response to that. I'm not sure offhand what the startup > > code thinks it's doing in this context. It might either be attempting > > to retry, or concluding that it's come to the end of WAL and it ought > > to promote to being a live server. If you don't see the walreceiver > > auto-restarting then I'd suspect that the latter is happening. > > > > regards, tom lane > > walrecevier is definitely not restarting -- replication stops cold > right at that segment. I'm a little unclear where to go from here -- > is there additional info that would be useful?
Hello. At Mon, 4 May 2020 09:09:15 -0500, Justin King <kingpin867@gmail.com> wrote in > Would there be anyone that might be able to help troubleshoot this > issue -- or at least give me something that would be helpful to look > for? > > https://www.postgresql.org/message-id/flat/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1%3Dtifg%40mail.gmail.com > https://www.postgresql.org/message-id/flat/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com > https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6 > > I'm not the first one to report something similar and all the > complaints have a different filesystem in common -- particularly ZFS > (or btrfs, in the bottom case). Is there anything more we can do here > to help narrow down this issue? I'm happy to help, but I honestly > wouldn't even know where to begin. The sendto() call at the end of your strace output is "close connecion" request to wal sender and normally should be followed by close() and kill(). If it is really the last strace output, the sendto() is being blocked with buffer-full. My diagnosis of the situation is that your replication connection had a trouble and the TCP session is broken in the way wal receiver couldn't be aware of the breakage. As the result feedback message packets from wal receiver were detained in tcp send buffer then finally the last sendto() was blocked while sending the close-connection message. If it happens constantly, routers or firewalls between the primary and standby may be discarding sessions inadvertantly. I'm not sure how ZFS can be involved in this trouble, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center