BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
Date
Msg-id 17392-ae1e272049dfec87@postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17392
Logged by:          Alexander Ulaev
Email address:      alexander.ulaev@rtlabs.ru
PostgreSQL version: Unsupported/Unknown
Operating system:   CentOS Linux release 7.9.2009 (Core)
Description:

We have a some shards with patroni cluster over PG9.6 installed on VMs
Some problems on SAN side follow our kvm VMs was halted during 1 min
approximately by I\O disability and most of db shards with relatively low
load had a 40-60 seconds commits, but was survived

but two shard's masters with high application load (TPS = 2-3x from AVG
among the shard DBs) was unexpectedly shutdowned with the same errors:

2022-02-01 16:12:24 MSK [16959] LOG:  received immediate shutdown request

and
2022-02-01 16:12:25 MSK [16959] LOG:  archiver process (PID 117615) exited
with exit code 2  (I suppose this timestamp for LOG is incorrect and this
record really stands behind "shutdown" record by meaning)
among huge number of messages for user process "terminating connection
because of crash of another server process" like these

2022-02-01 16:12:24 MSK [151045] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 HINT:  In a moment you should be able to reconnect to the
database and repeat your command.
2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 WARNING:  terminating connection because of crash of another
server process
2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 DETAIL:  The postmaster has commanded this server process to
roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.


I can't find anywhere what do this exit code 2 stand for and as I know of
the behavior of ARCHIVER process on abnormal termination it had to be
restarted by the postmaster, but no "entire instance is terminated
abnormally", or "All of the postgres process halts"
I found in source code that all functions relating to archiver are included
in pgarch.c having initial author: Simon Riggs simon@2ndquadrant.com, but I
cant found there any information related to "exit code 2"

Later when instance was starting and recovering wal logs since the last
checkpoint, then "invalid record length" arise:

2022-02-01 16:13:17 MSK [153401] LOG:  invalid record length at
1CEC/C3BEBB50: wanted 24, got 0
2022-02-01 16:13:17 MSK [153401] LOG:  consistent recovery state reached at
1CEC/C3BEBB50
2022-02-01 16:13:17 MSK [153397] LOG:  database system is ready to accept
read only connections

but instance was started and patroni return it to master role, because sync
replica also was shutdowned by "invalid record length" when applied wal
logs

2022-02-01 16:12:25 MSK [16563] 127.0.0.1 [unknown] patroni@postgres LOG:
connection authorized: user=patroni database=postgres
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
2022-02-01 16:12:25 MSK [89015] FATAL:  could not receive data from WAL
stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

2022-02-01 16:12:25 MSK [89009] LOG:  invalid record length at
1CEC/C3BEBA90: wanted 24, got 0
2022-02-01 16:12:25 MSK [16564] FATAL:  could not connect to the primary
server: FATAL:  the database system is shutting down
2022-02-01 16:12:26 MSK [89006] LOG:  received fast shutdown request
2022-02-01 16:12:26 MSK [89006] LOG:  aborting any active transactions


pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0
Next
From: Tom Lane
Date:
Subject: Re: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0