Thread: BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request

The following bug has been logged on the website:

Bug reference:      17392
Logged by:          Alexander Ulaev
Email address:      alexander.ulaev@rtlabs.ru
PostgreSQL version: Unsupported/Unknown
Operating system:   CentOS Linux release 7.9.2009 (Core)
Description:

We have a some shards with patroni cluster over PG9.6 installed on VMs
Some problems on SAN side follow our kvm VMs was halted during 1 min
approximately by I\O disability and most of db shards with relatively low
load had a 40-60 seconds commits, but was survived

but two shard's masters with high application load (TPS = 2-3x from AVG
among the shard DBs) was unexpectedly shutdowned with the same errors:

2022-02-01 16:12:24 MSK [16959] LOG:  received immediate shutdown request

and
2022-02-01 16:12:25 MSK [16959] LOG:  archiver process (PID 117615) exited
with exit code 2  (I suppose this timestamp for LOG is incorrect and this
record really stands behind "shutdown" record by meaning)
among huge number of messages for user process "terminating connection
because of crash of another server process" like these

2022-02-01 16:12:24 MSK [151045] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 HINT:  In a moment you should be able to reconnect to the
database and repeat your command.
2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 WARNING:  terminating connection because of crash of another
server process
2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 DETAIL:  The postmaster has commanded this server process to
roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.


I can't find anywhere what do this exit code 2 stand for and as I know of
the behavior of ARCHIVER process on abnormal termination it had to be
restarted by the postmaster, but no "entire instance is terminated
abnormally", or "All of the postgres process halts"
I found in source code that all functions relating to archiver are included
in pgarch.c having initial author: Simon Riggs simon@2ndquadrant.com, but I
cant found there any information related to "exit code 2"

Later when instance was starting and recovering wal logs since the last
checkpoint, then "invalid record length" arise:

2022-02-01 16:13:17 MSK [153401] LOG:  invalid record length at
1CEC/C3BEBB50: wanted 24, got 0
2022-02-01 16:13:17 MSK [153401] LOG:  consistent recovery state reached at
1CEC/C3BEBB50
2022-02-01 16:13:17 MSK [153397] LOG:  database system is ready to accept
read only connections

but instance was started and patroni return it to master role, because sync
replica also was shutdowned by "invalid record length" when applied wal
logs

2022-02-01 16:12:25 MSK [16563] 127.0.0.1 [unknown] patroni@postgres LOG:
connection authorized: user=patroni database=postgres
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
2022-02-01 16:12:25 MSK [89015] FATAL:  could not receive data from WAL
stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

2022-02-01 16:12:25 MSK [89009] LOG:  invalid record length at
1CEC/C3BEBA90: wanted 24, got 0
2022-02-01 16:12:25 MSK [16564] FATAL:  could not connect to the primary
server: FATAL:  the database system is shutting down
2022-02-01 16:12:26 MSK [89006] LOG:  received fast shutdown request
2022-02-01 16:12:26 MSK [89006] LOG:  aborting any active transactions


RE: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request

From
Улаев Александр Сергеевич
Date:
Hi, thanks for your cooperation

2022-02-01 16:11:34,964 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:11:44,950 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:11:54,677 INFO: Lock owner: db02.local; I am db02.local
2022-02-01 16:11:56,145 ERROR: Request to server http://IP_ETCD1:2379 failed:
MaxRetryError('HTTPConnectionPool(host=\'IP_ETCD1\',port=2379): Max retries exceeded with url:
/v2/keys/postgresql/db/leader(Caused by Read 
TimeoutError("HTTPConnectionPool(host=\'IP_ETCD1\', port=2379): Read timed out. (read
timeout=1.4283105651182788)",))',)
2022-02-01 16:11:56,146 INFO: Reconnection allowed, looking for another server.
2022-02-01 16:11:56,146 INFO: Retrying on http://IP_ETCD2:2379
2022-02-01 16:11:56,262 INFO: Selected new etcd server http://IP_ETCD2:2379
2022-02-01 16:11:56,475 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:12:04,950 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:12:14,960 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:12:24,669 INFO: Lock owner: db02.local; I am db02.local
2022-02-01 16:12:24,738 ERROR:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper
    retval = func(self, *args, **kwargs) is not None
  File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader
    return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
  File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry
    return retry(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/etcd/client.py", line 500, in write
    response = self.api_execute(path, method, params=params)
  File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 256, in api_execute
    return self._handle_server_response(response)
  File "/usr/lib/python3.6/site-packages/etcd/client.py", line 987, in _handle_server_response
    etcd.EtcdError.handle(r)
  File "/usr/lib/python3.6/site-packages/etcd/__init__.py", line 306, in handle
    raise exc(msg, payload)
etcd.EtcdKeyNotFound: Key not found : /postgresql/db/leader
2022-02-01 16:12:24,928 ERROR: failed to update leader lock
2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS
2022-02-01 16:12:27,079 INFO: closed patroni connection to the postgresql cluster
2022-02-01 16:12:28,685 INFO: postmaster pid=153397
2022-02-01 16:12:34,670 INFO: Lock owner: None; I am db02.local
2022-02-01 16:12:34,671 INFO: not healthy enough for leader race
2022-02-01 16:12:34,738 INFO: starting after demotion in progress
2022-02-01 16:12:44,672 INFO: Lock owner: None; I am db02.local
2022-02-01 16:12:44,672 INFO: not healthy enough for leader race
2022-02-01 16:12:44,741 INFO: starting after demotion in progress
2022-02-01 16:12:54,672 INFO: Lock owner: None; I am db02.local
2022-02-01 16:12:54,672 INFO: not healthy enough for leader race
2022-02-01 16:12:54,739 INFO: starting after demotion in progress
2022-02-01 16:13:04,672 INFO: Lock owner: None; I am db02.local
2022-02-01 16:13:04,672 INFO: not healthy enough for leader race
2022-02-01 16:13:04,746 INFO: starting after demotion in progress
2022-02-01 16:13:14,673 INFO: Lock owner: None; I am db02.local
2022-02-01 16:13:14,674 INFO: not healthy enough for leader race
2022-02-01 16:13:14,743 INFO: starting after demotion in progress
2022-02-01 16:13:18,748 INFO: establishing a new patroni connection to the postgres cluster
2022-02-01 16:13:19,961 INFO: Got response from db01.local http://IP_DB01:8008/patroni: {"state": "starting", "role":
"replica","database_system_identifier": "9879384539485923423", "patroni": {"version": "2.1.0", "scope": "db"}} 
2022-02-01 16:13:19,961 WARNING: Request failed to db01.local: GET http://IP_DB01:8008/patroni ('xlog')
2022-02-01 16:13:20,365 INFO: promoted self to leader by acquiring session lock
2022-02-01 16:13:20,380 INFO: cleared rewind state after becoming the leader
2022-02-01 16:13:21,816 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:13:24,830 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:13:34,826 INFO: no action. I am (db02.local) the leader with the lock

Best regards,
Ulaev Alexander


-----Original Message-----
From: Sergei Kornilov [mailto:sk@zsrv.org]
Sent: Thursday, February 3, 2022 11:52 AM
To: PG Bug reporting form <noreply@postgresql.org>
Cc: Улаев Александр Сергеевич <alexander.ulaev@rtlabs.ru>; pgsql-bugs@lists.postgresql.org
Subject: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request

Hello

> 2022-02-01 16:12:24 MSK [16959] LOG: received immediate shutdown request

Please prove by patroni logs, that not patroni commanded immediate shutdown. As far as I remember, there should be a
markabout the request for stopping the database immediatly. 

Seems unrelated to archiver. Patroni is able to command immediate shutdown, immediate shutdown is a kind of "terminated
abnormally".The sequence of logs looks completely correct. 

regards, Sergei



Hello
> 2022-02-01 16:12:24,928 ERROR: failed to update leader lock
> 2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS

Between these two messages, an immediate shutdown is called:
https://github.com/zalando/patroni/blob/v2.1.2/patroni/ha.py#L1045

Regards, Sergei



RE: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request

From
Улаев Александр Сергеевич
Date:
We found on ETCD1 such errors in the syslog:

Feb  1 16:12:01 etcd1 etcd: got unexpected response error (etcdserver: request timed out)
Feb  1 16:12:02 etcd1 etcd: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in
1.21s]
Feb  1 16:12:03 etcd1 etcd: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in
1s]
Feb  1 16:12:20 etcd1 etcd: sync duration of 29.69857369s, expected less than 1s
Feb  1 16:26:55 etcd1 etcd: got unexpected response error (etcdserver: request timed out)
Feb  1 16:27:03 etcd1 etcd: got unexpected response error (etcdserver: request timed out)
Feb  1 16:27:17 etcd1 etcd: sync duration of 1m0.745329542s, expected less than 1s

So, this problem related to SAN => i\o freeze many VMs including DB's, ETCD nodes => etcd i\o long delay affect =>
patronireaction to demoted self postgresql instance 
Sergey, Thank you much for your support!


Best regards,
Ulaev Alexander

-----Original Message-----
From: Sergei Kornilov [mailto:sk@zsrv.org]
Sent: Thursday, February 3, 2022 1:06 PM
To: Улаев Александр Сергеевич <alexander.ulaev@rtlabs.ru>
Cc: pgsql-bugs@lists.postgresql.org; PG Bug reporting form <noreply@postgresql.org>
Subject: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request

Hello
> 2022-02-01 16:12:24,928 ERROR: failed to update leader lock
> 2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS

Between these two messages, an immediate shutdown is called:
https://github.com/zalando/patroni/blob/v2.1.2/patroni/ha.py#L1045

Regards, Sergei