Thread: BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17392 Logged by: Alexander Ulaev Email address: alexander.ulaev@rtlabs.ru PostgreSQL version: Unsupported/Unknown Operating system: CentOS Linux release 7.9.2009 (Core) Description: We have a some shards with patroni cluster over PG9.6 installed on VMs Some problems on SAN side follow our kvm VMs was halted during 1 min approximately by I\O disability and most of db shards with relatively low load had a 40-60 seconds commits, but was survived but two shard's masters with high application load (TPS = 2-3x from AVG among the shard DBs) was unexpectedly shutdowned with the same errors: 2022-02-01 16:12:24 MSK [16959] LOG: received immediate shutdown request and 2022-02-01 16:12:25 MSK [16959] LOG: archiver process (PID 117615) exited with exit code 2 (I suppose this timestamp for LOG is incorrect and this record really stands behind "shutdown" record by meaning) among huge number of messages for user process "terminating connection because of crash of another server process" like these 2022-02-01 16:12:24 MSK [151045] 127.0.0.1 PostgreSQL JDBC Driver queue2@queue2 HINT: In a moment you should be able to reconnect to the database and repeat your command. 2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver queue2@queue2 WARNING: terminating connection because of crash of another server process 2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver queue2@queue2 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. I can't find anywhere what do this exit code 2 stand for and as I know of the behavior of ARCHIVER process on abnormal termination it had to be restarted by the postmaster, but no "entire instance is terminated abnormally", or "All of the postgres process halts" I found in source code that all functions relating to archiver are included in pgarch.c having initial author: Simon Riggs simon@2ndquadrant.com, but I cant found there any information related to "exit code 2" Later when instance was starting and recovering wal logs since the last checkpoint, then "invalid record length" arise: 2022-02-01 16:13:17 MSK [153401] LOG: invalid record length at 1CEC/C3BEBB50: wanted 24, got 0 2022-02-01 16:13:17 MSK [153401] LOG: consistent recovery state reached at 1CEC/C3BEBB50 2022-02-01 16:13:17 MSK [153397] LOG: database system is ready to accept read only connections but instance was started and patroni return it to master role, because sync replica also was shutdowned by "invalid record length" when applied wal logs 2022-02-01 16:12:25 MSK [16563] 127.0.0.1 [unknown] patroni@postgres LOG: connection authorized: user=patroni database=postgres WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. 2022-02-01 16:12:25 MSK [89015] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2022-02-01 16:12:25 MSK [89009] LOG: invalid record length at 1CEC/C3BEBA90: wanted 24, got 0 2022-02-01 16:12:25 MSK [16564] FATAL: could not connect to the primary server: FATAL: the database system is shutting down 2022-02-01 16:12:26 MSK [89006] LOG: received fast shutdown request 2022-02-01 16:12:26 MSK [89006] LOG: aborting any active transactions
RE: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
From
Улаев Александр Сергеевич
Date:
Hi, thanks for your cooperation 2022-02-01 16:11:34,964 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:11:44,950 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:11:54,677 INFO: Lock owner: db02.local; I am db02.local 2022-02-01 16:11:56,145 ERROR: Request to server http://IP_ETCD1:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'IP_ETCD1\',port=2379): Max retries exceeded with url: /v2/keys/postgresql/db/leader(Caused by Read TimeoutError("HTTPConnectionPool(host=\'IP_ETCD1\', port=2379): Read timed out. (read timeout=1.4283105651182788)",))',) 2022-02-01 16:11:56,146 INFO: Reconnection allowed, looking for another server. 2022-02-01 16:11:56,146 INFO: Retrying on http://IP_ETCD2:2379 2022-02-01 16:11:56,262 INFO: Selected new etcd server http://IP_ETCD2:2379 2022-02-01 16:11:56,475 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:12:04,950 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:12:14,960 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:12:24,669 INFO: Lock owner: db02.local; I am db02.local 2022-02-01 16:12:24,738 ERROR: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper retval = func(self, *args, **kwargs) is not None File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl) File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry return retry(*args, **kwargs) File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__ return func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/etcd/client.py", line 500, in write response = self.api_execute(path, method, params=params) File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 256, in api_execute return self._handle_server_response(response) File "/usr/lib/python3.6/site-packages/etcd/client.py", line 987, in _handle_server_response etcd.EtcdError.handle(r) File "/usr/lib/python3.6/site-packages/etcd/__init__.py", line 306, in handle raise exc(msg, payload) etcd.EtcdKeyNotFound: Key not found : /postgresql/db/leader 2022-02-01 16:12:24,928 ERROR: failed to update leader lock 2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS 2022-02-01 16:12:27,079 INFO: closed patroni connection to the postgresql cluster 2022-02-01 16:12:28,685 INFO: postmaster pid=153397 2022-02-01 16:12:34,670 INFO: Lock owner: None; I am db02.local 2022-02-01 16:12:34,671 INFO: not healthy enough for leader race 2022-02-01 16:12:34,738 INFO: starting after demotion in progress 2022-02-01 16:12:44,672 INFO: Lock owner: None; I am db02.local 2022-02-01 16:12:44,672 INFO: not healthy enough for leader race 2022-02-01 16:12:44,741 INFO: starting after demotion in progress 2022-02-01 16:12:54,672 INFO: Lock owner: None; I am db02.local 2022-02-01 16:12:54,672 INFO: not healthy enough for leader race 2022-02-01 16:12:54,739 INFO: starting after demotion in progress 2022-02-01 16:13:04,672 INFO: Lock owner: None; I am db02.local 2022-02-01 16:13:04,672 INFO: not healthy enough for leader race 2022-02-01 16:13:04,746 INFO: starting after demotion in progress 2022-02-01 16:13:14,673 INFO: Lock owner: None; I am db02.local 2022-02-01 16:13:14,674 INFO: not healthy enough for leader race 2022-02-01 16:13:14,743 INFO: starting after demotion in progress 2022-02-01 16:13:18,748 INFO: establishing a new patroni connection to the postgres cluster 2022-02-01 16:13:19,961 INFO: Got response from db01.local http://IP_DB01:8008/patroni: {"state": "starting", "role": "replica","database_system_identifier": "9879384539485923423", "patroni": {"version": "2.1.0", "scope": "db"}} 2022-02-01 16:13:19,961 WARNING: Request failed to db01.local: GET http://IP_DB01:8008/patroni ('xlog') 2022-02-01 16:13:20,365 INFO: promoted self to leader by acquiring session lock 2022-02-01 16:13:20,380 INFO: cleared rewind state after becoming the leader 2022-02-01 16:13:21,816 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:13:24,830 INFO: no action. I am (db02.local) the leader with the lock 2022-02-01 16:13:34,826 INFO: no action. I am (db02.local) the leader with the lock Best regards, Ulaev Alexander -----Original Message----- From: Sergei Kornilov [mailto:sk@zsrv.org] Sent: Thursday, February 3, 2022 11:52 AM To: PG Bug reporting form <noreply@postgresql.org> Cc: Улаев Александр Сергеевич <alexander.ulaev@rtlabs.ru>; pgsql-bugs@lists.postgresql.org Subject: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request Hello > 2022-02-01 16:12:24 MSK [16959] LOG: received immediate shutdown request Please prove by patroni logs, that not patroni commanded immediate shutdown. As far as I remember, there should be a markabout the request for stopping the database immediatly. Seems unrelated to archiver. Patroni is able to command immediate shutdown, immediate shutdown is a kind of "terminated abnormally".The sequence of logs looks completely correct. regards, Sergei
Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
From
Sergei Kornilov
Date:
Hello > 2022-02-01 16:12:24,928 ERROR: failed to update leader lock > 2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS Between these two messages, an immediate shutdown is called: https://github.com/zalando/patroni/blob/v2.1.2/patroni/ha.py#L1045 Regards, Sergei
RE: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
From
Улаев Александр Сергеевич
Date:
We found on ETCD1 such errors in the syslog: Feb 1 16:12:01 etcd1 etcd: got unexpected response error (etcdserver: request timed out) Feb 1 16:12:02 etcd1 etcd: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.21s] Feb 1 16:12:03 etcd1 etcd: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1s] Feb 1 16:12:20 etcd1 etcd: sync duration of 29.69857369s, expected less than 1s Feb 1 16:26:55 etcd1 etcd: got unexpected response error (etcdserver: request timed out) Feb 1 16:27:03 etcd1 etcd: got unexpected response error (etcdserver: request timed out) Feb 1 16:27:17 etcd1 etcd: sync duration of 1m0.745329542s, expected less than 1s So, this problem related to SAN => i\o freeze many VMs including DB's, ETCD nodes => etcd i\o long delay affect => patronireaction to demoted self postgresql instance Sergey, Thank you much for your support! Best regards, Ulaev Alexander -----Original Message----- From: Sergei Kornilov [mailto:sk@zsrv.org] Sent: Thursday, February 3, 2022 1:06 PM To: Улаев Александр Сергеевич <alexander.ulaev@rtlabs.ru> Cc: pgsql-bugs@lists.postgresql.org; PG Bug reporting form <noreply@postgresql.org> Subject: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request Hello > 2022-02-01 16:12:24,928 ERROR: failed to update leader lock > 2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS Between these two messages, an immediate shutdown is called: https://github.com/zalando/patroni/blob/v2.1.2/patroni/ha.py#L1045 Regards, Sergei