Thread: received immediate shutdown request caused cluster failover

received immediate shutdown request caused cluster failover

From

Yi Sun

Date:

18 November 2020, 03:54:15

Hi all

There are 3 nodes in our prd db in patroni cluster, vm01 is leader, vm02 and vm03 are standby, vm01 received immediate shutdown request caused failover to vm02, after that vm03 received fast shutdown request

As vm03 not in cluster so have to reinit vm03

What's the possible root caused vm01 received immediate shutdown request please?

and What's the possible root caused vm03 received fast shutdown request

Below are detail info, please check, if need any other info, I will provide, thanks

postgresql version: 11.4

OS: centos 7

HA: patroni upgraded from 1.6.3 to 2.0.1 these days

--vm01 pg log

[2020-11-16 12:03:56.592 UTC] p=31485:3@ c=pgwatch2@127.0.0.1/eu4_baas_bckp_epmgr:pgwatch2 LOG: disconnection: session time: 0:00:00.037 user=pgwatch2 database=eu4_baas_bckp_epmgr host=127.0.0.1 port=36916
[2020-11-16 12:03:57.005 UTC] p=32103:14@ c=@/: LOG: received immediate shutdown request
[2020-11-16 12:03:57.017 UTC] p=31189:3@314/0 c=atp_si_user@10.253.85.203/eu4_baas_software_inventory:[unknown] WARNING: terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31189:4@314/0 c=atp_si_user@10.253.85.203/eu4_baas_software_inventory:[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31189:5@314/0 c=atp_si_user@10.253.85.203/eu4_baas_software_inventory:[unknown] HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.017 UTC] p=31183:3@313/0 c=atp_pm@10.253.86.62/eu4_baas_atp_pm:[unknown] WARNING: terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31183:4@313/0 c=atp_pm@10.253.86.62/eu4_baas_atp_pm:[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31183:5@313/0 c=atp_pm@10.253.86.62/eu4_baas_atp_pm:[unknown] HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.017 UTC] p=31182:3@310/281059 c=bckp_epmgr@10.253.85.202/eu4_baas_bckp_epmgr:[unknown] WARNING: terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31182:4@310/281059 c=bckp_epmgr@10.253.85.202/eu4_baas_bckp_epmgr:[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31182:5@310/281059 c=bckp_epmgr@10.253.85.202/eu4_baas_bckp_epmgr:[unknown] HINT: In a moment you should be able to reconnect to the database and repeat your command.

--vm01 patroni log
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:56,922", "name": "patroni.dcs.consul", "process": 32085, "thread": 140350381303616, "level": "WARNING", "message": "Could not register service: unknown role type promoted"}
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "does not have lock"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,158", "name": "patroni.dcs.consul", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "Deregister service pgcluster11/eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,180", "name": "patroni", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "demoting self because i do not have the lock and i was a leader"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,181", "name": "patroni", "process": 32085, "thread": 140350381303616, "level": "WARNING", "message": "Loop time exceeded, rescheduling immediately."}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,184", "name": "patroni.ha", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,185", "name": "patroni", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "starting after demotion in progress"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,186", "name": "patroni.postgresql.connection", "process": 32085, "thread": 140349984724736, "level": "INFO", "message": "closed patroni connection to the postgresql cluster"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,668", "name": "patroni.postgresql.postmaster", "process": 32085, "thread": 140349984724736, "level": "INFO", "message": "postmaster pid=31495"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: localhost:5432 - no response

--vm03 pg log
[2020-11-16 12:03:53.484 UTC] p=24090:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=39442
[2020-11-16 12:03:53.485 UTC] p=24090:2@9/1878954 c=replicator@::1/[unknown]:[unknown] LOG: replication connection authorized: user=replicator
[2020-11-16 12:03:53.487 UTC] p=24090:3@ c=replicator@::1/[unknown]:[unknown] LOG: disconnection: session time: 0:00:00.002 user=replicator database= host=::1 port=39442
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.108 UTC] p=27711:6@ c=@/: FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[2020-11-16 12:03:57.144 UTC] p=27701:7@1/0 c=@/: LOG: invalid record length at E052/93EED170: wanted 24, got 0
[2020-11-16 12:03:57.419 UTC] p=24097:1@ c=@/: FATAL: could not connect to the primary server: FATAL: the database system is shutting down

[2020-11-16 12:05:38.581 UTC] p=25380:2@ c=postgres@::1/postgres:[unknown] FATAL: the database system is starting up
[2020-11-16 12:05:41.010 UTC] p=24113:4@1/0 c=@/: LOG: consistent recovery state reached at E052/93EED170
[2020-11-16 12:05:41.010 UTC] p=24113:5@1/0 c=@/: LOG: invalid record length at E052/93EED170: wanted 24, got 0
[2020-11-16 12:05:41.011 UTC] p=24109:8@ c=@/: LOG: database system is ready to accept read only connections
[2020-11-16 12:05:43.280 UTC] p=25400:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40922
[2020-11-16 12:05:43.392 UTC] p=25401:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40924
[2020-11-16 12:05:43.393 UTC] p=25401:2@2/2 c=postgres@::1/postgres:[unknown] LOG: connection authorized: user=postgres database=postgres
[2020-11-16 12:05:43.580 UTC] p=25402:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40926
[2020-11-16 12:05:43.628 UTC] p=25402:2@3/1 c=replicator@::1/[unknown]:[unknown] LOG: replication connection authorized: user=replicator
[2020-11-16 12:05:43.630 UTC] p=25402:3@ c=replicator@::1/[unknown]:[unknown] LOG: disconnection: session time: 0:00:00.049 user=replicator database= host=::1 port=40926
[2020-11-16 12:05:43.647 UTC] p=25406:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40932
[2020-11-16 12:05:43.648 UTC] p=25406:2@3/2 c=replicator@::1/[unknown]:[unknown] LOG: replication connection authorized: user=replicator
[2020-11-16 12:05:43.650 UTC] p=25406:3@ c=replicator@::1/[unknown]:[unknown] LOG: disconnection: session time: 0:00:00.002 user=replicator database= host=::1 port=40932
[2020-11-16 12:05:44.006 UTC] p=24109:9@ c=@/: LOG: received fast shutdown request
[2020-11-16 12:05:44.015 UTC] p=24109:10@ c=@/: LOG: aborting any active transactions
[2020-11-16 12:05:44.015 UTC] p=25401:3@2/0 c=postgres@::1/postgres:Patroni FATAL: terminating connection due to administrator command
[2020-11-16 12:05:44.015 UTC] p=25401:4@ c=postgres@::1/postgres:Patroni LOG: disconnection: session time: 0:00:00.622 user=postgres database=postgres host=::1 port=40924
[2020-11-16 12:05:44.509 UTC] p=24120:1@ c=@/: LOG: shutting down
[2020-11-16 12:05:45.621 UTC] p=24109:11@ c=@/: LOG: database system is shut down

--vm03 patroni log
Nov 16 12:03:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:03:51,436", "name": "patroni", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "following a different leader because i am not the healthiest node"}

Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: pg_controldata: could not read file "/var/lib/pgsql/11/data/global/pg_control": read 0 of 288
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,016", "name": "patroni.postgresql", "process": 27664, "thread": 140371489281856, "level": "ERROR", "message": "Error when calling pg_controldata"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: Traceback (most recent call last):
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py", line 718, in controldata
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: data = subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir], env=env)
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: **kwargs).stdout
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File "/usr/lib64/python3.6/subprocess.py", line 438, in run
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: output=stdout, stderr=stderr)
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: subprocess.CalledProcessError: Command '['/usr/pgsql-11/bin/pg_controldata', '/var/lib/pgsql/11/data']' returned non-zero exit status 1.
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,018", "name": "patroni.ha", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "pg_controldata:
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: "}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,031", "name": "patroni.dcs.consul", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "Deregister service pgcluster11/eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,048", "name": "patroni", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "starting as a secondary"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,049", "name": "patroni.postgresql.connection", "process": 27664, "thread": 140371151345408, "level": "INFO", "message": "closed patroni connection to the postgresql cluster"}

Best regards

Oliver Sun

Re: received immediate shutdown request caused cluster failover

From

Yi Sun

Date:

20 November 2020, 01:20:00

Hi guys,

Besides command run(like pg_ctl) can cause "received immediate shutdown request" any other reason can cause this please?

This production DB, support colleague said didn't run it

Yi Sun <yinan81@gmail.com> 于2020年11月18日周三上午11:54写道：

Hi all

There are 3 nodes in our prd db in patroni cluster, vm01 is leader, vm02 and vm03 are standby, vm01 received immediate shutdown request caused failover to vm02, after that vm03 received fast shutdown request

As vm03 not in cluster so have to reinit vm03

What's the possible root caused vm01 received immediate shutdown request please?

and What's the possible root caused vm03 received fast shutdown request

Below are detail info, please check, if need any other info, I will provide, thanks

postgresql version: 11.4
OS: centos 7
HA: patroni upgraded from 1.6.3 to 2.0.1 these days

--vm01 pg log

[2020-11-16 12:03:56.592 UTC] p=31485:3@ c=pgwatch2@127.0.0.1/eu4_baas_bckp_epmgr:pgwatch2 LOG: disconnection: session time: 0:00:00.037 user=pgwatch2 database=eu4_baas_bckp_epmgr host=127.0.0.1 port=36916
[2020-11-16 12:03:57.005 UTC] p=32103:14@ c=@/: LOG: received immediate shutdown request
[2020-11-16 12:03:57.017 UTC] p=31189:3@314/0 c=atp_si_user@10.253.85.203/eu4_baas_software_inventory:[unknown] WARNING: terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31189:4@314/0 c=atp_si_user@10.253.85.203/eu4_baas_software_inventory:[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31189:5@314/0 c=atp_si_user@10.253.85.203/eu4_baas_software_inventory:[unknown] HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.017 UTC] p=31183:3@313/0 c=atp_pm@10.253.86.62/eu4_baas_atp_pm:[unknown] WARNING: terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31183:4@313/0 c=atp_pm@10.253.86.62/eu4_baas_atp_pm:[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31183:5@313/0 c=atp_pm@10.253.86.62/eu4_baas_atp_pm:[unknown] HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.017 UTC] p=31182:3@310/281059 c=bckp_epmgr@10.253.85.202/eu4_baas_bckp_epmgr:[unknown] WARNING: terminating connection because of crash of another server process
[2020-11-16 12:03:57.017 UTC] p=31182:4@310/281059 c=bckp_epmgr@10.253.85.202/eu4_baas_bckp_epmgr:[unknown] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[2020-11-16 12:03:57.017 UTC] p=31182:5@310/281059 c=bckp_epmgr@10.253.85.202/eu4_baas_bckp_epmgr:[unknown] HINT: In a moment you should be able to reconnect to the database and repeat your command.

--vm01 patroni log
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:56,922", "name": "patroni.dcs.consul", "process": 32085, "thread": 140350381303616, "level": "WARNING", "message": "Could not register service: unknown role type promoted"}
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:56 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:56,923", "name": "patroni.ha", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "does not have lock"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,158", "name": "patroni.dcs.consul", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "Deregister service pgcluster11/eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,180", "name": "patroni", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "demoting self because i do not have the lock and i was a leader"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,181", "name": "patroni", "process": 32085, "thread": 140350381303616, "level": "WARNING", "message": "Loop time exceeded, rescheduling immediately."}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,184", "name": "patroni.ha", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm01"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,185", "name": "patroni", "process": 32085, "thread": 140350381303616, "level": "INFO", "message": "starting after demotion in progress"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,186", "name": "patroni.postgresql.connection", "process": 32085, "thread": 140349984724736, "level": "INFO", "message": "closed patroni connection to the postgresql cluster"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: {"time": "2020-11-16 12:03:59,668", "name": "patroni.postgresql.postmaster", "process": 32085, "thread": 140349984724736, "level": "INFO", "message": "postmaster pid=31495"}
Nov 16 12:03:59 eu4-baas-patroni-cluster-vm01 patroni[32085]: localhost:5432 - no response

--vm03 pg log
[2020-11-16 12:03:53.484 UTC] p=24090:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=39442
[2020-11-16 12:03:53.485 UTC] p=24090:2@9/1878954 c=replicator@::1/[unknown]:[unknown] LOG: replication connection authorized: user=replicator
[2020-11-16 12:03:53.487 UTC] p=24090:3@ c=replicator@::1/[unknown]:[unknown] LOG: disconnection: session time: 0:00:00.002 user=replicator database= host=::1 port=39442
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
[2020-11-16 12:03:57.108 UTC] p=27711:6@ c=@/: FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[2020-11-16 12:03:57.144 UTC] p=27701:7@1/0 c=@/: LOG: invalid record length at E052/93EED170: wanted 24, got 0
[2020-11-16 12:03:57.419 UTC] p=24097:1@ c=@/: FATAL: could not connect to the primary server: FATAL: the database system is shutting down

[2020-11-16 12:05:38.581 UTC] p=25380:2@ c=postgres@::1/postgres:[unknown] FATAL: the database system is starting up
[2020-11-16 12:05:41.010 UTC] p=24113:4@1/0 c=@/: LOG: consistent recovery state reached at E052/93EED170
[2020-11-16 12:05:41.010 UTC] p=24113:5@1/0 c=@/: LOG: invalid record length at E052/93EED170: wanted 24, got 0
[2020-11-16 12:05:41.011 UTC] p=24109:8@ c=@/: LOG: database system is ready to accept read only connections
[2020-11-16 12:05:43.280 UTC] p=25400:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40922
[2020-11-16 12:05:43.392 UTC] p=25401:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40924
[2020-11-16 12:05:43.393 UTC] p=25401:2@2/2 c=postgres@::1/postgres:[unknown] LOG: connection authorized: user=postgres database=postgres
[2020-11-16 12:05:43.580 UTC] p=25402:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40926
[2020-11-16 12:05:43.628 UTC] p=25402:2@3/1 c=replicator@::1/[unknown]:[unknown] LOG: replication connection authorized: user=replicator
[2020-11-16 12:05:43.630 UTC] p=25402:3@ c=replicator@::1/[unknown]:[unknown] LOG: disconnection: session time: 0:00:00.049 user=replicator database= host=::1 port=40926
[2020-11-16 12:05:43.647 UTC] p=25406:1@ c=[unknown]@::1/[unknown]:[unknown] LOG: connection received: host=::1 port=40932
[2020-11-16 12:05:43.648 UTC] p=25406:2@3/2 c=replicator@::1/[unknown]:[unknown] LOG: replication connection authorized: user=replicator
[2020-11-16 12:05:43.650 UTC] p=25406:3@ c=replicator@::1/[unknown]:[unknown] LOG: disconnection: session time: 0:00:00.002 user=replicator database= host=::1 port=40932
[2020-11-16 12:05:44.006 UTC] p=24109:9@ c=@/: LOG: received fast shutdown request
[2020-11-16 12:05:44.015 UTC] p=24109:10@ c=@/: LOG: aborting any active transactions
[2020-11-16 12:05:44.015 UTC] p=25401:3@2/0 c=postgres@::1/postgres:Patroni FATAL: terminating connection due to administrator command
[2020-11-16 12:05:44.015 UTC] p=25401:4@ c=postgres@::1/postgres:Patroni LOG: disconnection: session time: 0:00:00.622 user=postgres database=postgres host=::1 port=40924
[2020-11-16 12:05:44.509 UTC] p=24120:1@ c=@/: LOG: shutting down
[2020-11-16 12:05:45.621 UTC] p=24109:11@ c=@/: LOG: database system is shut down

--vm03 patroni log
Nov 16 12:03:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:03:51,436", "name": "patroni", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "following a different leader because i am not the healthiest node"}

Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: pg_controldata: could not read file "/var/lib/pgsql/11/data/global/pg_control": read 0 of 288
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,016", "name": "patroni.postgresql", "process": 27664, "thread": 140371489281856, "level": "ERROR", "message": "Error when calling pg_controldata"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: Traceback (most recent call last):
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/__init__.py", line 718, in controldata
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: data = subprocess.check_output([self.pgcommand('pg_controldata'), self._data_dir], env=env)
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: **kwargs).stdout
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: File "/usr/lib64/python3.6/subprocess.py", line 438, in run
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: output=stdout, stderr=stderr)
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: subprocess.CalledProcessError: Command '['/usr/pgsql-11/bin/pg_controldata', '/var/lib/pgsql/11/data']' returned non-zero exit status 1.
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,018", "name": "patroni.ha", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "pg_controldata:
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: "}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,022", "name": "patroni.ha", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "Lock owner: eu4-baas-patroni-cluster-vm02; I am eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,031", "name": "patroni.dcs.consul", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "Deregister service pgcluster11/eu4-baas-patroni-cluster-vm03"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,048", "name": "patroni", "process": 27664, "thread": 140371489281856, "level": "INFO", "message": "starting as a secondary"}
Nov 16 12:05:51 eu4-baas-patroni-cluster-vm03 patroni[27664]: {"time": "2020-11-16 12:05:51,049", "name": "patroni.postgresql.connection", "process": 27664, "thread": 140371151345408, "level": "INFO", "message": "closed patroni connection to the postgresql cluster"}

Best regards
Oliver Sun

Re: received immediate shutdown request caused cluster failover

From

Tom Lane

Date:

20 November 2020, 06:17:58

Yi Sun <yinan81@gmail.com> writes:
> Besides command run(like pg_ctl) can cause "received immediate shutdown
> request"  any other reason can cause this please?

That message indicates that something sent the postmaster process a
SIGQUIT signal (which is all that "pg_ctl stop -m immediate" does).
There's no speculation to that: a look at postmaster.c will convince
you that there is no other way to reach that message.  So you need
to be looking for things that would be sending SIGQUIT unexpectedly.

I don't know much about Patroni, but maybe something in that
environment thinks that SIGQUIT'ing random processes is a good
thing to do.

            regards, tom lane

Re: received immediate shutdown request caused cluster failover

From

Yi Sun

Date:

20 November 2020, 09:53:38

Hello,

Thank you for your reply

Patroni replied this:

"It seems your system is under so much stress that there was no resources for Patroni to execute HA loop for 35 seconds.
This interval exceeds ttl=30s, therefore the leader key expired, Patroni noticed it and demoted Postgres.

You need to figure out what is going on with your system, and what is the reason for cpu/memory pressure. Ideally fix these issues."

As company hundreds of clusters use ansible deployments use same parameters, change parameters for 1 cluster is difficult

I just think maybe can get top sql from pg_stat_statements as below then analyse and tuning

Is it correct direction? Any suggestions please, thanks

1 time IO SQL TOP 5
select userid::regrole, dbid, query from pg_stat_statements order by (blk_read_time+blk_write_time)/calls desc limit 5;

total IO SQL TOP 5
select userid::regrole, dbid, query from pg_stat_statements order by (blk_read_time+blk_write_time) desc limit 5;

1 time long SQL TOP 5
select userid::regrole, dbid, query from pg_stat_statements order by mean_time desc limit 5;

total time long SQL TOP 5
select userid::regrole, dbid, query from pg_stat_statements order by total_time desc limit 5;

average time long SQL TOP 5
select calls, total_time/calls AS avg_time, left(query,80) from pg_stat_statements order by 2 desc limit 5;

stddev time SQL
select userid::regrole, dbid, query from pg_stat_statements order by stddev_time desc limit 5;

share block SQL
select userid::regrole, dbid, query from pg_stat_statements order by (shared_blks_hit+shared_blks_dirtied) desc limit 5;

temp blk SQL
select userid::regrole, dbid, query from pg_stat_statements order by temp_blks_written desc limit 5;

Tom Lane <tgl@sss.pgh.pa.us> 于2020年11月20日周五下午2:17写道：

Yi Sun <yinan81@gmail.com> writes:
> Besides command run(like pg_ctl) can cause "received immediate shutdown
> request" any other reason can cause this please?

That message indicates that something sent the postmaster process a
SIGQUIT signal (which is all that "pg_ctl stop -m immediate" does).
There's no speculation to that: a look at postmaster.c will convince
you that there is no other way to reach that message. So you need
to be looking for things that would be sending SIGQUIT unexpectedly.

I don't know much about Patroni, but maybe something in that
environment thinks that SIGQUIT'ing random processes is a good
thing to do.

regards, tom lane