postgres cluster becomes unhealthy after draining one of the worker nodes - Mailing list pgsql-admin

From Oswald Hirschmueller
Subject postgres cluster becomes unhealthy after draining one of the worker nodes
Date
Msg-id 71de8e84252d4becbb76b622787b5251@gmsEXCH1201.gms.ca
Whole thread Raw
Responses Re: postgres cluster becomes unhealthy after draining one of the worker nodes  (Scott Ribe <scott_ribe@elevated-dev.com>)
List pgsql-admin

We are trying to upgrade the kubernetes worker nodes, but encountering an issue with the postgres clusters. As soon as our system administrator drains one of the nodes, the postgres cluster becomes unhealthy. We expected another node to take over as the master, but this did not occur. Note that the replicas are running in synchronous mode.

 

This is the command the system adminstrator ran. The issue occurs shortly after.

kubectl drain $NODENAME --delete-local-data --ignore-daemonsets

 

Here are snippets of the logs of the master and one of the replicas (the log of the second replica looks the same):

 

Master Log

----------

021-04-15 20:44:29,986 INFO: no action. i am the leader with the lock

2021-04-15 20:44:39,907 INFO: Lock owner: dbo-dhog-5b7dc865b4-lg7c9; I am dbo-dhog-5b7dc865b4-lg7c9

(the master was on the node that was drained)

/tmp:5432 - no response

2021-04-15 20:47:13.940 UTC [186] LOG: pgaudit extension initialized

2021-04-15 20:47:13.942 UTC [186] LOG: starting PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

2021-04-15 20:47:13.942 UTC [186] LOG: listening on IPv4 address "0.0.0.0", port 5432

2021-04-15 20:47:14.105 UTC [186] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"

2021-04-15 20:47:14.277 UTC [186] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432"

2021-04-15 20:47:14.321 UTC [186] LOG: redirecting log output to logging collector process

2021-04-15 20:47:14.321 UTC [186] HINT: Future log output will appear in directory "pg_log".

/tmp:5432 - rejecting connections

/tmp:5432 - rejecting connections

/tmp:5432 - rejecting connections

/tmp:5432 - accepting connections

2021-04-15 20:47:16,996 INFO: establishing a new patroni connection to the postgres cluster

2021-04-15 20:47:17,105 INFO: following a different leader because i am not the healthiest node

Thu Apr 15 20:47:17 UTC 2021 INFO: Node dbo-dhog-5b7dc865b4-d8pv4 fully initialized for cluster dbo and is ready for use

2021-04-15 20:47:27,511 INFO: following a different leader because i am not the healthiest node

2021-04-15 20:47:37,516 INFO: following a different leader because i am not the healthiest node

2021-04-15 20:47:47,511 INFO: following a different leader because i am not the healthiest node

 

Replica Log

-----------

2021-04-15 20:46:16.618 UTC [28158] LOG: listening on IPv4 address "0.0.0.0", port 5432

/tmp:5432 - no response

2021-04-15 20:46:16.655 UTC [28158] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"

2021-04-15 20:46:16.697 UTC [28158] LOG: listening on Unix socket "/crunchyadm/.s.PGSQL.5432"

2021-04-15 20:46:16.764 UTC [28158] LOG: redirecting log output to logging collector process

2021-04-15 20:46:16.764 UTC [28158] HINT: Future log output will appear in directory "pg_log".

2021-04-15 20:46:17,061 INFO: Lock owner: None; I am dbo-ffvg-6c5d7c576c-8tcwt

2021-04-15 20:46:17,061 INFO: not healthy enough for leader race

2021-04-15 20:46:17,061 INFO: changing primary_conninfo and restarting in progress

/tmp:5432 - rejecting connections

/tmp:5432 - rejecting connections

/tmp:5432 - accepting connections

2021-04-15 20:46:18,729 INFO: establishing a new patroni connection to the postgres cluster

2021-04-15 20:46:18,871 INFO: following a different leader because i am not the healthiest node

2021-04-15 20:46:29,245 INFO: following a different leader because i am not the healthiest node

2021-04-15 20:46:39,248 INFO: following a different leader because i am not the healthiest node

2021-04-15 20:46:49,249 INFO: following a different leader because i am not the healthiest node

2021-04-15 20:46:59,247 INFO: following a different leader because i am not the healthiest node

 

What we can do to prevent this situation from occurring? We do not want this to occur when we upgrade our production nodes.

 

Thanks,

Os

.

 

pgsql-admin by date:

Previous
From: Paul Smith
Date:
Subject: Re: Data Type to store Leading Zero(0)
Next
From: Scott Ribe
Date:
Subject: Re: postgres cluster becomes unhealthy after draining one of the worker nodes