Query 2 Node HA test case result - Mailing list pgsql-general

From Mukesh Tanuku
Subject Query 2 Node HA test case result
Date
Msg-id CAJzgB-F7jEUeiHDWZKXHqMiYvKiwYcQkRoA47=pYQ9rkhRV6+A@mail.gmail.com
Whole thread Raw
List pgsql-general
Hello everyone,
We are doing a POC on postgres HA setup with streaming replication (async) using pgpool-II as a load balancing  & connection pooling and repmgr for setting up HA & automatic failover. 
We are applying a test case, like isolating the VM1 node from the Network completely for more than 2 mins and again plug-in back the network, since we want to verify how the system works during network glitches, any chances of split-brain or so. 
Our current setup looks like below,
2 VM's on Azure cloud, each VM has Postgres running along with Pgpool service.
image.png

We enabled watchdog and assigned a delegate IP
NOTE: as per some limitations we are using a floating IP and used for delegate IP.

During the test, here are our observations:
1. Client connections got hung from the time the VM1 got lost from the network and till VM1 gets back to normal. 
2. Once the VM1 is lost then Pgpool promotes the VM2 as LEADER node and Postgres Standby got promoted to Primary on VM2 as well, but still client connections are not connecting to the new primary. Why is this not happening?
3. Once the VM1 is back to network, there is a split brain situation, where pgpool on VM1 takes the lead to become LEADER node (pgpool.log shows). and from then the client connects to the VM1 node via VIP.  

pgpool.conf 

sr_check_period  10sec

health_check_period  30sec

health_check_timeout 20 sec

health_check_max_retries  3

health_check_retry_delay 1

wd_lifecheck_method = 'heartbeat'

wd_interval = 10

wd_heartbeat_keepalive = 2

wd_heartbeat_deadtime = 30


Logs information: 

From VM2:

Pgpool.log

14:30:17  N/w disconnected

After 10 sec the streaming replication check failed and got timed out. 

2024-07-03 14:30:26.176: sr_check_worker pid 58187: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

 

Then pgpool failed to do health check since it got timed out as per health_check_timeout set to 20 sec

2024-07-03 14:30:35.869: health_check0 pid 58188: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

 

Re-trying health_check  & sr_check but again timed out.

 

2024-07-03 14:30:46.187: sr_check_worker pid 58187: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:30:46.880: health_check0 pid 58188: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

 

Watchdog received a message saying the Leader node is lost.

 

2024-07-03 14:30:47.192: watchdog pid 58151: WARNING:  we have not received a beacon message from leader node "staging-ha0001:9999 Linux staging-ha0001"

2024-07-03 14:30:47.192: watchdog pid 58151: DETAIL:  requesting info message from leader node

2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  read from socket failed, remote end closed the connection

2024-07-03 14:30:54.312: watchdog pid 58151: LOG:  client socket of staging-ha0001:9999 Linux staging-ha0001 is closed

2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  remote node "staging-ha0001:9999 Linux staging-ha0001" is reporting that it has lost us

2024-07-03 14:30:54.313: watchdog pid 58151: LOG:  we are lost on the leader node "staging-ha0001:9999 Linux staging-ha0001"

 

Re-trying health_check  & sr_check but again timed out.

 

2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:30:57.888: health_check0 pid 58188: LOG:  health check retrying on DB node: 0 (round:3)

2024-07-03 14:31:06.201: sr_check_worker pid 58187: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

 

 

After 10 sec from the time we lost the leader node,  watchdog changed current node to LEADER node

2024-07-03 14:31:04.199: watchdog pid 58151: LOG:  watchdog node state changed from [STANDING FOR LEADER] to [LEADER]

 

 

health_check is failed on node 0 and it received a degenerated request for node 0  and the pgpool main process started quarantining staging-ha0001(5432) (shutting down)

 

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  setting the local node "staging-ha0002:9999 Linux staging-ha0002" as watchdog cluster leader

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  signal_user1_to_parent_with_reason(1)

2024-07-03 14:31:08.202: watchdog pid 58151: LOG:  I am the cluster leader node but we do not have enough nodes in cluster

2024-07-03 14:31:08.202: watchdog pid 58151: DETAIL:  waiting for the quorum to start escalation process

2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process received SIGUSR1

2024-07-03 14:31:08.202: main pid 58147: LOG:  Pgpool-II parent process received watchdog state change signal from watchdog

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  failed to connect to PostgreSQL server on "staging-ha0001:5432", timed out

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  health check failed on node 0 (timeout:0)

2024-07-03 14:31:08.899: health_check0 pid 58188: LOG:  received degenerate backend request for node_id: 0 from pid [58188]

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  watchdog received the failover command from local pgpool-II on IPC interface

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover requires the quorum to hold, which is not present at the moment

2024-07-03 14:31:08.899: watchdog pid 58151: DETAIL:  Rejecting the failover request

2024-07-03 14:31:08.899: watchdog pid 58151: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "staging-ha0002:9999 Linux staging-ha0002" is rejected because the watchdog cluster does not hold the quorum

2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:  degenerate backend request for 1 node(s) from pid [58188], is changed to quarantine node request by watchdog

2024-07-03 14:31:08.900: health_check0 pid 58188: DETAIL:  watchdog does not holds the quorum

2024-07-03 14:31:08.900: health_check0 pid 58188: LOG:  signal_user1_to_parent_with_reason(0)

2024-07-03 14:31:08.900: main pid 58147: LOG:  Pgpool-II parent process received SIGUSR1

2024-07-03 14:31:08.900: main pid 58147: LOG:  Pgpool-II parent process has received failover request

2024-07-03 14:31:08.900: watchdog pid 58151: LOG:  received the failover indication from Pgpool-II on IPC interface

2024-07-03 14:31:08.900: watchdog pid 58151: LOG:  watchdog is informed of failover start by the main process

2024-07-03 14:31:08.900: main pid 58147: LOG:  === Starting quarantine. shutdown host staging-ha0001(5432) ===

2024-07-03 14:31:08.900: main pid 58147: LOG:  Restart all children

2024-07-03 14:31:08.900: main pid 58147: LOG:  failover: set new primary node: -1

2024-07-03 14:31:08.900: main pid 58147: LOG:  failover: set new main node: 1

2024-07-03 14:31:08.906: sr_check_worker pid 58187: ERROR:  Failed to check replication time lag

2024-07-03 14:31:08.906: sr_check_worker pid 58187: DETAIL:  No persistent db connection for the node 0

2024-07-03 14:31:08.906: sr_check_worker pid 58187: HINT:  check sr_check_user and sr_check_password

2024-07-03 14:31:08.906: sr_check_worker pid 58187: CONTEXT:  while checking replication time lag

2024-07-03 14:31:08.906: sr_check_worker pid 58187: LOG:  worker process received restart request

2024-07-03 14:31:08.906: watchdog pid 58151: LOG:  received the failover indication from Pgpool-II on IPC interface

2024-07-03 14:31:08.906: watchdog pid 58151: LOG:  watchdog is informed of failover end by the main process

2024-07-03 14:31:08.906: main pid 58147: LOG:  === Quarantine done. shutdown host staging-ha0001(5432) ===

2024-07-03 14:31:09.906: pcp_main pid 58186: LOG:  restart request received in pcp child process

2024-07-03 14:31:09.907: main pid 58147: LOG:  PCP child 58186 exits with status 0 in failover()

2024-07-03 14:31:09.908: main pid 58147: LOG:  fork a new PCP child pid 58578 in failover()

2024-07-03 14:31:09.908: main pid 58147: LOG:  reaper handler

2024-07-03 14:31:09.908: pcp_main pid 58578: LOG:  PCP process: 58578 started

2024-07-03 14:31:09.909: main pid 58147: LOG:  reaper handler: exiting normally

2024-07-03 14:31:09.909: sr_check_worker pid 58579: LOG:  process started

2024-07-03 14:31:19.915: watchdog pid 58151: LOG:  not able to send messages to remote node "staging-ha0001:9999 Linux staging-ha0001"

2024-07-03 14:31:19.915: watchdog pid 58151: DETAIL:  marking the node as lost

2024-07-03 14:31:19.915: watchdog pid 58151: LOG:  remote node "staging-ha0001:9999 Linux staging-ha0001" is lost

 

 

 

From VM1:

pgpool.log

2024-07-03 14:30:36.444: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons

2024-07-03 14:30:36.444: watchdog pid 8620: DETAIL:  missed beacon reply count:2

2024-07-03 14:30:37.448: sr_check_worker pid 65605: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:30:46.067: health_check1 pid 8676: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:30:46.068: health_check1 pid 8676: LOG:  health check retrying on DB node: 1 (round:1)

2024-07-03 14:30:46.455: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons

2024-07-03 14:30:46.455: watchdog pid 8620: DETAIL:  missed beacon reply count:3

2024-07-03 14:30:47.449: sr_check_worker pid 65605: ERROR:  Failed to check replication time lag

2024-07-03 14:30:47.449: sr_check_worker pid 65605: DETAIL:  No persistent db connection for the node 1

2024-07-03 14:30:47.449: sr_check_worker pid 65605: HINT:  check sr_check_user and sr_check_password

2024-07-03 14:30:47.449: sr_check_worker pid 65605: CONTEXT:  while checking replication time lag

2024-07-03 14:30:55.104: child pid 65509: LOG:  failover or failback event detected

2024-07-03 14:30:55.104: child pid 65509: DETAIL:  restarting myself

2024-07-03 14:30:55.104: main pid 8617: LOG:  reaper handler

2024-07-03 14:30:55.105: main pid 8617: LOG:  reaper handler: exiting normally

2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is not replying to our beacons

2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL:  missed beacon reply count:4

2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is not responding to our beacon messages

2024-07-03 14:30:56.459: watchdog pid 8620: DETAIL:  marking the node as lost

2024-07-03 14:30:56.459: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is lost

2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  removing watchdog node "staging-ha0002:9999 Linux staging-ha0002" from the standby list

2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  We have lost the quorum

2024-07-03 14:30:56.460: watchdog pid 8620: LOG:  signal_user1_to_parent_with_reason(3)

2024-07-03 14:30:56.460: main pid 8617: LOG:  Pgpool-II parent process received SIGUSR1

2024-07-03 14:30:56.460: main pid 8617: LOG:  Pgpool-II parent process received watchdog quorum change signal from watchdog

2024-07-03 14:30:56.461: watchdog_utility pid 66197: LOG:  watchdog: de-escalation started

sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper

2024-07-03 14:30:57.078: health_check1 pid 8676: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:30:57.078: health_check1 pid 8676: LOG:  health check retrying on DB node: 1 (round:2)

2024-07-03 14:30:57.418: life_check pid 8639: LOG:  informing the node status change to watchdog

2024-07-03 14:30:57.418: life_check pid 8639: DETAIL:  node id :1 status = "NODE DEAD" message:"No heartbeat signal from node"

2024-07-03 14:30:57.418: watchdog pid 8620: LOG:  received node status change ipc message

2024-07-03 14:30:57.418: watchdog pid 8620: DETAIL:  No heartbeat signal from node

2024-07-03 14:30:57.418: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is lost

2024-07-03 14:30:57.464: sr_check_worker pid 65605: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

sudo: a password is required

2024-07-03 14:30:59.301: watchdog_utility pid 66197: LOG:  failed to release the delegate IP:"10.127.1.20"

2024-07-03 14:30:59.301: watchdog_utility pid 66197: DETAIL:  'if_down_cmd' failed

2024-07-03 14:30:59.301: watchdog_utility pid 66197: WARNING:  watchdog de-escalation failed to bring down delegate IP

2024-07-03 14:30:59.301: watchdog pid 8620: LOG:  watchdog de-escalation process with pid: 66197 exit with SUCCESS.

 

2024-07-03 14:31:07.465: sr_check_worker pid 65605: ERROR:  Failed to check replication time lag

2024-07-03 14:31:07.465: sr_check_worker pid 65605: DETAIL:  No persistent db connection for the node 1

2024-07-03 14:31:07.465: sr_check_worker pid 65605: HINT:  check sr_check_user and sr_check_password

2024-07-03 14:31:07.465: sr_check_worker pid 65605: CONTEXT:  while checking replication time lag

2024-07-03 14:31:08.089: health_check1 pid 8676: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:31:08.089: health_check1 pid 8676: LOG:  health check retrying on DB node: 1 (round:3)

2024-07-03 14:31:17.480: sr_check_worker pid 65605: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  health check failed on node 1 (timeout:0)

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  received degenerate backend request for node_id: 1 from pid [8676]

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  watchdog received the failover command from local pgpool-II on IPC interface

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  failover requires the quorum to hold, which is not present at the moment

2024-07-03 14:31:19.097: watchdog pid 8620: DETAIL:  Rejecting the failover request

2024-07-03 14:31:19.097: watchdog pid 8620: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "staging-ha0001:9999 Linux staging-ha0001" is rejected because the watchdog cluster does not hold the quorum

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  degenerate backend request for 1 node(s) from pid [8676], is changed to quarantine node request by watchdog

2024-07-03 14:31:19.097: health_check1 pid 8676: DETAIL:  watchdog does not holds the quorum

2024-07-03 14:31:19.097: health_check1 pid 8676: LOG:  signal_user1_to_parent_with_reason(0)

2024-07-03 14:31:19.097: main pid 8617: LOG:  Pgpool-II parent process received SIGUSR1

2024-07-03 14:31:19.097: main pid 8617: LOG:  Pgpool-II parent process has received failover request

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  received the failover indication from Pgpool-II on IPC interface

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  watchdog is informed of failover start by the main process

2024-07-03 14:31:19.098: main pid 8617: LOG:  === Starting quarantine. shutdown host staging-ha0002(5432) ===

2024-07-03 14:31:19.098: main pid 8617: LOG:  Do not restart children because we are switching over node id 1 host: staging-ha0002 port: 5432 and we are in streaming replication mode

2024-07-03 14:31:19.098: main pid 8617: LOG:  failover: set new primary node: 0

2024-07-03 14:31:19.098: main pid 8617: LOG:  failover: set new main node: 0

2024-07-03 14:31:19.098: sr_check_worker pid 65605: ERROR:  Failed to check replication time lag

2024-07-03 14:31:19.098: sr_check_worker pid 65605: DETAIL:  No persistent db connection for the node 1

2024-07-03 14:31:19.098: sr_check_worker pid 65605: HINT:  check sr_check_user and sr_check_password

2024-07-03 14:31:19.098: sr_check_worker pid 65605: CONTEXT:  while checking replication time lag

2024-07-03 14:31:19.098: sr_check_worker pid 65605: LOG:  worker process received restart request

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  received the failover indication from Pgpool-II on IPC interface

2024-07-03 14:31:19.098: watchdog pid 8620: LOG:  watchdog is informed of failover end by the main process

2024-07-03 14:31:19.098: main pid 8617: LOG:  === Quarantine done. shutdown host staging-ha0002(5432) ==

 

 

2024-07-03 14:35:59.420: watchdog pid 8620: LOG:  new outbound connection to staging-ha0002:9000

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  "staging-ha0001:9999 Linux staging-ha0001" is the coordinator as per our record but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  cluster is in the split-brain

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  I am the coordinator but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  trying to figure out the best contender for the leader/coordinator node

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  remote node:"staging-ha0002:9999 Linux staging-ha0002" should step down from leader because we are the older leader

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  We are in split brain, and I am the best candidate for leader/coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  asking the remote node "staging-ha0002:9999 Linux staging-ha0002" to step down

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  we had lost this node because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  node will be added to cluster once life-check mark it as reachable again

2024-07-03 14:35:59.423: watchdog pid 8620: LOG:  "staging-ha0001:9999 Linux staging-ha0001" is the coordinator as per our record but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator

2024-07-03 14:35:59.423: watchdog pid 8620: DETAIL:  cluster is in the split-brain

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  I am the coordinator but "staging-ha0002:9999 Linux staging-ha0002" is also announcing as a coordinator

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  trying to figure out the best contender for the leader/coordinator node

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  remote node:"staging-ha0002:9999 Linux staging-ha0002" should step down from leader because we are the older leader

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  We are in split brain, and I am the best candidate for leader/coordinator

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  asking the remote node "staging-ha0002:9999 Linux staging-ha0002" to step down

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  we had lost this node because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.424: watchdog pid 8620: DETAIL:  node will be added to cluster once life-check mark it as reachable again

2024-07-03 14:35:59.424: watchdog pid 8620: LOG:  remote node "staging-ha0002:9999 Linux staging-ha0002" is reporting that it has found us again

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  leader/coordinator node "staging-ha0002:9999 Linux staging-ha0002" decided to resign from leader, probably because of split-brain

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  It was not our coordinator/leader anyway. ignoring the message

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  we had lost this node because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  node will be added to cluster once life-check mark it as reachable again

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  we had lost this node because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.425: watchdog pid 8620: LOG:  node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.425: watchdog pid 8620: DETAIL:  node will be added to cluster once life-check mark it as reachable again

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  we had lost this node because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  node will be added to cluster once life-check mark it as reachable again

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  we have received the NODE INFO message from the node:"staging-ha0002:9999 Linux staging-ha0002" that was lost

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  we had lost this node because of "REPORTED BY LIFECHECK"

2024-07-03 14:35:59.427: watchdog pid 8620: LOG:  node:"staging-ha0002:9999 Linux staging-ha0002" was reported lost by the life-check process

2024-07-03 14:35:59.427: watchdog pid 8620: DETAIL:  node will be added to cluster once life-check mark it as reachable again

2024-07-03 14:36:00.213: health_check1 pid 8676: LOG:  failed to connect to PostgreSQL server on "staging-ha0002:5432", timed out

2024-07-03 14:36:00.213: health_check1 pid 8676: LOG:  health check retrying on DB node: 1 (round:3)

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  health check retrying on DB node: 1 succeeded

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  received failback request for node_id: 1 from pid [8676]

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  failback request from pid [8676] is changed to update status request because node_id: 1 was quarantined

2024-07-03 14:36:01.221: health_check1 pid 8676: LOG:  signal_user1_to_parent_with_reason(0)

2024-07-03 14:36:01.221: main pid 8617: LOG:  Pgpool-II parent process received SIGUSR1

2024-07-03 14:36:01.221: main pid 8617: LOG:  Pgpool-II parent process has received failover request

2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  received the failover indication from Pgpool-II on IPC interface

2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  watchdog is informed of failover start by the main process

2024-07-03 14:36:01.221: watchdog pid 8620: LOG:  watchdog is informed of failover start by the main process

2024-07-03 14:36:01.222: main pid 8617: LOG:  === Starting fail back. reconnect host staging-ha0002(5432) ===

2024-07-03 14:36:01.222: main pid 8617: LOG:  Node 0 is not down (status: 2)

2024-07-03 14:36:01.222: main pid 8617: LOG:  Do not restart children because we are failing back node id 1 host: staging-ha0002 port: 5432 and we are in streaming replication mode and not all backends were down

2024-07-03 14:36:01.222: main pid 8617: LOG:  failover: set new primary node: 0

2024-07-03 14:36:01.222: main pid 8617: LOG:  failover: set new main node: 0

2024-07-03 14:36:01.222: sr_check_worker pid 66222: LOG:  worker process received restart request

2024-07-03 14:36:01.222: watchdog pid 8620: LOG:  received the failover indication from Pgpool-II on IPC interface

2024-07-03 14:36:01.222: watchdog pid 8620: LOG:  watchdog is informed of failover end by the main process

2024-07-03 14:36:01.222: main pid 8617: LOG:  === Failback done. reconnect host staging-ha0002(5432) ===



Questions: 
1. From the point 2 in observations, why are the connections not going to new primary?  
2. In this kind of setup will the transaction split happen when there is a network glitch?  

If anyone has worked on similar kind of setup, please provide your insights about it.
Thank you

Regards
Mukesh


 
Attachment

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: printing PGresult content with gdb
Next
From: Kent Dorfman
Date:
Subject: Description field for tables and views