RE: Exit walsender before confirming remote flush in logical replication - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Exit walsender before confirming remote flush in logical replication
Date
Msg-id TYAPR01MB58663B6F837C36C7DB20A463F5CC9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Exit walsender before confirming remote flush in logical replication  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
Dear Dilip, hackers,

Thanks for giving your opinion. I analyzed the relation with the given commit,
and I thought I could keep my patch. How do you think?

# Abstract

* Some modifications should be needed.
* We cannot rollback the shutdown if walsenders are stuck
* We don't have a good way to detect stuck

# Discussion

Compared to physical replication, it is likely to happen that logical replication is stuck.
I think the risk should be avoided as much as possible by fixing codes.
Then, if it leads another failure, we can document the caution to users.

While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders.
It is done after being exited other processes, so we cannot raise ERROR and rollback
the operation if checkpointer recognize the process stuck at that time.

We don't have any features that postmaster can check  whether this node is
publisher or not. So if we want to add the mechanism that can check the health
of walsenders before shutting down, we must do that at the top of
process_pm_shutdown_request() even if we are not in logical replication.
I think it affects the basis of postgres largely, and in the first place,
PostgreSQL does not have a mechanism to check the health of process.

Therefore, I want to adopt the approach that walsender itself exits immediately when they get signals.

## About patch - Were fixes correct?

In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets
SIGUSR1 signal. I think this should be. From the patch [1]:

```
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
                /* Try to flush pending output to the client */
                if (pq_flush_if_writable() != 0)
                        WalSndShutdown();
+
+               /* If we got shut down requested, try to exit the process */
+               if (got_STOPPING)
+                       WalSndDone(XLogSendLogical);
        }
 
        /* reactivate latch so WalSndLoop knows to continue */
```


Per my analysis, in case of logical replication, walsenders exit with following
steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by
themselves instead:

1. postmaster sends shutdown requests to checkpointer
2. checkpointer sends SIGUSR1 to walsenders and wait
3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on.
4. walsenders consume all WALs. @XLogSendLogical
5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical
6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop
7. proc_exit(0)
8. checkpoitner writes shutdown record
...

Type (i) stuck, I reported in -hackers[1], means that processes stop at step 6
and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is never set to on, so
we must use got_STOPPING flag.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB58701A47F35FED0A2B399662F5C49@TYCPR01MB5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Improve GetConfigOptionValues function
Next
From: Laurenz Albe
Date:
Subject: Re: Something is wrong with wal_compression