Thread: Exit walsender before confirming remote flush in logical replication

Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

22 December 2022, 05:46:11

Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1].
While discussing, we thought that we could extend the condition of walsender shutdown[2][3].

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4], which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

PSA the minimal patch for that. I'm not sure whether WalSndCaughtUp should be
also omitted or not. It seems that changes may affect other parts like
WalSndWaitForWal(), but we can investigate more about it.

[1]: https://commitfest.postgresql.org/41/3581/
[2]:
https://www.postgresql.org/message-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: https://www.postgresql.org/message-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg%40mail.gmail.com
[4]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

0001-Exit-walsender-before-confirming-remote-flush-in-log.patch

Re: Exit walsender before confirming remote flush in logical replication

From

Ashutosh Bapat

Date:

22 December 2022, 11:59:34

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
> (I added Amit as CC because we discussed in another thread)
>
> This is a fork thread from time-delayed logical replication [1].
> While discussing, we thought that we could extend the condition of walsender shutdown[2][3].
>
> Currently, walsenders delay the shutdown request until confirming all sent data
> are flushed on remote side. This condition was added in 985bd7[4], which is for
> supporting clean switchover. Supposing that there is a primary-secondary
> physical replication system, and do following steps. If any changes are come
> while step 2 but the walsender does not confirm the remote flush, the reboot in
> step 3 may be failed.
>
> 1. Stops primary server.
> 2. Promotes secondary to new primary.
> 3. Reboot (old)primary as new secondary.
>
> In case of logical replication, however, we cannot support the use-case that
> switches the role publisher <-> subscriber. Suppose same case as above, additional
> transactions are committed while doing step2. To catch up such changes subscriber
> must receive WALs related with trans, but it cannot be done because subscriber
> cannot request WALs from the specific position. In the case, we must truncate all
> data in new subscriber once, and then create new subscription with copy_data
> = true.
>
> Therefore, I think that we can ignore the condition for shutting down the
> walsender in logical replication.
>
> This change may be useful for time-delayed logical replication. The walsender
> waits the shutdown until all changes are applied on subscriber, even if it is
> delayed. This causes that publisher cannot be stopped if large delay-time is
> specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

I have not reviewed the patch though.

-- 
Best Wishes,
Ashutosh Bapat

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

23 December 2022, 02:21:54

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in 
> On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> > In case of logical replication, however, we cannot support the use-case that
> > switches the role publisher <-> subscriber. Suppose same case as above, additional
..
> > Therefore, I think that we can ignore the condition for shutting down the
> > walsender in logical replication.
...
> > This change may be useful for time-delayed logical replication. The walsender
> > waits the shutdown until all changes are applied on subscriber, even if it is
> > delayed. This causes that publisher cannot be stopped if large delay-time is
> > specified.
> 
> I think the current behaviour is an artifact of using the same WAL
> sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone. And I'm afraind that many people already
relies on that behavior.

> I agree with you that the logical WAL sender need not wait for all the
> WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it.  I'm
fine doing that only in the apply_delay case, though.  However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos?  A little tweak is needed in send_feedback() but seems to
work..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

23 December 2022, 05:56:21

On Fri, Dec 23, 2022 at 7:51 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in
> > On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > > In case of logical replication, however, we cannot support the use-case that
> > > switches the role publisher <-> subscriber. Suppose same case as above, additional
> ..
> > > Therefore, I think that we can ignore the condition for shutting down the
> > > walsender in logical replication.
> ...
> > > This change may be useful for time-delayed logical replication. The walsender
> > > waits the shutdown until all changes are applied on subscriber, even if it is
> > > delayed. This causes that publisher cannot be stopped if large delay-time is
> > > specified.
> >
> > I think the current behaviour is an artifact of using the same WAL
> > sender code for both logical and physical replication.
>
> Yeah, I don't think we do that for the reason of switchover. On the
> other hand I think the behavior was intentionally taken over since it
> is thought as sensible alone.
>

Do you see it was discussed somewhere? If so, can you please point to
that discussion?

> And I'm afraind that many people already
> relies on that behavior.
>

But OTOH, it can also be annoying for users to see some wait during
the shutdown which is actually not required.

> > I agree with you that the logical WAL sender need not wait for all the
> > WAL to be replayed downstream.
>
> Thus I feel that it might be a bit outrageous to get rid of that
> bahavior altogether because of a new feature stumbling on it.  I'm
> fine doing that only in the apply_delay case, though.  However, I have
> another concern that we are introducing the second exception for
> XLogSendLogical in the common path.
>
> I doubt that anyone wants to use synchronous logical replication with
> apply_delay since the sender transaction is inevitablly affected back
> by that delay.
>
> Thus how about before entering an apply_delay, logrep worker sending a
> kind of crafted feedback, which reports commit_data.end_lsn as
> flushpos?  A little tweak is needed in send_feedback() but seems to
> work..
>

How can we send commit_data.end_lsn before actually committing the
xact? I think this can lead to a problem because next time (say after
restart of walsender) server can skip sending the xact even if it is
not committed by the client.


-- 
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

23 December 2022, 12:54:15

Dear Horiguchi-san,

> Thus how about before entering an apply_delay, logrep worker sending a
> kind of crafted feedback, which reports commit_data.end_lsn as
> flushpos?  A little tweak is needed in send_feedback() but seems to
> work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1] for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos and flushpos in send_feedback().

Followings were contents of the feedback message I got, and we could see that recv and flush were overwritten.

```
DEBUG:  sending feedback (force 1) to recv 0/1553638, write 0/1553550, flush 0/1553638
CONTEXT:  processing remote data for replication origin "pg_16390" during message type "BEGIN" in transaction 730,
finishedat 0/1553638 
```

In terms of walsender, however, sentPtr seemed to be slightly larger than flushed position on subscriber.

```
(gdb) p MyWalSnd->sentPtr
$2 = 22361760
(gdb) p MyWalSnd->flush
$3 = 22361656
(gdb) p *MyWalSnd
$4 = {pid = 28807, state = WALSNDSTATE_STREAMING, sentPtr = 22361760, needreload = false, write = 22361656,
  flush = 22361656, apply = 22361424, writeLag = 20020343, flushLag = 20020343, applyLag = 20020343,
  sync_standby_priority = 0, mutex = 0 '\000', latch = 0x7ff0350cbb94, replyTime = 725113263592095}
```

Therefore I could not shut down the publisher node when applications were delaying.
Do you have any opinions about them?

```
$ pg_ctl stop -D data_pub/
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

[1]:
https://www.postgresql.org/message-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com 

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

26 December 2022, 12:27:24

Dear Horiguchi-san,

> > Thus how about before entering an apply_delay, logrep worker sending a
> > kind of crafted feedback, which reports commit_data.end_lsn as
> > flushpos?  A little tweak is needed in send_feedback() but seems to
> > work..
>
> Thanks for replying! I tested your saying but it could not work well...
>
> I made PoC based on the latest time-delayed patches [1] for non-streaming case.
> Apply workers that are delaying applications send begin_data.final_lsn as recvpos
> and flushpos in send_feedback().

Maybe I misunderstood what you said... I have also found that sentPtr is not the actual sent
position, but the starting point of the next WAL. You can see the comment below.

```
/*
 * How far have we sent WAL already? This is also advertised in
 * MyWalSnd->sentPtr.  (Actually, this is the next WAL location to send.)
 */
static XLogRecPtr sentPtr = InvalidXLogRecPtr;
```

We must use end_lsn for crafting messages to cheat the walsender, but such records
are included in COMMIT, not in BEGIN for the non-streaming case.
But if workers are delayed in apply_handle_commit(), will they hold locks for database
objects for a long time and it causes another issue.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Exit walsender before confirming remote flush in logical replication

From

Dilip Kumar

Date:

27 December 2022, 07:40:05

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

> In case of logical replication, however, we cannot support the use-case that
> switches the role publisher <-> subscriber. Suppose same case as above, additional
> transactions are committed while doing step2. To catch up such changes subscriber
> must receive WALs related with trans, but it cannot be done because subscriber
> cannot request WALs from the specific position. In the case, we must truncate all
> data in new subscriber once, and then create new subscription with copy_data
> = true.
>
> Therefore, I think that we can ignore the condition for shutting down the
> walsender in logical replication.
>
+1 for the idea.

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

27 December 2022, 08:14:13

Dear Dilip,

Thanks for checking my proposal!

> - * Note that if we determine that there's still more data to send, this
> - * function will return control to the caller.
> + * Note that if we determine that there's still more data to send or we are in
> + * the physical replication more, this function will return control to the
> + * caller.
> 
> I think in this comment you meant to say
> 
> 1. "or we are in physical replication mode and all WALs are not yet replicated"
> 2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

v2-0001-Exit-walsender-before-confirming-remote-flush-in-.patch

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

27 December 2022, 09:20:48

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Thanks for checking my proposal!
>
> > - * Note that if we determine that there's still more data to send, this
> > - * function will return control to the caller.
> > + * Note that if we determine that there's still more data to send or we are in
> > + * the physical replication more, this function will return control to the
> > + * caller.
> >
> > I think in this comment you meant to say
> >
> > 1. "or we are in physical replication mode and all WALs are not yet replicated"
> > 2. Typo /replication more/replication mode
>
> Firstly I considered 2, but I thought 1 seemed to be better.
> PSA the updated patch.
>

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

27 December 2022, 09:25:07

On Tue, Dec 27, 2022 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Thanks for checking my proposal!
> >
> > > - * Note that if we determine that there's still more data to send, this
> > > - * function will return control to the caller.
> > > + * Note that if we determine that there's still more data to send or we are in
> > > + * the physical replication more, this function will return control to the
> > > + * caller.
> > >
> > > I think in this comment you meant to say
> > >
> > > 1. "or we are in physical replication mode and all WALs are not yet replicated"
> > > 2. Typo /replication more/replication mode
> >
> > Firstly I considered 2, but I thought 1 seemed to be better.
> > PSA the updated patch.
> >
>
> I think even for logical replication we should check whether there is
> any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
> is the point to send the done message? Also, the caller of
> WalSndDone() already has that check which is another reason why I
> can't see why you didn't have the same check in function WalSndDone().
>
> BTW, even after fixing this, I think logical replication will behave
> differently when due to some reason (like time-delayed replication)
> send buffer gets full and walsender is not able to send data. I think
> this will be less of an issue with physical replication because there
> is a separate walreceiver process to flush the WAL which doesn't wait
> but the same is not true for logical replication. Do you have any
> thoughts on this matter?
>

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

-- 
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

28 December 2022, 02:47:56

Dear Amit,

> > Firstly I considered 2, but I thought 1 seemed to be better.
> > PSA the updated patch.
> >
> 
> I think even for logical replication we should check whether there is
> any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
> is the point to send the done message? Also, the caller of
> WalSndDone() already has that check which is another reason why I
> can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

> BTW, even after fixing this, I think logical replication will behave
> differently when due to some reason (like time-delayed replication)
> send buffer gets full and walsender is not able to send data. I think
> this will be less of an issue with physical replication because there
> is a separate walreceiver process to flush the WAL which doesn't wait
> but the same is not true for logical replication. Do you have any
> thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.
I think this is an architectural problem. Maybe we have assumed that the decoded
WALs are consumed in as short time. I do not have good idea, but one approach is
introducing a new process logical-walreceiver.  It will record the decoded WALs to
the persistent storage and workers consume and then remove them. It may have huge
impact for other features and should not be accepted...


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

v3-0001-Exit-walsender-before-confirming-remote-flush-in-.patch

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

28 December 2022, 02:49:53

Dear Amit,

> In logical replication, it can happen today as well without
> time-delayed replication. Basically, say apply worker is waiting to
> acquire some lock that is already acquired by some backend then it
> will have the same behavior. I have not verified this, so you may want
> to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

    Tx-2 (on publisher)
    INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

28 December 2022, 03:56:29

On Wed, Dec 28, 2022 at 8:19 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > In logical replication, it can happen today as well without
> > time-delayed replication. Basically, say apply worker is waiting to
> > acquire some lock that is already acquired by some backend then it
> > will have the same behavior. I have not verified this, so you may want
> > to check it once.
>
> Right, I could reproduce the scenario with following steps.
>
> 1. Construct pub -> sub logical replication system with streaming = off.
> 2. Define a table on both nodes.
>
> ```
> CREATE TABLE tbl (id int PRIMARY KEY);
> ```
>
> 3. Execute concurrent transactions.
>
> Tx-1 (on subscriber)
> BEGIN;
> INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);
>
>         Tx-2 (on publisher)
>         INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);
>
> 4. Try to shutdown publisher but it will be failed.
>
> ```
> $ pg_ctl stop -D publisher
> waiting for server to shut down............................................................... failed
> pg_ctl: server does not shut down
> ```

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

-- 
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

28 December 2022, 09:15:41

Dear Amit,

> Thanks for the verification. BTW, do you think we should document this
> either with time-delayed replication or otherwise unless this is
> already documented?

I think this should be documented at "Shutting Down the Server" section in runtime.sgml
or logical-replicaiton.sgml, but I cannot find. It will be included in next version.

> Another thing we can investigate here why do we need to ensure that
> there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa. 
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1][2], but I could not find any design about it.

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

13 January 2023, 11:11:08

On Wed, Dec 28, 2022 at 8:18 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit,
>
> > > Firstly I considered 2, but I thought 1 seemed to be better.
> > > PSA the updated patch.
> > >
> >
> > I think even for logical replication we should check whether there is
> > any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
> > is the point to send the done message? Also, the caller of
> > WalSndDone() already has that check which is another reason why I
> > can't see why you didn't have the same check in function WalSndDone().
>
> I did not have strong opinion around here. Fixed.
>
> > BTW, even after fixing this, I think logical replication will behave
> > differently when due to some reason (like time-delayed replication)
> > send buffer gets full and walsender is not able to send data. I think
> > this will be less of an issue with physical replication because there
> > is a separate walreceiver process to flush the WAL which doesn't wait
> > but the same is not true for logical replication. Do you have any
> > thoughts on this matter?
>
> Yes, it may happen even if this work is done. And your analysis is correct that
> the receive buffer is rarely full in physical replication because walreceiver
> immediately flush WALs.
>

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

16 January 2023, 01:28:23

At Wed, 28 Dec 2022 09:15:41 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in 
> > Another thing we can investigate here why do we need to ensure that
> > there is no pending send before shutdown.
> 
> I have not done yet about it, will continue next year.
> It seems that walsenders have been sending all data before shutting down since ea5516,
> e0b581 and 754baa. 
> There were many threads related with streaming replication, so I could not pin
> the specific message that written in the commit message of ea5516.
> 
> I have also checked some wiki pages [1][2], but I could not find any design about it.
> 
> [1]: https://wiki.postgresql.org/wiki/Streaming_Replication
> [2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit. Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

16 January 2023, 01:31:53

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> Okay, but what happens in the case of physical replication when
> synchronous_commit = remote_apply? In that case, won't it ensure that
> apply has also happened? If so, then shouldn't the time delay feature
> also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply.  The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

16 January 2023, 11:08:30

Dear Horiguchi-san, Amit,

> At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com>
> wrote in
> > Okay, but what happens in the case of physical replication when
> > synchronous_commit = remote_apply? In that case, won't it ensure that
> > apply has also happened? If so, then shouldn't the time delay feature
> > also cause a similar problem for physical replication as well?
>
> As written in another mail, WalSndDone doesn't honor
> synchronous_commit. In other words, AFAIS walsender finishes not
> waiting remote_apply.  The unapplied recods will be applied at the
> next startup.
>
> I didn't confirmed that behavior for myself, though..

If Amit wanted to say about the case that sending data is pending in physical
replication, the walsender cannot stop. But this is not related with the
synchronous_commit: it is caused because it must sweep all pending data before
shutting down. We can reproduce the situation with:

1. build streaming replication system
2. kill -STOP $walreceiver
3. insert data to primary server
4. try to stop the primary server

If what you said was not related with pending data, walsender can be stopped even
if the synchronous_commit = remote_apply. As Horiguchi-san said, such a condition
is not written in WalSndDone() [1]. I think the parameter synchronous_commit does
not affect walsender process so well. It just define when backend returns the
result to client.

I could check by following steps:

1. built streaming replication system. PSA the script to follow that.

Primary config.

```
synchronous_commit = 'remote_apply'
synchronous_standby_names = 'secondary'
```

Secondary config.

```
recovery_min_apply_delay = 1d
primary_conninfo = 'user=postgres port=$port_N1 application_name=secondary'
hot_standby = on
```

2. inserted data to primary. This waited the remote apply

psql -U postgres -p $port_primary -c "INSERT INTO tbl SELECT generate_series(1, 5000)"

3. Stopped the primary server from another terminal. It could be done.
The terminal on step2 said like:

```
WARNING:  canceling the wait for synchronous replication and terminating connection due to administrator command
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
```

[1]: https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

test_phy.sh

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

16 January 2023, 11:09:24

Dear Horiguchi-san,

> If I'm grabbing the discussion here correctly, in my memory, it is
> because: physical replication needs all records that have written on
> primary are written on standby for switchover to succeed. It is
> annoying that normal shutdown occasionally leads to switchover
> failure. Thus WalSndDone explicitly waits for remote flush/write
> regardless of the setting of synchronous_commit.

AFAIK the condition (sentPtr == replicatedPtr) seemed to be introduced for the purpose[1].
You meant to say that the conditon (!pq_is_send_pending()) has same motivation, right?

> Thus apply delay
> doesn't affect shutdown (AFAICS), and that is sufficient since all the
> records will be applied at the next startup.

I was not clear the word "next startup", but I agreed that we can shut down the
walsender in case of recovery_min_apply_delay > 0 and synchronous_commit = remote_apply.
The startup process will be not terminated even if the primary crashes, so I
think the process will apply the transaction sooner or later.

> In logical replication apply preceeds write and flush so we have no
> indication whether a record is "replicated" to standby by other than
> apply LSN. On the other hand, logical recplication doesn't have a
> business with switchover so that assurarance is useless. Thus I think
> we can (practically) ignore apply_lsn at shutdown. It seems subtly
> irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

I updated the patch accordingly. Also, I found that the previous version
did not work well in case of streamed transactions. When a streamed transaction
is committed on publisher but the application is delayed on subscriber, the
process sometimes waits until there is no pending write. This is done in
ProcessPendingWrites(). I added another termination path in the function.

[1]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

v4-0001-Exit-walsender-before-confirming-remote-flush-in-.patch

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

17 January 2023, 09:11:00

On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > In logical replication apply preceeds write and flush so we have no
> > indication whether a record is "replicated" to standby by other than
> > apply LSN. On the other hand, logical recplication doesn't have a
> > business with switchover so that assurarance is useless. Thus I think
> > we can (practically) ignore apply_lsn at shutdown. It seems subtly
> > irregular, though.
>
> Another consideration is that the condition (!pq_is_send_pending()) ensures that
> there are no pending messages, including other packets. Currently we force walsenders
> to clean up all messages before shutting down, even if it is a keepalive one.
> I cannot have any problems caused by this, but I can keep the condition in case of
> logical replication.
>

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend. Similarly, if the user has configured a
large value of time-delayed apply, it can lead to a network buffer
full between walsender and receive/process.

The condition to allow the shutdown to wait for all WAL to be sent has
two parts: (a) it confirms that there is no pending WAL to be sent;
(b) it confirms all the WAL sent has been flushed by the client. As
per our understanding, both these conditions are to allow clean
switchover/failover which seems to be useful only for physical
replication. The logical replication doesn't provide such
functionality.

The proposed patch tries to eliminate condition (b) for logical
replication in the hopes that the same will allow the shutdown to be
complete in most cases. There is no specific reason discussed to not
do (a) for logical replication.

Now, to proceed here we have the following options: (1) Fix (b) as
proposed by the patch and document the risks related to (a); (2) Fix
both (a) and (b); (3) Do nothing and document that users need to
unblock the subscribers to complete the shutdown.

Thoughts?

-- 
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

19 January 2023, 08:37:47

Dear Amit, hackers,

> Let me try to summarize the discussion till now. The problem we are
> trying to solve here is to allow a shutdown to complete when walsender
> is not able to send the entire WAL. Currently, in such cases, the
> shutdown fails. As per our current understanding, this can happen when
> (a) walreceiver/walapply process is stuck (not able to receive more
> WAL) due to locks or some other reason; (b) a long time delay has been
> configured to apply the WAL (we don't yet have such a feature for
> logical replication but the discussion for same is in progress).

Thanks for summarizing.
While analyzing stuck, I noticed that there are two types of shutdown failures.
They could be characterized by the back trace. They are shown at the bottom.

Type i)
The walsender executes WalSndDone(), but cannot satisfy the condition.
It means that all WALs have been sent to the subscriber but have not flushed;
sentPtr is not the same as replicatedPtr. This stuck can happen when the delayed
transaction is small or streamed.

Type ii)
The walsender cannot execute WalSndDone(), stacks at ProcessPendingWrites().
It means that when the send buffer becomes full while replicating a transaction;
pq_is_send_pending() returns true and the walsender cannot break the loop.
This stuck can happen when the delayed transaction is large, but it is not a streamed one.

If we choose modification (1), we can only fix type (i) because pending WALs cause
the failure. IIUC if we want to shut down walsender processes even if (ii), we must
choose (2) and additional fixes are needed.

Based on the above, I prefer modification (2) because it can rescue more cases. Thoughts?
PSA the patch for it. It is almost the same as the previous version, but the comments are updated.


Appendinx:

The backtrace for type i)

```
#0  WalSndDone (send_data=0x87f825 <XLogSendLogical>) at
../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:3111
#1  0x000000000087ed1d in WalSndLoop (send_data=0x87f825 <XLogSendLogical>) at
../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:2525
#2  0x000000000087d40a in StartLogicalReplication (cmd=0x1f49030) at
../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1320
#3  0x000000000087df29 in exec_replication_command (
    cmd_string=0x1f15498 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '4', streaming 'on', origin 'none',
publication_names'\"pub\"')")
 
    at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1830
#4  0x000000000091b032 in PostgresMain (dbname=0x1f4c938 "postgres", username=0x1f4c918 "postgres")
    at ../../PostgreSQL-Source-Dev/src/backend/tcop/postgres.c:4561
#5  0x000000000085390b in BackendRun (port=0x1f3d0b0) at
../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:4437
#6  0x000000000085322c in BackendStartup (port=0x1f3d0b0) at
../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:4165
#7  0x000000000084f7a2 in ServerLoop () at ../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:1762
#8  0x000000000084f0a2 in PostmasterMain (argc=3, argv=0x1f0ff30) at
../../PostgreSQL-Source-Dev/src/backend/postmaster/postmaster.c:1452
#9  0x000000000074a4d6 in main (argc=3, argv=0x1f0ff30) at ../../PostgreSQL-Source-Dev/src/backend/main/main.c:200
```

The backtrace for type ii)

```
#0  ProcessPendingWrites () at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1438
#1  0x000000000087d635 in WalSndWriteData (ctx=0x1429ce8, lsn=22406440, xid=731, last_write=true)
    at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:1405
#2  0x0000000000888420 in OutputPluginWrite (ctx=0x1429ce8, last_write=true) at
../../PostgreSQL-Source-Dev/src/backend/replication/logical/logical.c:669
#3  0x00007f022dfe43a7 in pgoutput_change (ctx=0x1429ce8, txn=0x1457d40, relation=0x7f0245075268, change=0x1460ef8)
    at ../../PostgreSQL-Source-Dev/src/backend/replication/pgoutput/pgoutput.c:1491
#4  0x0000000000889125 in change_cb_wrapper (cache=0x142bcf8, txn=0x1457d40, relation=0x7f0245075268,
change=0x1460ef8)
    at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/logical.c:1077
#5  0x000000000089507c in ReorderBufferApplyChange (rb=0x142bcf8, txn=0x1457d40, relation=0x7f0245075268,
change=0x1460ef8,streaming=false)
 
    at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:1969
#6  0x0000000000895866 in ReorderBufferProcessTXN (rb=0x142bcf8, txn=0x1457d40, commit_lsn=23060624,
snapshot_now=0x1440150,command_id=0, streaming=false)
 
    at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2245
#7  0x0000000000896348 in ReorderBufferReplay (txn=0x1457d40, rb=0x142bcf8, xid=731, commit_lsn=23060624,
end_lsn=23060672,commit_time=727353664342177, 
 
    origin_id=0, origin_lsn=0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2675
#8  0x00000000008963d0 in ReorderBufferCommit (rb=0x142bcf8, xid=731, commit_lsn=23060624, end_lsn=23060672,
commit_time=727353664342177,origin_id=0, 
 
    origin_lsn=0) at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/reorderbuffer.c:2699
#9  0x00000000008842c7 in DecodeCommit (ctx=0x1429ce8, buf=0x7ffcf03731a0, parsed=0x7ffcf0372fa0, xid=731,
two_phase=false)
    at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:682
#10 0x0000000000883667 in xact_decode (ctx=0x1429ce8, buf=0x7ffcf03731a0) at
../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:216
#11 0x000000000088338b in LogicalDecodingProcessRecord (ctx=0x1429ce8, record=0x142a080)
    at ../../PostgreSQL-Source-Dev/src/backend/replication/logical/decode.c:119
#12 0x000000000087f8c7 in XLogSendLogical () at ../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:3060
#13 0x000000000087ec5a in WalSndLoop (send_data=0x87f825 <XLogSendLogical>) at
../../PostgreSQL-Source-Dev/src/backend/replication/walsender.c:2490
...
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

v5-0001-Exit-walsender-before-confirming-remote-flush-in-.patch

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

20 January 2023, 10:45:27

On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Let me try to summarize the discussion till now. The problem we are
> trying to solve here is to allow a shutdown to complete when walsender
> is not able to send the entire WAL. Currently, in such cases, the
> shutdown fails. As per our current understanding, this can happen when
> (a) walreceiver/walapply process is stuck (not able to receive more
> WAL) due to locks or some other reason; (b) a long time delay has been
> configured to apply the WAL (we don't yet have such a feature for
> logical replication but the discussion for same is in progress).
>
> Both reasons mostly apply to logical replication because there is no
> separate walreceiver process whose job is to just flush the WAL. In
> logical replication, the process that receives the WAL also applies
> it. So, while applying it can stuck for a long time waiting for some
> heavy-weight lock to be released by some other long-running
> transaction by the backend.
>

While checking the commits and email discussions in this area, I came
across the email [1] from Michael where something similar seems to
have been discussed. Basically, whether the early shutdown of
walsender can prevent a switchover between publisher and subscriber
but that part was never clearly answered in that email chain. It might
be worth reading the entire discussion [2]. That discussion finally
lead to the following commit:

commit c6c333436491a292d56044ed6e167e2bdee015a2
Author: Andres Freund <andres@anarazel.de>
Date:   Mon Jun 5 18:53:41 2017 -0700

    Prevent possibility of panics during shutdown checkpoint.

    When the checkpointer writes the shutdown checkpoint, it checks
    afterwards whether any WAL has been written since it started and
    throws a PANIC if so.  At that point, only walsenders are still
    active, so one might think this could not happen, but walsenders can
    also generate WAL, for instance in BASE_BACKUP and logical decoding
    related commands (e.g. via hint bits).  So they can trigger this panic
    if such a command is run while the shutdown checkpoint is being
    written.

    To fix this, divide the walsender shutdown into two phases.  First,
    checkpointer, itself triggered by postmaster, sends a
    PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
    is idle or runs an SQL query this causes the backend to shutdown, if
    logical replication is in progress all existing WAL records are
    processed followed by a shutdown.
...
...

Here, as mentioned in the commit, we are trying to ensure that before
checkpoint writes its shutdown WAL record, we ensure that "if logical
replication is in progress all existing WAL records are processed
followed by a shutdown.". I think even before this commit, we try to
send the entire WAL before shutdown but not completely sure. There was
no discussion on what happens if the logical walreceiver/walapply
process is waiting on some heavy-weight lock and the network socket
buffer is full due to which walsender is not able to process its WAL.
Is it okay for shutdown to fail in such a case as it is happening now,
or shall we somehow detect that and shut down the walsender, or we
just allow logical walsender to always exit immediately as soon as the
shutdown signal came?

Note: I have added some of the people involved in the previous
thread's [2] discussion in the hope that they can share their
thoughts.

[1] - https://www.postgresql.org/message-id/CAB7nPqR3icaA%3DqMv_FuU8YVYH3KUrNMnq_OmCfkzxCHC4fox8w%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHGQGwEsttg9P9LOOavoc9d6VB1zVmYgfBk%3DLjsk-UL9cEf-eA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Dilip Kumar

Date:

23 January 2023, 05:39:14

On Fri, Jan 20, 2023 at 4:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Let me try to summarize the discussion till now. The problem we are
> > trying to solve here is to allow a shutdown to complete when walsender
> > is not able to send the entire WAL. Currently, in such cases, the
> > shutdown fails. As per our current understanding, this can happen when
> > (a) walreceiver/walapply process is stuck (not able to receive more
> > WAL) due to locks or some other reason; (b) a long time delay has been
> > configured to apply the WAL (we don't yet have such a feature for
> > logical replication but the discussion for same is in progress).
> >
> > Both reasons mostly apply to logical replication because there is no
> > separate walreceiver process whose job is to just flush the WAL. In
> > logical replication, the process that receives the WAL also applies
> > it. So, while applying it can stuck for a long time waiting for some
> > heavy-weight lock to be released by some other long-running
> > transaction by the backend.
> >
>
> While checking the commits and email discussions in this area, I came
> across the email [1] from Michael where something similar seems to
> have been discussed. Basically, whether the early shutdown of
> walsender can prevent a switchover between publisher and subscriber
> but that part was never clearly answered in that email chain. It might
> be worth reading the entire discussion [2]. That discussion finally
> lead to the following commit:


Right, in the thread the question is raised about whether it makes
sense for logical replication to send all WALs but there is no
conclusion on that.  But I think this patch is mainly about resolving
the PANIC due to extra WAL getting generated by walsender during
checkpoint processing and that's the reason the behavior of sending
all the WAL is maintained but only the extra WAL generation stopped
(before shutdown checkpoint can proceed) using this new state

>
> commit c6c333436491a292d56044ed6e167e2bdee015a2
> Author: Andres Freund <andres@anarazel.de>
> Date:   Mon Jun 5 18:53:41 2017 -0700
>
>     Prevent possibility of panics during shutdown checkpoint.
>
>     When the checkpointer writes the shutdown checkpoint, it checks
>     afterwards whether any WAL has been written since it started and
>     throws a PANIC if so.  At that point, only walsenders are still
>     active, so one might think this could not happen, but walsenders can
>     also generate WAL, for instance in BASE_BACKUP and logical decoding
>     related commands (e.g. via hint bits).  So they can trigger this panic
>     if such a command is run while the shutdown checkpoint is being
>     written.
>
>     To fix this, divide the walsender shutdown into two phases.  First,
>     checkpointer, itself triggered by postmaster, sends a
>     PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
>     is idle or runs an SQL query this causes the backend to shutdown, if
>     logical replication is in progress all existing WAL records are
>     processed followed by a shutdown.
> ...
> ...
>
> Here, as mentioned in the commit, we are trying to ensure that before
> checkpoint writes its shutdown WAL record, we ensure that "if logical
> replication is in progress all existing WAL records are processed
> followed by a shutdown.". I think even before this commit, we try to
> send the entire WAL before shutdown but not completely sure.


Yes, I think that there is no change in that behavior.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

27 January 2023, 04:11:53

Dear Dilip, hackers,

Thanks for giving your opinion. I analyzed the relation with the given commit,
and I thought I could keep my patch. How do you think?

# Abstract

* Some modifications should be needed.
* We cannot rollback the shutdown if walsenders are stuck
* We don't have a good way to detect stuck

# Discussion

Compared to physical replication, it is likely to happen that logical replication is stuck.
I think the risk should be avoided as much as possible by fixing codes.
Then, if it leads another failure, we can document the caution to users.

While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders.
It is done after being exited other processes, so we cannot raise ERROR and rollback
the operation if checkpointer recognize the process stuck at that time.

We don't have any features that postmaster can check  whether this node is
publisher or not. So if we want to add the mechanism that can check the health
of walsenders before shutting down, we must do that at the top of
process_pm_shutdown_request() even if we are not in logical replication.
I think it affects the basis of postgres largely, and in the first place,
PostgreSQL does not have a mechanism to check the health of process.

Therefore, I want to adopt the approach that walsender itself exits immediately when they get signals.

## About patch - Were fixes correct?

In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets
SIGUSR1 signal. I think this should be. From the patch [1]:

```
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
                /* Try to flush pending output to the client */
                if (pq_flush_if_writable() != 0)
                        WalSndShutdown();
+
+               /* If we got shut down requested, try to exit the process */
+               if (got_STOPPING)
+                       WalSndDone(XLogSendLogical);
        }
 
        /* reactivate latch so WalSndLoop knows to continue */
```


Per my analysis, in case of logical replication, walsenders exit with following
steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by
themselves instead:

1. postmaster sends shutdown requests to checkpointer
2. checkpointer sends SIGUSR1 to walsenders and wait
3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on.
4. walsenders consume all WALs. @XLogSendLogical
5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical
6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop
7. proc_exit(0)
8. checkpoitner writes shutdown record
...

Type (i) stuck, I reported in -hackers[1], means that processes stop at step 6
and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is never set to on, so
we must use got_STOPPING flag.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB58701A47F35FED0A2B399662F5C49@TYCPR01MB5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Exit walsender before confirming remote flush in logical replication

From

Masahiko Sawada

Date:

01 February 2023, 08:38:36

On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Let me try to summarize the discussion till now. The problem we are
> > trying to solve here is to allow a shutdown to complete when walsender
> > is not able to send the entire WAL. Currently, in such cases, the
> > shutdown fails. As per our current understanding, this can happen when
> > (a) walreceiver/walapply process is stuck (not able to receive more
> > WAL) due to locks or some other reason; (b) a long time delay has been
> > configured to apply the WAL (we don't yet have such a feature for
> > logical replication but the discussion for same is in progress).
> >
> > Both reasons mostly apply to logical replication because there is no
> > separate walreceiver process whose job is to just flush the WAL. In
> > logical replication, the process that receives the WAL also applies
> > it. So, while applying it can stuck for a long time waiting for some
> > heavy-weight lock to be released by some other long-running
> > transaction by the backend.
> >
>
> While checking the commits and email discussions in this area, I came
> across the email [1] from Michael where something similar seems to
> have been discussed. Basically, whether the early shutdown of
> walsender can prevent a switchover between publisher and subscriber
> but that part was never clearly answered in that email chain. It might
> be worth reading the entire discussion [2]. That discussion finally
> lead to the following commit:
>
> commit c6c333436491a292d56044ed6e167e2bdee015a2
> Author: Andres Freund <andres@anarazel.de>
> Date:   Mon Jun 5 18:53:41 2017 -0700
>
>     Prevent possibility of panics during shutdown checkpoint.
>
>     When the checkpointer writes the shutdown checkpoint, it checks
>     afterwards whether any WAL has been written since it started and
>     throws a PANIC if so.  At that point, only walsenders are still
>     active, so one might think this could not happen, but walsenders can
>     also generate WAL, for instance in BASE_BACKUP and logical decoding
>     related commands (e.g. via hint bits).  So they can trigger this panic
>     if such a command is run while the shutdown checkpoint is being
>     written.
>
>     To fix this, divide the walsender shutdown into two phases.  First,
>     checkpointer, itself triggered by postmaster, sends a
>     PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
>     is idle or runs an SQL query this causes the backend to shutdown, if
>     logical replication is in progress all existing WAL records are
>     processed followed by a shutdown.
> ...
> ...
>
> Here, as mentioned in the commit, we are trying to ensure that before
> checkpoint writes its shutdown WAL record, we ensure that "if logical
> replication is in progress all existing WAL records are processed
> followed by a shutdown.". I think even before this commit, we try to
> send the entire WAL before shutdown but not completely sure. There was
> no discussion on what happens if the logical walreceiver/walapply
> process is waiting on some heavy-weight lock and the network socket
> buffer is full due to which walsender is not able to process its WAL.
> Is it okay for shutdown to fail in such a case as it is happening now,
> or shall we somehow detect that and shut down the walsender, or we
> just allow logical walsender to always exit immediately as soon as the
> shutdown signal came?

+1 to eliminate condition (b) for logical replication.

Regarding (a), as Amit mentioned before[1], I think we should check if
pq_is_send_pending() is false. Otherwise, we will end up terminating
the WAL stream without the done message. Which will lead to an error
message "ERROR:  could not receive data from WAL stream: server closed
the connection unexpectedly" on the subscriber even at a clean
shutdown. In a case where pq_is_send_pending() doesn't become false
for a long time, (e.g., the network socket buffer got full due to the
apply worker waiting on a lock), I think users should unblock it by
themselves. Or it might be practically better to shutdown the
subscriber first in the logical replication case, unlike the physical
replication case. I've not studied the time-delayed logical
replication patch yet, though.

Regards,

[1] https://www.postgresql.org/message-id/CAA4eK1%2BpD654%2BXnrPugYueh7Oh22EBGTr6dA_fS0%2BgPiHayG9A%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

01 February 2023, 09:28:14

On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Let me try to summarize the discussion till now. The problem we are
> > > trying to solve here is to allow a shutdown to complete when walsender
> > > is not able to send the entire WAL. Currently, in such cases, the
> > > shutdown fails. As per our current understanding, this can happen when
> > > (a) walreceiver/walapply process is stuck (not able to receive more
> > > WAL) due to locks or some other reason; (b) a long time delay has been
> > > configured to apply the WAL (we don't yet have such a feature for
> > > logical replication but the discussion for same is in progress).
> > >
> > > Both reasons mostly apply to logical replication because there is no
> > > separate walreceiver process whose job is to just flush the WAL. In
> > > logical replication, the process that receives the WAL also applies
> > > it. So, while applying it can stuck for a long time waiting for some
> > > heavy-weight lock to be released by some other long-running
> > > transaction by the backend.
> > >
...
...
>
> +1 to eliminate condition (b) for logical replication.
>
> Regarding (a), as Amit mentioned before[1], I think we should check if
> pq_is_send_pending() is false.
>

Sorry, but your suggestion is not completely clear to me. Do you mean
to say that for logical replication, we shouldn't wait for all the WAL
to be successfully replicated but we should ensure to inform the
subscriber that XLOG streaming is done (by ensuring
pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

> Otherwise, we will end up terminating
> the WAL stream without the done message. Which will lead to an error
> message "ERROR:  could not receive data from WAL stream: server closed
> the connection unexpectedly" on the subscriber even at a clean
> shutdown.
>

But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
disallows new connections, then waits for all existing clients to
disconnect. If the server is in hot standby, recovery and streaming
replication will be terminated once all clients have disconnected.),
there is no such guarantee. I see that it is required for the
switchover in physical replication to ensure that all the WAL is sent
and replicated but we don't need that for logical replication.

> In a case where pq_is_send_pending() doesn't become false
> for a long time, (e.g., the network socket buffer got full due to the
> apply worker waiting on a lock), I think users should unblock it by
> themselves. Or it might be practically better to shutdown the
> subscriber first in the logical replication case, unlike the physical
> replication case.
>

Yeah, will users like such a dependency? And what will they gain by doing so?


[1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

--
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

02 February 2023, 04:34:23

At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Otherwise, we will end up terminating
> > the WAL stream without the done message. Which will lead to an error
> > message "ERROR:  could not receive data from WAL stream: server closed
> > the connection unexpectedly" on the subscriber even at a clean
> > shutdown.
> >
> 
> But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> disallows new connections, then waits for all existing clients to
> disconnect. If the server is in hot standby, recovery and streaming
> replication will be terminated once all clients have disconnected.),
> there is no such guarantee. I see that it is required for the
> switchover in physical replication to ensure that all the WAL is sent
> and replicated but we don't need that for logical replication.

+1

Since publisher is not aware of apply-delay (by this patch), as a
matter of fact publisher seems gone before sending EOS in that
case. The error message is correctly describing that situation.

> > In a case where pq_is_send_pending() doesn't become false
> > for a long time, (e.g., the network socket buffer got full due to the
> > apply worker waiting on a lock), I think users should unblock it by
> > themselves. Or it might be practically better to shutdown the
> > subscriber first in the logical replication case, unlike the physical
> > replication case.
> >
> 
> Yeah, will users like such a dependency? And what will they gain by doing so?

If PostgreSQL required such kind of special care about shutdown at
facing a trouble to keep replication consistency, that won't be
acceptable. The current time-delayed logical replication can be seen
as a kind of intentional continuous large network lag in this
aspect. And I think the consistency is guaranteed even in such cases.

On the other hand I don't think the almost all people care about the
exact progress when facing such troubles, as far as replication
consistently is maintained. IMHO that is also true for the
logical-delayed-replication case.

> [1] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Masahiko Sawada

Date:

02 February 2023, 05:17:55

On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jan 20, 2023 at 7:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > Let me try to summarize the discussion till now. The problem we are
> > > > trying to solve here is to allow a shutdown to complete when walsender
> > > > is not able to send the entire WAL. Currently, in such cases, the
> > > > shutdown fails. As per our current understanding, this can happen when
> > > > (a) walreceiver/walapply process is stuck (not able to receive more
> > > > WAL) due to locks or some other reason; (b) a long time delay has been
> > > > configured to apply the WAL (we don't yet have such a feature for
> > > > logical replication but the discussion for same is in progress).
> > > >
> > > > Both reasons mostly apply to logical replication because there is no
> > > > separate walreceiver process whose job is to just flush the WAL. In
> > > > logical replication, the process that receives the WAL also applies
> > > > it. So, while applying it can stuck for a long time waiting for some
> > > > heavy-weight lock to be released by some other long-running
> > > > transaction by the backend.
> > > >
> ...
> ...
> >
> > +1 to eliminate condition (b) for logical replication.
> >
> > Regarding (a), as Amit mentioned before[1], I think we should check if
> > pq_is_send_pending() is false.
> >
>
> Sorry, but your suggestion is not completely clear to me. Do you mean
> to say that for logical replication, we shouldn't wait for all the WAL
> to be successfully replicated but we should ensure to inform the
> subscriber that XLOG streaming is done (by ensuring
> pq_is_send_pending() is false and by calling EndCommand, pq_flush())?

Yes.

>
> > Otherwise, we will end up terminating
> > the WAL stream without the done message. Which will lead to an error
> > message "ERROR:  could not receive data from WAL stream: server closed
> > the connection unexpectedly" on the subscriber even at a clean
> > shutdown.
> >
>
> But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> disallows new connections, then waits for all existing clients to
> disconnect. If the server is in hot standby, recovery and streaming
> replication will be terminated once all clients have disconnected.),
> there is no such guarantee.

In smart shutdown case, the walsender doesn't exit until it can flush
the done message, no?

> I see that it is required for the
> switchover in physical replication to ensure that all the WAL is sent
> and replicated but we don't need that for logical replication.

It won't be a problem in practice in terms of logical replication. But
I'm concerned that this error could confuse users. Is there any case
where the client gets such an error at the smart shutdown?

>
> > In a case where pq_is_send_pending() doesn't become false
> > for a long time, (e.g., the network socket buffer got full due to the
> > apply worker waiting on a lock), I think users should unblock it by
> > themselves. Or it might be practically better to shutdown the
> > subscriber first in the logical replication case, unlike the physical
> > replication case.
> >
>
> Yeah, will users like such a dependency? And what will they gain by doing so?

IIUC there is no difference between smart shutdown and fast shutdown
in logical replication walsender, but reading the doc[1], it seems to
me that in the smart shutdown mode, the server stops existing sessions
normally. For example, If the client is psql that gets stuck for some
reason and the network buffer gets full, the smart shutdown waits for
a backend process to send all results to the client. I think the
logical replication walsender should follow this behavior for
consistency. One idea is to distinguish smart shutdown and fast
shutdown also in logical replication walsender so that we disconnect
even without the done message in fast shutdown mode, but I'm not sure
it's worthwhile.

Regards,

[1] https://www.postgresql.org/docs/devel/server-shutdown.html

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

02 February 2023, 05:51:54

On Thu, Feb 2, 2023 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Feb 1, 2023 at 6:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > > In a case where pq_is_send_pending() doesn't become false
> > > for a long time, (e.g., the network socket buffer got full due to the
> > > apply worker waiting on a lock), I think users should unblock it by
> > > themselves. Or it might be practically better to shutdown the
> > > subscriber first in the logical replication case, unlike the physical
> > > replication case.
> > >
> >
> > Yeah, will users like such a dependency? And what will they gain by doing so?
>
> IIUC there is no difference between smart shutdown and fast shutdown
> in logical replication walsender, but reading the doc[1], it seems to
> me that in the smart shutdown mode, the server stops existing sessions
> normally. For example, If the client is psql that gets stuck for some
> reason and the network buffer gets full, the smart shutdown waits for
> a backend process to send all results to the client. I think the
> logical replication walsender should follow this behavior for
> consistency. One idea is to distinguish smart shutdown and fast
> shutdown also in logical replication walsender so that we disconnect
> even without the done message in fast shutdown mode, but I'm not sure
> it's worthwhile.
>

The main problem we want to solve here is to avoid shutdown failing in
case walreceiver/applyworker is busy waiting for some lock or for some
other reason as shown in the email [1]. I haven't tested it but if
such a problem doesn't exist in smart shutdown mode then probably we
can allow walsender to wait till all the data is sent. We can once
investigate what it takes to introduce shutdown mode knowledge for
logical walsender. OTOH, the docs for smart shutdown says "If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected." which to me indicates
that it is okay to terminate logical replication connections even in
smart mode.

[1] -
https://www.postgresql.org/message-id/TYAPR01MB58669CB06F6657ABCEFE6555F5F29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

02 February 2023, 05:53:30

On Thu, Feb 2, 2023 at 10:04 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 1 Feb 2023 14:58:14 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Wed, Feb 1, 2023 at 2:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > Otherwise, we will end up terminating
> > > the WAL stream without the done message. Which will lead to an error
> > > message "ERROR:  could not receive data from WAL stream: server closed
> > > the connection unexpectedly" on the subscriber even at a clean
> > > shutdown.
> > >
> >
> > But will that be a problem? As per docs of shutdown [1] ( “Smart” mode
> > disallows new connections, then waits for all existing clients to
> > disconnect. If the server is in hot standby, recovery and streaming
> > replication will be terminated once all clients have disconnected.),
> > there is no such guarantee. I see that it is required for the
> > switchover in physical replication to ensure that all the WAL is sent
> > and replicated but we don't need that for logical replication.
>
> +1
>
> Since publisher is not aware of apply-delay (by this patch), as a
> matter of fact publisher seems gone before sending EOS in that
> case. The error message is correctly describing that situation.
>

This can happen even without apply-delay patch. For example, when
apply process is waiting on some lock.

--
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

03 February 2023, 12:08:48

Dear Amit, Sawada-san,

> > IIUC there is no difference between smart shutdown and fast shutdown
> > in logical replication walsender, but reading the doc[1], it seems to
> > me that in the smart shutdown mode, the server stops existing sessions
> > normally. For example, If the client is psql that gets stuck for some
> > reason and the network buffer gets full, the smart shutdown waits for
> > a backend process to send all results to the client. I think the
> > logical replication walsender should follow this behavior for
> > consistency. One idea is to distinguish smart shutdown and fast
> > shutdown also in logical replication walsender so that we disconnect
> > even without the done message in fast shutdown mode, but I'm not sure
> > it's worthwhile.
> >
> 
> The main problem we want to solve here is to avoid shutdown failing in
> case walreceiver/applyworker is busy waiting for some lock or for some
> other reason as shown in the email [1]. I haven't tested it but if
> such a problem doesn't exist in smart shutdown mode then probably we
> can allow walsender to wait till all the data is sent.

Based on the idea, I made a PoC patch to introduce the smart shutdown to walsenders.
PSA 0002 patch. 0001 is not changed from v5.
When logical walsenders got shutdown request but their send buffer is full due to
the delay, they will:

* wait to complete to send data to subscriber if we are in smart shutdown mode
* exit immediately if we are in fast shutdown mode

Note that in both case, walsender does not wait the remote flush of WALs.

For implementing that, I added new attribute to WalSndCtlData that indicates the
shutdown status. Basically it is zero, but it will be changed by postmaster when
it gets request.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

04 February 2023, 11:04:10

On Fri, Feb 3, 2023 at 5:38 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit, Sawada-san,
>
> > > IIUC there is no difference between smart shutdown and fast shutdown
> > > in logical replication walsender, but reading the doc[1], it seems to
> > > me that in the smart shutdown mode, the server stops existing sessions
> > > normally. For example, If the client is psql that gets stuck for some
> > > reason and the network buffer gets full, the smart shutdown waits for
> > > a backend process to send all results to the client. I think the
> > > logical replication walsender should follow this behavior for
> > > consistency. One idea is to distinguish smart shutdown and fast
> > > shutdown also in logical replication walsender so that we disconnect
> > > even without the done message in fast shutdown mode, but I'm not sure
> > > it's worthwhile.
> > >
> >
> > The main problem we want to solve here is to avoid shutdown failing in
> > case walreceiver/applyworker is busy waiting for some lock or for some
> > other reason as shown in the email [1].
> >

For this problem isn't using -t (timeout) avoid it? So, if there is a
pending WAL, users can always use -t option to allow the shutdown to
complete. Now, I agree that it is not very clear how much time to
specify but a user has some option to allow the shutdown to complete.
I am not telling that teaching walsenders about shutdown modes is
completely a bad idea but it doesn't seem necessary to allow shutdowns
to complete.

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Andres Freund

Date:

04 February 2023, 13:01:55

Hi,

On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
> The main problem we want to solve here is to avoid shutdown failing in
> case walreceiver/applyworker is busy waiting for some lock or for some
> other reason as shown in the email [1].

Isn't handling this part of the job of wal_sender_timeout?

I don't at all agree that it's ok to just stop replicating changes
because we're blocked on network IO. The patch justifies this with:

> Currently, at shutdown, walsender processes wait to send all pending data and
> ensure the all data is flushed in remote node. This mechanism was added by
> 985bd7 for supporting clean switch over, but such use-case cannot be supported
> for logical replication. This commit remove the blocking in the case.

and at the start of the thread with:

> In case of logical replication, however, we cannot support the use-case that
> switches the role publisher <-> subscriber. Suppose same case as above, additional
> transactions are committed while doing step2. To catch up such changes subscriber
> must receive WALs related with trans, but it cannot be done because subscriber
> cannot request WALs from the specific position. In the case, we must truncate all
> data in new subscriber once, and then create new subscription with copy_data
> = true.

But that seems a too narrow view to me. Imagine you want to decomission
the current primary, and instead start to use the logical standby as the
primary. For that you'd obviously want to replicate the last few
changes. But with the proposed change, that'd be hard to ever achieve.

Note that even disallowing any writes on the logical primary would make
it hard to be sure that everything is replicated, because autovacuum,
bgwriter, checkpointer all can continue to write WAL. Without being able
to check that the last LSN has indeed been sent out, how do you know
that you didn't miss something?

Greetings,

Andres Freund

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

06 February 2023, 04:29:19

On Sat, Feb 4, 2023 at 6:31 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
> > The main problem we want to solve here is to avoid shutdown failing in
> > case walreceiver/applyworker is busy waiting for some lock or for some
> > other reason as shown in the email [1].
>
> Isn't handling this part of the job of wal_sender_timeout?
>

In some cases, it is not clear whether we can handle it by
wal_sender_timeout. Consider a case of a time-delayed replica where
the applyworker will keep sending some response/alive message so that
walsender doesn't timeout in that (during delay) period. In that case,
because walsender won't timeout, the shutdown will fail (with the
failed message) even though it will be complete after the walsender is
able to send all the WAL and shutdown. The time-delayed replica patch
is still under discussion [1]. Also, for large values of
wal_sender_timeout, it will wait till the walsender times out and can
return with a failed message.

>
> I don't at all agree that it's ok to just stop replicating changes
> because we're blocked on network IO. The patch justifies this with:
>
> > Currently, at shutdown, walsender processes wait to send all pending data and
> > ensure the all data is flushed in remote node. This mechanism was added by
> > 985bd7 for supporting clean switch over, but such use-case cannot be supported
> > for logical replication. This commit remove the blocking in the case.
>
> and at the start of the thread with:
>
> > In case of logical replication, however, we cannot support the use-case that
> > switches the role publisher <-> subscriber. Suppose same case as above, additional
> > transactions are committed while doing step2. To catch up such changes subscriber
> > must receive WALs related with trans, but it cannot be done because subscriber
> > cannot request WALs from the specific position. In the case, we must truncate all
> > data in new subscriber once, and then create new subscription with copy_data
> > = true.
>
> But that seems a too narrow view to me. Imagine you want to decomission
> the current primary, and instead start to use the logical standby as the
> primary. For that you'd obviously want to replicate the last few
> changes. But with the proposed change, that'd be hard to ever achieve.
>

I think that can still be achieved with the idea being discussed which
is to keep allowing sending the WAL for smart shutdown mode but not
for other modes(fast, immediate). I don't know whether it is a good
idea or not but Kuroda-San has produced a POC patch for it. We can
instead choose to improve our docs related to shutdown to explain a
bit more about the shutdown's interaction with (logical and physical)
replication. As of now, it says: (“Smart” mode disallows new
connections, then waits for all existing clients to disconnect. If the
server is in hot standby, recovery and streaming replication will be
terminated once all clients have disconnected.)[2]. Here, it is not
clear that shutdown will wait for sending and flushing all the WALs.
The information for fast and immediate modes is even lesser which
makes it more difficult to understand what kind of behavior is
expected in those modes.

[1] - https://commitfest.postgresql.org/42/3581/
[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html

--
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Andres Freund

Date:

06 February 2023, 05:03:03

Hi,

On February 5, 2023 8:29:19 PM PST, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Sat, Feb 4, 2023 at 6:31 PM Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
>> > The main problem we want to solve here is to avoid shutdown failing in
>> > case walreceiver/applyworker is busy waiting for some lock or for some
>> > other reason as shown in the email [1].
>>
>> Isn't handling this part of the job of wal_sender_timeout?
>>
>
>In some cases, it is not clear whether we can handle it by
>wal_sender_timeout. Consider a case of a time-delayed replica where
>the applyworker will keep sending some response/alive message so that
>walsender doesn't timeout in that (during delay) period. In that case,
>because walsender won't timeout, the shutdown will fail (with the
>failed message) even though it will be complete after the walsender is
>able to send all the WAL and shutdown. The time-delayed replica patch
>is still under discussion [1]. Also, for large values of
>wal_sender_timeout, it will wait till the walsender times out and can
>return with a failed message.
>
>>
>> I don't at all agree that it's ok to just stop replicating changes
>> because we're blocked on network IO. The patch justifies this with:
>>
>> > Currently, at shutdown, walsender processes wait to send all pending data and
>> > ensure the all data is flushed in remote node. This mechanism was added by
>> > 985bd7 for supporting clean switch over, but such use-case cannot be supported
>> > for logical replication. This commit remove the blocking in the case.
>>
>> and at the start of the thread with:
>>
>> > In case of logical replication, however, we cannot support the use-case that
>> > switches the role publisher <-> subscriber. Suppose same case as above, additional
>> > transactions are committed while doing step2. To catch up such changes subscriber
>> > must receive WALs related with trans, but it cannot be done because subscriber
>> > cannot request WALs from the specific position. In the case, we must truncate all
>> > data in new subscriber once, and then create new subscription with copy_data
>> > = true.
>>
>> But that seems a too narrow view to me. Imagine you want to decomission
>> the current primary, and instead start to use the logical standby as the
>> primary. For that you'd obviously want to replicate the last few
>> changes. But with the proposed change, that'd be hard to ever achieve.
>>
>
>I think that can still be achieved with the idea being discussed which
>is to keep allowing sending the WAL for smart shutdown mode but not
>for other modes(fast, immediate). I don't know whether it is a good
>idea or not but Kuroda-San has produced a POC patch for it. We can
>instead choose to improve our docs related to shutdown to explain a
>bit more about the shutdown's interaction with (logical and physical)
>replication. As of now, it says: (“Smart” mode disallows new
>connections, then waits for all existing clients to disconnect. If the
>server is in hot standby, recovery and streaming replication will be
>terminated once all clients have disconnected.)[2]. Here, it is not
>clear that shutdown will wait for sending and flushing all the WALs.
>The information for fast and immediate modes is even lesser which
>makes it more difficult to understand what kind of behavior is
>expected in those modes.
>
>[1] - https://commitfest.postgresql.org/42/3581/
>[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html
>

Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way.


Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

06 February 2023, 06:53:54

On Mon, Feb 6, 2023 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
>
> On February 5, 2023 8:29:19 PM PST, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> But that seems a too narrow view to me. Imagine you want to decomission
> >> the current primary, and instead start to use the logical standby as the
> >> primary. For that you'd obviously want to replicate the last few
> >> changes. But with the proposed change, that'd be hard to ever achieve.
> >>
> >
> >I think that can still be achieved with the idea being discussed which
> >is to keep allowing sending the WAL for smart shutdown mode but not
> >for other modes(fast, immediate). I don't know whether it is a good
> >idea or not but Kuroda-San has produced a POC patch for it. We can
> >instead choose to improve our docs related to shutdown to explain a
> >bit more about the shutdown's interaction with (logical and physical)
> >replication. As of now, it says: (“Smart” mode disallows new
> >connections, then waits for all existing clients to disconnect. If the
> >server is in hot standby, recovery and streaming replication will be
> >terminated once all clients have disconnected.)[2]. Here, it is not
> >clear that shutdown will wait for sending and flushing all the WALs.
> >The information for fast and immediate modes is even lesser which
> >makes it more difficult to understand what kind of behavior is
> >expected in those modes.
> >
> >[1] - https://commitfest.postgresql.org/42/3581/
> >[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html
> >
>
> Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way.
>

So, we have the following options: (a) do nothing for this; (b)
clarify the current behavior in docs. Any suggestions?

--
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Andres Freund

Date:

06 February 2023, 20:34:49

Hi,

On 2023-02-06 12:23:54 +0530, Amit Kapila wrote:
> On Mon, Feb 6, 2023 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
> > Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any
way.
> >
> 
> So, we have the following options: (a) do nothing for this; (b)
> clarify the current behavior in docs. Any suggestions?

b) seems good.

I also think it'd make sense to improve this on a code-level. Just not in the
wholesale way discussed so far.

How about we make it an option in START_REPLICATION? Delayed logical rep can
toggle that on by default.

Greetings,

Andres Freund

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

07 February 2023, 03:30:13

On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-02-06 12:23:54 +0530, Amit Kapila wrote:
> > On Mon, Feb 6, 2023 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
> > > Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any
way.
> > >
> >
> > So, we have the following options: (a) do nothing for this; (b)
> > clarify the current behavior in docs. Any suggestions?
>
> b) seems good.
>
> I also think it'd make sense to improve this on a code-level. Just not in the
> wholesale way discussed so far.
>
> How about we make it an option in START_REPLICATION? Delayed logical rep can
> toggle that on by default.
>

Works for me. So, when this option is set in START_REPLICATION
message, walsender will set some flag and allow itself to exit at
shutdown without waiting for WAL to be sent?

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Andres Freund

Date:

07 February 2023, 04:49:57

Hi,

On 2023-02-07 09:00:13 +0530, Amit Kapila wrote:
> On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
> > How about we make it an option in START_REPLICATION? Delayed logical rep can
> > toggle that on by default.

> Works for me. So, when this option is set in START_REPLICATION
> message, walsender will set some flag and allow itself to exit at
> shutdown without waiting for WAL to be sent?

Yes. I think that might be useful in other situations as well, but we don't
need to make those configurable initially. But I imagine it'd be useful to set
things up so that non-HA physical replicas don't delay shutdown, particularly
if they're geographically far away.

Greetings,

Andres Freund

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

07 February 2023, 14:41:13

Dear Andres, Amit,

> On 2023-02-07 09:00:13 +0530, Amit Kapila wrote:
> > On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
> > > How about we make it an option in START_REPLICATION? Delayed logical
> rep can
> > > toggle that on by default.
>
> > Works for me. So, when this option is set in START_REPLICATION
> > message, walsender will set some flag and allow itself to exit at
> > shutdown without waiting for WAL to be sent?
>
> Yes. I think that might be useful in other situations as well, but we don't
> need to make those configurable initially. But I imagine it'd be useful to set
> things up so that non-HA physical replicas don't delay shutdown, particularly
> if they're geographically far away.

Based on the discussion, I made a patch for adding a walsender option
exit_before_confirming to the START_STREAMING replication command. It can be
used for both physical and logical replication. I made the patch with
extendibility - it allows adding further options.
And better naming are very welcome.

For physical replication, the grammar was slightly changed like a logical one.
It can now accept options but currently, only one option is allowed. And it is
not used in normal streaming replication. For logical replication, the option is
combined with options for the output plugin. Of course, we can modify the API to
better style.

0001 patch was ported from time-delayed logical replication thread[1], which uses
the added option. When the min_apply_delay option is specified and publisher seems
to be PG16 or later, the apply worker sends a START_REPLICATION query with
exit_before_confirming = true. And the worker will reboot and send START_REPLICATION
again when min_apply_delay is changed from zero to a non-zero value or non-zero to zero.

Note that I removed version number because the approach is completely changed.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB8373BA483A6D2C924C600968EDDB9@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

07 February 2023, 16:07:07

> Dear Andres, Amit,
>
> > On 2023-02-07 09:00:13 +0530, Amit Kapila wrote:
> > > On Tue, Feb 7, 2023 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
> > > > How about we make it an option in START_REPLICATION? Delayed logical
> > rep can
> > > > toggle that on by default.
> >
> > > Works for me. So, when this option is set in START_REPLICATION
> > > message, walsender will set some flag and allow itself to exit at
> > > shutdown without waiting for WAL to be sent?
> >
> > Yes. I think that might be useful in other situations as well, but we don't
> > need to make those configurable initially. But I imagine it'd be useful to set
> > things up so that non-HA physical replicas don't delay shutdown, particularly
> > if they're geographically far away.
>
> Based on the discussion, I made a patch for adding a walsender option
> exit_before_confirming to the START_STREAMING replication command. It can
> be
> used for both physical and logical replication. I made the patch with
> extendibility - it allows adding further options.
> And better naming are very welcome.
>
> For physical replication, the grammar was slightly changed like a logical one.
> It can now accept options but currently, only one option is allowed. And it is
> not used in normal streaming replication. For logical replication, the option is
> combined with options for the output plugin. Of course, we can modify the API to
> better style.
>
> 0001 patch was ported from time-delayed logical replication thread[1], which
> uses
> the added option. When the min_apply_delay option is specified and publisher
> seems
> to be PG16 or later, the apply worker sends a START_REPLICATION query with
> exit_before_confirming = true. And the worker will reboot and send
> START_REPLICATION
> again when min_apply_delay is changed from zero to a non-zero value or non-zero
> to zero.
>
> Note that I removed version number because the approach is completely changed.
>
> [1]:
> https://www.postgresql.org/message-id/TYCPR01MB8373BA483A6D2C924C60
> 0968EDDB9@TYCPR01MB8373.jpnprd01.prod.outlook.com

I noticed that previous ones are rejected by cfbot, even if they passed on my environment...
PSA fixed version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

07 February 2023, 17:08:54

> I noticed that previous ones are rejected by cfbot, even if they passed on my
> environment...
> PSA fixed version.

While analyzing more, I found the further bug that forgets initialization.
PSA new version that could be passed automated tests on my github repository.
Sorry for noise.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

08 February 2023, 02:27:17

I agree to the direction and thanks for the patch.

At Tue, 7 Feb 2023 17:08:54 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in 
> > I noticed that previous ones are rejected by cfbot, even if they passed on my
> > environment...
> > PSA fixed version.
> 
> While analyzing more, I found the further bug that forgets initialization.
> PSA new version that could be passed automated tests on my github repository.
> Sorry for noise.

0002:

This patch doesn't seem to offer a means to change the default
walsender behavior.  We need a subscription option named like
"walsender_exit_mode" to do that.


+ConsumeWalsenderOptions(List *options, WalSndData *data)

I wonder if it is the right design to put options for different things
into a single list. I rather choose to embed the walsender option in
the syntax than needing this function.

K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline opt_shutdown_mode

K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options

where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".


======
If we go with the current design, I think it is better to share the
option list rule between the logical and physical START_REPLCIATION
commands.

I'm not sure I like the option syntax
"exit_before_confirming=<Boolean>". I imagin that other options may
come in future. Thus, how about "walsender_shutdown_mode=<mode>",
where the mode is one of "wait_flush"(default) and "immediate"?


+typedef struct
+{
+    bool        exit_before_confirming;
+} WalSndData;

Data doesn't seem to represent the variable. Why not WalSndOptions?


-        !equal(newsub->publications, MySubscription->publications))
+        !equal(newsub->publications, MySubscription->publications) ||
+        (newsub->minapplydelay > 0 && MySubscription->minapplydelay == 0) ||
+        (newsub->minapplydelay == 0 && MySubscription->minapplydelay > 0))

 I slightly prefer the following expression (Others may disagree:p):

  ((newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))

 And I think we need a comment for the term. For example,

  /* minapplydelay affects START_REPLICATION option exit_before_confirming */


+ * Reads all entrly of the list and consume if needed.
s/entrly/entries/ ?
...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

08 February 2023, 05:36:08

On Wed, Feb 8, 2023 at 7:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> I agree to the direction and thanks for the patch.
>
> At Tue, 7 Feb 2023 17:08:54 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in
> > > I noticed that previous ones are rejected by cfbot, even if they passed on my
> > > environment...
> > > PSA fixed version.
> >
> > While analyzing more, I found the further bug that forgets initialization.
> > PSA new version that could be passed automated tests on my github repository.
> > Sorry for noise.
>
> 0002:
>
> This patch doesn't seem to offer a means to change the default
> walsender behavior.  We need a subscription option named like
> "walsender_exit_mode" to do that.
>

I don't think at this stage we need a subscription-level option, we
can extend it later if this is really useful for users. For now, we
can set this new option when min_apply_delay > 0.

>
> +ConsumeWalsenderOptions(List *options, WalSndData *data)
>
> I wonder if it is the right design to put options for different things
> into a single list. I rather choose to embed the walsender option in
> the syntax than needing this function.
>
> K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline opt_shutdown_mode
>
> K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR opt_shutdown_mode plugin_options
>
> where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".
>

The other option could have been that we just add it as a
plugin_option for logical replication but it doesn't seem to match
with the other plugin options. I think it would be better to have it
as a separate option something like opt_shutdown_immediate and extend
the logical replication syntax for now. We can later extend physical
replication syntax when we want to expose such an option via physical
replication.

-- 
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

08 February 2023, 07:59:09

Dear Amit,

Thanks for giving comments!

> >
> > 0002:
> >
> > This patch doesn't seem to offer a means to change the default
> > walsender behavior.  We need a subscription option named like
> > "walsender_exit_mode" to do that.
> >
> 
> I don't think at this stage we need a subscription-level option, we
> can extend it later if this is really useful for users. For now, we
> can set this new option when min_apply_delay > 0.

Agreed. I wanted to keep the feature closed for PG16 and then will extend if needed.

> >
> > +ConsumeWalsenderOptions(List *options, WalSndData *data)
> >
> > I wonder if it is the right design to put options for different things
> > into a single list. I rather choose to embed the walsender option in
> > the syntax than needing this function.
> >
> > K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline
> opt_shutdown_mode
> >
> > K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR
> opt_shutdown_mode plugin_options
> >
> > where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".
> >
> 
> The other option could have been that we just add it as a
> plugin_option for logical replication but it doesn't seem to match
> with the other plugin options. I think it would be better to have it
> as a separate option something like opt_shutdown_immediate and extend
> the logical replication syntax for now. We can later extend physical
> replication syntax when we want to expose such an option via physical
> replication.

The main intention for us is to shut down logical walsenders. Therefore, same as above,
I want to develop the feature for logical replication once and then try to extend if we want.
TBH I think adding physicalrep support seems not to be so hard,
but I want to keep the patch smaller.

The new patch will be attached soon in another mail.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

08 February 2023, 08:01:24

Dear Horiguchi-san,

Thank you for checking the patch! PSA new version.

> 0002:
>
> This patch doesn't seem to offer a means to change the default
> walsender behavior.  We need a subscription option named like
> "walsender_exit_mode" to do that.

As I said in another mail[1], I'm thinking the feature does not have to be
used alone for now.

> +ConsumeWalsenderOptions(List *options, WalSndData *data)
>
> I wonder if it is the right design to put options for different things
> into a single list. I rather choose to embed the walsender option in
> the syntax than needing this function.
>
> K_START_REPLICATION opt_slot opt_physical RECPTR opt_timeline
> opt_shutdown_mode
>
> K_START_REPLICATION K_SLOTIDENT K_LOGICAL RECPTR
> opt_shutdown_mode plugin_options
>
> where opt_shutdown_mode would be like "SHUTDOWN_MODE immediate".

Right, the option handling was quite bad. I added new syntax opt_shutdown_mode
to logical replication. And many codes were modified accordingly.
Note that based on the other discussion, I removed codes
for supporting physical replication but tried to keep the extensibility.

> ======
> If we go with the current design, I think it is better to share the
> option list rule between the logical and physical START_REPLCIATION
> commands.
>
> I'm not sure I like the option syntax
> "exit_before_confirming=<Boolean>". I imagin that other options may
> come in future. Thus, how about "walsender_shutdown_mode=<mode>",
> where the mode is one of "wait_flush"(default) and "immediate"?

Seems better, I changed to from boolean to enumeration.

> +typedef struct
> +{
> +    bool        exit_before_confirming;
> +} WalSndData;
>
> Data doesn't seem to represent the variable. Why not WalSndOptions?

This is inspired by PGOutputData, but I prefer your idea. Fixed.

> -        !equal(newsub->publications, MySubscription->publications))
> +        !equal(newsub->publications, MySubscription->publications) ||
> +        (newsub->minapplydelay > 0 &&
> MySubscription->minapplydelay == 0) ||
> +        (newsub->minapplydelay == 0 &&
> MySubscription->minapplydelay > 0))
>
>  I slightly prefer the following expression (Others may disagree:p):
>
>   ((newsub->minapplydelay == 0) != (MySubscription->minapplydelay == 0))

I think conditions for the same parameter should be aligned one line,
So your posted seems better. Fixed.

>
>  And I think we need a comment for the term. For example,
>
>   /* minapplydelay affects START_REPLICATION option exit_before_confirming
> */

Added just above the condition.

> + * Reads all entrly of the list and consume if needed.
> s/entrly/entries/ ?
> ...

This part is no longer needed.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5866D3EC780D251953BDE7FAF5D89%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

08 February 2023, 09:47:08

>
> Dear Horiguchi-san,
>
> Thank you for checking the patch! PSA new version.

PSA rebased patch that supports updated time-delayed patch[1].

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5866C11DAF8AB04F3CC181D3F5D89@TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

RE: Exit walsender before confirming remote flush in logical replication

From

"Takamichi Osumi (Fujitsu)"

Date:

09 February 2023, 05:50:05

On Wednesday, February 8, 2023 6:47 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
> PSA rebased patch that supports updated time-delayed patch[1].
Hi,

Thanks for creating the patch ! Minor review comments on v5-0002.

(1)

+ Decides the condition for exiting the walsender process.
+ <literal>'wait_flush'</literal>, which is the default, the walsender
+ will wait for all the sent WALs to be flushed on the subscriber side,
+ before exiting the process. <literal>'immediate'</literal> will exit
+ without confirming the remote flush. This may break the consistency
+ between publisher and subscriber, but it may be useful for a system
+ that has a high-latency network to reduce the amount of time for
+ shutdown.

(1-1)

The first part "exiting the walsender process" can be improved.
Probably, you can say "the exiting walsender process" or
"Decides the behavior of the walsender process at shutdown" instread.

(1-2)

Also, the next sentence can be improved something like
"If the shutdown mode is wait_flush, which is the default, the
walsender waits for all the sent WALs to be flushed on the subscriber side.
If it is immediate, the walsender exits without confirming the remote flush".

(1-3)

We don't need to wrap wait_flush and immediate by single quotes
within the literal tag.

(2)

+ /* minapplydelay affects SHUTDOWN_MODE option */

I think we can move this comment to just above the 'if' condition
and combine it with the existing 'if' conditions comments.

(3) 001_rep_changes.pl

(3-1) Question

In general, do we add this kind of check when we extend the protocol (STREAM_REPLICATION command)
or add a new condition for apply worker exit ?
In case when we would like to know the restart of the walsender process in TAP tests,
then could you tell me why the new test code matches the purpose of this patch ?

(3-2)

+ "Timed out while waiting for apply to restart after changing min_apply_delay to non-zero value";

Probably, we can partly change this sentence like below, because we check walsender's pid.
FROM: "... while waiting for apply to restart..."
TO: "... while waiting for the walsender to restart..."

Best Regards,
Takamichi Osumi

Re: Exit walsender before confirming remote flush in logical replication

From

Masahiko Sawada

Date:

09 February 2023, 08:33:57

Hi,

On Wed, Feb 8, 2023 at 6:47 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> >
> > Dear Horiguchi-san,
> >
> > Thank you for checking the patch! PSA new version.
>
> PSA rebased patch that supports updated time-delayed patch[1].
>

Thank you for the patch! Here are some comments on v5 patch:

+/*
+ * Options for controlling the behavior of the walsender. Options can be
+ * specified in the START_STREAMING replication command. Currently only one
+ * option is allowed.
+ */
+typedef struct
+{
+        WalSndShutdownMode shutdown_mode;
+} WalSndOptions;
+
+static WalSndOptions *my_options = NULL;

I'm not sure we need to have it as a struct at this stage since we
support only one option. I wonder if we can have one value, say
shutdown_mode, and we can make it a struct when we really need it.
Even if we use WalSndOptions struct, I don't think we need to
dynamically allocate it. Since a walsender can start logical
replication multiple times in principle, my_options is not freed.

---
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+        if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+                my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+        else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+                my_options->shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+        else
+                ereport(ERROR,
+                                errcode(ERRCODE_SYNTAX_ERROR),
+                                errmsg("SHUTDOWN_MODE requires
\"wait_flush\" or \"immediate\""));
+}

I think we should make the error message consistent with other enum
parameters. How about the message like:

ERROR:  invalid value shutdown mode: "%s"

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

09 February 2023, 10:11:10

Dear Osumi-san,

Thank you for reviewing! PSA new version.

> (1)
>
> +          Decides the condition for exiting the walsender process.
> +          <literal>'wait_flush'</literal>, which is the default, the walsender
> +          will wait for all the sent WALs to be flushed on the subscriber side,
> +          before exiting the process. <literal>'immediate'</literal> will exit
> +          without confirming the remote flush. This may break the consistency
> +          between publisher and subscriber, but it may be useful for a system
> +          that has a high-latency network to reduce the amount of time for
> +          shutdown.
>
> (1-1)
>
> The first part "exiting the walsender process" can be improved.
> Probably, you can say "the exiting walsender process" or
> "Decides the behavior of the walsender process at shutdown" instread.

Fixed. Second idea was chosen.

> (1-2)
>
> Also, the next sentence can be improved something like
> "If the shutdown mode is wait_flush, which is the default, the
> walsender waits for all the sent WALs to be flushed on the subscriber side.
> If it is immediate, the walsender exits without confirming the remote flush".

Fixed.

> (1-3)
>
> We don't need to wrap wait_flush and immediate by single quotes
> within the literal tag.

This style was ported from the SNAPSHOT options part, so I decided to keep.


> (2)
>
> +               /* minapplydelay affects SHUTDOWN_MODE option */
>
> I think we can move this comment to just above the 'if' condition
> and combine it with the existing 'if' conditions comments.

Moved and added some comments.

> (3) 001_rep_changes.pl
>
> (3-1) Question
>
> In general, do we add this kind of check when we extend the protocol
> (STREAM_REPLICATION command)
> or add a new condition for apply worker exit ?
> In case when we would like to know the restart of the walsender process in TAP
> tests,
> then could you tell me why the new test code matches the purpose of this patch ?

The replication command is not for normal user, so I think we don't have to test itself.

The check that waits to restart the apply worker was added to improve the robustness.
I think there is a possibility to fail the test when the apply worker recevies a transaction
before it checks new subscription option. Now the failure can be avoided by
confriming to reload pg_subscription and restart.

> (3-2)
>
> +  "Timed out while waiting for apply to restart after changing min_apply_delay
> to non-zero value";
>
> Probably, we can partly change this sentence like below, because we check
> walsender's pid.
> FROM: "... while waiting for apply to restart..."
> TO:   "... while waiting for the walsender to restart..."

Right, fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

09 February 2023, 10:12:03

Dear Sawada-san,

Thank you for reviewing!

> +/*
> + * Options for controlling the behavior of the walsender. Options can be
> + * specified in the START_STREAMING replication command. Currently only one
> + * option is allowed.
> + */
> +typedef struct
> +{
> +        WalSndShutdownMode shutdown_mode;
> +} WalSndOptions;
> +
> +static WalSndOptions *my_options = NULL;
> 
> I'm not sure we need to have it as a struct at this stage since we
> support only one option. I wonder if we can have one value, say
> shutdown_mode, and we can make it a struct when we really need it.
> Even if we use WalSndOptions struct, I don't think we need to
> dynamically allocate it. Since a walsender can start logical
> replication multiple times in principle, my_options is not freed.

+1, removed the structure.

> ---
> +/*
> + * Parse given shutdown mode.
> + *
> + * Currently two values are accepted - "wait_flush" and "immediate"
> + */
> +static void
> +ParseShutdownMode(char *shutdownmode)
> +{
> +        if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
> +                my_options->shutdown_mode =
> WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
> +        else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
> +                my_options->shutdown_mode =
> WALSND_SHUTDOWN_MODE_IMMIDEATE;
> +        else
> +                ereport(ERROR,
> +                                errcode(ERRCODE_SYNTAX_ERROR),
> +                                errmsg("SHUTDOWN_MODE requires
> \"wait_flush\" or \"immediate\""));
> +}
> 
> I think we should make the error message consistent with other enum
> parameters. How about the message like:
> 
> ERROR:  invalid value shutdown mode: "%s"

Modified like enum parameters and hint message was also provided.

New patch is attached in [1].

[1]:
https://www.postgresql.org/message-id/TYAPR01MB586683FC450662990E356A0EF5D99%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Exit walsender before confirming remote flush in logical replication

From

Peter Smith

Date:

10 February 2023, 07:10:53

Here are my review comments for the v6-0002 patch.

======
Commit Message

1.
This commit extends START_REPLICATION to accept SHUTDOWN_MODE term. Currently,
it works well only for logical replication.

~

1a.
"to accept SHUTDOWN term" --> "to include a SHUTDOWN_MODE clause."

~

1b.
"it works well only for..." --> do you mean "it is currently
implemented only for..."

~~~

2.
When 'wait_flush', which is the default, is specified, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before exiting the
process. 'immediate' will exit without confirming the remote flush. This may
break the consistency between publisher and subscriber, but it may be useful
for a system that has a high-latency network to reduce the amount of time for
shutdown. This may be useful to shut down the publisher even when the
worker is stuck.

~

SUGGESTION
The shutdown modes are:

1) 'wait_flush' (the default). In this mode, the walsender will wait
for all the sent WALs to be flushed on the subscriber side, before
exiting the process.

2) 'immediate'. In this mode, the walsender will exit without
confirming the remote flush. This may break the consistency between
publisher and subscriber. This mode might be useful for a system that
has a high-latency network (to reduce the amount of time for
shutdown), or to allow the shutdown of the publisher even when the
worker is stuck.

======
doc/src/sgml/protocol.sgml

3.
+       <varlistentry>
+        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate'
}</literal></term>
+        <listitem>
+         <para>
+          Decides the behavior of the walsender process at shutdown. If the
+          shutdown mode is <literal>'wait_flush'</literal>, which is the
+          default, the walsender waits for all the sent WALs to be flushed
+          on the subscriber side. If it is <literal>'immediate'</literal>,
+          the walsender exits without confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>

The synopsis said:
[ SHUTDOWN_MODE shutdown_mode ]

But then the 'shutdown_mode' term was never mentioned again (??).
Instead it says:
SHUTDOWN_MODE { 'wait_flush' | 'immediate' }

IMO the detailed explanation should not say SHUTDOWN_MODE again. It
should be writtenmore  like this:

SUGGESTION
shutdown_mode

Determines the behavior of the walsender process at shutdown. If
shutdown_mode is 'wait_flush', the walsender waits for all the sent
WALs to be flushed on the subscriber side. This is the default when
SHUTDOWN_MODE is not specified.

If shutdown_mode is 'immediate', the walsender exits without
confirming the remote flush.

======
.../libpqwalreceiver/libpqwalreceiver.c

4.
+ /* Add SHUTDOWN_MODE option if needed */
+ if (options->shutdown_mode &&
+ PQserverVersion(conn->streamConn) >= 160000)
+ appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
+ options->shutdown_mode);

Maybe you can expand on the meaning of "if needed".

SUGGESTION
Add SHUTDOWN_MODE clause if needed (i.e. if not using the default shutdown_mode)

======
src/backend/replication/logical/worker.c

5. maybe_reread_subscription

+ *
+ * minapplydelay affects SHUTDOWN_MODE option. 'immediate' shutdown mode
+ * will be specified if it is set to non-zero, otherwise default mode will
+ * be set.

Reworded this comment slightly and give a reference to ApplyWorkerMain.

SUGGESTION
Time-delayed logical replication affects the SHUTDOWN_MODE clause. The
'immediate' shutdown mode will be specified if min_apply_delay is
non-zero, otherwise the default shutdown mode will be used. See
ApplyWorkerMain.

~~~

6. ApplyWorkerMain
+ /*
+ * time-delayed logical replication does not support tablesync
+ * workers, so only the leader apply worker can request walsenders to
+ * exit before confirming remote flush.
+ */

"time-delayed" --> "Time-delayed"

======
src/backend/replication/repl_gram.y

7.
@@ -91,6 +92,7 @@ Node *replication_parse_result;
 %type <boolval> opt_temporary
 %type <list> create_slot_options create_slot_legacy_opt_list
 %type <defelt> create_slot_legacy_opt
+%type <str> opt_shutdown_mode

The tab alignment seemed not quite right. Not 100% sure.

~~~

8.
@@ -270,20 +272,22 @@ start_replication:
  cmd->slotname = $2;
  cmd->startpoint = $4;
  cmd->timeline = $5;
+ cmd->shutdownmode = NULL;
  $$ = (Node *) cmd;
  }

It seemed a bit inconsistent. E.g. the cmd->options member was not set
for physical replication, so why then set this member?

Alternatively, maybe should set cmd->options = NULL here as well?

======
src/backend/replication/walsender.c

9.
+/* Indicator for specifying the shutdown mode */
+typedef enum
+{
+ WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
+ WALSND_SHUTDOWN_MODE_IMMIDEATE
+} WalSndShutdownMode;

~

9a.
"Indicator for specifying" (??). How about just saying: "Shutdown modes"

~

9b.
Typo: WALSND_SHUTDOWN_MODE_IMMIDEATE ==> WALSND_SHUTDOWN_MODE_IMMEDIATE

~

9c.
AFAICT the fact that the first enum value is assigned 0 is not really
of importance. If that is correct, then IMO maybe it's better to
remove the "= 0" because the explicit assignment made me expect that
it had special meaning, and then it was confusing when I could not
find a reason.

~~~

10. ProcessPendingWrites

+ /*
+ * In this function, there is a possibility that the walsender is
+ * stuck. It is caused when the opposite worker is stuck and then the
+ * send-buffer of the walsender becomes full. Therefore, we must add
+ * an additional path for shutdown for immediate shutdown mode.
+ */
+ if (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
+ got_STOPPING)
+ WalSndDone(XLogSendLogical);

10a.
Can this comment say something like "receiving worker" instead of
"opposite worker"?

SUGGESTION
This can happen when the receiving worker is stuck, and then the
send-buffer of the walsender...

~

10b.
IMO it makes more sense to check this around the other way. E.g. we
don't care what is the shutdown_mode value unless got_STOPPING is
true.

SUGGESTION
if (got_STOPPING && (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMEDIATE))

~~~

11. WalSndDone

+ * If we are in the immediate shutdown mode, flush location and output
+ * buffer is not checked. This may break the consistency between nodes,
+ * but it may be useful for the system that has high-latency network to
+ * reduce the amount of time for shutdown.

Add some quotes for the mode.

SUGGESTION
'immediate' shutdown mode

~~~

12.
+/*
+ * Check options for walsender itself and set flags accordingly.
+ *
+ * Currently only one option is accepted.
+ */
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+ if (cmd->shutdownmode)
+ ParseShutdownMode(cmd->shutdownmode);
+}
+
+/*
+ * Parse given shutdown mode.
+ *
+ * Currently two values are accepted - "wait_flush" and "immediate"
+ */
+static void
+ParseShutdownMode(char *shutdownmode)
+{
+ if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+ else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid value for shutdown mode: \"%s\"", shutdownmode),
+ errhint("Available values: wait_flush, immediate."));
+}

IMO the ParseShutdownMode function seems unnecessary because it's not
really "parsing" anything and it is only called in one place. I
suggest wrapping everything into the CheckWalSndOptions function. The
end result is still only a simple function:

SUGGESTION

static void
CheckWalSndOptions(const StartReplicationCmd *cmd)
{
if (cmd->shutdownmode)
{
char *mode = cmd->shutdownmode;

if (pg_strcasecmp(mode, "wait_flush") == 0)
shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
else if (pg_strcasecmp(mode, "immediate") == 0)
shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;

else
ereport(ERROR,
errcode(ERRCODE_SYNTAX_ERROR),
errmsg("invalid value for shutdown mode: \"%s\"", mode),
errhint("Available values: wait_flush, immediate."));
}
}

======
src/include/replication/walreceiver.h

13.
@@ -170,6 +170,7 @@ typedef struct
  * false if physical stream.  */
  char    *slotname; /* Name of the replication slot or NULL. */
  XLogRecPtr startpoint; /* LSN of starting point. */
+ char    *shutdown_mode; /* Name of specified shutdown name */

  union
  {
~

13a.
Typo (shutdown name?)

SUGGESTION
/* The specified shutdown mode string, or NULL. */

~

13b.
Because they have the same member names I kept confusing this option
shutdown_mode with the other enum also called shutdown_mode.

I wonder if is it possible to call this one something like
'shutdown_mode_str' to make reading the code easier?

~

13c.
Is this member in the right place? AFAIK this is not even implemented
for physical replication. e.g. Why isn't this new member part of the
'logical' sub-structure in the union?

======
src/test/subscription/t/001_rep_changes.pl

14.
-# Set min_apply_delay parameter to 3 seconds
+# Check restart on changing min_apply_delay to 3 seconds
 my $delay = 3;
 $node_subscriber->safe_psql('postgres',
  "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay = '${delay}s')");
+$node_publisher->poll_query_until('postgres',
+ "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub_renamed' AND state = 'streaming';"
+  )
+  or die
+  "Timed out while waiting for the walsender to restart after
changing min_apply_delay to non-zero value";

IIUC this test is for verifying that a new walsender worker was
started if the delay was changed from 0 to non-zero. E.g. I think it
is for it is testing your new logic of the maybe_reread_subscription.

Probably more complete testing also needs to check the other scenarios:
* min_apply_delay from one non-zero value to another non-zero value
--> verify a new worker is NOT started.
* change min_apply_delay from non-zero to zero --> verify a new worker
IS started

------
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

10 February 2023, 11:54:01

Dear Peter,

Thank you for reviewing! PSA new version.

> ======
> Commit Message
> 
> 1.
> This commit extends START_REPLICATION to accept SHUTDOWN_MODE term.
> Currently,
> it works well only for logical replication.
> 
> ~
> 
> 1a.
> "to accept SHUTDOWN term" --> "to include a SHUTDOWN_MODE clause."

Fixed.

> 1b.
> "it works well only for..." --> do you mean "it is currently
> implemented only for..."

Fixed.

> 2.
> When 'wait_flush', which is the default, is specified, the walsender will wait
> for all the sent WALs to be flushed on the subscriber side, before exiting the
> process. 'immediate' will exit without confirming the remote flush. This may
> break the consistency between publisher and subscriber, but it may be useful
> for a system that has a high-latency network to reduce the amount of time for
> shutdown. This may be useful to shut down the publisher even when the
> worker is stuck.
> 
> ~
> 
> SUGGESTION
> The shutdown modes are:
> 
> 1) 'wait_flush' (the default). In this mode, the walsender will wait
> for all the sent WALs to be flushed on the subscriber side, before
> exiting the process.
> 
> 2) 'immediate'. In this mode, the walsender will exit without
> confirming the remote flush. This may break the consistency between
> publisher and subscriber. This mode might be useful for a system that
> has a high-latency network (to reduce the amount of time for
> shutdown), or to allow the shutdown of the publisher even when the
> worker is stuck.
> 
> ======
> doc/src/sgml/protocol.sgml
> 
> 3.
> +       <varlistentry>
> +        <term><literal>SHUTDOWN_MODE { 'wait_flush' | 'immediate'
> }</literal></term>
> +        <listitem>
> +         <para>
> +          Decides the behavior of the walsender process at shutdown. If the
> +          shutdown mode is <literal>'wait_flush'</literal>, which is the
> +          default, the walsender waits for all the sent WALs to be flushed
> +          on the subscriber side. If it is <literal>'immediate'</literal>,
> +          the walsender exits without confirming the remote flush.
> +         </para>
> +        </listitem>
> +       </varlistentry>
> 
> The synopsis said:
> [ SHUTDOWN_MODE shutdown_mode ]
> 
> But then the 'shutdown_mode' term was never mentioned again (??).
> Instead it says:
> SHUTDOWN_MODE { 'wait_flush' | 'immediate' }
> 
> IMO the detailed explanation should not say SHUTDOWN_MODE again. It
> should be writtenmore  like this:
> 
> SUGGESTION
> shutdown_mode
> 
> Determines the behavior of the walsender process at shutdown. If
> shutdown_mode is 'wait_flush', the walsender waits for all the sent
> WALs to be flushed on the subscriber side. This is the default when
> SHUTDOWN_MODE is not specified.
> 
> If shutdown_mode is 'immediate', the walsender exits without
> confirming the remote flush.

Fixed.

> .../libpqwalreceiver/libpqwalreceiver.c
> 
> 4.
> + /* Add SHUTDOWN_MODE option if needed */
> + if (options->shutdown_mode &&
> + PQserverVersion(conn->streamConn) >= 160000)
> + appendStringInfo(&cmd, " SHUTDOWN_MODE '%s'",
> + options->shutdown_mode);
> 
> Maybe you can expand on the meaning of "if needed".
> 
> SUGGESTION
> Add SHUTDOWN_MODE clause if needed (i.e. if not using the default
> shutdown_mode)

Fixed, but not completely same as your suggestion.

> src/backend/replication/logical/worker.c
> 
> 5. maybe_reread_subscription
> 
> + *
> + * minapplydelay affects SHUTDOWN_MODE option. 'immediate' shutdown
> mode
> + * will be specified if it is set to non-zero, otherwise default mode will
> + * be set.
> 
> Reworded this comment slightly and give a reference to ApplyWorkerMain.
> 
> SUGGESTION
> Time-delayed logical replication affects the SHUTDOWN_MODE clause. The
> 'immediate' shutdown mode will be specified if min_apply_delay is
> non-zero, otherwise the default shutdown mode will be used. See
> ApplyWorkerMain.

Fixed.

> 6. ApplyWorkerMain
> + /*
> + * time-delayed logical replication does not support tablesync
> + * workers, so only the leader apply worker can request walsenders to
> + * exit before confirming remote flush.
> + */
> 
> "time-delayed" --> "Time-delayed"

Fixed.

> src/backend/replication/repl_gram.y
> 
> 7.
> @@ -91,6 +92,7 @@ Node *replication_parse_result;
>  %type <boolval> opt_temporary
>  %type <list> create_slot_options create_slot_legacy_opt_list
>  %type <defelt> create_slot_legacy_opt
> +%type <str> opt_shutdown_mode
> 
> The tab alignment seemed not quite right. Not 100% sure.

Fixed accordingly.

> 8.
> @@ -270,20 +272,22 @@ start_replication:
>   cmd->slotname = $2;
>   cmd->startpoint = $4;
>   cmd->timeline = $5;
> + cmd->shutdownmode = NULL;
>   $$ = (Node *) cmd;
>   }
> 
> It seemed a bit inconsistent. E.g. the cmd->options member was not set
> for physical replication, so why then set this member?
> 
> Alternatively, maybe should set cmd->options = NULL here as well?

Removed. I checked makeNode() macro, found that palloc0fast() is called there.
This means that we do not have to initialize unused attributes.

> src/backend/replication/walsender.c
> 
> 9.
> +/* Indicator for specifying the shutdown mode */
> +typedef enum
> +{
> + WALSND_SHUTDOWN_MODE_WAIT_FLUSH = 0,
> + WALSND_SHUTDOWN_MODE_IMMIDEATE
> +} WalSndShutdownMode;
> 
> ~
> 
> 9a.
> "Indicator for specifying" (??). How about just saying: "Shutdown modes"

Fixed.

> 9b.
> Typo: WALSND_SHUTDOWN_MODE_IMMIDEATE ==>
> WALSND_SHUTDOWN_MODE_IMMEDIATE

Replaced.

> 9c.
> AFAICT the fact that the first enum value is assigned 0 is not really
> of importance. If that is correct, then IMO maybe it's better to
> remove the "= 0" because the explicit assignment made me expect that
> it had special meaning, and then it was confusing when I could not
> find a reason.

Removed. This was added for skipping the initialization for previous version,
but no longer needed.

> 10. ProcessPendingWrites
> 
> + /*
> + * In this function, there is a possibility that the walsender is
> + * stuck. It is caused when the opposite worker is stuck and then the
> + * send-buffer of the walsender becomes full. Therefore, we must add
> + * an additional path for shutdown for immediate shutdown mode.
> + */
> + if (shutdown_mode == WALSND_SHUTDOWN_MODE_IMMIDEATE &&
> + got_STOPPING)
> + WalSndDone(XLogSendLogical);
> 
> 10a.
> Can this comment say something like "receiving worker" instead of
> "opposite worker"?
> 
> SUGGESTION
> This can happen when the receiving worker is stuck, and then the
> send-buffer of the walsender...

Changed.

> 10b.
> IMO it makes more sense to check this around the other way. E.g. we
> don't care what is the shutdown_mode value unless got_STOPPING is
> true.
> 
> SUGGESTION
> if (got_STOPPING && (shutdown_mode ==
> WALSND_SHUTDOWN_MODE_IMMEDIATE))

Changed.

> 11. WalSndDone
> 
> + * If we are in the immediate shutdown mode, flush location and output
> + * buffer is not checked. This may break the consistency between nodes,
> + * but it may be useful for the system that has high-latency network to
> + * reduce the amount of time for shutdown.
> 
> Add some quotes for the mode.
> 
> SUGGESTION
> 'immediate' shutdown mode

Changed.

> 12.
> +/*
> + * Check options for walsender itself and set flags accordingly.
> + *
> + * Currently only one option is accepted.
> + */
> +static void
> +CheckWalSndOptions(const StartReplicationCmd *cmd)
> +{
> + if (cmd->shutdownmode)
> + ParseShutdownMode(cmd->shutdownmode);
> +}
> +
> +/*
> + * Parse given shutdown mode.
> + *
> + * Currently two values are accepted - "wait_flush" and "immediate"
> + */
> +static void
> +ParseShutdownMode(char *shutdownmode)
> +{
> + if (pg_strcasecmp(shutdownmode, "wait_flush") == 0)
> + shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
> + else if (pg_strcasecmp(shutdownmode, "immediate") == 0)
> + shutdown_mode = WALSND_SHUTDOWN_MODE_IMMIDEATE;
> + else
> + ereport(ERROR,
> + errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("invalid value for shutdown mode: \"%s\"", shutdownmode),
> + errhint("Available values: wait_flush, immediate."));
> +}
> 
> IMO the ParseShutdownMode function seems unnecessary because it's not
> really "parsing" anything and it is only called in one place. I
> suggest wrapping everything into the CheckWalSndOptions function. The
> end result is still only a simple function:
> 
> SUGGESTION
> 
> static void
> CheckWalSndOptions(const StartReplicationCmd *cmd)
> {
> if (cmd->shutdownmode)
> {
> char *mode = cmd->shutdownmode;
> 
> if (pg_strcasecmp(mode, "wait_flush") == 0)
> shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
> else if (pg_strcasecmp(mode, "immediate") == 0)
> shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;
> 
> else
> ereport(ERROR,
> errcode(ERRCODE_SYNTAX_ERROR),
> errmsg("invalid value for shutdown mode: \"%s\"", mode),
> errhint("Available values: wait_flush, immediate."));
> }
> }

Removed.

> ======
> src/include/replication/walreceiver.h
> 
> 13.
> @@ -170,6 +170,7 @@ typedef struct
>   * false if physical stream.  */
>   char    *slotname; /* Name of the replication slot or NULL. */
>   XLogRecPtr startpoint; /* LSN of starting point. */
> + char    *shutdown_mode; /* Name of specified shutdown name */
> 
>   union
>   {
> ~
> 
> 13a.
> Typo (shutdown name?)
> 
> SUGGESTION
> /* The specified shutdown mode string, or NULL. */

Fixed.

> 13b.
> Because they have the same member names I kept confusing this option
> shutdown_mode with the other enum also called shutdown_mode.
> 
> I wonder if is it possible to call this one something like
> 'shutdown_mode_str' to make reading the code easier?

Changed.

> 13c.
> Is this member in the right place? AFAIK this is not even implemented
> for physical replication. e.g. Why isn't this new member part of the
> 'logical' sub-structure in the union?

I remained for future extendibility, but it seemed not to be needed. Moved.

> ======
> src/test/subscription/t/001_rep_changes.pl
> 
> 14.
> -# Set min_apply_delay parameter to 3 seconds
> +# Check restart on changing min_apply_delay to 3 seconds
>  my $delay = 3;
>  $node_subscriber->safe_psql('postgres',
>   "ALTER SUBSCRIPTION tap_sub_renamed SET (min_apply_delay =
> '${delay}s')");
> +$node_publisher->poll_query_until('postgres',
> + "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub_renamed' AND state = 'streaming';"
> +  )
> +  or die
> +  "Timed out while waiting for the walsender to restart after
> changing min_apply_delay to non-zero value";
> 
> IIUC this test is for verifying that a new walsender worker was
> started if the delay was changed from 0 to non-zero. E.g. I think it
> is for it is testing your new logic of the maybe_reread_subscription.
> 
> Probably more complete testing also needs to check the other scenarios:
> * min_apply_delay from one non-zero value to another non-zero value
> --> verify a new worker is NOT started.
> * change min_apply_delay from non-zero to zero --> verify a new worker
> IS started

Hmm. These tests do not improve the coverage, so not sure we should test or not.
Moreover, IIUC we do not have a good way to verify that the worker does not restart.
Even if the old pid is remained in the pg_stat_replication, there is a possibility
that walsender exits after that.  So currently I added only the case that change
min_apply_delay from non-zero to zero.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

10 February 2023, 12:15:01

On Fri, Feb 10, 2023 at 5:24 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>

Can't we have this option just as a bool (like shutdown_immediate)?
Why do we want to keep multiple modes?

-- 
With Regards,
Amit Kapila.

RE: Exit walsender before confirming remote flush in logical replication

From

"Hayato Kuroda (Fujitsu)"

Date:

10 February 2023, 12:40:43

Dear Amit,

> Can't we have this option just as a bool (like shutdown_immediate)?
> Why do we want to keep multiple modes?

Of course we can use boolean instead, but current style is motivated by the post[1].
This allows to add another option in future, whereas I do not have idea now.

I want to ask other reviewers which one is better...

[1]: https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

13 February 2023, 01:56:51

At Fri, 10 Feb 2023 12:40:43 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in 
> Dear Amit,
> 
> > Can't we have this option just as a bool (like shutdown_immediate)?
> > Why do we want to keep multiple modes?
> 
> Of course we can use boolean instead, but current style is motivated by the post[1].
> This allows to add another option in future, whereas I do not have idea now.
> 
> I want to ask other reviewers which one is better...
> 
> [1]: https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com

IMHO I vaguely don't like that we lose a means to specify the default
behavior here. And I'm not sure we definitely don't need other than
flush and immedaite for both physical and logical replication. If it's
not the case, I don't object to make it a Boolean.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Amit Kapila

Date:

13 February 2023, 02:57:01

On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Fri, 10 Feb 2023 12:40:43 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in
> > Dear Amit,
> >
> > > Can't we have this option just as a bool (like shutdown_immediate)?
> > > Why do we want to keep multiple modes?
> >
> > Of course we can use boolean instead, but current style is motivated by the post[1].
> > This allows to add another option in future, whereas I do not have idea now.
> >
> > I want to ask other reviewers which one is better...
> >
> > [1]: https://www.postgresql.org/message-id/20230208.112717.1140830361804418505.horikyota.ntt%40gmail.com
>
> IMHO I vaguely don't like that we lose a means to specify the default
> behavior here. And I'm not sure we definitely don't need other than
> flush and immedaite for both physical and logical replication.
>

If we can think of any use case that requires its extension then it
makes sense to make it a non-boolean option but otherwise, let's keep
things simple by having a boolean option.

> If it's
> not the case, I don't object to make it a Boolean.
>

Thanks.

-- 
With Regards,
Amit Kapila.

Re: Exit walsender before confirming remote flush in logical replication

From

Peter Smith

Date:

13 February 2023, 04:09:59

Here are some comments for patch v7-0002.

======
Commit Message

1.
This commit extends START_REPLICATION to accept SHUTDOWN_MODE clause. It is
currently implemented only for logical replication.

~

"to accept SHUTDOWN_MODE clause." --> "to accept a SHUTDOWN_MODE clause."

======
doc/src/sgml/protocol.sgml

2.
START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE {
'wait_flush' | 'immediate' } ] [ ( option_name [ option_value ] [,
...] ) ]

~

IMO this should say shutdown_mode as it did before:
START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ SHUTDOWN_MODE
shutdown_mode ] [ ( option_name [ option_value ] [, ...] ) ]

~~~

3.
+       <varlistentry>
+        <term><literal>shutdown_mode</literal></term>
+        <listitem>
+         <para>
+          Determines the behavior of the walsender process at shutdown. If
+          shutdown_mode is <literal>'wait_flush'</literal>, the walsender waits
+          for all the sent WALs to be flushed on the subscriber side. This is
+          the default when SHUTDOWN_MODE is not specified. If shutdown_mode is
+          <literal>'immediate'</literal>, the walsender exits without
+          confirming the remote flush.
+         </para>
+        </listitem>
+       </varlistentry>

Is the font of the "shutdown_mode" correct? I expected it to be like
the others (e.g. slot_name)

======
src/backend/replication/walsender.c

4.
+static void
+CheckWalSndOptions(const StartReplicationCmd *cmd)
+{
+ if (cmd->shutdownmode)
+ {
+ char    *mode = cmd->shutdownmode;
+
+ if (pg_strcasecmp(mode, "wait_flush") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_WAIT_FLUSH;
+ else if (pg_strcasecmp(mode, "immediate") == 0)
+ shutdown_mode = WALSND_SHUTDOWN_MODE_IMMEDIATE;
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid value for shutdown mode: \"%s\"", mode),
+ errhint("Available values: wait_flush, immediate."));
+ }
+
+}

Unnecessary extra whitespace at end of the function.

======
src/include/nodes/replnodes.

5.
@@ -83,6 +83,7 @@ typedef struct StartReplicationCmd
  char    *slotname;
  TimeLineID timeline;
  XLogRecPtr startpoint;
+ char    *shutdownmode;
  List    *options;
 } StartReplicationCmd;

IMO I those the last 2 members should have a comment something like:
/* Only for logical replication */

because that will make it more clear why sometimes they are assigned
and sometimes they are not.

======
src/include/replication/walreceiver.h

6.
Should the protocol version be bumped (and documented) now that the
START REPLICATION supports a new extended syntax? Or is that done only
for messages sent by pgoutput?

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

14 February 2023, 01:05:40

At Mon, 13 Feb 2023 08:27:01 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Mon, Feb 13, 2023 at 7:26 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > IMHO I vaguely don't like that we lose a means to specify the default
> > behavior here. And I'm not sure we definitely don't need other than
> > flush and immedaite for both physical and logical replication.
> >
> 
> If we can think of any use case that requires its extension then it
> makes sense to make it a non-boolean option but otherwise, let's keep
> things simple by having a boolean option.

What do you think about the need for explicitly specifying the
default?  I'm fine with specifying the default using a single word,
such as WAIT_FOR_REMOTE_FLUSH.

> > If it's
> > not the case, I don't object to make it a Boolean.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Andres Freund

Date:

14 February 2023, 01:13:43

On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote:
> What do you think about the need for explicitly specifying the
> default?  I'm fine with specifying the default using a single word,
> such as WAIT_FOR_REMOTE_FLUSH.

We obviously shouldn't force the option to be present. Why would we want to
break existing clients unnecessarily?  Without it the behaviour should be
unchanged from today's.

Re: Exit walsender before confirming remote flush in logical replication

From

Kyotaro Horiguchi

Date:

15 February 2023, 06:20:59

At Mon, 13 Feb 2023 17:13:43 -0800, Andres Freund <andres@anarazel.de> wrote in 
> On 2023-02-14 10:05:40 +0900, Kyotaro Horiguchi wrote:
> > What do you think about the need for explicitly specifying the
> > default?  I'm fine with specifying the default using a single word,
> > such as WAIT_FOR_REMOTE_FLUSH.
> 
> We obviously shouldn't force the option to be present. Why would we want to
> break existing clients unnecessarily?  Without it the behaviour should be
> unchanged from today's.

I didn't suggest making the option mandatory. I just suggested
providing a way to specify the default value explicitly, like in the
recent commit 746915c686.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Exit walsender before confirming remote flush in logical replication

From

Greg Sabino Mullane

Date:

17 September 2024, 15:29:56

Thanks for everyone's work on this, I am very interested in it getting into a release. What is the status of this?

My use case is Patroni - when it needs to do a failover, it shuts down the primary. However, large transactions can cause it to stay in the "shutting down" state for a long time, which means your entire HA system is now non-functional. I like the idea of a new flag. I'll test this out soon if the original authors want to make a rebased patch. This thread is old, so if I don't hear back in a bit, I'll create and test a new one myself. :)

Cheers,

Greg