Thread: walsender.c fileheader comment
Hi, I was reading the walsender.c fileheader comment while studying another thread. I think if there is logical replication in progress then the PROCSIG_WALSND_INIT_STOPPING handler will *always* switch to a "stopping" state: e.g., /* * Handle PROCSIG_WALSND_INIT_STOPPING signal. */ void HandleWalSndInitStopping(void) { Assert(am_walsender); /* * If replication has not yet started, die like with SIGTERM. If * replication is active, only set a flag and wake up the main loop. It * will send any outstanding WAL, wait for it to be replicated to the * standby, and then exit gracefully. */ if (!replication_active) kill(MyProcPid, SIGTERM); else got_STOPPING = true; } ~~~ But the walsender.c fileheader comment seems to be saying something slightly different. IIUC, some minor rewording of the comment is needed so it describes the code better. HEAD ... * shutdown, if logical replication is in progress all existing WAL records * are processed followed by a shutdown. Otherwise this causes the walsender * to switch to the "stopping" state. In this state, the walsender will reject * any further replication commands. The checkpointer begins the shutdown ... SUGGESTION .. shutdown. If logical replication is in progress, the walsender switches to a "stopping" state. In this state, the walsender will reject any further replication commands - but all existing WAL records are processed - followed by a shutdown. ~~~ I attached a patch for the above-suggested change. Thoughts? ====== Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On 6/11/24 04:35, Peter Smith wrote: > Hi, > > I was reading the walsender.c fileheader comment while studying > another thread. I think if there is logical replication in progress > then the PROCSIG_WALSND_INIT_STOPPING handler will *always* switch to > a "stopping" state: e.g., > > /* > * Handle PROCSIG_WALSND_INIT_STOPPING signal. > */ > void > HandleWalSndInitStopping(void) > { > Assert(am_walsender); > > /* > * If replication has not yet started, die like with SIGTERM. If > * replication is active, only set a flag and wake up the main loop. It > * will send any outstanding WAL, wait for it to be replicated to the > * standby, and then exit gracefully. > */ > if (!replication_active) > kill(MyProcPid, SIGTERM); > else > got_STOPPING = true; > } > > ~~~ > > But the walsender.c fileheader comment seems to be saying something > slightly different. IIUC, some minor rewording of the comment is > needed so it describes the code better. > > HEAD > ... > * shutdown, if logical replication is in progress all existing WAL records > * are processed followed by a shutdown. Otherwise this causes the walsender > * to switch to the "stopping" state. In this state, the walsender will reject > * any further replication commands. The checkpointer begins the shutdown > ... > > SUGGESTION > .. shutdown. If logical replication is in progress, the walsender > switches to a "stopping" state. In this state, the walsender will > reject any further replication commands - but all existing WAL records > are processed - followed by a shutdown. > > ~~~ > > I attached a patch for the above-suggested change. > > Thoughts? I did look at this, and while the explanation in the current comment may seem a bit confusing, I'm not sure the suggested changes improve the situation very much. This suggests the two comments somehow disagree, but it does not say in what exactly, so perhaps I just missed it :-( ISTM there's a bit of confusion what is meant by "stopping" state - you seem to be interpreting it as a general concept, where the walsender is requested to stop (through the signal), and starts doing stuff to exit. But the comments actually talk about WalSnd->state, where "stopping" means it needs to be set to WALSNDSTATE_STOPPING. And we only ever switch to that state in two places - in WalSndPhysical and exec_replication_command. And that does not happen in regular logical replication (which is what "logical replication is in progress" refers to) - if you have a walsender just replicating DML, it will never see the WALSNDSTATE_STOPPING state. It will simply do the cleanup while still in WALSNDSTATE_STREAMING state, and then just exit. So from this point of view, the suggestion is actually wrong. To conclude, I think this probably makes the comments more confusing. If we want to make it clearer, I'd probably start by clarifying what the "stopping" state means. Also, it's a bit surprising we may not actually go through the "stopping" state during shutdown. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Thankyou for taking the time to look at this and reply. > > I did look at this, and while the explanation in the current comment may > seem a bit confusing, I'm not sure the suggested changes improve the > situation very much. > > This suggests the two comments somehow disagree, but it does not say in > what exactly, so perhaps I just missed it :-( > > ISTM there's a bit of confusion what is meant by "stopping" state - you > seem to be interpreting it as a general concept, where the walsender is > requested to stop (through the signal), and starts doing stuff to exit. > But the comments actually talk about WalSnd->state, where "stopping" > means it needs to be set to WALSNDSTATE_STOPPING. Yes, I interpreted the "stopping" state meaning as when the boolean flag 'got_STOPPING' is assigned true. > > And we only ever switch to that state in two places - in WalSndPhysical > and exec_replication_command. And that does not happen in regular > logical replication (which is what "logical replication is in progress" > refers to) - if you have a walsender just replicating DML, it will never > see the WALSNDSTATE_STOPPING state. It will simply do the cleanup while > still in WALSNDSTATE_STREAMING state, and then just exit. > > So from this point of view, the suggestion is actually wrong. OK. > > To conclude, I think this probably makes the comments more confusing. If > we want to make it clearer, I'd probably start by clarifying what the > "stopping" state means. Also, it's a bit surprising we may not actually > go through the "stopping" state during shutdown. > I agree. My interpretation of the (ambiguous) "stopping" state led me to believe the comment was quite wrong. So, this thread was only intended as a trivial comment fix in passing but clearly there is more to this than I anticipated. I would be happy if someone with more knowledge about the WALSNDSTATE_STOPPING versus got_STOPPING could disambiguate the file header comment, but that's not me, so I have withdrawn this from the Commitfest. ====== Kind Regards, Peter Smith. Fujitsu Australia
On 7/19/24 07:02, Peter Smith wrote: > ... >> >> To conclude, I think this probably makes the comments more confusing. If >> we want to make it clearer, I'd probably start by clarifying what the >> "stopping" state means. Also, it's a bit surprising we may not actually >> go through the "stopping" state during shutdown. >> > > I agree. My interpretation of the (ambiguous) "stopping" state led me > to believe the comment was quite wrong. So, this thread was only > intended as a trivial comment fix in passing but clearly there is more > to this than I anticipated. I would be happy if someone with more > knowledge about the WALSNDSTATE_STOPPING versus got_STOPPING could > disambiguate the file header comment, but that's not me, so I have > withdrawn this from the Commitfest. > Understood. Thanks for the patch anyway, I appreciate you took the time to try to improve the comments! I agree the state transitions in walsender are not very clear, and the fact that it may shutdown without ever going through STOPPING state is quite confusing. That being said, I personally don't have ambition to improve this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company