Thread: Can a child process detect postmaster death when in pg_usleep?
Hi, In my dev system(Ubuntu) when the postmaster is killed with SIGKILL, SIGPWR is being sent to its child processes (backends/any other bg process). If a child process is waiting with pg_usleep, it looks like it is not detecting the postmaster's death and it doesn't exit immediately but stays forever until it gets killed explicitly. For this experiment, I did 2 things to simulate the scenario i.e. a backend waiting in pg_usleep and killing the postmaster. 1) I wrote a wait function that uses pg_usleep and called it in a backend. This backend doesn't exit on postmaster death. 2) I set PostAuthDelay to 100 seconds and started the postmaster. Then, the "auotvacuum launcher" process still stays around (as it has pg_usleep in its main function), even after postmaster death. Questions: 1) Is it really harmful to use pg_usleep in a postmaster child process as it doesn't let the child process detect postmaster death? 2) Can pg_usleep() always detect signals? I see the caution in the pg_usleep function definition in pgsleep.c, saying the signal handling is platform dependent. We have code blocks like below in the code. Do we actually process interrupts before going to sleep with pg_usleep()? while/for loop { ...... ...... CHECK_FOR_INTERRUPTS(); pg_usleep(); } and if (PostAuthDelay) pg_usleep(); 3) Is it intentional to use pg_usleep in some places in the code? If yes, what are they? At least, I see one place where it's intentional in the wait_pid function which is used while running the regression tests. 4) Are there any places where we need to replace pg_usleep with WaitLatch/equivalent of pg_sleep to detect the postmaster death properly? Correct me if I'm missing something or if my observation/understanding of the pg_usleep() is wrong. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Hi Bharath, On Thu, Apr 15, 2021 at 2:06 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > 1) Is it really harmful to use pg_usleep in a postmaster child process > as it doesn't let the child process detect postmaster death? Yeah, that's a bad idea. Any long-term waiting (including short waits in a loop) should ideally be done with the latch infrastructure. One interesting and unusual case is recovery: it can run for a very long time without reaching a waiting primitive of any kind (other than LWLock, which doesn't count), because it can be busy applying records for hours at a time. In that case, we take special measures and explicitly check if the postmaster is dead in the redo loop. In theory, you could do the same in any loop containing pg_usleep() (we used to have several loops doing that, especially around replication code), but it'd be better to use the existing wait-event-multiplexing technology we have, and keep improving that. Some people have argued that long running queries should *also* react faster when the PM exits, a bit like recovery ... which leads to the next point... > 2) Can pg_usleep() always detect signals? I see the caution in the > pg_usleep function definition in pgsleep.c, saying the signal handling > is platform dependent. We have code blocks like below in the code. Do > we actually process interrupts before going to sleep with pg_usleep()? > while/for loop > { > ...... > ...... > CHECK_FOR_INTERRUPTS(); > pg_usleep(); > } > and > if (PostAuthDelay) > pg_usleep(); CHECK_FOR_INTERRUPTS() has nothing to do with postmaster death detection, currently, so that'd be for dealing with interrupts, not for that. Also, there would be a race: a signal on its own isn't enough on systems where we have them and where select() is guaranteed to wake up, because the signal might arrive between CFI() and pg_usleep(100 years). latch.c knows how to void such problems. There may be an argument that CFI() *should* be a potential postmaster-death-exit point, instead of having WaitLatch() (or its caller) handle it directly, but it's complicated. At the time the postmaster pipe system was invented we didn't have a signals for this so it wasn't even a candidate for treatment as an "interrupt". On systems that have postmaster death signals today (Linux + FreeBSD, but I suspect we can extend this to every Unix we support, see CF #3066, and a solution for Windows has been mentioned too), clearly the signal handler could set a new interrupt flag PostmasterLost + InterruptPending, and then CHECK_FOR_INTERRUPTS() could see it and exit. The argument against this is that exiting isn't always the right thing! In a couple of places, we do something special, such as printing a special error message (examples: sync rep and the main FEBE client read). Look for WL_POSTMASTER_DEATH (as opposed to WL_EXIT_ON_PM_DEATH). So I guess you'd need to reverse those decisions and standardise on "exit immediately, no message", or invented a way to suppress that behaviour in code regions. > 3) Is it intentional to use pg_usleep in some places in the code? If > yes, what are they? At least, I see one place where it's intentional > in the wait_pid function which is used while running the regression > tests. There are plenty of places that do a short sleep for various reasons, more like a deliberate stall or backoff or auth thing, and it's probably OK if they're shortish and not really a condition polling loop with an obvious latch/CV-based replacement. Note also that LWLock waits are similar. > 4) Are there any places where we need to replace pg_usleep with > WaitLatch/equivalent of pg_sleep to detect the postmaster death > properly? We definitely have replaced a lot of sleeps with latch.c primitives over the past few years, since we got WL_EXIT_ON_PM_DEATH and condition variables. There may be many more to improve... You mentioned autovacuum... yeah, Stephen fixed one of these with commit 4753ef37, but yeah it's not great to have those others in there...
On Thu, Apr 15, 2021 at 5:28 AM Thomas Munro <thomas.munro@gmail.com> wrote: Thanks a lot for the detailed explanation. > On Thu, Apr 15, 2021 at 2:06 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > 1) Is it really harmful to use pg_usleep in a postmaster child process > > as it doesn't let the child process detect postmaster death? > > Yeah, that's a bad idea. Any long-term waiting (including short waits > in a loop) should ideally be done with the latch infrastructure. Agree. Along with short waits in a loop, I think we also should replace pg_usleep with WaitLatch that has a user configurable parameter like below: pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L); pg_usleep(PostAuthDelay * 1000000L); pg_usleep(CommitDelay); > 4) Are there any places where we need to replace pg_usleep with > > WaitLatch/equivalent of pg_sleep to detect the postmaster death > > properly? > > We definitely have replaced a lot of sleeps with latch.c primitives > over the past few years, since we got WL_EXIT_ON_PM_DEATH and > condition variables. There may be many more to improve... You > mentioned autovacuum... yeah, Stephen fixed one of these with commit > 4753ef37, but yeah it's not great to have those others in there... I have not looked at the commit 4753ef37 previously, but it essentially addresses the problem with pg_usleep for vacuum delay. I'm thinking we can also replace pg_usleep in below places based on the fact that pg_usleep should be avoided in 1) short waits in a loop 2) when wait time is dependent on user configurable parameters. And using WaitLatch may require us to add wait event types to WaitEventTimeout enum, but that's okay. 1) pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L); in lazy_truncate_heap 2) pg_usleep(CommitDelay); in XLogFlush 3) pg_usleep(10000L); in CreateCheckPoint 4) pg_usleep(1000000L); in do_pg_stop_backup 5) pg_usleep(1000L); in read_local_xlog_page 6) pg_usleep(PostAuthDelay * 1000000L); in AutoVacLauncherMain, AutoVacWorkerMain, StartBackgroundWorker, InitPostgres 7) pg_usleep(100000L); in RequestCheckpoint 8) pg_usleep(1000000L); in pgarch_ArchiverCopyLoop 9) pg_usleep(PGSTAT_RETRY_DELAY * 1000L); in backend_read_statsfile 10) pg_usleep(PreAuthDelay * 1000000L); in BackendInitialize 11) pg_usleep(10000L); in WalSndWaitStopping 12) pg_usleep(standbyWait_us); in WaitExceedsMaxStandbyDelay 13) pg_usleep(10000L); in RegisterSyncRequest I'm sure we won't be changing in all of the above places. It will be good to review and correct the above list. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 15, 2021 at 11:48 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > We definitely have replaced a lot of sleeps with latch.c primitives > > over the past few years, since we got WL_EXIT_ON_PM_DEATH and > > condition variables. There may be many more to improve... You > > mentioned autovacuum... yeah, Stephen fixed one of these with commit > > 4753ef37, but yeah it's not great to have those others in there... > > I have not looked at the commit 4753ef37 previously, but it > essentially addresses the problem with pg_usleep for vacuum delay. I'm > thinking we can also replace pg_usleep in below places based on the > fact that pg_usleep should be avoided in 1) short waits in a loop 2) > when wait time is dependent on user configurable parameters. And using > WaitLatch may require us to add wait event types to WaitEventTimeout > enum, but that's okay. I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and Post Auth Delay. Regression tests pass with these patches. Please review them. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Apr 20, 2021 at 7:36 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Apr 15, 2021 at 11:48 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > We definitely have replaced a lot of sleeps with latch.c primitives > > > over the past few years, since we got WL_EXIT_ON_PM_DEATH and > > > condition variables. There may be many more to improve... You > > > mentioned autovacuum... yeah, Stephen fixed one of these with commit > > > 4753ef37, but yeah it's not great to have those others in there... > > > > I have not looked at the commit 4753ef37 previously, but it > > essentially addresses the problem with pg_usleep for vacuum delay. I'm > > thinking we can also replace pg_usleep in below places based on the > > fact that pg_usleep should be avoided in 1) short waits in a loop 2) > > when wait time is dependent on user configurable parameters. And using > > WaitLatch may require us to add wait event types to WaitEventTimeout > > enum, but that's okay. > > I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in > lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and > Post Auth Delay. Regression tests pass with these patches. Please > review them. I made a CF entry [1] so that it may get a chance for review. [1] https://commitfest.postgresql.org/33/3085/ With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed This patch looks fine. Tested on MacOS Catalina; master 09ae3299 The new status of this patch is: Ready for Committer
On Tue, Apr 20, 2021 at 07:36:39AM +0530, Bharath Rupireddy wrote: > I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in > lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and > Post Auth Delay. Regression tests pass with these patches. Please > review them. + if (backup_started_in_recovery) + latch = &XLogCtl->recoveryWakeupLatch; + else + latch = MyLatch; recoveryWakeupLatch is used by the startup process, but it has nothing to do with do_pg_stop_backup(). Why are you doing that? I can get behind the change for the truncation lock when finishing a VACUUM as that helps with monitoring. Now, I am not sure I get the point of changing anything for {post,pre}_auth_delay that are developer options. Please note that at this stage we don't know the backend activity in pg_stat_activity, so the use of wait events is not really interesting. On top of that, not reacting on signals can be interesting to keep as a behavior for developers? -- Michael
Attachment
On Thu, Jun 24, 2021 at 12:04 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Apr 20, 2021 at 07:36:39AM +0530, Bharath Rupireddy wrote: > > I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in > > lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and > > Post Auth Delay. Regression tests pass with these patches. Please > > review them. > > + if (backup_started_in_recovery) > + latch = &XLogCtl->recoveryWakeupLatch; > + else > + latch = MyLatch; > recoveryWakeupLatch is used by the startup process, but it has nothing > to do with do_pg_stop_backup(). Why are you doing that? The recoveryWakeupLatch and procLatch/MyLatch are being used for WAL replay and recovery conflict, respectively. Actually, I was earlier using procLatch/MyLatch, but came across the commit 00f690a23 which says that the two latches are reserved for specific purposes. I'm not quite sure which one to use when do_pg_stop_backup is called by the startup process. Any thoughts? > I can get behind the change for the truncation lock when finishing a > VACUUM as that helps with monitoring. Thanks. Please let me know if there are any comments on v1-0001-Use-a-WaitLatch-for-lock-waiting-in-lazy_truncate.patch. > Now, I am not sure I get the > point of changing anything for {post,pre}_auth_delay that are > developer options. Please note that at this stage we don't know the > backend activity in pg_stat_activity, so the use of wait events is not > really interesting. Hm. I was earlier thinking from the perspective that the processes should be able to detect the postmaster death if the {post,pre}_auth_delay are set to higher values. Now, I agree that the auth delays are happening at the initial stages of the processes and if the developers(not common users) set the higher values for the GUCs, let them deal with the problem of the processes not detecting the postmaster death. > On top of that, not reacting on signals can be > interesting to keep as a behavior for developers? Yeah, it can be useful at times as it enables debugging even when the postmaster dies. With Regards, Bharath Rupireddy.
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > On Thu, Jun 24, 2021 at 12:04 PM Michael Paquier <michael@paquier.xyz> wrote: >> On top of that, not reacting on signals can be >> interesting to keep as a behavior for developers? > Yeah, it can be useful at times as it enables debugging even when the > postmaster dies. Dunno ... I cannot recall ever having had that as a debugging requirement in a couple of decades worth of PG bug-chasing. If the postmaster is dying, you generally want to deal with that before bothering with child processes. Moreover, child processes that don't go awy when the postmaster does are a very nasty problem, because they could screw up subsequent debugging work. regards, tom lane
On Mon, Jun 28, 2021 at 11:01:57AM -0400, Tom Lane wrote: > Dunno ... I cannot recall ever having had that as a debugging requirement > in a couple of decades worth of PG bug-chasing. If the postmaster is > dying, you generally want to deal with that before bothering with child > processes. Moreover, child processes that don't go awy when the > postmaster does are a very nasty problem, because they could screw up > subsequent debugging work. At the same time, nobody has really complained about this being an issue for developer options. I would tend to wait for more opinions before doing anything with the auth_delay GUCs. -- Michael
Attachment
On Mon, Jun 28, 2021 at 08:21:06PM +0530, Bharath Rupireddy wrote: > The recoveryWakeupLatch and procLatch/MyLatch are being used for WAL > replay and recovery conflict, respectively. Actually, I was earlier > using procLatch/MyLatch, but came across the commit 00f690a23 which > says that the two latches are reserved for specific purposes. I'm not > quite sure which one to use when do_pg_stop_backup is called by the > startup process. Any thoughts? Could you explain why you think dp_pg_stop_backup() can be called by the startup process? AFAIK, this code path applies to two categories of sessions: - backend sessions, with the SQL functions calling this routine. - WAL senders, aka anything that connects with replication=1 able to use the BASE_BACKUP with the replication protocol. > Thanks. Please let me know if there are any comments on > v1-0001-Use-a-WaitLatch-for-lock-waiting-in-lazy_truncate.patch. Applied this one as that's clearly a win. The event name has been renamed to VacuumTruncate. -- Michael
Attachment
On Fri, Jul 2, 2021 at 9:53 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Jun 28, 2021 at 08:21:06PM +0530, Bharath Rupireddy wrote: > > The recoveryWakeupLatch and procLatch/MyLatch are being used for WAL > > replay and recovery conflict, respectively. Actually, I was earlier > > using procLatch/MyLatch, but came across the commit 00f690a23 which > > says that the two latches are reserved for specific purposes. I'm not > > quite sure which one to use when do_pg_stop_backup is called by the > > startup process. Any thoughts? > > Could you explain why you think dp_pg_stop_backup() can be called by > the startup process? AFAIK, this code path applies to two categories > of sessions: > - backend sessions, with the SQL functions calling this routine. > - WAL senders, aka anything that connects with replication=1 able to > use the BASE_BACKUP with the replication protocol. My bad. I was talking about the cases when do_pg_stop_backup is called while the server is in recovery mode i.e. backup_started_in_recovery = RecoveryInProgress(); evaluates to true. I'm not sure in these cases whether we should replace pg_usleep with WaitLatch. If yes, whether we should use procLatch/MyLatch or recoveryWakeupLatch as they are currently serving different purposes. Regards, Bharath Rupireddy.
On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote: > My bad. I was talking about the cases when do_pg_stop_backup is called > while the server is in recovery mode i.e. backup_started_in_recovery = > RecoveryInProgress(); evaluates to true. I'm not sure in these cases > whether we should replace pg_usleep with WaitLatch. If yes, whether we > should use procLatch/MyLatch or recoveryWakeupLatch as they are > currently serving different purposes. It seems to me that you should re-read the description of recoveryWakeupLatch at the top of xlog.c and check for which purpose it exists, which is, in this case, to wake up the startup process to accelerate WAL replay. So do_pg_stop_backup() has no business with it. Switching pg_stop_backup() to use a latch rather than pg_usleep() has benefits: - It simplifies the wait event handling. - The process waiting for the last WAL segment to be archived will be more responsive on signals like SIGHUP and on postmaster death. These don't sound bad to me to apply here, so 0002 could be simplified as attached. -- Michael
Attachment
At Fri, 2 Jul 2021 10:27:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Mon, Jun 28, 2021 at 11:01:57AM -0400, Tom Lane wrote: > > Dunno ... I cannot recall ever having had that as a debugging requirement > > in a couple of decades worth of PG bug-chasing. If the postmaster is > > dying, you generally want to deal with that before bothering with child > > processes. Moreover, child processes that don't go awy when the > > postmaster does are a very nasty problem, because they could screw up > > subsequent debugging work. > > At the same time, nobody has really complained about this being an > issue for developer options. I would tend to wait for more opinions > before doing anything with the auth_delay GUCs. I'm not sure the current behavior is especially useful for debugging, however, I don't think it is especially useful that children immediately respond to postmaster's death while the debug-delays, because anyway children don't respond while debugging (until the control (or code-pointer) reaches to the point of checking postmaster's death), and the delays must be very short even if someone abuses it on production systems. On the other hand, there could be a discussion as a convention that any user-definable sleep requires to respond to signals, maybe as Thomas mentioned. So, I don't object either way we will go. But if we don't change the behavior we instead would need a comment that explains the reason for the pg_usleep. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote: > > My bad. I was talking about the cases when do_pg_stop_backup is called > > while the server is in recovery mode i.e. backup_started_in_recovery = > > RecoveryInProgress(); evaluates to true. I'm not sure in these cases > > whether we should replace pg_usleep with WaitLatch. If yes, whether we > > should use procLatch/MyLatch or recoveryWakeupLatch as they are > > currently serving different purposes. > > It seems to me that you should re-read the description of > recoveryWakeupLatch at the top of xlog.c and check for which purpose > it exists, which is, in this case, to wake up the startup process to > accelerate WAL replay. So do_pg_stop_backup() has no business with > it. > > Switching pg_stop_backup() to use a latch rather than pg_usleep() has > benefits: > - It simplifies the wait event handling. > - The process waiting for the last WAL segment to be archived will be > more responsive on signals like SIGHUP and on postmaster death. Yes, agreed. > These don't sound bad to me to apply here, so 0002 could be simplified > as attached. Took a quick look and the patch looks good to me. In general, I agree with Tom's up-thread comment about children hanging around after postmaster death making things more difficult for debugging and just in general, so I'm in favor of trying to eliminate as many cases where that's happening as we reasonably can without impacting performance by checking too often. Thanks, Stephen
Attachment
On Mon, Jul 5, 2021 at 7:33 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote: > > My bad. I was talking about the cases when do_pg_stop_backup is called > > while the server is in recovery mode i.e. backup_started_in_recovery = > > RecoveryInProgress(); evaluates to true. I'm not sure in these cases > > whether we should replace pg_usleep with WaitLatch. If yes, whether we > > should use procLatch/MyLatch or recoveryWakeupLatch as they are > > currently serving different purposes. > > It seems to me that you should re-read the description of > recoveryWakeupLatch at the top of xlog.c and check for which purpose > it exists, which is, in this case, to wake up the startup process to > accelerate WAL replay. So do_pg_stop_backup() has no business with > it. Hm. The shared recoveryWakeupLatch is being owned by the startup process to wait and other backends/processes are using it to wake up the startup process. > Switching pg_stop_backup() to use a latch rather than pg_usleep() has > benefits: > - It simplifies the wait event handling. > - The process waiting for the last WAL segment to be archived will be > more responsive on signals like SIGHUP and on postmaster death. > > These don't sound bad to me to apply here, so 0002 could be simplified > as attached. The attached stop-backup-latch-v2.patch looks good to me. Regards, Bharath Rupireddy.
On Mon, Jul 5, 2021 at 9:25 PM Stephen Frost <sfrost@snowman.net> wrote: > In general, I agree with Tom's up-thread comment about children hanging > around after postmaster death making things more difficult for debugging > and just in general, so I'm in favor of trying to eliminate as many > cases where that's happening as we reasonably can without impacting > performance by checking too often. I agree. I'm attaching the patch that replaces pg_usleep with WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's latest patch stop-backup-latch-v2.patch, just for the sake of cfbot. Regards, Bharath Rupireddy.
Attachment
On Mon, Jul 05, 2021 at 09:42:29PM +0530, Bharath Rupireddy wrote: > I agree. I'm attaching the patch that replaces pg_usleep with > WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's > latest patch stop-backup-latch-v2.patch, just for the sake of cfbot. I don't object to the argument that switching to a latch for this code path could be good for responsiveness, but switching it is less attractive than the others as wait events are not available in pg_stat_activity at authentication startup. That's the case of normal backends and WAL senders, not the case of autovacuum workers using post_auth_delay if I read the code correctly. Anyway, it is worth noting that the patch as proposed breaks post_auth_delay. MyLatch is set when reaching WaitLatch() for post_auth_delay after loading the options, so the use of WL_LATCH_SET is not right. I think that this comes from SwitchToSharedLatch() in InitProcess(). And it does not seem quite right to me to just blindly reset the latch before doing the wait in this code path. Perhaps we could just use (WL_TIMEOUT | WL_EXIT_ON_PM_DEATH) to do the job. The one for pg_stop_backup() has been applied, no objections to that. -- Michael
Attachment
On Tue, Jul 6, 2021 at 6:15 AM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Jul 05, 2021 at 09:42:29PM +0530, Bharath Rupireddy wrote: > > I agree. I'm attaching the patch that replaces pg_usleep with > > WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's > > latest patch stop-backup-latch-v2.patch, just for the sake of cfbot. > > I don't object to the argument that switching to a latch for this code > path could be good for responsiveness, but switching it is less > attractive than the others as wait events are not available in > pg_stat_activity at authentication startup. That's the case of normal > backends and WAL senders, not the case of autovacuum workers using > post_auth_delay if I read the code correctly. We may not see anything in the pg_stat_activity for {post, pre}_auth_delay, but the processes can detect the postmaster death with WaitLatch. I think we should focus on that. > Anyway, it is worth noting that the patch as proposed breaks > post_auth_delay. MyLatch is set when reaching WaitLatch() for > post_auth_delay after loading the options, so the use of WL_LATCH_SET > is not right. I think that this comes from SwitchToSharedLatch() in > InitProcess(). And it does not seem quite right to me to just blindly > reset the latch before doing the wait in this code path. Perhaps we > could just use (WL_TIMEOUT | WL_EXIT_ON_PM_DEATH) to do the job. I'm sorry to say that I didn't get what was said above. We reset the latch after we come out of WaitLatch but not before going to wait. And the reason to have WL_LATCH_SET, is to exit the wait loop if MyLatch is set for that process because of other SetLatch events. Am I missing something here? Regards, Bharath Rupireddy.
On Tue, Jul 06, 2021 at 12:42:21PM +0530, Bharath Rupireddy wrote: > I'm sorry to say that I didn't get what was said above. We reset the > latch after we come out of WaitLatch but not before going to wait. And > the reason to have WL_LATCH_SET, is to exit the wait loop if MyLatch > is set for that process because of other SetLatch events. Am I missing > something here? Did you test the patch with post_auth_delay and a backend connection, making sure that the delay gets correctly applied? I did, and that was not working here. -- Michael
Attachment
On Tue, Jul 6, 2021 at 1:38 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Jul 06, 2021 at 12:42:21PM +0530, Bharath Rupireddy wrote: > > I'm sorry to say that I didn't get what was said above. We reset the > > latch after we come out of WaitLatch but not before going to wait. And > > the reason to have WL_LATCH_SET, is to exit the wait loop if MyLatch > > is set for that process because of other SetLatch events. Am I missing > > something here? > > Did you test the patch with post_auth_delay and a backend connection, > making sure that the delay gets correctly applied? I did, and that > was not working here. Thanks. You are right. The issue is due to the MyLatch being set by SwitchToSharedLatch before WaitLatch. If we use (WL_TIMEOUT | WL_EXIT_ON_PM_DEATH), then the backends will honour the post_auth_delay as well as detect the postmaster death. Since we are not using WL_LATCH_SET, I removed ResetLatch. Also, added some comments around why we are not using WL_LATCH_SET. For PreAuthDelay, there's no problem to use WL_LATCH_SET as MyLatch still points to the local latch(which is not set) in BackendInitialize(). PSA v2 patch. Regards, Bharath Rupireddy.
Attachment
On Tue, Jul 06, 2021 at 03:54:07PM +0530, Bharath Rupireddy wrote: > Thanks. You are right. The issue is due to the MyLatch being set by > SwitchToSharedLatch before WaitLatch. If we use (WL_TIMEOUT | > WL_EXIT_ON_PM_DEATH), then the backends will honour the > post_auth_delay as well as detect the postmaster death. Since we are > not using WL_LATCH_SET, I removed ResetLatch. Also, added some > comments around why we are not using WL_LATCH_SET. > > For PreAuthDelay, there's no problem to use WL_LATCH_SET as MyLatch > still points to the local latch(which is not set) in > BackendInitialize(). FWIW, I think that it could be a good idea to use the same set of flags for all the pre/post_auth_delay paths for consistency. That's useful when grepping for one. Please note that I don't plan to look more at this patch set for this CF as I am not really excited by the updates involving developer options, and I suspect more issues like the one I found upthread so this needs a close lookup. If somebody else wishes to look at it, please feel free, of course. -- Michael
Attachment
On Tue, Jul 6, 2021 at 4:33 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Jul 06, 2021 at 03:54:07PM +0530, Bharath Rupireddy wrote: > > Thanks. You are right. The issue is due to the MyLatch being set by > > SwitchToSharedLatch before WaitLatch. If we use (WL_TIMEOUT | > > WL_EXIT_ON_PM_DEATH), then the backends will honour the > > post_auth_delay as well as detect the postmaster death. Since we are > > not using WL_LATCH_SET, I removed ResetLatch. Also, added some > > comments around why we are not using WL_LATCH_SET. > > > > For PreAuthDelay, there's no problem to use WL_LATCH_SET as MyLatch > > still points to the local latch(which is not set) in > > BackendInitialize(). > > FWIW, I think that it could be a good idea to use the same set of > flags for all the pre/post_auth_delay paths for consistency. That's > useful when grepping for one. Please note that I don't plan to look > more at this patch set for this CF as I am not really excited by the > updates involving developer options, and I suspect more issues like > the one I found upthread so this needs a close lookup. > > If somebody else wishes to look at it, please feel free, of course. Thanks. Anyways, I removed WL_LATCH_SET for PreAuthDelay as well. PSA v4 patch. Regards, Bharath Rupireddy.
Attachment
On Tue, Jul 06, 2021 at 05:07:04PM +0530, Bharath Rupireddy wrote: > Thanks. Anyways, I removed WL_LATCH_SET for PreAuthDelay as > well. PSA v4 patch. For the moment, please note that I have marked the patch as committed in the CF app. It may be better to start a new thread with the remaining bits for a separate evaluation. -- Michael