Thread: Can a child process detect postmaster death when in pg_usleep?

Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
Hi,

In my dev system(Ubuntu) when the postmaster is killed with SIGKILL,
SIGPWR is being sent to its child processes (backends/any other bg
process). If a child process is waiting with pg_usleep, it looks like
it is not detecting the postmaster's death and it doesn't exit
immediately but stays forever until it gets killed explicitly. For
this experiment, I did 2 things to simulate the scenario i.e. a
backend waiting in pg_usleep and killing the postmaster. 1) I wrote a
wait function that uses pg_usleep and called it in a backend. This
backend doesn't exit on postmaster death. 2)  I set PostAuthDelay to
100 seconds and started the postmaster. Then, the "auotvacuum
launcher" process still stays around (as it has pg_usleep in its main
function), even after postmaster death.

Questions:
1) Is it really harmful to use pg_usleep in a postmaster child process
as it doesn't let the child process detect postmaster death?

2) Can pg_usleep() always detect signals? I see the caution in the
pg_usleep function definition in pgsleep.c, saying the signal handling
is platform dependent. We have code blocks like below in the code. Do
we actually process interrupts before going to sleep with pg_usleep()?
while/for loop
{
......
......
 CHECK_FOR_INTERRUPTS();
 pg_usleep();
}
and
if (PostAuthDelay)
   pg_usleep();

3) Is it intentional to use pg_usleep in some places in the code? If
yes, what are they? At least, I see one place where it's intentional
in the wait_pid function which is used while running the regression
tests.

4) Are there any places where we need to replace pg_usleep with
WaitLatch/equivalent of pg_sleep to detect the postmaster death
properly?

Correct me if I'm missing something or if my observation/understanding
of the pg_usleep() is wrong.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Can a child process detect postmaster death when in pg_usleep?

From
Thomas Munro
Date:
Hi Bharath,

On Thu, Apr 15, 2021 at 2:06 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> 1) Is it really harmful to use pg_usleep in a postmaster child process
> as it doesn't let the child process detect postmaster death?

Yeah, that's a bad idea.  Any long-term waiting (including short waits
in a loop) should ideally be done with the latch infrastructure.

One interesting and unusual case is recovery: it can run for a very
long time without reaching a waiting primitive of any kind (other than
LWLock, which doesn't count), because it can be busy applying records
for hours at a time.  In that case, we take special measures and
explicitly check if the postmaster is dead in the redo loop.  In
theory, you could do the same in any loop containing pg_usleep() (we
used to have several loops doing that, especially around replication
code), but it'd be better to use the existing wait-event-multiplexing
technology we have, and keep improving that.

Some people have argued that long running queries should *also* react
faster when the PM exits, a bit like recovery ... which leads to the
next point...

> 2) Can pg_usleep() always detect signals? I see the caution in the
> pg_usleep function definition in pgsleep.c, saying the signal handling
> is platform dependent. We have code blocks like below in the code. Do
> we actually process interrupts before going to sleep with pg_usleep()?
> while/for loop
> {
> ......
> ......
>  CHECK_FOR_INTERRUPTS();
>  pg_usleep();
> }
> and
> if (PostAuthDelay)
>    pg_usleep();

CHECK_FOR_INTERRUPTS() has nothing to do with postmaster death
detection, currently, so that'd be for dealing with interrupts, not
for that.  Also, there would be a race: a signal on its own isn't
enough on systems where we have them and where select() is guaranteed
to wake up, because  the signal might arrive between CFI() and
pg_usleep(100 years).  latch.c knows how to void such problems.

There may be an argument that CFI() *should* be a potential
postmaster-death-exit point, instead of having WaitLatch() (or its
caller) handle it directly, but it's complicated.  At the time the
postmaster pipe system was invented we didn't have a signals for this
so it wasn't even a candidate for treatment as an "interrupt".  On
systems that have postmaster death signals today (Linux + FreeBSD, but
I suspect we can extend this to every Unix we support, see CF #3066,
and a solution for Windows has been mentioned too), clearly the signal
handler could set a new interrupt flag PostmasterLost +
InterruptPending, and then CHECK_FOR_INTERRUPTS() could see it and
exit.  The argument against this is that exiting isn't always the
right thing!  In a couple of places, we do something special, such as
printing a special error message (examples: sync rep and the main FEBE
client read).  Look for WL_POSTMASTER_DEATH (as opposed to
WL_EXIT_ON_PM_DEATH).  So I guess you'd need to reverse those
decisions and standardise on "exit immediately, no message", or
invented a way to suppress that behaviour in code regions.

> 3) Is it intentional to use pg_usleep in some places in the code? If
> yes, what are they? At least, I see one place where it's intentional
> in the wait_pid function which is used while running the regression
> tests.

There are plenty of places that do a short sleep for various reasons,
more like a deliberate stall or backoff or auth thing, and it's
probably OK if they're shortish and not really a condition polling
loop with an obvious latch/CV-based replacement.  Note also that
LWLock waits are similar.

> 4) Are there any places where we need to replace pg_usleep with
> WaitLatch/equivalent of pg_sleep to detect the postmaster death
> properly?

We definitely have replaced a lot of sleeps with latch.c primitives
over the past few years, since we got WL_EXIT_ON_PM_DEATH and
condition variables.  There may be many more to improve...  You
mentioned autovacuum... yeah, Stephen fixed one of these with commit
4753ef37, but yeah it's not great to have those others in there...



Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Thu, Apr 15, 2021 at 5:28 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks a lot for the detailed explanation.

> On Thu, Apr 15, 2021 at 2:06 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > 1) Is it really harmful to use pg_usleep in a postmaster child process
> > as it doesn't let the child process detect postmaster death?
>
> Yeah, that's a bad idea.  Any long-term waiting (including short waits
> in a loop) should ideally be done with the latch infrastructure.

Agree. Along with short waits in a loop, I think we also should
replace pg_usleep with WaitLatch that has a user configurable
parameter like below:

pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L);
pg_usleep(PostAuthDelay * 1000000L);
pg_usleep(CommitDelay);

> 4) Are there any places where we need to replace pg_usleep with
> > WaitLatch/equivalent of pg_sleep to detect the postmaster death
> > properly?
>
> We definitely have replaced a lot of sleeps with latch.c primitives
> over the past few years, since we got WL_EXIT_ON_PM_DEATH and
> condition variables.  There may be many more to improve...  You
> mentioned autovacuum... yeah, Stephen fixed one of these with commit
> 4753ef37, but yeah it's not great to have those others in there...

I have not looked at the commit 4753ef37 previously, but it
essentially addresses the problem with pg_usleep for vacuum delay. I'm
thinking we can also replace pg_usleep in below places based on the
fact that pg_usleep should be avoided in 1) short waits in a loop 2)
when wait time is dependent on user configurable parameters. And using
WaitLatch may require us to add wait event types to WaitEventTimeout
enum, but that's okay.

1) pg_usleep(VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL * 1000L); in lazy_truncate_heap
2) pg_usleep(CommitDelay); in XLogFlush
3) pg_usleep(10000L); in CreateCheckPoint
4) pg_usleep(1000000L); in do_pg_stop_backup
5) pg_usleep(1000L); in read_local_xlog_page
6) pg_usleep(PostAuthDelay * 1000000L); in AutoVacLauncherMain,
AutoVacWorkerMain, StartBackgroundWorker, InitPostgres
7) pg_usleep(100000L); in RequestCheckpoint
8) pg_usleep(1000000L); in pgarch_ArchiverCopyLoop
9) pg_usleep(PGSTAT_RETRY_DELAY * 1000L); in backend_read_statsfile
10) pg_usleep(PreAuthDelay * 1000000L); in BackendInitialize
11) pg_usleep(10000L); in WalSndWaitStopping
12) pg_usleep(standbyWait_us); in WaitExceedsMaxStandbyDelay
13) pg_usleep(10000L); in RegisterSyncRequest

I'm sure we won't be changing in all of the above places. It will be
good to review and correct the above list.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Thu, Apr 15, 2021 at 11:48 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> > We definitely have replaced a lot of sleeps with latch.c primitives
> > over the past few years, since we got WL_EXIT_ON_PM_DEATH and
> > condition variables.  There may be many more to improve...  You
> > mentioned autovacuum... yeah, Stephen fixed one of these with commit
> > 4753ef37, but yeah it's not great to have those others in there...
>
> I have not looked at the commit 4753ef37 previously, but it
> essentially addresses the problem with pg_usleep for vacuum delay. I'm
> thinking we can also replace pg_usleep in below places based on the
> fact that pg_usleep should be avoided in 1) short waits in a loop 2)
> when wait time is dependent on user configurable parameters. And using
> WaitLatch may require us to add wait event types to WaitEventTimeout
> enum, but that's okay.

I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in
lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and
Post Auth Delay. Regression tests pass with these patches. Please
review them.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Tue, Apr 20, 2021 at 7:36 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Apr 15, 2021 at 11:48 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > We definitely have replaced a lot of sleeps with latch.c primitives
> > > over the past few years, since we got WL_EXIT_ON_PM_DEATH and
> > > condition variables.  There may be many more to improve...  You
> > > mentioned autovacuum... yeah, Stephen fixed one of these with commit
> > > 4753ef37, but yeah it's not great to have those others in there...
> >
> > I have not looked at the commit 4753ef37 previously, but it
> > essentially addresses the problem with pg_usleep for vacuum delay. I'm
> > thinking we can also replace pg_usleep in below places based on the
> > fact that pg_usleep should be avoided in 1) short waits in a loop 2)
> > when wait time is dependent on user configurable parameters. And using
> > WaitLatch may require us to add wait event types to WaitEventTimeout
> > enum, but that's okay.
>
> I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in
> lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and
> Post Auth Delay. Regression tests pass with these patches. Please
> review them.

I made a CF entry [1] so that it may get a chance for review.

[1] https://commitfest.postgresql.org/33/3085/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: Can a child process detect postmaster death when in pg_usleep?

From
Aleksander Alekseev
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

This patch looks fine. Tested on MacOS Catalina; master 09ae3299

The new status of this patch is: Ready for Committer

Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Tue, Apr 20, 2021 at 07:36:39AM +0530, Bharath Rupireddy wrote:
> I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in
> lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and
> Post Auth Delay. Regression tests pass with these patches. Please
> review them.

+       if (backup_started_in_recovery)
+           latch = &XLogCtl->recoveryWakeupLatch;
+       else
+           latch = MyLatch;
recoveryWakeupLatch is used by the startup process, but it has nothing
to do with do_pg_stop_backup().  Why are you doing that?

I can get behind the change for the truncation lock when finishing a
VACUUM as that helps with monitoring.  Now, I am not sure I get the
point of changing anything for {post,pre}_auth_delay that are
developer options.  Please note that at this stage we don't know the
backend activity in pg_stat_activity, so the use of wait events is not
really interesting.  On top of that, not reacting on signals can be
interesting to keep as a behavior for developers?
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Thu, Jun 24, 2021 at 12:04 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Apr 20, 2021 at 07:36:39AM +0530, Bharath Rupireddy wrote:
> > I'm attaching 3 patches that replace pg_usleep with WaitLatch: 0001 in
> > lazy_truncate_heap, 0002 in do_pg_stop_backup and 0003 for Pre and
> > Post Auth Delay. Regression tests pass with these patches. Please
> > review them.
>
> +       if (backup_started_in_recovery)
> +           latch = &XLogCtl->recoveryWakeupLatch;
> +       else
> +           latch = MyLatch;
> recoveryWakeupLatch is used by the startup process, but it has nothing
> to do with do_pg_stop_backup().  Why are you doing that?

The recoveryWakeupLatch and procLatch/MyLatch are being used for WAL
replay and recovery conflict, respectively. Actually, I was earlier
using procLatch/MyLatch, but came across the commit 00f690a23 which
says that the two latches are reserved for specific purposes. I'm not
quite sure which one to use when do_pg_stop_backup is called by the
startup process. Any thoughts?

> I can get behind the change for the truncation lock when finishing a
> VACUUM as that helps with monitoring.

Thanks. Please let me know if there are any comments on
v1-0001-Use-a-WaitLatch-for-lock-waiting-in-lazy_truncate.patch.

> Now, I am not sure I get the
> point of changing anything for {post,pre}_auth_delay that are
> developer options.  Please note that at this stage we don't know the
> backend activity in pg_stat_activity, so the use of wait events is not
> really interesting.

Hm. I was earlier thinking from the perspective that the processes
should be able to detect the postmaster death if the
{post,pre}_auth_delay are set to higher values. Now, I agree that the
auth delays are happening at the initial stages of the processes and
if the developers(not common users) set the higher values for the
GUCs, let them deal with the problem of the processes not detecting
the postmaster death.

> On top of that, not reacting on signals can be
> interesting to keep as a behavior for developers?

Yeah, it can be useful at times as it enables debugging even when the
postmaster dies.

With Regards,
Bharath Rupireddy.



Re: Can a child process detect postmaster death when in pg_usleep?

From
Tom Lane
Date:
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> On Thu, Jun 24, 2021 at 12:04 PM Michael Paquier <michael@paquier.xyz> wrote:
>> On top of that, not reacting on signals can be
>> interesting to keep as a behavior for developers?

> Yeah, it can be useful at times as it enables debugging even when the
> postmaster dies.

Dunno ... I cannot recall ever having had that as a debugging requirement
in a couple of decades worth of PG bug-chasing.  If the postmaster is
dying, you generally want to deal with that before bothering with child
processes.  Moreover, child processes that don't go awy when the
postmaster does are a very nasty problem, because they could screw up
subsequent debugging work.

            regards, tom lane



Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Mon, Jun 28, 2021 at 11:01:57AM -0400, Tom Lane wrote:
> Dunno ... I cannot recall ever having had that as a debugging requirement
> in a couple of decades worth of PG bug-chasing.  If the postmaster is
> dying, you generally want to deal with that before bothering with child
> processes.  Moreover, child processes that don't go awy when the
> postmaster does are a very nasty problem, because they could screw up
> subsequent debugging work.

At the same time, nobody has really complained about this being an
issue for developer options.  I would tend to wait for more opinions
before doing anything with the auth_delay GUCs.
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Mon, Jun 28, 2021 at 08:21:06PM +0530, Bharath Rupireddy wrote:
> The recoveryWakeupLatch and procLatch/MyLatch are being used for WAL
> replay and recovery conflict, respectively. Actually, I was earlier
> using procLatch/MyLatch, but came across the commit 00f690a23 which
> says that the two latches are reserved for specific purposes. I'm not
> quite sure which one to use when do_pg_stop_backup is called by the
> startup process. Any thoughts?

Could you explain why you think dp_pg_stop_backup() can be called by
the startup process?  AFAIK, this code path applies to two categories
of sessions:
- backend sessions, with the SQL functions calling this routine.
- WAL senders, aka anything that connects with replication=1 able to
use the BASE_BACKUP with the replication protocol.

> Thanks. Please let me know if there are any comments on
> v1-0001-Use-a-WaitLatch-for-lock-waiting-in-lazy_truncate.patch.

Applied this one as that's clearly a win.  The event name has been
renamed to VacuumTruncate.
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Fri, Jul 2, 2021 at 9:53 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jun 28, 2021 at 08:21:06PM +0530, Bharath Rupireddy wrote:
> > The recoveryWakeupLatch and procLatch/MyLatch are being used for WAL
> > replay and recovery conflict, respectively. Actually, I was earlier
> > using procLatch/MyLatch, but came across the commit 00f690a23 which
> > says that the two latches are reserved for specific purposes. I'm not
> > quite sure which one to use when do_pg_stop_backup is called by the
> > startup process. Any thoughts?
>
> Could you explain why you think dp_pg_stop_backup() can be called by
> the startup process? AFAIK, this code path applies to two categories
> of sessions:
> - backend sessions, with the SQL functions calling this routine.
> - WAL senders, aka anything that connects with replication=1 able to
> use the BASE_BACKUP with the replication protocol.

My bad. I was talking about the cases when do_pg_stop_backup is called
while the server is in recovery mode i.e. backup_started_in_recovery =
RecoveryInProgress(); evaluates to true. I'm not sure in these cases
whether we should replace pg_usleep with WaitLatch. If yes, whether we
should use procLatch/MyLatch or recoveryWakeupLatch as they are
currently serving different purposes.

Regards,
Bharath Rupireddy.



Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote:
> My bad. I was talking about the cases when do_pg_stop_backup is called
> while the server is in recovery mode i.e. backup_started_in_recovery =
> RecoveryInProgress(); evaluates to true. I'm not sure in these cases
> whether we should replace pg_usleep with WaitLatch. If yes, whether we
> should use procLatch/MyLatch or recoveryWakeupLatch as they are
> currently serving different purposes.

It seems to me that you should re-read the description of
recoveryWakeupLatch at the top of xlog.c and check for which purpose
it exists, which is, in this case, to wake up the startup process to
accelerate WAL replay.  So do_pg_stop_backup() has no business with
it.

Switching pg_stop_backup() to use a latch rather than pg_usleep() has
benefits:
- It simplifies the wait event handling.
- The process waiting for the last WAL segment to be archived will be
more responsive on signals like SIGHUP and on postmaster death.

These don't sound bad to me to apply here, so 0002 could be simplified
as attached.
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Kyotaro Horiguchi
Date:
At Fri, 2 Jul 2021 10:27:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Mon, Jun 28, 2021 at 11:01:57AM -0400, Tom Lane wrote:
> > Dunno ... I cannot recall ever having had that as a debugging requirement
> > in a couple of decades worth of PG bug-chasing.  If the postmaster is
> > dying, you generally want to deal with that before bothering with child
> > processes.  Moreover, child processes that don't go awy when the
> > postmaster does are a very nasty problem, because they could screw up
> > subsequent debugging work.
> 
> At the same time, nobody has really complained about this being an
> issue for developer options.  I would tend to wait for more opinions
> before doing anything with the auth_delay GUCs.

I'm not sure the current behavior is especially useful for debugging,
however, I don't think it is especially useful that children
immediately respond to postmaster's death while the debug-delays,
because anyway children don't respond while debugging (until the
control (or code-pointer) reaches to the point of checking
postmaster's death), and the delays must be very short even if someone
abuses it on production systems. On the other hand, there could be a
discussion as a convention that any user-definable sleep requires to
respond to signals, maybe as Thomas mentioned.

So, I don't object either way we will go. But if we don't change the
behavior we instead would need a comment that explains the reason for
the pg_usleep.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Can a child process detect postmaster death when in pg_usleep?

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote:
> > My bad. I was talking about the cases when do_pg_stop_backup is called
> > while the server is in recovery mode i.e. backup_started_in_recovery =
> > RecoveryInProgress(); evaluates to true. I'm not sure in these cases
> > whether we should replace pg_usleep with WaitLatch. If yes, whether we
> > should use procLatch/MyLatch or recoveryWakeupLatch as they are
> > currently serving different purposes.
>
> It seems to me that you should re-read the description of
> recoveryWakeupLatch at the top of xlog.c and check for which purpose
> it exists, which is, in this case, to wake up the startup process to
> accelerate WAL replay.  So do_pg_stop_backup() has no business with
> it.
>
> Switching pg_stop_backup() to use a latch rather than pg_usleep() has
> benefits:
> - It simplifies the wait event handling.
> - The process waiting for the last WAL segment to be archived will be
> more responsive on signals like SIGHUP and on postmaster death.

Yes, agreed.

> These don't sound bad to me to apply here, so 0002 could be simplified
> as attached.

Took a quick look and the patch looks good to me.

In general, I agree with Tom's up-thread comment about children hanging
around after postmaster death making things more difficult for debugging
and just in general, so I'm in favor of trying to eliminate as many
cases where that's happening as we reasonably can without impacting
performance by checking too often.

Thanks,

Stephen

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Mon, Jul 5, 2021 at 7:33 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote:
> > My bad. I was talking about the cases when do_pg_stop_backup is called
> > while the server is in recovery mode i.e. backup_started_in_recovery =
> > RecoveryInProgress(); evaluates to true. I'm not sure in these cases
> > whether we should replace pg_usleep with WaitLatch. If yes, whether we
> > should use procLatch/MyLatch or recoveryWakeupLatch as they are
> > currently serving different purposes.
>
> It seems to me that you should re-read the description of
> recoveryWakeupLatch at the top of xlog.c and check for which purpose
> it exists, which is, in this case, to wake up the startup process to
> accelerate WAL replay.  So do_pg_stop_backup() has no business with
> it.

Hm. The shared recoveryWakeupLatch is being owned by the startup
process to wait and other backends/processes are using it to wake up
the startup process.

> Switching pg_stop_backup() to use a latch rather than pg_usleep() has
> benefits:
> - It simplifies the wait event handling.
> - The process waiting for the last WAL segment to be archived will be
> more responsive on signals like SIGHUP and on postmaster death.
>
> These don't sound bad to me to apply here, so 0002 could be simplified
> as attached.

The attached stop-backup-latch-v2.patch looks good to me.

Regards,
Bharath Rupireddy.



Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Mon, Jul 5, 2021 at 9:25 PM Stephen Frost <sfrost@snowman.net> wrote:
> In general, I agree with Tom's up-thread comment about children hanging
> around after postmaster death making things more difficult for debugging
> and just in general, so I'm in favor of trying to eliminate as many
> cases where that's happening as we reasonably can without impacting
> performance by checking too often.

I agree. I'm attaching the patch that replaces pg_usleep with
WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's
latest patch stop-backup-latch-v2.patch, just for the sake of cfbot.

Regards,
Bharath Rupireddy.

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Mon, Jul 05, 2021 at 09:42:29PM +0530, Bharath Rupireddy wrote:
> I agree. I'm attaching the patch that replaces pg_usleep with
> WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's
> latest patch stop-backup-latch-v2.patch, just for the sake of cfbot.

I don't object to the argument that switching to a latch for this code
path could be good for responsiveness, but switching it is less
attractive than the others as wait events are not available in
pg_stat_activity at authentication startup.  That's the case of normal
backends and WAL senders, not the case of autovacuum workers using
post_auth_delay if I read the code correctly.

Anyway, it is worth noting that the patch as proposed breaks
post_auth_delay.  MyLatch is set when reaching WaitLatch() for
post_auth_delay after loading the options, so the use of WL_LATCH_SET
is not right.  I think that this comes from SwitchToSharedLatch() in
InitProcess().  And it does not seem quite right to me to just blindly
reset the latch before doing the wait in this code path.  Perhaps we
could just use (WL_TIMEOUT | WL_EXIT_ON_PM_DEATH) to do the job.

The one for pg_stop_backup() has been applied, no objections to that.
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Tue, Jul 6, 2021 at 6:15 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Jul 05, 2021 at 09:42:29PM +0530, Bharath Rupireddy wrote:
> > I agree. I'm attaching the patch that replaces pg_usleep with
> > WaitLatch for {pre, post}_auth_delay. I'm also attaching Michael's
> > latest patch stop-backup-latch-v2.patch, just for the sake of cfbot.
>
> I don't object to the argument that switching to a latch for this code
> path could be good for responsiveness, but switching it is less
> attractive than the others as wait events are not available in
> pg_stat_activity at authentication startup.  That's the case of normal
> backends and WAL senders, not the case of autovacuum workers using
> post_auth_delay if I read the code correctly.

We may not see anything in the pg_stat_activity for {post,
pre}_auth_delay, but the processes can detect the postmaster death
with WaitLatch. I think we should focus on that.

> Anyway, it is worth noting that the patch as proposed breaks
> post_auth_delay.  MyLatch is set when reaching WaitLatch() for
> post_auth_delay after loading the options, so the use of WL_LATCH_SET
> is not right.  I think that this comes from SwitchToSharedLatch() in
> InitProcess().  And it does not seem quite right to me to just blindly
> reset the latch before doing the wait in this code path.  Perhaps we
> could just use (WL_TIMEOUT | WL_EXIT_ON_PM_DEATH) to do the job.

I'm sorry to say that I didn't get what was said above. We reset the
latch after we come out of WaitLatch but not before going to wait. And
the reason to have WL_LATCH_SET, is to exit the wait loop if MyLatch
is set for that process because of other SetLatch events. Am I missing
something here?

Regards,
Bharath Rupireddy.



Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Tue, Jul 06, 2021 at 12:42:21PM +0530, Bharath Rupireddy wrote:
> I'm sorry to say that I didn't get what was said above. We reset the
> latch after we come out of WaitLatch but not before going to wait. And
> the reason to have WL_LATCH_SET, is to exit the wait loop if MyLatch
> is set for that process because of other SetLatch events. Am I missing
> something here?

Did you test the patch with post_auth_delay and a backend connection,
making sure that the delay gets correctly applied?  I did, and that
was not working here.
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Tue, Jul 6, 2021 at 1:38 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 06, 2021 at 12:42:21PM +0530, Bharath Rupireddy wrote:
> > I'm sorry to say that I didn't get what was said above. We reset the
> > latch after we come out of WaitLatch but not before going to wait. And
> > the reason to have WL_LATCH_SET, is to exit the wait loop if MyLatch
> > is set for that process because of other SetLatch events. Am I missing
> > something here?
>
> Did you test the patch with post_auth_delay and a backend connection,
> making sure that the delay gets correctly applied?  I did, and that
> was not working here.

Thanks. You are right. The issue is due to the MyLatch being set by
SwitchToSharedLatch before WaitLatch. If we use (WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH), then the backends will honour the
post_auth_delay as well as detect the postmaster death. Since we are
not using WL_LATCH_SET, I removed ResetLatch. Also, added some
comments around why we are not using WL_LATCH_SET.

For PreAuthDelay, there's no problem to use WL_LATCH_SET as MyLatch
still points to the local latch(which is not set) in
BackendInitialize().

PSA v2 patch.

Regards,
Bharath Rupireddy.

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Tue, Jul 06, 2021 at 03:54:07PM +0530, Bharath Rupireddy wrote:
> Thanks. You are right. The issue is due to the MyLatch being set by
> SwitchToSharedLatch before WaitLatch. If we use (WL_TIMEOUT |
> WL_EXIT_ON_PM_DEATH), then the backends will honour the
> post_auth_delay as well as detect the postmaster death. Since we are
> not using WL_LATCH_SET, I removed ResetLatch. Also, added some
> comments around why we are not using WL_LATCH_SET.
>
> For PreAuthDelay, there's no problem to use WL_LATCH_SET as MyLatch
> still points to the local latch(which is not set) in
> BackendInitialize().

FWIW, I think that it could be a good idea to use the same set of
flags for all the pre/post_auth_delay paths for consistency.  That's
useful when grepping for one.  Please note that I don't plan to look
more at this patch set for this CF as I am not really excited by the
updates involving developer options, and I suspect more issues like
the one I found upthread so this needs a close lookup.

If somebody else wishes to look at it, please feel free, of course.
--
Michael

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Bharath Rupireddy
Date:
On Tue, Jul 6, 2021 at 4:33 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 06, 2021 at 03:54:07PM +0530, Bharath Rupireddy wrote:
> > Thanks. You are right. The issue is due to the MyLatch being set by
> > SwitchToSharedLatch before WaitLatch. If we use (WL_TIMEOUT |
> > WL_EXIT_ON_PM_DEATH), then the backends will honour the
> > post_auth_delay as well as detect the postmaster death. Since we are
> > not using WL_LATCH_SET, I removed ResetLatch. Also, added some
> > comments around why we are not using WL_LATCH_SET.
> >
> > For PreAuthDelay, there's no problem to use WL_LATCH_SET as MyLatch
> > still points to the local latch(which is not set) in
> > BackendInitialize().
>
> FWIW, I think that it could be a good idea to use the same set of
> flags for all the pre/post_auth_delay paths for consistency.  That's
> useful when grepping for one.  Please note that I don't plan to look
> more at this patch set for this CF as I am not really excited by the
> updates involving developer options, and I suspect more issues like
> the one I found upthread so this needs a close lookup.
>
> If somebody else wishes to look at it, please feel free, of course.

Thanks. Anyways, I removed WL_LATCH_SET for PreAuthDelay as well. PSA v4 patch.

Regards,
Bharath Rupireddy.

Attachment

Re: Can a child process detect postmaster death when in pg_usleep?

From
Michael Paquier
Date:
On Tue, Jul 06, 2021 at 05:07:04PM +0530, Bharath Rupireddy wrote:
> Thanks. Anyways, I removed WL_LATCH_SET for PreAuthDelay as
> well. PSA v4 patch.

For the moment, please note that I have marked the patch as committed
in the CF app.  It may be better to start a new thread with the
remaining bits for a separate evaluation.
--
Michael

Attachment