Thread: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
Hello, In commits 9f095299 and f98b8476 we improved recovery performance on Linux and FreeBSD but we didn't help other operating systems. David just confirmed for me that commenting out the PostmasterIsAlive() call in the main recovery loop speeds up crash recovery considerably on his Windows system: 93s -> 70s or 1.32x faster. So I think we should do something like what Heikki originally proposed to lower the frequency of checks, on systems where we don't have the ability to skip the check completely. Please see attached.
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Heikki Linnakangas
Date:
On 17/09/2020 12:48, Thomas Munro wrote: > Hello, > > In commits 9f095299 and f98b8476 we improved recovery performance on > Linux and FreeBSD but we didn't help other operating systems. David > just confirmed for me that commenting out the PostmasterIsAlive() call > in the main recovery loop speeds up crash recovery considerably on his > Windows system: 93s -> 70s or 1.32x faster. Nice speedup! > So I think we should do > something like what Heikki originally proposed to lower the frequency > of checks, on systems where we don't have the ability to skip the > check completely. Please see attached. If you put the counter in HandleStartupProcInterrupts(), it could be a long wait if the startup process is e.g. waiting for WAL to arrive in the loop in WaitForWALToBecomeAvailable(), or in recoveryPausesHere(). My original patch only reduced the frequency in the WAL redo loop, when you're actively replaying records. We could probably do better on Windows. Maybe the signal handler thread could wait on the PostmasterHandle at the same time that it waits on the signal pipe, and set postmaster_possibly_dead. But I'm not going to work on that, and it would only help on Windows, so I'm OK with just adding the counter. - Heikki
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Thu, Sep 17, 2020 at 10:19 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 17/09/2020 12:48, Thomas Munro wrote: > > So I think we should do > > something like what Heikki originally proposed to lower the frequency > > of checks, on systems where we don't have the ability to skip the > > check completely. Please see attached. > > If you put the counter in HandleStartupProcInterrupts(), it could be a > long wait if the startup process is e.g. waiting for WAL to arrive in > the loop in WaitForWALToBecomeAvailable(), or in recoveryPausesHere(). > My original patch only reduced the frequency in the WAL redo loop, when > you're actively replaying records. Oh, I checked that WaitForWALToBecomeAvailable() already handled postmaster death via events rather than polling, with WL_EXIT_ON_PM_DEATH, but I hadn't clocked that recoveryPausesHere() uses pg_usleep() and polling. Hmm. Perhaps we should change that instead? The reason I did it that way is that I didn't want to make the new ProcSignalBarrierPending handler less reactive. > We could probably do better on Windows. Maybe the signal handler thread > could wait on the PostmasterHandle at the same time that it waits on the > signal pipe, and set postmaster_possibly_dead. But I'm not going to work > on that, and it would only help on Windows, so I'm OK with just adding > the counter. Yeah, I had the same thought.
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Heikki Linnakangas
Date:
On 17/09/2020 13:31, Thomas Munro wrote: > On Thu, Sep 17, 2020 at 10:19 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> If you put the counter in HandleStartupProcInterrupts(), it could be a >> long wait if the startup process is e.g. waiting for WAL to arrive in >> the loop in WaitForWALToBecomeAvailable(), or in recoveryPausesHere(). >> My original patch only reduced the frequency in the WAL redo loop, when >> you're actively replaying records. > > Oh, I checked that WaitForWALToBecomeAvailable() already handled > postmaster death via events rather than polling, with > WL_EXIT_ON_PM_DEATH, but I hadn't clocked that recoveryPausesHere() > uses pg_usleep() and polling. Hmm. Perhaps we should change that > instead? The reason I did it that way is that I didn't want to make > the new ProcSignalBarrierPending handler less reactive. Hmm, so for speedy response to postmaster death, you're relying on the loops to have other postmaster-death checks besides HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That seems a bit fragile, at the very least it needs a comment in HandleStartupProcInterrupts() to call it out. Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep(). - Heikki
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Thu, Sep 17, 2020 at 10:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm, so for speedy response to postmaster death, you're relying on the > loops to have other postmaster-death checks besides > HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That > seems a bit fragile, at the very least it needs a comment in > HandleStartupProcInterrupts() to call it out. Surely that's the direction we want to go in, though, no? Commit cfdf4dc4 was intended to standardise the way we react to postmaster death where waiting is involved. I updated the comment in HandleStartupProcInterrupts() to highlight that the PostmasterIsAlive() check in there is only for the benefit of CPU-bound loops. > Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep(). Updating that one required me to invent a new wait_event for pg_stat_activity, which seems like progress. Unfortunately, while I was doing that I realised that WaitLatch() without WL_SET_LATCH was broken by commit 3347c982bab (in master only), in a way not previously reached by the tests. So 0001 is a patch to fix that.
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Fujii Masao
Date:
On 2020/09/18 9:30, Thomas Munro wrote: > On Thu, Sep 17, 2020 at 10:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Hmm, so for speedy response to postmaster death, you're relying on the >> loops to have other postmaster-death checks besides >> HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That >> seems a bit fragile, at the very least it needs a comment in >> HandleStartupProcInterrupts() to call it out. > > Surely that's the direction we want to go in, though, no? Commit > cfdf4dc4 was intended to standardise the way we react to postmaster > death where waiting is involved. I updated the comment in > HandleStartupProcInterrupts() to highlight that the > PostmasterIsAlive() check in there is only for the benefit of > CPU-bound loops. > >> Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep(). > > Updating that one required me to invent a new wait_event for > pg_stat_activity, which seems like progress. > > Unfortunately, while I was doing that I realised that WaitLatch() > without WL_SET_LATCH was broken by commit 3347c982bab (in master > only), in a way not previously reached by the tests. So 0001 is a > patch to fix that. - pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE); - pg_usleep(1000000L); /* 1000 ms */ - pgstat_report_wait_end(); + WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 1000, + WAIT_EVENT_RECOVERY_PAUSE); This change may cause at most one second delay against the standby promotion request during WAL replay pause? It's only one second, but I'd like to avoid this (unnecessary) wait to shorten the failover time as much as possible basically. So what about using WL_SET_LATCH here? But when using WL_SET_LATCH, one concern is that walreceiver can wake up the startup process too frequently even during WAL replay pause. This is also problematic? I'm ok with this, but if not, using pg_usleep() might be better as the original code does. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Sat, Sep 19, 2020 at 6:07 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > - pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE); > - pg_usleep(1000000L); /* 1000 ms */ > - pgstat_report_wait_end(); > + WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 1000, > + WAIT_EVENT_RECOVERY_PAUSE); > > This change may cause at most one second delay against the standby > promotion request during WAL replay pause? It's only one second, > but I'd like to avoid this (unnecessary) wait to shorten the failover time > as much as possible basically. So what about using WL_SET_LATCH here? Right, there is a comment saying that we could do that: * XXX Could also be done with shared latch, avoiding the pg_usleep loop. * Probably not worth the trouble though. This state shouldn't be one that * anyone cares about server power consumption in. > But when using WL_SET_LATCH, one concern is that walreceiver can > wake up the startup process too frequently even during WAL replay pause. > This is also problematic? I'm ok with this, but if not, using pg_usleep() > might be better as the original code does. You're right, at least if we used recoveryWakeupLatch. Although we'd react to pg_wal_replay_resume() faster, which would be nice, we wouldn't be saving energy, we'd be using more energy due to all the other latch wakeups that we'd be ignoring. I believe the correct solution to this problem is to add a ConditionVariable "recoveryPauseChanged" into XLogCtlData, and then broadcast on it in SetRecoveryPause(). This would be a trivial change, except for one small problem: ConditionVariableTimedSleep() contains CHECK_FOR_INTERRUPTS(), but startup.c has its own special interrupt handling rather than using ProcessInterrupts() from postgres.c. Maybe that's OK, I'm not sure, but it requires more thought, and I propose to keep the existing sloppy polling for now and leave precise wakeup improvements for a separate patch. The primary goal of this patch is to switch to the standard treatment of postmaster death in wait loops, so that we're free to reduce the sampling frequency in HandleStartupProcInterrupts(), to fix a horrible performance problem. I have at least tweaked that comment about pg_usleep(), though, because that was out of date; I also used (void) WaitLatch(...) to make it look like other places where we ignore the return value (perhaps some static analyser out there somewhere cares?) By the way, a CV could probably be used for walreceiver state changes too, to improve ShutdownWalRcv(). Although I know from CI that this builds and passes "make check" on Windows, I'm hoping to attract some review of the 0001 patch from a Windows person, and confirmation that it passes "check-world" (or at least src/test/recovery check) there, because I don't have CI scripts for that yet.
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
David Rowley
Date:
On Sun, 20 Sep 2020 at 09:29, Thomas Munro <thomas.munro@gmail.com> wrote: > > Although I know from CI that this builds and passes "make check" on > Windows, I'm hoping to attract some review of the 0001 patch from a > Windows person, and confirmation that it passes "check-world" (or at > least src/test/recovery check) there, because I don't have CI scripts > for that yet. I've gone as far as running the recovery tests on the v3-0001 patch using a Windows machine. They pass: L:\Projects\Postgres\d\src\tools\msvc>vcregress taptest src/test/recovery ... t/001_stream_rep.pl .................. ok t/002_archiving.pl ................... ok t/003_recovery_targets.pl ............ ok t/004_timeline_switch.pl ............. ok t/005_replay_delay.pl ................ ok t/006_logical_decoding.pl ............ ok t/007_sync_rep.pl .................... ok t/008_fsm_truncation.pl .............. ok t/009_twophase.pl .................... ok t/010_logical_decoding_timelines.pl .. ok t/011_crash_recovery.pl .............. skipped: Test fails on Windows perl t/012_subtransactions.pl ............. ok t/013_crash_restart.pl ............... ok t/014_unlogged_reinit.pl ............. ok t/015_promotion_pages.pl ............. ok t/016_min_consistency.pl ............. ok t/017_shm.pl ......................... skipped: SysV shared memory not supported by this platform t/018_wal_optimize.pl ................ ok t/019_replslot_limit.pl .............. ok t/020_archive_status.pl .............. ok All tests successful. Files=20, Tests=222, 397 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU) Result: PASS L:\Projects\Postgres\d\src\tools\msvc>git diff --stat src/backend/storage/ipc/latch.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) David
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Wed, Sep 23, 2020 at 2:27 PM David Rowley <dgrowleyml@gmail.com> wrote: > I've gone as far as running the recovery tests on the v3-0001 patch > using a Windows machine. They pass: Thanks! I pushed that one, because it was effectively a bug fix (WaitLatch() without a latch was supposed to work). I'll wait longer for feedback on the main patch; perhaps someone has a better idea, or wants to take issue with the magic number 1024 (ie limit on how many records we'll replay before we notice the postmaster has exited), or my plan to harmonise those wait loops? It has a CF entry for the next CF.
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Fujii Masao
Date:
On 2020/09/23 12:47, Thomas Munro wrote: > On Wed, Sep 23, 2020 at 2:27 PM David Rowley <dgrowleyml@gmail.com> wrote: >> I've gone as far as running the recovery tests on the v3-0001 patch >> using a Windows machine. They pass: > > Thanks! I pushed that one, because it was effectively a bug fix > (WaitLatch() without a latch was supposed to work). Great! > > I'll wait longer for feedback on the main patch; perhaps someone has a > better idea, or wants to take issue with the magic number 1024 (ie > limit on how many records we'll replay before we notice the postmaster > has exited), or my plan to harmonise those wait loops? It has a CF > entry for the next CF. Does this patch work fine with warm-standby case using pg_standby? IIUC the startup process doesn't call WaitLatch() in that case, so ISTM that, with the patch, it cannot detect the postmaster death immediately. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Thu, Sep 24, 2020 at 2:39 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > Does this patch work fine with warm-standby case using pg_standby? > IIUC the startup process doesn't call WaitLatch() in that case, so ISTM that, > with the patch, it cannot detect the postmaster death immediately. Right, RestoreArchivedFile() uses system(), so I guess it can hang around for a long time after unexpected postmaster exit on every OS if the command waits. To respond to various kinds of important interrupts, I suppose that'd ideally use something like OpenPipeStream() and a typical WaitLatch() loop with CFI(). I'm not sure what our policy is or should be for exiting while we have running subprocesses. I guess that is a separate issue. Here's a rebase, no code change.
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Michael Paquier
Date:
On Thu, Sep 24, 2020 at 05:55:17PM +1200, Thomas Munro wrote: > Right, RestoreArchivedFile() uses system(), so I guess it can hang > around for a long time after unexpected postmaster exit on every OS if > the command waits. To respond to various kinds of important > interrupts, I suppose that'd ideally use something like > OpenPipeStream() and a typical WaitLatch() loop with CFI(). I'm not > sure what our policy is or should be for exiting while we have running > subprocesses. I guess that is a separate issue. - if (IsUnderPostmaster && !PostmasterIsAlive()) + if (IsUnderPostmaster && +#ifndef USE_POSTMASTER_DEATH_SIGNAL + count++ % 1024 == 0 && +#endif + !PostmasterIsAlive()) That's pretty hack-ish, still efficient. Could we consider a different approach like something relying on PostmasterIsAliveInternal() with repetitive call handling? This may not be the only place where we care about that, particularly for non-core code. No objections with the two changes from pg_usleep() to WaitLatch() so they could be applied separately first. -- Michael
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Fujii Masao
Date:
On 2021/03/02 10:10, Thomas Munro wrote: > On Tue, Mar 2, 2021 at 12:00 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Mon, Nov 16, 2020 at 8:56 PM Michael Paquier <michael@paquier.xyz> wrote: >>> No objections with the two changes from pg_usleep() to WaitLatch() so >>> they could be applied separately first. >> >> I thought about committing that first part, and got as far as >> splitting the patch into two (see attached), but then I re-read >> Fujii-san's message about the speed of promotion and realised that we >> really should have something like a condition variable for walRcvState >> changes. I'll look into that. > > Here's an experimental attempt at that, though I'm not sure if it's > the right approach. Of course it's not necessary to use condition > variables here: we could use recoveryWakeupLatch, because we're not in > any doubt about who needs to be woken up. But then you could get > constant wakeups while recovery is paused, unless you also suppressed > that somehow. You could use the startup process's procLatch, > advertised in shmem, but that's almost a condition variable. With a > condition variable, you get to name it something like > walRcvStateChanged, and then the programming rule is very clear: if > you change walRcvState, you need to broadcast that fact (and you don't > have to worry about who might be interested). One question I haven't > got to the bottom of: is it a problem for the startup process that CVs > use CHECK_FOR_INTERRUPTS()? I found 0001 patch was committed in de829ddf23, and which added new wait event WalrcvExit. This name seems not consistent with other wait events. I'm thinking it's better to rename it to WalReceiverExit. Thought? Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > I found 0001 patch was committed in de829ddf23, and which added new > wait event WalrcvExit. This name seems not consistent with other wait > events. I'm thinking it's better to rename it to WalReceiverExit. Thought? > Patch attached. Agreed, your names are better. Thanks.
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Fujii Masao
Date:
On 2021/03/23 10:52, Thomas Munro wrote: > On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> I found 0001 patch was committed in de829ddf23, and which added new >> wait event WalrcvExit. This name seems not consistent with other wait >> events. I'm thinking it's better to rename it to WalReceiverExit. Thought? >> Patch attached. > > Agreed, your names are better. Thanks. Thanks! I will commit the patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Fujii Masao
Date:
On 2021/03/23 14:49, Fujii Masao wrote: > > > On 2021/03/23 10:52, Thomas Munro wrote: >> On Tue, Mar 23, 2021 at 2:44 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>> I found 0001 patch was committed in de829ddf23, and which added new >>> wait event WalrcvExit. This name seems not consistent with other wait >>> events. I'm thinking it's better to rename it to WalReceiverExit. Thought? >>> Patch attached. >> >> Agreed, your names are better. Thanks. > > Thanks! I will commit the patch. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Fri, Mar 12, 2021 at 7:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Mar 11, 2021 at 7:34 PM Michael Paquier <michael@paquier.xyz> wrote: > > Wow. This probably means that we would be able to get rid of > > USE_POSTMASTER_DEATH_SIGNAL? > > We'll still need it, because there'd still be systems with no signal: > NetBSD, OpenBSD, AIX, HPUX, illumos. Erm, and of course macOS. There is actually something we could do here: ioctl(I_SETSIG) for SysV-derived systems and fcntl(O_ASYNC) for BSD-derived systems seems like a promising way to get a SIGIO signal when the postmaster goes away and the pipe becomes readable. Previously I'd discounted this, because it's not in POSIX and I doubted it would work well on other systems. But I was flicking through Stevens' UNIX book while trying to figure out that POLLHUP stuff from a nearby thread (though it's useless for that purpose) and I learned from section 12.6 that SIGIO is fairly ancient, originating in 4.2BSD, adopted by SVR4, so it's likely present in quite a few systems, maybe even all of our support platforms if we're prepared to do it two different ways. Just a thought.
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Wed, Mar 31, 2021 at 7:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Mar 12, 2021 at 7:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Mar 11, 2021 at 7:34 PM Michael Paquier <michael@paquier.xyz> wrote: > > > Wow. This probably means that we would be able to get rid of > > > USE_POSTMASTER_DEATH_SIGNAL? > > > > We'll still need it, because there'd still be systems with no signal: > > NetBSD, OpenBSD, AIX, HPUX, illumos. > > Erm, and of course macOS. > > There is actually something we could do here: ioctl(I_SETSIG) for > SysV-derived systems and fcntl(O_ASYNC) for BSD-derived systems seems > like a promising way to get a SIGIO signal when the postmaster goes > away and the pipe becomes readable. Previously I'd discounted this, > because it's not in POSIX and I doubted it would work well on other > systems. But I was flicking through Stevens' UNIX book while trying > to figure out that POLLHUP stuff from a nearby thread (though it's > useless for that purpose) and I learned from section 12.6 that SIGIO > is fairly ancient, originating in 4.2BSD, adopted by SVR4, so it's > likely present in quite a few systems, maybe even all of our support > platforms if we're prepared to do it two different ways. Just a > thought. Alright, here's a proof-of-concept patch that does that. Adding to the next CF. This seems to work on Linux, macOS, FreeBSD and OpenBSD (and I assume any other BSD). Can anyone tell me if it works on illumos, AIX or HPUX, and if not, how to fix it or disable the feature gracefully? For now the patch assumes that if you have SIGIO then you can do this; perhaps it should also test for O_ASYNC? Perhaps HPUX has the signal but requires a different incantation with I_SETSIG? Full disclosure: The reason for my interest in this subject is that I have a work-in-progress patch set to make latches, locks and condition variables more efficient using futexes on several OSes, but it needs a signal to wake on postmaster death.
Attachment
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Heikki Linnakangas
Date:
On 09/04/2021 07:01, Thomas Munro wrote: > On Wed, Mar 31, 2021 at 7:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Fri, Mar 12, 2021 at 7:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> On Thu, Mar 11, 2021 at 7:34 PM Michael Paquier <michael@paquier.xyz> wrote: >>>> Wow. This probably means that we would be able to get rid of >>>> USE_POSTMASTER_DEATH_SIGNAL? >>> >>> We'll still need it, because there'd still be systems with no signal: >>> NetBSD, OpenBSD, AIX, HPUX, illumos. >> >> Erm, and of course macOS. >> >> There is actually something we could do here: ioctl(I_SETSIG) for >> SysV-derived systems and fcntl(O_ASYNC) for BSD-derived systems seems >> like a promising way to get a SIGIO signal when the postmaster goes >> away and the pipe becomes readable. Previously I'd discounted this, >> because it's not in POSIX and I doubted it would work well on other >> systems. But I was flicking through Stevens' UNIX book while trying >> to figure out that POLLHUP stuff from a nearby thread (though it's >> useless for that purpose) and I learned from section 12.6 that SIGIO >> is fairly ancient, originating in 4.2BSD, adopted by SVR4, so it's >> likely present in quite a few systems, maybe even all of our support >> platforms if we're prepared to do it two different ways. Just a >> thought. > > Alright, here's a proof-of-concept patch that does that. Adding to the next CF. Looks good to me. > This seems to work on Linux, macOS, FreeBSD and OpenBSD (and I assume > any other BSD). Can anyone tell me if it works on illumos, AIX or > HPUX, and if not, how to fix it or disable the feature gracefully? > For now the patch assumes that if you have SIGIO then you can do this; > perhaps it should also test for O_ASYNC? Perhaps HPUX has the signal > but requires a different incantation with I_SETSIG? I think it would be OK to just commit this (after REL_14_STABLE has been created) and see what breaks in the buildfarm. Then we'll at least know if we need more autoconf checks or something to disable this. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 09/04/2021 07:01, Thomas Munro wrote: >> This seems to work on Linux, macOS, FreeBSD and OpenBSD (and I assume >> any other BSD). Can anyone tell me if it works on illumos, AIX or >> HPUX, and if not, how to fix it or disable the feature gracefully? >> For now the patch assumes that if you have SIGIO then you can do this; >> perhaps it should also test for O_ASYNC? Perhaps HPUX has the signal >> but requires a different incantation with I_SETSIG? I took a look on HPUX 10.20 (gaur's host): * SIGIO exists, but signal.h only defines it with -D_INCLUDE_HPUX_SOURCE which we don't use. * I found I_SETSIG, but ... $ grep -r SETSIG /usr/include /usr/include/sys/stropts.h:#define I_SETSIG _IO('S', 9) /* request SIGPOLL signal on events */ stropts.h seems to be for a feature called "streams", which is probably nonstandard enough that we don't want to deal with it. So I think the short answer on this platform is that if you conditionalize on #ifdef SIGIO then it will just not do it, and we should be fine. Can't say about HPUX 11. regards, tom lane
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
From
Thomas Munro
Date:
On Fri, Jun 11, 2021 at 1:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On 09/04/2021 07:01, Thomas Munro wrote: > >> This seems to work on Linux, macOS, FreeBSD and OpenBSD (and I assume > >> any other BSD). Can anyone tell me if it works on illumos, AIX or > >> HPUX, and if not, how to fix it or disable the feature gracefully? > >> For now the patch assumes that if you have SIGIO then you can do this; > >> perhaps it should also test for O_ASYNC? Perhaps HPUX has the signal > >> but requires a different incantation with I_SETSIG? > > I took a look on HPUX 10.20 (gaur's host): Thanks both for looking! Unfortunately I'll have to withdraw this patch in its current form. On closer inspection, only the last backend to start up and do F_SETOWN on the pipe receives the signal. We'd probably have to create a separate pipe for each backend, or something like that, which seems unwarranted so far. > * I found I_SETSIG, but ... > > $ grep -r SETSIG /usr/include > /usr/include/sys/stropts.h:#define I_SETSIG _IO('S', 9) /* request SIGPOLL signal on events */ > > stropts.h seems to be for a feature called "streams", which is > probably nonstandard enough that we don't want to deal with it. Agreed. It is technically POSIX, but optional and marked obsolescent. It's annoying to see that I_SETSIG did allow multiple processes to register to receive signals for the same event on the same underlying stream, unlike F_SETOWN. Oh well.