Thread: Race conditions with checkpointer and shutdown
Hi all, This is a continuation of the following thread, but I prefer spawning a new thread for clarity: https://www.postgresql.org/message-id/20190416064512.GJ2673@paquier.xyz The buildfarm has reported two similar failures when shutting down a node: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01 In both cases, the instance cannot shut down because it times out, waiting for the shutdown checkpoint to finish but I suspect that this checkpoint actually never happens. The first case involves piculet which has --disable-atomics, gcc 6 and the recovery test 016_min_consistency where we trigger a checkpoint, then issue a fast shutdown on a standby. And at this point the test waits forever. The second case involves dragonet which has JIT enabled and clang. The failure is on test 009_twophase.pl. The failure happens after test preparing transaction xact_009_11, where a *standby* gets restarted. Again, the test waits forever for the instance to shut down. The most recent commits which have touched checkpoints are 0dfe3d0e and c6c9474a, which maps roughly to the point where the failures began to happen, and that something related to standby clean shutdowns has broken since. Thanks, -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > The buildfarm has reported two similar failures when shutting down a > node: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01 > In both cases, the instance cannot shut down because it times out, > waiting for the shutdown checkpoint to finish but I suspect that this > checkpoint actually never happens. Hmm, I don't think that that is actually where the problem is. In piculet's failure, the test script times out waiting for a "fast" shutdown of the standby server, and what we see in the standby's log is 2019-03-23 22:44:12.181 UTC [9731] LOG: received fast shutdown request 2019-03-23 22:44:12.181 UTC [9731] LOG: aborting any active transactions 2019-03-23 22:44:12.181 UTC [9960] FATAL: terminating walreceiver process due to administrator command 2019-03-23 22:50:13.088 UTC [9731] LOG: received immediate shutdown request where the last line indicates that the test script lost patience and issued an immediate shutdown. However, in a successful run of the test, the log looks like 2019-03-24 03:33:25.592 UTC [23816] LOG: received fast shutdown request 2019-03-24 03:33:25.592 UTC [23816] LOG: aborting any active transactions 2019-03-24 03:33:25.592 UTC [23895] FATAL: terminating walreceiver process due to administrator command 2019-03-24 03:33:25.595 UTC [23819] LOG: shutting down 2019-03-24 03:33:25.600 UTC [23816] LOG: database system is shut down 2019-03-24 03:33:25.696 UTC [23903] LOG: starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.2.0-12)8.2.0, 64-bit where the last line reflects restarting the server for the next test step. So in the failure case we don't see the "shutting down" message, which means we never got to ShutdownXLOG, so no checkpoint request was made. Even if we had got to ShutdownXLOG, the process is just executing the operation directly, it's not sending a signal asking some other process to do the checkpoint; so it's hard to see how either of the commits you mention could be involved. I think what we need to look for is reasons why (1) the postmaster never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's main loop doesn't get to noticing shutdown_requested. A rather scary point for (2) is that said main loop seems to be assuming that MyLatch a/k/a MyProc->procLatch is not used for any other purposes in the checkpointer process. If there were something, like say a condition variable wait, that would reset MyLatch at any time during a checkpoint, then we could very easily go to sleep at the bottom of the loop and not notice that there's a pending shutdown request. Now, c6c9474aa did not break this, because the latch resets that it added happen in other processes not the checkpointer. But I'm feeling suspicious that some other change we made recently might've borked it. And in general, it seems like we've managed to load a lot of potentially conflicting roles onto process latches. Do we need to think harder about establishing rules for multiplexed use of the process latch? I'm imagining some rule like "if you are not the outermost event loop of a process, you do not get to summarily clear MyLatch. Make sure to leave it set after waiting, if there was any possibility that it was set by something other than the specific event you're concerned with". regards, tom lane
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Do we need to think harder about establishing rules for multiplexed > use of the process latch? I'm imagining some rule like "if you are > not the outermost event loop of a process, you do not get to > summarily clear MyLatch. Make sure to leave it set after waiting, > if there was any possibility that it was set by something other than > the specific event you're concerned with". Hmm, yeah. If the latch is left set, then the outer loop will just go through an extra and unnecessary iteration, which seems fine. If the latch is left clear, then the outer loop might miss a wakeup intended for it and hang forever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-04-16 18:59:37 -0400, Robert Haas wrote: > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Do we need to think harder about establishing rules for multiplexed > > use of the process latch? I'm imagining some rule like "if you are > > not the outermost event loop of a process, you do not get to > > summarily clear MyLatch. Make sure to leave it set after waiting, > > if there was any possibility that it was set by something other than > > the specific event you're concerned with". > > Hmm, yeah. If the latch is left set, then the outer loop will just go > through an extra and unnecessary iteration, which seems fine. If the > latch is left clear, then the outer loop might miss a wakeup intended > for it and hang forever. Arguably that's a sign that the latch using code in the outer loop(s) isn't written correctly? If you do it as: while (true) { CHECK_FOR_INTERRUPTS(); ResetLatch(MyLatch); if (work_needed) { Plenty(); Code(); Using(MyLatch); } else { WaitLatch(MyLatch); } } I think that's not a danger? I think the problem really is that we suggest doing that WaitLatch() unconditionally: * The correct pattern to wait for event(s) is: * * for (;;) * { * ResetLatch(); * if (work to do) * Do Stuff(); * WaitLatch(); * } * * It's important to reset the latch *before* checking if there's work to * do. Otherwise, if someone sets the latch between the check and the * ResetLatch call, you will miss it and Wait will incorrectly block. * * Another valid coding pattern looks like: * * for (;;) * { * if (work to do) * Do Stuff(); // in particular, exit loop if some condition satisfied * WaitLatch(); * ResetLatch(); * } Obviously there's the issue that a lot of latch using code isn't written that way - but I also don't think there's that many latch using code that then also uses latch. Seems like we could fix that. While it has obviously dangers of not being followed, so does the 'always-set-latch-unless-outermost-loop' approach. I'm not sure I like the idea of incurring another unnecessary SetLatch() call for most latch using places. I guess there's a bit bigger danger of taking longer to notice postmaster-death. But I'm not sure I can quite see that being problematic - seems like all we should incur is another cycle through the loop, as the latch shouldn't be set anymore. Greetings, Andres Freund
Hi, On 2019-04-16 17:05:36 -0700, Andres Freund wrote: > On 2019-04-16 18:59:37 -0400, Robert Haas wrote: > > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Do we need to think harder about establishing rules for multiplexed > > > use of the process latch? I'm imagining some rule like "if you are > > > not the outermost event loop of a process, you do not get to > > > summarily clear MyLatch. Make sure to leave it set after waiting, > > > if there was any possibility that it was set by something other than > > > the specific event you're concerned with". > > > > Hmm, yeah. If the latch is left set, then the outer loop will just go > > through an extra and unnecessary iteration, which seems fine. If the > > latch is left clear, then the outer loop might miss a wakeup intended > > for it and hang forever. > > Arguably that's a sign that the latch using code in the outer loop(s) isn't > written correctly? If you do it as: > > while (true) > { > CHECK_FOR_INTERRUPTS(); > > ResetLatch(MyLatch); > > if (work_needed) > { > Plenty(); > Code(); > Using(MyLatch); > } > else > { > WaitLatch(MyLatch); > } > } > > I think that's not a danger? I think the problem really is that we > suggest doing that WaitLatch() unconditionally: > > * The correct pattern to wait for event(s) is: > * > * for (;;) > * { > * ResetLatch(); > * if (work to do) > * Do Stuff(); > * WaitLatch(); > * } > * > * It's important to reset the latch *before* checking if there's work to > * do. Otherwise, if someone sets the latch between the check and the > * ResetLatch call, you will miss it and Wait will incorrectly block. > * > * Another valid coding pattern looks like: > * > * for (;;) > * { > * if (work to do) > * Do Stuff(); // in particular, exit loop if some condition satisfied > * WaitLatch(); > * ResetLatch(); > * } > > Obviously there's the issue that a lot of latch using code isn't written > that way - but I also don't think there's that many latch using code > that then also uses latch. Seems like we could fix that. While it has > obviously dangers of not being followed, so does the > 'always-set-latch-unless-outermost-loop' approach. > > I'm not sure I like the idea of incurring another unnecessary SetLatch() > call for most latch using places. > > I guess there's a bit bigger danger of taking longer to notice > postmaster-death. But I'm not sure I can quite see that being > problematic - seems like all we should incur is another cycle through > the loop, as the latch shouldn't be set anymore. I think we should thus change our latch documentation to say: something like: diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h index fc995819d35..dc46dd94c5b 100644 --- a/src/include/storage/latch.h +++ b/src/include/storage/latch.h @@ -44,22 +44,31 @@ * { * ResetLatch(); * if (work to do) - * Do Stuff(); - * WaitLatch(); + * DoStuff(); + * else + * WaitLatch(); * } * * It's important to reset the latch *before* checking if there's work to * do. Otherwise, if someone sets the latch between the check and the * ResetLatch call, you will miss it and Wait will incorrectly block. * + * The reason to only wait on the latch in case there is nothing to do is that + * code inside DoStuff() might use the same latch, and leave it reset, even + * though a SetLatch() aimed for the outer loop arrived. Which again could + * lead to incorrectly blocking in Wait. + * * Another valid coding pattern looks like: * * for (;;) * { * if (work to do) - * Do Stuff(); // in particular, exit loop if some condition satisfied - * WaitLatch(); - * ResetLatch(); + * DoStuff(); // in particular, exit loop if some condition satisfied + * else + * { + * WaitLatch(); + * ResetLatch(); + * } * } * * This is useful to reduce latch traffic if it's expected that the loop's and adapt code to match (at least in the outer loops). Greetings, Andres Freund
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think what we need to look for is reasons why (1) the postmaster > never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's > main loop doesn't get to noticing shutdown_requested. > > A rather scary point for (2) is that said main loop seems to be > assuming that MyLatch a/k/a MyProc->procLatch is not used for any > other purposes in the checkpointer process. If there were something, > like say a condition variable wait, that would reset MyLatch at any > time during a checkpoint, then we could very easily go to sleep at the > bottom of the loop and not notice that there's a pending shutdown request. Agreed on the non-composability of that coding, but if there actually is anything in that loop that can reach ResetLatch(), it's well hidden... -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Apr 17, 2019 at 10:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think what we need to look for is reasons why (1) the postmaster >> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's >> main loop doesn't get to noticing shutdown_requested. >> >> A rather scary point for (2) is that said main loop seems to be >> assuming that MyLatch a/k/a MyProc->procLatch is not used for any >> other purposes in the checkpointer process. If there were something, >> like say a condition variable wait, that would reset MyLatch at any >> time during a checkpoint, then we could very easily go to sleep at the >> bottom of the loop and not notice that there's a pending shutdown request. > Agreed on the non-composability of that coding, but if there actually > is anything in that loop that can reach ResetLatch(), it's well > hidden... Well, it's easy to see that there's no other ResetLatch call in checkpointer.c. It's much less obvious that there's no such call anywhere in the code reachable from e.g. CreateCheckPoint(). To try to investigate that, I hacked things up to force an assertion failure if ResetLatch was called from any other place in the checkpointer process (dirty patch attached for amusement's sake). This gets through check-world without any assertions. That does not really prove that there aren't corner timing cases where a latch wait and reset could happen, but it does put a big dent in my theory. Question is, what other theory has anybody got? regards, tom lane diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index a1e0423..b0ee1fd 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -166,6 +166,8 @@ static double ckpt_cached_elapsed; static pg_time_t last_checkpoint_time; static pg_time_t last_xlog_switch_time; +extern bool reset_latch_ok; + /* Prototypes for private functions */ static void CheckArchiveTimeout(void); @@ -343,9 +345,13 @@ CheckpointerMain(void) int elapsed_secs; int cur_timeout; + reset_latch_ok = true; + /* Clear any already-pending wakeups */ ResetLatch(MyLatch); + reset_latch_ok = false; + /* * Process any requests or signals received recently. */ diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 59fa917..1f0613d 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -118,6 +118,8 @@ struct WaitEventSet #endif }; +bool reset_latch_ok = true; + #ifndef WIN32 /* Are we currently in WaitLatch? The signal handler would like to know. */ static volatile sig_atomic_t waiting = false; @@ -521,6 +523,8 @@ ResetLatch(Latch *latch) /* Only the owner should reset the latch */ Assert(latch->owner_pid == MyProcPid); + Assert(reset_latch_ok); + latch->is_set = false; /*
Robert Haas <robertmhaas@gmail.com> wrote (in the other thread): > Any idea whether it's something newly-introduced or of long standing? It's the latter. I searched the buildfarm database for failure logs including the string "server does not shut down" within the last three years, and got all of the hits attached. Not all of these look like the failure pattern Michael pointed to, but enough of them do to say that the problem has existed since at least mid-2017. To be concrete, we have quite a sample of cases where a standby server has received a "fast shutdown" signal and acknowledged that in its log, but it never gets to the expected "shutting down" message, meaning it never starts the shutdown checkpoint let alone finishes it. The oldest case that clearly looks like that is https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29 A significant majority of the recent cases look just like the piculet failure Michael pointed to, that is we fail to shut down the "london" server while it's acting as standby in the recovery/t/009_twophase.pl test. But there are very similar failures in other tests. I also notice that the population of machines showing the problem seems heavily skewed towards, um, weird cases. For instance, in the set that have shown this type of failure since January, we have dragonet: uses JIT francolin: --disable-spinlocks gull: armv7 mereswine: armv7 piculet: --disable-atomics sidewinder: amd64, but running netbsd 7 (and this was 9.6, note) spurfowl: fairly generic amd64 This leads me to suspect that the problem is (a) some very low-level issue in spinlocks or or latches or the like, or (b) a timing problem that just doesn't show up on generic Intel-oid platforms. The timing theory is maybe a bit stronger given that one test case shows this more often than others. I've not got any clear ideas beyond that. Anyway, this is *not* new in v12. regards, tom lane sysname | snapshot | stage | l ---------------+---------------------+---------------------+------------------------------------- jacana | 2016-07-23 06:15:32 | pg_upgradeCheck | pg_ctl: server does not shut down\r pademelon | 2016-08-14 03:49:36 | ECPG-Check | pg_ctl: server does not shut down mereswine | 2017-02-13 14:24:37 | Check | pg_ctl: server does not shut down arapaima | 2017-03-04 20:06:10 | StopDb-C:4 | pg_ctl: server does not shut down nightjar | 2017-06-02 18:54:29 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-02 19:54:11 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-03 15:54:05 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-03 17:54:18 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-03 21:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-04 00:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-04 16:34:32 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-04 17:54:16 | SubscriptionCheck | pg_ctl: server does not shut down hornet | 2017-06-05 16:22:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-05 16:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-05 20:26:24 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-06 03:30:02 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-06 15:54:18 | SubscriptionCheck | pg_ctl: server does not shut down hornet | 2017-06-06 17:10:02 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-06 18:54:27 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 00:54:07 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 02:54:06 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 15:12:15 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 17:54:07 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 18:54:06 | SubscriptionCheck | pg_ctl: server does not shut down sungazer | 2017-06-07 19:46:53 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 21:03:43 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 01:54:07 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 15:54:10 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 16:57:03 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 17:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 19:54:06 | SubscriptionCheck | pg_ctl: server does not shut down nudibranch | 2017-07-23 08:31:12 | pg_upgradeCheck | pg_ctl: server does not shut down nudibranch | 2017-08-06 08:31:25 | pg_upgradeCheck | pg_ctl: server does not shut down spurfowl | 2017-08-21 18:43:08 | BinInstallCheck | pg_ctl: server does not shut down arapaima | 2017-10-10 22:12:01 | pg_upgradeCheck | pg_ctl: server does not shut down calliphoridae | 2017-10-11 20:32:24 | recoveryCheck | pg_ctl: server does not shut down arapaima | 2017-10-12 16:36:18 | pg_upgradeCheck | pg_ctl: server does not shut down arapaima | 2017-10-12 16:36:18 | pg_upgradeCheck | pg_ctl: server does not shut down arapaima | 2017-10-13 10:26:13 | test-decoding-check | pg_ctl: server does not shut down arapaima | 2017-10-14 07:40:05 | pg_upgradeCheck | pg_ctl: server does not shut down arapaima | 2017-10-14 23:22:28 | Check | pg_ctl: server does not shut down arapaima | 2017-10-15 10:32:28 | test-decoding-check | pg_ctl: server does not shut down arapaima | 2017-10-16 08:40:39 | Check | pg_ctl: server does not shut down arapaima | 2017-10-17 20:32:25 | pg_upgradeCheck | pg_ctl: server does not shut down arapaima | 2017-10-18 13:11:05 | pg_upgradeCheck | pg_ctl: server does not shut down arapaima | 2017-10-18 23:47:09 | Check | pg_ctl: server does not shut down arapaima | 2017-10-19 09:41:15 | Check | pg_ctl: server does not shut down spurfowl | 2017-11-15 13:29:06 | BinInstallCheck | pg_ctl: server does not shut down baiji | 2017-11-27 23:00:01 | Check | pg_ctl: server does not shut down\r skink | 2017-11-30 08:47:34 | recoveryCheck | pg_ctl: server does not shut down skink | 2017-12-02 14:30:01 | recoveryCheck | pg_ctl: server does not shut down gharial | 2017-12-15 09:12:45 | Check | pg_ctl: server does not shut down nudibranch | 2017-12-17 09:33:02 | pg_upgradeCheck | pg_ctl: server does not shut down nudibranch | 2018-01-20 10:08:33 | Check | pg_ctl: server does not shut down nudibranch | 2018-01-27 09:44:39 | Check | pg_ctl: server does not shut down nudibranch | 2018-02-01 09:49:26 | test-decoding-check | pg_ctl: server does not shut down mereswine | 2018-03-01 11:05:39 | Check | pg_ctl: server does not shut down topminnow | 2018-03-10 12:45:57 | pg_rewindCheck | pg_ctl: server does not shut down mereswine | 2018-03-14 08:10:08 | Check | pg_ctl: server does not shut down gull | 2018-03-30 02:10:15 | Check | pg_ctl: server does not shut down aholehole | 2018-04-08 11:37:13 | Check | pg_ctl: server does not shut down aholehole | 2018-04-08 21:41:39 | Check | pg_ctl: server does not shut down aholehole | 2018-04-10 23:37:22 | Check | pg_ctl: server does not shut down aholehole | 2018-04-13 16:06:17 | Check | pg_ctl: server does not shut down aholehole | 2018-04-13 23:59:24 | StopDb-C:1 | pg_ctl: server does not shut down aholehole | 2018-04-14 20:09:25 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-17 11:59:57 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-19 07:31:31 | Check | pg_ctl: server does not shut down aholehole | 2018-04-20 08:02:00 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-21 04:21:33 | Check | pg_ctl: server does not shut down aholehole | 2018-04-21 23:06:17 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-22 17:03:22 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-22 17:03:22 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-23 13:08:51 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-24 07:52:02 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-04-25 01:32:16 | Check | pg_ctl: server does not shut down gull | 2018-04-25 02:10:12 | Check | pg_ctl: server does not shut down aholehole | 2018-04-26 17:42:38 | Check | pg_ctl: server does not shut down aholehole | 2018-04-28 05:58:59 | Check | pg_ctl: server does not shut down aholehole | 2018-04-30 13:11:22 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-03 02:36:32 | StopDb-C:4 | pg_ctl: server does not shut down aholehole | 2018-05-04 11:07:37 | Check | pg_ctl: server does not shut down aholehole | 2018-05-04 18:02:59 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-06 02:52:23 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-07 19:16:33 | Check | pg_ctl: server does not shut down aholehole | 2018-05-08 13:03:52 | StopDb-C:4 | pg_ctl: server does not shut down aholehole | 2018-05-09 04:36:06 | Check | pg_ctl: server does not shut down aholehole | 2018-05-09 10:41:59 | Check | pg_ctl: server does not shut down aholehole | 2018-05-09 16:41:32 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-10 14:06:04 | Check | pg_ctl: server does not shut down aholehole | 2018-05-11 09:00:33 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-12 04:07:50 | Check | pg_ctl: server does not shut down aholehole | 2018-05-15 22:11:06 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-16 10:14:21 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-16 23:01:57 | Check | pg_ctl: server does not shut down aholehole | 2018-05-17 07:06:03 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-17 14:22:36 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-18 00:15:19 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-18 11:23:03 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-18 22:36:45 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-19 11:54:21 | Check | pg_ctl: server does not shut down aholehole | 2018-05-19 19:06:08 | StopDb-C:4 | pg_ctl: server does not shut down aholehole | 2018-05-20 09:32:53 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-20 16:47:17 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-21 03:36:24 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-21 14:38:21 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-22 03:02:05 | Check | pg_ctl: server does not shut down aholehole | 2018-05-22 11:06:05 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-23 12:13:37 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-23 21:05:13 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-24 08:29:25 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-05-25 14:08:16 | Check | pg_ctl: server does not shut down aholehole | 2018-05-25 19:11:35 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-27 16:56:10 | StopDb-C:1 | pg_ctl: server does not shut down aholehole | 2018-05-29 16:09:14 | StopDb-C:1 | pg_ctl: server does not shut down aholehole | 2018-05-30 15:07:44 | Check | pg_ctl: server does not shut down aholehole | 2018-05-31 13:10:53 | Check | pg_ctl: server does not shut down aholehole | 2018-05-31 21:12:12 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-05-31 21:12:12 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-04 17:07:20 | Check | pg_ctl: server does not shut down aholehole | 2018-06-05 00:12:13 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-06 17:06:03 | Check | pg_ctl: server does not shut down aholehole | 2018-06-06 21:49:02 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-06 21:49:02 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-07 06:53:06 | StopDb-C:1 | pg_ctl: server does not shut down aholehole | 2018-06-07 19:36:02 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-08 15:47:15 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-09 07:11:38 | StopDb-C:1 | pg_ctl: server does not shut down aholehole | 2018-06-09 21:54:40 | Check | pg_ctl: server does not shut down aholehole | 2018-06-10 08:12:11 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-06-12 00:12:15 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-06-13 23:29:02 | StopDb-C:4 | pg_ctl: server does not shut down aholehole | 2018-06-15 04:38:36 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-15 04:38:36 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-15 13:54:08 | Check | pg_ctl: server does not shut down aholehole | 2018-06-15 20:19:02 | Check | pg_ctl: server does not shut down aholehole | 2018-06-16 08:09:36 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-17 04:38:19 | StopDb-C:1 | pg_ctl: server does not shut down aholehole | 2018-06-17 19:47:19 | Check | pg_ctl: server does not shut down aholehole | 2018-06-18 04:06:04 | StopDb-C:4 | pg_ctl: server does not shut down aholehole | 2018-06-18 16:23:17 | Check | pg_ctl: server does not shut down aholehole | 2018-06-19 18:45:19 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-06-20 09:15:42 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-20 09:15:42 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-21 04:41:45 | Check | pg_ctl: server does not shut down aholehole | 2018-06-21 13:06:30 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-21 23:50:37 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-22 13:11:55 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-06-23 02:09:06 | Check | pg_ctl: server does not shut down aholehole | 2018-06-23 13:12:49 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-06-25 01:06:02 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-25 01:06:02 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-25 08:25:08 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-25 16:10:48 | test-decoding-check | pg_ctl: server does not shut down aholehole | 2018-06-26 03:34:45 | pg_upgradeCheck | pg_ctl: server does not shut down aholehole | 2018-06-26 12:47:21 | StopDb-C:1 | pg_ctl: server does not shut down gull | 2018-08-24 03:27:16 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-08-24 08:10:05 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-08-29 09:26:52 | pg_rewindCheck | pg_ctl: server does not shut down gull | 2018-08-30 02:10:05 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-09-06 02:10:16 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-09-14 02:10:05 | recoveryCheck | pg_ctl: server does not shut down sungazer | 2018-09-18 19:44:53 | subscriptionCheck | pg_ctl: server does not shut down gull | 2018-09-26 02:10:06 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-10-01 08:10:04 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-10-03 02:10:05 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-10-07 03:45:39 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-10-07 05:07:06 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-10-07 08:10:05 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-10-13 02:10:06 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-10-13 05:21:13 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-10-29 03:25:28 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-11-15 11:55:56 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-11-16 09:10:13 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-11-30 10:34:17 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2018-11-30 11:44:54 | recoveryCheck | pg_ctl: server does not shut down gull | 2018-12-08 04:29:34 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2019-01-30 05:18:59 | recoveryCheck | pg_ctl: server does not shut down francolin | 2019-02-07 07:55:41 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2019-02-20 03:10:06 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2019-02-22 03:10:05 | recoveryCheck | pg_ctl: server does not shut down spurfowl | 2019-02-25 11:02:04 | pg_rewindCheck | pg_ctl: server does not shut down sidewinder | 2019-02-25 16:45:00 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2019-03-06 03:10:05 | recoveryCheck | pg_ctl: server does not shut down piculet | 2019-03-23 22:28:59 | recoveryCheck | pg_ctl: server does not shut down gull | 2019-04-11 06:25:20 | recoveryCheck | pg_ctl: server does not shut down mereswine | 2019-04-12 06:31:30 | recoveryCheck | pg_ctl: server does not shut down dragonet | 2019-04-16 06:14:01 | recoveryCheck | pg_ctl: server does not shut down (188 rows)
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Question is, what other theory has anybody got? I wondered if there might be a way for PostmasterStateMachine() to be reached with without signals blocked, in the case where we fork a fresh checkpointers, and then it misses the SIGUSR2 that we immediately send because it hasn't installed its handler yet. But I can't see it. This is a curious thing from dragonet's log: 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver process due to administrator command 2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time LogCheckpointStart() is the thing that writes "starting: ...", and it prefers to report "shutdown" over "time", but it didn't. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > This is a curious thing from dragonet's log: > 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request > 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > process due to administrator command > 2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time > LogCheckpointStart() is the thing that writes "starting: ...", and it > prefers to report "shutdown" over "time", but it didn't. Yeah, but since we don't see "shutting down", we know that the shutdown checkpoint hasn't begun. Here's another similar case: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54 The relevant fragment of the standby server's log is 2018-11-30 05:09:22.996 PST [4229] LOG: received fast shutdown request 2018-11-30 05:09:23.628 PST [4229] LOG: aborting any active transactions 2018-11-30 05:09:23.649 PST [4231] LOG: checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 1recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563 kB,estimate=16563 kB 2018-11-30 05:09:23.679 PST [4229] LOG: background worker "logical replication launcher" (PID 4276) exited with exit code1 2018-11-30 05:11:23.757 PST [4288] master LOG: unexpected EOF on standby connection 2018-11-30 05:11:23.883 PST [4229] LOG: received immediate shutdown request 2018-11-30 05:11:23.907 PST [4229] LOG: database system is shut down To the extent that I've found logs in which the checkpointer prints anything at all during this interval, it seems to be just quietly plodding along with its usual business, without any hint that it's aware of the pending shutdown request. It'd be very easy to believe that the postmaster -> checkpointer SIGUSR2 is simply getting dropped, or never issued. Hmm ... actually, looking at the postmaster's logic, it won't issue SIGUSR2 to the checkpointer until the walreceiver (if any) is gone. And now that I think about it, several of these logs contain traces showing that the walreceiver is still live. Like the one quoted above: seems like the line from PID 4288 has to be from a walreceiver. Maybe what we should be looking for is "why doesn't the walreceiver shut down"? But the dragonet log you quote above shows the walreceiver exiting, or at least starting to exit. Tis a puzzlement. I'm a bit tempted to add temporary debug logging to the postmaster so that walreceiver start/stop is recorded at LOG level. We'd have to wait a few weeks to have any clear result from the buildfarm, but I'm not sure how we'll get any hard data without some such measures. regards, tom lane
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: > It's the latter. I searched the buildfarm database for failure logs > including the string "server does not shut down" within the last three > years, and got all of the hits attached. Not all of these look like > the failure pattern Michael pointed to, but enough of them do to say > that the problem has existed since at least mid-2017. To be concrete, > we have quite a sample of cases where a standby server has received a > "fast shutdown" signal and acknowledged that in its log, but it never > gets to the expected "shutting down" message, meaning it never starts > the shutdown checkpoint let alone finishes it. The oldest case that > clearly looks like that is > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29 Interesting. I was sort of thinking about c6c3334 first but this failed based on 9fcf670, which does not include the former. > This leads me to suspect that the problem is (a) some very low-level issue > in spinlocks or or latches or the like, or (b) a timing problem that just > doesn't show up on generic Intel-oid platforms. The timing theory is > maybe a bit stronger given that one test case shows this more often than > others. I've not got any clear ideas beyond that. > > Anyway, this is *not* new in v12. Indeed. It seems to me that v12 makes the problem easier to appear though, and I got to wonder if c6c9474 is helping in that as more cases are popping up since mid-March. -- Michael
Attachment
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: > > Anyway, this is *not* new in v12. > > Indeed. It seems to me that v12 makes the problem easier to appear > though, and I got to wonder if c6c9474 is helping in that as more > cases are popping up since mid-March. Interesting, but I'm not sure how that could be though. Perhaps, a bit like the other thing that cropped up in the build farm after that commit, removing ~200ms of needless sleeping around an earlier online CHECKPOINT made some other pre-existing race condition more likely to go wrong. -- Thomas Munro https://enterprisedb.com
Michael Paquier <michael@paquier.xyz> writes: > On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: >> Anyway, this is *not* new in v12. > Indeed. It seems to me that v12 makes the problem easier to appear > though, and I got to wonder if c6c9474 is helping in that as more > cases are popping up since mid-March. Yeah. Whether that's due to a server code change, or new or modified test cases, is unknown. But looking at my summary of buildfarm runs that failed like this, there's a really clear breakpoint at gull | 2018-08-24 03:27:16 | recoveryCheck | pg_ctl: server does not shut down Since that, most of the failures with this message have been in the recoveryCheck step. Before that, the failures were all over the place, and now that I look closely a big fraction of them were in bursts on particular animals, suggesting it was more about some local problem on that animal than any real code issue. So it might be worth groveling around in the commit logs from last August... regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes: > Interesting, but I'm not sure how that could be though. Perhaps, a > bit like the other thing that cropped up in the build farm after that > commit, removing ~200ms of needless sleeping around an earlier online > CHECKPOINT made some other pre-existing race condition more likely to > go wrong. The data that we've got is entirely consistent with the idea that there's a timing-sensitive bug that gets made more or less likely to trigger by "unrelated" changes in test cases or server code. regards, tom lane
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > > process due to administrator command > Maybe what we should be looking for is "why doesn't the walreceiver > shut down"? But the dragonet log you quote above shows the walreceiver > exiting, or at least starting to exit. Tis a puzzlement. One thing I noticed about this message: if you receive SIGTERM at a rare time when WalRcvImmediateInterruptOK is true, then that ereport() runs directly in the signal handler context. That's not strictly allowed, and could cause nasal demons. On the other hand, it probably wouldn't have managed to get the FATAL message out if that was the problem here (previously we've seen reports of signal handlers deadlocking while trying to ereport() but they couldn't get their message out at all, because malloc or some such was already locked in the user context). Is there some way that the exit code could hang *after* that due to corruption of libc resources (FILE streams, malloc, ...)? It doesn't seem likely to me (we'd hopefully see some more clues) but I thought I'd mention the idea. -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Apr 19, 2019 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe what we should be looking for is "why doesn't the walreceiver >> shut down"? But the dragonet log you quote above shows the walreceiver >> exiting, or at least starting to exit. Tis a puzzlement. > ... Is there some way that the exit code could hang > *after* that due to corruption of libc resources (FILE streams, > malloc, ...)? It doesn't seem likely to me (we'd hopefully see some > more clues) but I thought I'd mention the idea. I agree it's not likely ... but that's part of the reason I was thinking about adding some postmaster logging. Whatever we're chasing here is "not likely", per the observed buildfarm failure rate. regards, tom lane
>>> Maybe what we should be looking for is "why doesn't the walreceiver >>> shut down"? But the dragonet log you quote above shows the walreceiver >>> exiting, or at least starting to exit. Tis a puzzlement. huh ... take a look at this little stanza in PostmasterStateMachine: if (pmState == PM_SHUTDOWN_2) { /* * PM_SHUTDOWN_2 state ends when there's no other children than * dead_end children left. There shouldn't be any regular backends * left by now anyway; what we're really waiting for is walsenders and * archiver. * * Walreceiver should normally be dead by now, but not when a fast * shutdown is performed during recovery. */ if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 && WalReceiverPID == 0) { pmState = PM_WAIT_DEAD_END; } } I'm too tired to think through exactly what that last comment might be suggesting, but it sure seems like it might be relevant to our problem. If the walreceiver *isn't* dead yet, what's going to ensure that we can move forward later? regards, tom lane
I have spent a fair amount of time trying to replicate these failures locally, with little success. I now think that the most promising theory is Munro's idea in [1] that the walreceiver is hanging up during its unsafe attempt to do ereport(FATAL) from inside a signal handler. It's extremely plausible that that could result in a deadlock inside libc's malloc/free, or some similar place. Moreover, if that's what's causing it, then the windows for trouble are fixed by the length of time that malloc might hold internal locks, which fits with the results I've gotten that inserting delays in various promising-looking places doesn't do a thing towards making this reproducible. Even if that isn't the proximate cause of the current reports, it's clearly trouble waiting to happen, and we should get rid of it. Accordingly, see attached proposed patch. This just flushes the "immediate interrupt" stuff in favor of making sure that libpqwalreceiver.c will take care of any signals received while waiting for input. The existing code does not use PQsetnonblocking, which means that it's theoretically at risk of blocking while pushing out data to the remote server. In practice I think that risk is negligible because (IIUC) we don't send very large amounts of data at one time. So I didn't bother to change that. Note that for the most part, if that happened, the existing code was at risk of slow response to SIGTERM anyway since it didn't have Enable/DisableWalRcvImmediateExit around the places that send data. My thought is to apply this only to HEAD for now; it's kind of a large change to shove into the back branches to handle a failure mode that's not been reported from the field. Maybe we could back-patch after we have more confidence in it. regards, tom lane [1] https://www.postgresql.org/message-id/CA%2BhUKG%2B%3D1G98m61VjNS-qGboJPwdZcF%2BrAPu2eC4XuWRTR3UPw%40mail.gmail.com diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 7123d41..765d58d 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -99,6 +99,7 @@ static WalReceiverFunctionsType PQWalReceiverFunctions = { /* Prototypes for private functions */ static PGresult *libpqrcv_PQexec(PGconn *streamConn, const char *query); +static PGresult *libpqrcv_PQgetResult(PGconn *streamConn); static char *stringlist_to_identifierstr(PGconn *conn, List *strings); /* @@ -196,7 +197,7 @@ libpqrcv_connect(const char *conninfo, bool logical, const char *appname, if (rc & WL_LATCH_SET) { ResetLatch(MyLatch); - CHECK_FOR_INTERRUPTS(); + ProcessWalRcvInterrupts(); } /* If socket is ready, advance the libpq state machine */ @@ -456,6 +457,10 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli) { PGresult *res; + /* + * Send copy-end message. As in libpqrcv_PQexec, this could theoretically + * block, but the risk seems small. + */ if (PQputCopyEnd(conn->streamConn, NULL) <= 0 || PQflush(conn->streamConn)) ereport(ERROR, @@ -472,7 +477,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli) * If we had not yet received CopyDone from the backend, PGRES_COPY_OUT is * also possible in case we aborted the copy in mid-stream. */ - res = PQgetResult(conn->streamConn); + res = libpqrcv_PQgetResult(conn->streamConn); if (PQresultStatus(res) == PGRES_TUPLES_OK) { /* @@ -486,7 +491,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli) PQclear(res); /* the result set should be followed by CommandComplete */ - res = PQgetResult(conn->streamConn); + res = libpqrcv_PQgetResult(conn->streamConn); } else if (PQresultStatus(res) == PGRES_COPY_OUT) { @@ -499,7 +504,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli) pchomp(PQerrorMessage(conn->streamConn))))); /* CommandComplete should follow */ - res = PQgetResult(conn->streamConn); + res = libpqrcv_PQgetResult(conn->streamConn); } if (PQresultStatus(res) != PGRES_COMMAND_OK) @@ -509,7 +514,7 @@ libpqrcv_endstreaming(WalReceiverConn *conn, TimeLineID *next_tli) PQclear(res); /* Verify that there are no more results */ - res = PQgetResult(conn->streamConn); + res = libpqrcv_PQgetResult(conn->streamConn); if (res != NULL) ereport(ERROR, (errmsg("unexpected result after CommandComplete: %s", @@ -572,12 +577,11 @@ libpqrcv_readtimelinehistoryfile(WalReceiverConn *conn, * The function is modeled on PQexec() in libpq, but only implements * those parts that are in use in the walreceiver api. * - * Queries are always executed on the connection in streamConn. + * May return NULL, rather than an error result, on failure. */ static PGresult * libpqrcv_PQexec(PGconn *streamConn, const char *query) { - PGresult *result = NULL; PGresult *lastResult = NULL; /* @@ -588,60 +592,26 @@ libpqrcv_PQexec(PGconn *streamConn, const char *query) */ /* - * Submit a query. Since we don't use non-blocking mode, this also can - * block. But its risk is relatively small, so we ignore that for now. + * Submit the query. Since we don't use non-blocking mode, this could + * theoretically block. In practice, since we don't send very long query + * strings, the risk seems negligible. */ if (!PQsendQuery(streamConn, query)) return NULL; for (;;) { - /* - * Receive data until PQgetResult is ready to get the result without - * blocking. - */ - while (PQisBusy(streamConn)) - { - int rc; - - /* - * We don't need to break down the sleep into smaller increments, - * since we'll get interrupted by signals and can either handle - * interrupts here or elog(FATAL) within SIGTERM signal handler if - * the signal arrives in the middle of establishment of - * replication connection. - */ - rc = WaitLatchOrSocket(MyLatch, - WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE | - WL_LATCH_SET, - PQsocket(streamConn), - 0, - WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE); - - /* Interrupted? */ - if (rc & WL_LATCH_SET) - { - ResetLatch(MyLatch); - CHECK_FOR_INTERRUPTS(); - } + /* Wait for, and collect, the next PGresult. */ + PGresult *result; - /* Consume whatever data is available from the socket */ - if (PQconsumeInput(streamConn) == 0) - { - /* trouble; drop whatever we had and return NULL */ - PQclear(lastResult); - return NULL; - } - } + result = libpqrcv_PQgetResult(streamConn); + if (result == NULL) + break; /* query is complete, or failure */ /* * Emulate PQexec()'s behavior of returning the last result when there * are many. We are fine with returning just last error message. */ - result = PQgetResult(streamConn); - if (result == NULL) - break; /* query is complete */ - PQclear(lastResult); lastResult = result; @@ -656,6 +626,51 @@ libpqrcv_PQexec(PGconn *streamConn, const char *query) } /* + * Perform the equivalent of PQgetResult(), but watch for interrupts. + */ +static PGresult * +libpqrcv_PQgetResult(PGconn *streamConn) +{ + /* + * Collect data until PQgetResult is ready to get the result without + * blocking. + */ + while (PQisBusy(streamConn)) + { + int rc; + + /* + * We don't need to break down the sleep into smaller increments, + * since we'll get interrupted by signals and can handle any + * interrupts here. + */ + rc = WaitLatchOrSocket(MyLatch, + WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE | + WL_LATCH_SET, + PQsocket(streamConn), + 0, + WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE); + + /* Interrupted? */ + if (rc & WL_LATCH_SET) + { + ResetLatch(MyLatch); + ProcessWalRcvInterrupts(); + } + + /* Consume whatever data is available from the socket */ + if (PQconsumeInput(streamConn) == 0) + { + /* trouble; return NULL */ + return NULL; + } + } + + /* Now we can collect and return the next PGresult */ + return PQgetResult(streamConn); +} + +/* * Disconnect connection to primary, if any. */ static void @@ -716,13 +731,13 @@ libpqrcv_receive(WalReceiverConn *conn, char **buffer, { PGresult *res; - res = PQgetResult(conn->streamConn); + res = libpqrcv_PQgetResult(conn->streamConn); if (PQresultStatus(res) == PGRES_COMMAND_OK) { PQclear(res); /* Verify that there are no more results. */ - res = PQgetResult(conn->streamConn); + res = libpqrcv_PQgetResult(conn->streamConn); if (res != NULL) { PQclear(res); @@ -886,7 +901,7 @@ libpqrcv_processTuples(PGresult *pgres, WalRcvExecResult *walres, { char *cstrs[MaxTupleAttributeNumber]; - CHECK_FOR_INTERRUPTS(); + ProcessWalRcvInterrupts(); /* Do the allocations in temporary context. */ oldcontext = MemoryContextSwitchTo(rowcontext); diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index f32cf91..7f4bd56 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -112,28 +112,7 @@ static struct static StringInfoData reply_message; static StringInfoData incoming_message; -/* - * About SIGTERM handling: - * - * We can't just exit(1) within SIGTERM signal handler, because the signal - * might arrive in the middle of some critical operation, like while we're - * holding a spinlock. We also can't just set a flag in signal handler and - * check it in the main loop, because we perform some blocking operations - * like libpqrcv_PQexec(), which can take a long time to finish. - * - * We use a combined approach: When WalRcvImmediateInterruptOK is true, it's - * safe for the signal handler to elog(FATAL) immediately. Otherwise it just - * sets got_SIGTERM flag, which is checked in the main loop when convenient. - * - * This is very much like what regular backends do with ImmediateInterruptOK, - * ProcessInterrupts() etc. - */ -static volatile bool WalRcvImmediateInterruptOK = false; - /* Prototypes for private functions */ -static void ProcessWalRcvInterrupts(void); -static void EnableWalRcvImmediateExit(void); -static void DisableWalRcvImmediateExit(void); static void WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last); static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI); static void WalRcvDie(int code, Datum arg); @@ -151,7 +130,20 @@ static void WalRcvShutdownHandler(SIGNAL_ARGS); static void WalRcvQuickDieHandler(SIGNAL_ARGS); -static void +/* + * Process any interrupts the walreceiver process may have received. + * This should be called any time the process's latch has become set. + * + * Currently, only SIGTERM is of interest. We can't just exit(1) within the + * SIGTERM signal handler, because the signal might arrive in the middle of + * some critical operation, like while we're holding a spinlock. Instead, the + * signal handler sets a flag variable as well as setting the process's latch. + * We must check the flag (by calling ProcessWalRcvInterrupts) anytime the + * latch has become set. Operations that could block for a long time, such as + * reading from a remote server, must pay attention to the latch too; see + * libpqrcv_PQexec for example. + */ +void ProcessWalRcvInterrupts(void) { /* @@ -163,26 +155,12 @@ ProcessWalRcvInterrupts(void) if (got_SIGTERM) { - WalRcvImmediateInterruptOK = false; ereport(FATAL, (errcode(ERRCODE_ADMIN_SHUTDOWN), errmsg("terminating walreceiver process due to administrator command"))); } } -static void -EnableWalRcvImmediateExit(void) -{ - WalRcvImmediateInterruptOK = true; - ProcessWalRcvInterrupts(); -} - -static void -DisableWalRcvImmediateExit(void) -{ - WalRcvImmediateInterruptOK = false; - ProcessWalRcvInterrupts(); -} /* Main entry point for walreceiver process */ void @@ -292,12 +270,10 @@ WalReceiverMain(void) PG_SETMASK(&UnBlockSig); /* Establish the connection to the primary for XLOG streaming */ - EnableWalRcvImmediateExit(); wrconn = walrcv_connect(conninfo, false, cluster_name[0] ? cluster_name : "walreceiver", &err); if (!wrconn) ereport(ERROR, (errmsg("could not connect to the primary server: %s", err))); - DisableWalRcvImmediateExit(); /* * Save user-visible connection string. This clobbers the original @@ -336,7 +312,6 @@ WalReceiverMain(void) * Check that we're connected to a valid server using the * IDENTIFY_SYSTEM replication command. */ - EnableWalRcvImmediateExit(); primary_sysid = walrcv_identify_system(wrconn, &primaryTLI); snprintf(standby_sysid, sizeof(standby_sysid), UINT64_FORMAT, @@ -348,7 +323,6 @@ WalReceiverMain(void) errdetail("The primary's identifier is %s, the standby's identifier is %s.", primary_sysid, standby_sysid))); } - DisableWalRcvImmediateExit(); /* * Confirm that the current timeline of the primary is the same or @@ -509,6 +483,8 @@ WalReceiverMain(void) if (rc & WL_LATCH_SET) { ResetLatch(walrcv->latch); + ProcessWalRcvInterrupts(); + if (walrcv->force_reply) { /* @@ -577,9 +553,7 @@ WalReceiverMain(void) * The backend finished streaming. Exit streaming COPY-mode from * our side, too. */ - EnableWalRcvImmediateExit(); walrcv_endstreaming(wrconn, &primaryTLI); - DisableWalRcvImmediateExit(); /* * If the server had switched to a new timeline that we didn't @@ -726,9 +700,7 @@ WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last) (errmsg("fetching timeline history file for timeline %u from primary server", tli))); - EnableWalRcvImmediateExit(); walrcv_readtimelinehistoryfile(wrconn, tli, &fname, &content, &len); - DisableWalRcvImmediateExit(); /* * Check that the filename on the master matches what we @@ -805,7 +777,7 @@ WalRcvSigUsr1Handler(SIGNAL_ARGS) errno = save_errno; } -/* SIGTERM: set flag for main loop, or shutdown immediately if safe */ +/* SIGTERM: set flag for ProcessWalRcvInterrupts */ static void WalRcvShutdownHandler(SIGNAL_ARGS) { @@ -816,10 +788,6 @@ WalRcvShutdownHandler(SIGNAL_ARGS) if (WalRcv->latch) SetLatch(WalRcv->latch); - /* Don't joggle the elbow of proc_exit */ - if (!proc_exit_inprogress && WalRcvImmediateInterruptOK) - ProcessWalRcvInterrupts(); - errno = save_errno; } diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h index 33e89ca..7f2927c 100644 --- a/src/include/replication/walreceiver.h +++ b/src/include/replication/walreceiver.h @@ -302,6 +302,7 @@ walrcv_clear_result(WalRcvExecResult *walres) /* prototypes for functions in walreceiver.c */ extern void WalReceiverMain(void) pg_attribute_noreturn(); +extern void ProcessWalRcvInterrupts(void); /* prototypes for functions in walreceiverfuncs.c */ extern Size WalRcvShmemSize(void);
On Sun, Apr 28, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Even if that isn't the proximate cause of the current reports, it's > clearly trouble waiting to happen, and we should get rid of it. > Accordingly, see attached proposed patch. This just flushes the > "immediate interrupt" stuff in favor of making sure that > libpqwalreceiver.c will take care of any signals received while > waiting for input. +1 I see that we removed the code that this was modelled on back in 2015, and in fact your patch even removes a dangling reference in a comment: - * This is very much like what regular backends do with ImmediateInterruptOK, > The existing code does not use PQsetnonblocking, which means that it's > theoretically at risk of blocking while pushing out data to the remote > server. In practice I think that risk is negligible because (IIUC) we > don't send very large amounts of data at one time. So I didn't bother to > change that. Note that for the most part, if that happened, the existing > code was at risk of slow response to SIGTERM anyway since it didn't have > Enable/DisableWalRcvImmediateExit around the places that send data. Right. > My thought is to apply this only to HEAD for now; it's kind of a large > change to shove into the back branches to handle a failure mode that's > not been reported from the field. Maybe we could back-patch after we > have more confidence in it. +1 That reminds me, we should probably also clean up at least the ereport-from-signal-handler hazard identified over in this thread: https://www.postgresql.org/message-id/CAEepm%3D10MtmKeDc1WxBM0PQM9OgtNy%2BRCeWqz40pZRRS3PNo5Q%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Sun, Apr 28, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Even if that isn't the proximate cause of the current reports, it's >> clearly trouble waiting to happen, and we should get rid of it. > +1 > That reminds me, we should probably also clean up at least the > ereport-from-signal-handler hazard identified over in this thread: > https://www.postgresql.org/message-id/CAEepm%3D10MtmKeDc1WxBM0PQM9OgtNy%2BRCeWqz40pZRRS3PNo5Q%40mail.gmail.com Yeah, probably. I imagine the reason we've not already seen complaints about that is that not that many custom bgworkers exist. I do not think we can get away with back-patching a change in that area, though, since it'll move the goalposts about what bgworker code has to do to cope with SIGTERMs. It might already be too late for v12, unless we want to treat that as an Open Item. regards, tom lane
Hi, On 2019-04-27 20:56:51 -0400, Tom Lane wrote: > Even if that isn't the proximate cause of the current reports, it's > clearly trouble waiting to happen, and we should get rid of it. > Accordingly, see attached proposed patch. This just flushes the > "immediate interrupt" stuff in favor of making sure that > libpqwalreceiver.c will take care of any signals received while > waiting for input. Good plan. > The existing code does not use PQsetnonblocking, which means that it's > theoretically at risk of blocking while pushing out data to the remote > server. In practice I think that risk is negligible because (IIUC) we > don't send very large amounts of data at one time. So I didn't bother to > change that. Note that for the most part, if that happened, the existing > code was at risk of slow response to SIGTERM anyway since it didn't have > Enable/DisableWalRcvImmediateExit around the places that send data. Hm, I'm not convinced that's OK. What if there's a network hickup? We'll wait until there's an OS tcp timeout, no? It's bad enough that there were cases of this before. Increasing the surface of cases where we might want to shut down walreceiver, e.g. because we would rather switch to recovery_command, or just shut down the server, but just get stuck waiting for an hour for a tcp timeout, doesn't seem OK. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-04-27 20:56:51 -0400, Tom Lane wrote: >> The existing code does not use PQsetnonblocking, which means that it's >> theoretically at risk of blocking while pushing out data to the remote >> server. In practice I think that risk is negligible because (IIUC) we >> don't send very large amounts of data at one time. So I didn't bother to >> change that. Note that for the most part, if that happened, the existing >> code was at risk of slow response to SIGTERM anyway since it didn't have >> Enable/DisableWalRcvImmediateExit around the places that send data. > Hm, I'm not convinced that's OK. What if there's a network hickup? We'll > wait until there's an OS tcp timeout, no? No. send() is only going to block if there's no room in the kernel's buffers, and that would only happen if we send a lot of data in between waits to receive data. Which, AFAIK, the walreceiver never does. We might possibly need to improve that code in the future, but I don't think there's a need for it today. regards, tom lane
Hi, On 2019-04-29 12:55:31 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Hm, I'm not convinced that's OK. What if there's a network hickup? We'll > > wait until there's an OS tcp timeout, no? > > No. send() is only going to block if there's no room in the kernel's > buffers, and that would only happen if we send a lot of data in between > waits to receive data. Which, AFAIK, the walreceiver never does. > We might possibly need to improve that code in the future, but I don't > think there's a need for it today. Ah, right. - Andres
On Sat, Apr 27, 2019 at 5:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I have spent a fair amount of time trying to replicate these failures > locally, with little success. I now think that the most promising theory > is Munro's idea in [1] that the walreceiver is hanging up during its > unsafe attempt to do ereport(FATAL) from inside a signal handler. It's > extremely plausible that that could result in a deadlock inside libc's > malloc/free, or some similar place. Moreover, if that's what's causing > it, then the windows for trouble are fixed by the length of time that > malloc might hold internal locks, which fits with the results I've gotten > that inserting delays in various promising-looking places doesn't do a > thing towards making this reproducible. For Greenplum (based on 9.4 but current master code looks the same) we did see deadlocks recently hit in CI many times for walreceiver which I believe confirms above finding. #0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>, p=0x26bb3b0, have_lock=0) at malloc.c:3962 #2 0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #3 0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30 #4 0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30 #5 0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235 #6 0x00007f0637ea0ff8 in __run_exit_handlers (status=1, listp=0x7f063822b5f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #7 0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104 #8 0x00000000008c72c7 in proc_exit () #9 0x0000000000a75867 in errfinish () #10 0x000000000089ea53 in ProcessWalRcvInterrupts () #11 0x000000000089eac5 in WalRcvShutdownHandler () #12 <signal handler called> #13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>, bytes=bytes@entry=16384) at malloc.c:3802 #14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913 #15 0x00000000007754c3 in makeEmptyPGconn () #16 0x0000000000779686 in PQconnectStart () #17 0x0000000000779b8b in PQconnectdb () #18 0x00000000008aae52 in libpqrcv_connect () #19 0x000000000089f735 in WalReceiverMain () #20 0x00000000005c5eab in AuxiliaryProcessMain () #21 0x00000000004cd5f1 in ServerLoop () #22 0x000000000086fb18 in PostmasterMain () #23 0x00000000004d2e28 in main () ImmediateInterruptOK was removed from regular backends but not for walreceiver and walreceiver performing elog(FATAL) inside signal handler is dangerous.
Ashwin Agrawal <aagrawal@pivotal.io> writes: > For Greenplum (based on 9.4 but current master code looks the same) we > did see deadlocks recently hit in CI many times for walreceiver which > I believe confirms above finding. > #0 __lll_lock_wait_private () at > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 > #1 0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>, > p=0x26bb3b0, have_lock=0) at malloc.c:3962 > #2 0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at > malloc.c:2968 > #3 0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30 > #4 0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30 > #5 0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235 > #6 0x00007f0637ea0ff8 in __run_exit_handlers (status=1, > listp=0x7f063822b5f8 <__exit_funcs>, > run_list_atexit=run_list_atexit@entry=true) at exit.c:82 > #7 0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104 > #8 0x00000000008c72c7 in proc_exit () > #9 0x0000000000a75867 in errfinish () > #10 0x000000000089ea53 in ProcessWalRcvInterrupts () > #11 0x000000000089eac5 in WalRcvShutdownHandler () > #12 <signal handler called> > #13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>, > bytes=bytes@entry=16384) at malloc.c:3802 > #14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913 > #15 0x00000000007754c3 in makeEmptyPGconn () > #16 0x0000000000779686 in PQconnectStart () > #17 0x0000000000779b8b in PQconnectdb () > #18 0x00000000008aae52 in libpqrcv_connect () > #19 0x000000000089f735 in WalReceiverMain () > #20 0x00000000005c5eab in AuxiliaryProcessMain () > #21 0x00000000004cd5f1 in ServerLoop () > #22 0x000000000086fb18 in PostmasterMain () > #23 0x00000000004d2e28 in main () Cool --- that stack trace is *exactly* what you'd expect if this were the problem. Thanks for sending it along! Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 to see if it fixes the problem for you? regards, tom lane
On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 > to see if it fixes the problem for you? Yes, will give it a try on greenplum and report back the result. Have we decided if this will be applied to back branches?
Ashwin Agrawal <aagrawal@pivotal.io> writes: > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 >> to see if it fixes the problem for you? > Yes, will give it a try on greenplum and report back the result. > Have we decided if this will be applied to back branches? My feeling about it is "maybe eventually, but most definitely not the week before a set of minor releases". Some positive experience with Greenplum would help increase confidence in the patch, for sure. regards, tom lane
On 2019-Apr-29, Tom Lane wrote: > Ashwin Agrawal <aagrawal@pivotal.io> writes: > > On Mon, Apr 29, 2019 at 10:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Can you try applying a1a789eb5ac894b4ca4b7742f2dc2d9602116e46 > >> to see if it fixes the problem for you? > > > Yes, will give it a try on greenplum and report back the result. > > > Have we decided if this will be applied to back branches? Hi Ashwin, did you have the chance to try this out? > My feeling about it is "maybe eventually, but most definitely not > the week before a set of minor releases". Some positive experience > with Greenplum would help increase confidence in the patch, for sure. I looked at the buildfarm failures for the recoveryCheck stage. It looks like there is only one failure for branch master after this commit, which was chipmunk saying: # poll_query_until timed out executing this query: # SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name; # expecting this output: # standby1|1|sync # standby2|2|sync # standby3|2|potential # standby4|2|potential # last actual query output: # standby1|1|sync # standby2|2|potential # standby3|2|sync # standby4|2|potential # with stderr: not ok 6 - asterisk comes before another standby name # Failed test 'asterisk comes before another standby name' # at t/007_sync_rep.pl line 26. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11 AFAICS this is wholly unrelated to the problem at hand. No other animal failed recoveryCheck test; before the commit, the failure was not terribly frequent, but rarely would 10 days go by without it failing. So I suggest that the bug has indeed been fixed. Maybe now's a good time to get it back-patched? In branch REL_11_STABLE, it failed as recently as 11 days ago in gull, https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2019-06-01%2004%3A11%3A36 -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Apr-29, Tom Lane wrote: >> Ashwin Agrawal <aagrawal@pivotal.io> writes: >>> Have we decided if this will be applied to back branches? >> My feeling about it is "maybe eventually, but most definitely not >> the week before a set of minor releases". Some positive experience >> with Greenplum would help increase confidence in the patch, for sure. > I looked at the buildfarm failures for the recoveryCheck stage. It > looks like there is only one failure for branch master after this > commit, which was chipmunk saying: > ... > AFAICS this is wholly unrelated to the problem at hand. Yeah, that seems unrelated. > No other animal failed recoveryCheck test; before the commit, the > failure was not terribly frequent, but rarely would 10 days go by > without it failing. So I suggest that the bug has indeed been fixed. I feel pretty good about it too. > Maybe now's a good time to get it back-patched? Should we do that now, or wait till after next week's releases? regards, tom lane
On 2019-Jun-12, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Maybe now's a good time to get it back-patched? > > Should we do that now, or wait till after next week's releases? IMO this has been hammered enough in master, and we still have a few days in the back-branches for buildfarm, that it's okay to do it now. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Jun-12, Tom Lane wrote: >> Should we do that now, or wait till after next week's releases? > IMO this has been hammered enough in master, and we still have a few > days in the back-branches for buildfarm, that it's okay to do it now. Poking at that, I find that a1a789eb5 back-patches reasonably painlessly into v11 and v10, but trying to bring it back to 9.6 encounters a pile of merge failures. Also, looking at the git logs shows that we did a hell of a lot of subtle work on that code (libpqwalreceiver.c in particular) during the v10 cycle. So I've got no confidence that successful buildfarm/beta1 testing of the HEAD patch means much of anything for putting it into pre-v10 branches. Given that we've seen few if any field reports of this issue, my inclination is to back-patch as far as v10, but not take the risk and effort involved in going further. regards, tom lane
On Wed, Jun 12, 2019 at 04:26:23PM -0400, Tom Lane wrote: > Poking at that, I find that a1a789eb5 back-patches reasonably painlessly > into v11 and v10, but trying to bring it back to 9.6 encounters a pile of > merge failures. Also, looking at the git logs shows that we did a hell > of a lot of subtle work on that code (libpqwalreceiver.c in particular) > during the v10 cycle. So I've got no confidence that successful > buildfarm/beta1 testing of the HEAD patch means much of anything for > putting it into pre-v10 branches. > > Given that we've seen few if any field reports of this issue, my > inclination is to back-patch as far as v10, but not take the risk > and effort involved in going further. +1 for only a back-patch to v10 per the invasiveness argument. I think that you have made the right move here. -- Michael
Attachment
On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote: > I looked at the buildfarm failures for the recoveryCheck stage. It > looks like there is only one failure for branch master after this > commit, which was chipmunk saying: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11 > AFAICS this is wholly unrelated to the problem at hand. Yes, that's unrelated. This failure is interesting, still it has happened only once per what I can see. I think that this points out a rare race condition in test 007_sync_rep.pl because of this sequence which reorders the standbys: # Stop and start standbys to rearrange the order of standbys # in WalSnd array. Now, if standbys have the same priority, # standby2 is selected preferentially and standby3 is next. $node_standby_1->stop; $node_standby_2->stop; $node_standby_3->stop; $node_standby_2->start; $node_standby_3->start; The failure actually indicates that even if standby3 has started after standby2, standby3 has registered back into the WAL sender array of the primary before standby2 had the occasion to do it, but we have no guarantee that the ordering is actually right. So this has messed up with the expected set of sync standbys when these are not directly listed. I think that this can happen as well when initializing the standbys at the top of the test, but the window is much, much narrower and basically impossible to reach. Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes the test faster, but that's much more costly than using just poll_query_until() on pg_stat_replication to make sure that each standby is registered on the primary's WAL sender array. In short, something like the attached should improve the stability of the test. Thoughts? -- Michael