Thread: [COMMITTERS] pgsql: Add test for postmaster crash restarts.
Add test for postmaster crash restarts. Given that I managed to break this... We probably should extend the tests to also cover other sub-processes dying, but that's something for later. Author: Andres Freund Discussion: https://postgr.es/m/20170917080752.rcmihzfmgbeuqjk2@alap3.anarazel.de Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/a1924a4ea29399111e5155532ca24c9c51d3c82d Modified Files -------------- src/test/recovery/t/013_crash_restart.pl | 192 +++++++++++++++++++++++++++++++ 1 file changed, 192 insertions(+) -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Andres Freund <andres@anarazel.de> writes: > Add test for postmaster crash restarts. Hm, calliphoridae doesn't like this. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
On September 18, 2017 8:55:35 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> Add test for postmaster crash restarts. > >Hm, calliphoridae doesn't like this. Yea. Not clear to me why yet. The machine ran a number of instances with nearly the same config successfully. Can't imaginethat copyparse makes a difference here. I suspect it's somehow load related... Ran a good number of iterations locally,didn't reproduce, even under high load. Think I'll add bit more error reporting. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
I discovered that prairiedog has been hung up for many hours in the 013_crash_restart.pl. It looks to me like the explanation is that the test has a race condition, because what I find in the postmaster log is 2017-09-19 00:31:48.194 EDT [27839] [unknown] LOG: connection received: host=[local] 2017-09-19 00:31:48.203 EDT [27839] [unknown] LOG: connection authorized: user=buildfarm database=postgres 2017-09-19 00:31:48.218 EDT [27839] t/013_crash_restart.pl LOG: statement: CREATE TABLE alive(status text); 2017-09-19 00:31:48.266 EDT [27839] t/013_crash_restart.pl LOG: statement: INSERT INTO alive VALUES($$committed-before-sigquit$$); 2017-09-19 00:31:48.271 EDT [27839] t/013_crash_restart.pl LOG: statement: SELECT pg_backend_pid(); 2017-09-19 00:31:48.278 EDT [27839] t/013_crash_restart.pl LOG: statement: BEGIN; 2017-09-19 00:31:48.280 EDT [27839] t/013_crash_restart.pl LOG: statement: INSERT INTO alive VALUES($$in-progress-before-sigquit$$)RETURNING status; 2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl WARNING: terminating connection because of crash of another serverprocess 2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to rollback the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the databaseand repeat your command. 2017-09-19 00:31:48.299 EDT [27827] LOG: server process (PID 27839) exited with exit code 2 2017-09-19 00:31:48.299 EDT [27827] DETAIL: Failed process was running: INSERT INTO alive VALUES($$in-progress-before-sigquit$$)RETURNING status; 2017-09-19 00:31:48.300 EDT [27827] LOG: terminating any other active server processes 2017-09-19 00:31:48.307 EDT [27832] WARNING: terminating connection because of crash of another server process 2017-09-19 00:31:48.307 EDT [27832] DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-09-19 00:31:48.307 EDT [27832] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-09-19 00:31:48.317 EDT [27827] LOG: all server processes terminated; reinitializing 2017-09-19 00:31:48.333 EDT [27840] LOG: database system was interrupted; last known up at 2017-09-19 00:31:47 EDT 2017-09-19 00:31:48.338 EDT [27840] LOG: database system was not properly shut down; automatic recovery in progress 2017-09-19 00:31:48.346 EDT [27840] LOG: redo starts at 0/15A89EC 2017-09-19 00:31:48.361 EDT [27840] LOG: invalid record length at 0/15C6D74: wanted 24, got 0 2017-09-19 00:31:48.362 EDT [27840] LOG: redo done at 0/15C6D50 2017-09-19 00:31:48.362 EDT [27840] LOG: last completed transaction was at log time 2017-09-19 00:31:48.270076-04 2017-09-19 00:31:48.474 EDT [27827] LOG: database system is ready to accept connections 2017-09-19 00:31:48.492 EDT [27847] [unknown] LOG: connection received: host=[local] 2017-09-19 00:31:48.499 EDT [27847] [unknown] LOG: connection authorized: user=buildfarm database=postgres 2017-09-19 00:31:48.578 EDT [27847] t/013_crash_restart.pl LOG: statement: SELECT pg_sleep(3600); IOW, the "$monitor" instance of psql did not complete making its connection until after the crash/restart cycle had occurred. So we're just sitting there waiting for a crash report that won't come. Which is another very serious deficiency in this test: lacking any sort of timeout, it will just freeze indefinitely if anything doesn't happen exactly the way it expects. From a buildfarm owner's standpoint, that's pretty damn unfriendly. It means having to manually unwedge your animals from time to time. I'd like to ask you to revert this test, at least pending making it a whole lot more bulletproof. We don't really need crash recovery testing in the buildfarm IMO --- we hackers crash the system plenty often enough to notice problems there. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Hi, On 2017-09-19 12:13:54 -0400, Tom Lane wrote: > IOW, the "$monitor" instance of psql did not complete making its > connection until after the crash/restart cycle had occurred. That'd be easy enough to fix... Just something like $monitor_stdin .= q[ SELECT $$am-i-up$$; ]; $monitor->pump until $monitor_stdout =~ /am-i-up/; $monitor_stdout = ''; > So we're just sitting there waiting for a crash report that won't > come. Which is another very serious deficiency in this test: > lacking any sort of timeout, it will just freeze indefinitely > if anything doesn't happen exactly the way it expects. From a > buildfarm owner's standpoint, that's pretty damn unfriendly. > It means having to manually unwedge your animals from time to time. Note that I just copied the code for that from another test - this is isn't unique to this test. I agree that it'd be good to add a timeout to those pump calls. > I'd like to ask you to revert this test, at least pending making > it a whole lot more bulletproof. Hm. Ok. That seems like an overreaction to me - the failure rate isn't actualy that high so far. I'm happy to add both timeouts and "earlier startup" of the $monitor, but I'd prefer to do so in-tree - I'd run the test through 100+ iterations locally, without any of this showing up. > We don't really need crash recovery testing in the buildfarm IMO --- > we hackers crash the system plenty often enough to notice problems > there. I for one don't exercise that kind of crash restarts, my development scripts all work with restart_after_crash = false. What I find more concerning however is coverage of EXEC_BACKEND, which has far fewer developers actively running it constantly. Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Andres Freund <andres@anarazel.de> writes: > On 2017-09-19 12:13:54 -0400, Tom Lane wrote: >> IOW, the "$monitor" instance of psql did not complete making its >> connection until after the crash/restart cycle had occurred. > That'd be easy enough to fix... > Just something like > $monitor_stdin .= q[ > SELECT $$am-i-up$$; > ]; > $monitor->pump until $monitor_stdout =~ /am-i-up/; > $monitor_stdout = ''; Hm, do we care whether the sleep call has started? Maybe not. >> I'd like to ask you to revert this test, at least pending making >> it a whole lot more bulletproof. > Hm. Ok. That seems like an overreaction to me - the failure rate isn't > actualy that high so far. It's not that it's high, it's that having to ask buildfarm owners to manually unwedge their critters is not very cool. Maybe prairiedog will be the only one that gets stuck, but I kinda doubt it. > I'm happy to add both timeouts and "earlier > startup" of the $monitor, but I'd prefer to do so in-tree - I'd run the > test through 100+ iterations locally, without any of this showing up. Well, please fix it ASAP, if you don't want to take it out pending the fixes. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
On September 19, 2017 9:53:28 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Well, please fix it ASAP, if you don't want to take it out pending >the fixes. Will as soon as I finished my morning coffee. Uncaffeinated, which my phone fittingly autocorrects to unvaccinated, commitsaren't a good idea. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
On 2017-09-19 09:58:26 -0700, Andres Freund wrote: > > > On September 19, 2017 9:53:28 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Well, please fix it ASAP, if you don't want to take it out pending > >the fixes. > > Will as soon as I finished my morning coffee. Uncaffeinated, which my phone fittingly autocorrects to unvaccinated, commitsaren't a good idea. Done. Survived ~100 cycles here locally, while running make -j16 -s check-world in two parallel branches. But I have a fast disk, so it's not comparable. If the buildfarm doesn't complain about the use of IPC::Run's timeout functionality, we should probably patch that into the other use of IPC::Run as well, but especially into the other user of the pump() until ... scheme. Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Andres Freund <andres@anarazel.de> writes: > If the buildfarm doesn't complain about the use of IPC::Run's timeout > functionality, we should probably patch that into the other use of > IPC::Run as well, but especially into the other user of the pump() until > ... scheme. jacana hasn't passed this regression test yet, but at least that proves the timeout stuff works ;-) I think jacana's problem is simply that "kill QUIT" is never gonna work on Windows, there being no such animal there. It looks like "kill KILL" doesn't work either, which surprises me slightly more but not much. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
[re-adding commiters which I inadvertently left off] On 09/30/2017 06:10 PM, Andres Freund wrote: > > >> I was just looking at this. Why aren't we using "pg_ctl kill" to >> terminate the backend? That's supposed to be portable. > Because pg_ctl can't do that for any process but postmaster, no? The > test is supposed to find issues with backend death (and has > defficiencies in error reporting already, and would have caught a bug > I'd introduced previously). > No, I don't think so. That's not what the docs say. That's why you give it a pid argument" "pg_ctl kill signal_name process_id" cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
On 2017-09-30 18:21:33 -0400, Andrew Dunstan wrote: > > [re-adding commiters which I inadvertently left off] > > > On 09/30/2017 06:10 PM, Andres Freund wrote: > > > > > >> I was just looking at this. Why aren't we using "pg_ctl kill" to > >> terminate the backend? That's supposed to be portable. > > Because pg_ctl can't do that for any process but postmaster, no? The > > test is supposed to find issues with backend death (and has > > defficiencies in error reporting already, and would have caught a bug > > I'd introduced previously). > No, I don't think so. That's not what the docs say. That's why you give > it a pid argument" "pg_ctl kill signal_name process_id" Oh, cool. Didn't know that one. So the answer is: "Because Andres doesn't know squat.". But even after fixing that, there unfortunately is: static void set_sig(char *signame) { … #if 0/* probably should NOT provide SIGKILL */else if (strcmp(signame, "KILL") == 0) sig = SIGKILL; #endif I'm unclear on what that provision is achieving? If you can kill with pg_ctl you can do other nasty stuff too (like just use kill instead of pg_ctl)? Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
On 2017-09-30 15:27:12 -0700, Andres Freund wrote: > On 2017-09-30 18:21:33 -0400, Andrew Dunstan wrote: > > > > [re-adding commiters which I inadvertently left off] > > > > > > On 09/30/2017 06:10 PM, Andres Freund wrote: > > > > > > > > >> I was just looking at this. Why aren't we using "pg_ctl kill" to > > >> terminate the backend? That's supposed to be portable. > > > Because pg_ctl can't do that for any process but postmaster, no? The > > > test is supposed to find issues with backend death (and has > > > defficiencies in error reporting already, and would have caught a bug > > > I'd introduced previously). > > > No, I don't think so. That's not what the docs say. That's why you give > > it a pid argument" "pg_ctl kill signal_name process_id" > > Oh, cool. Didn't know that one. So the answer is: > "Because Andres doesn't know squat.". > > But even after fixing that, there unfortunately is: > > static void > set_sig(char *signame) > { > … > #if 0 > /* probably should NOT provide SIGKILL */ > else if (strcmp(signame, "KILL") == 0) > sig = SIGKILL; > #endif > > I'm unclear on what that provision is achieving? If you can kill with > pg_ctl you can do other nasty stuff too (like just use kill instead of > pg_ctl)? Could you perhaps test whether windows likes things after the following patch? I don't think the kill_kill guarantees are really needed here, so we might even be able to allow this on msvc. Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Attachment
On 09/30/2017 06:44 PM, Andres Freund wrote: > On 2017-09-30 15:27:12 -0700, Andres Freund wrote: >> On 2017-09-30 18:21:33 -0400, Andrew Dunstan wrote: >>> [re-adding commiters which I inadvertently left off] >>> >>> >>> On 09/30/2017 06:10 PM, Andres Freund wrote: >>>> >>>>> I was just looking at this. Why aren't we using "pg_ctl kill" to >>>>> terminate the backend? That's supposed to be portable. >>>> Because pg_ctl can't do that for any process but postmaster, no? The >>>> test is supposed to find issues with backend death (and has >>>> defficiencies in error reporting already, and would have caught a bug >>>> I'd introduced previously). >>> No, I don't think so. That's not what the docs say. That's why you give >>> it a pid argument" "pg_ctl kill signal_name process_id" >> Oh, cool. Didn't know that one. So the answer is: >> "Because Andres doesn't know squat.". >> >> But even after fixing that, there unfortunately is: >> >> static void >> set_sig(char *signame) >> { >> … >> #if 0 >> /* probably should NOT provide SIGKILL */ >> else if (strcmp(signame, "KILL") == 0) >> sig = SIGKILL; >> #endif >> >> I'm unclear on what that provision is achieving? If you can kill with >> pg_ctl you can do other nasty stuff too (like just use kill instead of >> pg_ctl)? I put it in when we rewrote pg_ctl in C many years ago, possibly out of a superabundance of caution. I agree it's worth revisiting. I think the idea was that there's a difference between an ordinary footgun and an officially sanctioned footgun :-) > Could you perhaps test whether windows likes things after the following > patch? I don't think the kill_kill guarantees are really needed here, > so we might even be able to allow this on msvc. > Haven't tested on MSVC but with this patch it passes on jacana (mingw). cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Hi, On 2017-09-30 22:28:39 -0400, Andrew Dunstan wrote: > >> But even after fixing that, there unfortunately is: > >> > >> static void > >> set_sig(char *signame) > >> { > >> … > >> #if 0 > >> /* probably should NOT provide SIGKILL */ > >> else if (strcmp(signame, "KILL") == 0) > >> sig = SIGKILL; > >> #endif > >> > >> I'm unclear on what that provision is achieving? If you can kill with > >> pg_ctl you can do other nasty stuff too (like just use kill instead of > >> pg_ctl)? > > > I put it in when we rewrote pg_ctl in C many years ago, possibly out of > a superabundance of caution. I agree it's worth revisiting. I think the > idea was that there's a difference between an ordinary footgun and an > officially sanctioned footgun :-) Heh. I'm inclined to take it out. We could add a --use-the-force-luke type parameter, but it doesn't seem worth it. > Haven't tested on MSVC but with this patch it passes on jacana (mingw). Yay! Thanks for testing. Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
On 09/30/2017 10:32 PM, Andres Freund wrote: >> Haven't tested on MSVC but with this patch it passes on jacana (mingw). > Yay! Thanks for testing. > I have now tested on bowerbird (MSVC) and it passes. This suggests that we can run tests there in cases where we can use IPC::Run's finish() instead of kill_kill(). Perhaps someone would like to look at the the two other cases where we do that (recovery/t/011_crash_recovery.pl and recovery/t/006_logical_decoding.pl) and see if they are amenable to this treatment. It woould be nice to be able to run all the tests on all platforms. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers