Thread: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

[COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:
Add test for postmaster crash restarts.

Given that I managed to break this...  We probably should extend the
tests to also cover other sub-processes dying, but that's something
for later.

Author: Andres Freund
Discussion: https://postgr.es/m/20170917080752.rcmihzfmgbeuqjk2@alap3.anarazel.de

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/a1924a4ea29399111e5155532ca24c9c51d3c82d

Modified Files
--------------
src/test/recovery/t/013_crash_restart.pl | 192 +++++++++++++++++++++++++++++++
1 file changed, 192 insertions(+)


--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Add test for postmaster crash restarts.

Hm, calliphoridae doesn't like this.
        regards, tom lane


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:

On September 18, 2017 8:55:35 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@anarazel.de> writes:
>> Add test for postmaster crash restarts.
>
>Hm, calliphoridae doesn't like this.

Yea. Not clear to me why yet. The machine ran a number of instances with nearly the same config successfully. Can't
imaginethat copyparse makes a difference here.  I suspect it's somehow load related... Ran a good number of iterations
locally,didn't reproduce, even under high load.  Think I'll add bit more error reporting. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Tom Lane
Date:
I discovered that prairiedog has been hung up for many hours in the
013_crash_restart.pl.  It looks to me like the explanation is that
the test has a race condition, because what I find in the postmaster
log is

2017-09-19 00:31:48.194 EDT [27839] [unknown] LOG:  connection received: host=[local]
2017-09-19 00:31:48.203 EDT [27839] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
2017-09-19 00:31:48.218 EDT [27839] t/013_crash_restart.pl LOG:  statement: CREATE TABLE alive(status text);
2017-09-19 00:31:48.266 EDT [27839] t/013_crash_restart.pl LOG:  statement: INSERT INTO alive
VALUES($$committed-before-sigquit$$);
2017-09-19 00:31:48.271 EDT [27839] t/013_crash_restart.pl LOG:  statement: SELECT pg_backend_pid();
2017-09-19 00:31:48.278 EDT [27839] t/013_crash_restart.pl LOG:  statement: BEGIN;
2017-09-19 00:31:48.280 EDT [27839] t/013_crash_restart.pl LOG:  statement: INSERT INTO alive
VALUES($$in-progress-before-sigquit$$)RETURNING status; 
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess 
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current transaction and exit, because another server process exited abnormally and possibly corrupted  
shared memory.
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command. 
2017-09-19 00:31:48.299 EDT [27827] LOG:  server process (PID 27839) exited with exit code 2
2017-09-19 00:31:48.299 EDT [27827] DETAIL:  Failed process was running: INSERT INTO alive
VALUES($$in-progress-before-sigquit$$)RETURNING status; 
2017-09-19 00:31:48.300 EDT [27827] LOG:  terminating any other active server processes
2017-09-19 00:31:48.307 EDT [27832] WARNING:  terminating connection because of crash of another server process
2017-09-19 00:31:48.307 EDT [27832] DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
2017-09-19 00:31:48.307 EDT [27832] HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
2017-09-19 00:31:48.317 EDT [27827] LOG:  all server processes terminated; reinitializing
2017-09-19 00:31:48.333 EDT [27840] LOG:  database system was interrupted; last known up at 2017-09-19 00:31:47 EDT
2017-09-19 00:31:48.338 EDT [27840] LOG:  database system was not properly shut down; automatic recovery in progress
2017-09-19 00:31:48.346 EDT [27840] LOG:  redo starts at 0/15A89EC
2017-09-19 00:31:48.361 EDT [27840] LOG:  invalid record length at 0/15C6D74: wanted 24, got 0
2017-09-19 00:31:48.362 EDT [27840] LOG:  redo done at 0/15C6D50
2017-09-19 00:31:48.362 EDT [27840] LOG:  last completed transaction was at log time 2017-09-19 00:31:48.270076-04
2017-09-19 00:31:48.474 EDT [27827] LOG:  database system is ready to accept connections
2017-09-19 00:31:48.492 EDT [27847] [unknown] LOG:  connection received: host=[local]
2017-09-19 00:31:48.499 EDT [27847] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
2017-09-19 00:31:48.578 EDT [27847] t/013_crash_restart.pl LOG:  statement: SELECT pg_sleep(3600);

IOW, the "$monitor" instance of psql did not complete making its
connection until after the crash/restart cycle had occurred.
So we're just sitting there waiting for a crash report that won't
come.  Which is another very serious deficiency in this test:
lacking any sort of timeout, it will just freeze indefinitely
if anything doesn't happen exactly the way it expects.  From a
buildfarm owner's standpoint, that's pretty damn unfriendly.
It means having to manually unwedge your animals from time to time.

I'd like to ask you to revert this test, at least pending making
it a whole lot more bulletproof.  We don't really need crash
recovery testing in the buildfarm IMO --- we hackers crash the
system plenty often enough to notice problems there.
        regards, tom lane


--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:
Hi,

On 2017-09-19 12:13:54 -0400, Tom Lane wrote:
> IOW, the "$monitor" instance of psql did not complete making its
> connection until after the crash/restart cycle had occurred.

That'd be easy enough to fix...

Just something like

$monitor_stdin .= q[
SELECT $$am-i-up$$;
];
$monitor->pump until $monitor_stdout =~ /am-i-up/;
$monitor_stdout = '';


> So we're just sitting there waiting for a crash report that won't
> come.  Which is another very serious deficiency in this test:
> lacking any sort of timeout, it will just freeze indefinitely
> if anything doesn't happen exactly the way it expects.  From a
> buildfarm owner's standpoint, that's pretty damn unfriendly.
> It means having to manually unwedge your animals from time to time.

Note that I just copied the code for that from another test - this is
isn't unique to this test. I agree that it'd be good to add a timeout to
those pump calls.


> I'd like to ask you to revert this test, at least pending making
> it a whole lot more bulletproof.

Hm. Ok. That seems like an overreaction to me - the failure rate isn't
actualy that high so far.  I'm happy to add both timeouts and "earlier
startup" of the $monitor, but I'd prefer to do so in-tree - I'd run the
test through 100+ iterations locally, without any of this showing up.


> We don't really need crash recovery testing in the buildfarm IMO ---
> we hackers crash the system plenty often enough to notice problems
> there.

I for one don't exercise that kind of crash restarts, my development
scripts all work with restart_after_crash = false. What I find more
concerning however is coverage of EXEC_BACKEND, which has far fewer
developers actively running it constantly.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-09-19 12:13:54 -0400, Tom Lane wrote:
>> IOW, the "$monitor" instance of psql did not complete making its
>> connection until after the crash/restart cycle had occurred.

> That'd be easy enough to fix...

> Just something like

> $monitor_stdin .= q[
> SELECT $$am-i-up$$;
> ];
> $monitor->pump until $monitor_stdout =~ /am-i-up/;
> $monitor_stdout = '';

Hm, do we care whether the sleep call has started?  Maybe not.

>> I'd like to ask you to revert this test, at least pending making
>> it a whole lot more bulletproof.

> Hm. Ok. That seems like an overreaction to me - the failure rate isn't
> actualy that high so far.

It's not that it's high, it's that having to ask buildfarm owners to
manually unwedge their critters is not very cool.  Maybe prairiedog
will be the only one that gets stuck, but I kinda doubt it.

> I'm happy to add both timeouts and "earlier
> startup" of the $monitor, but I'd prefer to do so in-tree - I'd run the
> test through 100+ iterations locally, without any of this showing up.

Well, please fix it ASAP, if you don't want to take it out pending
the fixes.
        regards, tom lane


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:

On September 19, 2017 9:53:28 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Well, please fix it ASAP, if you don't want to take it out pending
>the fixes.

Will as soon as I finished my morning coffee. Uncaffeinated, which my phone fittingly autocorrects to unvaccinated,
commitsaren't a good idea. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:
On 2017-09-19 09:58:26 -0700, Andres Freund wrote:
> 
> 
> On September 19, 2017 9:53:28 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >Well, please fix it ASAP, if you don't want to take it out pending
> >the fixes.
> 
> Will as soon as I finished my morning coffee. Uncaffeinated, which my phone fittingly autocorrects to unvaccinated,
commitsaren't a good idea.
 

Done. Survived ~100 cycles here locally, while running make -j16 -s
check-world in two parallel branches. But I have a fast disk, so it's
not comparable.

If the buildfarm doesn't complain about the use of IPC::Run's timeout
functionality, we should probably patch that into the other use of
IPC::Run as well, but especially into the other user of the pump() until
... scheme.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> If the buildfarm doesn't complain about the use of IPC::Run's timeout
> functionality, we should probably patch that into the other use of
> IPC::Run as well, but especially into the other user of the pump() until
> ... scheme.

jacana hasn't passed this regression test yet, but at least that proves
the timeout stuff works ;-)

I think jacana's problem is simply that "kill QUIT" is never gonna work
on Windows, there being no such animal there.  It looks like "kill KILL"
doesn't work either, which surprises me slightly more but not much.
        regards, tom lane


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andrew Dunstan
Date:
[re-adding commiters which I inadvertently left off]


On 09/30/2017 06:10 PM, Andres Freund wrote:
>
>
>> I was just looking at this. Why aren't we using "pg_ctl kill" to
>> terminate the backend? That's supposed to be portable.
> Because pg_ctl can't do that for any process but postmaster, no? The
> test is supposed to find issues with backend death (and has
> defficiencies in error reporting already, and would have caught a bug
> I'd introduced previously).
>


No, I don't think so. That's not what the docs say. That's why you give
it a pid argument" "pg_ctl kill signal_name process_id"

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:
On 2017-09-30 18:21:33 -0400, Andrew Dunstan wrote:
> 
> [re-adding commiters which I inadvertently left off]
> 
> 
> On 09/30/2017 06:10 PM, Andres Freund wrote:
> >
> >
> >> I was just looking at this. Why aren't we using "pg_ctl kill" to
> >> terminate the backend? That's supposed to be portable.
> > Because pg_ctl can't do that for any process but postmaster, no? The
> > test is supposed to find issues with backend death (and has
> > defficiencies in error reporting already, and would have caught a bug
> > I'd introduced previously).

> No, I don't think so. That's not what the docs say. That's why you give
> it a pid argument" "pg_ctl kill signal_name process_id"

Oh, cool. Didn't know that one. So the answer is:
"Because Andres doesn't know squat.".

But even after fixing that, there unfortunately is:

static void
set_sig(char *signame)
{
…
#if 0/* probably should NOT provide SIGKILL */else if (strcmp(signame, "KILL") == 0)    sig = SIGKILL;
#endif

I'm unclear on what that provision is achieving? If you can kill with
pg_ctl you can do other nasty stuff too (like just use kill instead of
pg_ctl)?

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:
On 2017-09-30 15:27:12 -0700, Andres Freund wrote:
> On 2017-09-30 18:21:33 -0400, Andrew Dunstan wrote:
> > 
> > [re-adding commiters which I inadvertently left off]
> > 
> > 
> > On 09/30/2017 06:10 PM, Andres Freund wrote:
> > >
> > >
> > >> I was just looking at this. Why aren't we using "pg_ctl kill" to
> > >> terminate the backend? That's supposed to be portable.
> > > Because pg_ctl can't do that for any process but postmaster, no? The
> > > test is supposed to find issues with backend death (and has
> > > defficiencies in error reporting already, and would have caught a bug
> > > I'd introduced previously).
> 
> > No, I don't think so. That's not what the docs say. That's why you give
> > it a pid argument" "pg_ctl kill signal_name process_id"
> 
> Oh, cool. Didn't know that one. So the answer is:
> "Because Andres doesn't know squat.".
> 
> But even after fixing that, there unfortunately is:
> 
> static void
> set_sig(char *signame)
> {
> …
> #if 0
>     /* probably should NOT provide SIGKILL */
>     else if (strcmp(signame, "KILL") == 0)
>         sig = SIGKILL;
> #endif
> 
> I'm unclear on what that provision is achieving? If you can kill with
> pg_ctl you can do other nasty stuff too (like just use kill instead of
> pg_ctl)?

Could you perhaps test whether windows likes things after the following
patch?  I don't think the kill_kill guarantees are really needed here,
so we might even be able to allow this on msvc.

Greetings,

Andres Freund

-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Attachment

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andrew Dunstan
Date:

On 09/30/2017 06:44 PM, Andres Freund wrote:
> On 2017-09-30 15:27:12 -0700, Andres Freund wrote:
>> On 2017-09-30 18:21:33 -0400, Andrew Dunstan wrote:
>>> [re-adding commiters which I inadvertently left off]
>>>
>>>
>>> On 09/30/2017 06:10 PM, Andres Freund wrote:
>>>>
>>>>> I was just looking at this. Why aren't we using "pg_ctl kill" to
>>>>> terminate the backend? That's supposed to be portable.
>>>> Because pg_ctl can't do that for any process but postmaster, no? The
>>>> test is supposed to find issues with backend death (and has
>>>> defficiencies in error reporting already, and would have caught a bug
>>>> I'd introduced previously).
>>> No, I don't think so. That's not what the docs say. That's why you give
>>> it a pid argument" "pg_ctl kill signal_name process_id"
>> Oh, cool. Didn't know that one. So the answer is:
>> "Because Andres doesn't know squat.".
>>
>> But even after fixing that, there unfortunately is:
>>
>> static void
>> set_sig(char *signame)
>> {
>> …
>> #if 0
>>     /* probably should NOT provide SIGKILL */
>>     else if (strcmp(signame, "KILL") == 0)
>>         sig = SIGKILL;
>> #endif
>>
>> I'm unclear on what that provision is achieving? If you can kill with
>> pg_ctl you can do other nasty stuff too (like just use kill instead of
>> pg_ctl)?


I put it in when we rewrote pg_ctl in C many years ago, possibly out of
a superabundance of caution. I agree it's worth revisiting. I think the
idea was that there's a difference between an ordinary footgun and an
officially sanctioned footgun :-)


> Could you perhaps test whether windows likes things after the following
> patch?  I don't think the kill_kill guarantees are really needed here,
> so we might even be able to allow this on msvc.
>



Haven't tested on MSVC but with this patch it passes on jacana (mingw).

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andres Freund
Date:
Hi,

On 2017-09-30 22:28:39 -0400, Andrew Dunstan wrote:
> >> But even after fixing that, there unfortunately is:
> >>
> >> static void
> >> set_sig(char *signame)
> >> {
> >> …
> >> #if 0
> >>     /* probably should NOT provide SIGKILL */
> >>     else if (strcmp(signame, "KILL") == 0)
> >>         sig = SIGKILL;
> >> #endif
> >>
> >> I'm unclear on what that provision is achieving? If you can kill with
> >> pg_ctl you can do other nasty stuff too (like just use kill instead of
> >> pg_ctl)?
> 
> 
> I put it in when we rewrote pg_ctl in C many years ago, possibly out of
> a superabundance of caution. I agree it's worth revisiting. I think the
> idea was that there's a difference between an ordinary footgun and an
> officially sanctioned footgun :-)

Heh.  I'm inclined to take it out. We could add a --use-the-force-luke
type parameter, but it doesn't seem worth it.


> Haven't tested on MSVC but with this patch it passes on jacana (mingw).

Yay!  Thanks for testing.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.

From
Andrew Dunstan
Date:

On 09/30/2017 10:32 PM, Andres Freund wrote:
>> Haven't tested on MSVC but with this patch it passes on jacana (mingw).
> Yay!  Thanks for testing.
>


I have now tested on bowerbird (MSVC) and it passes. This suggests that
we can run tests there in cases where we can use IPC::Run's finish()
instead of kill_kill(). Perhaps someone would like to look at the the
two other cases where we do that (recovery/t/011_crash_recovery.pl and
recovery/t/006_logical_decoding.pl) and see if they are amenable to this
treatment. It woould be nice to be able to run all the tests on all
platforms.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers