Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts. - Mailing list pgsql-committers

From Tom Lane
Subject Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.
Date
Msg-id 17084.1505837634@sss.pgh.pa.us
Whole thread Raw
In response to Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.  (Andres Freund <andres@anarazel.de>)
Responses Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.  (Andres Freund <andres@anarazel.de>)
List pgsql-committers
I discovered that prairiedog has been hung up for many hours in the
013_crash_restart.pl.  It looks to me like the explanation is that
the test has a race condition, because what I find in the postmaster
log is

2017-09-19 00:31:48.194 EDT [27839] [unknown] LOG:  connection received: host=[local]
2017-09-19 00:31:48.203 EDT [27839] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
2017-09-19 00:31:48.218 EDT [27839] t/013_crash_restart.pl LOG:  statement: CREATE TABLE alive(status text);
2017-09-19 00:31:48.266 EDT [27839] t/013_crash_restart.pl LOG:  statement: INSERT INTO alive
VALUES($$committed-before-sigquit$$);
2017-09-19 00:31:48.271 EDT [27839] t/013_crash_restart.pl LOG:  statement: SELECT pg_backend_pid();
2017-09-19 00:31:48.278 EDT [27839] t/013_crash_restart.pl LOG:  statement: BEGIN;
2017-09-19 00:31:48.280 EDT [27839] t/013_crash_restart.pl LOG:  statement: INSERT INTO alive
VALUES($$in-progress-before-sigquit$$)RETURNING status; 
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess 
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current transaction and exit, because another server process exited abnormally and possibly corrupted  
shared memory.
2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command. 
2017-09-19 00:31:48.299 EDT [27827] LOG:  server process (PID 27839) exited with exit code 2
2017-09-19 00:31:48.299 EDT [27827] DETAIL:  Failed process was running: INSERT INTO alive
VALUES($$in-progress-before-sigquit$$)RETURNING status; 
2017-09-19 00:31:48.300 EDT [27827] LOG:  terminating any other active server processes
2017-09-19 00:31:48.307 EDT [27832] WARNING:  terminating connection because of crash of another server process
2017-09-19 00:31:48.307 EDT [27832] DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
2017-09-19 00:31:48.307 EDT [27832] HINT:  In a moment you should be able to reconnect to the database and repeat your
command.
2017-09-19 00:31:48.317 EDT [27827] LOG:  all server processes terminated; reinitializing
2017-09-19 00:31:48.333 EDT [27840] LOG:  database system was interrupted; last known up at 2017-09-19 00:31:47 EDT
2017-09-19 00:31:48.338 EDT [27840] LOG:  database system was not properly shut down; automatic recovery in progress
2017-09-19 00:31:48.346 EDT [27840] LOG:  redo starts at 0/15A89EC
2017-09-19 00:31:48.361 EDT [27840] LOG:  invalid record length at 0/15C6D74: wanted 24, got 0
2017-09-19 00:31:48.362 EDT [27840] LOG:  redo done at 0/15C6D50
2017-09-19 00:31:48.362 EDT [27840] LOG:  last completed transaction was at log time 2017-09-19 00:31:48.270076-04
2017-09-19 00:31:48.474 EDT [27827] LOG:  database system is ready to accept connections
2017-09-19 00:31:48.492 EDT [27847] [unknown] LOG:  connection received: host=[local]
2017-09-19 00:31:48.499 EDT [27847] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
2017-09-19 00:31:48.578 EDT [27847] t/013_crash_restart.pl LOG:  statement: SELECT pg_sleep(3600);

IOW, the "$monitor" instance of psql did not complete making its
connection until after the crash/restart cycle had occurred.
So we're just sitting there waiting for a crash report that won't
come.  Which is another very serious deficiency in this test:
lacking any sort of timeout, it will just freeze indefinitely
if anything doesn't happen exactly the way it expects.  From a
buildfarm owner's standpoint, that's pretty damn unfriendly.
It means having to manually unwedge your animals from time to time.

I'd like to ask you to revert this test, at least pending making
it a whole lot more bulletproof.  We don't really need crash
recovery testing in the buildfarm IMO --- we hackers crash the
system plenty often enough to notice problems there.
        regards, tom lane


--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

pgsql-committers by date:

Previous
From: Andrew Dunstan
Date:
Subject: [COMMITTERS] pgsql: Add citext_pattern_ops for citext contrib module
Next
From: Bruce Momjian
Date:
Subject: [COMMITTERS] pgsql: doc: add example of % substitution for connection URIs