Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts. - Mailing list pgsql-committers
From | Tom Lane |
---|---|
Subject | Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts. |
Date | |
Msg-id | 17084.1505837634@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts. (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [COMMITTERS] pgsql: Add test for postmaster crash restarts.
|
List | pgsql-committers |
I discovered that prairiedog has been hung up for many hours in the 013_crash_restart.pl. It looks to me like the explanation is that the test has a race condition, because what I find in the postmaster log is 2017-09-19 00:31:48.194 EDT [27839] [unknown] LOG: connection received: host=[local] 2017-09-19 00:31:48.203 EDT [27839] [unknown] LOG: connection authorized: user=buildfarm database=postgres 2017-09-19 00:31:48.218 EDT [27839] t/013_crash_restart.pl LOG: statement: CREATE TABLE alive(status text); 2017-09-19 00:31:48.266 EDT [27839] t/013_crash_restart.pl LOG: statement: INSERT INTO alive VALUES($$committed-before-sigquit$$); 2017-09-19 00:31:48.271 EDT [27839] t/013_crash_restart.pl LOG: statement: SELECT pg_backend_pid(); 2017-09-19 00:31:48.278 EDT [27839] t/013_crash_restart.pl LOG: statement: BEGIN; 2017-09-19 00:31:48.280 EDT [27839] t/013_crash_restart.pl LOG: statement: INSERT INTO alive VALUES($$in-progress-before-sigquit$$)RETURNING status; 2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl WARNING: terminating connection because of crash of another serverprocess 2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to rollback the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-09-19 00:31:48.292 EDT [27839] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the databaseand repeat your command. 2017-09-19 00:31:48.299 EDT [27827] LOG: server process (PID 27839) exited with exit code 2 2017-09-19 00:31:48.299 EDT [27827] DETAIL: Failed process was running: INSERT INTO alive VALUES($$in-progress-before-sigquit$$)RETURNING status; 2017-09-19 00:31:48.300 EDT [27827] LOG: terminating any other active server processes 2017-09-19 00:31:48.307 EDT [27832] WARNING: terminating connection because of crash of another server process 2017-09-19 00:31:48.307 EDT [27832] DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-09-19 00:31:48.307 EDT [27832] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-09-19 00:31:48.317 EDT [27827] LOG: all server processes terminated; reinitializing 2017-09-19 00:31:48.333 EDT [27840] LOG: database system was interrupted; last known up at 2017-09-19 00:31:47 EDT 2017-09-19 00:31:48.338 EDT [27840] LOG: database system was not properly shut down; automatic recovery in progress 2017-09-19 00:31:48.346 EDT [27840] LOG: redo starts at 0/15A89EC 2017-09-19 00:31:48.361 EDT [27840] LOG: invalid record length at 0/15C6D74: wanted 24, got 0 2017-09-19 00:31:48.362 EDT [27840] LOG: redo done at 0/15C6D50 2017-09-19 00:31:48.362 EDT [27840] LOG: last completed transaction was at log time 2017-09-19 00:31:48.270076-04 2017-09-19 00:31:48.474 EDT [27827] LOG: database system is ready to accept connections 2017-09-19 00:31:48.492 EDT [27847] [unknown] LOG: connection received: host=[local] 2017-09-19 00:31:48.499 EDT [27847] [unknown] LOG: connection authorized: user=buildfarm database=postgres 2017-09-19 00:31:48.578 EDT [27847] t/013_crash_restart.pl LOG: statement: SELECT pg_sleep(3600); IOW, the "$monitor" instance of psql did not complete making its connection until after the crash/restart cycle had occurred. So we're just sitting there waiting for a crash report that won't come. Which is another very serious deficiency in this test: lacking any sort of timeout, it will just freeze indefinitely if anything doesn't happen exactly the way it expects. From a buildfarm owner's standpoint, that's pretty damn unfriendly. It means having to manually unwedge your animals from time to time. I'd like to ask you to revert this test, at least pending making it a whole lot more bulletproof. We don't really need crash recovery testing in the buildfarm IMO --- we hackers crash the system plenty often enough to notice problems there. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
pgsql-committers by date: