Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust. - Mailing list pgsql-committers
From | Andres Freund |
---|---|
Subject | Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust. |
Date | |
Msg-id | 20170920010629.3icid5ukemkrr42k@alap3.anarazel.de Whole thread Raw |
In response to | Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust. (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
|
List | pgsql-committers |
On 2017-09-19 16:46:58 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > So this is geniuinely interesting. When the machine is really loaded (as > > in 6 animals running on a vm at the same time, incuding valgrind), psql > > sometimes doesn't get the WARNING message from a shutdown. Instead it > > gets > > # psql:<stdin>:3: server closed the connection unexpectedly > > # This probably means the server terminated abnormally > > # before or while processing the request. > > # psql:<stdin>:3: connection to server was lost > > That seems pretty weird. Maybe it's not the same case, but in > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02 > > you can see from the postmaster log that the backend *is* issuing > the message, or at least it's getting to the server log: > > 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG: connection received: host=[local] > 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG: connection authorized: user=andres database=postgres > 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG: statement: SELECT $$psql-connected$$; > ... > 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING: terminating connection because of crash of anotherserver process > 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process toroll back the current transaction and exit, because another server process exited abnormally and possibly corrupted sharedmemory. > 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the databaseand repeat your command. > > Have we forgotten an fflush() or something? After hacking a fix for my previous theory, I started adding strace into the mix, to verify this. Takes longer to reproduce, but after filtering to -e trace=network, I got this: socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) socket(AF_UNIX, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0 sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79 recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340 sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32 recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79 sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28 recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer) +++ exited with 2 +++ So indeed, we got a connreset before receiving the proper error message. The corresponding server log (debug3): 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 730 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 716 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 715 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 717 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 718 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 719 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING: terminating connection because of crash of another serverprocess 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to rollback the current t ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the databaseand repeat your c ommand. 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2017-09-20 00:57:00.573 UTC [720] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: proc_exit(-1): 0 callbacks to make ... 2017-09-20 00:57:00.577 UTC [713] DEBUG: server process (PID 730) exited with exit code 2 2017-09-20 00:57:00.577 UTC [713] DETAIL: Failed process was running: SELECT pg_sleep(3600); 2017-09-20 00:57:00.577 UTC [713] LOG: all server processes terminated; reinitializing So the server indeed was killed by SIGQUIT, not an escalation to SIGKILL. And it output stuff to the server log, and didn't complain about communication to the client... Odd. Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
pgsql-committers by date: