Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust. - Mailing list pgsql-committers

From Andres Freund
Subject Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
Date
Msg-id 20170920010629.3icid5ukemkrr42k@alap3.anarazel.de
Whole thread Raw
In response to Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
List pgsql-committers
On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > #       This probably means the server terminated abnormally
> > #       before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
> 
> That seems pretty weird.  Maybe it's not the same case, but in
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
> 
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
> 
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG:  connection received: host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG:  connection authorized: user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG:  statement: SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING:  terminating connection because of crash of
anotherserver process
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process
toroll back the current transaction and exit, because another server process exited abnormally and possibly corrupted
sharedmemory.
 
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your command.
 
> 
> Have we forgotten an fflush() or something?

After hacking a fix for my previous theory, I started adding strace into
the mix, to verify this. Takes longer to reproduce, but after filtering
to -e trace=network, I got this:

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340
sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79
sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
+++ exited with 2 +++

So indeed, we got a connreset before receiving the proper error
message.

The corresponding server log (debug3):
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 730
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 716
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 715
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 717
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 718
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 719
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess
 
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current t
 
ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your c
 
ommand.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [720] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  proc_exit(-1): 0 callbacks to make
...
2017-09-20 00:57:00.577 UTC [713] DEBUG:  server process (PID 730) exited with exit code 2
2017-09-20 00:57:00.577 UTC [713] DETAIL:  Failed process was running: SELECT pg_sleep(3600);
2017-09-20 00:57:00.577 UTC [713] LOG:  all server processes terminated; reinitializing

So the server indeed was killed by SIGQUIT, not an escalation to
SIGKILL. And it output stuff to the server log, and didn't complain
about communication to the client... Odd.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

pgsql-committers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [COMMITTERS] pgsql: Set partitioned_rels appropriately when UNIONALL is used.
Next
From: Andres Freund
Date:
Subject: Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.