Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust. - Mailing list pgsql-committers

From Andres Freund
Subject Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
Date
Msg-id 20170920020038.kllxgilo7xzwmtto@alap3.anarazel.de
Whole thread Raw
In response to Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] Re: [COMMITTERS] pgsql: Make new crash restart test abit more robust.  (Andres Freund <andres@anarazel.de>)
List pgsql-committers
On 2017-09-19 18:06:29 -0700, Andres Freund wrote:
> On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> > Have we forgotten an fflush() or something?
> 
> After hacking a fix for my previous theory, I started adding strace into
> the mix, to verify this. Takes longer to reproduce, but after filtering
> to -e trace=network, I got this:
> 
> socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
> socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
> socket(AF_UNIX, SOCK_STREAM, 0)         = 3
> connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) = 0
> getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
> sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, NULL, 0) = 79
> recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, NULL) = 340
> sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
> recvfrom(3, "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 16384, 0, NULL, NULL) = 79
> sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
> recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
> +++ exited with 2 +++
> 
> So indeed, we got a connreset before receiving the proper error
> message.
> 
> The corresponding server log (debug3):
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 730
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 716
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 715
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 717
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 718
> 2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 719
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING:  terminating connection because of crash of another
serverprocess
 
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL:  The postmaster has commanded this server process to
rollback the current t
 
> ransaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT:  In a moment you should be able to reconnect to the
databaseand repeat your c
 
> ommand.
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to
make
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
> 2017-09-20 00:57:00.573 UTC [720] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
> 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  proc_exit(-1): 0 callbacks to make
> ...
> 2017-09-20 00:57:00.577 UTC [713] DEBUG:  server process (PID 730) exited with exit code 2
> 2017-09-20 00:57:00.577 UTC [713] DETAIL:  Failed process was running: SELECT pg_sleep(3600);
> 2017-09-20 00:57:00.577 UTC [713] LOG:  all server processes terminated; reinitializing
> 
> So the server indeed was killed by SIGQUIT, not an escalation to
> SIGKILL. And it output stuff to the server log, and didn't complain
> about communication to the client... Odd.

Hah! The reason for that is that socket_flush tries to avoid doing stuff
recursively:

static int
socket_flush(void)
{int            res;
/* No-op if reentrant call */if (PqCommBusy)    return 0;
...

(detected by putting an elog(COMMERROR) there)

adding an abort shows the following backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f089de4e42a in __GI_abort () at abort.c:89
#2  0x000056473218a3f6 in socket_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1408
#3  0x000056473246e4ec in send_message_to_frontend (edata=0x5647329e34e0 <errordata>)   at
/home/andres/src/postgresql/src/backend/utils/error/elog.c:3319
#4  0x000056473246ad02 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1483
#5  0x00005647324680af in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:483
#6  0x00005647322fb340 in quickdie (postgres_signal_arg=3) at
/home/andres/src/postgresql/src/backend/tcop/postgres.c:2608
#7  <signal handler called>
#8  0x00007f089fbb65bd in __libc_send (fd=7, buf=0x564734a22c80, n=79, flags=0) at
../sysdeps/unix/sysv/linux/x86_64/send.c:26
#9  0x000056473217c874 in secure_raw_write (port=0x564734a1c730, ptr=0x564734a22c80, len=79)   at
/home/andres/src/postgresql/src/backend/libpq/be-secure.c:310
#10 0x000056473217c6fc in secure_write (port=0x564734a1c730, ptr=0x564734a22c80, len=79)   at
/home/andres/src/postgresql/src/backend/libpq/be-secure.c:256
#11 0x000056473218a47c in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1437
#12 0x000056473218a40c in socket_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1413
#13 0x00005647322f693d in ReadyForQuery (dest=DestRemote) at /home/andres/src/postgresql/src/backend/tcop/dest.c:259
#14 0x00005647322fd078 in PostgresMain (argc=1, argv=0x564734a24db8, dbname=0x564734a24c98 "postgres",
username=0x5647349fbda8"andres")   at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4047
 
#15 0x000056473225b54e in BackendRun (port=0x564734a1c730) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4364
#16 0x000056473225ac5e in BackendStartup (port=0x564734a1c730) at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4036
#17 0x0000564732257007 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1756
#18 0x0000564732256596 in PostmasterMain (argc=3, argv=0x5647349f9b70)   at
/home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1364
#19 0x0000564732190be8 in main (argc=3, argv=0x5647349f9b70) at
/home/andres/src/postgresql/src/backend/main/main.c:228

I think this is actually not the $monitor backend but $killme, but I
think it shows how this is possible.  I don't think there's an easy way
out of this - making all low-level client communcation properly
reentrant sounds hard.  What I would like to do is to add an log message
somewhere around send_message_to_frontend() noticing that the message to
the client could not be sent, to avoid future confusion like this.

Does anybody have a better idea?

Given this fact pattern, I'll allow the case without a received error
message in the recovery test. Objections?

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

pgsql-committers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
Next
From: Tom Lane
Date:
Subject: [COMMITTERS] pgsql: Remove no-op GiST support functions in the core GiST opclasses.