Infinite wait for SyncRep while handling USR1 - Mailing list pgsql-hackers

From Soumyadeep Chakraborty
Subject Infinite wait for SyncRep while handling USR1
Date
Msg-id CAE-ML+8DK3WHj5EskPX6HxfctkYgGfOLeU-bfkqGfE1Viia5Fw@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hello Hackers,

There is an edge case in 9_5_STABLE (doesn't reproduce 9_6+) we found in how
backends handle the TERM signal while handling a USR1 signal that can cause
them to infinitely wait. A backend can end up waiting forever inside
SyncRepWaitForLSN() at:

rc = WaitLatch(MyLatch, WL_LATCH_SET |
WL_POSTMASTER_DEATH, -1, WAIT_EVENT_SYNC_REP);

Even pg_terminate_backend() would not be able to terminate it after it reaches this point.

Stack trace:

#0  0x00007f39418d73c8 in poll () from /lib64/libc.so.6
#1  0x0000000000753d7b in WaitLatchOrSocket (latch=0x7f3940d371cc, wakeEvents=17, sock=-1, timeout=-1)
    at pg_latch.c:333
#2  0x0000000000753b8c in WaitLatch (latch=0x7f3940d371cc, wakeEvents=17, timeout=-1) at pg_latch.c:197
#3  0x00000000007a0e02 in SyncRepWaitForLSN (XactCommitLSN=167868344) at syncrep.c:231
#4  0x0000000000518b8a in RecordTransactionCommit () at xact.c:1349
#5  0x000000000051966d in CommitTransaction () at xact.c:2057
#6  0x000000000051a1f9 in CommitTransactionCommand () at xact.c:2769
#7  0x000000000054ff5e in RemoveTempRelationsCallback (code=1, arg=0) at namespace.c:3878
#8  0x00000000007be3d1 in shmem_exit (code=1) at ipc.c:228
#9  0x00000000007be2c6 in proc_exit_prepare (code=1) at ipc.c:185
#10 0x00000000007be234 in proc_exit (code=1) at ipc.c:102
#11 0x000000000093152e in errfinish (dummy=0) at elog.c:535
#12 0x00000000007ea492 in ProcessInterrupts () at postgres.c:2913
#13 0x00000000007e9f93 in die (postgres_signal_arg=15) at postgres.c:2682
#14 <signal handler called>
#15 procsignal_sigusr1_handler (postgres_signal_arg=10) at procsignal.c:271
#16 <signal handler called>
#17 SocketBackend (inBuf=0x7ffc8dde0f80) at postgres.c:353
#18 0x00000000007e70ca in ReadCommand (inBuf=0x7ffc8dde0f80) at postgres.c:510
#19 0x00000000007eb990 in PostgresMain (argc=1, argv=0x1708cd0, dbname=0x1708b38 "gpadmin",
    username=0x1708b18 "gpadmin") at postgres.c:4032
#20 0x0000000000769922 in BackendRun (port=0x172e060) at postmaster.c:4309
#21 0x0000000000769086 in BackendStartup (port=0x172e060) at postmaster.c:3983
#22 0x00000000007657dc in ServerLoop () at postmaster.c:1706
#23 0x0000000000764e16 in PostmasterMain (argc=3, argv=0x1707ce0) at postmaster.c:1314
#24 0x00000000006bcdb3 in main (argc=3, argv=0x1707ce0) at main.c:228

Root cause:

The reason why the backend waits forever in WaitLatch is that it expects a USR1
signal to be delivered in order to wake it up from the wait (WaitLatchOrSocket
implementation) using latch_sigusr1_handler. Now since its is already inside a
USR1 handler, USR1 is blocked by default. Thus, it never receives a USR1 and
never wakes up.

Reproduction: [9_5_STABLE, commit: 5a32fcd]

We must align the stars as follows:

1. Create a psql session and create a temp table. [temp table forces a Tx
during backend termination to drop the table => SyncRepWaitForLSN will be
called]

2. Since, we don't have a fault injection framework, we have to rely on the
debugger: Attach to the following processes and install these breakpoints:

i) Attach to the Backend:
b procsignal_sigusr1_handler
b postgres.c:353
at: SocketBackend: ereport(DEBUG1,
(errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST),
errmsg("unexpected EOF on client connection")));

ii) Attach to the Wal Sender, no need for a specific breakpoint [We need to
make the Wal Sender fall behind in order for the condition lsn <=
WalSndCtl->lsn[mode] to be false inside SyncRepWaitForLSN()]

3. kill -9 the psql process [This will send an EOF to the backend's connection
to the front end, which will cause DoingCommandRead = true and
whereToSendOutput = DestNone (in SocketBackend()). This will later ensure that
ProcessInterrupts() is called from within die: the backend's TERM handler]

4. Continue the backend process in the debugger, it should reach the breakpoint
we set inside SocketBackend.

5. Send a USR1 to the backend. kill -SIGUSR1. Continue the backend. We will hit
the breakpoint b procsignal_sigusr1_handler.

6. Send a TERM or perform pg_terminate_backend() on the backend.

7. Detach from all of the processes.

8. The backend will be waiting for replication forever inside
SyncRepWaitForLSN(). We can confirm this by pausing the backend and checking
the back trace and also by looking at pg_stat_activity. Subsequent attempts to
terminate the backend would be futile.


Working fix:
At the start of SyncRepWaitForLSN() check if we are already in a USR1 handler.
If yes, we return. We use sigprocmask(), sigismember() to perform the
detection.

-----
Ashwin Agrawal, Bhuvnesh Chaudhary, Jesse Zhang and Soumyadeep Chakraborty

pgsql-hackers by date:

Previous
From: Robert Eckhardt
Date:
Subject: Re: Cleanup isolation specs from unused steps
Next
From: Ashwin Agrawal
Date:
Subject: Re: Comment in ginpostinglist.c doesn't match code