From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> [ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
I must add one thing. After some client processes closed the connection
without any hang, their server processes were stuck with a stack trace like
this (I'll look for and show the exact stack trace tomorrow):
WaitLatchOrSocket
SyncRepWaitForLSN
CommitTransaction
CommitTransactionCommand
ProcessCatchupEvent
...
shmem_exit
proc_exit_prepare
proc_exit
PostgresMain
...
The process appears to be hanging during session termination. So, it's not
the problem only during client request wait.
> Another line of thought is that we've been way too uncritical about
> shoving different kinds of events into the SIGUSR1 multiplexor.
> It might be a good idea to separate "high level" interrupts from
> "low level" ones, using say SIGUSR1 for the former and SIGUSR2
> for the latter. However, that doesn't sound very back-patchable,
> even assuming that we can come up with a clean division.
This seems to be one step in the right direction. There are two issues in
the current implementation:
[Issue 1]
[ sinval catchup signal -> ProcessCatchupEvent -> WaitLatch -> deadlock ]
This is (partly) because the latch wakeup and other processing use the same
SIGUSR1 in normal backend, autovacuum launcher/worker, and the background
worker with database access. On the other hand, other background daemon
processes properly use SIGUSR1 only for latch wakeup, and SIGUSR2 for other
tasks.
[Issue 2]
WaitLatch() returns prematurely due to the sinval catchup signal, even
though the target event (e.g. reply from standby) hasn't occurred and called
SetLatch() yet. This is because procsignal_sigusr1_handler() always calls
latch_sigusr1_handler().
This is probably not related to the cause of the hang.
So, as you suggest, I think it would be a good idea to separate
latch_sigusr1_handler() call into a different function solely for it like
other daemon processes, and leave the rest in procsignal_sigusr1_handler()
and rename function to procsignal_sigusr2_handler() or procsignal_handler().
Originally, it's not natural that the current procsignal_sigusr1_handler()
contains latch_sigusr1_handler() call, because latch processing is not based
on the procsignal mechanism (SetLatch() doesn't call SendProcSignal()).
I'll try the fix tomorrow if possible. What kind of problems do you hink of
for back-patching?
Regards
MauMau