pgsql: In walreceiver, don't try to do ereport() in a signal handler. - Mailing list pgsql-committers

From Tom Lane
Subject pgsql: In walreceiver, don't try to do ereport() in a signal handler.
Date
Msg-id E1hL96b-0005SP-Fi@gemulon.postgresql.org
Whole thread Raw
List pgsql-committers
In walreceiver, don't try to do ereport() in a signal handler.

This is quite unsafe, even for the case of ereport(FATAL) where we won't
return control to the interrupted code, and despite this code's use of
a flag to restrict the areas where we'd try to do it.  It's possible
for example that we interrupt malloc or free while that's holding a lock
that's meant to protect against cross-thread interference.  Then, any
attempt to do malloc or free within ereport() will result in a deadlock,
preventing the walreceiver process from exiting in response to SIGTERM.
We hypothesize that this explains some hard-to-reproduce failures seen
in the buildfarm.

Hence, get rid of the immediate-exit code in WalRcvShutdownHandler,
as well as the logic associated with WalRcvImmediateInterruptOK.
Instead, we need to take care that potentially-blocking operations
in the walreceiver's data transmission logic (libpqwalreceiver.c)
will respond reasonably promptly to the process's latch becoming
set and then call ProcessWalRcvInterrupts.  Much of the needed code
for that was already present in libpqwalreceiver.c.  I refactored
things a bit so that all the uses of PQgetResult use latch-aware
waiting, but didn't need to do much more.

These changes should be enough to ensure that libpqwalreceiver.c
will respond promptly to SIGTERM whenever it's waiting to receive
data.  In principle, it could block for a long time while waiting
to send data too, and this patch does nothing to guard against that.
I think that that hazard is mostly theoretical though: such blocking
should occur only if we fill the kernel's data transmission buffers,
and we don't generally send enough data to make that happen without
waiting for input.  If we find out that the hazard isn't just
theoretical, we could fix it by using PQsetnonblocking, but that
would require more ticklish changes than I care to make now.

This is a bug fix, but it seems like too big a change to push into
the back branches without much more testing than there's time for
right now.  Perhaps we'll back-patch once we have more confidence
in the change.

Patch by me; thanks to Thomas Munro for review.

Discussion: https://postgr.es/m/20190416070119.GK2673@paquier.xyz

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/a1a789eb5ac894b4ca4b7742f2dc2d9602116e46

Modified Files
--------------
.../libpqwalreceiver/libpqwalreceiver.c            | 119 ++++++++++++---------
src/backend/replication/walreceiver.c              |  66 +++---------
src/include/replication/walreceiver.h              |   1 +
3 files changed, 85 insertions(+), 101 deletions(-)


pgsql-committers by date:

Previous
From: Michael Paquier
Date:
Subject: pgsql: Fix some typos
Next
From: Peter Geoghegan
Date:
Subject: pgsql: Remove obsolete _bt_insert_parent() comment.