Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
Date
Msg-id CA+hUKGKrLKx7Ky1T_FHk-Y729K0oie-gOXKCbxCXyjbPDJAOOw@mail.gmail.com
Whole thread Raw
In response to Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?
List pgsql-hackers
Here's a new version, but there's something wrong that I haven't
figured out yet (see CI link below).

Here's one thing I got a bit confused about along the way, but it
seems the comment was just wrong:

+                       /*
+                        * If we can abort just the current
subtransaction then we are OK
+                        * to throw an ERROR to resolve the conflict.
Otherwise drop
+                        * through to the FATAL case.
...
+                       if (!IsSubTransaction())
...
+                               ereport(ERROR,

Surely this was meant to say, "If we're not in a subtransaction then
...", right?  Changed.

I thought of a couple more simplifications for the big switch
statement in ProcessRecoveryConflictInterrupt().  The special case for
DoingCommandRead can be changed to fall through, instead of needing a
separate ereport(FATAL).

I also folded the two ereport(FATAL) calls in the CONFLICT_DATABASE
case into one, since they differ only in errcode().

+                                       (errcode(reason ==
PROCSIG_RECOVERY_CONFLICT_DATABASE ?
+
ERRCODE_DATABASE_DROPPED :
+
ERRCODE_T_R_SERIALIZATION_FAILURE),

Now we're down to just one ereport(FATAL), one ereport(ERROR), and a
couple of ways to give up without erroring.  I think this makes the
logic a lot easier to follow?

I'm confused about proc->recoveryConflictPending: the startup process
sets it, and sometimes the interrupt receiver sets it too, and it
causes errdetail() to be clobbered on abort (for any reason), even
though we bothered to set it carefully for the recovery conflict
ereport calls.  Or something like that.  I haven't changed anything
about that in this patch, though.

Problem: I saw 031_recovery_conflict.pl time out while waiting for a
buffer pin conflict, but so far once only, on CI:

https://cirrus-ci.com/task/5956804860444672

timed out waiting for match: (?^:User was holding shared buffer pin
for too long) at t/031_recovery_conflict.pl line 367.

Hrmph.  Still trying to reproduce that, which may be a bug in this
patch, a bug in the test or a pre-existing problem.  Note that
recovery didn't say something like:

2022-06-21 17:05:40.931 NZST [57674] LOG:  recovery still waiting
after 11.197 ms: recovery conflict on buffer pin

(That's what I'd expect to see in

https://api.cirrus-ci.com/v1/artifact/task/5956804860444672/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log
if the startup process had decided to send the signal).

... so it seems like the problem in that run is upstream of the interrupt stuff.

Other things changed in response to feedback (quoting from several
recent messages):

On Thu, Jun 16, 2022 at 5:01 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 15, 2022 at 1:51 AM Michael Paquier <michael@paquier.xyz> wrote:
> > Handle is more consistent with the other types of interruptions in the
> > SIGUSR1 handler, so the name proposed in the patch in not that
> > confusing to me.  And so does Process, in spirit with
> > ProcessProcSignalBarrier() and ProcessLogMemoryContextInterrupt().
> > While on it, is postgres.c the best home for
> > HandleRecoveryConflictInterrupt()?  That's a very generic file, for
> > starters.  Not related to the actual bug, just asking.
>
> Yeah, there's existing precedent for this kind of split in, for
> example, HandleCatchupInterrupt() and ProcessCatchupInterrupt(). I
> think the idea is that "process" is supposed to sound like the more
> involved part of the operation, whereas "handle" is supposed to sound
> like the initial response to the signal.

Thanks both for looking.  Yeah, I was trying to keep with the existing
convention here (though admittedly we're not 100% consistent on this,
something to tidy up separately perhaps).

On Wed, Jun 15, 2022 at 5:51 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Sun, May 22, 2022 at 05:10:01PM -0700, Andres Freund wrote:
> > It *might* look a tad cleaner to have the loop in a separate function from the
> > existing code. I.e. a +ProcessRecoveryConflictInterrupts() that calls
> > ProcessRecoveryConflictInterrupts().
>
> Agreed that it would be a bit cleaner to keep the internals in a
> different routine.

Alright, I split it into two functions: one with an 's' in the name to
do the looping, and one without 's' to process an individual interrupt
reason.  Makes the patch harder to read because the indentation level
changes...

> Also note that bufmgr.c mentions RecoveryConflictInterrupt() in the
> top comment of HoldingBufferPinThatDelaysRecovery().

Fixed.

> Should the processing of PROCSIG_RECOVERY_CONFLICT_DATABASE mention
> that FATAL is used because we are never going to retry the conflict as
> the database has been dropped?

OK, note added.

> Getting rid of
> RecoveryConflictRetryable makes the code easier to go through.

Yeah, all the communication through global variables was really
confusing, and also subtly wrong (that global reason gets clobbered
with incorrect values), and that retryable variable was hard to
follow.

On Mon, May 23, 2022 at 12:10 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-05-10 16:39:11 +1200, Thomas Munro wrote:
> > @@ -3146,6 +3192,9 @@ ProcessInterrupts(void)
> >               return;
> >       InterruptPending = false;
> >
> > +     if (RecoveryConflictPending)
> > +             ProcessRecoveryConflictInterrupts();
> > +
> >       if (ProcDiePending)
> >       {
> >               ProcDiePending = false;
>
> Should the ProcessRecoveryConflictInterrupts() call really be before the
> ProcDiePending check?

I don't think it's important which of (say) a statement timeout and a
recovery conflict that arrive around the same time takes priority, but
on reflection it was an ugly place to put it, and it seems tidier to
move it down the function a bit further, where other various special
interrupts are handled after the "main" and original die/cancel ones.

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Amit Kapila
Date:
Subject: Re: Replica Identity check of partition table on subscriber