Re: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers

From Andres Freund
Subject Re: failures in t/031_recovery_conflict.pl on CI
Date
Msg-id 20220413002626.udl7lll7f3o7nre7@alap3.anarazel.de
Whole thread Raw
In response to Re: failures in t/031_recovery_conflict.pl on CI  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: failures in t/031_recovery_conflict.pl on CI  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2022-04-12 15:05:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-04-09 19:34:26 -0400, Tom Lane wrote:
> >> +1.  This is probably more feasible given the latch infrastructure
> >> than it was when that code was first written.
> 
> > What do you think about just reordering the disable_all_timeouts() to be
> > before the got_standby_deadlock_timeout check in the back branches? I think
> > that should close at least the most obvious hole.  And fix it properly in
> > HEAD?
> 
> I don't have much faith in that, and I don't see why we can't fix it
> properly.

It's not too hard, agreed.

It's somewhat hard to test that it works though. The new recovery conflict
tests test all recovery conflicts except for deadlock ones, because they're
not easy to trigger... But I think I nearly got it working reliably.

It's probably worth backpatching the tests, after stripping them of the stats
checks?

Three questions:

- For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
  replay progress reporting. Is there a reason to keep them in the
  backbranches? Hard to see how an extension or something could rely on it,
  but ...?

- I named the variable set by the signal handler got_standby_delay_timeout,
  because just got_standby_timeout looked weird besides the _deadlock_. Which
  makes me think that we should rename STANDBY_TIMEOUT to
  STANDBY_DELAY_TIMEOUT too?

- There's the following comment in ResolveRecoveryConflictWithBufferPin():

  "We assume that only UnpinBuffer() and the timeout requests established
   above can wake us up here."

  That bogus afaict? There's plenty other things that can cause MyProc->latch
  to be set. Is it worth doing something about this at the same time? Right
  now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
  succession initially.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Improving the "Routine Vacuuming" docs
Next
From: "David G. Johnston"
Date:
Subject: Re: Improving the "Routine Vacuuming" docs