Re: failures in t/031_recovery_conflict.pl on CI - Mailing list pgsql-hackers

From Andres Freund
Subject Re: failures in t/031_recovery_conflict.pl on CI
Date
Msg-id 20220429200809.uuuseakmhb57sppk@alap3.anarazel.de
Whole thread Raw
In response to Re: failures in t/031_recovery_conflict.pl on CI  (Andres Freund <andres@anarazel.de>)
Responses Re: failures in t/031_recovery_conflict.pl on CI  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

Attached are patches for this issue.

It adds a test case for deadlock conflicts to make sure that case isn't
broken. I also tested the recovery conflict tests in the back branches, and
they work there with a reasonably small set of changes.

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
  preceding numbers are unused. It seems way more problematic to use a
  different number in the backbranches than have gaps?

- The test uses pump_until() and wait_for_log(), which don't exist in the
  backbranches. For now I've just inlined the implementation, but I guess we
  could also backpatch their introduction?

- There's a few incompatibilities in the test with older branches:
  - older branches don't have allow_in_place_tablespaces - I think just
    skipping tablespace conflicts is fine, they're comparatively
    simple.

    Eventually it might make sense to backpatch allow_in_place_tablespaces,
    our test coverage in the area is quite poor.

  - the stats tests can't easily made reliably in the backbranches - which is
    ok, as the conflict itself is verified via the log

  - some branches don't have log_recovery_conflict_waits, since it's not
    critical to the test, it's ok to just not include it there

  I played with the idea of handling the differences using version comparisons
  in the code, and have the test be identically across branches. Since it's
  something we don't do so far, I'm leaning against it, but ...


> - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
>   replay progress reporting. Is there a reason to keep them in the
>   backbranches? Hard to see how an extension or something could rely on it,
>   but ...?

I've left it as is for now, will start a separate thread.


> - There's the following comment in ResolveRecoveryConflictWithBufferPin():
>
>   "We assume that only UnpinBuffer() and the timeout requests established
>    above can wake us up here."
>
>   That bogus afaict? There's plenty other things that can cause MyProc->latch
>   to be set. Is it worth doing something about this at the same time? Right
>   now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
>   succession initially.

The comment is more recent than I had realized. I raised this separately in
https://postgr.es/m/20220429191815.xewxjlpmq7mxhsr2%40alap3.anarazel.de


pgindent uses some crazy formatting nearby:
        SendRecoveryConflictWithBufferPin(
                                          PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I'm tempted to clean that up in passing by having just one
SendRecoveryConflictWithBufferPin() call instead of two, storing the type of
conflict in a local variable? Doesn't look entirely pretty either, but ...


I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(),
btw. But that'd be a non-backpatchable cleanup, I think:
 * The ProcWaitForSignal() sleep normally done in LockBufferForCleanup()
 * (when not InHotStandby) is performed here, for code clarity.


Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: fix cost subqueryscan wrong parallel cost
Next
From: Andres Freund
Date:
Subject: Re: [RFC] building postgres with meson -v8