Re: Refactor recovery conflict signaling a little - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: Refactor recovery conflict signaling a little
Date
Msg-id CABPTF7X0j91LFh_Rk3V01rhGURSRp27M0Ts3YGe2vYPFHaijFg@mail.gmail.com
Whole thread Raw
In response to Re: Refactor recovery conflict signaling a little  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
Hi Alexander,

On Sat, Mar 7, 2026 at 7:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Xuneng and Heikki,
>
> 04.03.2026 07:33, Xuneng Zhou wrote:
>
> 03.03.2026 17:39, Heikki Linnakangas wrote:
>
> On 24/02/2026 10:00, Alexander Lakhin wrote:
>
> The "terminating process ..." message doesn't appear when the test passes
> successfully.
>
> Hmm, right, looks like something wrong in signaling the recovery conflict. I can't tell if the signal is being sent,
> or it's not processed correctly. Looking at the code, I don't see anything wrong.
>
> I was unable to reproduce the issue on an x86_64 Linux machine using
> the provided script. All test runs completed successfully without any
> failures.
>
>
> I've added debug logging (see attached) and saw the following:
> !!!SignalRecoveryConflict[282363]
> !!!ProcArrayEndTransaction| pendingRecoveryConflicts = 0
> !!!ProcessInterrupts[283863]| MyProc->pendingRecoveryConflicts: 0
> !!!ProcessInterrupts[283863]| MyProc->pendingRecoveryConflicts: 0
> 2026-03-07 12:21:24.544 EET walreceiver[282421] FATAL:  could not receive data from WAL stream: server closed the
connectionunexpectedly 
>         This probably means the server terminated abnormally
>         before or while processing the request.
> 2026-03-07 12:21:24.645 EET postmaster[282355] LOG:  received immediate shutdown request
> 2026-03-07 12:21:24.647 EET postmaster[282355] LOG:  database system is shut down
>
> While for a successful run, I see:
> 2026-03-07 12:18:17.075 EET startup[285260] DETAIL:  The slot conflicted with xid horizon 677.
> 2026-03-07 12:18:17.075 EET startup[285260] CONTEXT:  WAL redo at 0/04022130 for Heap2/PRUNE_ON_ACCESS:
snapshotConflictHorizon:677, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 2, nunused: 0, dead: [35, 36]; blkref
#0:rel 1663/16384/16418, blk 10 
> !!!SignalRecoveryConflict[285260]
> !!!ProcessInterrupts[286071]| MyProc->pendingRecoveryConflicts: 16
> !!!ProcessRecoveryConflictInterrupts[286071]
> !!!ProcessRecoveryConflictInterrupts[286071] pending: 16, reason: 4
> 2026-03-07 12:18:17.075 EET walsender[286071] 035_standby_logical_decoding.pl ERROR:  canceling statement due to
conflictwith recovery 
> 2026-03-07 12:18:17.075 EET walsender[286071] 035_standby_logical_decoding.pl DETAIL:  User was using a logical
replicationslot that must be invalidated. 
>
> (Full logs for this failed run and a good run are attached.)
>

Thanks again for the logs. I think we might frame the next run around
three explicit hypotheses:

1) Self-clear: the target walsender clears its own
pendingRecoveryConflicts during xact cleanup before
ProcessInterrupts() consumes it.
2) Group-clear-on-behalf: another backend clears the target PGPROC via
group clear before consume.
3) Visibility gap: the weak pg_atomic_read_u32() in the interrupt path
observes 0 even though the conflict bit was set.

The bad vs good traces are consistent with a receiver-side handoff failure:
   - bad run: signal path reached, target walsender later sees
pendingRecoveryConflicts = 0, never dispatches conflict;
   - good run: target sees pending = 0x10, dispatches reason 4
(RECOVERY_CONFLICT_LOGICALSLOT), throws expected ERROR, releases slot.

I prepared an updated diagnostics patch to classify hypotheses (1) and
(2) directly:
   - SignalRecoveryConflict: logs target pid/procno and old/new mask
   - clear sites: logs self_pid, target_pid, target_procno, self_is_target
   - ProcessInterrupts: logs handler counter + pending-mask observations

Two caveats:
   - The weak/strong telemetry is supportive but not definitive for
hypothesis (3); it may miss the exact stale-read window.
   - The patch preserves the original dispatch path, but added
diagnostics (especially barriered reads in clear paths) can perturb
timing, so repro rate may shift.

What to look for:
Hypothesis (1) self-clear:
ProcArrayEndTransaction... self_is_target=t ... clearing
pendingRecoveryConflicts=0x10 (or ProcArrayClearTransaction self case)
for target pid/procno before target ProcessInterrupts reports handler
fired but pending=0.

Hypothesis (2) group-clear-on-behalf:
ProcArrayEndTransactionInternal... self_is_target=f ...
target_procno=<walsender procno> ... clearing
pendingRecoveryConflicts=0x10 (or ProcArrayClearTransaction with
self_is_target=f).

Hypothesis (3) visibility gap (suggestive):
no matching clear-site log for target procno, but target still reports
handler fired with pending=0.

If this run is inconclusive, I suggest a direct behavioral A/B as next step:
   - change only the ProcessInterrupts() recovery-conflict check from
pg_atomic_read_u32() to pg_atomic_read_membarrier_u32();
   - if failures disappear, that strongly supports hypothesis (3).

--
Best,
Xuneng

Attachment

pgsql-hackers by date:

Previous
From: Chao Li
Date:
Subject: Re: Add pg_stat_recovery system view
Next
From: Nisha Moond
Date:
Subject: Re: Skipping schema changes in publication