Re: Refactor recovery conflict signaling a little - Mailing list pgsql-hackers
| From | Xuneng Zhou |
|---|---|
| Subject | Re: Refactor recovery conflict signaling a little |
| Date | |
| Msg-id | CABPTF7X0j91LFh_Rk3V01rhGURSRp27M0Ts3YGe2vYPFHaijFg@mail.gmail.com Whole thread Raw |
| In response to | Re: Refactor recovery conflict signaling a little (Alexander Lakhin <exclusion@gmail.com>) |
| List | pgsql-hackers |
Hi Alexander, On Sat, Mar 7, 2026 at 7:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Xuneng and Heikki, > > 04.03.2026 07:33, Xuneng Zhou wrote: > > 03.03.2026 17:39, Heikki Linnakangas wrote: > > On 24/02/2026 10:00, Alexander Lakhin wrote: > > The "terminating process ..." message doesn't appear when the test passes > successfully. > > Hmm, right, looks like something wrong in signaling the recovery conflict. I can't tell if the signal is being sent, > or it's not processed correctly. Looking at the code, I don't see anything wrong. > > I was unable to reproduce the issue on an x86_64 Linux machine using > the provided script. All test runs completed successfully without any > failures. > > > I've added debug logging (see attached) and saw the following: > !!!SignalRecoveryConflict[282363] > !!!ProcArrayEndTransaction| pendingRecoveryConflicts = 0 > !!!ProcessInterrupts[283863]| MyProc->pendingRecoveryConflicts: 0 > !!!ProcessInterrupts[283863]| MyProc->pendingRecoveryConflicts: 0 > 2026-03-07 12:21:24.544 EET walreceiver[282421] FATAL: could not receive data from WAL stream: server closed the connectionunexpectedly > This probably means the server terminated abnormally > before or while processing the request. > 2026-03-07 12:21:24.645 EET postmaster[282355] LOG: received immediate shutdown request > 2026-03-07 12:21:24.647 EET postmaster[282355] LOG: database system is shut down > > While for a successful run, I see: > 2026-03-07 12:18:17.075 EET startup[285260] DETAIL: The slot conflicted with xid horizon 677. > 2026-03-07 12:18:17.075 EET startup[285260] CONTEXT: WAL redo at 0/04022130 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon:677, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 2, nunused: 0, dead: [35, 36]; blkref #0:rel 1663/16384/16418, blk 10 > !!!SignalRecoveryConflict[285260] > !!!ProcessInterrupts[286071]| MyProc->pendingRecoveryConflicts: 16 > !!!ProcessRecoveryConflictInterrupts[286071] > !!!ProcessRecoveryConflictInterrupts[286071] pending: 16, reason: 4 > 2026-03-07 12:18:17.075 EET walsender[286071] 035_standby_logical_decoding.pl ERROR: canceling statement due to conflictwith recovery > 2026-03-07 12:18:17.075 EET walsender[286071] 035_standby_logical_decoding.pl DETAIL: User was using a logical replicationslot that must be invalidated. > > (Full logs for this failed run and a good run are attached.) > Thanks again for the logs. I think we might frame the next run around three explicit hypotheses: 1) Self-clear: the target walsender clears its own pendingRecoveryConflicts during xact cleanup before ProcessInterrupts() consumes it. 2) Group-clear-on-behalf: another backend clears the target PGPROC via group clear before consume. 3) Visibility gap: the weak pg_atomic_read_u32() in the interrupt path observes 0 even though the conflict bit was set. The bad vs good traces are consistent with a receiver-side handoff failure: - bad run: signal path reached, target walsender later sees pendingRecoveryConflicts = 0, never dispatches conflict; - good run: target sees pending = 0x10, dispatches reason 4 (RECOVERY_CONFLICT_LOGICALSLOT), throws expected ERROR, releases slot. I prepared an updated diagnostics patch to classify hypotheses (1) and (2) directly: - SignalRecoveryConflict: logs target pid/procno and old/new mask - clear sites: logs self_pid, target_pid, target_procno, self_is_target - ProcessInterrupts: logs handler counter + pending-mask observations Two caveats: - The weak/strong telemetry is supportive but not definitive for hypothesis (3); it may miss the exact stale-read window. - The patch preserves the original dispatch path, but added diagnostics (especially barriered reads in clear paths) can perturb timing, so repro rate may shift. What to look for: Hypothesis (1) self-clear: ProcArrayEndTransaction... self_is_target=t ... clearing pendingRecoveryConflicts=0x10 (or ProcArrayClearTransaction self case) for target pid/procno before target ProcessInterrupts reports handler fired but pending=0. Hypothesis (2) group-clear-on-behalf: ProcArrayEndTransactionInternal... self_is_target=f ... target_procno=<walsender procno> ... clearing pendingRecoveryConflicts=0x10 (or ProcArrayClearTransaction with self_is_target=f). Hypothesis (3) visibility gap (suggestive): no matching clear-site log for target procno, but target still reports handler fired with pending=0. If this run is inconclusive, I suggest a direct behavioral A/B as next step: - change only the ProcessInterrupts() recovery-conflict check from pg_atomic_read_u32() to pg_atomic_read_membarrier_u32(); - if failures disappear, that strongly supports hypothesis (3). -- Best, Xuneng
Attachment
pgsql-hackers by date: