On 24/02/2026 10:00, Alexander Lakhin wrote:
> Could you please look at a new failure of 035_standby_logical_decoding.pl
> produced at buildfarm [1]?:
> [05:34:51.345](233.134s) # poll_query_until timed out executing this query:
> # SELECT '0/0403F950' <= replay_lsn AND state = 'streaming'
> # FROM pg_catalog.pg_stat_replication
> # WHERE application_name IN ('standby', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
>
> 035_standby_logical_decoding_standby.log contains:
> 2026-02-24 05:30:58.300 CET [1512377][client backend][:0] LOG:
> disconnection: session time: 0:00:00.035 user=bf database=testdb
> host=[local]
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] LOG: invalidating
> obsolete replication slot "row_removal_inactiveslot"
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] DETAIL: The slot
> conflicted with xid horizon 748.
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] CONTEXT: WAL
> redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon:
> 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1,
> dead: [33, 34, 35], unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] LOG: terminating
> process 1512360 to release replication slot "row_removal_activeslot"
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] DETAIL: The slot
> conflicted with xid horizon 748.
> 2026-02-24 05:30:58.425 CET [1507982][startup][32/0:0] CONTEXT: WAL
> redo at 0/040214F0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon:
> 748, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 3, nunused: 1,
> dead: [33, 34, 35], unused: [36]; blkref #0: rel 1663/16384/16418, blk 10
> 2026-02-24 05:34:51.400 CET [1508227][walreceiver][:0] FATAL: could not
> receive data from WAL stream: server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> The "terminating process ..." message doesn't appear when the test passes
> successfully.
Hmm, right, looks like something wrong in signaling the recovery
conflict. I can't tell if the signal is being sent, or it's not
processed correctly. Looking at the code, I don't see anything wrong.
> I've managed to reproduce this with multiple (20) test instances running
> in a loop (it failed within 10 iterations for me); `git bisect` for this
> anomaly pointed at 17f51ea81.
I've been trying to reproduce this locally, but so far not success,
after thousands of iterations.
If you can still reproduce this, can you try it with code changes from
the attached recovery-conflict-fail-extra-logging.patch, which adds some
extra logging, and send over the logs please?
The recovery-conflict-fail-repro-attempt.patch contains very hacky
changes to the test, to run the just the failing part 100 times in a
loop. That's just to show what I used to try to reproduce this, but no luck.
- Heikki