Home > mailing lists

Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers

From	Xuneng Zhou
Subject	Re: Implement waiting for wal lsn replay: reloaded
Date	January 6 16:12:41
Msg-id	CABPTF7UtCZW4EcOaTDnBgMxmdsx9RS_d5Q+LbfroQYLzK2g__A@mail.gmail.com Whole thread Raw
In response to	Re: Implement waiting for wal lsn replay: reloaded (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: Implement waiting for wal lsn replay: reloaded Re: Implement waiting for wal lsn replay: reloaded
List	pgsql-hackers

Tree view

Hi,

On Tue, Jan 6, 2026 at 7:54 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Tue, Jan 6, 2026 at 9:29 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > On Tue, Jan 6, 2026 at 1:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Could this be causing the recent flapping failures on CI/macOS in
> > > recovery/031_recovery_conflict?  I didn't have time to dig personally
> > > but f30848cb looks relevant:
> > >
> > > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
> > > error running SQL: 'psql:<stdin>:1: ERROR:  canceling statement due to
> > > conflict with recovery
> > > DETAIL:  User was or might have been using tablespace that must be dropped.'
> > > while running 'psql --no-psqlrc --no-align --tuples-only --quiet
> > > --dbname port=25195
> > > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
> > > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
> > > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
> > > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
> > > line 2300.
> > >
> > > https://cirrus-ci.com/task/5771274900733952
> > >
> > > The master branch in time-descending order, macOS tasks only:
> > >
> > >      task_id      | substring |  status
> > > ------------------+-----------+-----------
> > >  6460882231754752 | c970bdc0  | FAILED
> > >  5771274900733952 | 6ca8506e  | FAILED
> > >  6217757068361728 | 63ed3bc7  | FAILED
> > >  5980650261446656 | ae283736  | FAILED
> > >  6585898394976256 | 5f13999a  | COMPLETED
> > >  4527474786172928 | 7f9acc9b  | COMPLETED
> > >  4826100842364928 | e8d4e94a  | COMPLETED
> > >  4540563027918848 | b9ee5f2d  | FAILED
> > >  6358528648019968 | c5af141c  | FAILED
> > >  5998005284765696 | e212a0f8  | COMPLETED
> > >  6488580526178304 | b85d5dc0  | FAILED
> > >  5034091344560128 | 7dc95cc3  | ABORTED
> > >  5688692477526016 | bb048e31  | COMPLETED
> > >  5481187977723904 | d351063e  | COMPLETED
> > >  5101831568752640 | f30848cb  | COMPLETED <-- the change
> > >  6395317408497664 | 3f33b63d  | COMPLETED
> > >  6741325208354816 | 877ae5db  | COMPLETED
> > >  4594007789010944 | de746e0d  | COMPLETED
> > >  6497208998035456 | 461b8cc9  | COMPLETED
> >
> > Thanks for raising this issue. I think it is related to f30848cb after
> > some analysis. I'll prepare a follow-up patch to fix it.
>
> Sorry, I've mistakenly referenced this report from commit [1].  I
> thought it was related, but it appears to be not.  [1] is related to
> the report I've got from Ruikai Peng off-list.
>
> Regarding the present failure, could it happen before ExecWaitStmt()
> calls PopActiveSnapshot() and InvalidateCatalogSnapshot()?  If so, we
> should do preliminary efforts to release these snapshots.
>
> 1. https://git.postgresql.org/pg/commitdiff/bf308639bfcfa38541e24733e074184153a8ab7f
>

I agree that moving PopActiveSnapshot() and
InvalidateCatalogSnapshot() to the very beginning of ExecWaitStmt()
appears to be a sensible optimization. However, in this particular
failure scenario, it may not address the issue.

For tablespace conflicts, recovery conflict resolution uses
GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid), which
returns all active backends, regardless of their snapshot state. As a
result, even if all snapshots are released at the start of
ExecWaitStmt(), the session would still be canceled during replay of
DROP TABLESPACE.

Given this, I am considering handling this conflict class explicitly:
if the WAIT FOR statement is terminated and the error indicates a
recovery conflict, we fall back to the existing polling-based
approach.

* Ask everybody to cancel their queries immediately so we can ensure no
* temp files remain and we can remove the tablespace. Nuke the entire
* site from orbit, it's the only way to be sure.
*
* XXX: We could work out the pids of active backends using this
* tablespace by examining the temp filenames in the directory. We would
* then convert the pids into VirtualXIDs before attempting to cancel
* them.

I am also wondering whether this optimization would be helpful.

--
Best,
Xuneng

Attachment

v1-0001-Fix-wait_for_catchup-failure-when-standby-session.patch

pgsql-hackers by date:

From: "Pavlo Golub"
Date: 06 January, 15:47:08
Subject: Re[2]: [PATCH] Add pg_current_vxact_id() function to expose virtual transaction IDs

From: Jakub Wartak
Date: 06 January, 16:23:33
Subject: Re: failed NUMA pages inquiry status: Operation not permitted

Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers

Attachment

Previous

Next