Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: Implement waiting for wal lsn replay: reloaded
Date
Msg-id CABPTF7Ub=w7CRxi3sNv8oMGMh4hCqUTohuiTuP9Y1DpxRuFtRQ@mail.gmail.com
Whole thread Raw
In response to Re: Implement waiting for wal lsn replay: reloaded  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: Implement waiting for wal lsn replay: reloaded
List pgsql-hackers
Hi,

On Thu, Jan 8, 2026 at 10:19 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> On Wed, Jan 7, 2026 at 6:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > On Wed, Jan 7, 2026 at 8:32 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:
> > > > Could this be causing the recent flapping failures on CI/macOS in
> > > > recovery/031_recovery_conflict?  I didn't have time to dig personally
> > > > but f30848cb looks relevant:
> > > >
> > > > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
> > > > error running SQL: 'psql:<stdin>:1: ERROR:  canceling statement due to
> > > > conflict with recovery
> > > > DETAIL:  User was or might have been using tablespace that must be dropped.'
> > > > while running 'psql --no-psqlrc --no-align --tuples-only --quiet
> > > > --dbname port=25195
> > > > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
> > > > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
> > > > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
> > > > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
> > > > line 2300.
> > > >
> > > > https://cirrus-ci.com/task/5771274900733952
> > > >
> > > > The master branch in time-descending order, macOS tasks only:
> > > >
> > > >      task_id      | substring |  status
> > > > ------------------+-----------+-----------
> > > >  6460882231754752 | c970bdc0  | FAILED
> > > >  5771274900733952 | 6ca8506e  | FAILED
> > > >  6217757068361728 | 63ed3bc7  | FAILED
> > > >  5980650261446656 | ae283736  | FAILED
> > > >  6585898394976256 | 5f13999a  | COMPLETED
> > > >  4527474786172928 | 7f9acc9b  | COMPLETED
> > > >  4826100842364928 | e8d4e94a  | COMPLETED
> > > >  4540563027918848 | b9ee5f2d  | FAILED
> > > >  6358528648019968 | c5af141c  | FAILED
> > > >  5998005284765696 | e212a0f8  | COMPLETED
> > > >  6488580526178304 | b85d5dc0  | FAILED
> > > >  5034091344560128 | 7dc95cc3  | ABORTED
> > > >  5688692477526016 | bb048e31  | COMPLETED
> > > >  5481187977723904 | d351063e  | COMPLETED
> > > >  5101831568752640 | f30848cb  | COMPLETED <-- the change
> > > >  6395317408497664 | 3f33b63d  | COMPLETED
> > > >  6741325208354816 | 877ae5db  | COMPLETED
> > > >  4594007789010944 | de746e0d  | COMPLETED
> > > >  6497208998035456 | 461b8cc9  | COMPLETED
> > >
> > > The failure rates of this are very high - the majority of the CI runs on the
> > > postgres/postgres repos failed since the change went in. Which then also means
> > > cfbot has a very high spurious failure rate. I think we need to revert this
> > > change until the problem has been verified as fixed.
> >
> > This specific failure can be reproduced with this patch v1.
> >
> > I guess the potential race condition is: when
> > wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a
> > tablespace conflict fires during that wait, the WAIT FOR LSN session
> > is killed even though it doesn't use the tablespace.
> >
> > In my test, the failure won't occur after applying the v2 patch.
>
> I see, you were right.  This is not related to the MyProc->xmin.
> ResolveRecoveryConflictWithTablespace() calls
> GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid).  That
> would kill WAIT FOR LSN query independently on its xmin.

I think the concern is valid --- conflicts like
PROCSIG_RECOVERY_CONFLICT_SNAPSHOT could occur and terminate the
backend if the timing is unlucky. It's more difficult to reproduce
though. A check for the log containing "conflict with recovery" would
likely catch these conflicts as well.

> I guess your
> patch is the only way to go.  It's clumsy to wrap WAIT FOR LSN call
> with retry loop, but it would still consume less resources than
> polling.
>

Assuming recovery conflicts are relatively rare in tap tests, except
for the explicitly designed tests like 031_recovery_conflict and the
narrow timing window that the standby has not caught up while the wait
for gets invoked, a simple fallback seems appropriate to me.

--
Best,
Xuneng



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_plan_advice
Next
From: Lukas Fittl
Date:
Subject: Re: pg_plan_advice