Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: Implement waiting for wal lsn replay: reloaded
Date
Msg-id CABPTF7W-gaO=FAkhda=_pDQJjLne68ioNHHU8vuB4iEnswR1=w@mail.gmail.com
Whole thread
In response to Re: Implement waiting for wal lsn replay: reloaded  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
Hi Alexander,

On Wed, Apr 29, 2026 at 5:01 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Apr 21, 2026 at 7:03 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 2:46 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>
> > The updated patchset is attached.  It includes improved coverage as
> > suggested by Andres upthread.  And documentation that WAIT FOR LSN is
> > timeline-blind (per off-list discussion with Xuneng).
>
> I revised the test patch 6 to make the new cases check the intended
> WAIT FOR behavior more directly, and to avoid cases where the test
> could pass for the wrong reason.
>
> The fresh walreceiver restart test now distinguishes what we can
> observe from what is only covered indirectly.
> 'pg_last_wal_receive_lsn()' reports 'flushedUpto', not 'writtenUpto',
> so the test now describes that state accurately and covers
> 'writtenUpto' through the 'standby_write' result. This seems
> appropriate to me since the two positions are seeded in the places and
> conditions. Test for flush lsn should also help verify write lsn.
>
> The fencepost tests were split by the actual frontier being tested.
> 'standby_replay' uses 'pg_last_wal_replay_lsn()', while
> 'standby_flush' uses 'pg_last_wal_receive_lsn()'. This avoids treating
> a replay-derived LSN as if it were also the exact write/flush
> boundary. I left 'standby_write' out of the exact fencepost helper
> because its frontier is not SQL-visible once walreceiver is stopped.
> The async wakeup case now starts the waiter while replay is still
> paused, so it must actually sleep before replay and walreceiver are
> allowed to advance.
>
> The cascading timeline-switch test now checks the 'WAIT FOR ...
> NO_THROW' status from background psql stdout. The previous log-marker
> pattern could pass after unexpected returned status, includingn
> 'timeout', because the following statement would still run. The
> 'received_tli > 1' check remains, but only as confirmation that the
> downstream followed the new timeline; the 'success' status proves the
> wait completed as intended.
>
> Please check it.

LGTM, I've added some comments for new functions in 0006.  I propose
to push this patchset.  Probably something is still missing and we
will have to go back to this.  But it seems to make a lot of aspects
much better.

I reviewed the patchset and found a potential issue in the test for patch 5, similar to the log-checking problem in the cascading timeline-switch test. I've applied a minor fix to address it. Other parts LGTM. 

Best,
Xuneng
Attachment

pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE
Next
From: Richard Guo
Date:
Subject: Re: Fix HAVING-to-WHERE pushdown with nondeterministic collations