Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From shveta malik
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAJpy0uCCUkweQad2U6n0KRcrjEJe-MzR-Nsw4d6bwQSdf1N8EA@mail.gmail.com
Whole thread
In response to RE: Improve pg_sync_replication_slots() to wait for primary to advance  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses RE: Improve pg_sync_replication_slots() to wait for primary to advance
List pgsql-hackers
On Mon, Feb 9, 2026 at 11:36 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, February 2, 2026 5:10 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Mon, Feb 2, 2026 at 12:56 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > >
> > > Here are the updated patches.
> > >
> >
> > Thanks for the patch. Few trivial comments:
> >
> > 1)
> > + * and if the slots are not ready to be synced because of any of the
> > + reasons
> > + * mentioned above, then the SQL function also waits and retries
> > until the slots
> > + * are synchronized to the latest information. Refer to the comments
> > + * in SyncReplicationSlots() for more details.
> >
> > We can make it slightly more clear by mentioning that it waits only for the
> > slots which existed at the start of function:
> >
> > "...then the SQL function also waits and retries until the failover slots that
> > existed on primary at the start of the function call are synchronized."
>
> Improved.
>
> >
> > 2)
> >
> > We have below comment atop SyncReplicationSlots:
> >
> >  * Repeatedly fetches and updates replication slot information from the
> >  * primary until all slots are at least "sync ready".
> >
> > We shall change this too, as now we are not only waiting for them to be sync-
> > ready. Even if they are sync-ready, this function can still wait in subsequent
> > runs for different reasons.
>
> Right, fixed.
>
> >
> > 3)
> >
> > Existing test in test file:
> > ##################################################
> > # Test that pg_sync_replication_slots() on the standby skips and retries # until
> > the slot becomes sync-ready (when the remote slot catches up with # the
> > locally reserved position).
> > # Also verify that slotsync skip statistics are correctly updated when the #
> > slotsync operation is skipped.
> > ##################################################
> >
> > New one added says:
> > +##################################################
> > +# Test that when physical replication lags behind logical replication,
> > +# pg_sync_replication_slots() on the standby skips and retries until
> > +physical # replication catches up. Also verify that slotsync skip
> > +statistics are # correctly updated when the slotsync operation is skipped.
> > +##################################################
> >
> > The "need" of this new test case isn't very clear provided we already have
> > testcase1. Perhaps we could revise the comment to something like:
> >
> > "Test that even for a sync-ready slot, when physical replication lags behind
> > logical replication, pg_sync_replication_slots() on the standby skips........"
>
> Adjusted the comments as suggested.
>
> In addition to addressing the comments, I revisited the recently updated
> slotsync code and noticed opportunities to simplify some parameters, checks, and
> codes. This will also facilitate the improvement in v2-0001 coding.
>
> * Previously, certain function parameters(found_consistent_snapshot,
> remote_slot_precedes of update_local_synced_slot()) were used to store the
> reason for slot synchronization being skipped. However, now that a slot property
> serves this purpose, we can simplify the code by eliminating those redundant
> parameters and using the slot's property to perform the same check.
>
> * The slot synchronization is skipped if the required WAL has not been received
> and flushed. Previously, this check[1] was performed in two separate code paths.
> Such duplication can lead to coding errors if changes are made in one location
> without updating the other, as exemplified by the issue fixed in commit 3df4df5.
> This commit consolidates the check into a single location to eliminate
> redundancies and reduce the potential for future errors.
>
> To address these points, I have created two additional patches: V3-0001 for the
> first point and V3-0002 for the second. V3-0003 contains the current improvement
> being discussed, and it's also simplified thanks to the preceding patches.
>
>
> [1]
>                 /*
>                  * ...
>                  */
>                 if (remote_slot->confirmed_lsn > latestFlushPtr)
>                 {
>                         update_slotsync_skip_stats(SS_SKIP_WAL_NOT_FLUSHED);
>
>                         /*
>                          * Can get here only if GUC 'synchronized_standby_slots' on the
>                          * primary server was not configured correctly.
>                          */
>                         ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
>                                         ...));
>                 }
>
>

I like the idea of both the new patches. Please find a few trivial comments:

patch002:
1)
Earlier at both the places where we were updating
'SS_SKIP_WAL_NOT_FLUSHED', we were returning slot_updated as false,
now, we might end up returning it as true (specially at second
occurrence). Is this intentional?

2)
In update_and_persist_local_synced_slot(), we can reach this even when
wal_not_flushed, so we shall to update comment:
        if (slot->slotsync_skip_reason != SS_SKIP_NONE)
        {
                /*
                 * We reach here when the remote slot didn't catch up to locally
                 * reserved position, or it cannot reach the
consistent point from the
                 * restart_lsn.
....
*/

Patch003:
3)
+ if (slotsync_pending && slot->slotsync_skip_reason != SS_SKIP_NONE)
+ *slotsync_pending = true;

Here shall we ensure by a sanity check that slotsync_skip_reason !=
SS_SKIP_INVALID? And please bring back the comment as well, which was
there in an earlier patch which stated the reason for not including
'SS_SKIP_INVALID' here.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Shinya Kato
Date:
Subject: Re: Use pg_current_xact_id() instead of deprecated txid_current()
Next
From: Michael Paquier
Date:
Subject: Re: Miscellaneous message fixes