Thread: Re: Improve pg_sync_replication_slots() to wait for primary to advance

On Tue, Jun 24, 2025 at 4:11 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Hello,
>
> Creating this thread for a POC based on discussions in thread [1].
> Hou-san had created this patch, and I just cleaned up some documents,
> did some testing and now sharing the patch here.
>
> In this patch, the pg_sync_replication_slots() API now waits
> indefinitely for the remote slot to catch up. We could later add a
> timeout parameter to control maximum wait time if this approach seems
> acceptable. If there are more ideas on improving this patch, let me
> know.

+1 on the idea.
I believe the timeout option may not be necessary here, since the API
can be manually canceled if needed. Otherwise, the recommended
approach is to let it complete. But I would like to know what others
think here.

Few comments:

1)
When the API is waiting for the primary to advance, standby fails to
handle promotion requests. Promotion fails:
./pg_ctl -D ../../standbydb/ promote -w
waiting for server to promote.................stopped waiting
pg_ctl: server did not promote in time

See the logs at [1]

2)
Also when the API is waiting for a long time, it just dumps the
'waiting for remote_slot..' LOG only once. Do you think it makes sense
to log it at a regular interval until the wait is over? See logs at
[1]. It dumped the log once in 3minutes.

3)
+ /*
+ * It is possible to get null value for restart_lsn if the slot is
+ * invalidated on the primary server, so handle accordingly.
+ */

+ if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
+ {
+ /*
+ * The slot won't be persisted by the caller; it will be cleaned up
+ * at the end of synchronization.
+ */
+ ereport(WARNING,
+ errmsg("aborting initial sync for slot \"%s\"",
+    remote_slot->name),
+ errdetail("This slot was invalidated on the primary server."));

Which case are we referring to here where null restart_lsn would mean
invalidation? Can you please point me to such code where it happens or
a test-case which does that. I tried a few invalidation cases, but did
not hit it.


[1]:
Log file:
2025-07-02 14:38:09.851 IST [153187] LOG:  waiting for remote slot
"failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local
slot LSN (0/3003F60) and catalog xmin (767)
2025-07-02 14:38:09.851 IST [153187] STATEMENT:  SELECT
pg_sync_replication_slots();
2025-07-02 14:41:36.200 IST [153164] LOG:  received promote request

thanks
Shveta