Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAJpy0uCXM_z-Osdrfh2gTWV=LKXeUvUzOGVjMzjLsNErPopFRQ@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (Ajin Cherian <itsajin@gmail.com>) |
Responses |
Re: Improve pg_sync_replication_slots() to wait for primary to advance
|
List | pgsql-hackers |
On Wed, Jul 16, 2025 at 3:00 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Few comments: > > > > 1) > > When the API is waiting for the primary to advance, standby fails to > > handle promotion requests. Promotion fails: > > ./pg_ctl -D ../../standbydb/ promote -w > > waiting for server to promote.................stopped waiting > > pg_ctl: server did not promote in time > > > > See the logs at [1] > > > > I've modified this to handle promotion request and stop slot > synchronization if standby is promoted. > > > > 2) > > Also when the API is waiting for a long time, it just dumps the > > 'waiting for remote_slot..' LOG only once. Do you think it makes sense > > to log it at a regular interval until the wait is over? See logs at > > [1]. It dumped the log once in 3minutes. > > > > I've modified it to log once every 10 seconds. > > > 3) > > + /* > > + * It is possible to get null value for restart_lsn if the slot is > > + * invalidated on the primary server, so handle accordingly. > > + */ > > > > + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn)) > > + { > > + /* > > + * The slot won't be persisted by the caller; it will be cleaned up > > + * at the end of synchronization. > > + */ > > + ereport(WARNING, > > + errmsg("aborting initial sync for slot \"%s\"", > > + remote_slot->name), > > + errdetail("This slot was invalidated on the primary server.")); > > > > Which case are we referring to here where null restart_lsn would mean > > invalidation? Can you please point me to such code where it happens or > > a test-case which does that. I tried a few invalidation cases, but did > > not hit it. > > > > I've removed all this code as the checks for null restart_lsn and > other parameters are handled in earlier functions and we won't get > here if these had null, I've added asserts to check that it is not > null. > > > On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Please find few more comments: > > > > 1) > > In pg_sync_replication_slots() doc, we have this: > > > > "Note that this function is primarily intended for testing and > > debugging purposes and should be used with caution. Additionally, this > > function cannot be executed if ...." > > > > We can get rid of this info as well and change to: > > > > "Note that this function cannot be executed if...." > > Modified as requested. > > > > > > 2) > > We got rid of NOTE in logicaldecoding.sgml, but now the page does not > > mention pg_sync_replication_slots() at all. We need to bring back the > > change removed by [1] (or something on similar line) which is this: > > > > - <command>CREATE SUBSCRIPTION</command> during slot creation, and > > then calling > > - <link linkend="pg-sync-replication-slots"> > > - <function>pg_sync_replication_slots</function></link> > > - on the standby. By setting <link linkend="guc-sync-replication-slots"> > > + <command>CREATE SUBSCRIPTION</command> during slot creation. > > + Additionally, enabling <link linkend="guc-sync-replication-slots"> > > + <varname>sync_replication_slots</varname></link> on the standby > > + is required. By enabling <link linkend="guc-sync-replication-slots"> > > > > I've added that back. > > > > 3) > > wait_for_primary_slot_catchup(): > > + /* > > + * It is possible to get null values for confirmed_lsn and > > + * catalog_xmin if on the primary server the slot is just created with > > + * a valid restart_lsn and slot-sync worker has fetched the slot > > + * before the primary server could set valid confirmed_lsn and > > + * catalog_xmin. > > + */ > > > > Do we need this special handling? We already have one such handling in > > synchronize_slots(). please see: > > /* > > * If restart_lsn, confirmed_lsn or catalog_xmin is > > invalid but the > > * slot is valid, that means we have fetched the > > remote_slot in its > > * RS_EPHEMERAL state. In such a case, don't sync it; > > we can always > > * sync it in the next sync cycle when the remote_slot > > is persisted > > * and has valid lsn(s) and xmin values. > > */ > > if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) || > > XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) || > > !TransactionIdIsValid(remote_slot->catalog_xmin)) && > > remote_slot->invalidated == RS_INVAL_NONE) > > pfree(remote_slot); > > > > > > Due to the above check in synchronize_slots(), we will not reach > > wait_for_primary_slot_catchup() when any of confirmed_lsn or > > catalog_xmin is not initialized. > > > > Yes, you are correct. I've removed all that logic. > > The modified patch (v2) is attached. > I am not able to apply the patch to the latest head or even to a week back version. Can you please check and rebase? thanks Shveta
pgsql-hackers by date: