Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From shveta malik
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAJpy0uCXM_z-Osdrfh2gTWV=LKXeUvUzOGVjMzjLsNErPopFRQ@mail.gmail.com
Whole thread Raw
In response to Re: Improve pg_sync_replication_slots() to wait for primary to advance  (Ajin Cherian <itsajin@gmail.com>)
Responses Re: Improve pg_sync_replication_slots() to wait for primary to advance
List pgsql-hackers
On Wed, Jul 16, 2025 at 3:00 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > Few comments:
> >
> > 1)
> > When the API is waiting for the primary to advance, standby fails to
> > handle promotion requests. Promotion fails:
> > ./pg_ctl -D ../../standbydb/ promote -w
> > waiting for server to promote.................stopped waiting
> > pg_ctl: server did not promote in time
> >
> > See the logs at [1]
> >
>
> I've modified this to handle promotion request and stop slot
> synchronization if standby is promoted.
>
>
> > 2)
> > Also when the API is waiting for a long time, it just dumps the
> > 'waiting for remote_slot..' LOG only once. Do you think it makes sense
> > to log it at a regular interval until the wait is over? See logs at
> > [1]. It dumped the log once in 3minutes.
> >
>
> I've modified it to log once every 10 seconds.
>
> > 3)
> > + /*
> > + * It is possible to get null value for restart_lsn if the slot is
> > + * invalidated on the primary server, so handle accordingly.
> > + */
> >
> > + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
> > + {
> > + /*
> > + * The slot won't be persisted by the caller; it will be cleaned up
> > + * at the end of synchronization.
> > + */
> > + ereport(WARNING,
> > + errmsg("aborting initial sync for slot \"%s\"",
> > +    remote_slot->name),
> > + errdetail("This slot was invalidated on the primary server."));
> >
> > Which case are we referring to here where null restart_lsn would mean
> > invalidation? Can you please point me to such code where it happens or
> > a test-case which does that. I tried a few invalidation cases, but did
> > not hit it.
> >
>
> I've removed all this code as the checks for null restart_lsn and
> other parameters are handled in earlier functions and we won't get
> here if these had null, I've added asserts to check that it is not
> null.
>
>
> On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > Please find few more comments:
> >
> > 1)
> > In  pg_sync_replication_slots() doc, we have this:
> >
> > "Note that this function is primarily intended for testing and
> > debugging purposes and should be used with caution. Additionally, this
> > function cannot be executed if ...."
> >
> > We can get rid of this info as well and change to:
> >
> > "Note that this function cannot be executed if...."
>
> Modified as requested.
>
>
> >
> > 2)
> > We got rid of NOTE in logicaldecoding.sgml, but now the page does not
> > mention pg_sync_replication_slots() at all. We need to bring back the
> > change removed by [1] (or something on similar line) which is this:
> >
> > -     <command>CREATE SUBSCRIPTION</command> during slot creation, and
> > then calling
> > -     <link linkend="pg-sync-replication-slots">
> > -     <function>pg_sync_replication_slots</function></link>
> > -     on the standby. By setting <link linkend="guc-sync-replication-slots">
> > +     <command>CREATE SUBSCRIPTION</command> during slot creation.
> > +     Additionally, enabling <link linkend="guc-sync-replication-slots">
> > +     <varname>sync_replication_slots</varname></link> on the standby
> > +     is required. By enabling <link linkend="guc-sync-replication-slots">
> >
>
> I've added that back.
>
>
> > 3)
> > wait_for_primary_slot_catchup():
> > + /*
> > + * It is possible to get null values for confirmed_lsn and
> > + * catalog_xmin if on the primary server the slot is just created with
> > + * a valid restart_lsn and slot-sync worker has fetched the slot
> > + * before the primary server could set valid confirmed_lsn and
> > + * catalog_xmin.
> > + */
> >
> > Do we need this special handling? We already have one such handling in
> > synchronize_slots(). please see:
> >                 /*
> >                  * If restart_lsn, confirmed_lsn or catalog_xmin is
> > invalid but the
> >                  * slot is valid, that means we have fetched the
> > remote_slot in its
> >                  * RS_EPHEMERAL state. In such a case, don't sync it;
> > we can always
> >                  * sync it in the next sync cycle when the remote_slot
> > is persisted
> >                  * and has valid lsn(s) and xmin values.
> >                  */
> >                 if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
> >                          XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
> >                          !TransactionIdIsValid(remote_slot->catalog_xmin)) &&
> >                         remote_slot->invalidated == RS_INVAL_NONE)
> >                         pfree(remote_slot);
> >
> >
> > Due to the above check in synchronize_slots(), we will not reach
> > wait_for_primary_slot_catchup() when any of confirmed_lsn or
> > catalog_xmin is not initialized.
> >
>
> Yes,  you are correct. I've removed all that logic.
>
> The modified patch (v2) is attached.
>

I am not able to apply the patch to the latest head or even to a week
back version. Can you please check and rebase?

thanks
Shveta



pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Fix lwlock.c and wait_event_names.txt discrepancy
Next
From: Ajin Cherian
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance