Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers

From Ajin Cherian
Subject Re: Improve pg_sync_replication_slots() to wait for primary to advance
Date
Msg-id CAFPTHDZFnvB_ZUv94WjXe_2_KRQKDo-nfhFa2bsq9NRBH3KVvw@mail.gmail.com
Whole thread Raw
In response to Re: Improve pg_sync_replication_slots() to wait for primary to advance  (shveta malik <shveta.malik@gmail.com>)
Responses Re: Improve pg_sync_replication_slots() to wait for primary to advance
List pgsql-hackers
On Wed, Jul 2, 2025 at 7:56 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> Few comments:
>
> 1)
> When the API is waiting for the primary to advance, standby fails to
> handle promotion requests. Promotion fails:
> ./pg_ctl -D ../../standbydb/ promote -w
> waiting for server to promote.................stopped waiting
> pg_ctl: server did not promote in time
>
> See the logs at [1]
>

I've modified this to handle promotion request and stop slot
synchronization if standby is promoted.


> 2)
> Also when the API is waiting for a long time, it just dumps the
> 'waiting for remote_slot..' LOG only once. Do you think it makes sense
> to log it at a regular interval until the wait is over? See logs at
> [1]. It dumped the log once in 3minutes.
>

I've modified it to log once every 10 seconds.

> 3)
> + /*
> + * It is possible to get null value for restart_lsn if the slot is
> + * invalidated on the primary server, so handle accordingly.
> + */
>
> + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
> + {
> + /*
> + * The slot won't be persisted by the caller; it will be cleaned up
> + * at the end of synchronization.
> + */
> + ereport(WARNING,
> + errmsg("aborting initial sync for slot \"%s\"",
> +    remote_slot->name),
> + errdetail("This slot was invalidated on the primary server."));
>
> Which case are we referring to here where null restart_lsn would mean
> invalidation? Can you please point me to such code where it happens or
> a test-case which does that. I tried a few invalidation cases, but did
> not hit it.
>

I've removed all this code as the checks for null restart_lsn and
other parameters are handled in earlier functions and we won't get
here if these had null, I've added asserts to check that it is not
null.


On Wed, Jul 9, 2025 at 2:53 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> Please find few more comments:
>
> 1)
> In  pg_sync_replication_slots() doc, we have this:
>
> "Note that this function is primarily intended for testing and
> debugging purposes and should be used with caution. Additionally, this
> function cannot be executed if ...."
>
> We can get rid of this info as well and change to:
>
> "Note that this function cannot be executed if...."

Modified as requested.


>
> 2)
> We got rid of NOTE in logicaldecoding.sgml, but now the page does not
> mention pg_sync_replication_slots() at all. We need to bring back the
> change removed by [1] (or something on similar line) which is this:
>
> -     <command>CREATE SUBSCRIPTION</command> during slot creation, and
> then calling
> -     <link linkend="pg-sync-replication-slots">
> -     <function>pg_sync_replication_slots</function></link>
> -     on the standby. By setting <link linkend="guc-sync-replication-slots">
> +     <command>CREATE SUBSCRIPTION</command> during slot creation.
> +     Additionally, enabling <link linkend="guc-sync-replication-slots">
> +     <varname>sync_replication_slots</varname></link> on the standby
> +     is required. By enabling <link linkend="guc-sync-replication-slots">
>

I've added that back.


> 3)
> wait_for_primary_slot_catchup():
> + /*
> + * It is possible to get null values for confirmed_lsn and
> + * catalog_xmin if on the primary server the slot is just created with
> + * a valid restart_lsn and slot-sync worker has fetched the slot
> + * before the primary server could set valid confirmed_lsn and
> + * catalog_xmin.
> + */
>
> Do we need this special handling? We already have one such handling in
> synchronize_slots(). please see:
>                 /*
>                  * If restart_lsn, confirmed_lsn or catalog_xmin is
> invalid but the
>                  * slot is valid, that means we have fetched the
> remote_slot in its
>                  * RS_EPHEMERAL state. In such a case, don't sync it;
> we can always
>                  * sync it in the next sync cycle when the remote_slot
> is persisted
>                  * and has valid lsn(s) and xmin values.
>                  */
>                 if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) ||
>                          XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) ||
>                          !TransactionIdIsValid(remote_slot->catalog_xmin)) &&
>                         remote_slot->invalidated == RS_INVAL_NONE)
>                         pfree(remote_slot);
>
>
> Due to the above check in synchronize_slots(), we will not reach
> wait_for_primary_slot_catchup() when any of confirmed_lsn or
> catalog_xmin is not initialized.
>

Yes,  you are correct. I've removed all that logic.

The modified patch (v2) is attached.

regards,
Ajin Cherian
Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: patch: Use pg_assume in jsonb_util.c to fix GCC 15 warnings
Next
From: Bertrand Drouvot
Date:
Subject: Fix lwlock.c and wait_event_names.txt discrepancy