Thread: Re: Replication slot is not able to sync up

Re: Replication slot is not able to sync up

From

Amit Kapila

Date:

23 May, 07:55:15

On Fri, May 23, 2025 at 9:57 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:

Hi,

Noticed below behaviour where replication slot is not able to sync up if any catalog changes happened after the creation.
Getting below LOG when trying to sync replication slots using pg_sync_replication_slots() function.
The newly created slot does not appear on the standby after this LOG -

2025-05-23 07:57:12.453 IST [4178805] LOG: could not synchronize replication slot "failover_slot" because remote slot precedes local slot
2025-05-23 07:57:12.453 IST [4178805] DETAIL: The remote slot has LSN 0/B000060 and catalog xmin 764, but the local slot has LSN 0/B000060 and catalog xmin 765.
2025-05-23 07:57:12.453 IST [4178805] STATEMENT: SELECT pg_sync_replication_slots();

Below is the test case tried on latest master branch -
=========
- Create the Primary and start the server.
wal_level = logical

- Create the physical slot on Primary.
SELECT pg_create_physical_replication_slot('slot1');

- Setup the standby using pg_basebackup.
bin/pg_basebackup -D data1 -p 5418 -d "dbname=postgres" -R

primary_slot_name = 'slot1'
hot_standby_feedback = on
port = 5419

-- Start the standby.

-- Connect to Primary and create a logical replication slot.
SELECT pg_create_logical_replication_slot('failover_slot', 'pgoutput', false, false, true);

postgres@4177929=#select xmin,* from pg_replication_slots ;
xmin | slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | two_phase_at | inactive_since | conflicting | invalidation_reason | failover | synced
------+---------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+--------------+----------------------------------+-------------+---------------------+----------+--------
765 | slot1 | | physical | | | f | t | 4177898 | 765 | | 0/B018B00 | | reserved | | f
| | | | | f | f
| failover_slot | pgoutput | logical | 5 | postgres | f | f | | | 764 | 0/B000060 | 0/B000098 | reserved | | f
| | 2025-05-23 07:55:31.277584+05:30 | f | | t | f
(2 rows)

-- Perform some catalog changes. e.g.:
create table abc(id int);
postgres@4179034=#select xmin from pg_class where relname='abc';
xmin
------
764
(1 row)

-- Connect to the standby and try to sync the replication slots.
SELECT pg_sync_replication_slots();

In the logfile, can see below LOG -
2025-05-23 07:57:12.453 IST [4178805] LOG: could not synchronize replication slot "failover_slot" because remote slot precedes local slot
2025-05-23 07:57:12.453 IST [4178805] DETAIL: The remote slot has LSN 0/B000060 and catalog xmin 764, but the local slot has LSN 0/B000060 and catalog xmin 765.
2025-05-23 07:57:12.453 IST [4178805] STATEMENT: SELECT pg_sync_replication_slots();

select xmin,* from pg_replication_slots ;
no rows

Primary -
postgres@4179034=#select xmin,* from pg_replication_slots ;
xmin | slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | two_phase_at | inactive_since | conflicting | invalidation_reason | failover | synced
------+---------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+--------------+----------------------------------+-------------+---------------------+----------+--------
765 | slot1 | | physical | | | f | t | 4177898 | 765 | | 0/B018C08 | | reserved | | f
| | | | | f | f
| failover_slot | pgoutput | logical | 5 | postgres | f | f | | | 764 | 0/B000060 | 0/B000098 | reserved | | f
| | 2025-05-23 07:55:31.277584+05:30 | f | | t | f
(2 rows)
=========

Is there any way to sync up the replication slot after the catalog changes have been made after creation?

The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do pg_logical_slot_get_changes() API before performing sync? You can check the xmin of the logical slot after get_changes to ensure that xmin has moved to 765 in your case.

With Regards,

Amit Kapila.

Re: Replication slot is not able to sync up

From

Robert Haas

Date:

23 May, 20:55:17

On Fri, May 23, 2025 at 12:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do
pg_logical_slot_get_changes()API before performing sync? You can check the xmin of the logical slot after get_changes
toensure that xmin has moved to 765 in your case.

I'm fairly dismayed by this example. I hope I'm misunderstanding
something, because otherwise I have difficulty understanding how we
thought it was OK to ship this feature in this condition.

At the moment that pg_sync_replication_slots() is executed, a slot
named failover_slot exists on only one of the two servers. How can you
justify emitting an error message complaining that "remote slot
precedes local slot"? There's only one slot! I understand that, under
the hood, we probably created an additional slot on the standby and
then tried to fast-forward it, and this error occurred in the second
step. But a user shouldn't have to understand those kinds of internal
implementation details to make sense of the error message. If the
problem is that we're not able to create a slot on the standby at an
old enough LSN or XID position to permit its use with the
corresponding slot on the master, it should be reported that way.

It also seems like having to execute a manual step like
pg_logical_slot_get_changes() in order for things to work is really
missing the point of the feature. I mean, it seems like the intention
of the feature was that someone can just periodically call
pg_sync_replication_slots() on each standby and the right things will
happen -- creating slots or fast-forwarding them or dropping them, as
required. But if that sometimes requires manual fiddling like having
to consume changes from a slot then basically the feature just doesn't
work, because now the user will have to somehow understand when that
is required and what they need to do to fix it. This doesn't even seem
like a particularly obscure case.

To be honest, even after spending quite a bit of time on this, I still
don't really understand what's happening with the xmins here. Just
after creating the logical slot on the primary, it has xmin 764 on one
slot and xmin 765 on the other, and I don't understand why that's the
case, nor why the extra DDL command is needed to trigger the problem.
But I also can't shake the feeling that I shouldn't *need* to
understand that stuff to use the feature. Isn't that the whole point?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Replication slot is not able to sync up

From

Masahiko Sawada

Date:

27 May, 21:08:23

On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> In the case presented here, the logical slot is expected to keep
> forwarding, and in the consecutive sync cycle, the sync should be
> successful. Users using logical decoding APIs should also be aware
> that if due for some reason, the logical slot is not moving forward,
> the master/publisher node will start accumulating dead rows and WAL,
> which can create bigger problems.

I've tried this case and am concerned that the slot synchronization
using pg_sync_replication_slots() would never succeed while the
primary keeps getting write transactions. Even if the user manually
consumes changes on the primary, the primary server keeps advancing
its XID in the meanwhile. On the standby, we ensure that the
TransamVariables->nextXid is beyond the XID of WAL record that it's
going to apply so the xmin horizon calculated by
GetOldestSafeDecodingTransactionId() ends up always being higher than
the slot's catalog_xmin on the primary. We get the log message "could
not synchronize replication slot "s" because remote slot precedes
local slot" and cleanup the slot on the standby at the end of
pg_sync_replication_slots().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Replication slot is not able to sync up

From

"Zhijie Hou (Fujitsu)"

Date:

28 May, 07:15:49

On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote:
> 
> On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > In the case presented here, the logical slot is expected to keep
> > forwarding, and in the consecutive sync cycle, the sync should be
> > successful. Users using logical decoding APIs should also be aware
> > that if due for some reason, the logical slot is not moving forward,
> > the master/publisher node will start accumulating dead rows and WAL,
> > which can create bigger problems.
> 
> I've tried this case and am concerned that the slot synchronization using
> pg_sync_replication_slots() would never succeed while the primary keeps
> getting write transactions. Even if the user manually consumes changes on the
> primary, the primary server keeps advancing its XID in the meanwhile. On the
> standby, we ensure that the
> TransamVariables->nextXid is beyond the XID of WAL record that it's
> going to apply so the xmin horizon calculated by
> GetOldestSafeDecodingTransactionId() ends up always being higher than the
> slot's catalog_xmin on the primary. We get the log message "could not
> synchronize replication slot "s" because remote slot precedes local slot" and
> cleanup the slot on the standby at the end of pg_sync_replication_slots().

I think the issue occurs because unlike the slotsync worker, the SQL API
removes temporary slots when the function ends, so it cannot hold back the
standby's catalog_xmin. If transactions on the primary keep advancing xids, the
source slot's catalog_xmin on the primary fails to catch up with the standby's
nextXid, causing sync failure.

We chose this behavior because we could not predict when (or if) the SQL
function might be executed again, and the creating session might persist after
promotion. Without automatic cleanup, this could lead to temporary slots being
retained for a longer time.

This only affects the initial sync when creating a new slot on the standby.
Once the slot exists, the standby's catalog_xmin stabilizes, preventing the
issue in subsequent syncs.

I think the SQL API was mainly intended for testing and debugging purposes
where controlled sync operations are useful. For production use, the slotsync
worker (with sync_replication_slots=on) is recommended because it automatically
handles this problem and requires minimal manual intervention. But to avoid
confusion, I think we should clearly document this distinction.

Best Regards,
Hou zj

Re: Replication slot is not able to sync up

From

shveta malik

Date:

29 May, 06:09:25

On Wed, May 28, 2025 at 11:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> I didn't know it was intended for testing and debugging purposes so
> clearilying it in the documentation would be a good idea.

I have added the suggested docs in v3.

thanks
Shveta

Attachment

v3-0001-Improve-log-messages-and-docs-for-slotsync.patch

Re: Replication slot is not able to sync up

From

Robert Haas

Date:

29 May, 15:30:53

On Wed, May 28, 2025 at 12:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
> I think the SQL API was mainly intended for testing and debugging purposes
> where controlled sync operations are useful. For production use, the slotsync
> worker (with sync_replication_slots=on) is recommended because it automatically
> handles this problem and requires minimal manual intervention. But to avoid
> confusion, I think we should clearly document this distinction.

If this analysis is correct, this should never have been committed, at
least not in this form. When we ship something, it needs to work.
Testing and debugging facilities are best placed in src/test/modules
or in contrib; if for some reason they really need to be in
src/backend, then they had better be clearly documented as such.

What really annoys me about this is that the function gives every
superficial impression of being something you could actually use. Why
wouldn't a user believe that if they periodically connect and run
pg_sync_replication_slots(), things will be OK? I can certainly
imagine a user *wanting* that to work. I'd like that to work. But it
seems like either it's impossible for some reason that isn't clear to
me, and we just went ahead and shipped it in a non-working state
anyway, or it is possible to make it work and we didn't do the
necessary engineering before something got committed. Either way,
that's really disappointing.

> I think the issue occurs because unlike the slotsync worker, the SQL API
> removes temporary slots when the function ends, so it cannot hold back the
> standby's catalog_xmin. If transactions on the primary keep advancing xids, the
> source slot's catalog_xmin on the primary fails to catch up with the standby's
> nextXid, causing sync failure.

I still don't understand how this problem arises in the first place.
It seems like you're describing a situation where we need to prevent
the standby from getting ahead of the primary, but that should be
impossible by definition.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Replication slot is not able to sync up

From

Amit Kapila

Date:

30 May, 11:49:50

On Thu, May 29, 2025 at 6:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 28, 2025 at 12:15 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> > I think the SQL API was mainly intended for testing and debugging purposes
> > where controlled sync operations are useful. For production use, the slotsync
> > worker (with sync_replication_slots=on) is recommended because it automatically
> > handles this problem and requires minimal manual intervention. But to avoid
> > confusion, I think we should clearly document this distinction.
>
> If this analysis is correct, this should never have been committed, at
> least not in this form. When we ship something, it needs to work.
> Testing and debugging facilities are best placed in src/test/modules
> or in contrib; if for some reason they really need to be in
> src/backend, then they had better be clearly documented as such.
>
> What really annoys me about this is that the function gives every
> superficial impression of being something you could actually use. Why
> wouldn't a user believe that if they periodically connect and run
> pg_sync_replication_slots(), things will be OK? I can certainly
> imagine a user *wanting* that to work. I'd like that to work. But it
> seems like either it's impossible for some reason that isn't clear to
> me, and we just went ahead and shipped it in a non-working state
> anyway, or it is possible to make it work and we didn't do the
> necessary engineering before something got committed. Either way,
> that's really disappointing.
>
> > I think the issue occurs because unlike the slotsync worker, the SQL API
> > removes temporary slots when the function ends, so it cannot hold back the
> > standby's catalog_xmin. If transactions on the primary keep advancing xids, the
> > source slot's catalog_xmin on the primary fails to catch up with the standby's
> > nextXid, causing sync failure.
>
> I still don't understand how this problem arises in the first place.
> It seems like you're describing a situation where we need to prevent
> the standby from getting ahead of the primary, but that should be
> impossible by definition.
>

The reason is that we do not allow creating a synced slot if the
required WAL or catalog rows for this slot have been removed or are at
risk of removal. The way we achieve it is that during the first
sync_slot call, either via slotsync worker or API, we create a
temporary slot on the standby with xmin pointed to the safest possible
xmin (catalog_xmin) on standby computed by
GetOldestSafeDecodingTransactionId() and WAL (restart_lsn) pointed to
by the oldest WAL present on standby. Now, if the source slot's (slot
on primary) corresponding location/xmin are prior to the location/xmin
on the standby then we can't sync the slot immediately because there
is no guarantee that required resources (WAL/catalog_rows) will be
available when we try to use the synced slot after promotion. The
slotsync worker will keep retrying to sync the slot and will
eventually succeed once the source slot's values are safe to be synced
to the standby. Now, with API, we didn't implement this retry logic
due to which we see the behaviour currently reported. Note that once
the first time sync is successful, the consecutive times, even the
API, should work similar to the worker.

I agree that the current use of API is limited, such that one can use
it in a controlled environment (e.g., the first time sync happens
before other operations on primary), or to debug this functionality,
or to write tests. It is not clear to me why someone would not use the
built-in functionality to sync slots and prefer this API. But going
forward (as we see people would like to use this API to sync slots),
it is not that difficult to improve this API to match its behaviour
with the built-in worker for initial/first sync.

I see that we separately document functions [1] used for
development/debug, and this API could be documented in that way.

[1]: https://www.postgresql.org/docs/current/functions-textsearch.html#TEXTSEARCH-FUNCTIONS-DEBUG-TABLE

--
With Regards,
Amit Kapila.

RE: Replication slot is not able to sync up

From

"Zhijie Hou (Fujitsu)"

Date:

30 May, 13:07:42

On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote:
> 
> On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > In the case presented here, the logical slot is expected to keep
> > forwarding, and in the consecutive sync cycle, the sync should be
> > successful. Users using logical decoding APIs should also be aware
> > that if due for some reason, the logical slot is not moving forward,
> > the master/publisher node will start accumulating dead rows and WAL,
> > which can create bigger problems.
> 
> I've tried this case and am concerned that the slot synchronization using
> pg_sync_replication_slots() would never succeed while the primary keeps
> getting write transactions. Even if the user manually consumes changes on the
> primary, the primary server keeps advancing its XID in the meanwhile. On the
> standby, we ensure that the
> TransamVariables->nextXid is beyond the XID of WAL record that it's
> going to apply so the xmin horizon calculated by
> GetOldestSafeDecodingTransactionId() ends up always being higher than the
> slot's catalog_xmin on the primary. We get the log message "could not
> synchronize replication slot "s" because remote slot precedes local slot" and
> cleanup the slot on the standby at the end of pg_sync_replication_slots().

To improve this workload scenario, we can modify pg_sync_replication_slots() to
wait for the primary slot to advance to a suitable position before completing
synchronization and removing the temporary slot. This would allow the sync to
complete as soon as the primary slot advances, whether through
pg_logical_xx_get_changes() or other ways.

I've created a POC (attached) that currently waits indefinitely for the remote
slot to catch up. We could later add a timeout parameter to control maximum
wait time if this approach seems acceptable.

I tested that, when pgbench TPC-B is running on the primary, calling
pg_sync_replication_slots() on the standby correctly blocks until I advance the
primary slot position by calling pg_logical_xx_get_changes().

if the basic idea sounds reasonable then I can start a separate
thread to extend this API. Thoughts ?

Best Regards,
Hou zj

Attachment

0001-POC-Improve-initial-slot-synchronization-in-pg_sync_repl.patch

Re: Replication slot is not able to sync up

From

Amit Kapila

Date:

30 May, 14:02:28

On Fri, May 30, 2025 at 4:05 PM Amul Sul <sulamul@gmail.com> wrote:
>
> Quick question -- due to my limited understanding of this area: why
> can't we perform an action similar to pg_logical_slot_get_changes()
> implicitly from pg_sync_replication_slots()? Would there be any
> implications of doing so?
>

Yes, there would be implications if we did it that way. It would mean
that the consumer of the slot may not process those changes (for which
sync_slot API has done the get_changes) and send it to the client.
Consider a publisher-subscriber and physical standby setup. In this
setup, the subscriber creates a logical slot corresponding to the
subscription on the publisher. Now, the publisher process changes and
sends it to the subscriber; then the slot is advanced (both its xmin
and WAL locations) once the corresponding changes are sent to the
client.

If we allow pg_sync_replication_slots() to do
pg_logical_slot_get_changes or equivalent in some way, then we may end
up advancing the slot without sending the changes to the subscriber,
which would be considered a data loss for the subscriber.

I have explained in terms of built-in logical replication, but the
external plugins using these APIs (pg_logical_*) should be doing
something similar to process the changes and advance the slot.

Does this answer your question and make sense to you?

--
With Regards,
Amit Kapila.

Re: Replication slot is not able to sync up

From

Robert Haas

Date:

03 June, 19:47:45

On Fri, May 30, 2025 at 6:08 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
> To improve this workload scenario, we can modify pg_sync_replication_slots() to
> wait for the primary slot to advance to a suitable position before completing
> synchronization and removing the temporary slot. This would allow the sync to
> complete as soon as the primary slot advances, whether through
> pg_logical_xx_get_changes() or other ways.

My understanding of this area is limited, but this sounds potentially
promising to me. The current approach seems very timing-dependent.
Depending on the state of the primary vs. the state of the standby, a
call to pg_sync_replication_slots() may either create a slot or fail
to do so. A call at a slightly earlier or later time might have had a
different result. IIUC, this proposal would make different results due
to minor timing variations less probable.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Replication slot is not able to sync up

From

Amit Kapila

Date:

10 June, 10:15:06

On Thu, May 29, 2025 at 8:39 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Wed, May 28, 2025 at 11:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I didn't know it was intended for testing and debugging purposes so
> > clearilying it in the documentation would be a good idea.
>
> I have added the suggested docs in v3.
>

- errmsg("could not synchronize replication slot \"%s\"", remote_slot->name),
- errdetail("Logical decoding could not find consistent point from
local slot's LSN %X/%X.",
+ errmsg("could not synchronize replication slot \"%s\" to prevent
data loss", remote_slot->name),
+ errdetail("Standby does not have enough data to decode WALs at LSN %X/%X.",
    LSN_FORMAT_ARGS(slot->data.restart_lsn)));

I find the errdetail is not clear about the current state, which is
that we can't yet build a consistent snapshot on the standby to allow
decoding. Would it be better to have errdetail like: "Standby could
not build a consistent snapshot to decode WALs at LSN %X/%X.?

--
With Regards,
Amit Kapila.

RE: Replication slot is not able to sync up

From

"Zhijie Hou (Fujitsu)"

Date:

10 June, 12:50:29

On Thu, May 29, 2025 at 11:09 AM shveta malik wrote:
> 
> On Wed, May 28, 2025 at 11:56 AM Masahiko Sawada 
> <sawada.mshk@gmail.com> wrote:
> >
> >
> > I didn't know it was intended for testing and debugging purposes so 
> > clearilying it in the documentation would be a good idea.
> 
> I have added the suggested docs in v3.

Thanks for updating the patch.

I have few suggestions for the document from a user's perspective.

1.
>     ... , one
>     condition must be met. The logical replication slot on primary must be advanced
>     to such a catalog change position (catalog_xmin) and WAL's LSN (restart_lsn) for
>     which sufficient data is retained on the corresponding standby server.

The term "catalog change position" might be not be very eaiser for some readers
to grasp. Would it be clearer to phrase it as follows ?

"The logical replication slot on the primary must reach a state where the WALs
and system catalog rows retained by the slot are also present on the
corresponding standby server. "

2.
>     If the primary slot is still lagging behind and synchronization is attempted
>     for the first time, then to prevent the data loss as explained, persistence
>     and synchronization of newly created slot will be skipped, and the following
>     log message may appear on standby.

The phrase "lagging behind" typically refers to the standby, which can be a bit
confusing. I understand that user can context around to understand it, but
would it be eaiser to undertand by providing a more detailed description like
below ?

"If the WALs and system catalog rows retained by the slot on the primary have
already been purged from the standby server, ..."

3.
<programlisting>
     LOG: could not synchronize replication slot "failover_slot" to prevent data loss
     DETAIL:  The remote slot needs WAL at LSN 0/3003F28 and catalog xmin 754, but the standby has LSN 0/3003F28 and
catalogxmin 766.

</programlisting>

It seems that it lacks one space between "LOG:" and the message

Best Regards,
Hou zj

Re: Replication slot is not able to sync up

From

shveta malik

Date:

11 June, 04:49:26

On Tue, Jun 10, 2025 at 3:20 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Thanks for updating the patch.
>
> I have few suggestions for the document from a user's perspective.
>

Thanks Hou-San, I agree with your suggestions. Addressed in v4.

Also addressed Amit's suggestion at [1] to improve errdetail.

[1]: https://www.postgresql.org/message-id/CAA4eK1JKXCMDqfFgNtemVZ9ge4KrQtwSQG1OwMLNHRBDfnH9rA%40mail.gmail.com

thanks
Shveta

Attachment

v4-0001-Improve-log-messages-and-docs-for-slotsync.patch

Re: Replication slot is not able to sync up

From

Peter Smith

Date:

12 June, 01:43:10

On Wed, Jun 11, 2025 at 8:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 11, 2025 at 7:19 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Jun 10, 2025 at 3:20 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > >
> > > Thanks for updating the patch.
> > >
> > > I have few suggestions for the document from a user's perspective.
> > >
> >
> > Thanks Hou-San, I agree with your suggestions. Addressed in v4.
> >
> > Also addressed Amit's suggestion at [1] to improve errdetail.
> >
>
> So, the overall direction we are taking here is that we want to
> improve the existing LOG/DEBUG messages and docs for HEAD and back
> branches. Then we will improve the API behavior based on Hou-San's
> patch for PG19. Let me know if you or others think otherwise.
>
> +    <para>
> +     Apart from enabling <link linkend="guc-sync-replication-slots">
> +     <varname>sync_replication_slots</varname></link> to synchronize slots
> +     periodically, failover slots can be manually synchronized by invoking
> +     <link linkend="pg-sync-replication-slots">
> +     <function>pg_sync_replication_slots</function></link> on the standby.
> +     However, this function is primarily intended for testing and debugging
> +     purposes and should be used with caution. The recommended approach to
> +     synchronize slots is by enabling <link
> linkend="guc-sync-replication-slots">
> +     <varname>sync_replication_slots</varname></link> on the standby, as it
> +     ensures continuous and automatic synchronization of replication slots,
> +     facilitating seamless failover and high availability.
> +    </para>
> +
> +    <para>
> +     When slot-synchronization setup is done as recommended, and
> +     slot-synchronization is performed the very first time either automatically
> +     or by <link linkend="pg-sync-replication-slots">
> +     <function>pg_sync_replication_slots</function></link>,
> +     then for the synchronized slot to be created and persisted on the standby,
> +     one condition must be met. The logical replication slot on the primary
> +     must reach a state where the WALs and system catalog rows retained by
> +     the slot are also present on the corresponding standby server. This is
> +     needed to prevent any data loss and to allow logical replication
> to continue
> +     seamlessly through the synchronized slot if needed after promotion.
> +     If the WALs and system catalog rows retained by the slot on the
> primary have
> +     already been purged from the standby server, and synchronization
> is attempted
> +     for the first time, then to prevent the data loss as explained,
> persistence
> +     and synchronization of newly created slot will be skipped, and
> the following
> +     log message may appear on standby.
> +<programlisting>
> +     LOG:  could not synchronize replication slot "failover_slot"
> +     DETAIL:  Synchronization could lead to data loss as the remote
> slot needs WAL at LSN 0/3003F28 and catalog xmin 754, but the standby
> has LSN 0/3003F28 and catalog xmin 756
> +</programlisting>
> +     If the logical replication slot is actively consumed by a
> consumer, no further
> +     manual action is needed by the user, as the slot on primary will
> be advanced
> +     automatically, and synchronization will proceed in the next
> cycle. However,
> +     if no logical replication consumer is set up yet, to advance the slot, it
> +     is recommended to manually run the <link
> linkend="pg-logical-slot-get-changes">
> +     <function>pg_logical_slot_get_changes</function></link> or
> +     <link linkend="pg-logical-slot-get-binary-changes">
> +     <function>pg_logical_slot_get_binary_changes</function></link>
> on the primary
> +     slot and allow synchronization to proceed.
> +    </para>
> +
>
> I have reworded the above as follows:
> To enable periodic synchronization of replication slots, it is
> recommended to activate sync_replication_slots on the standby server.
> While manual synchronization is possible using
> pg_sync_replication_slots, this function is primarily intended for
> testing and debugging and should be used with caution. Automatic
> synchronization via sync_replication_slots ensures continuous slot
> updates, supporting seamless failover and maintaining high
> availability. When slot synchronization is configured as recommended,
> and the initial synchronization is performed either automatically or
> manually via pg_sync_replication_slot, the standby can persist the
> synchronized slot only if the following condition is met: The logical
> replication slot on the primary must retain WALs and system catalog
> rows that are still available on the standby. This ensures data
> integrity and allows logical replication to continue smoothly after
> promotion.
> If the required WALs or catalog rows have already been purged from the
> standby, the slot will not be persisted to avoid data loss. In such
> cases, the following log message may appear:
>
> LOG: could not synchronize replication slot "failover_slot"
> DETAIL: Synchronization could lead to data loss as the remote slot
> needs WAL at LSN 0/3003F28 and catalog xmin 754, but the standby has
> LSN 0/3003F28 and catalog xmin 756
>
> If the logical replication slot is actively used by a consumer, no
> manual intervention is needed; the slot will advance automatically,
> and synchronization will resume in the next cycle. However, if no
> consumer is configured, it is advisable to manually advance the slot
> on the primary using pg_logical_slot_get_changes or
> pg_logical_slot_get_binary_changes, allowing synchronization to
> proceed.
>
> Let me know what you think of above?
>

Phrases like "... it is recommended..." and "... intended for testing
and debugging .. " and "... should be used with caution." and "... it
is advisable to..." seem like indicators that parts of the above
description should be using SGML markup such as <caution> or <warning>
or <note> instead of just plain text.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Replication slot is not able to sync up

From

shveta malik

Date:

12 June, 08:14:30

On Thu, Jun 12, 2025 at 4:13 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Phrases like "... it is recommended..." and "... intended for testing
> and debugging .. " and "... should be used with caution." and "... it
> is advisable to..." seem like indicators that parts of the above
> description should be using SGML markup such as <caution> or <warning>
> or <note> instead of just plain text.
>

I feel WARNING and CAUTION markups could be a little strong for the
concerned case. Such markups are generally used when there is a
side-effect involved with the usage. But in our case, there is no such
side-effect with the API. At max it may fail without harming the
system and will succeed in the next invocation. But I also feel that
such sections catch user attention. Thus if needed, we can have a NOTE
section to convey the recommended way of slot synchronization.
Thoughts?

Similar to our case, I see some other docs using caution words without
a CAUTION markup. Please search for 'caution' in [1],[2],[3]

[1]: https://www.postgresql.org/docs/current/continuous-archiving.html
[2]:  https://www.postgresql.org/docs/current/sql-altertable.html
[3]:  https://www.postgresql.org/docs/18/oauth-validator-design.html#OAUTH-VALIDATOR-DESIGN-USERMAP-DELEGATION

thanks
Shveta