Thread: Re: Replication slot is not able to sync up
On Fri, May 23, 2025 at 9:57 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Hi,
Noticed below behaviour where replication slot is not able to sync up if any catalog changes happened after the creation.Getting below LOG when trying to sync replication slots using pg_sync_replication_slots() function.The newly created slot does not appear on the standby after this LOG -
2025-05-23 07:57:12.453 IST [4178805] LOG: could not synchronize replication slot "failover_slot" because remote slot precedes local slot
2025-05-23 07:57:12.453 IST [4178805] DETAIL: The remote slot has LSN 0/B000060 and catalog xmin 764, but the local slot has LSN 0/B000060 and catalog xmin 765.
2025-05-23 07:57:12.453 IST [4178805] STATEMENT: SELECT pg_sync_replication_slots();
Below is the test case tried on latest master branch -
=========
- Create the Primary and start the server.
wal_level = logical
- Create the physical slot on Primary.
SELECT pg_create_physical_replication_slot('slot1');
- Setup the standby using pg_basebackup.
bin/pg_basebackup -D data1 -p 5418 -d "dbname=postgres" -R
primary_slot_name = 'slot1'
hot_standby_feedback = on
port = 5419
-- Start the standby.
-- Connect to Primary and create a logical replication slot.
SELECT pg_create_logical_replication_slot('failover_slot', 'pgoutput', false, false, true);postgres@4177929=#select xmin,* from pg_replication_slots ;
xmin | slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | two_phase_at | inactive_since | conflicting | invalidation_reason | failover | synced
------+---------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+--------------+----------------------------------+-------------+---------------------+----------+--------
765 | slot1 | | physical | | | f | t | 4177898 | 765 | | 0/B018B00 | | reserved | | f
| | | | | f | f
| failover_slot | pgoutput | logical | 5 | postgres | f | f | | | 764 | 0/B000060 | 0/B000098 | reserved | | f
| | 2025-05-23 07:55:31.277584+05:30 | f | | t | f
(2 rows)
-- Perform some catalog changes. e.g.:
create table abc(id int);
postgres@4179034=#select xmin from pg_class where relname='abc';
xmin
------
764
(1 row)
-- Connect to the standby and try to sync the replication slots.
SELECT pg_sync_replication_slots();
In the logfile, can see below LOG -
2025-05-23 07:57:12.453 IST [4178805] LOG: could not synchronize replication slot "failover_slot" because remote slot precedes local slot
2025-05-23 07:57:12.453 IST [4178805] DETAIL: The remote slot has LSN 0/B000060 and catalog xmin 764, but the local slot has LSN 0/B000060 and catalog xmin 765.
2025-05-23 07:57:12.453 IST [4178805] STATEMENT: SELECT pg_sync_replication_slots();
select xmin,* from pg_replication_slots ;
no rows
Primary -
postgres@4179034=#select xmin,* from pg_replication_slots ;
xmin | slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phas
e | two_phase_at | inactive_since | conflicting | invalidation_reason | failover | synced
------+---------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+---------
--+--------------+----------------------------------+-------------+---------------------+----------+--------
765 | slot1 | | physical | | | f | t | 4177898 | 765 | | 0/B018C08 | | reserved | | f
| | | | | f | f
| failover_slot | pgoutput | logical | 5 | postgres | f | f | | | 764 | 0/B000060 | 0/B000098 | reserved | | f
| | 2025-05-23 07:55:31.277584+05:30 | f | | t | f
(2 rows)
=========
Is there any way to sync up the replication slot after the catalog changes have been made after creation?
The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do pg_logical_slot_get_changes() API before performing sync? You can check the xmin of the logical slot after get_changes to ensure that xmin has moved to 765 in your case.
--
With Regards,
Amit Kapila.
On Fri, May 23, 2025 at 12:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do pg_logical_slot_get_changes()API before performing sync? You can check the xmin of the logical slot after get_changes toensure that xmin has moved to 765 in your case. I'm fairly dismayed by this example. I hope I'm misunderstanding something, because otherwise I have difficulty understanding how we thought it was OK to ship this feature in this condition. At the moment that pg_sync_replication_slots() is executed, a slot named failover_slot exists on only one of the two servers. How can you justify emitting an error message complaining that "remote slot precedes local slot"? There's only one slot! I understand that, under the hood, we probably created an additional slot on the standby and then tried to fast-forward it, and this error occurred in the second step. But a user shouldn't have to understand those kinds of internal implementation details to make sense of the error message. If the problem is that we're not able to create a slot on the standby at an old enough LSN or XID position to permit its use with the corresponding slot on the master, it should be reported that way. It also seems like having to execute a manual step like pg_logical_slot_get_changes() in order for things to work is really missing the point of the feature. I mean, it seems like the intention of the feature was that someone can just periodically call pg_sync_replication_slots() on each standby and the right things will happen -- creating slots or fast-forwarding them or dropping them, as required. But if that sometimes requires manual fiddling like having to consume changes from a slot then basically the feature just doesn't work, because now the user will have to somehow understand when that is required and what they need to do to fix it. This doesn't even seem like a particularly obscure case. To be honest, even after spending quite a bit of time on this, I still don't really understand what's happening with the xmins here. Just after creating the logical slot on the primary, it has xmin 764 on one slot and xmin 765 on the other, and I don't understand why that's the case, nor why the extra DDL command is needed to trigger the problem. But I also can't shake the feeling that I shouldn't *need* to understand that stuff to use the feature. Isn't that the whole point? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > In the case presented here, the logical slot is expected to keep > forwarding, and in the consecutive sync cycle, the sync should be > successful. Users using logical decoding APIs should also be aware > that if due for some reason, the logical slot is not moving forward, > the master/publisher node will start accumulating dead rows and WAL, > which can create bigger problems. I've tried this case and am concerned that the slot synchronization using pg_sync_replication_slots() would never succeed while the primary keeps getting write transactions. Even if the user manually consumes changes on the primary, the primary server keeps advancing its XID in the meanwhile. On the standby, we ensure that the TransamVariables->nextXid is beyond the XID of WAL record that it's going to apply so the xmin horizon calculated by GetOldestSafeDecodingTransactionId() ends up always being higher than the slot's catalog_xmin on the primary. We get the log message "could not synchronize replication slot "s" because remote slot precedes local slot" and cleanup the slot on the standby at the end of pg_sync_replication_slots(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote: > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > In the case presented here, the logical slot is expected to keep > > forwarding, and in the consecutive sync cycle, the sync should be > > successful. Users using logical decoding APIs should also be aware > > that if due for some reason, the logical slot is not moving forward, > > the master/publisher node will start accumulating dead rows and WAL, > > which can create bigger problems. > > I've tried this case and am concerned that the slot synchronization using > pg_sync_replication_slots() would never succeed while the primary keeps > getting write transactions. Even if the user manually consumes changes on the > primary, the primary server keeps advancing its XID in the meanwhile. On the > standby, we ensure that the > TransamVariables->nextXid is beyond the XID of WAL record that it's > going to apply so the xmin horizon calculated by > GetOldestSafeDecodingTransactionId() ends up always being higher than the > slot's catalog_xmin on the primary. We get the log message "could not > synchronize replication slot "s" because remote slot precedes local slot" and > cleanup the slot on the standby at the end of pg_sync_replication_slots(). I think the issue occurs because unlike the slotsync worker, the SQL API removes temporary slots when the function ends, so it cannot hold back the standby's catalog_xmin. If transactions on the primary keep advancing xids, the source slot's catalog_xmin on the primary fails to catch up with the standby's nextXid, causing sync failure. We chose this behavior because we could not predict when (or if) the SQL function might be executed again, and the creating session might persist after promotion. Without automatic cleanup, this could lead to temporary slots being retained for a longer time. This only affects the initial sync when creating a new slot on the standby. Once the slot exists, the standby's catalog_xmin stabilizes, preventing the issue in subsequent syncs. I think the SQL API was mainly intended for testing and debugging purposes where controlled sync operations are useful. For production use, the slotsync worker (with sync_replication_slots=on) is recommended because it automatically handles this problem and requires minimal manual intervention. But to avoid confusion, I think we should clearly document this distinction. Best Regards, Hou zj
On Wed, May 28, 2025 at 11:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I didn't know it was intended for testing and debugging purposes so > clearilying it in the documentation would be a good idea. I have added the suggested docs in v3. thanks Shveta
Attachment
On Wed, May 28, 2025 at 12:15 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > I think the SQL API was mainly intended for testing and debugging purposes > where controlled sync operations are useful. For production use, the slotsync > worker (with sync_replication_slots=on) is recommended because it automatically > handles this problem and requires minimal manual intervention. But to avoid > confusion, I think we should clearly document this distinction. If this analysis is correct, this should never have been committed, at least not in this form. When we ship something, it needs to work. Testing and debugging facilities are best placed in src/test/modules or in contrib; if for some reason they really need to be in src/backend, then they had better be clearly documented as such. What really annoys me about this is that the function gives every superficial impression of being something you could actually use. Why wouldn't a user believe that if they periodically connect and run pg_sync_replication_slots(), things will be OK? I can certainly imagine a user *wanting* that to work. I'd like that to work. But it seems like either it's impossible for some reason that isn't clear to me, and we just went ahead and shipped it in a non-working state anyway, or it is possible to make it work and we didn't do the necessary engineering before something got committed. Either way, that's really disappointing. > I think the issue occurs because unlike the slotsync worker, the SQL API > removes temporary slots when the function ends, so it cannot hold back the > standby's catalog_xmin. If transactions on the primary keep advancing xids, the > source slot's catalog_xmin on the primary fails to catch up with the standby's > nextXid, causing sync failure. I still don't understand how this problem arises in the first place. It seems like you're describing a situation where we need to prevent the standby from getting ahead of the primary, but that should be impossible by definition. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 29, 2025 at 6:01 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 28, 2025 at 12:15 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > I think the SQL API was mainly intended for testing and debugging purposes > > where controlled sync operations are useful. For production use, the slotsync > > worker (with sync_replication_slots=on) is recommended because it automatically > > handles this problem and requires minimal manual intervention. But to avoid > > confusion, I think we should clearly document this distinction. > > If this analysis is correct, this should never have been committed, at > least not in this form. When we ship something, it needs to work. > Testing and debugging facilities are best placed in src/test/modules > or in contrib; if for some reason they really need to be in > src/backend, then they had better be clearly documented as such. > > What really annoys me about this is that the function gives every > superficial impression of being something you could actually use. Why > wouldn't a user believe that if they periodically connect and run > pg_sync_replication_slots(), things will be OK? I can certainly > imagine a user *wanting* that to work. I'd like that to work. But it > seems like either it's impossible for some reason that isn't clear to > me, and we just went ahead and shipped it in a non-working state > anyway, or it is possible to make it work and we didn't do the > necessary engineering before something got committed. Either way, > that's really disappointing. > > > I think the issue occurs because unlike the slotsync worker, the SQL API > > removes temporary slots when the function ends, so it cannot hold back the > > standby's catalog_xmin. If transactions on the primary keep advancing xids, the > > source slot's catalog_xmin on the primary fails to catch up with the standby's > > nextXid, causing sync failure. > > I still don't understand how this problem arises in the first place. > It seems like you're describing a situation where we need to prevent > the standby from getting ahead of the primary, but that should be > impossible by definition. > The reason is that we do not allow creating a synced slot if the required WAL or catalog rows for this slot have been removed or are at risk of removal. The way we achieve it is that during the first sync_slot call, either via slotsync worker or API, we create a temporary slot on the standby with xmin pointed to the safest possible xmin (catalog_xmin) on standby computed by GetOldestSafeDecodingTransactionId() and WAL (restart_lsn) pointed to by the oldest WAL present on standby. Now, if the source slot's (slot on primary) corresponding location/xmin are prior to the location/xmin on the standby then we can't sync the slot immediately because there is no guarantee that required resources (WAL/catalog_rows) will be available when we try to use the synced slot after promotion. The slotsync worker will keep retrying to sync the slot and will eventually succeed once the source slot's values are safe to be synced to the standby. Now, with API, we didn't implement this retry logic due to which we see the behaviour currently reported. Note that once the first time sync is successful, the consecutive times, even the API, should work similar to the worker. I agree that the current use of API is limited, such that one can use it in a controlled environment (e.g., the first time sync happens before other operations on primary), or to debug this functionality, or to write tests. It is not clear to me why someone would not use the built-in functionality to sync slots and prefer this API. But going forward (as we see people would like to use this API to sync slots), it is not that difficult to improve this API to match its behaviour with the built-in worker for initial/first sync. I see that we separately document functions [1] used for development/debug, and this API could be documented in that way. [1]: https://www.postgresql.org/docs/current/functions-textsearch.html#TEXTSEARCH-FUNCTIONS-DEBUG-TABLE -- With Regards, Amit Kapila.
On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote: > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > In the case presented here, the logical slot is expected to keep > > forwarding, and in the consecutive sync cycle, the sync should be > > successful. Users using logical decoding APIs should also be aware > > that if due for some reason, the logical slot is not moving forward, > > the master/publisher node will start accumulating dead rows and WAL, > > which can create bigger problems. > > I've tried this case and am concerned that the slot synchronization using > pg_sync_replication_slots() would never succeed while the primary keeps > getting write transactions. Even if the user manually consumes changes on the > primary, the primary server keeps advancing its XID in the meanwhile. On the > standby, we ensure that the > TransamVariables->nextXid is beyond the XID of WAL record that it's > going to apply so the xmin horizon calculated by > GetOldestSafeDecodingTransactionId() ends up always being higher than the > slot's catalog_xmin on the primary. We get the log message "could not > synchronize replication slot "s" because remote slot precedes local slot" and > cleanup the slot on the standby at the end of pg_sync_replication_slots(). To improve this workload scenario, we can modify pg_sync_replication_slots() to wait for the primary slot to advance to a suitable position before completing synchronization and removing the temporary slot. This would allow the sync to complete as soon as the primary slot advances, whether through pg_logical_xx_get_changes() or other ways. I've created a POC (attached) that currently waits indefinitely for the remote slot to catch up. We could later add a timeout parameter to control maximum wait time if this approach seems acceptable. I tested that, when pgbench TPC-B is running on the primary, calling pg_sync_replication_slots() on the standby correctly blocks until I advance the primary slot position by calling pg_logical_xx_get_changes(). if the basic idea sounds reasonable then I can start a separate thread to extend this API. Thoughts ? Best Regards, Hou zj
Attachment
On Fri, May 30, 2025 at 4:05 PM Amul Sul <sulamul@gmail.com> wrote: > > Quick question -- due to my limited understanding of this area: why > can't we perform an action similar to pg_logical_slot_get_changes() > implicitly from pg_sync_replication_slots()? Would there be any > implications of doing so? > Yes, there would be implications if we did it that way. It would mean that the consumer of the slot may not process those changes (for which sync_slot API has done the get_changes) and send it to the client. Consider a publisher-subscriber and physical standby setup. In this setup, the subscriber creates a logical slot corresponding to the subscription on the publisher. Now, the publisher process changes and sends it to the subscriber; then the slot is advanced (both its xmin and WAL locations) once the corresponding changes are sent to the client. If we allow pg_sync_replication_slots() to do pg_logical_slot_get_changes or equivalent in some way, then we may end up advancing the slot without sending the changes to the subscriber, which would be considered a data loss for the subscriber. I have explained in terms of built-in logical replication, but the external plugins using these APIs (pg_logical_*) should be doing something similar to process the changes and advance the slot. Does this answer your question and make sense to you? -- With Regards, Amit Kapila.
On Fri, May 30, 2025 at 6:08 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > To improve this workload scenario, we can modify pg_sync_replication_slots() to > wait for the primary slot to advance to a suitable position before completing > synchronization and removing the temporary slot. This would allow the sync to > complete as soon as the primary slot advances, whether through > pg_logical_xx_get_changes() or other ways. My understanding of this area is limited, but this sounds potentially promising to me. The current approach seems very timing-dependent. Depending on the state of the primary vs. the state of the standby, a call to pg_sync_replication_slots() may either create a slot or fail to do so. A call at a slightly earlier or later time might have had a different result. IIUC, this proposal would make different results due to minor timing variations less probable. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 29, 2025 at 8:39 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, May 28, 2025 at 11:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I didn't know it was intended for testing and debugging purposes so > > clearilying it in the documentation would be a good idea. > > I have added the suggested docs in v3. > - errmsg("could not synchronize replication slot \"%s\"", remote_slot->name), - errdetail("Logical decoding could not find consistent point from local slot's LSN %X/%X.", + errmsg("could not synchronize replication slot \"%s\" to prevent data loss", remote_slot->name), + errdetail("Standby does not have enough data to decode WALs at LSN %X/%X.", LSN_FORMAT_ARGS(slot->data.restart_lsn))); I find the errdetail is not clear about the current state, which is that we can't yet build a consistent snapshot on the standby to allow decoding. Would it be better to have errdetail like: "Standby could not build a consistent snapshot to decode WALs at LSN %X/%X.? -- With Regards, Amit Kapila.
On Thu, May 29, 2025 at 11:09 AM shveta malik wrote: > > On Wed, May 28, 2025 at 11:56 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > I didn't know it was intended for testing and debugging purposes so > > clearilying it in the documentation would be a good idea. > > I have added the suggested docs in v3. Thanks for updating the patch. I have few suggestions for the document from a user's perspective. 1. > ... , one > condition must be met. The logical replication slot on primary must be advanced > to such a catalog change position (catalog_xmin) and WAL's LSN (restart_lsn) for > which sufficient data is retained on the corresponding standby server. The term "catalog change position" might be not be very eaiser for some readers to grasp. Would it be clearer to phrase it as follows ? "The logical replication slot on the primary must reach a state where the WALs and system catalog rows retained by the slot are also present on the corresponding standby server. " 2. > If the primary slot is still lagging behind and synchronization is attempted > for the first time, then to prevent the data loss as explained, persistence > and synchronization of newly created slot will be skipped, and the following > log message may appear on standby. The phrase "lagging behind" typically refers to the standby, which can be a bit confusing. I understand that user can context around to understand it, but would it be eaiser to undertand by providing a more detailed description like below ? "If the WALs and system catalog rows retained by the slot on the primary have already been purged from the standby server, ..." 3. <programlisting> LOG: could not synchronize replication slot "failover_slot" to prevent data loss DETAIL: The remote slot needs WAL at LSN 0/3003F28 and catalog xmin 754, but the standby has LSN 0/3003F28 and catalogxmin 766. </programlisting> It seems that it lacks one space between "LOG:" and the message Best Regards, Hou zj
On Tue, Jun 10, 2025 at 3:20 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > Thanks for updating the patch. > > I have few suggestions for the document from a user's perspective. > Thanks Hou-San, I agree with your suggestions. Addressed in v4. Also addressed Amit's suggestion at [1] to improve errdetail. [1]: https://www.postgresql.org/message-id/CAA4eK1JKXCMDqfFgNtemVZ9ge4KrQtwSQG1OwMLNHRBDfnH9rA%40mail.gmail.com thanks Shveta
Attachment
On Wed, Jun 11, 2025 at 8:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 11, 2025 at 7:19 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Jun 10, 2025 at 3:20 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > Thanks for updating the patch. > > > > > > I have few suggestions for the document from a user's perspective. > > > > > > > Thanks Hou-San, I agree with your suggestions. Addressed in v4. > > > > Also addressed Amit's suggestion at [1] to improve errdetail. > > > > So, the overall direction we are taking here is that we want to > improve the existing LOG/DEBUG messages and docs for HEAD and back > branches. Then we will improve the API behavior based on Hou-San's > patch for PG19. Let me know if you or others think otherwise. > > + <para> > + Apart from enabling <link linkend="guc-sync-replication-slots"> > + <varname>sync_replication_slots</varname></link> to synchronize slots > + periodically, failover slots can be manually synchronized by invoking > + <link linkend="pg-sync-replication-slots"> > + <function>pg_sync_replication_slots</function></link> on the standby. > + However, this function is primarily intended for testing and debugging > + purposes and should be used with caution. The recommended approach to > + synchronize slots is by enabling <link > linkend="guc-sync-replication-slots"> > + <varname>sync_replication_slots</varname></link> on the standby, as it > + ensures continuous and automatic synchronization of replication slots, > + facilitating seamless failover and high availability. > + </para> > + > + <para> > + When slot-synchronization setup is done as recommended, and > + slot-synchronization is performed the very first time either automatically > + or by <link linkend="pg-sync-replication-slots"> > + <function>pg_sync_replication_slots</function></link>, > + then for the synchronized slot to be created and persisted on the standby, > + one condition must be met. The logical replication slot on the primary > + must reach a state where the WALs and system catalog rows retained by > + the slot are also present on the corresponding standby server. This is > + needed to prevent any data loss and to allow logical replication > to continue > + seamlessly through the synchronized slot if needed after promotion. > + If the WALs and system catalog rows retained by the slot on the > primary have > + already been purged from the standby server, and synchronization > is attempted > + for the first time, then to prevent the data loss as explained, > persistence > + and synchronization of newly created slot will be skipped, and > the following > + log message may appear on standby. > +<programlisting> > + LOG: could not synchronize replication slot "failover_slot" > + DETAIL: Synchronization could lead to data loss as the remote > slot needs WAL at LSN 0/3003F28 and catalog xmin 754, but the standby > has LSN 0/3003F28 and catalog xmin 756 > +</programlisting> > + If the logical replication slot is actively consumed by a > consumer, no further > + manual action is needed by the user, as the slot on primary will > be advanced > + automatically, and synchronization will proceed in the next > cycle. However, > + if no logical replication consumer is set up yet, to advance the slot, it > + is recommended to manually run the <link > linkend="pg-logical-slot-get-changes"> > + <function>pg_logical_slot_get_changes</function></link> or > + <link linkend="pg-logical-slot-get-binary-changes"> > + <function>pg_logical_slot_get_binary_changes</function></link> > on the primary > + slot and allow synchronization to proceed. > + </para> > + > > I have reworded the above as follows: > To enable periodic synchronization of replication slots, it is > recommended to activate sync_replication_slots on the standby server. > While manual synchronization is possible using > pg_sync_replication_slots, this function is primarily intended for > testing and debugging and should be used with caution. Automatic > synchronization via sync_replication_slots ensures continuous slot > updates, supporting seamless failover and maintaining high > availability. When slot synchronization is configured as recommended, > and the initial synchronization is performed either automatically or > manually via pg_sync_replication_slot, the standby can persist the > synchronized slot only if the following condition is met: The logical > replication slot on the primary must retain WALs and system catalog > rows that are still available on the standby. This ensures data > integrity and allows logical replication to continue smoothly after > promotion. > If the required WALs or catalog rows have already been purged from the > standby, the slot will not be persisted to avoid data loss. In such > cases, the following log message may appear: > > LOG: could not synchronize replication slot "failover_slot" > DETAIL: Synchronization could lead to data loss as the remote slot > needs WAL at LSN 0/3003F28 and catalog xmin 754, but the standby has > LSN 0/3003F28 and catalog xmin 756 > > If the logical replication slot is actively used by a consumer, no > manual intervention is needed; the slot will advance automatically, > and synchronization will resume in the next cycle. However, if no > consumer is configured, it is advisable to manually advance the slot > on the primary using pg_logical_slot_get_changes or > pg_logical_slot_get_binary_changes, allowing synchronization to > proceed. > > Let me know what you think of above? > Phrases like "... it is recommended..." and "... intended for testing and debugging .. " and "... should be used with caution." and "... it is advisable to..." seem like indicators that parts of the above description should be using SGML markup such as <caution> or <warning> or <note> instead of just plain text. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Jun 12, 2025 at 4:13 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Phrases like "... it is recommended..." and "... intended for testing > and debugging .. " and "... should be used with caution." and "... it > is advisable to..." seem like indicators that parts of the above > description should be using SGML markup such as <caution> or <warning> > or <note> instead of just plain text. > I feel WARNING and CAUTION markups could be a little strong for the concerned case. Such markups are generally used when there is a side-effect involved with the usage. But in our case, there is no such side-effect with the API. At max it may fail without harming the system and will succeed in the next invocation. But I also feel that such sections catch user attention. Thus if needed, we can have a NOTE section to convey the recommended way of slot synchronization. Thoughts? Similar to our case, I see some other docs using caution words without a CAUTION markup. Please search for 'caution' in [1],[2],[3] [1]: https://www.postgresql.org/docs/current/continuous-archiving.html [2]: https://www.postgresql.org/docs/current/sql-altertable.html [3]: https://www.postgresql.org/docs/18/oauth-validator-design.html#OAUTH-VALIDATOR-DESIGN-USERMAP-DELEGATION thanks Shveta