Re: Replication slot is not able to sync up - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Replication slot is not able to sync up |
Date | |
Msg-id | CAD21AoChZmhH70vikmiXH+MXt173PcCvioxtHA_MD1A_Apaq_Q@mail.gmail.com Whole thread Raw |
In response to | RE: Replication slot is not able to sync up ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Tue, May 27, 2025 at 9:15 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote: > > > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > In the case presented here, the logical slot is expected to keep > > > forwarding, and in the consecutive sync cycle, the sync should be > > > successful. Users using logical decoding APIs should also be aware > > > that if due for some reason, the logical slot is not moving forward, > > > the master/publisher node will start accumulating dead rows and WAL, > > > which can create bigger problems. > > > > I've tried this case and am concerned that the slot synchronization using > > pg_sync_replication_slots() would never succeed while the primary keeps > > getting write transactions. Even if the user manually consumes changes on the > > primary, the primary server keeps advancing its XID in the meanwhile. On the > > standby, we ensure that the > > TransamVariables->nextXid is beyond the XID of WAL record that it's > > going to apply so the xmin horizon calculated by > > GetOldestSafeDecodingTransactionId() ends up always being higher than the > > slot's catalog_xmin on the primary. We get the log message "could not > > synchronize replication slot "s" because remote slot precedes local slot" and > > cleanup the slot on the standby at the end of pg_sync_replication_slots(). > > I think the issue occurs because unlike the slotsync worker, the SQL API > removes temporary slots when the function ends, so it cannot hold back the > standby's catalog_xmin. If transactions on the primary keep advancing xids, the > source slot's catalog_xmin on the primary fails to catch up with the standby's > nextXid, causing sync failure. Agreed with this analysis. > This only affects the initial sync when creating a new slot on the standby. > Once the slot exists, the standby's catalog_xmin stabilizes, preventing the > issue in subsequent syncs. Right. I think this is an area where we can improve, if there is a real use case. > I think the SQL API was mainly intended for testing and debugging purposes > where controlled sync operations are useful. For production use, the slotsync > worker (with sync_replication_slots=on) is recommended because it automatically > handles this problem and requires minimal manual intervention. But to avoid > confusion, I think we should clearly document this distinction. I didn't know it was intended for testing and debugging purposes so clearilying it in the documentation would be a good idea. Also, I agree that using the slotsync worker is the primary usage of this feature. I'm interested in whether there is a use case where the SQL API is more preferable. If there is, we can improve the SQL API part, especially the first synchronization part, for v19 or later. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: