RE: Replication slot is not able to sync up - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Replication slot is not able to sync up |
Date | |
Msg-id | OS0PR01MB5716F14E904A5CB06053AD6D9470A@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Replication slot is not able to sync up (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Sat, Jun 14, 2025 at 11:37 PM Dilip Kumar wrote: > > On Fri, May 30, 2025 at 3:38 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote: > > > > > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila > > > <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > In the case presented here, the logical slot is expected to keep > > > > forwarding, and in the consecutive sync cycle, the sync should be > > > > successful. Users using logical decoding APIs should also be aware > > > > that if due for some reason, the logical slot is not moving > > > > forward, the master/publisher node will start accumulating dead > > > > rows and WAL, which can create bigger problems. > > > > > > I've tried this case and am concerned that the slot synchronization > > > using > > > pg_sync_replication_slots() would never succeed while the primary > > > keeps getting write transactions. Even if the user manually consumes > > > changes on the primary, the primary server keeps advancing its XID > > > in the meanwhile. On the standby, we ensure that the > > > TransamVariables->nextXid is beyond the XID of WAL record that it's > > > going to apply so the xmin horizon calculated by > > > GetOldestSafeDecodingTransactionId() ends up always being higher > > > than the slot's catalog_xmin on the primary. We get the log message > > > "could not synchronize replication slot "s" because remote slot > > > precedes local slot" and cleanup the slot on the standby at the end of > pg_sync_replication_slots(). > > > > To improve this workload scenario, we can modify > > pg_sync_replication_slots() to wait for the primary slot to advance to > > a suitable position before completing synchronization and removing the > > temporary slot. This would allow the sync to complete as soon as the > > primary slot advances, whether through > > pg_logical_xx_get_changes() or other ways. > > > > I've created a POC (attached) that currently waits indefinitely for > > the remote slot to catch up. We could later add a timeout parameter to > > control maximum wait time if this approach seems acceptable. > > > > I tested that, when pgbench TPC-B is running on the primary, calling > > pg_sync_replication_slots() on the standby correctly blocks until I > > advance the primary slot position by calling pg_logical_xx_get_changes(). > > > > if the basic idea sounds reasonable then I can start a separate thread > > to extend this API. Thoughts ? > > IMHO, this idea has merit, have you started a thread for reviewing this patch? Thank you for looking at it. I plan to start a new thread soon for the upcoming commit fest, after some additional testing and documentation cleanup. Best Regards, Hou zj
pgsql-hackers by date: