Re: Improve pg_sync_replication_slots() to wait for primary to advance - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date | |
Msg-id | CAJpy0uD_cGn_VAnHCRnauio3Bo3vgwTC0fxQD04disHFr6aSQA@mail.gmail.com Whole thread Raw |
In response to | Re: Improve pg_sync_replication_slots() to wait for primary to advance (shveta malik <shveta.malik@gmail.com>) |
List | pgsql-hackers |
On Fri, Aug 1, 2025 at 12:02 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Jul 31, 2025 at 3:11 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > Patch v3 attached. > > > > Thanks for the patch. I tested it, please find a few comments: > > > 1) > it hits an assert > (slotsync_reread_config()-->Assert(sync_replication_slots)) when API > is trying to sync and is in wait loop while in another session, I > enable sync_replication_slots using: > > ALTER SYSTEM SET sync_replication_slots = 'on'; > SELECT pg_reload_conf(); > > Assert: > 025-08-01 10:55:43.637 IST [118576] STATEMENT: SELECT > pg_sync_replication_slots(); > 2025-08-01 10:55:51.730 IST [118563] LOG: received SIGHUP, reloading > configuration files > 2025-08-01 10:55:51.731 IST [118563] LOG: parameter > "sync_replication_slots" changed to "on" > TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c", > Line: 1334, PID: 118576 > postgres: shveta postgres [local] > SELECT(ExceptionalCondition+0xbb)[0x61df0160e090] > postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc] > 2025-08-01 10:55:51.739 IST [118666] ERROR: cannot synchronize > replication slots concurrently > postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2] > postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664] > postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8] > postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea] > postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df] > postgres: shveta postgres [local] > SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60] > postgres: shveta postgres [local] > SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52] > > 2) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot synchronize replication slots when" > + " standby promotion is ongoing"))); > > I think better error message will be: > "exiting from slot synchronization as promotion is triggered" > > This will be better suited in log file as well after below wait statements: > LOG: continuing to wait for remote slot "failover_slot" LSN > (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060) > and catalog xmin (757) > STATEMENT: SELECT pg_sync_replication_slots(); > > 3) > API dumps this when it is waiting for primary: > > ---- > LOG: could not synchronize replication slot "failover_slot2" > DETAIL: Synchronization could lead to data loss, because the remote > slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby > has LSN 0/03066E70 and catalog xmin 770. > STATEMENT: SELECT pg_sync_replication_slots(); > LOG: waiting for remote slot "failover_slot2" LSN (0/3066E70) and > catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin > (770) > STATEMENT: SELECT pg_sync_replication_slots(); > LOG: continuing to wait for remote slot "failover_slot2" LSN > (0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70) > and catalog xmin (770) > STATEMENT: SELECT pg_sync_replication_slots(); > ---- > > Unsure if we shall still dump 'could not synchronize..' when it is > going to retry until it succeeds? The concerned log gives a feeling > that we are done trying and could not synchronize it. What do you > think? > A few more comments: 4) + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) + { + ereport(WARNING, + errmsg("aborting initial sync for slot \"%s\"", + remote_slot->name), + errdetail("This slot was not found on the primary server.")); + + pfree(cmd.data); + walrcv_clear_result(res); + + return false; + } We may have 'aborting sync for slot', can remove 'initial'. 5) I tried a test where there were 4 slots on the publisher, where one was getting used while the others were not. Initiated pg_sync_replication_slots on standby. Forced unused slots to be invalidated by setting idle_replication_slot_timeout=60 on primary, due to which API finished but gave a warning: postgres=# SELECT pg_sync_replication_slots(); WARNING: aborting initial sync for slot "failover_slot" DETAIL: This slot was invalidated on the primary server. WARNING: aborting initial sync for slot "failover_slot2" DETAIL: This slot was invalidated on the primary server. WARNING: aborting initial sync for slot "failover_slot3" DETAIL: This slot was invalidated on the primary server. pg_sync_replication_slots --------------------------- (1 row) Do we need these warnings here? I think we can have it as a LOG rather than having it on console. Thoughts? If we inclined towards WARNING here, will ti be better to have it as a single line: WARNING: aborting sync for slot "failover_slot" as the slot was invalidated on primary WARNING: aborting sync for slot "failover_slot1" as the slot was invalidated on primary WARNING: aborting sync for slot "failover_slot2" as the slot was invalidated on primary 6) - * We do not drop the slot because the restart_lsn can be ahead of the - * current location when recreating the slot in the next cycle. It may - * take more time to create such a slot. Therefore, we keep this slot - * and attempt the synchronization in the next cycle. + * If we're in the slotsync worker, we retain the slot and retry in the + * next cycle. The restart_lsn might advance by then, allowing the slot + * to be created successfully later. */ I like the previous command better as it was conveying the side-effect of dropping the slot herer. Can we try to incorporate the previous comment here and make it specific to slotsync workers? thanks Shveta
pgsql-hackers by date: