Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
From | Shlok Kyal |
---|---|
Subject | Re: How can end users know the cause of LR slot sync delays? |
Date | |
Msg-id | CANhcyEWoXbzRL6xTcahBtuV1vzRH8NiDu0fNjby1XDJ1jW4Ckw@mail.gmail.com Whole thread Raw |
In response to | Re: How can end users know the cause of LR slot sync delays? (Ashutosh Sharma <ashu.coek88@gmail.com>) |
Responses |
Re: How can end users know the cause of LR slot sync delays?
|
List | pgsql-hackers |
On Thu, 18 Sept 2025 at 13:17, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi Amit, > > On Thu, Sep 18, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Sep 17, 2025 at 8:19 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu) > > > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > > > > > Dear Shlok, > > > > > > > > > > Thanks for creating the patch. Personally I prefer approach2; approach1 cannot > > > > > indicate the current status of synchronization, it just shows the history. > > > > > I feel approach2 has more information than approach1. > > > > > > > > > > > > > I also think so but Ashutosh thought that it would be hacky. Ashutosh, > > > > did you have an opinion on this matter after seeing the patches? > > > > > > > > > > Yes, I’ve looked into both the patches. Approach 1 seems quite > > > straightforward. In approach 2, we need to pass some additional > > > arguments to update_local_sync_slot and > > > update_and_persist_local_synced_slot, which makes it feel a little > > > less clean compared to approach 1, where we simply add a new function > > > and call it directly. > > > > > > > This is because the approach-1 doesn't show the latest value of > > sync_status. I mean in the latest cycle if the sync is successful, it > > won't update the stats which I am not sure is correct because users > > may want to know the recent status of sync cycle. Otherwise, the patch > > should be almost the same. > > This should be manageable, no? If we add an additional call to the > stats report function immediately after ReplicationSlotPersist(), > wouldn’t that address the issue? Please correct me if I’m overlooking > something. > > @@ -600,6 +600,8 @@ update_and_persist_local_synced_slot(RemoteSlot > *remote_slot, Oid remote_dbid) > > ReplicationSlotPersist(); > > + pgstat_report_replslot_sync_skip(slot, SLOT_SYNC_SKIP_NONE); > + > ereport(LOG, > errmsg("newly created replication slot \"%s\" > is sync-ready now", > remote_slot->name)); > Currently, in this code change, the skip_reason is updated to 'none' only when the slot state changes from temporary to persistent. However, this logic does not handle cases where the slot is already a persistent sync slot. I believe we should also sync skip can happen for persistent slots. That means, on a successful slot sync, we should update the skip_reason to 'none' even for slots that are already persistent. > In addition to this, should anyone really need to query the skip > reason if pg_replication_slots already shows that the slot is synced > and not temporary? Ideally, users should check the slot status in > pg_replication_slots, and if it indicates the slot is persisted, there > seems little value in enquiring pg_stat_replication_slots for the skip > reason. That said, it’s important to ensure the information in both > views remains consistent. > I have a doubt. Why don't we want to report the sync skip reason once the slots are persisted? for the case: latestFlushPtr = GetStandbyFlushRecPtr(NULL); if (remote_slot->confirmed_lsn > latestFlushPtr) { /* * Can get here only if GUC 'synchronized_standby_slots' on the * primary server was not configured correctly. */ ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("skipping slot synchronization because the received slot sync" " LSN %X/%08X for slot \"%s\" is ahead of the standby position %X/%08X", LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), remote_slot->name, LSN_FORMAT_ARGS(latestFlushPtr))); return false; } Slot sync skip can happen even for persistent slots. So why should we avoid displaying the skip reason in such cases? I checked that if the synchronized_standby_slots GUC is not set properly, we can hit this condition even for persistent slots. I think we should still display the skip reason if the user does not configure this GUC as expected. > > I think we can even try to write a patch > > for approach-2 without an additional out parameter in some of the > > functions. > > We can aim for this, if possible. Thanks, Shlok Kyal
pgsql-hackers by date: