Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Shlok Kyal
Subject Re: How can end users know the cause of LR slot sync delays?
Date
Msg-id CANhcyEWoXbzRL6xTcahBtuV1vzRH8NiDu0fNjby1XDJ1jW4Ckw@mail.gmail.com
Whole thread Raw
In response to Re: How can end users know the cause of LR slot sync delays?  (Ashutosh Sharma <ashu.coek88@gmail.com>)
Responses Re: How can end users know the cause of LR slot sync delays?
List pgsql-hackers
On Thu, 18 Sept 2025 at 13:17, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi Amit,
>
> On Thu, Sep 18, 2025 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Sep 17, 2025 at 8:19 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > >
> > > On Wed, Sep 17, 2025 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Wed, Sep 17, 2025 at 4:24 PM Hayato Kuroda (Fujitsu)
> > > > <kuroda.hayato@fujitsu.com> wrote:
> > > > >
> > > > > Dear Shlok,
> > > > >
> > > > > Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
> > > > > indicate the current status of synchronization, it just shows the history.
> > > > > I feel approach2 has more information than approach1.
> > > > >
> > > >
> > > > I also think so but Ashutosh thought that it would be hacky. Ashutosh,
> > > > did you have an opinion on this matter after seeing the patches?
> > > >
> > >
> > > Yes, I’ve looked into both the patches. Approach 1 seems quite
> > > straightforward. In approach 2, we need to pass some additional
> > > arguments to update_local_sync_slot and
> > > update_and_persist_local_synced_slot, which makes it feel a little
> > > less clean compared to approach 1, where we simply add a new function
> > > and call it directly.
> > >
> >
> > This is because the approach-1 doesn't show the latest value of
> > sync_status. I mean in the latest cycle if the sync is successful, it
> > won't update the stats which I am not sure is correct because users
> > may want to know the recent status of sync cycle. Otherwise, the patch
> > should be almost the same.
>
> This should be manageable, no? If we add an additional call to the
> stats report function immediately after ReplicationSlotPersist(),
> wouldn’t that address the issue? Please correct me if I’m overlooking
> something.
>
> @@ -600,6 +600,8 @@ update_and_persist_local_synced_slot(RemoteSlot
> *remote_slot, Oid remote_dbid)
>
>         ReplicationSlotPersist();
>
> +       pgstat_report_replslot_sync_skip(slot, SLOT_SYNC_SKIP_NONE);
> +
>         ereport(LOG,
>                         errmsg("newly created replication slot \"%s\"
> is sync-ready now",
>                                    remote_slot->name));
>
Currently, in this code change, the skip_reason is updated to 'none'
only when the slot state changes from temporary to persistent.
However, this logic does not handle cases where the slot is already a
persistent sync slot.
I believe we should also sync skip can happen for persistent slots.
That means, on a successful slot sync, we should update the
skip_reason to 'none' even for slots that are already persistent.

> In addition to this, should anyone really need to query the skip
> reason if pg_replication_slots already shows that the slot is synced
> and not temporary? Ideally, users should check the slot status in
> pg_replication_slots, and if it indicates the slot is persisted, there
> seems little value in enquiring pg_stat_replication_slots for the skip
> reason. That said, it’s important to ensure the information in both
> views remains consistent.
>
I have a doubt. Why don't we want to report the sync skip reason once
the slots are persisted?
for the case:

latestFlushPtr = GetStandbyFlushRecPtr(NULL);
    if (remote_slot->confirmed_lsn > latestFlushPtr)
    {
        /*
         * Can get here only if GUC 'synchronized_standby_slots' on the
         * primary server was not configured correctly.
         */
        ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
                errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                errmsg("skipping slot synchronization because the
received slot sync"
                       " LSN %X/%08X for slot \"%s\" is ahead of the
standby position %X/%08X",
                       LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
                       remote_slot->name,
                       LSN_FORMAT_ARGS(latestFlushPtr)));

        return false;
    }

Slot sync skip can happen even for persistent slots. So why should we
avoid displaying the skip reason in such cases?
I checked that if the synchronized_standby_slots GUC is not set
properly, we can hit this condition even for persistent slots.
I think we should still display the skip reason if the user does not
configure this GUC  as expected.

> > I think we can even try to write a patch
> > for approach-2 without an additional out parameter in some of the
> > functions.
>
> We can aim for this, if possible.

Thanks,
Shlok Kyal



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [PATCH] Introduce unified support for composite GUC options
Next
From: Ajin Cherian
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance