Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Shlok Kyal
Subject Re: How can end users know the cause of LR slot sync delays?
Date
Msg-id CANhcyEWbY4W3hCDbMbKH4ME-euw60vvg9eLLMMjbG0v2NJGAVg@mail.gmail.com
Whole thread Raw
In response to Re: How can end users know the cause of LR slot sync delays?  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
List pgsql-hackers
On Wed, 24 Sept 2025 at 16:39, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Wed, 17 Sept 2025 at 16:24, Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Shlok,
> >
> > Thanks for creating the patch. Personally I prefer approach2; approach1 cannot
> > indicate the current status of synchronization, it just shows the history.
> > I feel approach2 has more information than approach1.
> >
> I agree with your point. I think the preferred behaviour of
> slot_sync_skip_reason is to indicate the current status of
> synchronization. And taking account the current behaviour of columns
> of views pg_replication_slots and pg_stat_replication_slots, I think
> slot_sync_skip_reason is suited for pg_replication_slots view and
> I am proceeding ahead with approach2
>
> > Below are my comments for your patch.
> >
> > 01.
> > ```
> > +        Number of times the slot sync is skipped.
> > ```
> >
> > "slot sync" should be "slot synchronization". Same thing can be saied for other
> > attributes.
> >
> > 02.
> > ```
> > +        Reason of the last slot sync skip.
> > ```
> >
> > Possible values must be clarified.
> >
> > 03.
> > ```
> > +            s.slot_sync_skip_count,
> > +            s.last_slot_sync_skip,
> > +                       s.slot_sync_skip_reason,
> > ```
> >
> > Last line has tab-blank but others have space-blank.
> >
> > 04.
> > ```
> > +typedef enum SlotSyncSkipReason
> > +{
> > +       SLOT_SYNC_SKIP_NONE,            /* No skip */
> > +       SLOT_SYNC_SKIP_STANDBY_BEHIND,  /* Standby is behind the remote slot */
> > +       SLOT_SYNC_SKIP_REMOTE_BEHIND,   /* Remote slot is behind the local slot */
> > +       SLOT_SYNC_SKIP_NO_CONSISTENT_SNAPSHOT   /* Standby could not reach a
> > +                                                                                        * consistent snapshot */
> > +}                      SlotSyncSkipReason
> > ```
> >
> > a.
> > Can we add comment atop the enum?
> >
> > b.
> > SLOT_SYNC_SKIP_STANDBY_BEHIND is misleading; it indicates the standby server has
> > not received WALs required by slots, but enum does not clarify it.
> > How about SLOT_SYNC_SKIP_MISSING_WAL_RECORDS or something? Better naming are very
> > welcome.
> >
> > c
> > s/reach/build/.
> >
> > 05.
> > ```
> > @@ -646,11 +652,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
> >                                            remote_slot->name,
> >                                            LSN_FORMAT_ARGS(latestFlushPtr)));
> >
> > -               return false;
> > +               /* If the slot is not present on the local */
> > +               if (!(slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> > +                       return false;
> >         }
> > ```
> >
> > Looks like if users try to sync slots via SQL interfaces, the statistics cannot
> > be updated.
> >
> > 06. update_slot_sync_skip_stats
> > ```
> > +       /* Update the slot sync reason */
> > +       SpinLockAcquire(&slot->mutex);
> > +       slot->slot_sync_skip_reason = skip_reason;
> > +       SpinLockRelease(&slot->mutex);
> > ```
> >
> > I feel no need to update the reason if the slot->slot_sync_skip_reason
> > and skip_reason are the same.
> >
> > 07. synchronize_one_slot
> > ```
> > +               /*
> > +                * If standby is behind remote slot and the synced slot is present on
> > +                * local.
> > +                */
> > +               if (remote_slot->confirmed_lsn > latestFlushPtr)
> > +               {
> > +                       if (synced)
> > +                               update_slot_sync_skip_stats(slot, skip_reason);
> > +                       return false;
> > +               }
> > ```
> >
> > This condition exist in the same function; can we combine?
> >
> > 08. GetSlotSyncSkipReason()
> >
> > Do we have to do pstrdup() here? I found a similar function get_snapbuild_state_desc(),
> > and it does not use.
> >
> I check similar functions and I think we can remove pstrup. Removed in
> the latest patch.
>
> > 09.
> >
> > Can you consider a test for added codes?
> Added test in 0002 patch
>
> I have also addressed remaining comments and attached the latest
> version of patch.
>
The CF Bot was failing as meson.build was not formatted appropriately.
I have fixed it and made some cosmetic changes in the test file.
I ran the CF tests on my local repository and it is passing all tests.

I have attached the updated version.

Thanks,
Shlok Kyal

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [BUG] temporary file usage report with extended protocol and unnamed portals
Next
From: Kirill Reshke
Date:
Subject: Re: Remove custom redundant full page write description from GIN