Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Shlok Kyal
Subject Re: How can end users know the cause of LR slot sync delays?
Date
Msg-id CANhcyEXfAcmmfrYde8cTfiAM0EGs6pOqx=Gb8KAiA1PfQUvFjw@mail.gmail.com
Whole thread Raw
In response to RE: How can end users know the cause of LR slot sync delays?  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-hackers
On Wed, 1 Oct 2025 at 08:26, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Shlok,
>
> > Thanks for updating the patch. Here are my comments.
>
> I found one more comment.
>
> ```
> +       /*
> +        * If found_consistent_snapshot is not NULL and a consistent snapshot is
> +        * found set the slot sync skip reason to none. Else, if consistent
> +        * snapshot is not found the stats will be updated in the function
> +        * update_and_persist_local_synced_slot
> +        */
> +       if (!found_consistent_snapshot || *found_consistent_snapshot)
> +               update_slot_sync_skip_stats(slot, SS_SKIP_NONE);
> ```
>
> I think the condition is confusing; in code level there is a path that
> found_consistent_snapshot is NULL but synchronization happened. (Not sure it is
> possible though).
>
> I think it is better to put update_slot_sync_skip_stats() near the sync part.
> If the snapshot exists from the beginning, it can be done unconditionally,
> otherwise we can check again. Attached .diffs file implements it.
>

I agree that the condition can be confusing. I checked your patch and
have a concern.
For the change:
```
  LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
  found_consistent_snapshot);

+ /* Update the slot sync skip reason if snapshot could be created */
+ if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
+ {
+ Assert(!found_consistent_snapshot ||
+    *found_consistent_snapshot);
+ update_slot_sync_skip_stats(slot, SS_SKIP_NONE);
+ }
+
```

I debugged and noticed that, on a successful sync slot run, in some
cases we do not enter the if condition here which can lead to
misleading slot_sync_skip_reason.
I noticed that even if the synced slot is advanced and
'found_consistent_snapshot' set as true,
SnapBuildSnapshotExists(remote_slot->restart_lsn) can return false.

As far as I understand, LogicalSlotAdvanceAndCheckSnapState will
advance the slot but will not serialize the snapshot and hence
SnapBuildSnapshotExists can return false.
I have also added a test case in 049_slot_skip_stats named
"slot_sync_skip_reason is reset after successful sync"  which can
reproduce it.

I think we can update the stats when "slot->data.confirmed_flush ==
remote_slot->confirmed_lsn" instead of checking
'SnapBuildSnapshotExists'. Thoughts?

Thanks,
Shlok Kyal



pgsql-hackers by date:

Previous
From: Shlok Kyal
Date:
Subject: Re: How can end users know the cause of LR slot sync delays?
Next
From: Amit Kapila
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart