Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
| From | Shlok Kyal |
|---|---|
| Subject | Re: How can end users know the cause of LR slot sync delays? |
| Date | |
| Msg-id | CANhcyEXEyUt8TycOqkgdWT+NeJASM=1acU1dK64UNsJeKMQFQA@mail.gmail.com Whole thread Raw |
| In response to | Re: How can end users know the cause of LR slot sync delays? (shveta malik <shveta.malik@gmail.com>) |
| Responses |
RE: How can end users know the cause of LR slot sync delays?
|
| List | pgsql-hackers |
On Wed, 5 Nov 2025 at 11:49, shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Nov 4, 2025 at 3:17 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Mon, Nov 3, 2025 at 3:14 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > > I have attached the latest patch. > > > > > > > > > > Thanks. I have started going through it. > > > > > > I’m not sure if this has already been discussed; I couldn’t find any > > > mention of it in the thread. Why don’t we persist > > > 'slot_sync_skip_reason' (it is outside of > > > ReplicationSlotPersistentData)? If a slot wasn’t synced during the > > > last cycle and the server restarts, it would be helpful to know the > > > reason it wasn’t synced prior to the node restart. > > > Actually I did not think in this direction. I think it will be useful to persist 'slot_sync_skip_reason'. I have made the change for the same in the latest patch. > > > > Please find a few more comments: > > > > 1) > > last_slot_sync_skip > > > > Will 'last_slotsync_skip_at' be a better name? > > > > Since we refer worker as slotsync in docs, I feel slotsync seems a > > more natural choice than slot_sync. Also '_at' gives clarity that it > > is about time rather than a boolean (which currently it seems like). > > > > Same goes for slot_sync_skip_count and slot_sync_skip_reason. Shall > > these be slotsync_skip_count and slotsync_skip_reason. > > Fixed it. > > 2) > > +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason > > skip_reason, > > + bool acquire_slot) > > > > It looks like there is only one caller that passes acquire_slot as > > true, while all others pass false. Instead of keeping the acquire_slot > > parameter, will it be better if we remove it and add an > > Assert(MyReplication) to ensure the slot is already acquired? We can > > add a comment stating that this function expects the slot to be > > acquired by the caller. The one caller that currently passes > > acquire_slot = true can acquire the slot explicitly before invoking > > this function. Thoughts? This idea looks good to me. I have updated the patch accordingly. > > > > 3) > > In update_and_persist_local_synced_slot(), we get both > > 'found_consistent_snapshot' and 'remote_slot_precedes' from > > update_local_synced_slot(). But skipsync-reason for > > 'remote_slot_precedes' is updated inside update_local_synced_slot() > > while skipsync-reason for '!found_consistent_snapshot' is updated in > > caller update_and_persist_local_synced_slot. Is there a reason for > > that? > > update_and_persist_local_synced_slot is called when the synced slot is in temporary state and we are calling function 'update_local_synced_slot' directly for permanent slots.Slot sync skip when "remote_slot_precedes" is true can happen for both permanent and temporary slot. So I think we need to update the stats in "update_local_synced_slot" Whereas we are skipping slot sync only for temporary slots when a consistent snapshot is not found. So I added this in the function "update_and_persist_local_synced_slot". > > 4) > > What about the case where the slot is invalidated and sync is skipped? > > I do not see any stats for that. See 'Skip the sync of an invalidated > > slot' in synchronize_one_slot(). If it is already discussed and > > concluded, please add a comment. > > It was not discussed earlier. The pg_replication_slots already have a column name 'invalidation_reason'. And when the remote slot is invalidated the local slot's is also invalidated. So, should we be required to maintain this in 'slotsync_skip_reason' as well? I think it would be kind of redundant? Thoughts? > > Few more on 001: > > 5) > The name SS_SKIP_MISSING_WAL_RECORD doesn’t seem appropriate. It > sounds more like some WAL issue, rather than indicating that the WAL > hasn’t been flushed. A better name could be SS_SKIP_WAL_NOT_FLUSHED. > I think that the suggested name is better. I have updated the patch accordingly. > 6) > Instead of calling 'update_slot_sync_skip_stats' at multiple places, > how about we just update the skip_reason everywhere and make a call to > 'update_slot_sync_skip_stats' only in synchronize_one_slot(). IMO, > that will look cleaner. Thoughts? > I tried this approach in [1] (see v2_approach2). This approach would require passing extra parameters to the functions. Here Amit suggested that we should try an approach where this can be avoided. So, I came up with the current approach. See [2]. I have addressed the comments and attached the updated v7 patch. [1]: https://www.postgresql.org/message-id/CANhcyEXHcdoRRo0N0uib-t7mfkbotv%3DaYjAWAekDAbHCRe%2BBng%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1KZLPxv7VBZf%3DBp9%3D-pzKNfvNmFDqFYYzwkowE4FpRs1A%40mail.gmail.com Thanks, Shlok Kyal
Attachment
pgsql-hackers by date: