Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
From | Shlok Kyal |
---|---|
Subject | Re: How can end users know the cause of LR slot sync delays? |
Date | |
Msg-id | CANhcyEWbY4W3hCDbMbKH4ME-euw60vvg9eLLMMjbG0v2NJGAVg@mail.gmail.com Whole thread Raw |
In response to | Re: How can end users know the cause of LR slot sync delays? (Shlok Kyal <shlok.kyal.oss@gmail.com>) |
List | pgsql-hackers |
On Wed, 24 Sept 2025 at 16:39, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > On Wed, 17 Sept 2025 at 16:24, Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Shlok, > > > > Thanks for creating the patch. Personally I prefer approach2; approach1 cannot > > indicate the current status of synchronization, it just shows the history. > > I feel approach2 has more information than approach1. > > > I agree with your point. I think the preferred behaviour of > slot_sync_skip_reason is to indicate the current status of > synchronization. And taking account the current behaviour of columns > of views pg_replication_slots and pg_stat_replication_slots, I think > slot_sync_skip_reason is suited for pg_replication_slots view and > I am proceeding ahead with approach2 > > > Below are my comments for your patch. > > > > 01. > > ``` > > + Number of times the slot sync is skipped. > > ``` > > > > "slot sync" should be "slot synchronization". Same thing can be saied for other > > attributes. > > > > 02. > > ``` > > + Reason of the last slot sync skip. > > ``` > > > > Possible values must be clarified. > > > > 03. > > ``` > > + s.slot_sync_skip_count, > > + s.last_slot_sync_skip, > > + s.slot_sync_skip_reason, > > ``` > > > > Last line has tab-blank but others have space-blank. > > > > 04. > > ``` > > +typedef enum SlotSyncSkipReason > > +{ > > + SLOT_SYNC_SKIP_NONE, /* No skip */ > > + SLOT_SYNC_SKIP_STANDBY_BEHIND, /* Standby is behind the remote slot */ > > + SLOT_SYNC_SKIP_REMOTE_BEHIND, /* Remote slot is behind the local slot */ > > + SLOT_SYNC_SKIP_NO_CONSISTENT_SNAPSHOT /* Standby could not reach a > > + * consistent snapshot */ > > +} SlotSyncSkipReason > > ``` > > > > a. > > Can we add comment atop the enum? > > > > b. > > SLOT_SYNC_SKIP_STANDBY_BEHIND is misleading; it indicates the standby server has > > not received WALs required by slots, but enum does not clarify it. > > How about SLOT_SYNC_SKIP_MISSING_WAL_RECORDS or something? Better naming are very > > welcome. > > > > c > > s/reach/build/. > > > > 05. > > ``` > > @@ -646,11 +652,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid) > > remote_slot->name, > > LSN_FORMAT_ARGS(latestFlushPtr))); > > > > - return false; > > + /* If the slot is not present on the local */ > > + if (!(slot = SearchNamedReplicationSlot(remote_slot->name, true))) > > + return false; > > } > > ``` > > > > Looks like if users try to sync slots via SQL interfaces, the statistics cannot > > be updated. > > > > 06. update_slot_sync_skip_stats > > ``` > > + /* Update the slot sync reason */ > > + SpinLockAcquire(&slot->mutex); > > + slot->slot_sync_skip_reason = skip_reason; > > + SpinLockRelease(&slot->mutex); > > ``` > > > > I feel no need to update the reason if the slot->slot_sync_skip_reason > > and skip_reason are the same. > > > > 07. synchronize_one_slot > > ``` > > + /* > > + * If standby is behind remote slot and the synced slot is present on > > + * local. > > + */ > > + if (remote_slot->confirmed_lsn > latestFlushPtr) > > + { > > + if (synced) > > + update_slot_sync_skip_stats(slot, skip_reason); > > + return false; > > + } > > ``` > > > > This condition exist in the same function; can we combine? > > > > 08. GetSlotSyncSkipReason() > > > > Do we have to do pstrdup() here? I found a similar function get_snapbuild_state_desc(), > > and it does not use. > > > I check similar functions and I think we can remove pstrup. Removed in > the latest patch. > > > 09. > > > > Can you consider a test for added codes? > Added test in 0002 patch > > I have also addressed remaining comments and attached the latest > version of patch. > The CF Bot was failing as meson.build was not formatted appropriately. I have fixed it and made some cosmetic changes in the test file. I ran the CF tests on my local repository and it is passing all tests. I have attached the updated version. Thanks, Shlok Kyal
Attachment
pgsql-hackers by date: