Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | Nisha Moond |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | CABdArM4w-rBmSMqPvsWLdj+OwLZzOZ-MX_596qhHmc+XvP9dVg@mail.gmail.com Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Peter Smith <smithpb2250@gmail.com>) |
List | pgsql-hackers |
On Thu, Nov 14, 2024 at 5:29 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Nisha. > > Thanks for the recent patch updates. Here are my review comments for > the latest patch v48-0001. > Thank you for the review. Comments are addressed in v49 version. Below is my response to comments that may require further discussion. > ====== > doc/src/sgml/system-views.sgml > > 2. > <para> > The time since the slot has become inactive. > - <literal>NULL</literal> if the slot is currently being used. > - Note that for slots on the standby that are being synced from a > + <literal>NULL</literal> if the slot is currently being used. Once the > + slot is invalidated, this value will remain unchanged until we shutdown > + the server. Note that for slots on the standby that are being > synced from a > primary server (whose <structfield>synced</structfield> field is > <literal>true</literal>), the > > Is this change related to the new inactivity timeout feature or are > you just clarifying the existing behaviour of the 'active_since' > field. > Yes, this patch introduces inactive_timeout invalidation and prevents updates to inactive_since for invalid slots. Only a node restart can modify it, so, I believe we should retain these lines in this patch. > Note there is already another thread [1] created to patch/clarify this > same field. So if you are just clarifying existing behavior then IMO > it would be better if you can to try and get your desired changes > included there quickly before that other patch gets pushed. > Thanks for the reference, I have posted my suggestion on the thread. > > ReplicationSlotAcquire: > > 5. > + * > + * An error is raised if error_if_invalid is true and the slot has been > + * invalidated previously. > */ > void > -ReplicationSlotAcquire(const char *name, bool nowait) > +ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid) > > This function comment makes it seem like "invalidated previously" > might mean *any* kind of invalidation, but later in the body of the > function we find the logic is really only used for inactive timeout. > > + /* > + * An error is raised if error_if_invalid is true and the slot has been > + * previously invalidated due to inactive timeout. > + */ > > So, I think a better name for that parameter might be > 'error_if_inactive_timeout' > > OTOH, if it really is supposed to erro for *any* kind of invalidation > then there needs to be more ereports. > +1 to the idea. I have created a separate patch v49-0001 adding more ereports for all kinds of invalidations. > ~~~ > SlotInactiveTimeoutCheckAllowed: > > 8. > +/* > + * Is this replication slot allowed for inactive timeout invalidation check? > + * > + * Inactive timeout invalidation is allowed only when: > + * > + * 1. Inactive timeout is set > + * 2. Slot is inactive > + * 3. Server is in recovery and slot is not being synced from the primary > + * > + * Note that the inactive timeout invalidation mechanism is not > + * applicable for slots on the standby server that are being synced > + * from the primary server (i.e., standby slots having 'synced' field 'true'). > + * Synced slots are always considered to be inactive because they don't > + * perform logical decoding to produce changes. > + */ > > 8a. > Somehow that first sentence seems strange. Would it be better to write it like: > > SUGGESTION > Can this replication slot timeout due to inactivity? > I feel the suggestion is not very clear on the purpose of the function, This function doesn't check inactivity or decide slot timeout invalidation. It only pre-checks if the slot qualifies for an inactivity check, which the caller will perform. As I have changed function name too as per commnet#9, I used the following - "Is inactive timeout invalidation possible for this replication slot?" Thoughts? > ~ > 8c. > Similarly, I think something about that "Note that the inactive > timeout invalidation mechanism is not applicable..." paragraph needs > tweaking because IMO that should also now be saying something about > 'RecoveryInProgress'. > 'RecoveryInProgress' check indicates that the server is a standby, and the mentioned paragraph uses the term "standby" to describe the condition. It seems unnecessary to mention RecoveryInProgress separately. > ~~~ > > InvalidatePossiblyObsoleteSlot: > > 10. > break; > + case RS_INVAL_INACTIVE_TIMEOUT: > + > + /* > + * Check if the slot needs to be invalidated due to > + * replication_slot_inactive_timeout GUC. > + */ > > Since there are no other blank lines anywhere in this switch, the > introduction of this one in v48 looks out of place to me. pgindent automatically added this blank line after 'case RS_INVAL_INACTIVE_TIMEOUT'. > IMO it would > be more readable if a blank line followed each/every of the breaks, > but then that is not a necessary change for this patch so... > Since it's not directly related to the patch, I feel it might be best to leave it as is for now. > ~~~ > > 11. > + /* > + * Invalidation due to inactive timeout implies that > + * no one is using the slot. > + */ > + Assert(s->active_pid == 0); > > Given this assertion, does it mean that "(s->active_pid == 0)" should > have been another condition done up-front in the function > 'SlotInactiveTimeoutCheckAllowed'? > I don't think it's a good idea to check (s->active_pid == 0) upfront, before the timeout-invalidation check. AFAIU, this assertion is meant to ensure active_pid = 0 only if the slot is going to be invalidated, i.e., when the following condition is true: TimestampDifferenceExceeds(s->inactive_since, now, replication_slot_inactive_timeout_sec * 1000) Thoughts? Open to others' opinions too. > ~~~ > > 12. > /* > - * If the slot can be acquired, do so and mark it invalidated > - * immediately. Otherwise we'll signal the owning process, below, and > - * retry. > + * If the slot can be acquired, do so and mark it as invalidated. If > + * the slot is already ours, mark it as invalidated. Otherwise, we'll > + * signal the owning process below and retry. > */ > - if (active_pid == 0) > + if (active_pid == 0 || > + (MyReplicationSlot == s && > + active_pid == MyProcPid)) > > I wasn't sure how this change belongs to this patch, because the logic > of the previous review comment said for the case of invalidation due > to inactivity that active_id must be 0. e.g. Assert(s->active_pid == > 0); > I don't fully understand the purpose of this change yet. I'll look into it further and get back. > ~~~ > > RestoreSlotFromDisk: > > 13. > - slot->inactive_since = GetCurrentTimestamp(); > + slot->inactive_since = now; > > In v47 this assignment used to call the function > 'ReplicationSlotSetInactiveSince'. I recognise there is a very subtle > difference between direct assignment and the function, because the > function will skip assignment if the slot is already invalidated. > Anyway, if you are *deliberately* not wanting to call > ReplicationSlotSetInactiveSince here then I think this assignment > should be commented to explain the reason why not, otherwise someone > in the future might be tempted to think it was just an oversight and > add the call back in that you don't want. > Added comment saying avoid using ReplicationSlotSetInactiveSince() here as it will skip the invalid slots. ~~~~ -- Thanks, Nisha
pgsql-hackers by date: