RE: persist logical slots to disk during shutdown checkpoint - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: persist logical slots to disk during shutdown checkpoint |
Date | |
Msg-id | OS0PR01MB57167ADBAA5F87D78FD6B87394E8A@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: persist logical slots to disk during shutdown checkpoint (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: persist logical slots to disk during shutdown checkpoint
|
List | pgsql-hackers |
On Tuesday, September 5, 2023 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: Hi, > > On Tue, Sep 5, 2023 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Tue, Sep 5, 2023 at 7:54 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Monday, September 4, 2023 6:15 PM vignesh C > <vignesh21@gmail.com> wrote: > > > > > > > > On Mon, 4 Sept 2023 at 15:20, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > On Fri, Sep 1, 2023 at 10:50 AM vignesh C <vignesh21@gmail.com> > wrote: > > > > > > > > > > > > On Fri, 1 Sept 2023 at 10:06, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > I think we should also ensure that slots are not invalidated > > > > > > > (slot.data.invalidated != RS_INVAL_NONE) before marking them > > > > > > > dirty because we don't allow decoding from such slots, so we > > > > > > > shouldn't include those. > > > > > > > > > > > > Added this check. > > > > > > > > > > > > Apart from this I have also fixed the following issues that > > > > > > were agreed on: a) Setting slots to dirty in > > > > > > CheckPointReplicationSlots instead of setting it in > > > > > > SaveSlotToPath > > > > > > > > > > > > > > > > + if (is_shutdown && SlotIsLogical(s)) { > > > > > + SpinLockAcquire(&s->mutex); if (s->data.invalidated == > > > > > + RS_INVAL_NONE && > > > > > + s->data.confirmed_flush != s->last_saved_confirmed_flush) > > > > > + s->dirty = true; > > > > > > > > > > I think it is better to use ReplicationSlotMarkDirty() as that > > > > > would be consistent with all other usages. > > > > > > > > ReplicationSlotMarkDirty works only on MyReplicationSlot whereas > > > > CheckpointReplicationSlots loops through all the slots and marks > > > > the appropriate slot as dirty, we might have to change > > > > ReplicationSlotMarkDirty to take the slot as input parameter and all caller > should pass MyReplicationSlot. > > > > > > Personally, I feel if we want to centralize the code of marking > > > dirty into a function, we can introduce a new static function > > > MarkSlotDirty(slot) to mark passed slot dirty and let > > > ReplicationSlotMarkDirty and CheckpointReplicationSlots call it. Like: > > > > > > void > > > ReplicationSlotMarkDirty(void) > > > { > > > MarkSlotDirty(MyReplicationSlot); } > > > > > > +static void > > > +MarkSlotDirty(ReplicationSlot *slot) { > > > + Assert(slot != NULL); > > > + > > > + SpinLockAcquire(&slot->mutex); > > > + slot->just_dirtied = true; > > > + slot->dirty = true; > > > + SpinLockRelease(&slot->mutex); } > > > > > > This is somewhat similar to the relation between > > > ReplicationSlotSave(serialize my backend's replications slot) and > SaveSlotToPath(save the passed slot). > > > > > > > Another thing is we have already taken spin lock to access > > > > last_confirmed_flush_lsn from CheckpointReplicationSlots, we could > > > > set dirty flag here itself, else we will have to release the lock > > > > and call ReplicationSlotMarkDirty which will take lock again. > > > > > > Yes, this is unavoidable, but maybe it's not a big problem as we > > > only do it at shutdown. > > > > > > > True but still it doesn't look elegant. I also thought about having a > > probably inline function that marks both just_dirty and dirty fields. > > However, that requires us to assert that the caller has already > > acquired a spinlock. I see a macro SpinLockFree() that might help but > > it didn't seem to be used anywhere in the code so not sure if we can > > rely on it. > > Can't we just have code like this? I mean we will have to make > ReplicationSlotMarkDirty take slot as an argument or have another version > which takes slot as an argument and that would be called by us as well as by > ReplicationSlotMarkDirty(). I mean why do we need these checks > (s-(data.invalidated == RS_INVAL_NONE && > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the > mutex? Walsender is shutdown so confirmed flush LSN can not move > concurrently and slot can not be invalidated as well because that is done by > checkpointer and we are in checkpointer? I agree with your analysis that the lock may be unnecessary for now and the code will work, but I personally feel we'd better take the spinlock. Firstly, considering our discussion on the potential extension of persisting the slot for online checkpoints in the future, we anyway need the lock at that time, so taking the lock here could avoid overlooking the need to update it later. And the lock also won't cause any performance or concurrency issue. Additionally, if we don't take the lock, we rely on the assumption that the walsender will exit before the shutdown checkpoint, currently, that's true for logical walsender, but physical walsender can exit later than checkpointer. So, I am slight woirred that if we change the logical walsender's exit timing in the future, the assumption may not hold. Besides, for non-built-in logical replication, if someone creates their own walsender or other processes to send the changes and the process doesn't exit before the shutdown checkpoint, it may also be a problem. Although I don't have exsiting examples about these extensions, but I feel taking the lock would make it more robust. Best Regards, Hou zj
pgsql-hackers by date: