Re: persist logical slots to disk during shutdown checkpoint - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: persist logical slots to disk during shutdown checkpoint
Date
Msg-id CAA4eK1JZ7GL+xOf-OpNBpR9WuZeeS_wWrw-t7uUFTzSopUTdGg@mail.gmail.com
Whole thread Raw
In response to Re: persist logical slots to disk during shutdown checkpoint  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: persist logical slots to disk during shutdown checkpoint
List pgsql-hackers
On Tue, Sep 5, 2023 at 6:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
>
> > > Can't we just have code like this?  I mean we will have to make
> > > ReplicationSlotMarkDirty take slot as an argument or have another version
> > > which takes slot as an argument and that would be called by us as well as by
> > > ReplicationSlotMarkDirty().  I mean why do we need these checks
> > > (s-(data.invalidated == RS_INVAL_NONE &&
> > > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
> > > mutex?  Walsender is shutdown so confirmed flush LSN can not move
> > > concurrently and slot can not be invalidated as well because that is done by
> > > checkpointer and we are in checkpointer?
> >
>
...
> > Additionally, if we don't take the lock, we rely on the assumption that the
> > walsender will exit before the shutdown checkpoint, currently, that's true for
> > logical walsender, but physical walsender can exit later than checkpointer. So,
> > I am slight woirred that if we change the logical walsender's exit timing in
> > the future, the assumption may not hold.
> >
> > Besides, for non-built-in logical replication, if someone creates their own
> > walsender or other processes to send the changes and the process doesn't exit
> > before the shutdown checkpoint, it may also be a problem. Although I don't have
> > exsiting examples about these extensions, but I feel taking the lock would make
> > it more robust.
>
> I think our all logic is based on that the walsender is existed
> already.  If not then even if you check under the mutex that the
> confirmed flush LSN is not changed then it can changed right after you
> release the lock and then we will not be flushing the latest update of
> the confirmed flush lsn to the disk and our logic of comparing
> checkpoint.redo with the confirmed flush lsn might not work?
>

Right, it can change and in that case, the check related to
confirm_flush LSN will fail during the upgrade. However, the point is
that if we don't take spinlock, we need to properly write comments on
why unlike in other places it is safe here to check these values
without spinlock. We can do that but I feel we have to be careful for
all future usages of these variables, so, having spinlock makes them
follow the normal coding pattern which I feel makes it more robust.
Yes, marking dirty via common function also has merits but personally,
I find it better to follow the normal coding practice of checking the
required fields under spinlock. The other possibility is to first
check if we need to mark the slot dirty under spinlock, then release
the spinlock, and then call the common MarkDirty function, but again
that will be under the assumption that these flags won't change.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: document the need to analyze partitioned tables
Next
From: Dilip Kumar
Date:
Subject: Re: persist logical slots to disk during shutdown checkpoint