Re: [Patch] ALTER SYSTEM READ ONLY - Mailing list pgsql-hackers

From Amul Sul
Subject Re: [Patch] ALTER SYSTEM READ ONLY
Date
Msg-id CAAJ_b96Fia+XQHvkpDcLGvhJ5iK0dZZnPZFoi69REWb2w6ga+w@mail.gmail.com
Whole thread Raw
In response to Re: [Patch] ALTER SYSTEM READ ONLY  (Amul Sul <sulamul@gmail.com>)
Responses Re: [Patch] ALTER SYSTEM READ ONLY
List pgsql-hackers
On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > > That's not good. On a typical busy system, a system is going to be in
> > > the middle of a checkpoint most of the time, and the checkpoint will
> > > take a long time to finish - maybe minutes.
> >
> > Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> > reduce the frequency of that cost...
> >
> >
> > > We want this feature to respond within milliseconds or a few seconds,
> > > not minutes. So we need something better here.
> >
> > Indeed.
> >
> >
> > > I'm inclined to think
> > > that we should try to CompleteWALProhibitChange() at the same places
> > > we AbsorbSyncRequests(). We know from experience that bad things
> > > happen if we fail to absorb sync requests in a timely fashion, so we
> > > probably have enough calls to AbsorbSyncRequests() to make sure that
> > > we always do that work in a timely fashion. So, if we do this work in
> > > the same place, then it will also be done in a timely fashion.
> >
> > Sounds sane, without having looked in detail.
> >
>
> Understood & agreed that we need to change the system state as soon as possible.
>
> I can see AbsorbSyncRequests() is called from 4 routing as
> CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
> CheckpointerMain().  Out of 4, the first three executes with an interrupt is on
> hod which will cause a problem when we do emit barrier and wait for those
> barriers absorption by all the process including itself and will cause an
> infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
> do not wait on self to absorb barrier.  Let that get absorbed at a later point
> in time when the interrupt is resumed.  I assumed that we cannot do barrier
> processing right away since there could be other barriers (maybe in the future)
> including ours that should not process while the interrupt is on hold.
>

CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
which puts interrupt on hold.  This kinda surprising that we were holding this
lock and putting interrupt on hots for a long time.  We do need that
CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
something easy to ensure that instead of the lock? Probably holding off
interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?

Regards,
Amul



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: Insert Documentation - Returning Clause and Order
Next
From: Fujii Masao
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit