Re: [Patch] ALTER SYSTEM READ ONLY - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [Patch] ALTER SYSTEM READ ONLY
Date
Msg-id CA+Tgmob56Pk1-5aTJdVPCWFHon7me4M96ENpGe9n_R4JUjjhZA@mail.gmail.com
Whole thread Raw
In response to Re: [Patch] ALTER SYSTEM READ ONLY  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [Patch] ALTER SYSTEM READ ONLY
List pgsql-hackers
On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't mind a loop, but that one looks broken. We have to clear the
> bit before we call the function that processes that type of barrier.
> Otherwise, if we succeed in absorbing the barrier but a new instance
> of the same barrier arrives meanwhile, we'll fail to realize that we
> need to absorb the new one.

Here's a new version of the patch for allowing errors in
barrier-handling functions and/or rejection of barriers by those
functions. I think this responds to all of the previous review
comments from Andres. Also, here is an 0002 which is a handy bit of
test code that I wrote. It's not for commit, but it is useful for
finding bugs.

In addition to improving 0001 based on the review comments, I also
tried to write a better commit message for it, but it might still be
possible to do better there. It's a bit hard to explain the idea in
the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
with an XID -- and possibly a bunch of sub-XIDs, and possibly while
idle-in-transaction -- can elect to FATAL rather than absorbing the
barrier. I suspect for other barrier types we might have certain
(hopefully short) stretches of code where a barrier of a particular
type can't be absorbed because we're in the middle of doing something
that relies on the previous value of whatever state is protected by
the barrier. Holding off interrupts in those stretches of code would
prevent the barrier from being absorbed, but would also prevent query
cancel, backend termination, and absorption of other barrier types, so
it seems possible that just allowing the barrier-absorption function
for a barrier of that type to just refuse the barrier until after the
backend exits the critical section of code will work out better.

Just for kicks, I tried running 'make installcheck-parallel' while
emitting placeholder barriers every 0.05 s after altering the
barrier-absorption function to always return false, just to see how
ugly that was. In round figures, it made it take 24 s vs. 21 s, so
it's actually not that bad. However, it all depends on how many times
you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
that the effect might be very non-uniform. That is, if you can get the
code to be running a tight loop that does little real work but does
CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
barrier, it will probably suck. Therefore, I'm inclined to think that
the fairly strong cautionary logic in the patch is reasonable, but
perhaps it can be better worded somehow. Thoughts welcome.

I have not rebased the remainder of the patch series over these two.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Recent failures on buildfarm member hornet
Next
From: vignesh C
Date:
Subject: Re: Parallel copy