Re: [Patch] ALTER SYSTEM READ ONLY - Mailing list pgsql-hackers

From Amul Sul
Subject Re: [Patch] ALTER SYSTEM READ ONLY
Date
Msg-id CAAJ_b95Xun35mpp7Ypn=gb9urcn5MDDhY8LDeDb8F2W-wJUE1w@mail.gmail.com
Whole thread Raw
In response to Re: [Patch] ALTER SYSTEM READ ONLY  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [Patch] ALTER SYSTEM READ ONLY
List pgsql-hackers
On Wed, Jan 20, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:
> > To move development, testing, and the review forward, I have commented out the
> > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> > added the changes for the checkpointer so that system read-write state change
> > request can be processed as soon as possible, as suggested by Robert[1].
>
> I am extremely doubtful about SetWALProhibitState()'s claim that "The
> final state can only be requested by the checkpointer or by the
> single-user so that there will be no chance that the server is already
> in the desired final state." It seems like there is an obvious race
> condition: CompleteWALProhibitChange() is called with a cur_state_gen
> argument which embeds the last state we saw, but there's nothing to
> keep it from changing between the time we saw it and the time that
> function calls SetWALProhibitState(), is there? We aren't holding any
> lock. It seems to me that SetWALProhibitState() needs to be rewritten
> to avoid this assumption.
>

It is not like that, let me explain. When a user backend requests to alter WAL
prohibit state by using ASRO/ASRW DDL with the previous patch or calling
pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
set to the transition state i.e. going-read-only or going-read-write if it is
not already.  If another backend trying to request the same alteration to the
wal prohibit state then nothing going to be changed in shared memory but that
backend needs to wait until the transition to the final wal prohibited state
completes.  If a backend tries to request for the opposite state than the
previous which is in progress then it will see an error as "system state
transition to read only/write is already in progress".  At a time only one
transition state can be set.

For the case where transition state changes to the complete states i.e.
read-only/read-write that can only be changed by the checkpointer or standalone
backend, there won't be any concurrency to change transition state to complete
state.

> On a related note, SetWALProhibitState() has only two callers. One
> passes is_final_state as true, and the other as false: it's never a
> variable. The two cases are handled mostly differently. This doesn't
> seem good. A lot of the logic in this function should probably be
> moved to the calling sites, especially because it's almost certainly
> wrong for this function to be basing what it does on the *current* WAL
> prohibit state rather than the WAL prohibit state that was in effect
> at the time we made the decision to call this function in the first
> place. As I mentioned in the previous paragraph, that's a built-in
> race condition. To put that another way, this function should NOT feel
> free to call GetWALProhibitStateGen().
>

Understood. I have removed SetWALProhibitState() and moved the respective code
to the caller in the attached version.

> I don't really see why we should have both an SQL callable function
> pg_alter_wal_prohibit_state() and also a DDL command for this. If
> we're going to go with a functional interface, and I guess the idea of
> that is to make it so GRANT EXECUTE works, then why not just get rid
> of the DDL?
>

Ok, dropped the patch of the DDL command. If in the future we want it back, I
can add that again.

Now, I am a little bit concerned about the current function name. How about
pg_set_wal_prohibit_state(bool) name or have two functions as
pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
other suggestions?

> RequestWALProhibitChange() doesn't look very nice. It seems like it's
> basically the second half of pg_alter_wal_prohibit_state(), not being
> called from anywhere else. It doesn't seem to add anything to separate
> it out like this; the interface between the two is not especially
> clean.
>

Ok, moved that code in pg_alter_wal_prohibit_state() in the attached version.

> It seems odd that ProcessWALProhibitStateChangeRequest() returns
> without doing anything if !AmCheckpointerProcess(), rather than having
> that be an Assert(). Why is it like that?
>

Like AbsorbSyncRequests().

> I think WALProhibitStateShmemInit() would probably look more similar
> to other functions if it did if (found) { stuff; } rather than if
> (!found) return; stuff; -- but I might be wrong about the existing
> precedent.
>

Ok, did the same in the attached version.

> The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff
> seems confusingly-named, because we have other reasons for skipping a
> checkpoint that are not what we're talking about here. I think this is
> talking about whether we've performed a checkpoint after recovery, and
> the naming should reflect that. But I think there's something else
> wrong with the design, too: why is this protected by a spinlock? I
> have questions in both directions. On the one hand, I wonder why we
> need any kind of lock at all. On the other hand, if we do need a lock,
> I wonder why a spinlock that protects only the setting and clearing of
> the flag and nothing else is sufficient. There are zero comments
> explaining what the idea behind this locking regime is, and I can't
> understand why it should be correct.
>

Renamed those functions to SetRecoveryCheckpointSkippedFlag() and
RecoveryCheckpointIsSkipped() respectively and remove the lock which is not
needed. Updated comment for lastRecoveryCheckpointSkipped variable for the lock
requirement.

> In fact, I think this area needs a broader rethink. Like, the way you
> integrated that stuff into StartupXLog(), it sure looks to me like we
> might skip the checkpoint but still try to write other WAL records.
> Before we reach the offending segment of code, we call
> UpdateFullPageWrites(). Afterwards, we call XLogReportParameters().
> Both of those are going to potentially write WAL. I guess you could
> argue that's OK, on the grounds that neither function is necessarily
> going to log anything, but I don't think I believe that. If I make my
> server read only, take the OS down, change some GUCs, and then start
> it again, I don't expect it to PANIC.
>

If you think that there will be panic when UpdateFullPageWrites() and/or
XLogReportParameters() tries to write WAL since the shared memory state for WAL
prohibited is set then it is not like that.  For those functions, WAL write is
explicitly enabled by calling LocalSetXLogInsertAllowed().

I was under the impression that there won't be any problem if we allow the
writing WAL to UpdateFullPageWrites() and XLogReportParameters().  It can be
considered as an exception since it is fine that this WAL record is not streamed
to standby while graceful failover, I may be wrong though.

> Also, I doubt that it's OK to skip the checkpoint as this code does
> and then go ahead and execute recovery_end_command and update the
> control file anyway. It sure looks like the existing code is written
> with the assumption that the checkpoint happens before those other
> things.

Hmm, here we could go wrong. I need to look at this part carefully.

> One idea I just had was: suppose that, if the system is READ
> ONLY, we don't actually exit recovery right away, and the startup
> process doesn't exit. Instead we just sit there and wait for the
> system to be made read-write again before doing anything else. But
> then if hot_standby=false, there's no way for someone to execute a
> ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which
> seems bad. So perhaps we need to let in regular connections *as if*
> the system were read-write while postponing not just the
> end-of-recovery checkpoint but also the other associated things like
> UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command,
> control file update, etc. until the end of recovery. Or maybe that's
> not the right idea either, but regardless of what we do here it needs
> clear comments justifying it. The current version of the patch does
> not have any.
>

Will get back to you on this.  Let me think more on this and the previous
point.

> I think that you've mis-positioned the check in autovacuum.c. Note
> that the comment right afterwards says: "a worker finished, or
> postmaster signaled failure to start a worker". Those are things we
> should still check for even when the system is R/O. What we don't want
> to do in that case is start new workers. I would suggest revising the
> comment that starts with "There are some conditions that..." to
> mention three conditions. The new one would be that the system is in a
> read-only state. I'd mention that first, making the existing ones #2
> and #3, and then add the code to "continue;" in that case right after
> that comment, before setting current_time.
>

Done.

> SendsSignalToCheckpointer() has multiple problems. As far as the name,
> it should at least be "Send" rather than "Sends" but the corresponding

"Sends" is unacceptable, it is a typo.

> functions elsewhere have names like SendPostmasterSignal() not
> SendSignalToPostmaster(). Also, why is it OK for it to use elog()
> rather than ereport()? Also, why is it an error if the checkpointer's
> not running, rather than just having the next checkpointer do it when
> it's relaunched?

Ok, now the function only returns true or false. It's up to the caller what to
do with that. In our case, the caller will issue a warning only. If you want
this could be a NOTICE as well.

> Also, why pass SIGINT as an argument if there's only
> one caller?

I thoughts, anybody can also reuse it to send some other signal to the
checkpointer process in the future.

> A related thing that's also odd is that sending SIGINT
> calls ReqCheckpointHandler() not anything specific to prohibiting WAL.
> That is probably OK because that function now just sets the latch. But
> then we could stop sending SIGINT to the checkpointer at all and just
> send SIGUSR1, which would also set the latch, without using up a
> signal. I wonder if we should make that change as a separate
> preparatory patch. It seems like that would clear things up; it would
> remove the oddity that this patch is invoking a handler called
> ReqCheckpointerHandler() with no intention of requesting a checkpoint,
> because ReqCheckpointerHandler() would be gone. That problem could
> also be fixed by renaming ReqCheckpointerHandler() to something
> clearer, but that seems inferior.
>

I am not clear on this part. In the attached version I am sending SIGUSR1
instead of SIGINT, which works for me.

> This is probably not a complete list of problems. Review from others
> would be appreciated.
>

Thanks a lot.

The attached version does not address all your comments, I'll continue my work
on that.

Regards,
Amul

Attachment

pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: Stronger safeguard for archive recovery not to miss data
Next
From: Matthias van de Meent
Date:
Subject: postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)