Re: [Patch] ALTER SYSTEM READ ONLY - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [Patch] ALTER SYSTEM READ ONLY
Date
Msg-id CAA4eK1+5BDNS08XKXR7UPkq4tDaV66wyZia4kDMvRUm=a_A=Gg@mail.gmail.com
Whole thread Raw
In response to [Patch] ALTER SYSTEM READ ONLY  (amul sul <sulamul@gmail.com>)
Responses Re: [Patch] ALTER SYSTEM READ ONLY
Re: [Patch] ALTER SYSTEM READ ONLY
List pgsql-hackers
On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:
>
> Hi,
>
> Attached patch proposes $Subject feature which forces the system into read-only
> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
> WRITE executed.
>
> The high-level goal is to make the availability/scale-out situation better.  The feature
> will help HA setup where the master server needs to stop accepting WAL writes
> immediately and kick out any transaction expecting WAL writes at the end, in case
> of network down on master or replication connections failures.
>
> For example, this feature allows for a controlled switchover without needing to shut
> down the master. You can instead make the master read-only, wait until the standby
> catches up, and then promote the standby. The master remains available for read
> queries throughout, and also for WAL streaming, but without the possibility of any
> new write transactions. After switchover is complete, the master can be shut down
> and brought back up as a standby without needing to use pg_rewind. (Eventually, it
> would be nice to be able to make the read-only master into a standby without having
> to restart it, but that is a problem for another patch.)
>
> This might also help in failover scenarios. For example, if you detect that the master
> has lost network connectivity to the standby, you might make it read-only after 30 s,
> and promote the standby after 60 s, so that you never have two writable masters at
> the same time. In this case, there's still some split-brain, but it's still better than what
> we have now.
>
> Design:
> ----------
> The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
> global state changes to all active backends.  Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and WAL
> permitted state respectively.  When the checkpointer process sees the WAL prohibit
> state change request, it emits a global barrier and waits until all backends that
> participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
> share memory and control file will be updated so that XLogInsertAllowed() returns
> accordingly.
>

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well?  If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

> If there are open transactions that have acquired an XID, the sessions are killed
> before the barrier is absorbed.
>

What about prepared transactions?

> They can't commit without writing WAL, and they
> can't abort without writing WAL, either, so we must at least abort the transaction. We
> don't necessarily need to kill the session, but it's hard to avoid in all cases because
> (1) if there are subtransactions active, we need to force the top-level abort record to
> be written immediately, but we can't really do that while keeping the subtransactions
> on the transaction stack, and (2) if the session is idle, we also need the top-level abort
> record to be written immediately, but can't send an error to the client until the next
> command is issued without losing wire protocol synchronization. For now, we just use
> FATAL to kill the session; maybe this can be improved in the future.
>
> Open transactions that don't have an XID are not killed, but will get an ERROR if they
> try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).
>

What if vacuum is on an unlogged relation?  Do we allow writes via
vacuum to unlogged relation?

> To make that happen, the patch adds a new coding rule: a critical section that will write
> WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
> AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
> that inserting WAL here must be OK, either because we have an XID (we would have
> been killed by a change to read-only if one had occurred) or for some other reason.
>
> The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
> ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
> SYSTEM READ WRITE update not only the shared memory state but also the control
> file, so that changes survive a restart.
>
> The transition between read-write and read-only is a pretty major transition, so we emit
> log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
> command. Also, we have added a new GUC system_is_read_only which returns "on"
> when the system is in WAL prohibited state or recovery.
>
> Another part of the patch that quite uneasy and need a discussion is that when the
> shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> startup recovery will be performed and latter the read-only state will be restored to
> prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> even if it's later put back into read-write mode.
>

I am not able to understand this problem.  What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: torikoshia
Date:
Subject: Creating a function for exposing memory usage of backend process
Next
From: "Jonathan S. Katz"
Date:
Subject: Re: language cleanups in code and docs