Thread: [Patch] ALTER SYSTEM READ ONLY

[Patch] ALTER SYSTEM READ ONLY

From
amul sul
Date:
Hi,

Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.

The high-level goal is to make the availability/scale-out situation better.  The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.

Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
global state changes to all active backends.  Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and WAL
permitted state respectively.  When the checkpointer process sees the WAL prohibit
state change request, it emits a global barrier and waits until all backends that
participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
share memory and control file will be updated so that XLogInsertAllowed() returns
accordingly.

If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed. They can't commit without writing WAL, and they
can't abort without writing WAL, either, so we must at least abort the transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all cases because
(1) if there are subtransactions active, we need to force the top-level abort record to
be written immediately, but we can't really do that while keeping the subtransactions
on the transaction stack, and (2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.

Open transactions that don't have an XID are not killed, but will get an ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).
To make that happen, the patch adds a new coding rule: a critical section that will write
WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
that inserting WAL here must be OK, either because we have an XID (we would have
been killed by a change to read-only if one had occurred) or for some other reason.

The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the control
file, so that changes survive a restart.

The transition between read-write and read-only is a pretty major transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
command. Also, we have added a new GUC system_is_read_only which returns "on"
when the system is in WAL prohibited state or recovery.

Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode. Thoughts?

Quick demo:
----------------
We have few active sessions, section 1 has performed some writes and stayed in the
idle state for some time, in between in session 2 where superuser successfully changed
system state in read-only via  ALTER SYSTEM READ ONLY command which kills
session 1.  Any other backend who is trying to run write transactions thereafter will see
a read-only system error.

------------- SESSION 1  -------------
session_1=# BEGIN;
BEGIN
session_1=*# CREATE TABLE foo AS SELECT i FROM generate_series(1,5) i;
SELECT 5

------------- SESSION 2  -------------
session_2=# ALTER SYSTEM READ ONLY;
ALTER SYSTEM

------------- SESSION 1  -------------
session_1=*# COMMIT;
FATAL:  system is now read only
HINT:  Cannot continue a transaction if it has performed writes while system is read only.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

------------- SESSION 3  -------------
session_3=# CREATE TABLE foo_bar (i int);
ERROR:  cannot execute CREATE TABLE in a read-only transaction

------------- SESSION 4  -------------
session_4=# CHECKPOINT;
ERROR:  system is now read only

System can put back to read-write mode by "ALTER SYSTEM READ WRITE" :

------------- SESSION 2  -------------
session_2=# ALTER SYSTEM READ WRITE;
ALTER SYSTEM

------------- SESSION 3  -------------
session_3=# CREATE TABLE foo_bar (i int);
CREATE TABLE

------------- SESSION 4  -------------
session_4=# CHECKPOINT;
CHECKPOINT


TODOs:
-----------
1. Documentation.

Attachments summary:
------------------------------
I tried to split the changes so that it can be easy to read and see the
incremental implementation.

0001: Patch by Robert, to add ability support error in global barrier absorption.
0002: Patch implement ALTER SYSTEM { READ | WRITE} syntax and psql tab
          completion support for it.
0003: A basic implementation where the system can accept $Subject command
         and change system to read-only by an emitting barrier.
0004: Patch does the enhancing where the backed execute $Subject command
          only and places a request to the checkpointer which is responsible to change
          the state by the emitting barrier. Also, store the state into the control file to
          make It persists across the server restarts.
0005: Patch tightens the check to prevent error in the critical section.
0006: Documentation - WIP

Credit:
-------
The feature is one of the part of Andres Frued's high-level design ideas for inbuilt
graceful failover for PostgreSQL. Feature implementation design by Robert Haas.
Initial patch by Amit Khandekar further works and improvement by me under Robert's
guidance includes this mail writeup as well.

Ref:
----
1] Global barrier commit # 16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b

Thank you !

Regards,
Amul Sul
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amit Kapila
Date:
On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:
>
> Hi,
>
> Attached patch proposes $Subject feature which forces the system into read-only
> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
> WRITE executed.
>
> The high-level goal is to make the availability/scale-out situation better.  The feature
> will help HA setup where the master server needs to stop accepting WAL writes
> immediately and kick out any transaction expecting WAL writes at the end, in case
> of network down on master or replication connections failures.
>
> For example, this feature allows for a controlled switchover without needing to shut
> down the master. You can instead make the master read-only, wait until the standby
> catches up, and then promote the standby. The master remains available for read
> queries throughout, and also for WAL streaming, but without the possibility of any
> new write transactions. After switchover is complete, the master can be shut down
> and brought back up as a standby without needing to use pg_rewind. (Eventually, it
> would be nice to be able to make the read-only master into a standby without having
> to restart it, but that is a problem for another patch.)
>
> This might also help in failover scenarios. For example, if you detect that the master
> has lost network connectivity to the standby, you might make it read-only after 30 s,
> and promote the standby after 60 s, so that you never have two writable masters at
> the same time. In this case, there's still some split-brain, but it's still better than what
> we have now.
>
> Design:
> ----------
> The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
> global state changes to all active backends.  Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and WAL
> permitted state respectively.  When the checkpointer process sees the WAL prohibit
> state change request, it emits a global barrier and waits until all backends that
> participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
> share memory and control file will be updated so that XLogInsertAllowed() returns
> accordingly.
>

Do we prohibit the checkpointer to write dirty pages and write a
checkpoint record as well?  If so, will the checkpointer process
writes the current dirty pages and writes a checkpoint record or we
skip that as well?

> If there are open transactions that have acquired an XID, the sessions are killed
> before the barrier is absorbed.
>

What about prepared transactions?

> They can't commit without writing WAL, and they
> can't abort without writing WAL, either, so we must at least abort the transaction. We
> don't necessarily need to kill the session, but it's hard to avoid in all cases because
> (1) if there are subtransactions active, we need to force the top-level abort record to
> be written immediately, but we can't really do that while keeping the subtransactions
> on the transaction stack, and (2) if the session is idle, we also need the top-level abort
> record to be written immediately, but can't send an error to the client until the next
> command is issued without losing wire protocol synchronization. For now, we just use
> FATAL to kill the session; maybe this can be improved in the future.
>
> Open transactions that don't have an XID are not killed, but will get an ERROR if they
> try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).
>

What if vacuum is on an unlogged relation?  Do we allow writes via
vacuum to unlogged relation?

> To make that happen, the patch adds a new coding rule: a critical section that will write
> WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
> AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
> that inserting WAL here must be OK, either because we have an XID (we would have
> been killed by a change to read-only if one had occurred) or for some other reason.
>
> The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
> ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
> SYSTEM READ WRITE update not only the shared memory state but also the control
> file, so that changes survive a restart.
>
> The transition between read-write and read-only is a pretty major transition, so we emit
> log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
> command. Also, we have added a new GUC system_is_read_only which returns "on"
> when the system is in WAL prohibited state or recovery.
>
> Another part of the patch that quite uneasy and need a discussion is that when the
> shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> startup recovery will be performed and latter the read-only state will be restored to
> prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> even if it's later put back into read-write mode.
>

I am not able to understand this problem.  What do you mean by
"recovery checkpoint succeed or not", do you add a try..catch and skip
any error while performing recovery checkpoint?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
tushar
Date:
On 6/16/20 7:25 PM, amul sul wrote:
> Attached patch proposes $Subject feature which forces the system into 
> read-only
> mode where insert write-ahead log will be prohibited until ALTER 
> SYSTEM READ
> WRITE executed.

Thanks Amul.

1) ALTER SYSTEM

postgres=# alter system read only;
ALTER SYSTEM
postgres=# alter  system reset all;
ALTER SYSTEM
postgres=# create table t1(n int);
ERROR:  cannot execute CREATE TABLE in a read-only transaction

Initially i thought after firing 'Alter system reset all' , it will be 
back to  normal.

can't we have a syntax like - "Alter system set read_only='True' ; "

so that ALTER SYSTEM command syntax should be same for all.

postgres=# \h alter system
Command:     ALTER SYSTEM
Description: change a server configuration parameter
Syntax:
ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' | 
DEFAULT }

ALTER SYSTEM RESET configuration_parameter
ALTER SYSTEM RESET ALL

How we are going to justify this in help command of ALTER SYSTEM ?

2)When i connected to postgres in a single user mode , i was not able to 
set the system in read only

[edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres


PostgreSQL stand-alone backend 14devel
backend> alter system read only;
ERROR:  checkpointer is not running

backend>

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Do we prohibit the checkpointer to write dirty pages and write a
> checkpoint record as well?  If so, will the checkpointer process
> writes the current dirty pages and writes a checkpoint record or we
> skip that as well?

I think the definition of this feature should be that you can't write
WAL. So, it's OK to write dirty pages in general, for example to allow
for buffer replacement so we can continue to run read-only queries.
But there's no reason for the checkpointer to do it: it shouldn't try
to checkpoint, and therefore it shouldn't write dirty pages either.
(I'm not sure if this is how the patch currently works; I'm describing
how I think it should work.)

> > If there are open transactions that have acquired an XID, the sessions are killed
> > before the barrier is absorbed.
>
> What about prepared transactions?

They don't matter. The problem with a running transaction that has an
XID is that somebody might end the session, and then we'd have to
write either a commit record or an abort record. But a prepared
transaction doesn't have that problem. You can't COMMIT PREPARED or
ROLLBACK PREPARED while the system is read-only, as I suppose anybody
would expect, but their mere existence isn't a problem.

> What if vacuum is on an unlogged relation?  Do we allow writes via
> vacuum to unlogged relation?

Interesting question. I was thinking that we should probably teach the
autovacuum launcher to stop launching workers while the system is in a
READ ONLY state, but what about existing workers? Anything that
generates invalidation messages, acquires an XID, or writes WAL has to
be blocked in a read-only state; but I'm not sure to what extent the
first two of those things would be a problem for vacuuming an unlogged
table. I think you couldn't truncate it, at least, because that
acquires an XID.

> > Another part of the patch that quite uneasy and need a discussion is that when the
> > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> > startup recovery will be performed and latter the read-only state will be restored to
> > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> > even if it's later put back into read-write mode.
>
> I am not able to understand this problem.  What do you mean by
> "recovery checkpoint succeed or not", do you add a try..catch and skip
> any error while performing recovery checkpoint?

What I think should happen is that the end-of-recovery checkpoint
should be skipped, and then if the system is put back into read-write
mode later we should do it then. But I think right now the patch
performs the end-of-recovery checkpoint before restoring the read-only
state, which seems 100% wrong to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote:
> 1) ALTER SYSTEM
>
> postgres=# alter system read only;
> ALTER SYSTEM
> postgres=# alter  system reset all;
> ALTER SYSTEM
> postgres=# create table t1(n int);
> ERROR:  cannot execute CREATE TABLE in a read-only transaction
>
> Initially i thought after firing 'Alter system reset all' , it will be
> back to  normal.
>
> can't we have a syntax like - "Alter system set read_only='True' ; "

No, this needs to be separate from the GUC-modification syntax, I
think. It's a different kind of state change. It doesn't, and can't,
just edit postgresql.auto.conf.

> 2)When i connected to postgres in a single user mode , i was not able to
> set the system in read only
>
> [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres
>
> PostgreSQL stand-alone backend 14devel
> backend> alter system read only;
> ERROR:  checkpointer is not running
>
> backend>

Hmm, that's an interesting finding. I wonder what happens if you make
the system read only, shut it down, and then restart it in single-user
mode. Given what you see here, I bet you can't put it back into a
read-write state from single user mode either, which seems like a
problem. Either single-user mode should allow changing between R/O and
R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
ONLY and always allow writes anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote:
>> Attached patch proposes $Subject feature which forces the system into read-only
>> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
>> WRITE executed.

> Do we prohibit the checkpointer to write dirty pages and write a
> checkpoint record as well?

I think this is a really bad idea and should simply be rejected.

Aside from the points you mention, such a switch would break autovacuum.
It would break the ability for scans to do HOT-chain cleanup, which would
likely lead to some odd behaviors (if, eg, somebody flips the switch
between where that's supposed to happen and where an update needs to
happen on the same page).  It would break the ability for indexscans to do
killed-tuple marking, which is critical for performance in some scenarios.
It would break the ability to set tuple hint bits, which is even more
critical for performance.  It'd possibly break, or at least complicate,
logic in index AMs to deal with index format updates --- I'm fairly sure
there are places that will try to update out-of-date data structures
rather than cope with the old structure, even in nominally read-only
searches.

I also think that putting such a thing into ALTER SYSTEM has got big
logical problems.  Someday we will probably want to have ALTER SYSTEM
write WAL so that standby servers can absorb the settings changes.
But if writing WAL is disabled, how can you ever turn the thing off again?

Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
involves just killing the primary server, not expecting that you can
leisurely issue some commands to it first.  Commands that involve a whole
bunch of subtle interlocking --- and, therefore, aren't going to work if
anything has gone wrong already anywhere in the server --- seem like a
particularly poor thing to be hanging your HA strategy on.  I also wonder
what this accomplishes that couldn't be done much more simply by killing
the walsenders.

In short, I see a huge amount of complexity here, an ongoing source of
hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

            regards, tom lane



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Aside from the points you mention, such a switch would break autovacuum.
> It would break the ability for scans to do HOT-chain cleanup, which would
> likely lead to some odd behaviors (if, eg, somebody flips the switch
> between where that's supposed to happen and where an update needs to
> happen on the same page).  It would break the ability for indexscans to do
> killed-tuple marking, which is critical for performance in some scenarios.
> It would break the ability to set tuple hint bits, which is even more
> critical for performance.  It'd possibly break, or at least complicate,
> logic in index AMs to deal with index format updates --- I'm fairly sure
> there are places that will try to update out-of-date data structures
> rather than cope with the old structure, even in nominally read-only
> searches.

This seems like pretty dubious hand-waving. Of course, things that
write WAL are going to be broken by a switch that prevents writing
WAL; but if they were not, there would be no purpose in having such a
switch, so that's not really an argument. But you seem to have mixed
in some things that don't require writing WAL, and claimed without
evidence that those would somehow also be broken. I don't think that's
the case, but even if it were, so what? We live with all of these
restrictions on standbys anyway.

> I also think that putting such a thing into ALTER SYSTEM has got big
> logical problems.  Someday we will probably want to have ALTER SYSTEM
> write WAL so that standby servers can absorb the settings changes.
> But if writing WAL is disabled, how can you ever turn the thing off again?

I mean, the syntax that we use for a feature like this is arbitrary. I
picked this one, so I like it, but it can easily be changed if other
people want something else. The rest of this argument doesn't seem to
me to make very much sense. The existing ALTER SYSTEM functionality to
modify a text configuration file isn't replicated today and I'm not
sure why we should make it so, considering that replication generally
only considers things that are guaranteed to be the same on the master
and the standby, which this is not. But even if we did, that has
nothing to do with whether some functionality that changes the system
state without changing a text file ought to also be replicated. This
is a piece of cluster management functionality and it makes no sense
to replicate it. And no right-thinking person would ever propose to
change a feature that renders the system read-only in such a way that
it was impossible to deactivate it. That would be nuts.

> Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
> involves just killing the primary server, not expecting that you can
> leisurely issue some commands to it first.

Yeah, that's exactly the problem I want to fix. If you kill the master
server, then you have interrupted service, even for read-only queries.
That sucks. Also, even if you don't care about interrupting service on
the master, it's actually sorta hard to guarantee a clean switchover.
The walsenders are supposed to send all the WAL from the master before
exiting, but if the connection is broken for some reason, then the
master is down and the standbys can't stream the rest of the WAL. You
can start it up again, but then you might generate more WAL. You can
try to copy the WAL around manually from one pg_wal directory to
another, but that's not a very nice thing for users to need to do
manually, and seems buggy and error-prone.

And how do you figure out where the WAL ends on the master and make
sure that the standby replayed it all? If the master is up, it's easy:
you just use the same queries you use all the time. If the master is
down, you have to use some different technique that involves manually
examining files or scrutinizing pg_controldata output. It's actually
very difficult to get this right.

> Commands that involve a whole
> bunch of subtle interlocking --- and, therefore, aren't going to work if
> anything has gone wrong already anywhere in the server --- seem like a
> particularly poor thing to be hanging your HA strategy on.

It's important not to conflate controlled switchover with failover.
When there's a failover, you have to accept some risk of data loss or
service interruption; but a controlled switchover does not need to
carry the same risks and there are plenty of systems out there where
it doesn't.

> I also wonder
> what this accomplishes that couldn't be done much more simply by killing
> the walsenders.

Killing the walsenders does nothing ... the clients immediately reconnect.

> In short, I see a huge amount of complexity here, an ongoing source of
> hard-to-identify, hard-to-fix bugs, and not very much real usefulness.

I do think this is complex and the risk of bugs that are hard to
identify or hard to fix certainly needs to be considered. I
strenuously disagree with the idea that there is not very much real
usefulness. Getting failover set up in a way that actually works
robustly is, in my experience, one of the two or three most serious
challenges my employer's customers face today. The core server support
we provide for that is breathtakingly primitive, and it's urgent that
we do better. Cloud providers are moving users from PostgreSQL to
their own forks of PostgreSQL in vast numbers in large part because
users don't want to deal with this crap, and the cloud providers have
made it so they don't have to. People running PostgreSQL themselves
need complex third-party tools and even then the experience isn't as
good as what a major cloud provider would offer. This patch is not
going to fix that, but I think it's a step in the right direction, and
I hope others will agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> This seems like pretty dubious hand-waving. Of course, things that
> write WAL are going to be broken by a switch that prevents writing
> WAL; but if they were not, there would be no purpose in having such a
> switch, so that's not really an argument. But you seem to have mixed
> in some things that don't require writing WAL, and claimed without
> evidence that those would somehow also be broken.

Which of the things I mentioned don't require writing WAL?

You're right that these are the same things that we already forbid on a
standby, for the same reason, so maybe it won't be as hard to identify
them as I feared.  I wonder whether we should envision this as "demote
primary to standby" rather than an independent feature.

>> I also think that putting such a thing into ALTER SYSTEM has got big
>> logical problems.

> ... no right-thinking person would ever propose to
> change a feature that renders the system read-only in such a way that
> it was impossible to deactivate it. That would be nuts.

My point was that putting this in ALTER SYSTEM paints us into a corner
as to what we can do with ALTER SYSTEM in the future: we won't ever be
able to make that do anything that would require writing WAL.  And I
don't entirely believe your argument that that will never be something
we'd want to do.

            regards, tom lane



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Which of the things I mentioned don't require writing WAL?

Writing hint bits and marking index tuples as killed do not write WAL
unless checksums are enabled.

> You're right that these are the same things that we already forbid on a
> standby, for the same reason, so maybe it won't be as hard to identify
> them as I feared.  I wonder whether we should envision this as "demote
> primary to standby" rather than an independent feature.

See my comments on the nearby pg_demote thread. I think we want both.

> >> I also think that putting such a thing into ALTER SYSTEM has got big
> >> logical problems.
>
> > ... no right-thinking person would ever propose to
> > change a feature that renders the system read-only in such a way that
> > it was impossible to deactivate it. That would be nuts.
>
> My point was that putting this in ALTER SYSTEM paints us into a corner
> as to what we can do with ALTER SYSTEM in the future: we won't ever be
> able to make that do anything that would require writing WAL.  And I
> don't entirely believe your argument that that will never be something
> we'd want to do.

I think that depends a lot on how you view ALTER SYSTEM. I believe it
would be reasonable to view ALTER SYSTEM as a catch-all for commands
that make system-wide state changes, even if those changes are not all
of the same kind as each other; some might be machine-local, and
others cluster-wide; some WAL-logged, and others not. I don't think
it's smart to view ALTER SYSTEM through a lens that boxes it into only
editing postgresql.auto.conf; if that were so, we ought to have called
it ALTER CONFIGURATION FILE or something rather than ALTER SYSTEM. For
that reason, I do not see the choice of syntax as painting us into a
corner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Which of the things I mentioned don't require writing WAL?

> Writing hint bits and marking index tuples as killed do not write WAL
> unless checksums are enabled.

And your point is?  I thought enabling checksums was considered
good practice these days.

>> You're right that these are the same things that we already forbid on a
>> standby, for the same reason, so maybe it won't be as hard to identify
>> them as I feared.  I wonder whether we should envision this as "demote
>> primary to standby" rather than an independent feature.

> See my comments on the nearby pg_demote thread. I think we want both.

Well, if pg_demote can be done for X amount of effort, and largely
gets the job done, while this requires 10X or 100X the effort and
introduces 10X or 100X as many bugs, I'm not especially convinced
that we want both. 

            regards, tom lane



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jun 17, 2020 at 12:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Writing hint bits and marking index tuples as killed do not write WAL
> > unless checksums are enabled.
>
> And your point is?  I thought enabling checksums was considered
> good practice these days.

I don't want to have an argument about what typical or best practices
are; I wasn't trying to make any point about that one way or the
other. I'm just saying that the operations you listed don't
necessarily all write WAL. In an event, even if they did, the larger
point is that standbys work like that, too, so it's not unprecedented
or illogical to think of such things.

> >> You're right that these are the same things that we already forbid on a
> >> standby, for the same reason, so maybe it won't be as hard to identify
> >> them as I feared.  I wonder whether we should envision this as "demote
> >> primary to standby" rather than an independent feature.
>
> > See my comments on the nearby pg_demote thread. I think we want both.
>
> Well, if pg_demote can be done for X amount of effort, and largely
> gets the job done, while this requires 10X or 100X the effort and
> introduces 10X or 100X as many bugs, I'm not especially convinced
> that we want both.

Sure: if two features duplicate each other, and one of them is way
more work and way more buggy, then it's silly to have both, and we
should just accept the easy, bug-free one. However, as I said in the
other email to which I referred you, I currently believe that these
two features actually don't duplicate each other and that using them
both together would be quite beneficial. Also, even if they did, I
don't know where you are getting the idea that this feature will be
10X or 100X more work and more buggy than the other one. I have looked
at this code prior to it being posted, but I haven't looked at the
other code at all; I am guessing that you have looked at neither. I
would be happy if you did, because it is often the case that
architectural issues that escape other people are apparent to you upon
examination, and it's always nice to know about those earlier rather
than later so that one can decide to (a) give up or (b) fix them. But
I see no point in speculating in the abstract that such issues may
exist and that they may be more severe in one case than the other. My
own guess is that, properly implemented, they are within 2-3X of each
in one direction or the other, not 10-100X. It is almost unbelievable
to me that the pg_demote patch could be 100X simpler than this one; if
it were, I'd be practically certain it was a 5-minute hack job
unworthy of any serious consideration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
Hi,

On 2020-06-17 12:07:22 -0400, Robert Haas wrote:
> On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I also think that putting such a thing into ALTER SYSTEM has got big
> > logical problems.  Someday we will probably want to have ALTER SYSTEM
> > write WAL so that standby servers can absorb the settings changes.
> > But if writing WAL is disabled, how can you ever turn the thing off again?
> 
> I mean, the syntax that we use for a feature like this is arbitrary. I
> picked this one, so I like it, but it can easily be changed if other
> people want something else. The rest of this argument doesn't seem to
> me to make very much sense. The existing ALTER SYSTEM functionality to
> modify a text configuration file isn't replicated today and I'm not
> sure why we should make it so, considering that replication generally
> only considers things that are guaranteed to be the same on the master
> and the standby, which this is not. But even if we did, that has
> nothing to do with whether some functionality that changes the system
> state without changing a text file ought to also be replicated. This
> is a piece of cluster management functionality and it makes no sense
> to replicate it. And no right-thinking person would ever propose to
> change a feature that renders the system read-only in such a way that
> it was impossible to deactivate it. That would be nuts.

I agree that the concrete syntax here doesn't seem to matter much. If
this worked by actually putting a GUC into the config file, it would
perhaps matter a bit more, but it doesn't afaict.  It seems good to
avoid new top-level statements, and ALTER SYSTEM seems to fit well.


I wonder if there's an argument about wanting to be able to execute this
command over a physical replication connection? I think this feature
fairly obviously is a building block for "gracefully failover to this
standby", and it seems like it'd be nicer if that didn't potentially
require two pg_hba.conf entries for the to-be-promoted primary on the
current/old primary?


> > Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
> > involves just killing the primary server, not expecting that you can
> > leisurely issue some commands to it first.
> 
> Yeah, that's exactly the problem I want to fix. If you kill the master
> server, then you have interrupted service, even for read-only queries.
> That sucks. Also, even if you don't care about interrupting service on
> the master, it's actually sorta hard to guarantee a clean switchover.
> The walsenders are supposed to send all the WAL from the master before
> exiting, but if the connection is broken for some reason, then the
> master is down and the standbys can't stream the rest of the WAL. You
> can start it up again, but then you might generate more WAL. You can
> try to copy the WAL around manually from one pg_wal directory to
> another, but that's not a very nice thing for users to need to do
> manually, and seems buggy and error-prone.

Also (I'm sure you're aware) if you just non-gracefully shut down the
old primary, you're going to have to rewind the old primary to be able
to use it as a standby. And if you non-gracefully stop you're gonna
incur checkpoint overhead, which is *massive* on non-toy
databases. There's a huge practical difference between a minor version
upgrade causing 10s of unavailability and causing 5min-30min.


> And how do you figure out where the WAL ends on the master and make
> sure that the standby replayed it all? If the master is up, it's easy:
> you just use the same queries you use all the time. If the master is
> down, you have to use some different technique that involves manually
> examining files or scrutinizing pg_controldata output. It's actually
> very difficult to get this right.

Yea, it's absurdly hard. I think it's really kind of ridiculous that we
expect others to get this right if we, the developers of this stuff,
can't really get it right because it's so complicated. Which imo makes
this:

> > Commands that involve a whole
> > bunch of subtle interlocking --- and, therefore, aren't going to work if
> > anything has gone wrong already anywhere in the server --- seem like a
> > particularly poor thing to be hanging your HA strategy on.

more of an argument for having this type of stuff builtin.


> It's important not to conflate controlled switchover with failover.
> When there's a failover, you have to accept some risk of data loss or
> service interruption; but a controlled switchover does not need to
> carry the same risks and there are plenty of systems out there where
> it doesn't.

Yup.

Greetings,

Andres Freund



Re: [Patch] ALTER SYSTEM READ ONLY

From
amul sul
Date:
On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do we prohibit the checkpointer to write dirty pages and write a
> > checkpoint record as well?  If so, will the checkpointer process
> > writes the current dirty pages and writes a checkpoint record or we
> > skip that as well?
>
> I think the definition of this feature should be that you can't write
> WAL. So, it's OK to write dirty pages in general, for example to allow
> for buffer replacement so we can continue to run read-only queries.
> But there's no reason for the checkpointer to do it: it shouldn't try
> to checkpoint, and therefore it shouldn't write dirty pages either.
> (I'm not sure if this is how the patch currently works; I'm describing
> how I think it should work.)
>
You are correct -- writing dirty pages is not restricted.

> > > If there are open transactions that have acquired an XID, the sessions are killed
> > > before the barrier is absorbed.
> >
> > What about prepared transactions?
>
> They don't matter. The problem with a running transaction that has an
> XID is that somebody might end the session, and then we'd have to
> write either a commit record or an abort record. But a prepared
> transaction doesn't have that problem. You can't COMMIT PREPARED or
> ROLLBACK PREPARED while the system is read-only, as I suppose anybody
> would expect, but their mere existence isn't a problem.
>
> > What if vacuum is on an unlogged relation?  Do we allow writes via
> > vacuum to unlogged relation?
>
> Interesting question. I was thinking that we should probably teach the
> autovacuum launcher to stop launching workers while the system is in a
> READ ONLY state, but what about existing workers? Anything that
> generates invalidation messages, acquires an XID, or writes WAL has to
> be blocked in a read-only state; but I'm not sure to what extent the
> first two of those things would be a problem for vacuuming an unlogged
> table. I think you couldn't truncate it, at least, because that
> acquires an XID.
>
> > > Another part of the patch that quite uneasy and need a discussion is that when the
> > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> > > startup recovery will be performed and latter the read-only state will be restored to
> > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> > > even if it's later put back into read-write mode.
> >
> > I am not able to understand this problem.  What do you mean by
> > "recovery checkpoint succeed or not", do you add a try..catch and skip
> > any error while performing recovery checkpoint?
>
> What I think should happen is that the end-of-recovery checkpoint
> should be skipped, and then if the system is put back into read-write
> mode later we should do it then. But I think right now the patch
> performs the end-of-recovery checkpoint before restoring the read-only
> state, which seems 100% wrong to me.
>
Yeah, we need more thought on how to proceed further.  I am kind of agree that
the current behavior is not right with Robert since writing end-of-recovery
checkpoint violates the no-wal-write rule.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
amul sul
Date:
On Wed, Jun 17, 2020 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote:
> > 1) ALTER SYSTEM
> >
> > postgres=# alter system read only;
> > ALTER SYSTEM
> > postgres=# alter  system reset all;
> > ALTER SYSTEM
> > postgres=# create table t1(n int);
> > ERROR:  cannot execute CREATE TABLE in a read-only transaction
> >
> > Initially i thought after firing 'Alter system reset all' , it will be
> > back to  normal.
> >
> > can't we have a syntax like - "Alter system set read_only='True' ; "
>
> No, this needs to be separate from the GUC-modification syntax, I
> think. It's a different kind of state change. It doesn't, and can't,
> just edit postgresql.auto.conf.
>
> > 2)When i connected to postgres in a single user mode , i was not able to
> > set the system in read only
> >
> > [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres
> >
> > PostgreSQL stand-alone backend 14devel
> > backend> alter system read only;
> > ERROR:  checkpointer is not running
> >
> > backend>
>
> Hmm, that's an interesting finding. I wonder what happens if you make
> the system read only, shut it down, and then restart it in single-user
> mode. Given what you see here, I bet you can't put it back into a
> read-write state from single user mode either, which seems like a
> problem. Either single-user mode should allow changing between R/O and
> R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ
> ONLY and always allow writes anyway.
>
Ok, will try to enable changing between R/O and R/W in the next version.

Thanks Tushar for the testing.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amit Kapila
Date:
On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Do we prohibit the checkpointer to write dirty pages and write a
> > checkpoint record as well?  If so, will the checkpointer process
> > writes the current dirty pages and writes a checkpoint record or we
> > skip that as well?
>
> I think the definition of this feature should be that you can't write
> WAL. So, it's OK to write dirty pages in general, for example to allow
> for buffer replacement so we can continue to run read-only queries.
>

For buffer replacement, many-a-times we have to also perform
XLogFlush, what do we do for that?  We can't proceed without doing
that and erroring out from there means stopping read-only query from
the user perspective.

> But there's no reason for the checkpointer to do it: it shouldn't try
> to checkpoint, and therefore it shouldn't write dirty pages either.
>

What is the harm in doing the checkpoint before we put the system into
READ ONLY state?  The advantage is that we can at least reduce the
recovery time if we allow writing checkpoint record.

>
> > What if vacuum is on an unlogged relation?  Do we allow writes via
> > vacuum to unlogged relation?
>
> Interesting question. I was thinking that we should probably teach the
> autovacuum launcher to stop launching workers while the system is in a
> READ ONLY state, but what about existing workers? Anything that
> generates invalidation messages, acquires an XID, or writes WAL has to
> be blocked in a read-only state; but I'm not sure to what extent the
> first two of those things would be a problem for vacuuming an unlogged
> table. I think you couldn't truncate it, at least, because that
> acquires an XID.
>

If the truncate operation errors out, then won't the system will again
trigger a new autovacuum worker for the same relation as we update
stats at the end?  Also, in general for regular tables, if there is an
error while it tries to WAL, it could again trigger the autovacuum
worker for the same relation.  If this is true then unnecessarily it
will generate a lot of dirty pages and don't think it will be good for
the system to behave that way?

> > > Another part of the patch that quite uneasy and need a discussion is that when the
> > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> > > startup recovery will be performed and latter the read-only state will be restored to
> > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> > > even if it's later put back into read-write mode.
> >
> > I am not able to understand this problem.  What do you mean by
> > "recovery checkpoint succeed or not", do you add a try..catch and skip
> > any error while performing recovery checkpoint?
>
> What I think should happen is that the end-of-recovery checkpoint
> should be skipped, and then if the system is put back into read-write
> mode later we should do it then.
>

But then if we have to perform recovery again, it will start from the
previous checkpoint.  I think we have to live with it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Jehan-Guillaume de Rorthais
Date:
On Wed, 17 Jun 2020 12:07:22 -0400
Robert Haas <robertmhaas@gmail.com> wrote:
[...]

> > Commands that involve a whole
> > bunch of subtle interlocking --- and, therefore, aren't going to work if
> > anything has gone wrong already anywhere in the server --- seem like a
> > particularly poor thing to be hanging your HA strategy on.  
> 
> It's important not to conflate controlled switchover with failover.
> When there's a failover, you have to accept some risk of data loss or
> service interruption; but a controlled switchover does not need to
> carry the same risks and there are plenty of systems out there where
> it doesn't.

Yes. Maybe we should make sure the wording we are using is the same for
everyone. I already hear/read "failover", "controlled failover", "switchover" or
"controlled switchover", this is confusing. My definition of switchover is:

  swapping primary and secondary status between two replicating instances. With
  no data loss. This is a controlled procedure where all steps must succeed to
  complete.
  If a step fails, the procedure fail back to the original primary with no data
  loss.

However, Wikipedia has a broader definition, including situations where the
switchover is executed upon a failure: https://en.wikipedia.org/wiki/Switchover

Regards,



Re: [Patch] ALTER SYSTEM READ ONLY

From
Simon Riggs
Date:
On Tue, 16 Jun 2020 at 14:56, amul sul <sulamul@gmail.com> wrote:
 
The high-level goal is to make the availability/scale-out situation better.  The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.

For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)

This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.
 
If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.
 
inbuilt graceful failover for PostgreSQL

That doesn't appear to be very graceful. Perhaps objections could be assuaged by having a smoother transition and perhaps not even a full barrier, initially.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amit Kapila
Date:
On Wed, Jun 17, 2020 at 9:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Lastly, the arguments in favor seem pretty bogus.  HA switchover normally
> > involves just killing the primary server, not expecting that you can
> > leisurely issue some commands to it first.
>
> Yeah, that's exactly the problem I want to fix. If you kill the master
> server, then you have interrupted service, even for read-only queries.
>

Yeah, but if there is a synchronuos_standby (standby that provide sync
replication), user can always route the connections to it
(automatically if there is some middleware which can detect and route
the connection to standby)

> That sucks. Also, even if you don't care about interrupting service on
> the master, it's actually sorta hard to guarantee a clean switchover.
>

Fair enough.  However, it is not described in the initial email
(unless I have missed it; there is a mention that this patch is one
part of that bigger feature but no further explanation of that bigger
feature) how this feature will allow a clean switchover.  I think
before we put the system into READ ONLY state, there could be some WAL
which we haven't sent to standby, what we do we do for that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
amul sul
Date:
On Thu, Jun 18, 2020 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Do we prohibit the checkpointer to write dirty pages and write a
> > > checkpoint record as well?  If so, will the checkpointer process
> > > writes the current dirty pages and writes a checkpoint record or we
> > > skip that as well?
> >
> > I think the definition of this feature should be that you can't write
> > WAL. So, it's OK to write dirty pages in general, for example to allow
> > for buffer replacement so we can continue to run read-only queries.
> >
>
> For buffer replacement, many-a-times we have to also perform
> XLogFlush, what do we do for that?  We can't proceed without doing
> that and erroring out from there means stopping read-only query from
> the user perspective.
>
Read-only does not restrict XLogFlush().

> > But there's no reason for the checkpointer to do it: it shouldn't try
> > to checkpoint, and therefore it shouldn't write dirty pages either.
> >
>
> What is the harm in doing the checkpoint before we put the system into
> READ ONLY state?  The advantage is that we can at least reduce the
> recovery time if we allow writing checkpoint record.
>
The checkpoint could take longer, intending to quickly switch to the read-only
state.

> >
> > > What if vacuum is on an unlogged relation?  Do we allow writes via
> > > vacuum to unlogged relation?
> >
> > Interesting question. I was thinking that we should probably teach the
> > autovacuum launcher to stop launching workers while the system is in a
> > READ ONLY state, but what about existing workers? Anything that
> > generates invalidation messages, acquires an XID, or writes WAL has to
> > be blocked in a read-only state; but I'm not sure to what extent the
> > first two of those things would be a problem for vacuuming an unlogged
> > table. I think you couldn't truncate it, at least, because that
> > acquires an XID.
> >
>
> If the truncate operation errors out, then won't the system will again
> trigger a new autovacuum worker for the same relation as we update
> stats at the end?  Also, in general for regular tables, if there is an
> error while it tries to WAL, it could again trigger the autovacuum
> worker for the same relation.  If this is true then unnecessarily it
> will generate a lot of dirty pages and don't think it will be good for
> the system to behave that way?
>
No new autovacuum worker will be forked in the read-only state and existing will
have an error if they try to write WAL after barrier absorption.

> > > > Another part of the patch that quite uneasy and need a discussion is that when the
> > > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
> > > > startup recovery will be performed and latter the read-only state will be restored to
> > > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
> > > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
> > > > even if it's later put back into read-write mode.
> > >
> > > I am not able to understand this problem.  What do you mean by
> > > "recovery checkpoint succeed or not", do you add a try..catch and skip
> > > any error while performing recovery checkpoint?
> >
> > What I think should happen is that the end-of-recovery checkpoint
> > should be skipped, and then if the system is put back into read-write
> > mode later we should do it then.
> >
>
> But then if we have to perform recovery again, it will start from the
> previous checkpoint.  I think we have to live with it.
>
Let me explain the case, if we do skip the end-of-recovery checkpoint while
starting the system in read-only mode and then later changing the state to
read-write and do a few write operations and online checkpoints, that will be
fine? I am yet to explore those things.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> For buffer replacement, many-a-times we have to also perform
> XLogFlush, what do we do for that?  We can't proceed without doing
> that and erroring out from there means stopping read-only query from
> the user perspective.

I think we should stop WAL writes, then XLogFlush() once, then declare
the system R/O. After that there might be more XLogFlush() calls but
there won't be any new WAL, so they won't do anything.

> > But there's no reason for the checkpointer to do it: it shouldn't try
> > to checkpoint, and therefore it shouldn't write dirty pages either.
>
> What is the harm in doing the checkpoint before we put the system into
> READ ONLY state?  The advantage is that we can at least reduce the
> recovery time if we allow writing checkpoint record.

Well, as Andres says in
http://postgr.es/m/20200617180546.yucxtiupvxghxss6@alap3.anarazel.de
it can take a really long time.

> > Interesting question. I was thinking that we should probably teach the
> > autovacuum launcher to stop launching workers while the system is in a
> > READ ONLY state, but what about existing workers? Anything that
> > generates invalidation messages, acquires an XID, or writes WAL has to
> > be blocked in a read-only state; but I'm not sure to what extent the
> > first two of those things would be a problem for vacuuming an unlogged
> > table. I think you couldn't truncate it, at least, because that
> > acquires an XID.
> >
>
> If the truncate operation errors out, then won't the system will again
> trigger a new autovacuum worker for the same relation as we update
> stats at the end?

Not if we do what I said in that paragraph. If we're not launching new
workers we can't again trigger a worker for the same relation.

> Also, in general for regular tables, if there is an
> error while it tries to WAL, it could again trigger the autovacuum
> worker for the same relation.  If this is true then unnecessarily it
> will generate a lot of dirty pages and don't think it will be good for
> the system to behave that way?

I don't see how this would happen. VACUUM can't really dirty pages
without writing WAL, can it? And, anyway, if there's an error, we're
not going to try again for the same relation unless we launch new
workers.

> > What I think should happen is that the end-of-recovery checkpoint
> > should be skipped, and then if the system is put back into read-write
> > mode later we should do it then.
>
> But then if we have to perform recovery again, it will start from the
> previous checkpoint.  I think we have to live with it.

Yeah. I don't think it's that bad. The case where you shut down the
system while it's read-only should be a somewhat unusual one. Normally
you would mark it read only and then promote a standby and shut the
old master down (or demote it). But what you want is that if it does
happen to go down for some reason before all the WAL is streamed, you
can bring it back up and finish streaming the WAL without generating
any new WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jun 18, 2020 at 7:19 AM amul sul <sulamul@gmail.com> wrote:
> Let me explain the case, if we do skip the end-of-recovery checkpoint while
> starting the system in read-only mode and then later changing the state to
> read-write and do a few write operations and online checkpoints, that will be
> fine? I am yet to explore those things.

I think we'd want the FIRST write operation to be the end-of-recovery
checkpoint, before the system is fully read-write. And then after that
completes you could do other things.

It would be good if we can get an opinion from Andres about this,
since I think he has thought about this stuff quite a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jun 18, 2020 at 6:39 AM Simon Riggs <simon@2ndquadrant.com> wrote:
> That doesn't appear to be very graceful. Perhaps objections could be assuaged by having a smoother transition and
perhapsnot even a full barrier, initially.
 

Yeah, it's not ideal, though still better than what we have now. What
do you mean by "a smoother transition and perhaps not even a full
barrier"? I think if you want to switch the primary to another machine
and make the old primary into a standby, you really need to arrest WAL
writes completely. It would be better to make existing write
transactions ERROR rather than FATAL, but there are some very
difficult cases there, so I would like to leave that as a possible
later improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Jehan-Guillaume de Rorthais
Date:
On Thu, 18 Jun 2020 10:52:49 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

[...]
> But what you want is that if it does happen to go down for some reason before
> all the WAL is streamed, you can bring it back up and finish streaming the
> WAL without generating any new WAL.

Thanks to cascading replication, it could be very possible without this READ
ONLY mode, just in recovery mode, isn't it?

Regards,



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jun 18, 2020 at 11:08 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> Thanks to cascading replication, it could be very possible without this READ
> ONLY mode, just in recovery mode, isn't it?

Yeah, perhaps. I just wrote an email about that over on the demote
thread, so I won't repeat it here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
amul sul
Date:
On Thu, Jun 18, 2020 at 8:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > For buffer replacement, many-a-times we have to also perform
> > XLogFlush, what do we do for that?  We can't proceed without doing
> > that and erroring out from there means stopping read-only query from
> > the user perspective.
>
> I think we should stop WAL writes, then XLogFlush() once, then declare
> the system R/O. After that there might be more XLogFlush() calls but
> there won't be any new WAL, so they won't do anything.
>
Yeah, the proposed v1 patch does the same.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Hi All,

Attaching a new set of patches rebased atop the latest master head and includes
the following changes:

1. Enabling ALTER SYSTEM READ { ONLY | WRITE } support for the single-user,
discussed here [1]

2. Now skipping the startup checkpoint if the system is read-only mode, as
discussed [2].

3. While changing the system state to READ-WRITE, a new checkpoint request will
be made.

All these changes are part of the v2-0004 patch and the rest of the patches will
be the same as the v1.

Regards,
Amul

1] https://postgr.es/m/CAAJ_b96WPPt-=vyjpPUy8pG0vAvLgpjLukCZONUkvdR1_exrKA@mail.gmail.com
2] https://postgr.es/m/CAAJ_b95hddJrgciCfri2NkTLdEUSz6zdMSjoDuWPFPBFvJy+Kg@mail.gmail.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
tushar
Date:
On 6/22/20 11:59 AM, Amul Sul wrote:
> 2. Now skipping the startup checkpoint if the system is read-only mode, as
> discussed [2].

I am not able to perform pg_checksums o/p after shutting down my server 
in read only  mode .

Steps -

1.initdb (./initdb -k -D data)
2.start the server(./pg_ctl -D data start)
3.connect to psql (./psql postgres)
4.Fire query (alter system read only;)
5.shutdown the server(./pg_ctl -D data stop)
6.pg_checksums

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
pg_checksums: error: cluster must be shut down
[edb@tushar-ldap-docker bin]$

Result - (when server is not in read only)

[edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
Checksum operation completed
Files scanned:  916
Blocks scanned: 2976
Bad checksums:  0
Data checksum version: 1

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Jun 24, 2020 at 1:54 PM tushar <tushar.ahuja@enterprisedb.com> wrote:
>
> On 6/22/20 11:59 AM, Amul Sul wrote:
> > 2. Now skipping the startup checkpoint if the system is read-only mode, as
> > discussed [2].
>
> I am not able to perform pg_checksums o/p after shutting down my server
> in read only  mode .
>
> Steps -
>
> 1.initdb (./initdb -k -D data)
> 2.start the server(./pg_ctl -D data start)
> 3.connect to psql (./psql postgres)
> 4.Fire query (alter system read only;)
> 5.shutdown the server(./pg_ctl -D data stop)
> 6.pg_checksums
>
> [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
> pg_checksums: error: cluster must be shut down
> [edb@tushar-ldap-docker bin]$
>
> Result - (when server is not in read only)
>
> [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
> Checksum operation completed
> Files scanned:  916
> Blocks scanned: 2976
> Bad checksums:  0
> Data checksum version: 1
>
I think that's expected since the server isn't clean shutdown, similar error can
be seen with any server which has been shutdown in immediate mode
(pg_clt -D data_dir -m i).

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Michael Banck
Date:
Hi,

On Wed, Jun 24, 2020 at 01:54:29PM +0530, tushar wrote:
> On 6/22/20 11:59 AM, Amul Sul wrote:
> > 2. Now skipping the startup checkpoint if the system is read-only mode, as
> > discussed [2].
> 
> I am not able to perform pg_checksums o/p after shutting down my server in
> read only  mode .
> 
> Steps -
> 
> 1.initdb (./initdb -k -D data)
> 2.start the server(./pg_ctl -D data start)
> 3.connect to psql (./psql postgres)
> 4.Fire query (alter system read only;)
> 5.shutdown the server(./pg_ctl -D data stop)
> 6.pg_checksums
> 
> [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
> pg_checksums: error: cluster must be shut down
> [edb@tushar-ldap-docker bin]$

What's the 'Database cluster state' from pg_controldata at this point?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: [Patch] ALTER SYSTEM READ ONLY

From
Michael Paquier
Date:
On Fri, Jun 26, 2020 at 10:11:41AM +0530, Amul Sul wrote:
> I think that's expected since the server isn't clean shutdown, similar error can
> be seen with any server which has been shutdown in immediate mode
> (pg_clt -D data_dir -m i).

Any operation working on on-disk relation blocks needs to have a
consistent state, and a clean shutdown gives this guarantee thanks to
the shutdown checkpoint (see also pg_rewind).  There are two states in
the control file, shutdown for a primary and shutdown while in
recovery to cover that.  So if you stop the server cleanly but fail to
see a proper state with pg_checksums, it seems to me that the proposed
patch does not handle correctly the state of the cluster in the
control file at shutdown.  That's not good.
--
Michael

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Jun 26, 2020 at 12:15 PM Michael Banck
<michael.banck@credativ.de> wrote:
>
> Hi,
>
> On Wed, Jun 24, 2020 at 01:54:29PM +0530, tushar wrote:
> > On 6/22/20 11:59 AM, Amul Sul wrote:
> > > 2. Now skipping the startup checkpoint if the system is read-only mode, as
> > > discussed [2].
> >
> > I am not able to perform pg_checksums o/p after shutting down my server in
> > read only  mode .
> >
> > Steps -
> >
> > 1.initdb (./initdb -k -D data)
> > 2.start the server(./pg_ctl -D data start)
> > 3.connect to psql (./psql postgres)
> > 4.Fire query (alter system read only;)
> > 5.shutdown the server(./pg_ctl -D data stop)
> > 6.pg_checksums
> >
> > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data
> > pg_checksums: error: cluster must be shut down
> > [edb@tushar-ldap-docker bin]$
>
> What's the 'Database cluster state' from pg_controldata at this point?
>
"in production"

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Fri, Jun 26, 2020 at 5:59 AM Michael Paquier <michael@paquier.xyz> wrote:
> Any operation working on on-disk relation blocks needs to have a
> consistent state, and a clean shutdown gives this guarantee thanks to
> the shutdown checkpoint (see also pg_rewind).  There are two states in
> the control file, shutdown for a primary and shutdown while in
> recovery to cover that.  So if you stop the server cleanly but fail to
> see a proper state with pg_checksums, it seems to me that the proposed
> patch does not handle correctly the state of the cluster in the
> control file at shutdown.  That's not good.

I think it is actually very good. If a feature that supposedly
prevents writing WAL permitted a shutdown checkpoint to be written, it
would be failing to accomplish its design goal. There is not much of a
use case for a feature that stops WAL from being written except when
it doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Prabhat Sahu
Date:
Hi All,
I was testing the feature on top of v3 patch and found the "pg_upgrade" failure after keeping "alter system read only;" as below:

-- Steps:
./initdb -D data
./pg_ctl -D data -l logs start -c
./psql postgres
alter system read only;
\q
./pg_ctl -D data -l logs stop -c

./initdb -D data2
./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520


[edb@localhost bin]$ ./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok

The source cluster was not shut down cleanly.
Failure, exiting

--Below is the logs
2021-07-16 11:04:20.305 IST [105788] LOG:  starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-07-16 11:04:20.309 IST [105788] LOG:  listening on IPv6 address "::1", port 5432
2020-07-16 11:04:20.309 IST [105788] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2020-07-16 11:04:20.321 IST [105788] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-07-16 11:04:20.347 IST [105789] LOG:  database system was shut down at 2020-07-16 11:04:20 IST
2020-07-16 11:04:20.352 IST [105788] LOG:  database system is ready to accept connections
2020-07-16 11:04:20.534 IST [105790] LOG:  system is now read only
2020-07-16 11:04:20.542 IST [105788] LOG:  received fast shutdown request
2020-07-16 11:04:20.543 IST [105788] LOG:  aborting any active transactions
2020-07-16 11:04:20.544 IST [105788] LOG:  background worker "logical replication launcher" (PID 105795) exited with exit code 1
2020-07-16 11:04:20.544 IST [105790] LOG:  shutting down
2020-07-16 11:04:20.544 IST [105790] LOG:  skipping shutdown checkpoint because the system is read only
2020-07-16 11:04:20.551 IST [105788] LOG:  database system is shut down


On Tue, Jul 14, 2020 at 12:08 PM Amul Sul <sulamul@gmail.com> wrote:
Attached is a rebased version for the latest master head[1].

Regards,
Amul

1] Commit # 101f903e51f52bf595cd8177d2e0bc6fe9000762


--

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jul 16, 2020 at 2:12 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
Hi All,
I was testing the feature on top of v3 patch and found the "pg_upgrade" failure after keeping "alter system read only;" as below:

That's expected. You can't perform a clean shutdown without writing WAL. 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Regards,

Soumyadeep (VMware)



Re: [Patch] ALTER SYSTEM READ ONLY

From
SATYANARAYANA NARLAPURAM
Date:
+1 to this feature and I have been thinking about it for sometime. There are several use cases with marking database read only (no transaction log generation). Some of the examples in a hosted service scenario are 1/ when customer runs out of storage space, 2/ Upgrading the server to a different major version (current server can be set to read only, new one can be built and then switch DNS), 3/ If user wants to force a database to read only and not accept writes, may be for import / export a database.

Thanks,
Satya

On Wed, Jul 22, 2020 at 3:04 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
Hello,

I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.

The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.

Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?

So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:

ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;

If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:

Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.

Why not have the best of both worlds? Consider:

ALTER SYSTEM SET read_only to {off, on, wal};

-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off

Of course, there can probably be better syntax for the above.

Regards,

Soumyadeep (VMware)


Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
Hi Amul,

On Tue, Jun 16, 2020 at 6:56 AM amul sul <sulamul@gmail.com> wrote:
> The proposed feature is built atop of super barrier mechanism commit[1] to
> coordinate
> global state changes to all active backends.  Backends which executed
> ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> process to change the requested WAL read/write state aka WAL prohibited and
> WAL
> permitted state respectively.  When the checkpointer process sees the WAL
> prohibit
> state change request, it emits a global barrier and waits until all
> backends that
> participate in the ProcSignal absorbs it.

Why should the checkpointer have the responsibility of setting the state
of the system to read-only? Maybe this should be the postmaster's
responsibility - the checkpointer should just handle requests to
checkpoint. I think the backend requesting the read-only transition
should signal the postmaster, which in turn, will take on the aforesaid
responsibilities. The postmaster, could also additionally request a
checkpoint, using RequestCheckpoint() (if we want to support the
read-onlyness discussed in [1]). checkpointer.c should not be touched by
this feature.

Following on, any condition variable used by the backend to wait for the
ALTER SYSTEM command to finish (the patch uses
CheckpointerShmem->readonly_cv), could be housed in ProcGlobal.

Regards,
Soumyadeep (VMware)

[1] https://www.postgresql.org/message-id/CAE-ML%2B-zdWODAyWNs_Eu-siPxp_3PGbPkiSg%3DtoLeW9iS_eioA%40mail.gmail.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Jul 23, 2020 at 3:33 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
>
> Hello,
>
> I think we should really term this feature, as it stands, as a means to
> solely stop WAL writes from happening.
>

True.

> The feature doesn't truly make the system read-only (e.g. dirty buffer
> flushes may succeed the system being put into a read-only state), which
> does make it confusing to a degree.
>
> Ideally, if we were to have a read-only system, we should be able to run
> pg_checksums on it, or take file-system snapshots etc, without the need
> to shut down the cluster. It would also enable an interesting use case:
> we should also be able to do a live upgrade on any running cluster and
> entertain read-only queries at the same time, given that all the
> cluster's files will be immutable?
>

Read-only is for the queries.

The aim of this feature is preventing new WAL records from being generated, not
preventing them from being flushed to disk, or streamed to standbys, or anything
else. The rest should happen as normal.

If you can't flush WAL, then you might not be able to evict some number of
buffers, which in the worst case could be large. That's because you can't evict
a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise,
you wouldn't be following the WAL-before-data rule). And having a potentially
large number of unevictable buffers around sounds terrible, not only for
performance, but also for having the system keep working at all.

> So if we are not going to address those cases, we should change the
> syntax and remove the notion of read-only. It could be:
>
> ALTER SYSTEM SET wal_writes TO off|on;
> or
> ALTER SYSTEM SET prohibit_wal TO off|on;
>
> If we are going to try to make it truly read-only, and cater to the
> other use cases, we have to:
>
> Perform a checkpoint before declaring the system read-only (i.e. before
> the command returns). This may be expensive of course, as Andres has
> pointed out in this thread, but it is a price that has to be paid. If we
> do this checkpoint, then we can avoid an additional shutdown checkpoint
> and an end-of-recovery checkpoint (if we restart the primary after a
> crash while in read-only mode). Also, we would have to prevent any
> operation that touches control files, which I am not sure we do today in
> the current patch.
>

The intention is to change the system to read-only ASAP; the checkpoint will
make it much slower.

I don't think we can skip control file updates that need to make read-only
state persistent across the restart.

> Why not have the best of both worlds? Consider:
>
> ALTER SYSTEM SET read_only to {off, on, wal};
>
> -- on: wal writes off + no writes to disk
> -- off: default
> -- wal: only wal writes off
>
> Of course, there can probably be better syntax for the above.
>

Sure, thanks for the suggestions. Syntax change is not a harder part; we can
choose the better one later.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Jul 23, 2020 at 4:34 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> +1 to this feature and I have been thinking about it for sometime. There are several use cases with marking database
readonly (no transaction log generation). Some of the examples in a hosted service scenario are 1/ when customer runs
outof storage space, 2/ Upgrading the server to a different major version (current server can be set to read only, new
onecan be built and then switch DNS), 3/ If user wants to force a database to read only and not accept writes, may be
forimport / export a database. 
>
Thanks for voting & listing the realistic use cases.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Jul 23, 2020 at 6:08 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
>
> Hi Amul,
>

Thanks, Soumyadeep for looking and putting your thoughts on the patch.

> On Tue, Jun 16, 2020 at 6:56 AM amul sul <sulamul@gmail.com> wrote:
> > The proposed feature is built atop of super barrier mechanism commit[1] to
> > coordinate
> > global state changes to all active backends.  Backends which executed
> > ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
> > process to change the requested WAL read/write state aka WAL prohibited and
> > WAL
> > permitted state respectively.  When the checkpointer process sees the WAL
> > prohibit
> > state change request, it emits a global barrier and waits until all
> > backends that
> > participate in the ProcSignal absorbs it.
>
> Why should the checkpointer have the responsibility of setting the state
> of the system to read-only? Maybe this should be the postmaster's
> responsibility - the checkpointer should just handle requests to
> checkpoint.

Well, once we've initiated the change to a read-only state, we probably want to
always either finish that change or go back to read-write, even if the process
that initiated the change is interrupted. Leaving the system in a
half-way-in-between state long term seems bad. Maybe we would have put some
background process, but choose the checkpointer in charge of making the state
change and to avoid the new background process to keep the first version patch
simple.  The checkpointer isn't likely to get killed, but if it does, it will
be relaunched and the new one can clean things up. On the other hand, I agree
making the checkpointer responsible for more than one thing might not
be a good idea
but I don't think the postmaster should do the work that any
background process can
do.

>I think the backend requesting the read-only transition
> should signal the postmaster, which in turn, will take on the aforesaid
> responsibilities. The postmaster, could also additionally request a
> checkpoint, using RequestCheckpoint() (if we want to support the
> read-onlyness discussed in [1]). checkpointer.c should not be touched by
> this feature.
>
> Following on, any condition variable used by the backend to wait for the
> ALTER SYSTEM command to finish (the patch uses
> CheckpointerShmem->readonly_cv), could be housed in ProcGlobal.
>

Relevant only if we don't want to use the checkpointer process.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
On Thu, Jul 23, 2020 at 3:42 AM Amul Sul <sulamul@gmail.com> wrote:
> The aim of this feature is preventing new WAL records from being generated, not
> preventing them from being flushed to disk, or streamed to standbys, or anything
> else. The rest should happen as normal.
>
> If you can't flush WAL, then you might not be able to evict some number of
> buffers, which in the worst case could be large. That's because you can't evict
> a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise,
> you wouldn't be following the WAL-before-data rule). And having a potentially
> large number of unevictable buffers around sounds terrible, not only for
> performance, but also for having the system keep working at all.

In the read-only level I was suggesting, I wasn't suggesting that we
stop WAL flushes, in fact we should flush the WAL before we mark the
system as read-only. Once the system declares itself as read-only, it
will not perform any more on-disk changes; It may perform all the
flushes it needs as a part of the read-only request handling.

WAL should still stream to the secondary of course, even after you mark
the primary as read-only.

> Read-only is for the queries.

What I am saying is it doesn't have to be just the queries. I think we
can cater to all the other use cases simply by forcing a checkpoint
before marking the system as read-only.

> The intention is to change the system to read-only ASAP; the checkpoint will
> make it much slower.

I agree - if one needs that speed, then they can do the equivalent of:
ALTER SYSTEM SET read_only to 'wal';
and the expensive checkpoint you mentioned can be avoided.

> I don't think we can skip control file updates that need to make read-only
> state persistent across the restart.

I was referring to control file updates post the read-only state change.
Any updates done as a part of the state change is totally cool.


Regards,
Soumyadeep (VMware)



Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
On Thu, Jul 23, 2020 at 3:57 AM Amul Sul <sulamul@gmail.com> wrote:

> Well, once we've initiated the change to a read-only state, we probably want to
> always either finish that change or go back to read-write, even if the process
> that initiated the change is interrupted. Leaving the system in a
> half-way-in-between state long term seems bad. Maybe we would have put some
> background process, but choose the checkpointer in charge of making the state
> change and to avoid the new background process to keep the first version patch
> simple.  The checkpointer isn't likely to get killed, but if it does, it will
> be relaunched and the new one can clean things up. On the other hand, I agree
> making the checkpointer responsible for more than one thing might not
> be a good idea
> but I don't think the postmaster should do the work that any
> background process can
> do.

+1 for doing it in a background process rather than in the backend
itself (as we can't risk doing it in a backend as it can crash and won't
restart and clean up as a background process would).

As my co-worker pointed out to me, doing the work in the postmaster is a
very bad idea as we don't want delays in serving connection requests on
account of the barrier that comes with this patch.

I would like to see this responsibility in a separate auxiliary process
but I guess having it in the checkpointer isn't the end of the world.

Regards,
Soumyadeep (VMware)



Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think we'd want the FIRST write operation to be the end-of-recovery
> checkpoint, before the system is fully read-write. And then after that
> completes you could do other things.

I can't see why this is necessary from a correctness or performance
point of view. Maybe I'm missing something.

In case it is necessary, the patch set does not wait for the checkpoint to
complete before marking the system as read-write. Refer:

/* Set final state by clearing in-progress flag bit */
if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
{
  if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
    ereport(LOG, (errmsg("system is now read only")));
  else
  {
    /* Request checkpoint */
    RequestCheckpoint(CHECKPOINT_IMMEDIATE);
    ereport(LOG, (errmsg("system is now read write")));
  }
}

We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
we SetWALProhibitState() and do the ereport(), if we have a read-write
state change request.

Also, we currently request this checkpoint even if there was no startup
recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
the read-write request does follow a startup recovery.
So it should really be:
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
CHECKPOINT_END_OF_RECOVERY);
We would need to convey that an end-of-recovery-checkpoint is pending in
shmem somehow (and only if one such checkpoint is pending, should we do
it as a part of the read-write request handling).
Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
/*
 * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
 */
and then check for that.

Some minor comments about the code (some of them probably doesn't
warrant immediate attention, but for the record...):

1. There are some places where we can use a local variable to store the
result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
brin_doupdate()

2. Similarly, we can also capture the calls to GetWALProhibitState() in
a local variable where applicable. E.g. inside WALProhibitRequest().

3. Some of the functions that were added such as GetWALProhibitState(),
IsWALProhibited() etc could be declared static inline.

4. IsWALProhibited(): Shouldn't it really be:
bool
IsWALProhibited(void)
{
  uint32 walProhibitState = GetWALProhibitState();
  return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
    && (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
}

5. I think the comments:
/* Must be performing an INSERT or UPDATE, so we'll have an XID */
and
/* Can reach here from VACUUM, so need not have an XID */
can be internalized in the function/macro comment header.

6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
advances */
We need to update the comment here.

Regards,
Soumyadeep (VMware)



Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
Hi,

> From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001
> From: Amul Sul <amul.sul@enterprisedb.com>
> Date: Fri, 27 Mar 2020 05:05:38 -0400
> Subject: [PATCH v3 2/6] Add alter system read only/write syntax
> 
> Note that syntax doesn't have any implementation.
> ---
>  src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
>  src/backend/nodes/equalfuncs.c   |  9 +++++++++
>  src/backend/parser/gram.y        | 13 +++++++++++++
>  src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
>  src/bin/psql/tab-complete.c      |  6 ++++--
>  src/include/nodes/nodes.h        |  1 +
>  src/include/nodes/parsenodes.h   | 10 ++++++++++
>  src/tools/pgindent/typedefs.list |  1 +
>  8 files changed, 70 insertions(+), 2 deletions(-)

Shouldn't there be at outfuncs support as well? Perhaps we even need
readfuncs, not immediately sure.


> From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001
> From: Amul Sul <amul.sul@enterprisedb.com>
> Date: Fri, 19 Jun 2020 06:29:36 -0400
> Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.
> 
> Implementation:
> 
>  1. When a user tried to change server state to WAL-Prohibited using
>     ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
>     PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
>     barrier has been absorbed by all the backends.
> 
>  2. When a backend receives the WAL-Prohibited barrier, at that moment if
>     it is already in a transaction and the transaction already assigned XID,
>     then the backend will be killed by throwing FATAL(XXX: need more discussion
>     on this)

I think we should consider introducing XACTFATAL or such, guaranteeing
the transaction gets aborted, without requiring a FATAL. This has been
needed for enough cases that it's worthwhile.


There are several cases where we WAL log without having an xid
assigned. E.g. when HOT pruning during syscache lookups or such. Are
there any cases where the check for being in recovery is followed by a
CHECK_FOR_INTERRUPTS, before the WAL logging is done?


>  3. Otherwise, if that backend running transaction which yet to get XID
>     assigned we don't need to do anything special, simply call
>     ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
>     XLogInsertAllowed() first which set ready only state appropriately.
> 
>  4. A new transaction (from existing or new backend) starts as a read-only
>     transaction.

Why do we need 4)? And doesn't that have the potential to be
unnecessarily problematic if a the server is subsequently brought out of
the readonly state again?


>  5. Auxiliary processes like autovacuum launcher, background writer,
>     checkpointer and  walwriter will don't do anything in WAL-Prohibited
>     server state until someone wakes us up. E.g. a backend might later on
>     request us to put the system back to read-write.

Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do
anything in this state. bgwriter for example is even running entirely
normally in a hot standby node?


>  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
>     and xlog rotation. Starting up again will perform crash recovery(XXX:
>     need some discussion on this as well)
> 
>  7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.
> 
>  8. Only super user can toggle WAL-Prohibit state.
> 
>  9. Add system_is_read_only GUC show the system state -- will true when system
>     is wal prohibited or in recovery.



> +/*
> + * AlterSystemSetWALProhibitState
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> +    if (!superuser())
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +                 errmsg("must be superuser to execute ALTER SYSTEM command")));

ISTM we should rather do this in a GRANTable manner. We've worked
substantially towards that in the last few years.


>  
> +    /*
> +     * WALProhibited indicates if we have stopped allowing WAL writes.
> +     * Protected by info_lck.
> +     */
> +    bool        WALProhibited;
> +
>      /*
>       * SharedHotStandbyActive indicates if we allow hot standby queries to be
>       * run.  Protected by info_lck.
> @@ -7962,6 +7969,25 @@ StartupXLOG(void)
>          RequestCheckpoint(CHECKPOINT_FORCE);
>  }
>  
> +void
> +MakeReadOnlyXLOG(void)
> +{
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    XLogCtl->WALProhibited = true;
> +    SpinLockRelease(&XLogCtl->info_lck);
> +}
> +
> +/*
> + * Is the system still in WAL prohibited state?
> + */
> +bool
> +IsWALProhibited(void)
> +{
> +    volatile XLogCtlData *xlogctl = XLogCtl;
> +
> +    return xlogctl->WALProhibited;
> +}

What does this kind of locking achieving? It doesn't protect against
concurrent ALTER SYSTEM SET READ ONLY or such?


> +        /*
> +         * If the server is in WAL-Prohibited state then don't do anything until
> +         * someone wakes us up. E.g. a backend might later on request us to put
> +         * the system back to read-write.
> +         */
> +        if (IsWALProhibited())
> +        {
> +            (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> +                             WAIT_EVENT_CHECKPOINTER_MAIN);
> +            continue;
> +        }
> +
>          /*
>           * Detect a pending checkpoint request by checking whether the flags
>           * word in shared memory is nonzero.  We shouldn't need to acquire the

So if the ASRO happens while a checkpoint, potentially with a
checkpoint_timeout = 60d, it'll not take effect until the checkpoint has
finished.

But uh, as far as I can tell, the code would simply continue an
in-progress checkpoint, despite having absorbed the barrier. And then
we'd PANIC when doing the XLogInsert()?



> diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
> new file mode 100644
> index 00000000000..619c33cd780
> --- /dev/null
> +++ b/src/include/access/walprohibit.h

Not sure I like the mix of xlog/wal prefix for pretty closely related
files... I'm not convinced it's worth having a separate file for this,
fwiw.



> From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001
> From: Amul Sul <amul.sul@enterprisedb.com>
> Date: Fri, 15 May 2020 06:39:43 -0400
> Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or
>  READ-WRITE
> 
> Till the previous commit, the backend used to do this, but now the backend
> requests checkpointer to do it. Checkpointer, noticing that the current state
> is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
> and then acknowledges back to the backend who requested the state change.
> 
> Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
> prohibited state persistent across the system restarts.

The split between the previous commit and this commit seems more
confusing than useful to me.

> +/*
> + * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
> + * read-only.
> + */
> +void
> +WALProhibitRequest(void)
> +{
> +    /* Must not be called from checkpointer */
> +    Assert(!AmCheckpointerProcess());
> +    Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> +
> +    /*
> +     * If in a standalone backend, just do it ourselves.
> +     */
> +    if (!IsPostmasterEnvironment)
> +    {
> +        performWALProhibitStateChange(GetWALProhibitState());
> +        return;
> +    }
> +
> +    if (CheckpointerShmem->checkpointer_pid == 0)
> +        elog(ERROR, "checkpointer is not running");
> +
> +    if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
> +        elog(ERROR, "could not signal checkpointer: %m");
> +
> +    /* Wait for the state to change to read-only */
> +    ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
> +    for (;;)
> +    {
> +        /*  We'll be done once in-progress flag bit is cleared */
> +        if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> +            break;
> +
> +        elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
> +        ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
> +                               WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
> +    }
> +    ConditionVariableCancelSleep();
> +    elog(DEBUG1, "Done WALProhibitRequest");
> +}

Isn't it possible that the system could have been changed back to be
read-write by the time the wakeup is being processed?



> From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001
> From: Amul Sul <amul.sul@enterprisedb.com>
> Date: Tue, 14 Jul 2020 02:10:55 -0400
> Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL
>  write

Isn't that the wrong order? This needs to come before the feature is
enabled, no?



> @@ -758,6 +759,9 @@ brinbuildempty(Relation index)
>          ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
>      LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
>  
> +    /* Building indexes will have an XID */
> +    AssertWALPermitted_HaveXID();
> +

Ugh, that's a pretty ugly naming scheme mix.



> @@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
>      if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
>          brin_can_do_samepage_update(oldbuf, origsz, newsz))
>      {
> +        /* Can reach here from VACUUM, so need not have an XID */
> +        if (RelationNeedsWAL(idxrel))
> +            CheckWALPermitted();
> +

Hm. Maybe I am confused, but why is that dependent on
RelationNeedsWAL()? Shouldn't readonly actually mean readonly, even if
no WAL is emitted?


>  #include "access/genam.h"
>  #include "access/gist_private.h"
>  #include "access/transam.h"
> +#include "access/walprohibit.h"
>  #include "commands/vacuum.h"
>  #include "lib/integerset.h"
>  #include "miscadmin.h"

The number of places that now need this new header - pretty much the
same set of files that do XLogInsert, already requiring an xlog* header
to be included - drives me further towards the conclusion that it's not
a good idea to have it separate.

>  extern void ProcessInterrupts(void);
>  
> +#ifdef USE_ASSERT_CHECKING
> +typedef enum
> +{
> +    WALPERMIT_UNCHECKED,
> +    WALPERMIT_CHECKED,
> +    WALPERMIT_CHECKED_AND_USED
> +} WALPermitCheckState;
> +
> +/* in access/walprohibit.c */
> +extern WALPermitCheckState walpermit_checked_state;
> +
> +/*
> + * Reset walpermit_checked flag when no longer in the critical section.
> + * Otherwise, marked checked and used.
> + */
> +#define RESET_WALPERMIT_CHECKED_STATE() \
> +do { \
> +    walpermit_checked_state = CritSectionCount ? \
> +    WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
> +} while(0)
> +#else
> +#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
> +#endif
> +

Why are these in headers? And why is this tied to CritSectionCount?


Greetings,

Andres Freund



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think we'd want the FIRST write operation to be the end-of-recovery
> > checkpoint, before the system is fully read-write. And then after that
> > completes you could do other things.
>
> I can't see why this is necessary from a correctness or performance
> point of view. Maybe I'm missing something.
>
> In case it is necessary, the patch set does not wait for the checkpoint to
> complete before marking the system as read-write. Refer:
>
> /* Set final state by clearing in-progress flag bit */
> if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> {
>   if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
>     ereport(LOG, (errmsg("system is now read only")));
>   else
>   {
>     /* Request checkpoint */
>     RequestCheckpoint(CHECKPOINT_IMMEDIATE);
>     ereport(LOG, (errmsg("system is now read write")));
>   }
> }
>
> We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> we SetWALProhibitState() and do the ereport(), if we have a read-write
> state change request.
>
+1, I too have the same question.

FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it think
it will be deadlock case -- checkpointer process waiting for itself.

> Also, we currently request this checkpoint even if there was no startup
> recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
> the read-write request does follow a startup recovery.
> So it should really be:
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
> CHECKPOINT_END_OF_RECOVERY);
> We would need to convey that an end-of-recovery-checkpoint is pending in
> shmem somehow (and only if one such checkpoint is pending, should we do
> it as a part of the read-write request handling).
> Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
> /*
>  * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
>  */
> and then check for that.
>
Yep, we need some indication that end-of-recovery was skipped at the startup,
but I haven't added that since I wasn't sure do we really need
CHECKPOINT_END_OF_RECOVERY as part of the previous concern?

> Some minor comments about the code (some of them probably doesn't
> warrant immediate attention, but for the record...):
>
> 1. There are some places where we can use a local variable to store the
> result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
> brin_doupdate()
>
Ok.

> 2. Similarly, we can also capture the calls to GetWALProhibitState() in
> a local variable where applicable. E.g. inside WALProhibitRequest().
>
I don't think so.

> 3. Some of the functions that were added such as GetWALProhibitState(),
> IsWALProhibited() etc could be declared static inline.
>
IsWALProhibited() can be static but not GetWALProhibitState() since it needed to
be accessible from other files.

> 4. IsWALProhibited(): Shouldn't it really be:
> bool
> IsWALProhibited(void)
> {
>   uint32 walProhibitState = GetWALProhibitState();
>   return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
>     && (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
> }
>
I think the current one is better, this allows read-write transactions from
existing backend which has absorbed barrier or from new backend while we
changing stated to read-write in the assumption that we never fallback.

> 5. I think the comments:
> /* Must be performing an INSERT or UPDATE, so we'll have an XID */
> and
> /* Can reach here from VACUUM, so need not have an XID */
> can be internalized in the function/macro comment header.
>
Ok.

> 6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
> advances */
> We need to update the comment here.
>
Ok.

Will try to address all the above review comments in the next version along with
Andres' concern/suggestion. Thanks again for your time.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Jul 24, 2020 at 7:34 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,

Thanks for looking at the patch.

>
> > From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001
> > From: Amul Sul <amul.sul@enterprisedb.com>
> > Date: Fri, 27 Mar 2020 05:05:38 -0400
> > Subject: [PATCH v3 2/6] Add alter system read only/write syntax
> >
> > Note that syntax doesn't have any implementation.
> > ---
> >  src/backend/nodes/copyfuncs.c    | 12 ++++++++++++
> >  src/backend/nodes/equalfuncs.c   |  9 +++++++++
> >  src/backend/parser/gram.y        | 13 +++++++++++++
> >  src/backend/tcop/utility.c       | 20 ++++++++++++++++++++
> >  src/bin/psql/tab-complete.c      |  6 ++++--
> >  src/include/nodes/nodes.h        |  1 +
> >  src/include/nodes/parsenodes.h   | 10 ++++++++++
> >  src/tools/pgindent/typedefs.list |  1 +
> >  8 files changed, 70 insertions(+), 2 deletions(-)
>
> Shouldn't there be at outfuncs support as well? Perhaps we even need
> readfuncs, not immediately sure.

Ok, can add that as well.

>
>
>
> > From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001
> > From: Amul Sul <amul.sul@enterprisedb.com>
> > Date: Fri, 19 Jun 2020 06:29:36 -0400
> > Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier.
> >
> > Implementation:
> >
> >  1. When a user tried to change server state to WAL-Prohibited using
> >     ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit
> >     PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the
> >     barrier has been absorbed by all the backends.
> >
> >  2. When a backend receives the WAL-Prohibited barrier, at that moment if
> >     it is already in a transaction and the transaction already assigned XID,
> >     then the backend will be killed by throwing FATAL(XXX: need more discussion
> >     on this)
>
> I think we should consider introducing XACTFATAL or such, guaranteeing
> the transaction gets aborted, without requiring a FATAL. This has been
> needed for enough cases that it's worthwhile.
>

As I am aware of, the existing code PostgresMain() uses FATAL to terminate
the connection when protocol synchronization was lost.  Currently, in
a proposal, this and another one is "Terminate the idle sessions"[1] is using
FATAL, afaik.

>
> There are several cases where we WAL log without having an xid
> assigned. E.g. when HOT pruning during syscache lookups or such. Are
> there any cases where the check for being in recovery is followed by a
> CHECK_FOR_INTERRUPTS, before the WAL logging is done?
>

In case of operation without xid, an error will be raised just before the point
where the wal record is expected. The places you are asking about, I haven't
found in a glance, will try to search for that, but I am sure current
implementation is not missing those places where it is supposed to check the
prohibited state and complaint.

Quick question, is it possible that pruning will happen with the SELECT query?
It would be helpful if you or someone else could point me to the place where WAL
can be generated even in the case of read-only queries.

>
>
> >  3. Otherwise, if that backend running transaction which yet to get XID
> >     assigned we don't need to do anything special, simply call
> >     ResetLocalXLogInsertAllowed() so that any future WAL insert in will check
> >     XLogInsertAllowed() first which set ready only state appropriately.
> >
> >  4. A new transaction (from existing or new backend) starts as a read-only
> >     transaction.
>
> Why do we need 4)? And doesn't that have the potential to be
> unnecessarily problematic if a the server is subsequently brought out of
> the readonly state again?

The transaction that was started in the read-only system state will be read-only
until the end.  I think that shouldn't be too problematic.

>
>
> >  5. Auxiliary processes like autovacuum launcher, background writer,
> >     checkpointer and  walwriter will don't do anything in WAL-Prohibited
> >     server state until someone wakes us up. E.g. a backend might later on
> >     request us to put the system back to read-write.
>
> Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do
> anything in this state. bgwriter for example is even running entirely
> normally in a hot standby node?

I think I missed to update the description when I reverted the
walwriter changes. The current version doesn't have any changes to
the walwriter.  And bgwriter too behaves the same as it on the recovery
system. Will update this, sorry for the confusion.

>
>
> >  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> >     and xlog rotation. Starting up again will perform crash recovery(XXX:
> >     need some discussion on this as well)
> >
> >  7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.
> >
> >  8. Only super user can toggle WAL-Prohibit state.
> >
> >  9. Add system_is_read_only GUC show the system state -- will true when system
> >     is wal prohibited or in recovery.
>
>
>
> > +/*
> > + * AlterSystemSetWALProhibitState
> > + *
> > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> > + */
> > +void
> > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> > +{
> > +     if (!superuser())
> > +             ereport(ERROR,
> > +                             (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > +                              errmsg("must be superuser to execute ALTER SYSTEM command")));
>
> ISTM we should rather do this in a GRANTable manner. We've worked
> substantially towards that in the last few years.
>

I added this to be inlined with AlterSystemSetConfigFile(), if we want a
GRANTable manner, will try that.

>
>
> >
> > +     /*
> > +      * WALProhibited indicates if we have stopped allowing WAL writes.
> > +      * Protected by info_lck.
> > +      */
> > +     bool            WALProhibited;
> > +
> >       /*
> >        * SharedHotStandbyActive indicates if we allow hot standby queries to be
> >        * run.  Protected by info_lck.
> > @@ -7962,6 +7969,25 @@ StartupXLOG(void)
> >               RequestCheckpoint(CHECKPOINT_FORCE);
> >  }
> >
> > +void
> > +MakeReadOnlyXLOG(void)
> > +{
> > +     SpinLockAcquire(&XLogCtl->info_lck);
> > +     XLogCtl->WALProhibited = true;
> > +     SpinLockRelease(&XLogCtl->info_lck);
> > +}
> > +
> > +/*
> > + * Is the system still in WAL prohibited state?
> > + */
> > +bool
> > +IsWALProhibited(void)
> > +{
> > +     volatile XLogCtlData *xlogctl = XLogCtl;
> > +
> > +     return xlogctl->WALProhibited;
> > +}
>
> What does this kind of locking achieving? It doesn't protect against
> concurrent ALTER SYSTEM SET READ ONLY or such?
>

The 0004 patch improves that.

>
>
> > +             /*
> > +              * If the server is in WAL-Prohibited state then don't do anything until
> > +              * someone wakes us up. E.g. a backend might later on request us to put
> > +              * the system back to read-write.
> > +              */
> > +             if (IsWALProhibited())
> > +             {
> > +                     (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> > +                                                      WAIT_EVENT_CHECKPOINTER_MAIN);
> > +                     continue;
> > +             }
> > +
> >               /*
> >                * Detect a pending checkpoint request by checking whether the flags
> >                * word in shared memory is nonzero.  We shouldn't need to acquire the
>
> So if the ASRO happens while a checkpoint, potentially with a
> checkpoint_timeout = 60d, it'll not take effect until the checkpoint has
> finished.
>
> But uh, as far as I can tell, the code would simply continue an
> in-progress checkpoint, despite having absorbed the barrier. And then
> we'd PANIC when doing the XLogInsert()?

I think this might not be the case with the next checkpointer changes in the
0004 patch.

>
> > diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h
> > new file mode 100644
> > index 00000000000..619c33cd780
> > --- /dev/null
> > +++ b/src/include/access/walprohibit.h
>
> Not sure I like the mix of xlog/wal prefix for pretty closely related
> files... I'm not convinced it's worth having a separate file for this,
> fwiw.

I see.

>
>
>
> > From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001
> > From: Amul Sul <amul.sul@enterprisedb.com>
> > Date: Fri, 15 May 2020 06:39:43 -0400
> > Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or
> >  READ-WRITE
> >
> > Till the previous commit, the backend used to do this, but now the backend
> > requests checkpointer to do it. Checkpointer, noticing that the current state
> > is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request,
> > and then acknowledges back to the backend who requested the state change.
> >
> > Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL
> > prohibited state persistent across the system restarts.
>
> The split between the previous commit and this commit seems more
> confusing than useful to me.

By looking at the previous two review comments I agree with you.  My
intention to make things easier for the reviewer. Will merge this patch
with the previous one.

>
> > +/*
> > + * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to
> > + * read-only.
> > + */
> > +void
> > +WALProhibitRequest(void)
> > +{
> > +     /* Must not be called from checkpointer */
> > +     Assert(!AmCheckpointerProcess());
> > +     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> > +
> > +     /*
> > +      * If in a standalone backend, just do it ourselves.
> > +      */
> > +     if (!IsPostmasterEnvironment)
> > +     {
> > +             performWALProhibitStateChange(GetWALProhibitState());
> > +             return;
> > +     }
> > +
> > +     if (CheckpointerShmem->checkpointer_pid == 0)
> > +             elog(ERROR, "checkpointer is not running");
> > +
> > +     if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
> > +             elog(ERROR, "could not signal checkpointer: %m");
> > +
> > +     /* Wait for the state to change to read-only */
> > +     ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv);
> > +     for (;;)
> > +     {
> > +             /*  We'll be done once in-progress flag bit is cleared */
> > +             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > +                     break;
> > +
> > +             elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
> > +             ConditionVariableSleep(&CheckpointerShmem->readonly_cv,
> > +                                                        WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE);
> > +     }
> > +     ConditionVariableCancelSleep();
> > +     elog(DEBUG1, "Done WALProhibitRequest");
> > +}
>
> Isn't it possible that the system could have been changed back to be
> read-write by the time the wakeup is being processed?

You have a point, the second backend will see the ASRW executed successfully
despite any changes by this.  I think it better to have an error for the second
backend instead of silent.  Will do the same.

>
> > From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001
> > From: Amul Sul <amul.sul@enterprisedb.com>
> > Date: Tue, 14 Jul 2020 02:10:55 -0400
> > Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL
> >  write
>
> Isn't that the wrong order? This needs to come before the feature is
> enabled, no?
>

Agreed but, IMHO,  let it be, my intention behind the split is to make code read
easy and I don't think they are going to be check-in separately except 0001.

>
>
> > @@ -758,6 +759,9 @@ brinbuildempty(Relation index)
> >               ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
> >       LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
> >
> > +     /* Building indexes will have an XID */
> > +     AssertWALPermitted_HaveXID();
> > +
>
> Ugh, that's a pretty ugly naming scheme mix.
>

Ok.

>
>
>
> > @@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
> >       if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) &&
> >               brin_can_do_samepage_update(oldbuf, origsz, newsz))
> >       {
> > +             /* Can reach here from VACUUM, so need not have an XID */
> > +             if (RelationNeedsWAL(idxrel))
> > +                     CheckWALPermitted();
> > +
>
> Hm. Maybe I am confused, but why is that dependent on
> RelationNeedsWAL()? Shouldn't readonly actually mean readonly, even if
> no WAL is emitted?
>

To avoid the unnecessary error for the case where the wal record will not be
generated.

>
> >  #include "access/genam.h"
> >  #include "access/gist_private.h"
> >  #include "access/transam.h"
> > +#include "access/walprohibit.h"
> >  #include "commands/vacuum.h"
> >  #include "lib/integerset.h"
> >  #include "miscadmin.h"
>
> The number of places that now need this new header - pretty much the
> same set of files that do XLogInsert, already requiring an xlog* header
> to be included - drives me further towards the conclusion that it's not
> a good idea to have it separate.
>

Noted.

>
> >  extern void ProcessInterrupts(void);
> >
> > +#ifdef USE_ASSERT_CHECKING
> > +typedef enum
> > +{
> > +     WALPERMIT_UNCHECKED,
> > +     WALPERMIT_CHECKED,
> > +     WALPERMIT_CHECKED_AND_USED
> > +} WALPermitCheckState;
> > +
> > +/* in access/walprohibit.c */
> > +extern WALPermitCheckState walpermit_checked_state;
> > +
> > +/*
> > + * Reset walpermit_checked flag when no longer in the critical section.
> > + * Otherwise, marked checked and used.
> > + */
> > +#define RESET_WALPERMIT_CHECKED_STATE() \
> > +do { \
> > +     walpermit_checked_state = CritSectionCount ? \
> > +     WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \
> > +} while(0)
> > +#else
> > +#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0)
> > +#endif
> > +
>
> Why are these in headers? And why is this tied to CritSectionCount?
>

If it is too bad we could think to move that. In the critical section, we don't
want the walpermit_checked_state flag to be reset by XLogResetInsertion()
otherwise following XLogBeginInsert() will have an assertion.  The idea is that
anything that checks the flag changes it from UNCHECKED to CHECKED.
XLogResetInsertion() sets it to CHECKED_AND_USED if in a critical section and to
UNCHECKED otherwise (i.e. when CritSectionCount == 0).

Regards,
Amul

1] https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
> So if we are not going to address those cases, we should change the
> syntax and remove the notion of read-only. It could be:
>
> ALTER SYSTEM SET wal_writes TO off|on;
> or
> ALTER SYSTEM SET prohibit_wal TO off|on;

This doesn't really work because of the considerations mentioned in
http://postgr.es/m/CA+TgmoakCtzOZr0XEqaLFiMBcjE2rGcBAzf4EybpXjtNetpSVw@mail.gmail.com

> If we are going to try to make it truly read-only, and cater to the
> other use cases, we have to:
>
> Perform a checkpoint before declaring the system read-only (i.e. before
> the command returns). This may be expensive of course, as Andres has
> pointed out in this thread, but it is a price that has to be paid. If we
> do this checkpoint, then we can avoid an additional shutdown checkpoint
> and an end-of-recovery checkpoint (if we restart the primary after a
> crash while in read-only mode). Also, we would have to prevent any
> operation that touches control files, which I am not sure we do today in
> the current patch.

It's basically impossible to create a system for fast failover that
involves a checkpoint.  See my comments at
http://postgr.es/m/CA+TgmoYe8uCgtYFGfnv3vWpZTygsdkSu2F4MNiqhkar_UKbWfQ@mail.gmail.com
- you can't achieve five nines or even four nines of availability if
you have to wait for a checkpoint that might take twenty minutes. I
have nothing against a feature that does what you're describing, but
this feature is designed to make fast failover easier to accomplish,
and it's not going to succeed if it involves a checkpoint.

> Why not have the best of both worlds? Consider:
>
> ALTER SYSTEM SET read_only to {off, on, wal};
>
> -- on: wal writes off + no writes to disk
> -- off: default
> -- wal: only wal writes off
>
> Of course, there can probably be better syntax for the above.

There are a few things you can can imagine doing here:

1. Freeze WAL writes but allow dirty buffers to be flushed afterward.
This is the most useful thing for fast failover, I would argue,
because it's quick and the fact that some dirty buffers may not be
written doesn't matter.

2. Freeze WAL writes except a final checkpoint which will flush dirty
buffers along the way. This is like shutting the system down cleanly
and bringing it back up as a standby, except without performing a
shutdown.

3. Freeze WAL writes and write out all dirty buffers without actually
checkpointing. This is sort of a hybrid of #1 and #2. It's probably
not much faster than #2 but it avoids generating any more WAL.

4. Freeze WAL writes and just keep all the dirty buffers cached,
without writing them out. This seems like a bad idea for the reasons
mentioned in Amul's reply. The system might not be able to respond
even to read-only queries any more if shared_buffers is full of
unevictable dirty buffers.

Either #2 or #3 is sufficient to take a filesystem level snapshot of
the cluster while it's running, but I'm not sure why that's
interesting. You can already do that sort of thing by using
pg_basebackup or by running pg_start_backup() and pg_stop_backup() and
copying the directory in the middle, and you can do all of that while
the cluster is accepting writes, which seems like it will usually be
more convenient. If you do want this, you have several options, like
running a checkpoint immediately followed by ALTER SYSTEM READ ONLY
(so that the amount of WAL generated during the backup is small but
maybe not none); or shutting down the system cleanly and restarting it
as a standby; or maybe using the proposed pg_ctl demote feature
mentioned on a separate thread.

Contrary to what you write, I don't think either #2 or #3 is
sufficient to enable checksums, at least not without some more
engineering, because the server would cache the state from the control
file, and a bunch of blocks from the database. I guess it would work
if you did a server restart afterward, but I think there are better
ways of supporting online checksum enabling that don't require
shutting down the server, or even making it read-only; and there's
been significant work done on those already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
> In the read-only level I was suggesting, I wasn't suggesting that we
> stop WAL flushes, in fact we should flush the WAL before we mark the
> system as read-only. Once the system declares itself as read-only, it
> will not perform any more on-disk changes; It may perform all the
> flushes it needs as a part of the read-only request handling.

I think that's already how the patch works, or at least how it should
work. You stop new writes, flush any existing WAL, and then declare
the system read-only. That can all be done quickly.

> What I am saying is it doesn't have to be just the queries. I think we
> can cater to all the other use cases simply by forcing a checkpoint
> before marking the system as read-only.

But that part can't, which means that if we did that, it would break
the feature for the originally intended use case. I'm not on board
with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jul 23, 2020 at 10:04 PM Andres Freund <andres@anarazel.de> wrote:
> I think we should consider introducing XACTFATAL or such, guaranteeing
> the transaction gets aborted, without requiring a FATAL. This has been
> needed for enough cases that it's worthwhile.

Seems like that would need a separate discussion, apart from this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
> > In case it is necessary, the patch set does not wait for the checkpoint to
> > complete before marking the system as read-write. Refer:
> >
> > /* Set final state by clearing in-progress flag bit */
> > if (SetWALProhibitState(wal_state &
> ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > {
> >   if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> >     ereport(LOG, (errmsg("system is now read only")));
> >   else
> >   {
> >     /* Request checkpoint */
> >     RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> >     ereport(LOG, (errmsg("system is now read write")));
> >   }
> > }
> >
> > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> > we SetWALProhibitState() and do the ereport(), if we have a read-write
> > state change request.
> >
> +1, I too have the same question.
>
>
>
> FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
> think
> it will be deadlock case -- checkpointer process waiting for itself.

We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().

> > 3. Some of the functions that were added such as GetWALProhibitState(),
> > IsWALProhibited() etc could be declared static inline.
> >
> IsWALProhibited() can be static but not GetWALProhibitState() since it
> needed to
> be accessible from other files.

If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.

Regards,
Soumyadeep



Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
On Fri, Jul 24, 2020 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
> <soumyadeep2007@gmail.com> wrote:
> > So if we are not going to address those cases, we should change the
> > syntax and remove the notion of read-only. It could be:
> >
> > ALTER SYSTEM SET wal_writes TO off|on;
> > or
> > ALTER SYSTEM SET prohibit_wal TO off|on;
>
> This doesn't really work because of the considerations mentioned in
> http://postgr.es/m/CA+TgmoakCtzOZr0XEqaLFiMBcjE2rGcBAzf4EybpXjtNetpSVw@mail.gmail.com

Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't
think we should say "READ ONLY" if we still allow on-disk file changes
after the ALTER SYSTEM command returns (courtesy dirty buffer flushes)
because it does introduce confusion, especially to an audience not privy
to this thread. When people hear "read-only" they may think of static on-disk
files immediately.

> Contrary to what you write, I don't think either #2 or #3 is
> sufficient to enable checksums, at least not without some more
> engineering, because the server would cache the state from the control
> file, and a bunch of blocks from the database. I guess it would work
> if you did a server restart afterward, but I think there are better
> ways of supporting online checksum enabling that don't require
> shutting down the server, or even making it read-only; and there's
> been significant work done on those already.

Agreed. As you mentioned, if we did do #2 or #3, we would be able to do
pg_checksums on a server that was shut down or that had crashed while it
was in a read-only state, which is what Michael was asking for in [1]. I
think it's just cleaner if we allow for this.

I don't have enough context to enumerate use cases for the advantages or
opportunities that would come with an assurance that the cluster's files
are frozen (and not covered by any existing utilities), but surely there
are some? Like the possibility of pg_upgrade on a running server while
it can entertain read-only queries? Surely, that's a nice one!

Of course, some or all of these utilities would need to be taught about
read-only mode.

Regards,
Soumyadeep

[1] http://postgr.es/m/20200626095921.GF1504@paquier.xyz



Re: [Patch] ALTER SYSTEM READ ONLY

From
Soumyadeep Chakraborty
Date:
On Fri, Jul 24, 2020 at 7:34 AM Robert Haas <robertmhaas@gmail.com> wrote:

>
> On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
> <soumyadeep2007@gmail.com> wrote:
> > In the read-only level I was suggesting, I wasn't suggesting that we
> > stop WAL flushes, in fact we should flush the WAL before we mark the
> > system as read-only. Once the system declares itself as read-only, it
> > will not perform any more on-disk changes; It may perform all the
> > flushes it needs as a part of the read-only request handling.
>
> I think that's already how the patch works, or at least how it should
> work. You stop new writes, flush any existing WAL, and then declare
> the system read-only. That can all be done quickly.
>

True, except for the fact that it allows dirty buffers to be flushed
after the ALTER command returns.

> > What I am saying is it doesn't have to be just the queries. I think we
> > can cater to all the other use cases simply by forcing a checkpoint
> > before marking the system as read-only.
>
> But that part can't, which means that if we did that, it would break
> the feature for the originally intended use case. I'm not on board
> with that.
>

Referring to the options you presented in [1]:
I am saying that we should allow for both: with a checkpoint (#2) (can
also be #3) and without a checkpoint (#1) before having the ALTER
command return, by having different levels of read-onlyness.

We should have syntax variants for these. The syntax should not be an
ALTER SYSTEM SET as you have pointed out before. Perhaps:

ALTER SYSTEM READ ONLY; -- #2 or #3
ALTER SYSTEM READ ONLY WAL; -- #1
ALTER SYSTEM READ WRITE;

or even:

ALTER SYSTEM FREEZE; -- #2 or #3
ALTER SYSTEM FREEZE WAL; -- #1
ALTER SYSTEM UNFREEZE;

Regards,
Soumyadeep (VMware)

[1] http://postgr.es/m/CA+TgmoZ-c3Dz9QwHwmm4bc36N4u0XZ2OyENewMf+BwokbYdK9Q@mail.gmail.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Hi,

The attached version is updated w.r.t. some of the review comments
from Soumyadeep and Andres.

Two thing from Andres' review comment are not addressed are:

1. Only superuser allowed to execute AlterSystemSetWALProhibitState(). As per
Andres instead we should do this in a GRANTable manner. I tried that but
got a little confused with the roles that we could use for ASRO and didn't see
any much appropriate one. pg_signal_backend could have been suited for ASRO
where we terminate some of the backends but a user granted this role is not
supposed to terminate the superuser backend. If we used that we need to check a
superuser backend and raise an error or warning. Other roles are
pg_write_server_files or pg_execute_server_program but I am not sure we should
use either of this, seems a bit confusing to me. Any suggestion or am I missing
something here?

2. About walprohibit.c/.h file, Andres' concern on file name is that WAL
related file names are started with xlog. I think renaming to xlog* will not be
the correct and will be more confusing since function/variable/macros inside
walprohibit.c/.h files contain the walprohibit keyword.  And another concern is due to
separate file we have to include it to many places but I think that will be
one time pain and worth it to keep code modularised.

Andres, Robert, do let me know your opinion on this if you think we should merge
walprohibit.c/.h file into xlog.c/.h, will do that in the next version.


Changes in the attached version are:

1. Renamed readonly_cv to walprohibit_cv.
2. Removed repetitive comments for CheckWALPermitted() &
AssertWALPermitted_HaveXID().
3. Renamed AssertWALPermitted_HaveXID() to AssertWALPermittedHaveXID().
4. Changes to avoid repeated RelationNeedsWAL() calls.
5. IsWALProhibited() made static inline function.
6. Added outfuncs and readfuncs functions.
7. Added error when read-only state transition is in progress and other backends
trying to make the system read-write or vice versa. Previously 2nd backend seeing
command that was executed successfully but it wasn't.
8. Merged checkpointer code changes patch to 0002.

Regards,
Amul
Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Jul 24, 2020 at 10:40 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
> > In case it is necessary, the patch set does not wait for the checkpoint to
> > complete before marking the system as read-write. Refer:
> >
> > /* Set final state by clearing in-progress flag bit */
> > if (SetWALProhibitState(wal_state &
> ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > {
> >   if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> >     ereport(LOG, (errmsg("system is now read only")));
> >   else
> >   {
> >     /* Request checkpoint */
> >     RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> >     ereport(LOG, (errmsg("system is now read write")));
> >   }
> > }
> >
> > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> > we SetWALProhibitState() and do the ereport(), if we have a read-write
> > state change request.
> >
> +1, I too have the same question.
>
>
>
> FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
> think
> it will be deadlock case -- checkpointer process waiting for itself.

We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().


The only setting flag would have been enough for now, the next loop of
CheckpointerMain() will anyway be going to call CreateCheckPoint() without
waiting.  I used RequestCheckpoint() to avoid duplicate flag setting code.
Also, I think RequestCheckpoint() will be better so that we don't need to deal
will the standalone backend, the only imperfection is it will unnecessary signal
itself, that would be fine I guess.

> > 3. Some of the functions that were added such as GetWALProhibitState(),
> > IsWALProhibited() etc could be declared static inline.
> >
> IsWALProhibited() can be static but not GetWALProhibitState() since it
> needed to
> be accessible from other files.

If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.

Well, the current patch set also has few inline functions in the header file. 
But, I don't think we can do the same for GetWALProhibitState() without changing
the XLogCtl structure scope which is local to xlog.c file and the changing XLogCtl
scope would be a bad idea.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Fri, Jul 24, 2020 at 3:12 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
> Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't
> think we should say "READ ONLY" if we still allow on-disk file changes
> after the ALTER SYSTEM command returns (courtesy dirty buffer flushes)
> because it does introduce confusion, especially to an audience not privy
> to this thread. When people hear "read-only" they may think of static on-disk
> files immediately.

They might think of a variety of things that are not a correct
interpretation of what the feature does, but I think the way to handle
that is to document it properly. I don't think making WAL a grammar
keyword just for this is a good idea. I'm not totally stuck on this
particular syntax if there's consensus on something else, but I
seriously doubt that there will be consensus around adding parser
keywords for this.

> I don't have enough context to enumerate use cases for the advantages or
> opportunities that would come with an assurance that the cluster's files
> are frozen (and not covered by any existing utilities), but surely there
> are some? Like the possibility of pg_upgrade on a running server while
> it can entertain read-only queries? Surely, that's a nice one!

I think that this feature is plenty complicated enough already, and we
shouldn't make it more complicated to cater to additional use cases,
especially when those use cases are somewhat uncertain and would
probably require additional work in other parts of the system.

For instance, I think it would be great to have an option to start the
postmaster in a strictly "don't write ANYTHING" mode where regardless
of the cluster state it won't write any data files or any WAL or even
the control file. It would be useful for poking around on damaged
clusters without making things worse. And it's somewhat related to the
topic of this thread, but it's not THAT closely related. It's better
to add features one at a time; you can always add more later, but if
you make the individual ones too big and hard they don't get done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote:
> Attached is a rebased on top of the latest master head (# 3e98c0bafb2).

Does anyone, especially anyone named Andres Freund, have comments on
0001? That work is somewhat independent of the rest of this patch set
from a theoretical point of view, and it seems like if nobody sees a
problem with the line of attack there, it would make sense to go ahead
and commit that part. Considering that this global barrier stuff is
new and that I'm not sure how well we really understand the problems
yet, there's a possibility that we might end up revising these details
again. I understand that most people, including me, are somewhat
reluctant to see experimental code get committed, in this case that
ship has basically sailed already, since neither of the patches that
we thought would use the barrier mechanism end up making it into v13.
I don't think it's really making things any worse to try to improve
the mechanism.

0002 isn't separately committable, but I don't see anything wrong with it.

Regarding 0003:

I don't understand why ProcessBarrierWALProhibit() can safely assert
that the WALPROHIBIT_STATE_READ_ONLY is set.

+                                errhint("Cannot continue a
transaction if it has performed writes while system is read only.")));

This sentence is bad because it makes it sound like the current
transaction successfully performed a write after the system had
already become read-only. I think something like errdetail("Sessions
with open write transactions must be terminated.") would be better.

I think SetWALProhibitState() could be in walprohibit.c rather than
xlog.c. Also, this function appears to have obvious race conditions.
It fetches the current state, then thinks things over while holding no
lock, and then unconditionally updates the current state. What happens
if somebody else has changed the state in the meantime? I had sort of
imagined that we'd use something like pg_atomic_uint32 for this and
manipulate it using compare-and-swap operations. Using some kind of
lock is probably fine, too, but you have to hold it long enough that
the variable can't change under you while you're still deciding
whether it's OK to modify it, or else recheck after reacquiring the
lock that the value doesn't differ from what you expect.

I think the choice to use info_lck to synchronize
SharedWALProhibitState is very strange -- what is the justification
for that? I thought the idea might be that we frequently need to check
SharedWALProhibitState at times when we'd be holding info_lck anyway,
but it looks to me like you always do separate acquisitions of
info_lck just for this, in which case I don't see why we should use it
here instead of a separate lock. For that matter, why does this need
to be part of XLogCtlData rather than a separate shared memory area
that is private to walprohibit.c?

-       else
+       /*
+        * Can't perform checkpoint or xlog rotation without writing WAL.
+        */
+       else if (XLogInsertAllowed())

Not project style.

+               case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:

Can we drop the word SYSTEM here to make this shorter, or would that
break some convention?

+/*
+ * NB: The return string should be the same as the _ShowOption() for boolean
+ * type.
+ */
+ static const char *
+ show_system_is_read_only(void)
+{

I'm not sure the comment is appropriate here, but I'm very sure the
extra spaces before "static" and "show" are not per style.

+               /*  We'll be done once in-progress flag bit is cleared */

Another whitespace mistake.

+               elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
+       elog(DEBUG1, "Done WALProhibitRequest");

I think these should be removed.

Can WALProhibitRequest() and performWALProhibitStateChange() be moved
to walprohibit.c, just to bring more of the code for this feature
together in one place? Maybe we could also rename them to
RequestWALProhibitChange() and CompleteWALProhibitChange()?

-        * think it should leave the child state in place.
+        * think it should leave the child state in place.  Note that the upper
+        * transaction will be a force to ready-only irrespective of
its previous
+        * status if the server state is WAL prohibited.
         */
-       XactReadOnly = s->prevXactReadOnly;
+       XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();

Both instances of this pattern seem sketchy to me. You don't expect
that reverting the state to a previous state will instead change to a
different state that doesn't match up with what you had before. What
is the bad thing that would happen if we did not make this change?

-        * Else, must check to see if we're still in recovery.
+        * Else, must check to see if we're still in recovery

Spurious change.

+                       /* Request checkpoint */
+                       RequestCheckpoint(CHECKPOINT_IMMEDIATE);
+                       ereport(LOG, (errmsg("system is now read write")));

This does not seem right. Perhaps the intention here was that the
system should perform a checkpoint when it switches to read-write
state after having skipped the startup checkpoint. But why would we do
this unconditionally in all cases where we just went to a read-write
state?

There's probably quite a bit more to say about 0003 but I think I'm
running too low on mental energy to say more now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Sat, Aug 29, 2020 at 1:23 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote:
> > Attached is a rebased on top of the latest master head (# 3e98c0bafb2).
>
> Does anyone, especially anyone named Andres Freund, have comments on
> 0001? That work is somewhat independent of the rest of this patch set
> from a theoretical point of view, and it seems like if nobody sees a
> problem with the line of attack there, it would make sense to go ahead
> and commit that part. Considering that this global barrier stuff is
> new and that I'm not sure how well we really understand the problems
> yet, there's a possibility that we might end up revising these details
> again. I understand that most people, including me, are somewhat
> reluctant to see experimental code get committed, in this case that
> ship has basically sailed already, since neither of the patches that
> we thought would use the barrier mechanism end up making it into v13.
> I don't think it's really making things any worse to try to improve
> the mechanism.
>
> 0002 isn't separately committable, but I don't see anything wrong with it.
>
> Regarding 0003:
>
> I don't understand why ProcessBarrierWALProhibit() can safely assert
> that the WALPROHIBIT_STATE_READ_ONLY is set.
>

IF blocks entered to kill a transaction have valid XID & this happens only in
case of system state changing to READ_ONLY.

> +                                errhint("Cannot continue a
> transaction if it has performed writes while system is read only.")));
>
> This sentence is bad because it makes it sound like the current
> transaction successfully performed a write after the system had
> already become read-only. I think something like errdetail("Sessions
> with open write transactions must be terminated.") would be better.
>

Ok, changed as suggested in the attached version.

> I think SetWALProhibitState() could be in walprohibit.c rather than
> xlog.c. Also, this function appears to have obvious race conditions.
> It fetches the current state, then thinks things over while holding no
> lock, and then unconditionally updates the current state. What happens
> if somebody else has changed the state in the meantime? I had sort of
> imagined that we'd use something like pg_atomic_uint32 for this and
> manipulate it using compare-and-swap operations. Using some kind of
> lock is probably fine, too, but you have to hold it long enough that
> the variable can't change under you while you're still deciding
> whether it's OK to modify it, or else recheck after reacquiring the
> lock that the value doesn't differ from what you expect.
>
> I think the choice to use info_lck to synchronize
> SharedWALProhibitState is very strange -- what is the justification
> for that? I thought the idea might be that we frequently need to check
> SharedWALProhibitState at times when we'd be holding info_lck anyway,
> but it looks to me like you always do separate acquisitions of
> info_lck just for this, in which case I don't see why we should use it
> here instead of a separate lock. For that matter, why does this need
> to be part of XLogCtlData rather than a separate shared memory area
> that is private to walprohibit.c?
>

In the attached patch I added a separate shared memory structure for WAL
prohibit state. SharedWALProhibitState is now pg_atomic_uint32 and part of that
structure instead of XLogCtlData. The shared state will be changed using a
compare-and-swap operation.

I hope that should be enough to avoid said race conditions.

> -       else
> +       /*
> +        * Can't perform checkpoint or xlog rotation without writing WAL.
> +        */
> +       else if (XLogInsertAllowed())
>
> Not project style.
>

Corrected.

> +               case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE:
>
> Can we drop the word SYSTEM here to make this shorter, or would that
> break some convention?
>

No issue, removed SYSTEM.

> +/*
> + * NB: The return string should be the same as the _ShowOption() for boolean
> + * type.
> + */
> + static const char *
> + show_system_is_read_only(void)
> +{
>

Fixed.

> I'm not sure the comment is appropriate here, but I'm very sure the
> extra spaces before "static" and "show" are not per style.
>
> +               /*  We'll be done once in-progress flag bit is cleared */
>
> Another whitespace mistake.
>

Fixed.

> +               elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer");
> +       elog(DEBUG1, "Done WALProhibitRequest");
>
> I think these should be removed.
>

Removed.

> Can WALProhibitRequest() and performWALProhibitStateChange() be moved
> to walprohibit.c, just to bring more of the code for this feature
> together in one place? Maybe we could also rename them to
> RequestWALProhibitChange() and CompleteWALProhibitChange()?
>

Yes, I have moved these functions to walprohibit.c and renamed as suggested.
For this, I needed to add few helper functions to send a signal to checkpointer
and update Control File, as send_signal_to_checkpointer &
SetControlFileWALProhibitFlag() respectively, since checkpointer_pid
or ControlFile are not directly accessible from walprohibit.c

> -        * think it should leave the child state in place.
> +        * think it should leave the child state in place.  Note that the upper
> +        * transaction will be a force to ready-only irrespective of
> its previous
> +        * status if the server state is WAL prohibited.
>          */
> -       XactReadOnly = s->prevXactReadOnly;
> +       XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed();
>
> Both instances of this pattern seem sketchy to me. You don't expect
> that reverting the state to a previous state will instead change to a
> different state that doesn't match up with what you had before. What
> is the bad thing that would happen if we did not make this change?
>

We can drop these changes now since we are simply terminating sessions for those
who have performed or expected to perform write operations.

> -        * Else, must check to see if we're still in recovery.
> +        * Else, must check to see if we're still in recovery
>
> Spurious change.
>

Fixed.

> +                       /* Request checkpoint */
> +                       RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> +                       ereport(LOG, (errmsg("system is now read write")));
>
> This does not seem right. Perhaps the intention here was that the
> system should perform a checkpoint when it switches to read-write
> state after having skipped the startup checkpoint. But why would we do
> this unconditionally in all cases where we just went to a read-write
> state?
>

You are correct since this could be expensive if the system changes to read-only
for a shorter period. For the initial version, I did this unconditionally to
avoid additional shared-memory variables in XLogCtlData but now WAL prohibits
state got its own shared-memory structure so that I have added the required
variable to it.  Now, doing this checkpoint conditionally with
 CHECKPOINT_END_OF_RECOVERY & CHECKPOINT_IMMEDIATE flag what we do in the
startup process. Note that to mark end-of-recovery checkpoint has been skipped
from the startup process I have added helper function as
MarkCheckPointSkippedInWalProhibitState(), I am not sure the name that I have
chosen is the best fit.

> There's probably quite a bit more to say about 0003 but I think I'm
> running too low on mental energy to say more now.
>

Thanks for your time and suggestions.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
Hi,

On 2020-08-28 15:53:29 -0400, Robert Haas wrote:
> On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote:
> > Attached is a rebased on top of the latest master head (# 3e98c0bafb2).
>
> Does anyone, especially anyone named Andres Freund, have comments on
> 0001? That work is somewhat independent of the rest of this patch set
> from a theoretical point of view, and it seems like if nobody sees a
> problem with the line of attack there, it would make sense to go ahead
> and commit that part.

It'd be easier to review the proposed commit if it included reasoning
about the change...

In particular, it looks to me like the commit actually implements two
different changes:
1) Allow a barrier function to "reject" a set barrier, because it can't
   be set in that moment
2) Allow barrier functions to raise errors

and there's not much of an explanation as to why (probably somewhere
upthread, but ...)



 /*
  * ProcSignalShmemSize
@@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void)
     flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0);

     /*
-     * Process each type of barrier. It's important that nothing we call from
-     * here throws an error, because pss_barrierCheckMask has already been
-     * cleared. If we jumped out of here before processing all barrier types,
-     * then we'd forget about the need to do so later.
-     *
-     * NB: It ought to be OK to call the barrier-processing functions
-     * unconditionally, but it's more efficient to call only the ones that
-     * might need us to do something based on the flags.
+     * If there are no flags set, then we can skip doing any real work.
+     * Otherwise, establish a PG_TRY block, so that we don't lose track of
+     * which types of barrier processing are needed if an ERROR occurs.
      */
-    if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER))
-        ProcessBarrierPlaceholder();
+    if (flags != 0)
+    {
+        PG_TRY();
+        {
+            /*
+             * Process each type of barrier. The barrier-processing functions
+             * should normally return true, but may return false if the barrier
+             * can't be absorbed at the current time. This should be rare,
+             * because it's pretty expensive.  Every single
+             * CHECK_FOR_INTERRUPTS() will return here until we manage to
+             * absorb the barrier, and that cost will add up in a hurry.
+             *
+             * NB: It ought to be OK to call the barrier-processing functions
+             * unconditionally, but it's more efficient to call only the ones
+             * that might need us to do something based on the flags.
+             */
+            if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)
+                && ProcessBarrierPlaceholder())
+                BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER);

This pattern seems like it'll get unwieldy with more than one barrier
type. And won't flag "unhandled" barrier types either (already the case,
I know). We could go for something like:

    while (flags != 0)
    {
        barrier_bit = pg_rightmost_one_pos32(flags);
        barrier_type = 1 >> barrier_bit;

        switch (barrier_type)
        {
                case PROCSIGNAL_BARRIER_PLACEHOLDER:
                    processed = ProcessBarrierPlaceholder();
        }

        if (processed)
            BARRIER_CLEAR_BIT(flags, barrier_type);
    }

But perhaps that's too complicated?

+        }
+        PG_CATCH();
+        {
+            /*
+             * If an ERROR occurred, add any flags that weren't yet handled
+             * back into pss_barrierCheckMask, and reset the global variables
+             * so that we try again the next time we check for interrupts.
+             */
+            pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+                                   flags);

For this to be correct, wouldn't flags need to be volatile? Otherwise
this might use a register value for flags, which might not contain the
correct value at this point.

Perhaps a comment explaining why we have to clear bits first would be
good?

+            ProcSignalBarrierPending = true;
+            InterruptPending = true;
+
+            PG_RE_THROW();
+        }
+        PG_END_TRY();


+        /*
+         * If some barrier was not successfully absorbed, we will have to try
+         * again later.
+         */
+        if (flags != 0)
+        {
+            pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask,
+                                   flags);
+            ProcSignalBarrierPending = true;
+            InterruptPending = true;
+            return;
+        }
+    }

I wish there were a way we could combine the PG_CATCH and this instance
of the same code. I'd probably just move into a helper.


It might be good to add a warning to WaitForProcSignalBarrier() or by
pss_barrierCheckMask indicating that it's *not* OK to look at
pss_barrierCheckMask when checking whether barriers have been processed.


> Considering that this global barrier stuff is
> new and that I'm not sure how well we really understand the problems
> yet, there's a possibility that we might end up revising these details
> again. I understand that most people, including me, are somewhat
> reluctant to see experimental code get committed, in this case that
> ship has basically sailed already, since neither of the patches that
> we thought would use the barrier mechanism end up making it into v13.
> I don't think it's really making things any worse to try to improve
> the mechanism.

Yea, I have no problem with this.


Greetings,

Andres Freund



Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
Hi,

Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...


On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
> diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
> index 42050ab7195..0ac826d3c2f 100644
> --- a/src/backend/nodes/readfuncs.c
> +++ b/src/backend/nodes/readfuncs.c
> @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
>      READ_DONE();
>  }
>  
> +/*
> + * _readAlterSystemWALProhibitState
> + */
> +static AlterSystemWALProhibitState *
> +_readAlterSystemWALProhibitState(void)
> +{
> +    READ_LOCALS(AlterSystemWALProhibitState);
> +
> +    READ_BOOL_FIELD(WALProhibited);
> +
> +    READ_DONE();
> +}
> +

Why do we need readfuncs support for this?

> +
> +/*
> + * AlterSystemSetWALProhibitState
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +static void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> +    /* some code */
> +    elog(INFO, "AlterSystemSetWALProhibitState() called");
> +}

As long as it's not implemented it seems better to return an ERROR.

> @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
>      VariableSetStmt *setstmt;    /* SET subcommand */
>  } AlterSystemStmt;
>  
> +/* ----------------------
> + *        Alter System Read Statement
> + * ----------------------
> + */
> +typedef struct AlterSystemWALProhibitState
> +{
> +    NodeTag        type;
> +    bool        WALProhibited;
> +} AlterSystemWALProhibitState;
> +

All the nearby fields use under_score_style names.



> From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
> From: Amul Sul <amul.sul@enterprisedb.com>
> Date: Fri, 19 Jun 2020 06:29:36 -0400
> Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.
> 
> Implementation:
> 
>  1. When a user tried to change server state to WAL-Prohibited using
>     ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
>     raises request to checkpointer by marking current state to inprogress in
>     shared memory.  Checkpointer, noticing that the current state is has

"is has"

>     WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
>     then acknowledges back to the backend who requested the state change once
>     the transition has been completed.  Final state will be updated in control
>     file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?


>  2. When a backend receives the WAL-Prohibited barrier, at that moment if
>     it is already in a transaction and the transaction already assigned XID,
>     then the backend will be killed by throwing FATAL(XXX: need more discussion
>     on this)


>  3. Otherwise, if that backend running transaction which yet to get XID
>     assigned we don't need to do anything special

Somewhat garbled sentence...


>  4. A new transaction (from existing or new backend) starts as a read-only
>     transaction.

Maybe "(in an existing or in a new backend)"?


>  5. Autovacuum launcher as well as checkpointer will don't do anything in
>     WAL-Prohibited server state until someone wakes us up.  E.g. a backend
>     might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"


>  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
>     and xlog rotation. Starting up again will perform crash recovery(XXX:
>     need some discussion on this as well) but the end of recovery checkpoint
>     will be skipped and it will be performed when the system changed to
>     WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.


>  8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;


>  9. Add system_is_read_only GUC show the system state -- will true when system
>     is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay.  ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.


> --- /dev/null
> +++ b/src/backend/access/transam/walprohibit.c
> @@ -0,0 +1,321 @@
> +/*-------------------------------------------------------------------------
> + *
> + * walprohibit.c
> + *         PostgreSQL write-ahead log prohibit states
> + *
> + *
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> + *
> + * src/backend/access/transam/walprohibit.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/walprohibit.h"
> +#include "pgstat.h"
> +#include "port/atomics.h"
> +#include "postmaster/bgwriter.h"
> +#include "storage/condition_variable.h"
> +#include "storage/procsignal.h"
> +#include "storage/shmem.h"
> +
> +/*
> + * Shared-memory WAL prohibit state
> + */
> +typedef struct WALProhibitStateData
> +{
> +    /* Indicates current WAL prohibit state */
> +    pg_atomic_uint32 SharedWALProhibitState;
> +
> +    /* Startup checkpoint pending */
> +    bool        checkpointPending;
> +
> +    /* Signaled when requested WAL prohibit state changes */
> +    ConditionVariable walprohibit_cv;

You're using three different naming styles for as many members.



> +/*
> + * ProcessBarrierWALProhibit()
> + *
> + * Handle WAL prohibit state change request.
> + */
> +bool
> +ProcessBarrierWALProhibit(void)
> +{
> +    /*
> +     * Kill off any transactions that have an XID *before* allowing the system
> +     * to go WAL prohibit state.
> +     */
> +    if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))

Hm. I wonder if this check is good enough. If you look at
RecordTransactionCommit() we also WAL log in some cases where no xid was
assigned.  This is particularly true of (auto-)vacuum, but also for HOT
pruning.

I think it'd be good to put the logic of this check into xlog.c and
mirror the logic in RecordTransactionCommit(). And add cross-referencing
comments to RecordTransactionCommit and the new function, reminding our
futures selves that both places need to be modified.


> +    {
> +        /* Should be here only for the WAL prohibit state. */
> +        Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);

There are no races where an ASRO READ ONLY is quickly followed by ASRO
READ WRITE where this could be reached?


> +/*
> + * AlterSystemSetWALProhibitState()
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> +    uint32        state;
> +
> +    if (!superuser())
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +                 errmsg("must be superuser to execute ALTER SYSTEM command")));

See comments about this above.


> +    /* Alter WAL prohibit state not allowed during recovery */
> +    PreventCommandDuringRecovery("ALTER SYSTEM");
> +
> +    /* Requested state */
> +    state = stmt->WALProhibited ?
> +        WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
> +
> +    /*
> +     * Since we yet to convey this WAL prohibit state to all backend mark it
> +     * in-progress.
> +     */
> +    state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
> +
> +    if (!SetWALProhibitState(state))
> +        return;                    /* server is already in the desired state */
> +

This use of bitmasks seems unnecessary to me. I'd rather have one param
for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
for WALPROHIBIT_TRANSITION_IN_PROGRESS



> +/*
> + * RequestWALProhibitChange()
> + *
> + * Request checkpointer to make the WALProhibitState to read-only.
> + */
> +static void
> +RequestWALProhibitChange(void)
> +{
> +    /* Must not be called from checkpointer */
> +    Assert(!AmCheckpointerProcess());
> +    Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> +
> +    /*
> +     * If in a standalone backend, just do it ourselves.
> +     */
> +    if (!IsPostmasterEnvironment)
> +    {
> +        CompleteWALProhibitChange(GetWALProhibitState());
> +        return;
> +    }
> +
> +    send_signal_to_checkpointer(SIGINT);
> +
> +    /* Wait for the state to change to read-only */
> +    ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
> +    for (;;)
> +    {
> +        /* We'll be done once in-progress flag bit is cleared */
> +        if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> +            break;
> +
> +        ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
> +                               WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
> +    }
> +    ConditionVariableCancelSleep();

What if somebody concurrently changes the state back to READ WRITE?
Won't we unnecessarily wait here?

That's probably fine, because we would just wait until that transition
is complete too. But at least a comment about that would be
good. Alternatively a "ASRO transitions completed counter" or such might
be a better idea?


> +/*
> + * CompleteWALProhibitChange()
> + *
> + * Checkpointer will call this to complete the requested WAL prohibit state
> + * transition.
> + */
> +void
> +CompleteWALProhibitChange(uint32 wal_state)
> +{
> +    uint64        barrierGeneration;
> +
> +    /*
> +     * Must be called from checkpointer. Otherwise, it must be single-user
> +     * backend.
> +     */
> +    Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
> +    Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> +
> +    /*
> +     * WAL prohibit state change is initiated. We need to complete the state
> +     * transition by setting requested WAL prohibit state in all backends.
> +     */
> +    elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
> +
> +    /* Emit global barrier */
> +    barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
> +    WaitForProcSignalBarrier(barrierGeneration);
> +
> +    /* And flush all writes. */
> +    XLogFlush(GetXLogWriteRecPtr());

Hm, maybe I'm missing something, but why is the write pointer the right
thing to flush? That won't include records that haven't been written to
disk yet... We also need to trigger writing out all WAL that is as of
yet unwritten, no?  Without having thought a lot about it, it seems that
GetXLogInsertRecPtr() would be the right thing to flush?


> +    /* Set final state by clearing in-progress flag bit */
> +    if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> +    {
> +        bool        wal_prohibited;
> +
> +        wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
> +
> +        /* Update the control file to make state persistent */
> +        SetControlFileWALProhibitFlag(wal_prohibited);

Hm. Is there an issue with not WAL logging the control file change? Is
there a scenario where we a crash + recovery would end up overwriting
this?


> +        if (wal_prohibited)
> +            ereport(LOG, (errmsg("system is now read only")));
> +        else
> +        {
> +            /*
> +             * Request checkpoint if the end-of-recovery checkpoint has been
> +             * skipped previously.
> +             */
> +            if (WALProhibitState->checkpointPending)
> +            {
> +                RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
> +                                  CHECKPOINT_IMMEDIATE);
> +                WALProhibitState->checkpointPending = false;
> +            }
> +            ereport(LOG, (errmsg("system is now read write")));
> +        }
> +    }
> +
> +    /* Wake up the backend who requested the state change */
> +    ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);

Could be multiple backends, right?


> +}
> +
> +/*
> + * GetWALProhibitState()
> + *
> + * Atomically return the current server WAL prohibited state
> + */
> +uint32
> +GetWALProhibitState(void)
> +{
> +    return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
> +}

Is there an issue with needing memory barriers here?


> +/*
> + * SetWALProhibitState()
> + *
> + * Change current WAL prohibit state to the input state.
> + *
> + * If the server is already completely moved to the requested WAL prohibit
> + * state, or if the desired state is same as the current state, return false,
> + * indicating that the server state did not change. Else return true.
> + */
> +bool
> +SetWALProhibitState(uint32 new_state)
> +{
> +    bool        state_updated = false;
> +    uint32        cur_state;
> +
> +    cur_state = GetWALProhibitState();
> +
> +    /* Server is already in requested state */
> +    if (new_state == cur_state ||
> +        new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
> +        return false;
> +
> +    /* Prevent concurrent contrary in progress transition state setting */
> +    if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
> +        (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> +    {
> +        if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                     errmsg("system state transition to read only is already in progress"),
> +                     errhint("Try after sometime again.")));
> +        else
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                     errmsg("system state transition to read write is already in progress"),
> +                     errhint("Try after sometime again.")));
> +    }
> +
> +    /* Update new state in share memory */
> +    state_updated =
> +        pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
> +                                       &cur_state, new_state);
> +
> +    if (!state_updated)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("system read write state concurrently changed"),
> +                 errhint("Try after sometime again.")));
> +

I don't think it's safe to use pg_atomic_compare_exchange_u32() outside
of a loop. I think there's platforms (basically all load-linked /
store-conditional architectures) where than can fail spuriously.

Also, there's no memory barrier around GetWALProhibitState, so there's
no guarantee it's not an out-of-date value you're starting with.


> +/
> + * MarkCheckPointSkippedInWalProhibitState()
> + *
> + * Sets checkpoint pending flag so that it can be performed next time while
> + * changing system state to WAL permitted.
> + */
> +void
> +MarkCheckPointSkippedInWalProhibitState(void)
> +{
> +    WALProhibitState->checkpointPending = true;
> +}

I don't *at all* like this living outside of xlog.c. I think this should
be moved there, and merged with deferring checkpoints in other cases
(promotions, not immediately performing a checkpoint after recovery).
There's state in ControlFile *and* here for essentially the same thing.



> +     * If it is not currently possible to insert write-ahead log records,
> +     * either because we are still in recovery or because ALTER SYSTEM READ
> +     * ONLY has been executed, force this to be a read-only transaction.
> +     * We have lower level defences in XLogBeginInsert() and elsewhere to stop
> +     * us from modifying data during recovery when !XLogInsertAllowed(), but
> +     * this gives the normal indication to the user that the transaction is
> +     * read-only.
> +     *
> +     * On the other hand, we only need to set the startedInRecovery flag when
> +     * the transaction started during recovery, and not when WAL is otherwise
> +     * prohibited. This information is used by RelationGetIndexScan() to
> +     * decide whether to permit (1) relying on existing killed-tuple markings
> +     * and (2) further killing of index tuples. Even when WAL is prohibited
> +     * on the master, it's still the master, so the former is OK; and since
> +     * killing index tuples doesn't generate WAL, the latter is also OK.
> +     * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
> +     */
> +    XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
> +    s->startedInRecovery = RecoveryInProgress();

It's somewhat ugly that we call RecoveryInProgress() once in
XLogInsertAllowed() and then again directly here... It's probably fine
runtime cost wise, but...


>  /*
>   * Subroutine to try to fetch and validate a prior checkpoint record.
>   *
> @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
>       */
>      WalSndWaitStopping();
>  
> +    /*
> +     * The restartpoint, checkpoint, or xlog rotation will be performed if the
> +     * WAL writing is permitted.
> +     */
>      if (RecoveryInProgress())
>          CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> -    else
> +    else if (XLogInsertAllowed())

Not sure I like going via XLogInsertAllowed(), that seems like a
confusing indirection here. And it encompasses things we atually don't
want to check for - it's fragile to also look at LocalXLogInsertAllowed
here imo.


>      ShutdownCLOG();
>      ShutdownCommitTs();
>      ShutdownSUBTRANS();
> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 1b8cd7bacd4..aa4cdd57ec1 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
>  
>          HandleAutoVacLauncherInterrupts();
>  
> +        /* If the server is read only just go back to sleep. */
> +        if (!XLogInsertAllowed())
> +            continue;
> +

I think we really should have a different functions for places like
this. We don't want to generally hide bugs like e.g. starting the
autovac launcher in recovery, but this would.



> @@ -342,6 +344,28 @@ CheckpointerMain(void)
>          AbsorbSyncRequests();
>          HandleCheckpointerInterrupts();
>  
> +        wal_state = GetWALProhibitState();
> +
> +        if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
> +        {
> +            /* Complete WAL prohibit state change request */
> +            CompleteWALProhibitChange(wal_state);
> +            continue;
> +        }
> +        else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
> +        {
> +            /*
> +             * Don't do anything until someone wakes us up.  For example a
> +             * backend might later on request us to put the system back to
> +             * read-write wal prohibit sate.
> +             */
> +            (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> +                             WAIT_EVENT_CHECKPOINTER_MAIN);
> +            continue;
> +        }
> +        Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
> +
>          /*
>           * Detect a pending checkpoint request by checking whether the flags
>           * word in shared memory is nonzero.  We shouldn't need to acquire the
> @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)
>  
>      return FirstCall;
>  }

So, if we're in the middle of a paced checkpoint with a large
checkpoint_timeout - a sensible real world configuration - we'll not
process ASRO until that checkpoint is over?  That seems very much not
practical. What am I missing?


> +/*
> + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
> + */
> +void
> +send_signal_to_checkpointer(int signum)
> +{
> +    if (CheckpointerShmem->checkpointer_pid == 0)
> +        elog(ERROR, "checkpointer is not running");
> +
> +    if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
> +        elog(ERROR, "could not signal checkpointer: %m");
> +}

Sudden switch to a different naming style...




Greetings,

Andres Freund



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,

Thanks for your time.

>
> Thomas, there's one point below that could be relevant for you. You can
> search for your name and/or checkpoint...
>
>
> On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
> > diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
> > index 42050ab7195..0ac826d3c2f 100644
> > --- a/src/backend/nodes/readfuncs.c
> > +++ b/src/backend/nodes/readfuncs.c
> > @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
> >       READ_DONE();
> >  }
> >
> > +/*
> > + * _readAlterSystemWALProhibitState
> > + */
> > +static AlterSystemWALProhibitState *
> > +_readAlterSystemWALProhibitState(void)
> > +{
> > +     READ_LOCALS(AlterSystemWALProhibitState);
> > +
> > +     READ_BOOL_FIELD(WALProhibited);
> > +
> > +     READ_DONE();
> > +}
> > +
>
> Why do we need readfuncs support for this?
>

I thought we need that from your previous comment[1].

> > +
> > +/*
> > + * AlterSystemSetWALProhibitState
> > + *
> > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> > + */
> > +static void
> > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> > +{
> > +     /* some code */
> > +     elog(INFO, "AlterSystemSetWALProhibitState() called");
> > +}
>
> As long as it's not implemented it seems better to return an ERROR.
>

Ok, will add an error in the next version.

> > @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
> >       VariableSetStmt *setstmt;       /* SET subcommand */
> >  } AlterSystemStmt;
> >
> > +/* ----------------------
> > + *           Alter System Read Statement
> > + * ----------------------
> > + */
> > +typedef struct AlterSystemWALProhibitState
> > +{
> > +     NodeTag         type;
> > +     bool            WALProhibited;
> > +} AlterSystemWALProhibitState;
> > +
>
> All the nearby fields use under_score_style names.
>

I am not sure which nearby fields having the underscore that you are referring
to. Probably "WALProhibited" needs to be renamed to  "walprohibited" to be
inline with the nearby fields.

>
> > From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
> > From: Amul Sul <amul.sul@enterprisedb.com>
> > Date: Fri, 19 Jun 2020 06:29:36 -0400
> > Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.
> >
> > Implementation:
> >
> >  1. When a user tried to change server state to WAL-Prohibited using
> >     ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
> >     raises request to checkpointer by marking current state to inprogress in
> >     shared memory.  Checkpointer, noticing that the current state is has
>
> "is has"
>
> >     WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
> >     then acknowledges back to the backend who requested the state change once
> >     the transition has been completed.  Final state will be updated in control
> >     file to make it persistent across the system restarts.
>
> What makes checkpointer the right backend to do this work?
>

Once we've initiated the change to a read-only state, we probably want to
always either finish that change or go back to read-write, even if the process
that initiated the change is interrupted. Leaving the system in a
half-way-in-between state long term seems bad. Maybe we would have put some
background process, but choose the checkpointer in charge of making the state
change and to avoid the new background process to keep the first version patch
simple.  The checkpointer isn't likely to get killed, but if it does, it will
be relaunched and the new one can clean things up.  Probably later we might want
such a background worker that will be isn't likely to get killed.

>
> >  2. When a backend receives the WAL-Prohibited barrier, at that moment if
> >     it is already in a transaction and the transaction already assigned XID,
> >     then the backend will be killed by throwing FATAL(XXX: need more discussion
> >     on this)
>
>
> >  3. Otherwise, if that backend running transaction which yet to get XID
> >     assigned we don't need to do anything special
>
> Somewhat garbled sentence...
>
>
> >  4. A new transaction (from existing or new backend) starts as a read-only
> >     transaction.
>
> Maybe "(in an existing or in a new backend)"?
>
>
> >  5. Autovacuum launcher as well as checkpointer will don't do anything in
> >     WAL-Prohibited server state until someone wakes us up.  E.g. a backend
> >     might later on request us to put the system back to read-write.
>
> "will don't do anything", "might later on request us"
>

Ok, I'll fix all of this. I usually don't much focus on the commit message text
but I try to make it as much as possible sane enough.

>
> >  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> >     and xlog rotation. Starting up again will perform crash recovery(XXX:
> >     need some discussion on this as well) but the end of recovery checkpoint
> >     will be skipped and it will be performed when the system changed to
> >     WAL-Permitted mode.
>
> Hm, this has some interesting interactions with some of Thomas' recent
> hacking.
>

I would be so thankful for the help.

>
> >  8. Only super user can toggle WAL-Prohibit state.
>
> Hm. I don't quite agree with this. We try to avoid if (superuser())
> style checks these days, because they can't be granted to other
> users. Look at how e.g. pg_promote() - an operation of similar severity
> - is handled. We just revoke the permission from public in
> system_views.sql:
> REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;
>

Ok, currently we don't have SQL callable function to change the system
read-write state.  Do you want me to add that? If so, any naming suggesting? How
about pg_make_system_read_only(bool)  or have two function as
pg_make_system_read_only(void) & pg_make_system_read_write(void).

>
> >  9. Add system_is_read_only GUC show the system state -- will true when system
> >     is wal prohibited or in recovery.
>
> *shows the system state. There's also some oddity in the second part of
> the sentence.
>
> Is it really correct to show system_is_read_only as true during
> recovery? For one, recovery could end soon after, putting the system
> into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
> during recovery the database state actually changes if there are changes
> to replay.  ISTM it would not be a good idea to mix ASRO and
> pg_is_in_recovery() into one GUC.
>

Well, whether the system is in recovery or wal prohibited state it is read-only
for the user perspective, isn't it?

>
> > --- /dev/null
> > +++ b/src/backend/access/transam/walprohibit.c
> > @@ -0,0 +1,321 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * walprohibit.c
> > + *           PostgreSQL write-ahead log prohibit states
> > + *
> > + *
> > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> > + *
> > + * src/backend/access/transam/walprohibit.c
> > + *
> > + *-------------------------------------------------------------------------
> > + */
> > +#include "postgres.h"
> > +
> > +#include "access/walprohibit.h"
> > +#include "pgstat.h"
> > +#include "port/atomics.h"
> > +#include "postmaster/bgwriter.h"
> > +#include "storage/condition_variable.h"
> > +#include "storage/procsignal.h"
> > +#include "storage/shmem.h"
> > +
> > +/*
> > + * Shared-memory WAL prohibit state
> > + */
> > +typedef struct WALProhibitStateData
> > +{
> > +     /* Indicates current WAL prohibit state */
> > +     pg_atomic_uint32 SharedWALProhibitState;
> > +
> > +     /* Startup checkpoint pending */
> > +     bool            checkpointPending;
> > +
> > +     /* Signaled when requested WAL prohibit state changes */
> > +     ConditionVariable walprohibit_cv;
>
> You're using three different naming styles for as many members.
>

Ill fix in the next version.

>
> > +/*
> > + * ProcessBarrierWALProhibit()
> > + *
> > + * Handle WAL prohibit state change request.
> > + */
> > +bool
> > +ProcessBarrierWALProhibit(void)
> > +{
> > +     /*
> > +      * Kill off any transactions that have an XID *before* allowing the system
> > +      * to go WAL prohibit state.
> > +      */
> > +     if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
>
> Hm. I wonder if this check is good enough. If you look at
> RecordTransactionCommit() we also WAL log in some cases where no xid was
> assigned.  This is particularly true of (auto-)vacuum, but also for HOT
> pruning.
>
> I think it'd be good to put the logic of this check into xlog.c and
> mirror the logic in RecordTransactionCommit(). And add cross-referencing
> comments to RecordTransactionCommit and the new function, reminding our
> futures selves that both places need to be modified.
>

I am not sure I have understood this, here is the snip from the implementation
detail from the first post[2]:

"Open transactions that don't have an XID are not killed, but will get an ERROR
if they try to acquire an XID later, or if they try to write WAL without
acquiring an XID (e.g. VACUUM).  To make that happen, the patch adds a new
coding rule: a critical section that will write WAL must be preceded by a call
to CheckWALPermitted(), AssertWALPermitted(), or AssertWALPermitted_HaveXID().
The latter variants are used when we know for certain that inserting WAL here
must be OK, either because we have an XID (we would have been killed by a change
to read-only if one had occurred) or for some other reason."

Do let me  know if you want further clarification.

>
> > +     {
> > +             /* Should be here only for the WAL prohibit state. */
> > +             Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
>
> There are no races where an ASRO READ ONLY is quickly followed by ASRO
> READ WRITE where this could be reached?
>

No, right now SetWALProhibitState() doesn't allow two transient wal prohibit
states at a time.

>
> > +/*
> > + * AlterSystemSetWALProhibitState()
> > + *
> > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> > + */
> > +void
> > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> > +{
> > +     uint32          state;
> > +
> > +     if (!superuser())
> > +             ereport(ERROR,
> > +                             (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > +                              errmsg("must be superuser to execute ALTER SYSTEM command")));
>
> See comments about this above.
>
>
> > +     /* Alter WAL prohibit state not allowed during recovery */
> > +     PreventCommandDuringRecovery("ALTER SYSTEM");
> > +
> > +     /* Requested state */
> > +     state = stmt->WALProhibited ?
> > +             WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
> > +
> > +     /*
> > +      * Since we yet to convey this WAL prohibit state to all backend mark it
> > +      * in-progress.
> > +      */
> > +     state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
> > +
> > +     if (!SetWALProhibitState(state))
> > +             return;                                 /* server is already in the desired state */
> > +
>
> This use of bitmasks seems unnecessary to me. I'd rather have one param
> for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
> for WALPROHIBIT_TRANSITION_IN_PROGRESS
>

Ok.

How about the new version of  SetWALProhibitState function as :
SetWALProhibitState(bool wal_prohibited, bool is_final_state)  ?

>
>
> > +/*
> > + * RequestWALProhibitChange()
> > + *
> > + * Request checkpointer to make the WALProhibitState to read-only.
> > + */
> > +static void
> > +RequestWALProhibitChange(void)
> > +{
> > +     /* Must not be called from checkpointer */
> > +     Assert(!AmCheckpointerProcess());
> > +     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> > +
> > +     /*
> > +      * If in a standalone backend, just do it ourselves.
> > +      */
> > +     if (!IsPostmasterEnvironment)
> > +     {
> > +             CompleteWALProhibitChange(GetWALProhibitState());
> > +             return;
> > +     }
> > +
> > +     send_signal_to_checkpointer(SIGINT);
> > +
> > +     /* Wait for the state to change to read-only */
> > +     ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
> > +     for (;;)
> > +     {
> > +             /* We'll be done once in-progress flag bit is cleared */
> > +             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > +                     break;
> > +
> > +             ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
> > +                                                        WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
> > +     }
> > +     ConditionVariableCancelSleep();
>
> What if somebody concurrently changes the state back to READ WRITE?
> Won't we unnecessarily wait here?
>

Yes, there will be wait.

> That's probably fine, because we would just wait until that transition
> is complete too. But at least a comment about that would be
> good. Alternatively a "ASRO transitions completed counter" or such might
> be a better idea?
>

Ok, will add comments but could you please elaborate little a bit about "ASRO
transitions completed counter"  and is there any existing counter I can refer
to?

>
> > +/*
> > + * CompleteWALProhibitChange()
> > + *
> > + * Checkpointer will call this to complete the requested WAL prohibit state
> > + * transition.
> > + */
> > +void
> > +CompleteWALProhibitChange(uint32 wal_state)
> > +{
> > +     uint64          barrierGeneration;
> > +
> > +     /*
> > +      * Must be called from checkpointer. Otherwise, it must be single-user
> > +      * backend.
> > +      */
> > +     Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
> > +     Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> > +
> > +     /*
> > +      * WAL prohibit state change is initiated. We need to complete the state
> > +      * transition by setting requested WAL prohibit state in all backends.
> > +      */
> > +     elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
> > +
> > +     /* Emit global barrier */
> > +     barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
> > +     WaitForProcSignalBarrier(barrierGeneration);
> > +
> > +     /* And flush all writes. */
> > +     XLogFlush(GetXLogWriteRecPtr());
>
> Hm, maybe I'm missing something, but why is the write pointer the right
> thing to flush? That won't include records that haven't been written to
> disk yet... We also need to trigger writing out all WAL that is as of
> yet unwritten, no?  Without having thought a lot about it, it seems that
> GetXLogInsertRecPtr() would be the right thing to flush?
>

TBH, I am not an expert in this area.  I wants to flush the latest record
pointer that needs to be flushed, I think  GetXLogInsertRecPtr() would be fine
if is the latest one. Note that wal flushes are not blocked in read-only mode.

>
> > +     /* Set final state by clearing in-progress flag bit */
> > +     if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > +     {
> > +             bool            wal_prohibited;
> > +
> > +             wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
> > +
> > +             /* Update the control file to make state persistent */
> > +             SetControlFileWALProhibitFlag(wal_prohibited);
>
> Hm. Is there an issue with not WAL logging the control file change? Is
> there a scenario where we a crash + recovery would end up overwriting
> this?
>

I am not sure. If the system crash before update this that means we haven't
acknowledged the system state change. And the server will be restarted with the
previous state.

Could you please explain what bothering you.

>
> > +             if (wal_prohibited)
> > +                     ereport(LOG, (errmsg("system is now read only")));
> > +             else
> > +             {
> > +                     /*
> > +                      * Request checkpoint if the end-of-recovery checkpoint has been
> > +                      * skipped previously.
> > +                      */
> > +                     if (WALProhibitState->checkpointPending)
> > +                     {
> > +                             RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
> > +                                                               CHECKPOINT_IMMEDIATE);
> > +                             WALProhibitState->checkpointPending = false;
> > +                     }
> > +                     ereport(LOG, (errmsg("system is now read write")));
> > +             }
> > +     }
> > +
> > +     /* Wake up the backend who requested the state change */
> > +     ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
>
> Could be multiple backends, right?
>

Yes, you are correct, will fix that.

>
> > +}
> > +
> > +/*
> > + * GetWALProhibitState()
> > + *
> > + * Atomically return the current server WAL prohibited state
> > + */
> > +uint32
> > +GetWALProhibitState(void)
> > +{
> > +     return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
> > +}
>
> Is there an issue with needing memory barriers here?
>
>
> > +/*
> > + * SetWALProhibitState()
> > + *
> > + * Change current WAL prohibit state to the input state.
> > + *
> > + * If the server is already completely moved to the requested WAL prohibit
> > + * state, or if the desired state is same as the current state, return false,
> > + * indicating that the server state did not change. Else return true.
> > + */
> > +bool
> > +SetWALProhibitState(uint32 new_state)
> > +{
> > +     bool            state_updated = false;
> > +     uint32          cur_state;
> > +
> > +     cur_state = GetWALProhibitState();
> > +
> > +     /* Server is already in requested state */
> > +     if (new_state == cur_state ||
> > +             new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > +             return false;
> > +
> > +     /* Prevent concurrent contrary in progress transition state setting */
> > +     if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
> > +             (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > +     {
> > +             if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
> > +                     ereport(ERROR,
> > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                      errmsg("system state transition to read only is already in progress"),
> > +                                      errhint("Try after sometime again.")));
> > +             else
> > +                     ereport(ERROR,
> > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                      errmsg("system state transition to read write is already in progress"),
> > +                                      errhint("Try after sometime again.")));
> > +     }
> > +
> > +     /* Update new state in share memory */
> > +     state_updated =
> > +             pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
> > +                                                                        &cur_state, new_state);
> > +
> > +     if (!state_updated)
> > +             ereport(ERROR,
> > +                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                              errmsg("system read write state concurrently changed"),
> > +                              errhint("Try after sometime again.")));
> > +
>
> I don't think it's safe to use pg_atomic_compare_exchange_u32() outside
> of a loop. I think there's platforms (basically all load-linked /
> store-conditional architectures) where than can fail spuriously.
>
> Also, there's no memory barrier around GetWALProhibitState, so there's
> no guarantee it's not an out-of-date value you're starting with.
>

How about having some kind of lock instead what Robert have suggested
previously[3] ?

>
> > +/
> > + * MarkCheckPointSkippedInWalProhibitState()
> > + *
> > + * Sets checkpoint pending flag so that it can be performed next time while
> > + * changing system state to WAL permitted.
> > + */
> > +void
> > +MarkCheckPointSkippedInWalProhibitState(void)
> > +{
> > +     WALProhibitState->checkpointPending = true;
> > +}
>
> I don't *at all* like this living outside of xlog.c. I think this should
> be moved there, and merged with deferring checkpoints in other cases
> (promotions, not immediately performing a checkpoint after recovery).

Here we want to perform the checkpoint sometime quite later when the
system state changes to read-write. For that, I think we need some flag
if we want this in xlog.c then we can have that flag in XLogCtl.


> There's state in ControlFile *and* here for essentially the same thing.
>

I am sorry to trouble you much, but I haven't understood this too.

>
>
> > +      * If it is not currently possible to insert write-ahead log records,
> > +      * either because we are still in recovery or because ALTER SYSTEM READ
> > +      * ONLY has been executed, force this to be a read-only transaction.
> > +      * We have lower level defences in XLogBeginInsert() and elsewhere to stop
> > +      * us from modifying data during recovery when !XLogInsertAllowed(), but
> > +      * this gives the normal indication to the user that the transaction is
> > +      * read-only.
> > +      *
> > +      * On the other hand, we only need to set the startedInRecovery flag when
> > +      * the transaction started during recovery, and not when WAL is otherwise
> > +      * prohibited. This information is used by RelationGetIndexScan() to
> > +      * decide whether to permit (1) relying on existing killed-tuple markings
> > +      * and (2) further killing of index tuples. Even when WAL is prohibited
> > +      * on the master, it's still the master, so the former is OK; and since
> > +      * killing index tuples doesn't generate WAL, the latter is also OK.
> > +      * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
> > +      */
> > +     XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
> > +     s->startedInRecovery = RecoveryInProgress();
>
> It's somewhat ugly that we call RecoveryInProgress() once in
> XLogInsertAllowed() and then again directly here... It's probably fine
> runtime cost wise, but...
>
>
> >  /*
> >   * Subroutine to try to fetch and validate a prior checkpoint record.
> >   *
> > @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
> >        */
> >       WalSndWaitStopping();
> >
> > +     /*
> > +      * The restartpoint, checkpoint, or xlog rotation will be performed if the
> > +      * WAL writing is permitted.
> > +      */
> >       if (RecoveryInProgress())
> >               CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> > -     else
> > +     else if (XLogInsertAllowed())
>
> Not sure I like going via XLogInsertAllowed(), that seems like a
> confusing indirection here. And it encompasses things we atually don't
> want to check for - it's fragile to also look at LocalXLogInsertAllowed
> here imo.
>
>
> >       ShutdownCLOG();
> >       ShutdownCommitTs();
> >       ShutdownSUBTRANS();
> > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> > index 1b8cd7bacd4..aa4cdd57ec1 100644
> > --- a/src/backend/postmaster/autovacuum.c
> > +++ b/src/backend/postmaster/autovacuum.c
> > @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
> >
> >               HandleAutoVacLauncherInterrupts();
> >
> > +             /* If the server is read only just go back to sleep. */
> > +             if (!XLogInsertAllowed())
> > +                     continue;
> > +
>
> I think we really should have a different functions for places like
> this. We don't want to generally hide bugs like e.g. starting the
> autovac launcher in recovery, but this would.
>

So, we need a separate function like XLogInsertAllowed() and a global variable
like LocalXLogInsertAllowed for the caching wal prohibit state.

>
> > @@ -342,6 +344,28 @@ CheckpointerMain(void)
> >               AbsorbSyncRequests();
> >               HandleCheckpointerInterrupts();
> >
> > +             wal_state = GetWALProhibitState();
> > +
> > +             if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
> > +             {
> > +                     /* Complete WAL prohibit state change request */
> > +                     CompleteWALProhibitChange(wal_state);
> > +                     continue;
> > +             }
> > +             else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
> > +             {
> > +                     /*
> > +                      * Don't do anything until someone wakes us up.  For example a
> > +                      * backend might later on request us to put the system back to
> > +                      * read-write wal prohibit sate.
> > +                      */
> > +                     (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> > +                                                      WAIT_EVENT_CHECKPOINTER_MAIN);
> > +                     continue;
> > +             }
> > +             Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
> > +
> >               /*
> >                * Detect a pending checkpoint request by checking whether the flags
> >                * word in shared memory is nonzero.  We shouldn't need to acquire the
> > @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)
> >
> >       return FirstCall;
> >  }
>
> So, if we're in the middle of a paced checkpoint with a large
> checkpoint_timeout - a sensible real world configuration - we'll not
> process ASRO until that checkpoint is over?  That seems very much not
> practical. What am I missing?
>

Yes, the process doing ASRO will wait until that checkpoint is over.

>
> > +/*
> > + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
> > + */
> > +void
> > +send_signal_to_checkpointer(int signum)
> > +{
> > +     if (CheckpointerShmem->checkpointer_pid == 0)
> > +             elog(ERROR, "checkpointer is not running");
> > +
> > +     if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
> > +             elog(ERROR, "could not signal checkpointer: %m");
> > +}
>
> Sudden switch to a different naming style...
>

My bad, sorry, will fix that.

Regards,
Amul

1] http://postgr.es/m/20200724020402.2byiiufsd7pw4hsp@alap3.anarazel.de
2] http://postgr.es/m/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com
3] http://postgr.es/m/CA+TgmoYMyw-m3O5XQ8tRy4mdEArGcfXr+9niO5Fmq1wVdKxYmQ@mail.gmail.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Hi Andres,

The attached patch has fixed the issue that you have raised & I have confirmed
in my previous email.  Also, I tried to improve some of the things that you have
pointed but for those changes, I am a little unsure and looking forward to the
inputs/suggestions/confirmation on that, therefore 0003 patch is marked WIP.

Please have a look at my inline reply below for the things that are changes in
the attached version and need inputs:

On Sat, Sep 12, 2020 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
>
> Thanks for your time.
>
> >
> > Thomas, there's one point below that could be relevant for you. You can
> > search for your name and/or checkpoint...
> >
> >
> > On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
> > > diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
> > > index 42050ab7195..0ac826d3c2f 100644
> > > --- a/src/backend/nodes/readfuncs.c
> > > +++ b/src/backend/nodes/readfuncs.c
> > > @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
> > >       READ_DONE();
> > >  }
> > >
> > > +/*
> > > + * _readAlterSystemWALProhibitState
> > > + */
> > > +static AlterSystemWALProhibitState *
> > > +_readAlterSystemWALProhibitState(void)
> > > +{
> > > +     READ_LOCALS(AlterSystemWALProhibitState);
> > > +
> > > +     READ_BOOL_FIELD(WALProhibited);
> > > +
> > > +     READ_DONE();
> > > +}
> > > +
> >
> > Why do we need readfuncs support for this?
> >
>
> I thought we need that from your previous comment[1].
>
> > > +
> > > +/*
> > > + * AlterSystemSetWALProhibitState
> > > + *
> > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> > > + */
> > > +static void
> > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> > > +{
> > > +     /* some code */
> > > +     elog(INFO, "AlterSystemSetWALProhibitState() called");
> > > +}
> >
> > As long as it's not implemented it seems better to return an ERROR.
> >
>
> Ok, will add an error in the next version.
>
> > > @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
> > >       VariableSetStmt *setstmt;       /* SET subcommand */
> > >  } AlterSystemStmt;
> > >
> > > +/* ----------------------
> > > + *           Alter System Read Statement
> > > + * ----------------------
> > > + */
> > > +typedef struct AlterSystemWALProhibitState
> > > +{
> > > +     NodeTag         type;
> > > +     bool            WALProhibited;
> > > +} AlterSystemWALProhibitState;
> > > +
> >
> > All the nearby fields use under_score_style names.
> >
>
> I am not sure which nearby fields having the underscore that you are referring
> to. Probably "WALProhibited" needs to be renamed to  "walprohibited" to be
> inline with the nearby fields.
>
> >
> > > From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
> > > From: Amul Sul <amul.sul@enterprisedb.com>
> > > Date: Fri, 19 Jun 2020 06:29:36 -0400
> > > Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.
> > >
> > > Implementation:
> > >
> > >  1. When a user tried to change server state to WAL-Prohibited using
> > >     ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
> > >     raises request to checkpointer by marking current state to inprogress in
> > >     shared memory.  Checkpointer, noticing that the current state is has
> >
> > "is has"
> >
> > >     WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
> > >     then acknowledges back to the backend who requested the state change once
> > >     the transition has been completed.  Final state will be updated in control
> > >     file to make it persistent across the system restarts.
> >
> > What makes checkpointer the right backend to do this work?
> >
>
> Once we've initiated the change to a read-only state, we probably want to
> always either finish that change or go back to read-write, even if the process
> that initiated the change is interrupted. Leaving the system in a
> half-way-in-between state long term seems bad. Maybe we would have put some
> background process, but choose the checkpointer in charge of making the state
> change and to avoid the new background process to keep the first version patch
> simple.  The checkpointer isn't likely to get killed, but if it does, it will
> be relaunched and the new one can clean things up.  Probably later we might want
> such a background worker that will be isn't likely to get killed.
>
> >
> > >  2. When a backend receives the WAL-Prohibited barrier, at that moment if
> > >     it is already in a transaction and the transaction already assigned XID,
> > >     then the backend will be killed by throwing FATAL(XXX: need more discussion
> > >     on this)
> >
> >
> > >  3. Otherwise, if that backend running transaction which yet to get XID
> > >     assigned we don't need to do anything special
> >
> > Somewhat garbled sentence...
> >
> >
> > >  4. A new transaction (from existing or new backend) starts as a read-only
> > >     transaction.
> >
> > Maybe "(in an existing or in a new backend)"?
> >
> >
> > >  5. Autovacuum launcher as well as checkpointer will don't do anything in
> > >     WAL-Prohibited server state until someone wakes us up.  E.g. a backend
> > >     might later on request us to put the system back to read-write.
> >
> > "will don't do anything", "might later on request us"
> >
>
> Ok, I'll fix all of this. I usually don't much focus on the commit message text
> but I try to make it as much as possible sane enough.
>
> >
> > >  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> > >     and xlog rotation. Starting up again will perform crash recovery(XXX:
> > >     need some discussion on this as well) but the end of recovery checkpoint
> > >     will be skipped and it will be performed when the system changed to
> > >     WAL-Permitted mode.
> >
> > Hm, this has some interesting interactions with some of Thomas' recent
> > hacking.
> >
>
> I would be so thankful for the help.
>
> >
> > >  8. Only super user can toggle WAL-Prohibit state.
> >
> > Hm. I don't quite agree with this. We try to avoid if (superuser())
> > style checks these days, because they can't be granted to other
> > users. Look at how e.g. pg_promote() - an operation of similar severity
> > - is handled. We just revoke the permission from public in
> > system_views.sql:
> > REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;
> >
>
> Ok, currently we don't have SQL callable function to change the system
> read-write state.  Do you want me to add that? If so, any naming suggesting? How
> about pg_make_system_read_only(bool)  or have two function as
> pg_make_system_read_only(void) & pg_make_system_read_write(void).
>

In the attached version I added SQL callable function as
pg_alter_wal_prohibit_state(bool), and another suggestion for the naming is
welcome.

For the permission denied error for ASRO READ-ONLY/READ-WRITE, I have added
ereport() in AlterSystemSetWALProhibitState() instead of aclcheck_error() and
the hint is added. Any suggestions?

> >
> > >  9. Add system_is_read_only GUC show the system state -- will true when system
> > >     is wal prohibited or in recovery.
> >
> > *shows the system state. There's also some oddity in the second part of
> > the sentence.
> >
> > Is it really correct to show system_is_read_only as true during
> > recovery? For one, recovery could end soon after, putting the system
> > into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
> > during recovery the database state actually changes if there are changes
> > to replay.  ISTM it would not be a good idea to mix ASRO and
> > pg_is_in_recovery() into one GUC.
> >
>
> Well, whether the system is in recovery or wal prohibited state it is read-only
> for the user perspective, isn't it?
>
> >
> > > --- /dev/null
> > > +++ b/src/backend/access/transam/walprohibit.c
> > > @@ -0,0 +1,321 @@
> > > +/*-------------------------------------------------------------------------
> > > + *
> > > + * walprohibit.c
> > > + *           PostgreSQL write-ahead log prohibit states
> > > + *
> > > + *
> > > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> > > + *
> > > + * src/backend/access/transam/walprohibit.c
> > > + *
> > > + *-------------------------------------------------------------------------
> > > + */
> > > +#include "postgres.h"
> > > +
> > > +#include "access/walprohibit.h"
> > > +#include "pgstat.h"
> > > +#include "port/atomics.h"
> > > +#include "postmaster/bgwriter.h"
> > > +#include "storage/condition_variable.h"
> > > +#include "storage/procsignal.h"
> > > +#include "storage/shmem.h"
> > > +
> > > +/*
> > > + * Shared-memory WAL prohibit state
> > > + */
> > > +typedef struct WALProhibitStateData
> > > +{
> > > +     /* Indicates current WAL prohibit state */
> > > +     pg_atomic_uint32 SharedWALProhibitState;
> > > +
> > > +     /* Startup checkpoint pending */
> > > +     bool            checkpointPending;
> > > +
> > > +     /* Signaled when requested WAL prohibit state changes */
> > > +     ConditionVariable walprohibit_cv;
> >
> > You're using three different naming styles for as many members.
> >
>
> Ill fix in the next version.
>
> >
> > > +/*
> > > + * ProcessBarrierWALProhibit()
> > > + *
> > > + * Handle WAL prohibit state change request.
> > > + */
> > > +bool
> > > +ProcessBarrierWALProhibit(void)
> > > +{
> > > +     /*
> > > +      * Kill off any transactions that have an XID *before* allowing the system
> > > +      * to go WAL prohibit state.
> > > +      */
> > > +     if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny()))
> >
> > Hm. I wonder if this check is good enough. If you look at
> > RecordTransactionCommit() we also WAL log in some cases where no xid was
> > assigned.  This is particularly true of (auto-)vacuum, but also for HOT
> > pruning.
> >
> > I think it'd be good to put the logic of this check into xlog.c and
> > mirror the logic in RecordTransactionCommit(). And add cross-referencing
> > comments to RecordTransactionCommit and the new function, reminding our
> > futures selves that both places need to be modified.
> >
>
> I am not sure I have understood this, here is the snip from the implementation
> detail from the first post[2]:
>
> "Open transactions that don't have an XID are not killed, but will get an ERROR
> if they try to acquire an XID later, or if they try to write WAL without
> acquiring an XID (e.g. VACUUM).  To make that happen, the patch adds a new
> coding rule: a critical section that will write WAL must be preceded by a call
> to CheckWALPermitted(), AssertWALPermitted(), or AssertWALPermitted_HaveXID().
> The latter variants are used when we know for certain that inserting WAL here
> must be OK, either because we have an XID (we would have been killed by a change
> to read-only if one had occurred) or for some other reason."
>
> Do let me  know if you want further clarification.
>
> >
> > > +     {
> > > +             /* Should be here only for the WAL prohibit state. */
> > > +             Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY);
> >
> > There are no races where an ASRO READ ONLY is quickly followed by ASRO
> > READ WRITE where this could be reached?
> >
>
> No, right now SetWALProhibitState() doesn't allow two transient wal prohibit
> states at a time.
>
> >
> > > +/*
> > > + * AlterSystemSetWALProhibitState()
> > > + *
> > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> > > + */
> > > +void
> > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> > > +{
> > > +     uint32          state;
> > > +
> > > +     if (!superuser())
> > > +             ereport(ERROR,
> > > +                             (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > > +                              errmsg("must be superuser to execute ALTER SYSTEM command")));
> >
> > See comments about this above.
> >
> >
> > > +     /* Alter WAL prohibit state not allowed during recovery */
> > > +     PreventCommandDuringRecovery("ALTER SYSTEM");
> > > +
> > > +     /* Requested state */
> > > +     state = stmt->WALProhibited ?
> > > +             WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE;
> > > +
> > > +     /*
> > > +      * Since we yet to convey this WAL prohibit state to all backend mark it
> > > +      * in-progress.
> > > +      */
> > > +     state |= WALPROHIBIT_TRANSITION_IN_PROGRESS;
> > > +
> > > +     if (!SetWALProhibitState(state))
> > > +             return;                                 /* server is already in the desired state */
> > > +
> >
> > This use of bitmasks seems unnecessary to me. I'd rather have one param
> > for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one
> > for WALPROHIBIT_TRANSITION_IN_PROGRESS
> >
>
> Ok.
>
> How about the new version of  SetWALProhibitState function as :
> SetWALProhibitState(bool wal_prohibited, bool is_final_state)  ?
>

I have added the same.

> >
> >
> > > +/*
> > > + * RequestWALProhibitChange()
> > > + *
> > > + * Request checkpointer to make the WALProhibitState to read-only.
> > > + */
> > > +static void
> > > +RequestWALProhibitChange(void)
> > > +{
> > > +     /* Must not be called from checkpointer */
> > > +     Assert(!AmCheckpointerProcess());
> > > +     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> > > +
> > > +     /*
> > > +      * If in a standalone backend, just do it ourselves.
> > > +      */
> > > +     if (!IsPostmasterEnvironment)
> > > +     {
> > > +             CompleteWALProhibitChange(GetWALProhibitState());
> > > +             return;
> > > +     }
> > > +
> > > +     send_signal_to_checkpointer(SIGINT);
> > > +
> > > +     /* Wait for the state to change to read-only */
> > > +     ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
> > > +     for (;;)
> > > +     {
> > > +             /* We'll be done once in-progress flag bit is cleared */
> > > +             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > > +                     break;
> > > +
> > > +             ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
> > > +                                                        WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
> > > +     }
> > > +     ConditionVariableCancelSleep();
> >
> > What if somebody concurrently changes the state back to READ WRITE?
> > Won't we unnecessarily wait here?
> >
>
> Yes, there will be wait.
>
> > That's probably fine, because we would just wait until that transition
> > is complete too. But at least a comment about that would be
> > good. Alternatively a "ASRO transitions completed counter" or such might
> > be a better idea?
> >
>
> Ok, will add comments but could you please elaborate little a bit about "ASRO
> transitions completed counter"  and is there any existing counter I can refer
> to?
>
> >
> > > +/*
> > > + * CompleteWALProhibitChange()
> > > + *
> > > + * Checkpointer will call this to complete the requested WAL prohibit state
> > > + * transition.
> > > + */
> > > +void
> > > +CompleteWALProhibitChange(uint32 wal_state)
> > > +{
> > > +     uint64          barrierGeneration;
> > > +
> > > +     /*
> > > +      * Must be called from checkpointer. Otherwise, it must be single-user
> > > +      * backend.
> > > +      */
> > > +     Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment);
> > > +     Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> > > +
> > > +     /*
> > > +      * WAL prohibit state change is initiated. We need to complete the state
> > > +      * transition by setting requested WAL prohibit state in all backends.
> > > +      */
> > > +     elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state");
> > > +
> > > +     /* Emit global barrier */
> > > +     barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT);
> > > +     WaitForProcSignalBarrier(barrierGeneration);
> > > +
> > > +     /* And flush all writes. */
> > > +     XLogFlush(GetXLogWriteRecPtr());
> >
> > Hm, maybe I'm missing something, but why is the write pointer the right
> > thing to flush? That won't include records that haven't been written to
> > disk yet... We also need to trigger writing out all WAL that is as of
> > yet unwritten, no?  Without having thought a lot about it, it seems that
> > GetXLogInsertRecPtr() would be the right thing to flush?
> >
>
> TBH, I am not an expert in this area.  I wants to flush the latest record
> pointer that needs to be flushed, I think  GetXLogInsertRecPtr() would be fine
> if is the latest one. Note that wal flushes are not blocked in read-only mode.
>

Used GetXLogInsertRecPtr().

> >
> > > +     /* Set final state by clearing in-progress flag bit */
> > > +     if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > > +     {
> > > +             bool            wal_prohibited;
> > > +
> > > +             wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0;
> > > +
> > > +             /* Update the control file to make state persistent */
> > > +             SetControlFileWALProhibitFlag(wal_prohibited);
> >
> > Hm. Is there an issue with not WAL logging the control file change? Is
> > there a scenario where we a crash + recovery would end up overwriting
> > this?
> >
>
> I am not sure. If the system crash before update this that means we haven't
> acknowledged the system state change. And the server will be restarted with the
> previous state.
>
> Could you please explain what bothering you.
>
> >
> > > +             if (wal_prohibited)
> > > +                     ereport(LOG, (errmsg("system is now read only")));
> > > +             else
> > > +             {
> > > +                     /*
> > > +                      * Request checkpoint if the end-of-recovery checkpoint has been
> > > +                      * skipped previously.
> > > +                      */
> > > +                     if (WALProhibitState->checkpointPending)
> > > +                     {
> > > +                             RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
> > > +                                                               CHECKPOINT_IMMEDIATE);
> > > +                             WALProhibitState->checkpointPending = false;
> > > +                     }
> > > +                     ereport(LOG, (errmsg("system is now read write")));
> > > +             }
> > > +     }
> > > +
> > > +     /* Wake up the backend who requested the state change */
> > > +     ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv);
> >
> > Could be multiple backends, right?
> >
>
> Yes, you are correct, will fix that.
>
> >
> > > +}
> > > +
> > > +/*
> > > + * GetWALProhibitState()
> > > + *
> > > + * Atomically return the current server WAL prohibited state
> > > + */
> > > +uint32
> > > +GetWALProhibitState(void)
> > > +{
> > > +     return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState);
> > > +}
> >
> > Is there an issue with needing memory barriers here?
> >
> >
> > > +/*
> > > + * SetWALProhibitState()
> > > + *
> > > + * Change current WAL prohibit state to the input state.
> > > + *
> > > + * If the server is already completely moved to the requested WAL prohibit
> > > + * state, or if the desired state is same as the current state, return false,
> > > + * indicating that the server state did not change. Else return true.
> > > + */
> > > +bool
> > > +SetWALProhibitState(uint32 new_state)
> > > +{
> > > +     bool            state_updated = false;
> > > +     uint32          cur_state;
> > > +
> > > +     cur_state = GetWALProhibitState();
> > > +
> > > +     /* Server is already in requested state */
> > > +     if (new_state == cur_state ||
> > > +             new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > > +             return false;
> > > +
> > > +     /* Prevent concurrent contrary in progress transition state setting */
> > > +     if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
> > > +             (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > > +     {
> > > +             if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
> > > +                     ereport(ERROR,
> > > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > +                                      errmsg("system state transition to read only is already in progress"),
> > > +                                      errhint("Try after sometime again.")));
> > > +             else
> > > +                     ereport(ERROR,
> > > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > +                                      errmsg("system state transition to read write is already in progress"),
> > > +                                      errhint("Try after sometime again.")));
> > > +     }
> > > +
> > > +     /* Update new state in share memory */
> > > +     state_updated =
> > > +             pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
> > > +                                                                        &cur_state, new_state);
> > > +
> > > +     if (!state_updated)
> > > +             ereport(ERROR,
> > > +                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > +                              errmsg("system read write state concurrently changed"),
> > > +                              errhint("Try after sometime again.")));
> > > +
> >
> > I don't think it's safe to use pg_atomic_compare_exchange_u32() outside
> > of a loop. I think there's platforms (basically all load-linked /
> > store-conditional architectures) where than can fail spuriously.
> >
> > Also, there's no memory barrier around GetWALProhibitState, so there's
> > no guarantee it's not an out-of-date value you're starting with.
> >
>
> How about having some kind of lock instead what Robert have suggested
> previously[3] ?
>

I would like to discuss this point more. In the attached version I have added
WALProhibitLock to protect shared walprohibit state updates.  I was a little
unsure do we want another spinlock what XLogCtlData has which is mostly used to
read the shared variable and for the update, both are used e.g. LogwrtResult.

Right now I haven't added and shared walprohibit state was fetch using a
volatile pointer. Do we need a spinlock there, I am not sure why? Thoughts?

> >
> > > +/
> > > + * MarkCheckPointSkippedInWalProhibitState()
> > > + *
> > > + * Sets checkpoint pending flag so that it can be performed next time while
> > > + * changing system state to WAL permitted.
> > > + */
> > > +void
> > > +MarkCheckPointSkippedInWalProhibitState(void)
> > > +{
> > > +     WALProhibitState->checkpointPending = true;
> > > +}
> >
> > I don't *at all* like this living outside of xlog.c. I think this should
> > be moved there, and merged with deferring checkpoints in other cases
> > (promotions, not immediately performing a checkpoint after recovery).
>
> Here we want to perform the checkpoint sometime quite later when the
> system state changes to read-write. For that, I think we need some flag
> if we want this in xlog.c then we can have that flag in XLogCtl.
>

Right now I have added a new variable to XLogCtlData and moved this code to
xlog.c.

>
> > There's state in ControlFile *and* here for essentially the same thing.
> >
>
> I am sorry to trouble you much, but I haven't understood this too.
>
> >
> >
> > > +      * If it is not currently possible to insert write-ahead log records,
> > > +      * either because we are still in recovery or because ALTER SYSTEM READ
> > > +      * ONLY has been executed, force this to be a read-only transaction.
> > > +      * We have lower level defences in XLogBeginInsert() and elsewhere to stop
> > > +      * us from modifying data during recovery when !XLogInsertAllowed(), but
> > > +      * this gives the normal indication to the user that the transaction is
> > > +      * read-only.
> > > +      *
> > > +      * On the other hand, we only need to set the startedInRecovery flag when
> > > +      * the transaction started during recovery, and not when WAL is otherwise
> > > +      * prohibited. This information is used by RelationGetIndexScan() to
> > > +      * decide whether to permit (1) relying on existing killed-tuple markings
> > > +      * and (2) further killing of index tuples. Even when WAL is prohibited
> > > +      * on the master, it's still the master, so the former is OK; and since
> > > +      * killing index tuples doesn't generate WAL, the latter is also OK.
> > > +      * See comments in RelationGetIndexScan() and MarkBufferDirtyHint().
> > > +      */
> > > +     XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed();
> > > +     s->startedInRecovery = RecoveryInProgress();
> >
> > It's somewhat ugly that we call RecoveryInProgress() once in
> > XLogInsertAllowed() and then again directly here... It's probably fine
> > runtime cost wise, but...
> >
> >
> > >  /*
> > >   * Subroutine to try to fetch and validate a prior checkpoint record.
> > >   *
> > > @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg)
> > >        */
> > >       WalSndWaitStopping();
> > >
> > > +     /*
> > > +      * The restartpoint, checkpoint, or xlog rotation will be performed if the
> > > +      * WAL writing is permitted.
> > > +      */
> > >       if (RecoveryInProgress())
> > >               CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> > > -     else
> > > +     else if (XLogInsertAllowed())
> >
> > Not sure I like going via XLogInsertAllowed(), that seems like a
> > confusing indirection here. And it encompasses things we atually don't
> > want to check for - it's fragile to also look at LocalXLogInsertAllowed
> > here imo.
> >
> >
> > >       ShutdownCLOG();
> > >       ShutdownCommitTs();
> > >       ShutdownSUBTRANS();
> > > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> > > index 1b8cd7bacd4..aa4cdd57ec1 100644
> > > --- a/src/backend/postmaster/autovacuum.c
> > > +++ b/src/backend/postmaster/autovacuum.c
> > > @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[])
> > >
> > >               HandleAutoVacLauncherInterrupts();
> > >
> > > +             /* If the server is read only just go back to sleep. */
> > > +             if (!XLogInsertAllowed())
> > > +                     continue;
> > > +
> >
> > I think we really should have a different functions for places like
> > this. We don't want to generally hide bugs like e.g. starting the
> > autovac launcher in recovery, but this would.
> >
>
> So, we need a separate function like XLogInsertAllowed() and a global variable
> like LocalXLogInsertAllowed for the caching wal prohibit state.
>
> >
> > > @@ -342,6 +344,28 @@ CheckpointerMain(void)
> > >               AbsorbSyncRequests();
> > >               HandleCheckpointerInterrupts();
> > >
> > > +             wal_state = GetWALProhibitState();
> > > +
> > > +             if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)
> > > +             {
> > > +                     /* Complete WAL prohibit state change request */
> > > +                     CompleteWALProhibitChange(wal_state);
> > > +                     continue;
> > > +             }
> > > +             else if (wal_state & WALPROHIBIT_STATE_READ_ONLY)
> > > +             {
> > > +                     /*
> > > +                      * Don't do anything until someone wakes us up.  For example a
> > > +                      * backend might later on request us to put the system back to
> > > +                      * read-write wal prohibit sate.
> > > +                      */
> > > +                     (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> > > +                                                      WAIT_EVENT_CHECKPOINTER_MAIN);
> > > +                     continue;
> > > +             }
> > > +             Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE);
> > > +
> > >               /*
> > >                * Detect a pending checkpoint request by checking whether the flags
> > >                * word in shared memory is nonzero.  We shouldn't need to acquire the
> > > @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void)
> > >
> > >       return FirstCall;
> > >  }
> >
> > So, if we're in the middle of a paced checkpoint with a large
> > checkpoint_timeout - a sensible real world configuration - we'll not
> > process ASRO until that checkpoint is over?  That seems very much not
> > practical. What am I missing?
> >
>
> Yes, the process doing ASRO will wait until that checkpoint is over.
>
> >
> > > +/*
> > > + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process.
> > > + */
> > > +void
> > > +send_signal_to_checkpointer(int signum)
> > > +{
> > > +     if (CheckpointerShmem->checkpointer_pid == 0)
> > > +             elog(ERROR, "checkpointer is not running");
> > > +
> > > +     if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0)
> > > +             elog(ERROR, "could not signal checkpointer: %m");
> > > +}
> >
> > Sudden switch to a different naming style...
> >
>
> My bad, sorry, will fix that.
>
> 1] http://postgr.es/m/20200724020402.2byiiufsd7pw4hsp@alap3.anarazel.de
> 2] http://postgr.es/m/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com
> 3] http://postgr.es/m/CA+TgmoYMyw-m3O5XQ8tRy4mdEArGcfXr+9niO5Fmq1wVdKxYmQ@mail.gmail.com


Thank you !

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Tue, Sep 8, 2020 at 2:20 PM Andres Freund <andres@anarazel.de> wrote:
> This pattern seems like it'll get unwieldy with more than one barrier
> type. And won't flag "unhandled" barrier types either (already the case,
> I know). We could go for something like:
>
>     while (flags != 0)
>     {
>         barrier_bit = pg_rightmost_one_pos32(flags);
>         barrier_type = 1 >> barrier_bit;
>
>         switch (barrier_type)
>         {
>                 case PROCSIGNAL_BARRIER_PLACEHOLDER:
>                     processed = ProcessBarrierPlaceholder();
>         }
>
>         if (processed)
>             BARRIER_CLEAR_BIT(flags, barrier_type);
>     }
>
> But perhaps that's too complicated?

I don't mind a loop, but that one looks broken. We have to clear the
bit before we call the function that processes that type of barrier.
Otherwise, if we succeed in absorbing the barrier but a new instance
of the same barrier arrives meanwhile, we'll fail to realize that we
need to absorb the new one.

> For this to be correct, wouldn't flags need to be volatile? Otherwise
> this might use a register value for flags, which might not contain the
> correct value at this point.

I think you're right.

> Perhaps a comment explaining why we have to clear bits first would be
> good?

Probably a good idea.

[ snipping assorted comments with which I agree ]

> It might be good to add a warning to WaitForProcSignalBarrier() or by
> pss_barrierCheckMask indicating that it's *not* OK to look at
> pss_barrierCheckMask when checking whether barriers have been processed.

Not sure I understand this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
 On Tue, Sep 15, 2020 at 2:35 PM Amul Sul <sulamul@gmail.com> wrote:
>
> Hi Andres,
>
> The attached patch has fixed the issue that you have raised & I have confirmed
> in my previous email.  Also, I tried to improve some of the things that you have
> pointed but for those changes, I am a little unsure and looking forward to the
> inputs/suggestions/confirmation on that, therefore 0003 patch is marked WIP.
>
> Please have a look at my inline reply below for the things that are changes in
> the attached version and need inputs:
>
> On Sat, Sep 12, 2020 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote:
> > >
[... Skipped ....]
> > >
> > >
> > > > +/*
> > > > + * RequestWALProhibitChange()
> > > > + *
> > > > + * Request checkpointer to make the WALProhibitState to read-only.
> > > > + */
> > > > +static void
> > > > +RequestWALProhibitChange(void)
> > > > +{
> > > > +     /* Must not be called from checkpointer */
> > > > +     Assert(!AmCheckpointerProcess());
> > > > +     Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS);
> > > > +
> > > > +     /*
> > > > +      * If in a standalone backend, just do it ourselves.
> > > > +      */
> > > > +     if (!IsPostmasterEnvironment)
> > > > +     {
> > > > +             CompleteWALProhibitChange(GetWALProhibitState());
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     send_signal_to_checkpointer(SIGINT);
> > > > +
> > > > +     /* Wait for the state to change to read-only */
> > > > +     ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv);
> > > > +     for (;;)
> > > > +     {
> > > > +             /* We'll be done once in-progress flag bit is cleared */
> > > > +             if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > > > +                     break;
> > > > +
> > > > +             ConditionVariableSleep(&WALProhibitState->walprohibit_cv,
> > > > +                                                        WAIT_EVENT_WALPROHIBIT_STATE_CHANGE);
> > > > +     }
> > > > +     ConditionVariableCancelSleep();
> > >
> > > What if somebody concurrently changes the state back to READ WRITE?
> > > Won't we unnecessarily wait here?
> > >
> >
> > Yes, there will be wait.
> >
> > > That's probably fine, because we would just wait until that transition
> > > is complete too. But at least a comment about that would be
> > > good. Alternatively a "ASRO transitions completed counter" or such might
> > > be a better idea?
> > >
> >
> > Ok, will add comments but could you please elaborate little a bit about "ASRO
> > transitions completed counter"  and is there any existing counter I can refer
> > to?
> >

In an off-list discussion, Robert had explained to me this counter thing and
its requirement.

I tried to add the same as "shared WAL prohibited state generation" in the
attached version. The implementation is quite similar to the generation counter
in the super barrier. In the attached version, when a backend makes a request
for the WAL prohibit state changes then a generation number will be given to
that backend to wait on and that wait will be ended when the shared generation
counter changes.

> > >
[... Skipped ....]
> > > > +/*
> > > > + * SetWALProhibitState()
> > > > + *
> > > > + * Change current WAL prohibit state to the input state.
> > > > + *
> > > > + * If the server is already completely moved to the requested WAL prohibit
> > > > + * state, or if the desired state is same as the current state, return false,
> > > > + * indicating that the server state did not change. Else return true.
> > > > + */
> > > > +bool
> > > > +SetWALProhibitState(uint32 new_state)
> > > > +{
> > > > +     bool            state_updated = false;
> > > > +     uint32          cur_state;
> > > > +
> > > > +     cur_state = GetWALProhibitState();
> > > > +
> > > > +     /* Server is already in requested state */
> > > > +     if (new_state == cur_state ||
> > > > +             new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > > > +             return false;
> > > > +
> > > > +     /* Prevent concurrent contrary in progress transition state setting */
> > > > +     if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) &&
> > > > +             (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS))
> > > > +     {
> > > > +             if (cur_state & WALPROHIBIT_STATE_READ_ONLY)
> > > > +                     ereport(ERROR,
> > > > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > > +                                      errmsg("system state transition to read only is already in progress"),
> > > > +                                      errhint("Try after sometime again.")));
> > > > +             else
> > > > +                     ereport(ERROR,
> > > > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > > +                                      errmsg("system state transition to read write is already in progress"),
> > > > +                                      errhint("Try after sometime again.")));
> > > > +     }
> > > > +
> > > > +     /* Update new state in share memory */
> > > > +     state_updated =
> > > > +             pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState,
> > > > +                                                                        &cur_state, new_state);
> > > > +
> > > > +     if (!state_updated)
> > > > +             ereport(ERROR,
> > > > +                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > > +                              errmsg("system read write state concurrently changed"),
> > > > +                              errhint("Try after sometime again.")));
> > > > +
> > >
> > > I don't think it's safe to use pg_atomic_compare_exchange_u32() outside
> > > of a loop. I think there's platforms (basically all load-linked /
> > > store-conditional architectures) where than can fail spuriously.
> > >
> > > Also, there's no memory barrier around GetWALProhibitState, so there's
> > > no guarantee it's not an out-of-date value you're starting with.
> > >
> >
> > How about having some kind of lock instead what Robert have suggested
> > previously[3] ?
> >
>
> I would like to discuss this point more. In the attached version I have added
> WALProhibitLock to protect shared walprohibit state updates.  I was a little
> unsure do we want another spinlock what XLogCtlData has which is mostly used to
> read the shared variable and for the update, both are used e.g. LogwrtResult.
>
> Right now I haven't added and shared walprohibit state was fetch using a
> volatile pointer. Do we need a spinlock there, I am not sure why? Thoughts?
>

I reverted this WALProhibitLock implementation since with changes in the
attached version I don't think we need that locking.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't mind a loop, but that one looks broken. We have to clear the
> bit before we call the function that processes that type of barrier.
> Otherwise, if we succeed in absorbing the barrier but a new instance
> of the same barrier arrives meanwhile, we'll fail to realize that we
> need to absorb the new one.

Here's a new version of the patch for allowing errors in
barrier-handling functions and/or rejection of barriers by those
functions. I think this responds to all of the previous review
comments from Andres. Also, here is an 0002 which is a handy bit of
test code that I wrote. It's not for commit, but it is useful for
finding bugs.

In addition to improving 0001 based on the review comments, I also
tried to write a better commit message for it, but it might still be
possible to do better there. It's a bit hard to explain the idea in
the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
with an XID -- and possibly a bunch of sub-XIDs, and possibly while
idle-in-transaction -- can elect to FATAL rather than absorbing the
barrier. I suspect for other barrier types we might have certain
(hopefully short) stretches of code where a barrier of a particular
type can't be absorbed because we're in the middle of doing something
that relies on the previous value of whatever state is protected by
the barrier. Holding off interrupts in those stretches of code would
prevent the barrier from being absorbed, but would also prevent query
cancel, backend termination, and absorption of other barrier types, so
it seems possible that just allowing the barrier-absorption function
for a barrier of that type to just refuse the barrier until after the
backend exits the critical section of code will work out better.

Just for kicks, I tried running 'make installcheck-parallel' while
emitting placeholder barriers every 0.05 s after altering the
barrier-absorption function to always return false, just to see how
ugly that was. In round figures, it made it take 24 s vs. 21 s, so
it's actually not that bad. However, it all depends on how many times
you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
that the effect might be very non-uniform. That is, if you can get the
code to be running a tight loop that does little real work but does
CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
barrier, it will probably suck. Therefore, I'm inclined to think that
the fairly strong cautionary logic in the patch is reasonable, but
perhaps it can be better worded somehow. Thoughts welcome.

I have not rebased the remainder of the patch series over these two.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Oct 7, 2020 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't mind a loop, but that one looks broken. We have to clear the
> > bit before we call the function that processes that type of barrier.
> > Otherwise, if we succeed in absorbing the barrier but a new instance
> > of the same barrier arrives meanwhile, we'll fail to realize that we
> > need to absorb the new one.
>
> Here's a new version of the patch for allowing errors in
> barrier-handling functions and/or rejection of barriers by those
> functions. I think this responds to all of the previous review
> comments from Andres. Also, here is an 0002 which is a handy bit of
> test code that I wrote. It's not for commit, but it is useful for
> finding bugs.
>
> In addition to improving 0001 based on the review comments, I also
> tried to write a better commit message for it, but it might still be
> possible to do better there. It's a bit hard to explain the idea in
> the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
> with an XID -- and possibly a bunch of sub-XIDs, and possibly while
> idle-in-transaction -- can elect to FATAL rather than absorbing the
> barrier. I suspect for other barrier types we might have certain
> (hopefully short) stretches of code where a barrier of a particular
> type can't be absorbed because we're in the middle of doing something
> that relies on the previous value of whatever state is protected by
> the barrier. Holding off interrupts in those stretches of code would
> prevent the barrier from being absorbed, but would also prevent query
> cancel, backend termination, and absorption of other barrier types, so
> it seems possible that just allowing the barrier-absorption function
> for a barrier of that type to just refuse the barrier until after the
> backend exits the critical section of code will work out better.
>
> Just for kicks, I tried running 'make installcheck-parallel' while
> emitting placeholder barriers every 0.05 s after altering the
> barrier-absorption function to always return false, just to see how
> ugly that was. In round figures, it made it take 24 s vs. 21 s, so
> it's actually not that bad. However, it all depends on how many times
> you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
> that the effect might be very non-uniform. That is, if you can get the
> code to be running a tight loop that does little real work but does
> CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
> barrier, it will probably suck. Therefore, I'm inclined to think that
> the fairly strong cautionary logic in the patch is reasonable, but
> perhaps it can be better worded somehow. Thoughts welcome.
>
> I have not rebased the remainder of the patch series over these two.
>
That I'll do.

On a quick look at the latest 0001 patch, the following hunk to reset leftover
flags seems to be unnecessary:

+ /*
+ * If some barrier types were not successfully absorbed, we will have
+ * to try again later.
+ */
+ if (!success)
+ {
+ ResetProcSignalBarrierBits(flags);
+ return;
+ }

When the ProcessBarrierPlaceholder() function returns false without an error,
that barrier flag gets reset within the while loop.  The case when it has an
error, the rest of the flags get reset in the catch block.  Correct me if I am
missing something here.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Oct 8, 2020 at 3:52 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Wed, Oct 7, 2020 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I don't mind a loop, but that one looks broken. We have to clear the
> > > bit before we call the function that processes that type of barrier.
> > > Otherwise, if we succeed in absorbing the barrier but a new instance
> > > of the same barrier arrives meanwhile, we'll fail to realize that we
> > > need to absorb the new one.
> >
> > Here's a new version of the patch for allowing errors in
> > barrier-handling functions and/or rejection of barriers by those
> > functions. I think this responds to all of the previous review
> > comments from Andres. Also, here is an 0002 which is a handy bit of
> > test code that I wrote. It's not for commit, but it is useful for
> > finding bugs.
> >
> > In addition to improving 0001 based on the review comments, I also
> > tried to write a better commit message for it, but it might still be
> > possible to do better there. It's a bit hard to explain the idea in
> > the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process
> > with an XID -- and possibly a bunch of sub-XIDs, and possibly while
> > idle-in-transaction -- can elect to FATAL rather than absorbing the
> > barrier. I suspect for other barrier types we might have certain
> > (hopefully short) stretches of code where a barrier of a particular
> > type can't be absorbed because we're in the middle of doing something
> > that relies on the previous value of whatever state is protected by
> > the barrier. Holding off interrupts in those stretches of code would
> > prevent the barrier from being absorbed, but would also prevent query
> > cancel, backend termination, and absorption of other barrier types, so
> > it seems possible that just allowing the barrier-absorption function
> > for a barrier of that type to just refuse the barrier until after the
> > backend exits the critical section of code will work out better.
> >
> > Just for kicks, I tried running 'make installcheck-parallel' while
> > emitting placeholder barriers every 0.05 s after altering the
> > barrier-absorption function to always return false, just to see how
> > ugly that was. In round figures, it made it take 24 s vs. 21 s, so
> > it's actually not that bad. However, it all depends on how many times
> > you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine
> > that the effect might be very non-uniform. That is, if you can get the
> > code to be running a tight loop that does little real work but does
> > CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of
> > barrier, it will probably suck. Therefore, I'm inclined to think that
> > the fairly strong cautionary logic in the patch is reasonable, but
> > perhaps it can be better worded somehow. Thoughts welcome.
> >
> > I have not rebased the remainder of the patch series over these two.
> >
> That I'll do.
>

Attaching a rebased version includes Robert's patches for the latest master
head.

> On a quick look at the latest 0001 patch, the following hunk to reset leftover
> flags seems to be unnecessary:
>
> + /*
> + * If some barrier types were not successfully absorbed, we will have
> + * to try again later.
> + */
> + if (!success)
> + {
> + ResetProcSignalBarrierBits(flags);
> + return;
> + }
>
> When the ProcessBarrierPlaceholder() function returns false without an error,
> that barrier flag gets reset within the while loop.  The case when it has an
> error, the rest of the flags get reset in the catch block.  Correct me if I am
> missing something here.
>

Robert, could you please confirm this?

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:
> On a quick look at the latest 0001 patch, the following hunk to reset leftover
> flags seems to be unnecessary:
>
> + /*
> + * If some barrier types were not successfully absorbed, we will have
> + * to try again later.
> + */
> + if (!success)
> + {
> + ResetProcSignalBarrierBits(flags);
> + return;
> + }
>
> When the ProcessBarrierPlaceholder() function returns false without an error,
> that barrier flag gets reset within the while loop.  The case when it has an
> error, the rest of the flags get reset in the catch block.  Correct me if I am
> missing something here.

Good catch. I think you're right. Do you want to update accordingly?

Andres, do you like the new loop better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:


On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:
> On a quick look at the latest 0001 patch, the following hunk to reset leftover
> flags seems to be unnecessary:
>
> + /*
> + * If some barrier types were not successfully absorbed, we will have
> + * to try again later.
> + */
> + if (!success)
> + {
> + ResetProcSignalBarrierBits(flags);
> + return;
> + }
>
> When the ProcessBarrierPlaceholder() function returns false without an error,
> that barrier flag gets reset within the while loop.  The case when it has an
> error, the rest of the flags get reset in the catch block.  Correct me if I am
> missing something here.

Good catch. I think you're right. Do you want to update accordingly?

Sure, Ill update that. Thanks for the confirmation.


Andres, do you like the new loop better?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Nov 20, 2020 at 11:13 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:
>> > On a quick look at the latest 0001 patch, the following hunk to reset leftover
>> > flags seems to be unnecessary:
>> >
>> > + /*
>> > + * If some barrier types were not successfully absorbed, we will have
>> > + * to try again later.
>> > + */
>> > + if (!success)
>> > + {
>> > + ResetProcSignalBarrierBits(flags);
>> > + return;
>> > + }
>> >
>> > When the ProcessBarrierPlaceholder() function returns false without an error,
>> > that barrier flag gets reset within the while loop.  The case when it has an
>> > error, the rest of the flags get reset in the catch block.  Correct me if I am
>> > missing something here.
>>
>> Good catch. I think you're right. Do you want to update accordingly?
>
>
> Sure, Ill update that. Thanks for the confirmation.
>

Attached is the updated version where unnecessary ResetProcSignalBarrierBits()
call in 0001 patch is removed. The rest of the patches are unchanged, thanks.

>>
>> Andres, do you like the new loop better?
>>

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Sat, Sep 12, 2020 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote:
> > So, if we're in the middle of a paced checkpoint with a large
> > checkpoint_timeout - a sensible real world configuration - we'll not
> > process ASRO until that checkpoint is over?  That seems very much not
> > practical. What am I missing?
>
> Yes, the process doing ASRO will wait until that checkpoint is over.

That's not good. On a typical busy system, a system is going to be in
the middle of a checkpoint most of the time, and the checkpoint will
take a long time to finish - maybe minutes. We want this feature to
respond within milliseconds or a few seconds, not minutes. So we need
something better here. I'm inclined to think that we should try to
CompleteWALProhibitChange() at the same places we
AbsorbSyncRequests(). We know from experience that bad things happen
if we fail to absorb sync requests in a timely fashion, so we probably
have enough calls to AbsorbSyncRequests() to make sure that we always
do that work in a timely fashion. So, if we do this work in the same
place, then it will also be done in a timely fashion.

I'm not 100% sure whether that introduces any other problems.
Certainly, we're not going to be able to finish the checkpoint once
we've gone read-only, so we'll fail when we try to write the WAL
record for that, or maybe earlier if there's anything else that tries
to write WAL. Either the checkpoint needs to error out, like any other
attempt to write WAL, and we can attempt a new checkpoint if and when
we go read/write, or else we need to finish writing stuff out to disk
but not actually write the checkpoint completion record (or any other
WAL) unless and until the system goes back into read/write mode - and
then at that point the previously-started checkpoint will finish
normally. The latter seems better if we can make it work, but the
former is probably also acceptable. What you've got right now is not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
On 2020-11-20 11:23:44 -0500, Robert Haas wrote:
> Andres, do you like the new loop better?

I do!



Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
Hi,

On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> That's not good. On a typical busy system, a system is going to be in
> the middle of a checkpoint most of the time, and the checkpoint will
> take a long time to finish - maybe minutes.

Or hours, even. Due to the cost of FPWs it can make a lot of sense to
reduce the frequency of that cost...


> We want this feature to respond within milliseconds or a few seconds,
> not minutes. So we need something better here.

Indeed.


> I'm inclined to think
> that we should try to CompleteWALProhibitChange() at the same places
> we AbsorbSyncRequests(). We know from experience that bad things
> happen if we fail to absorb sync requests in a timely fashion, so we
> probably have enough calls to AbsorbSyncRequests() to make sure that
> we always do that work in a timely fashion. So, if we do this work in
> the same place, then it will also be done in a timely fashion.

Sounds sane, without having looked in detail.


> I'm not 100% sure whether that introduces any other problems.
> Certainly, we're not going to be able to finish the checkpoint once
> we've gone read-only, so we'll fail when we try to write the WAL
> record for that, or maybe earlier if there's anything else that tries
> to write WAL. Either the checkpoint needs to error out, like any other
> attempt to write WAL, and we can attempt a new checkpoint if and when
> we go read/write, or else we need to finish writing stuff out to disk
> but not actually write the checkpoint completion record (or any other
> WAL) unless and until the system goes back into read/write mode - and
> then at that point the previously-started checkpoint will finish
> normally. The latter seems better if we can make it work, but the
> former is probably also acceptable. What you've got right now is not.

I mostly wonder which of those two has which implications for how many
FPWs we need to redo. Presumably stalling but not cancelling the current
checkpoint is better?

Greetings,

Andres Freund



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > That's not good. On a typical busy system, a system is going to be in
> > the middle of a checkpoint most of the time, and the checkpoint will
> > take a long time to finish - maybe minutes.
>
> Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> reduce the frequency of that cost...
>
>
> > We want this feature to respond within milliseconds or a few seconds,
> > not minutes. So we need something better here.
>
> Indeed.
>
>
> > I'm inclined to think
> > that we should try to CompleteWALProhibitChange() at the same places
> > we AbsorbSyncRequests(). We know from experience that bad things
> > happen if we fail to absorb sync requests in a timely fashion, so we
> > probably have enough calls to AbsorbSyncRequests() to make sure that
> > we always do that work in a timely fashion. So, if we do this work in
> > the same place, then it will also be done in a timely fashion.
>
> Sounds sane, without having looked in detail.
>

Understood & agreed that we need to change the system state as soon as possible.

I can see AbsorbSyncRequests() is called from 4 routing as
CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
CheckpointerMain().  Out of 4, the first three executes with an interrupt is on
hod which will cause a problem when we do emit barrier and wait for those
barriers absorption by all the process including itself and will cause an
infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
do not wait on self to absorb barrier.  Let that get absorbed at a later point
in time when the interrupt is resumed.  I assumed that we cannot do barrier
processing right away since there could be other barriers (maybe in the future)
including ours that should not process while the interrupt is on hold.

>
> > I'm not 100% sure whether that introduces any other problems.
> > Certainly, we're not going to be able to finish the checkpoint once
> > we've gone read-only, so we'll fail when we try to write the WAL
> > record for that, or maybe earlier if there's anything else that tries
> > to write WAL. Either the checkpoint needs to error out, like any other
> > attempt to write WAL, and we can attempt a new checkpoint if and when
> > we go read/write, or else we need to finish writing stuff out to disk
> > but not actually write the checkpoint completion record (or any other
> > WAL) unless and until the system goes back into read/write mode - and
> > then at that point the previously-started checkpoint will finish
> > normally. The latter seems better if we can make it work, but the
> > former is probably also acceptable. What you've got right now is not.
>
> I mostly wonder which of those two has which implications for how many
> FPWs we need to redo. Presumably stalling but not cancelling the current
> checkpoint is better?
>

Also, I like to uphold this idea of stalling a checkpointer's work in the middle
instead of canceling it. But here, we need to take care of shutdown requests and
death of postmaster cases that can cancel this stalling.  If that happens we
need to make sure that no unwanted wal insertion happens afterward and for that
LocalXLogInsertAllowed flag needs to be updated correctly since the wal
prohibits barrier processing was skipped for the checkpointer since it emits
that barrier as mentioned above.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > > That's not good. On a typical busy system, a system is going to be in
> > > the middle of a checkpoint most of the time, and the checkpoint will
> > > take a long time to finish - maybe minutes.
> >
> > Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> > reduce the frequency of that cost...
> >
> >
> > > We want this feature to respond within milliseconds or a few seconds,
> > > not minutes. So we need something better here.
> >
> > Indeed.
> >
> >
> > > I'm inclined to think
> > > that we should try to CompleteWALProhibitChange() at the same places
> > > we AbsorbSyncRequests(). We know from experience that bad things
> > > happen if we fail to absorb sync requests in a timely fashion, so we
> > > probably have enough calls to AbsorbSyncRequests() to make sure that
> > > we always do that work in a timely fashion. So, if we do this work in
> > > the same place, then it will also be done in a timely fashion.
> >
> > Sounds sane, without having looked in detail.
> >
>
> Understood & agreed that we need to change the system state as soon as possible.
>
> I can see AbsorbSyncRequests() is called from 4 routing as
> CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
> CheckpointerMain().  Out of 4, the first three executes with an interrupt is on
> hod which will cause a problem when we do emit barrier and wait for those
> barriers absorption by all the process including itself and will cause an
> infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
> do not wait on self to absorb barrier.  Let that get absorbed at a later point
> in time when the interrupt is resumed.  I assumed that we cannot do barrier
> processing right away since there could be other barriers (maybe in the future)
> including ours that should not process while the interrupt is on hold.
>

CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
which puts interrupt on hold.  This kinda surprising that we were holding this
lock and putting interrupt on hots for a long time.  We do need that
CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
something easy to ensure that instead of the lock? Probably holding off
interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Mon, Dec 14, 2020 at 8:03 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote:
> > > > That's not good. On a typical busy system, a system is going to be in
> > > > the middle of a checkpoint most of the time, and the checkpoint will
> > > > take a long time to finish - maybe minutes.
> > >
> > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to
> > > reduce the frequency of that cost...
> > >
> > >
> > > > We want this feature to respond within milliseconds or a few seconds,
> > > > not minutes. So we need something better here.
> > >
> > > Indeed.
> > >
> > >
> > > > I'm inclined to think
> > > > that we should try to CompleteWALProhibitChange() at the same places
> > > > we AbsorbSyncRequests(). We know from experience that bad things
> > > > happen if we fail to absorb sync requests in a timely fashion, so we
> > > > probably have enough calls to AbsorbSyncRequests() to make sure that
> > > > we always do that work in a timely fashion. So, if we do this work in
> > > > the same place, then it will also be done in a timely fashion.
> > >
> > > Sounds sane, without having looked in detail.
> > >
> >
> > Understood & agreed that we need to change the system state as soon as possible.
> >
> > I can see AbsorbSyncRequests() is called from 4 routing as
> > CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and
> > CheckpointerMain().  Out of 4, the first three executes with an interrupt is on
> > hod which will cause a problem when we do emit barrier and wait for those
> > barriers absorption by all the process including itself and will cause an
> > infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(),
> > do not wait on self to absorb barrier.  Let that get absorbed at a later point
> > in time when the interrupt is resumed.  I assumed that we cannot do barrier
> > processing right away since there could be other barriers (maybe in the future)
> > including ours that should not process while the interrupt is on hold.
> >
>
> CreateCheckPoint() holds CheckpointLock LW at start and releases at the end
> which puts interrupt on hold.  This kinda surprising that we were holding this
> lock and putting interrupt on hots for a long time.  We do need that
> CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do
> something easy to ensure that instead of the lock? Probably holding off
> interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions?
>

To move development, testing, and the review forward, I have commented out the
code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
added the changes for the checkpointer so that system read-write state change
request can be processed as soon as possible, as suggested by Robert[1].

I have started a new thread[2] to understand the need for the CheckpointLock in
CreateCheckPoint() function. Until then we can continue work on this feature by
skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is
marked WIP.

1] http://postgr.es/m/CA+TgmoYexwDQjdd1=15KMz+7VfHVx8VHNL2qjRRK92P=CSZDxg@mail.gmail.com
2] http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:
> To move development, testing, and the review forward, I have commented out the
> code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> added the changes for the checkpointer so that system read-write state change
> request can be processed as soon as possible, as suggested by Robert[1].
>
> I have started a new thread[2] to understand the need for the CheckpointLock in
> CreateCheckPoint() function. Until then we can continue work on this feature by
> skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is
> marked WIP.

Based on the favorable review comment from Andres upthread and also
your feedback, I committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:
> To move development, testing, and the review forward, I have commented out the
> code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> added the changes for the checkpointer so that system read-write state change
> request can be processed as soon as possible, as suggested by Robert[1].

I am extremely doubtful about SetWALProhibitState()'s claim that "The
final state can only be requested by the checkpointer or by the
single-user so that there will be no chance that the server is already
in the desired final state." It seems like there is an obvious race
condition: CompleteWALProhibitChange() is called with a cur_state_gen
argument which embeds the last state we saw, but there's nothing to
keep it from changing between the time we saw it and the time that
function calls SetWALProhibitState(), is there? We aren't holding any
lock. It seems to me that SetWALProhibitState() needs to be rewritten
to avoid this assumption.

On a related note, SetWALProhibitState() has only two callers. One
passes is_final_state as true, and the other as false: it's never a
variable. The two cases are handled mostly differently. This doesn't
seem good. A lot of the logic in this function should probably be
moved to the calling sites, especially because it's almost certainly
wrong for this function to be basing what it does on the *current* WAL
prohibit state rather than the WAL prohibit state that was in effect
at the time we made the decision to call this function in the first
place. As I mentioned in the previous paragraph, that's a built-in
race condition. To put that another way, this function should NOT feel
free to call GetWALProhibitStateGen().

I don't really see why we should have both an SQL callable function
pg_alter_wal_prohibit_state() and also a DDL command for this. If
we're going to go with a functional interface, and I guess the idea of
that is to make it so GRANT EXECUTE works, then why not just get rid
of the DDL?

RequestWALProhibitChange() doesn't look very nice. It seems like it's
basically the second half of pg_alter_wal_prohibit_state(), not being
called from anywhere else. It doesn't seem to add anything to separate
it out like this; the interface between the two is not especially
clean.

It seems odd that ProcessWALProhibitStateChangeRequest() returns
without doing anything if !AmCheckpointerProcess(), rather than having
that be an Assert(). Why is it like that?

I think WALProhibitStateShmemInit() would probably look more similar
to other functions if it did if (found) { stuff; } rather than if
(!found) return; stuff; -- but I might be wrong about the existing
precedent.

The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff
seems confusingly-named, because we have other reasons for skipping a
checkpoint that are not what we're talking about here. I think this is
talking about whether we've performed a checkpoint after recovery, and
the naming should reflect that. But I think there's something else
wrong with the design, too: why is this protected by a spinlock? I
have questions in both directions. On the one hand, I wonder why we
need any kind of lock at all. On the other hand, if we do need a lock,
I wonder why a spinlock that protects only the setting and clearing of
the flag and nothing else is sufficient. There are zero comments
explaining what the idea behind this locking regime is, and I can't
understand why it should be correct.

In fact, I think this area needs a broader rethink. Like, the way you
integrated that stuff into StartupXLog(), it sure looks to me like we
might skip the checkpoint but still try to write other WAL records.
Before we reach the offending segment of code, we call
UpdateFullPageWrites(). Afterwards, we call XLogReportParameters().
Both of those are going to potentially write WAL. I guess you could
argue that's OK, on the grounds that neither function is necessarily
going to log anything, but I don't think I believe that. If I make my
server read only, take the OS down, change some GUCs, and then start
it again, I don't expect it to PANIC.

Also, I doubt that it's OK to skip the checkpoint as this code does
and then go ahead and execute recovery_end_command and update the
control file anyway. It sure looks like the existing code is written
with the assumption that the checkpoint happens before those other
things. One idea I just had was: suppose that, if the system is READ
ONLY, we don't actually exit recovery right away, and the startup
process doesn't exit. Instead we just sit there and wait for the
system to be made read-write again before doing anything else. But
then if hot_standby=false, there's no way for someone to execute a
ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which
seems bad. So perhaps we need to let in regular connections *as if*
the system were read-write while postponing not just the
end-of-recovery checkpoint but also the other associated things like
UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command,
control file update, etc. until the end of recovery. Or maybe that's
not the right idea either, but regardless of what we do here it needs
clear comments justifying it. The current version of the patch does
not have any.

I think that you've mis-positioned the check in autovacuum.c. Note
that the comment right afterwards says: "a worker finished, or
postmaster signaled failure to start a worker". Those are things we
should still check for even when the system is R/O. What we don't want
to do in that case is start new workers. I would suggest revising the
comment that starts with "There are some conditions that..." to
mention three conditions. The new one would be that the system is in a
read-only state. I'd mention that first, making the existing ones #2
and #3, and then add the code to "continue;" in that case right after
that comment, before setting current_time.

SendsSignalToCheckpointer() has multiple problems. As far as the name,
it should at least be "Send" rather than "Sends" but the corresponding
functions elsewhere have names like SendPostmasterSignal() not
SendSignalToPostmaster(). Also, why is it OK for it to use elog()
rather than ereport()? Also, why is it an error if the checkpointer's
not running, rather than just having the next checkpointer do it when
it's relaunched? Also, why pass SIGINT as an argument if there's only
one caller? A related thing that's also odd is that sending SIGINT
calls ReqCheckpointHandler() not anything specific to prohibiting WAL.
That is probably OK because that function now just sets the latch. But
then we could stop sending SIGINT to the checkpointer at all and just
send SIGUSR1, which would also set the latch, without using up a
signal. I wonder if we should make that change as a separate
preparatory patch. It seems like that would clear things up; it would
remove the oddity that this patch is invoking a handler called
ReqCheckpointerHandler() with no intention of requesting a checkpoint,
because ReqCheckpointerHandler() would be gone. That problem could
also be fixed by renaming ReqCheckpointerHandler() to something
clearer, but that seems inferior.

This is probably not a complete list of problems. Review from others
would be appreciated.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Jan 20, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote:
> > To move development, testing, and the review forward, I have commented out the
> > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> > added the changes for the checkpointer so that system read-write state change
> > request can be processed as soon as possible, as suggested by Robert[1].
>
> I am extremely doubtful about SetWALProhibitState()'s claim that "The
> final state can only be requested by the checkpointer or by the
> single-user so that there will be no chance that the server is already
> in the desired final state." It seems like there is an obvious race
> condition: CompleteWALProhibitChange() is called with a cur_state_gen
> argument which embeds the last state we saw, but there's nothing to
> keep it from changing between the time we saw it and the time that
> function calls SetWALProhibitState(), is there? We aren't holding any
> lock. It seems to me that SetWALProhibitState() needs to be rewritten
> to avoid this assumption.
>

It is not like that, let me explain. When a user backend requests to alter WAL
prohibit state by using ASRO/ASRW DDL with the previous patch or calling
pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
set to the transition state i.e. going-read-only or going-read-write if it is
not already.  If another backend trying to request the same alteration to the
wal prohibit state then nothing going to be changed in shared memory but that
backend needs to wait until the transition to the final wal prohibited state
completes.  If a backend tries to request for the opposite state than the
previous which is in progress then it will see an error as "system state
transition to read only/write is already in progress".  At a time only one
transition state can be set.

For the case where transition state changes to the complete states i.e.
read-only/read-write that can only be changed by the checkpointer or standalone
backend, there won't be any concurrency to change transition state to complete
state.

> On a related note, SetWALProhibitState() has only two callers. One
> passes is_final_state as true, and the other as false: it's never a
> variable. The two cases are handled mostly differently. This doesn't
> seem good. A lot of the logic in this function should probably be
> moved to the calling sites, especially because it's almost certainly
> wrong for this function to be basing what it does on the *current* WAL
> prohibit state rather than the WAL prohibit state that was in effect
> at the time we made the decision to call this function in the first
> place. As I mentioned in the previous paragraph, that's a built-in
> race condition. To put that another way, this function should NOT feel
> free to call GetWALProhibitStateGen().
>

Understood. I have removed SetWALProhibitState() and moved the respective code
to the caller in the attached version.

> I don't really see why we should have both an SQL callable function
> pg_alter_wal_prohibit_state() and also a DDL command for this. If
> we're going to go with a functional interface, and I guess the idea of
> that is to make it so GRANT EXECUTE works, then why not just get rid
> of the DDL?
>

Ok, dropped the patch of the DDL command. If in the future we want it back, I
can add that again.

Now, I am a little bit concerned about the current function name. How about
pg_set_wal_prohibit_state(bool) name or have two functions as
pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
other suggestions?

> RequestWALProhibitChange() doesn't look very nice. It seems like it's
> basically the second half of pg_alter_wal_prohibit_state(), not being
> called from anywhere else. It doesn't seem to add anything to separate
> it out like this; the interface between the two is not especially
> clean.
>

Ok, moved that code in pg_alter_wal_prohibit_state() in the attached version.

> It seems odd that ProcessWALProhibitStateChangeRequest() returns
> without doing anything if !AmCheckpointerProcess(), rather than having
> that be an Assert(). Why is it like that?
>

Like AbsorbSyncRequests().

> I think WALProhibitStateShmemInit() would probably look more similar
> to other functions if it did if (found) { stuff; } rather than if
> (!found) return; stuff; -- but I might be wrong about the existing
> precedent.
>

Ok, did the same in the attached version.

> The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff
> seems confusingly-named, because we have other reasons for skipping a
> checkpoint that are not what we're talking about here. I think this is
> talking about whether we've performed a checkpoint after recovery, and
> the naming should reflect that. But I think there's something else
> wrong with the design, too: why is this protected by a spinlock? I
> have questions in both directions. On the one hand, I wonder why we
> need any kind of lock at all. On the other hand, if we do need a lock,
> I wonder why a spinlock that protects only the setting and clearing of
> the flag and nothing else is sufficient. There are zero comments
> explaining what the idea behind this locking regime is, and I can't
> understand why it should be correct.
>

Renamed those functions to SetRecoveryCheckpointSkippedFlag() and
RecoveryCheckpointIsSkipped() respectively and remove the lock which is not
needed. Updated comment for lastRecoveryCheckpointSkipped variable for the lock
requirement.

> In fact, I think this area needs a broader rethink. Like, the way you
> integrated that stuff into StartupXLog(), it sure looks to me like we
> might skip the checkpoint but still try to write other WAL records.
> Before we reach the offending segment of code, we call
> UpdateFullPageWrites(). Afterwards, we call XLogReportParameters().
> Both of those are going to potentially write WAL. I guess you could
> argue that's OK, on the grounds that neither function is necessarily
> going to log anything, but I don't think I believe that. If I make my
> server read only, take the OS down, change some GUCs, and then start
> it again, I don't expect it to PANIC.
>

If you think that there will be panic when UpdateFullPageWrites() and/or
XLogReportParameters() tries to write WAL since the shared memory state for WAL
prohibited is set then it is not like that.  For those functions, WAL write is
explicitly enabled by calling LocalSetXLogInsertAllowed().

I was under the impression that there won't be any problem if we allow the
writing WAL to UpdateFullPageWrites() and XLogReportParameters().  It can be
considered as an exception since it is fine that this WAL record is not streamed
to standby while graceful failover, I may be wrong though.

> Also, I doubt that it's OK to skip the checkpoint as this code does
> and then go ahead and execute recovery_end_command and update the
> control file anyway. It sure looks like the existing code is written
> with the assumption that the checkpoint happens before those other
> things.

Hmm, here we could go wrong. I need to look at this part carefully.

> One idea I just had was: suppose that, if the system is READ
> ONLY, we don't actually exit recovery right away, and the startup
> process doesn't exit. Instead we just sit there and wait for the
> system to be made read-write again before doing anything else. But
> then if hot_standby=false, there's no way for someone to execute a
> ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which
> seems bad. So perhaps we need to let in regular connections *as if*
> the system were read-write while postponing not just the
> end-of-recovery checkpoint but also the other associated things like
> UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command,
> control file update, etc. until the end of recovery. Or maybe that's
> not the right idea either, but regardless of what we do here it needs
> clear comments justifying it. The current version of the patch does
> not have any.
>

Will get back to you on this.  Let me think more on this and the previous
point.

> I think that you've mis-positioned the check in autovacuum.c. Note
> that the comment right afterwards says: "a worker finished, or
> postmaster signaled failure to start a worker". Those are things we
> should still check for even when the system is R/O. What we don't want
> to do in that case is start new workers. I would suggest revising the
> comment that starts with "There are some conditions that..." to
> mention three conditions. The new one would be that the system is in a
> read-only state. I'd mention that first, making the existing ones #2
> and #3, and then add the code to "continue;" in that case right after
> that comment, before setting current_time.
>

Done.

> SendsSignalToCheckpointer() has multiple problems. As far as the name,
> it should at least be "Send" rather than "Sends" but the corresponding

"Sends" is unacceptable, it is a typo.

> functions elsewhere have names like SendPostmasterSignal() not
> SendSignalToPostmaster(). Also, why is it OK for it to use elog()
> rather than ereport()? Also, why is it an error if the checkpointer's
> not running, rather than just having the next checkpointer do it when
> it's relaunched?

Ok, now the function only returns true or false. It's up to the caller what to
do with that. In our case, the caller will issue a warning only. If you want
this could be a NOTICE as well.

> Also, why pass SIGINT as an argument if there's only
> one caller?

I thoughts, anybody can also reuse it to send some other signal to the
checkpointer process in the future.

> A related thing that's also odd is that sending SIGINT
> calls ReqCheckpointHandler() not anything specific to prohibiting WAL.
> That is probably OK because that function now just sets the latch. But
> then we could stop sending SIGINT to the checkpointer at all and just
> send SIGUSR1, which would also set the latch, without using up a
> signal. I wonder if we should make that change as a separate
> preparatory patch. It seems like that would clear things up; it would
> remove the oddity that this patch is invoking a handler called
> ReqCheckpointerHandler() with no intention of requesting a checkpoint,
> because ReqCheckpointerHandler() would be gone. That problem could
> also be fixed by renaming ReqCheckpointerHandler() to something
> clearer, but that seems inferior.
>

I am not clear on this part. In the attached version I am sending SIGUSR1
instead of SIGINT, which works for me.

> This is probably not a complete list of problems. Review from others
> would be appreciated.
>

Thanks a lot.

The attached version does not address all your comments, I'll continue my work
on that.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul@gmail.com> wrote:
> It is not like that, let me explain. When a user backend requests to alter WAL
> prohibit state by using ASRO/ASRW DDL with the previous patch or calling
> pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
> set to the transition state i.e. going-read-only or going-read-write if it is
> not already.  If another backend trying to request the same alteration to the
> wal prohibit state then nothing going to be changed in shared memory but that
> backend needs to wait until the transition to the final wal prohibited state
> completes.  If a backend tries to request for the opposite state than the
> previous which is in progress then it will see an error as "system state
> transition to read only/write is already in progress".  At a time only one
> transition state can be set.

Hrm. Well, then that needs to be abundantly clear in the relevant comments.

> Now, I am a little bit concerned about the current function name. How about
> pg_set_wal_prohibit_state(bool) name or have two functions as
> pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
> other suggestions?

How about pg_prohibit_wal(true|false)?

> > It seems odd that ProcessWALProhibitStateChangeRequest() returns
> > without doing anything if !AmCheckpointerProcess(), rather than having
> > that be an Assert(). Why is it like that?
>
> Like AbsorbSyncRequests().

Well, that can be called not from the checkpointer, according to the
comments.  Specifically from the postmaster, I guess.  Again, comments
please.

> If you think that there will be panic when UpdateFullPageWrites() and/or
> XLogReportParameters() tries to write WAL since the shared memory state for WAL
> prohibited is set then it is not like that.  For those functions, WAL write is
> explicitly enabled by calling LocalSetXLogInsertAllowed().
>
> I was under the impression that there won't be any problem if we allow the
> writing WAL to UpdateFullPageWrites() and XLogReportParameters().  It can be
> considered as an exception since it is fine that this WAL record is not streamed
> to standby while graceful failover, I may be wrong though.

I don't think that's OK. I mean, the purpose of the feature is to
prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
the principle of least surprise.

> I am not clear on this part. In the attached version I am sending SIGUSR1
> instead of SIGINT, which works for me.

OK.

> The attached version does not address all your comments, I'll continue my work
> on that.

Some thoughts on this version:

+/* Extract last two bits */
+#define        WALPROHIBIT_CURRENT_STATE(stateGeneration)      \
+       ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
+#define        WALPROHIBIT_NEXT_STATE(stateGeneration) \
+       WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))

This is really confusing. First, the comment looks like it applies to
both based on how it is positioned, but that's clearly not true.
Second, the naming is really hard to understand. Third, there don't
seem to be comments explaining the theory of what is going on here.
Fourth, stateGeneration refers not to which generation of state we've
got here but to the combination of the state and the generation.
However, it's not clear that we ever really use the generation for
anything.

I think that the direction you went with this is somewhat different
from what I had in mind. That may be OK, but let me just explain the
difference. We both had in mind the idea that the low two bits of the
state would represent the current state and the upper bits would
represent the state generation. However, I wasn't necessarily
imagining that the only supported operation was making the combined
value go up by 1. For instance, I had thought that perhaps the effect
of trying to go read-only when we're in the middle of going read-write
would be to cancel the previous operation and start the new one. What
you have instead is that it errors out. So in your model a change
always has to finish before the next one can start, which in turn
means that the sequence is completely linear. In my idea the
state+generation might go from say 1 to 7, because trying to go
read-write would cancel the previous attempt to go read-only and
replace it with an attempt to go the other direction, and from 7 we
might go to to 9 if somebody now tries to go read-only again before
that finishes. In your model, there's never any sort of cancellation
of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.

One disadvantage of the way you've got it from a user perspective is
that if I'm writing a tool, I might get an error telling me that the
state change I'm trying to make is already in progress, and then I
have to retry. With the other design, I might attempt a state change
and have it fail because the change can't be completed, but I won't
ever fail because I attempt a state change and it can't be started
because we're in the wrong starting state. So, with this design, as
the tool author, I may not be able to just say, well, I tried to
change the state and it didn't work, so report the error to the user.
I think with the other approach that would be more viable. But I might
be wrong here; it would be interesting to hear what other people
think.

I dislike the use of the term state_gen or StateGen to refer to the
combination of a state and a generation. That seems unintuitive. I'm
tempted to propose that we just call it a counter, and, assuming we
stick with the design as you now have it, explain it with a comment
like this in walprohibit.h:

"There are four possible states. A brand new database cluster is
always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to
make it read only, then we enter the state
WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we
enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently
tries to make it read write, we will enter the state
WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete,
we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state
transitions are the only ones possible; for example, if we're
currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go
read-write will produce an error, and a second attempt to go read-only
will not cause a state change. Thus, we can represent the state as a
shared-memory counter that whose value only ever changes by adding 1.
The initial value at postmaster startup is either 0 or 2, depending on
whether the control file specifies the the system is starting
read-only or read-write."

And then maybe change all the state_gen references to reference
wal_prohibit_counter or, where a shorter name is appropriate, counter.

I think this might be clearer if we used different data types for the
state and the state/generation combination, with functions to convert
between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0
etc. maybe do:

typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState;

And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like:

static inline WALProhibitState
GetWALProhibitState(uint32 wal_prohibit_counter)
{
    return (WALProhibitState) (wal_prohibit_counter & 3);
}

I don't really know why we need WALPROHIBIT_NEXT_STATE at all,
honestly. It's just a macro to add 1 to an integer. And you don't even
use it consistently. Like pg_alter_wal_prohibit_state() does this:

+       /* Server is already in requested state */
+       if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
+               PG_RETURN_VOID();

But then later does this:

+               next_state_gen = cur_state_gen + 1;

Which is exactly the same thing as what you computed above using
WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly
sure how to structure this to make it as simple as possible, but I
don't think this is it.

Honestly this whole logic here seems correct but a bit hard to follow.
Like, maybe:

wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter);
switch (GetWALProhibitState(wal_prohibit_counter))
{
case WALPROHIBIT_STATE_READ_WRITE:
if (!walprohibit) return;
increment = true;
break;
case WALPROHIBIT_STATE_GOING_READ_WRITE:
if (walprohibit) ereport(ERROR, ...);
break;
...
}

And then just:

if (increment)
    wal_prohibit_counter =
pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1);
target_counter_value = wal_prohibit_counter + 1;
// random stuff
// eventually wait until the counter reaches >= target_counter_value

This might not be exactly the right idea though. I'm just looking for
a way to make it clearer, because I find it a bit hard to understand
right now. Maybe you or someone else will have a better idea.

+               success =
pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
+
                          &cur_state_gen, next_state_gen);
+               Assert(success);

I am almost positive that this is not OK. I think on some platforms
atomics just randomly fail some percentage of the time. You always
need a retry loop. Anyway, what happens if two people enter this
function at the same time and both read the same starting counter
value before either does anything?

+               /* To be sure that any later reads of memory happen
strictly after this. */
+               pg_memory_barrier();

You don't need a memory barrier after use of an atomic. The atomic
includes a barrier.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, Jan 26, 2021 at 2:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul@gmail.com> wrote:
> > It is not like that, let me explain. When a user backend requests to alter WAL
> > prohibit state by using ASRO/ASRW DDL with the previous patch or calling
> > pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be
> > set to the transition state i.e. going-read-only or going-read-write if it is
> > not already.  If another backend trying to request the same alteration to the
> > wal prohibit state then nothing going to be changed in shared memory but that
> > backend needs to wait until the transition to the final wal prohibited state
> > completes.  If a backend tries to request for the opposite state than the
> > previous which is in progress then it will see an error as "system state
> > transition to read only/write is already in progress".  At a time only one
> > transition state can be set.
>
> Hrm. Well, then that needs to be abundantly clear in the relevant comments.
>
> > Now, I am a little bit concerned about the current function name. How about
> > pg_set_wal_prohibit_state(bool) name or have two functions as
> > pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any
> > other suggestions?
>
> How about pg_prohibit_wal(true|false)?
>

LGTM. Used this.

> > > It seems odd that ProcessWALProhibitStateChangeRequest() returns
> > > without doing anything if !AmCheckpointerProcess(), rather than having
> > > that be an Assert(). Why is it like that?
> >
> > Like AbsorbSyncRequests().
>
> Well, that can be called not from the checkpointer, according to the
> comments.  Specifically from the postmaster, I guess.  Again, comments
> please.
>

Done.

> > If you think that there will be panic when UpdateFullPageWrites() and/or
> > XLogReportParameters() tries to write WAL since the shared memory state for WAL
> > prohibited is set then it is not like that.  For those functions, WAL write is
> > explicitly enabled by calling LocalSetXLogInsertAllowed().
> >
> > I was under the impression that there won't be any problem if we allow the
> > writing WAL to UpdateFullPageWrites() and XLogReportParameters().  It can be
> > considered as an exception since it is fine that this WAL record is not streamed
> > to standby while graceful failover, I may be wrong though.
>
> I don't think that's OK. I mean, the purpose of the feature is to
> prohibit WAL. If it doesn't do that, I believe it will fail to satisfy
> the principle of least surprise.
>

Yes, you are correct.

I am still on this. The things that worried me here are the wal records sequence
being written in the startup process -- UpdateFullPageWrites() generate record
just before the recovery check-point record and XLogReportParameters() just
after that but before any other backend could write any wal record. We might
also need to follow the same sequence while changing the system to read-write.

But in our case maintaining this sequence seems to be a little difficult. let me
explain, when a backend executes a function (ie. pg_prohibit_wal(false)) to
make the system read-write then that system state changes will be conveyed by
the Checkpointer process to all existing backends using global barrier and then
checkpoint might want to write those records. While checkpoint in progress, few
existing backends who might have absorbed barriers can write new records that
might come before aforesaid wal record sequence to be written. Also, we might
think that we could write these records before emitting the super barrier which
also might not solve the problem because a new backend could connect the server
just after the read-write system state change request was made but before
Checkpointer could pick that. Such a backend could write WAL before the
Checkpointer could, (see IsWALProhibited()).

Apart from this I also had a thought on the point recovery_end_command execution
that happens just after the recovery end checkpoint in the Startup process.
I think, first of all, why should we go and execute this command if we are
read-only?   I don't think there will be any use to boot-up a read-only server
as standby, which itself is read-only to some extent.  Also, pg_basebackup from
read-only is not allowed, a new standby cannot be set up. I think,
IMHO, we should simply error-out if tried to boot-up read-only server as
standby using standby.signal file, thoughts?

> > I am not clear on this part. In the attached version I am sending SIGUSR1
> > instead of SIGINT, which works for me.
>
> OK.
>
> > The attached version does not address all your comments, I'll continue my work
> > on that.
>
> Some thoughts on this version:
>
> +/* Extract last two bits */
> +#define        WALPROHIBIT_CURRENT_STATE(stateGeneration)      \
> +       ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1)))
> +#define        WALPROHIBIT_NEXT_STATE(stateGeneration) \
> +       WALPROHIBIT_CURRENT_STATE((stateGeneration + 1))
>
> This is really confusing. First, the comment looks like it applies to
> both based on how it is positioned, but that's clearly not true.
> Second, the naming is really hard to understand. Third, there don't
> seem to be comments explaining the theory of what is going on here.
> Fourth, stateGeneration refers not to which generation of state we've
> got here but to the combination of the state and the generation.
> However, it's not clear that we ever really use the generation for
> anything.
>
> I think that the direction you went with this is somewhat different
> from what I had in mind. That may be OK, but let me just explain the
> difference. We both had in mind the idea that the low two bits of the
> state would represent the current state and the upper bits would
> represent the state generation. However, I wasn't necessarily
> imagining that the only supported operation was making the combined
> value go up by 1. For instance, I had thought that perhaps the effect
> of trying to go read-only when we're in the middle of going read-write
> would be to cancel the previous operation and start the new one. What
> you have instead is that it errors out. So in your model a change
> always has to finish before the next one can start, which in turn
> means that the sequence is completely linear. In my idea the
> state+generation might go from say 1 to 7, because trying to go
> read-write would cancel the previous attempt to go read-only and
> replace it with an attempt to go the other direction, and from 7 we
> might go to to 9 if somebody now tries to go read-only again before
> that finishes. In your model, there's never any sort of cancellation
> of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc.
>

Yes, that made implementation quite simple. I was under the impression that we
might not have that much concurrency that so many backends might be trying to
change the system state so quickly.

> One disadvantage of the way you've got it from a user perspective is
> that if I'm writing a tool, I might get an error telling me that the
> state change I'm trying to make is already in progress, and then I
> have to retry. With the other design, I might attempt a state change
> and have it fail because the change can't be completed, but I won't
> ever fail because I attempt a state change and it can't be started
> because we're in the wrong starting state. So, with this design, as
> the tool author, I may not be able to just say, well, I tried to
> change the state and it didn't work, so report the error to the user.
> I think with the other approach that would be more viable. But I might
> be wrong here; it would be interesting to hear what other people
> think.
>

Thinking a little bit more, I agree that your approach is more viable as it can
cancel previously in-progress state.

For e.g. in a graceful failure future, the master might have detected that he
lost the connection to all standby and immediately calls the function to change
the system state to read-only. But, it regains the connection soon and wants to
back to read-write then it might need to wait until the previous state
completion. That might be the worst if the system is quite busy and/or any
backend which might have stuck or too busy and could not absorb the barrier.

If you want, I try to change the way you have thought, in the next version.

> I dislike the use of the term state_gen or StateGen to refer to the
> combination of a state and a generation. That seems unintuitive. I'm
> tempted to propose that we just call it a counter, and, assuming we
> stick with the design as you now have it, explain it with a comment
> like this in walprohibit.h:
>
> "There are four possible states. A brand new database cluster is
> always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to
> make it read only, then we enter the state
> WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we
> enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently
> tries to make it read write, we will enter the state
> WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete,
> we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state
> transitions are the only ones possible; for example, if we're
> currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go
> read-write will produce an error, and a second attempt to go read-only
> will not cause a state change. Thus, we can represent the state as a
> shared-memory counter that whose value only ever changes by adding 1.
> The initial value at postmaster startup is either 0 or 2, depending on
> whether the control file specifies the the system is starting
> read-only or read-write."
>

Thanks, added the same.

> And then maybe change all the state_gen references to reference
> wal_prohibit_counter or, where a shorter name is appropriate, counter.
>

Done.

> I think this might be clearer if we used different data types for the
> state and the state/generation combination, with functions to convert
> between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0
> etc. maybe do:
>
> typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState;
>
> And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like:
>
> static inline WALProhibitState
> GetWALProhibitState(uint32 wal_prohibit_counter)
> {
>     return (WALProhibitState) (wal_prohibit_counter & 3);
> }
>

Done.

> I don't really know why we need WALPROHIBIT_NEXT_STATE at all,
> honestly. It's just a macro to add 1 to an integer. And you don't even
> use it consistently. Like pg_alter_wal_prohibit_state() does this:
>
> +       /* Server is already in requested state */
> +       if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state)
> +               PG_RETURN_VOID();
>
> But then later does this:
>
> +               next_state_gen = cur_state_gen + 1;
>
> Which is exactly the same thing as what you computed above using
> WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly
> sure how to structure this to make it as simple as possible, but I
> don't think this is it.
>
> Honestly this whole logic here seems correct but a bit hard to follow.
> Like, maybe:
>
> wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter);
> switch (GetWALProhibitState(wal_prohibit_counter))
> {
> case WALPROHIBIT_STATE_READ_WRITE:
> if (!walprohibit) return;
> increment = true;
> break;
> case WALPROHIBIT_STATE_GOING_READ_WRITE:
> if (walprohibit) ereport(ERROR, ...);
> break;
> ...
> }
>
> And then just:
>
> if (increment)
>     wal_prohibit_counter =
> pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1);
> target_counter_value = wal_prohibit_counter + 1;
> // random stuff
> // eventually wait until the counter reaches >= target_counter_value
>
> This might not be exactly the right idea though. I'm just looking for
> a way to make it clearer, because I find it a bit hard to understand
> right now. Maybe you or someone else will have a better idea.
>

Yeah, this makes code much cleaner than before, did the same in the attached
version. Thanks again.

> +               success =
> pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation,
> +
>                           &cur_state_gen, next_state_gen);
> +               Assert(success);
>
> I am almost positive that this is not OK. I think on some platforms
> atomics just randomly fail some percentage of the time. You always
> need a retry loop. Anyway, what happens if two people enter this
> function at the same time and both read the same starting counter
> value before either does anything?
>
> +               /* To be sure that any later reads of memory happen
> strictly after this. */
> +               pg_memory_barrier();
>
> You don't need a memory barrier after use of an atomic. The atomic
> includes a barrier.

Understood, removed.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jan 28, 2021 at 7:17 AM Amul Sul <sulamul@gmail.com> wrote:
> I am still on this. The things that worried me here are the wal records sequence
> being written in the startup process -- UpdateFullPageWrites() generate record
> just before the recovery check-point record and XLogReportParameters() just
> after that but before any other backend could write any wal record. We might
> also need to follow the same sequence while changing the system to read-write.

I was able to chat with Andres about this topic for a while today and
he made some proposals which seemed pretty good to me. I can't promise
that what I'm about to write is an entirely faithful representation of
what he said, but hopefully it's not so far off that he gets mad at me
or something.

1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with.

2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments.

3. If the system is starting up read only (and the read-only state
didn't get cleared because of #1 above) then don't call
XLogAcceptWrites() at the end of StartupXLOG() and instead have the
checkpointer do it later when the system is going read-write for the
first time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Andres Freund
Date:
Hi,

On 2021-02-16 17:11:06 -0500, Robert Haas wrote:
> I can't promise that what I'm about to write is an entirely faithful
> representation of what he said, but hopefully it's not so far off that
> he gets mad at me or something.

Seems accurate - and also I'm way too tired that I'd be mad ;)


> 1. If the server starts up and is read-only and
> ArchiveRecoveryRequested, clear the read-only state in memory and also
> in the control file, log a message saying that this has been done, and
> proceed. This makes some other cases simpler to deal with.

It seems also to make sense from a behaviour POV to me: Imagine a
"smooth" planned failover with ASRO:
1) ASRO on primary
2) promote standby
3) edit primary config to include primary_conninfo, add standby.signal
4) restart "read only primary"

There's not really any spot in which it'd be useful to do disable ASRO,
right? But 4) should make the node a normal standby.

Greetings,

Andres Freund



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Feb 17, 2021 at 7:50 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-02-16 17:11:06 -0500, Robert Haas wrote:

Thank you very much to both of you !

> > I can't promise that what I'm about to write is an entirely faithful
> > representation of what he said, but hopefully it's not so far off that
> > he gets mad at me or something.
>
> Seems accurate - and also I'm way too tired that I'd be mad ;)
>
>
> > 1. If the server starts up and is read-only and
> > ArchiveRecoveryRequested, clear the read-only state in memory and also
> > in the control file, log a message saying that this has been done, and
> > proceed. This makes some other cases simpler to deal with.
>
> It seems also to make sense from a behaviour POV to me: Imagine a
> "smooth" planned failover with ASRO:
> 1) ASRO on primary
> 2) promote standby
> 3) edit primary config to include primary_conninfo, add standby.signal
> 4) restart "read only primary"
>
> There's not really any spot in which it'd be useful to do disable ASRO,
> right? But 4) should make the node a normal standby.
>

Understood.

In the attached version I have made the changes accordingly what Robert has
summarised in his previous mail[1].

In addition to that, I also move the code that updates the control file to
XLogAcceptWrites() which will also get skipped when the system is read-only (wal
prohibited).  The system will be in the crash recovery, and that will
change once we do the end-of-recovery checkpoint and the WAL writes operation
which we were skipping from startup.  The benefit of keeping the system in
recovery mode is that it fixes my concern[2] where other backends could connect
and write wal records while we were changing the system to read-write. Now, no
other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery
checkpoint, and XLogReportParameters() operations will be performed in the same
sequence as it is in the startup while changing the system to read-write.

Regards,
Amul

1] http://postgr.es/m/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com
2] http://postgr.es/m/CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Prabhat Sahu
Date:
Hi all,
While testing this feature with v20-patch, the server is crashing with below steps.

Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);

-- Slave:
postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

postgres=# SELECT pg_prohibit_wal(true);
 pg_prohibit_wal
-----------------
 
(1 row)

postgres=# SELECT pg_prohibit_wal(false);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

-- Below are the stack trace:
[prabhat@localhost bin]$ gdb -q -c /tmp/data_slave/core.35273 postgres
Reading symbols from /home/prabhat/PG/PGsrcNew/postgresql/inst/bin/postgres...done.
[New LWP 35273]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer                                                       '.
Program terminated with signal 6, Aborted.
#0  0x00007fa876233387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007fa876233387 in raise () from /lib64/libc.so.6
#1  0x00007fa876234a78 in abort () from /lib64/libc.so.6
#2  0x0000000000aea31c in ExceptionalCondition (conditionName=0xb8c998 "ThisTimeLineID != 0 || IsBootstrapProcessingMode()",
    errorType=0xb8956d "FailedAssertion", fileName=0xb897c0 "xlog.c", lineNumber=8611) at assert.c:69
#3  0x0000000000588eb5 in InitXLOGAccess () at xlog.c:8611
#4  0x0000000000588ae6 in LocalSetXLogInsertAllowed () at xlog.c:8483
#5  0x00000000005881bb in XLogAcceptWrites (needChkpt=true, xlogreader=0x0, EndOfLog=0, EndOfLogTLI=0) at xlog.c:8008
#6  0x00000000005751ed in ProcessWALProhibitStateChangeRequest () at walprohibit.c:361
#7  0x000000000088c69f in CheckpointerMain () at checkpointer.c:355
#8  0x000000000059d7db in AuxiliaryProcessMain (argc=2, argv=0x7ffd1290d060) at bootstrap.c:455
#9  0x000000000089fc5f in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5416
#10 0x000000000089f782 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5128
#11 <signal handler called>
#12 0x00007fa8762f2983 in __select_nocancel () from /lib64/libc.so.6
#13 0x000000000089b511 in ServerLoop () at postmaster.c:1700
#14 0x000000000089af00 in PostmasterMain (argc=5, argv=0x15b8460) at postmaster.c:1408
#15 0x000000000079c23a in main (argc=5, argv=0x15b8460) at main.c:209
(gdb)

kindly let me know if you need more inputs on this.

On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:
On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>
> On Tue, Mar 9, 2021 at 3:31 PM Amul Sul <sulamul@gmail.com> wrote:
>>
>> On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:
>> >
>> > On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> > >
>> > > On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>[....]
>
> One of the patch (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the latest patchset does not apply successfully.
>
> http://cfbot.cputube.org/patch_32_2602.log
>
> === applying patch ./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
>
> Hunk #15 succeeded at 2604 (offset -13 lines).
> 1 out of 15 hunks FAILED -- saving rejects to file src/backend/access/nbtree/nbtpage.c.rej
> patching file src/backend/access/spgist/spgdoinsert.c
>
> It is a very minor change, so I rebased the patch. Please take a look, if that works for you.
>

Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.

Regards,
Amul


--

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Mar 19, 2021 at 7:17 PM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
>
> Hi all,
> While testing this feature with v20-patch, the server is crashing with below steps.
>
> Steps to reproduce:
> 1. Configure master-slave replication setup.
> 2. Connect to Slave.
> 3. Execute below statements, it will crash the server:
> SELECT pg_prohibit_wal(true);
> SELECT pg_prohibit_wal(false);
>
> -- Slave:
> postgres=# select pg_is_in_recovery();
>  pg_is_in_recovery
> -------------------
>  t
> (1 row)
>
> postgres=# SELECT pg_prohibit_wal(true);
>  pg_prohibit_wal
> -----------------
>
> (1 row)
>
> postgres=# SELECT pg_prohibit_wal(false);
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !?>

Thanks Prabhat.

The assertion failure is due to wrong assumptions for the flag that were used
for the XLogAcceptWrites() call. In the case of standby, the startup process
never reached the place where it could call XLogAcceptWrites() and update the
respective flag. Due to this flag value, pg_prohibit_wal() function does
alter system state in recovery state which is incorrect.

In the attached function I took enum value for that flag so that
pg_prohibit_wal() is only allowed in the recovery mode, iff that flag indicates
that XLogAcceptWrites() has been skipped previously.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Regards,
Amul


On Mon, Mar 22, 2021 at 12:13 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Fri, Mar 19, 2021 at 7:17 PM Prabhat Sahu
> <prabhat.sahu@enterprisedb.com> wrote:
> >
> > Hi all,
> > While testing this feature with v20-patch, the server is crashing with below steps.
> >
> > Steps to reproduce:
> > 1. Configure master-slave replication setup.
> > 2. Connect to Slave.
> > 3. Execute below statements, it will crash the server:
> > SELECT pg_prohibit_wal(true);
> > SELECT pg_prohibit_wal(false);
> >
> > -- Slave:
> > postgres=# select pg_is_in_recovery();
> >  pg_is_in_recovery
> > -------------------
> >  t
> > (1 row)
> >
> > postgres=# SELECT pg_prohibit_wal(true);
> >  pg_prohibit_wal
> > -----------------
> >
> > (1 row)
> >
> > postgres=# SELECT pg_prohibit_wal(false);
> > WARNING:  terminating connection because of crash of another server process
> > DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
> > HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Failed.
> > !?>
>
> Thanks Prabhat.
>
> The assertion failure is due to wrong assumptions for the flag that were used
> for the XLogAcceptWrites() call. In the case of standby, the startup process
> never reached the place where it could call XLogAcceptWrites() and update the
> respective flag. Due to this flag value, pg_prohibit_wal() function does
> alter system state in recovery state which is incorrect.
>
> In the attached function I took enum value for that flag so that
> pg_prohibit_wal() is only allowed in the recovery mode, iff that flag indicates
> that XLogAcceptWrites() has been skipped previously.
>
> Regards,
> Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Bharath Rupireddy
Date:
On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:
>
> Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).

Some minor comments on 0001:
Isn't it "might not be running"?
+                 errdetail("Checkpointer might not running."),

Isn't it  "Try again after sometime"?
+                         errhint("Try after sometime again.")));

Can we have ereport(DEBUG1 just to be consistent(although it doesn't
make any difference from elog(DEBUG1) with the new log messages
introduced in the patch?
+    elog(DEBUG1, "waiting for backends to adopt requested WAL
prohibit state change");

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>

Thanks Bharath for your review.

> On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).
>
> Some minor comments on 0001:
> Isn't it "might not be running"?
> +                 errdetail("Checkpointer might not running."),
>

Ok, fixed in the attached version.

> Isn't it  "Try again after sometime"?
> +                         errhint("Try after sometime again.")));
>

Ok, done.

> Can we have ereport(DEBUG1 just to be consistent(although it doesn't
> make any difference from elog(DEBUG1) with the new log messages
> introduced in the patch?
> +    elog(DEBUG1, "waiting for backends to adopt requested WAL
> prohibit state change");
>

I think it's fine; many existing places have used elog(DEBUG1, ....) too.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Rotten again, attached the rebased version.

Regards,
Amul

On Mon, Apr 5, 2021 at 5:27 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
>
> Thanks Bharath for your review.
>
> > On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).
> >
> > Some minor comments on 0001:
> > Isn't it "might not be running"?
> > +                 errdetail("Checkpointer might not running."),
> >
>
> Ok, fixed in the attached version.
>
> > Isn't it  "Try again after sometime"?
> > +                         errhint("Try after sometime again.")));
> >
>
> Ok, done.
>
> > Can we have ereport(DEBUG1 just to be consistent(although it doesn't
> > make any difference from elog(DEBUG1) with the new log messages
> > introduced in the patch?
> > +    elog(DEBUG1, "waiting for backends to adopt requested WAL
> > prohibit state change");
> >
>
> I think it's fine; many existing places have used elog(DEBUG1, ....) too.
>
> Regards,
> Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Rebased again.

On Wed, Apr 7, 2021 at 12:38 PM Amul Sul <sulamul@gmail.com> wrote:
>
> Rotten again, attached the rebased version.
>
> Regards,
> Amul
>
> On Mon, Apr 5, 2021 at 5:27 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> >
> > Thanks Bharath for your review.
> >
> > > On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote:
> > > >
> > > > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6).
> > >
> > > Some minor comments on 0001:
> > > Isn't it "might not be running"?
> > > +                 errdetail("Checkpointer might not running."),
> > >
> >
> > Ok, fixed in the attached version.
> >
> > > Isn't it  "Try again after sometime"?
> > > +                         errhint("Try after sometime again.")));
> > >
> >
> > Ok, done.
> >
> > > Can we have ereport(DEBUG1 just to be consistent(although it doesn't
> > > make any difference from elog(DEBUG1) with the new log messages
> > > introduced in the patch?
> > > +    elog(DEBUG1, "waiting for backends to adopt requested WAL
> > > prohibit state change");
> > >
> >
> > I think it's fine; many existing places have used elog(DEBUG1, ....) too.
> >
> > Regards,
> > Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Mon, Apr 12, 2021 at 10:04 AM Amul Sul <sulamul@gmail.com> wrote:
> Rebased again.

I started to look at this today, and didn't get very far, but I have a
few comments. The main one is that I don't think this patch implements
the design proposed in
https://www.postgresql.org/message-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com

The first part of that proposal said this:

"1. If the server starts up and is read-only and
ArchiveRecoveryRequested, clear the read-only state in memory and also
in the control file, log a message saying that this has been done, and
proceed. This makes some other cases simpler to deal with."

As I read it, the patch clears the read-only state in memory, does not
clear it in the control file, and does not log a message.

The second part of this proposal was:

"2. Create a new function with a name like XLogAcceptWrites(). Move the
following things from StartupXLOG() into that function: (1) the call
to UpdateFullPageWrites(), (2) the following block of code that does
either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
CreateCheckPoint(), (3) the next block of code that runs
recovery_end_command, (4) the call to XLogReportParameters(), and (5)
the call to CompleteCommitTsInitialization(). Call the new function
from the place where we now call XLogReportParameters(). This would
mean that (1)-(3) happen later than they do now, which might require
some adjustments."

Now you moved that code, but you also moved (6)
CompleteCommitTsInitialization(), (7) setting the control file to
DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
(9) requesting a checkpoint if we were just promoted. That's not what
was proposed. One result of this is that the server now thinks it's in
recovery even after the startup process has exited.
RecoveryInProgress() is still returning true everywhere. But that is
inconsistent with what Andres and I were recommending in
http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

I also noticed that 0001 does not compile without 0002, so the
separation into multiple patches is not clean. I would actually
suggest that the first patch in the series should just create
XLogAcceptWrites() with the minimum amount of adjustment to make that
work. That would potentially let us commit that change independently,
which would be good, because then if we accidentally break something,
it'll be easier to pin down to that particular change instead of being
mixed with everything else this needs to change.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, May 7, 2021 at 1:23 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Apr 12, 2021 at 10:04 AM Amul Sul <sulamul@gmail.com> wrote:
> > Rebased again.
>
> I started to look at this today, and didn't get very far, but I have a
> few comments. The main one is that I don't think this patch implements
> the design proposed in
> https://www.postgresql.org/message-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com
>
> The first part of that proposal said this:
>
> "1. If the server starts up and is read-only and
> ArchiveRecoveryRequested, clear the read-only state in memory and also
> in the control file, log a message saying that this has been done, and
> proceed. This makes some other cases simpler to deal with."
>
> As I read it, the patch clears the read-only state in memory, does not
> clear it in the control file, and does not log a message.
>

The state in the control file also gets cleared. Though, after
clearing in memory the state patch doesn't really do the immediate
change to the control file, it relies on the next UpdateControlFile()
to do that.

Regarding log message I think I have skipped that intentionally, to
avoid confusing log as "system is now read write" when we do start as
hot-standby which is not really read-write.

> The second part of this proposal was:
>
> "2. Create a new function with a name like XLogAcceptWrites(). Move the
> following things from StartupXLOG() into that function: (1) the call
> to UpdateFullPageWrites(), (2) the following block of code that does
> either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
> CreateCheckPoint(), (3) the next block of code that runs
> recovery_end_command, (4) the call to XLogReportParameters(), and (5)
> the call to CompleteCommitTsInitialization(). Call the new function
> from the place where we now call XLogReportParameters(). This would
> mean that (1)-(3) happen later than they do now, which might require
> some adjustments."
>
> Now you moved that code, but you also moved (6)
> CompleteCommitTsInitialization(), (7) setting the control file to
> DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
> (9) requesting a checkpoint if we were just promoted. That's not what
> was proposed. One result of this is that the server now thinks it's in
> recovery even after the startup process has exited.
> RecoveryInProgress() is still returning true everywhere. But that is
> inconsistent with what Andres and I were recommending in
> http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com
>

Regarding modified approach, I tried to explain that why I did
this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com

> I also noticed that 0001 does not compile without 0002, so the
> separation into multiple patches is not clean. I would actually
> suggest that the first patch in the series should just create
> XLogAcceptWrites() with the minimum amount of adjustment to make that
> work. That would potentially let us commit that change independently,
> which would be good, because then if we accidentally break something,
> it'll be easier to pin down to that particular change instead of being
> mixed with everything else this needs to change.
>

Ok, I will try in the next version.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Sun, May 9, 2021 at 1:26 AM Amul Sul <sulamul@gmail.com> wrote:
> The state in the control file also gets cleared. Though, after
> clearing in memory the state patch doesn't really do the immediate
> change to the control file, it relies on the next UpdateControlFile()
> to do that.

But when will that happen? If you're relying on some very nearby code,
that might be OK, but perhaps a comment is in order. If you're just
thinking it's going to happen eventually, I think that's not good
enough.

> Regarding log message I think I have skipped that intentionally, to
> avoid confusing log as "system is now read write" when we do start as
> hot-standby which is not really read-write.

I think the message should not be phrased that way. In fact, I think
now that we've moved to calling this pg_prohibit_wal() rather than
ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and
maybe some comments and function names as well. Perhaps something
like:

system is read only -> WAL is now prohibited
system is read write -> WAL is no longer prohibited

And then for this particular case, maybe something like:

clearing WAL prohibition because the system is in archive recovery

> > The second part of this proposal was:
> >
> > "2. Create a new function with a name like XLogAcceptWrites(). Move the
> > following things from StartupXLOG() into that function: (1) the call
> > to UpdateFullPageWrites(), (2) the following block of code that does
> > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
> > CreateCheckPoint(), (3) the next block of code that runs
> > recovery_end_command, (4) the call to XLogReportParameters(), and (5)
> > the call to CompleteCommitTsInitialization(). Call the new function
> > from the place where we now call XLogReportParameters(). This would
> > mean that (1)-(3) happen later than they do now, which might require
> > some adjustments."
> >
> > Now you moved that code, but you also moved (6)
> > CompleteCommitTsInitialization(), (7) setting the control file to
> > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
> > (9) requesting a checkpoint if we were just promoted. That's not what
> > was proposed. One result of this is that the server now thinks it's in
> > recovery even after the startup process has exited.
> > RecoveryInProgress() is still returning true everywhere. But that is
> > inconsistent with what Andres and I were recommending in
> > http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com
>
> Regarding modified approach, I tried to explain that why I did
> this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com

I am not able to understand what problem you are seeing there. If
we're in crash recovery, then nobody can connect to the database, so
there can't be any concurrent activity. If we're in archive recovery,
we now clear the WAL-is-prohibited flag so that we will go read-write
directly at the end of recovery. We can and should refuse any effort
to call pg_prohibit_wal() during recovery. If we reached the end of
crash recovery and are now permitting read-only connections, why would
anyone be able to write WAL before the system has been changed to
read-write? If that can happen, it's a bug, not a reason to change the
design.

Maybe your concern here is about ordering: the process that is going
to run XLogAcceptWrites() needs to allow xlog writes locally before we
tell other backends that they also can xlog writes; otherwise, some
other records could slip in before UpdateFullPageWrites() and similar
have run, which we probably don't want. But that's why
LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do
what we need in this situation, we should be able to tweak it so it
does.

If your concern is something else, can you spell it out for me again
because I'm not getting it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Mon, May 10, 2021 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, May 9, 2021 at 1:26 AM Amul Sul <sulamul@gmail.com> wrote:
> > The state in the control file also gets cleared. Though, after
> > clearing in memory the state patch doesn't really do the immediate
> > change to the control file, it relies on the next UpdateControlFile()
> > to do that.
>
> But when will that happen? If you're relying on some very nearby code,
> that might be OK, but perhaps a comment is in order. If you're just
> thinking it's going to happen eventually, I think that's not good
> enough.
>

Ok.

> > Regarding log message I think I have skipped that intentionally, to
> > avoid confusing log as "system is now read write" when we do start as
> > hot-standby which is not really read-write.
>
> I think the message should not be phrased that way. In fact, I think
> now that we've moved to calling this pg_prohibit_wal() rather than
> ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and
> maybe some comments and function names as well. Perhaps something
> like:
>
> system is read only -> WAL is now prohibited
> system is read write -> WAL is no longer prohibited
>
> And then for this particular case, maybe something like:
>
> clearing WAL prohibition because the system is in archive recovery
>

Ok, thanks for the suggestions.

> > > The second part of this proposal was:
> > >
> > > "2. Create a new function with a name like XLogAcceptWrites(). Move the
> > > following things from StartupXLOG() into that function: (1) the call
> > > to UpdateFullPageWrites(), (2) the following block of code that does
> > > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or
> > > CreateCheckPoint(), (3) the next block of code that runs
> > > recovery_end_command, (4) the call to XLogReportParameters(), and (5)
> > > the call to CompleteCommitTsInitialization(). Call the new function
> > > from the place where we now call XLogReportParameters(). This would
> > > mean that (1)-(3) happen later than they do now, which might require
> > > some adjustments."
> > >
> > > Now you moved that code, but you also moved (6)
> > > CompleteCommitTsInitialization(), (7) setting the control file to
> > > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and
> > > (9) requesting a checkpoint if we were just promoted. That's not what
> > > was proposed. One result of this is that the server now thinks it's in
> > > recovery even after the startup process has exited.
> > > RecoveryInProgress() is still returning true everywhere. But that is
> > > inconsistent with what Andres and I were recommending in
> > > http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com
> >
> > Regarding modified approach, I tried to explain that why I did
> > this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com
>
> I am not able to understand what problem you are seeing there. If
> we're in crash recovery, then nobody can connect to the database, so
> there can't be any concurrent activity. If we're in archive recovery,
> we now clear the WAL-is-prohibited flag so that we will go read-write
> directly at the end of recovery. We can and should refuse any effort
> to call pg_prohibit_wal() during recovery. If we reached the end of
> crash recovery and are now permitting read-only connections, why would
> anyone be able to write WAL before the system has been changed to
> read-write? If that can happen, it's a bug, not a reason to change the
> design.
>
> Maybe your concern here is about ordering: the process that is going
> to run XLogAcceptWrites() needs to allow xlog writes locally before we
> tell other backends that they also can xlog writes; otherwise, some
> other records could slip in before UpdateFullPageWrites() and similar
> have run, which we probably don't want. But that's why
> LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do
> what we need in this situation, we should be able to tweak it so it
> does.
>

Yes, we don't want any write slip in before UpdateFullPageWrites().
Recently[1], we have decided to let the Checkpointed process call
XLogAcceptWrites() unconditionally.

Here problem is that when a backend executes the
pg_prohibit_wal(false) function to make the system read-write, the wal
prohibited state is set to inprogress(ie.
WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
Next, Checkpointer will convey this system change to all existing
backends using a global barrier, and after that final wal prohibited
state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
While Checkpointer is in the progress of conveying this global
barrier,  any new backend can connect at that time and can write a new
record because the inprogress read-write state is equivalent to the
final read-write state iff LocalXLogInsertAllowed != 0 for that
backend.  And, that new record could slip in before or in between
records to be written by XLogAcceptWrites().

1] http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Mon, May 10, 2021 at 10:25 PM Amul Sul <sulamul@gmail.com> wrote:
>
> Yes, we don't want any write slip in before UpdateFullPageWrites().
> Recently[1], we have decided to let the Checkpointed process call
> XLogAcceptWrites() unconditionally.
>
> Here problem is that when a backend executes the
> pg_prohibit_wal(false) function to make the system read-write, the wal
> prohibited state is set to inprogress(ie.
> WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
> Next, Checkpointer will convey this system change to all existing
> backends using a global barrier, and after that final wal prohibited
> state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
> While Checkpointer is in the progress of conveying this global
> barrier,  any new backend can connect at that time and can write a new
> record because the inprogress read-write state is equivalent to the
> final read-write state iff LocalXLogInsertAllowed != 0 for that
> backend.  And, that new record could slip in before or in between
> records to be written by XLogAcceptWrites().
>
> 1] http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com

But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE
and signaled to the checkpointer.  The checkpointer should first call
XLogAcceptWrites and then it should inform other backends through the
global barrier?  Are we worried that if we have written the WAL in
XLogAcceptWrites but later if we could not set the state to
WALPROHIBIT_STATE_READ_WRITE?  Then maybe we can inform all the
backend first but before setting the state to
WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, May 11, 2021 at 11:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 10, 2021 at 10:25 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > Yes, we don't want any write slip in before UpdateFullPageWrites().
> > Recently[1], we have decided to let the Checkpointed process call
> > XLogAcceptWrites() unconditionally.
> >
> > Here problem is that when a backend executes the
> > pg_prohibit_wal(false) function to make the system read-write, the wal
> > prohibited state is set to inprogress(ie.
> > WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled.
> > Next, Checkpointer will convey this system change to all existing
> > backends using a global barrier, and after that final wal prohibited
> > state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE).
> > While Checkpointer is in the progress of conveying this global
> > barrier,  any new backend can connect at that time and can write a new
> > record because the inprogress read-write state is equivalent to the
> > final read-write state iff LocalXLogInsertAllowed != 0 for that
> > backend.  And, that new record could slip in before or in between
> > records to be written by XLogAcceptWrites().
> >
> > 1] http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com
>
> But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE
> and signaled to the checkpointer.  The checkpointer should first call
> XLogAcceptWrites and then it should inform other backends through the
> global barrier?  Are we worried that if we have written the WAL in
> XLogAcceptWrites but later if we could not set the state to
> WALPROHIBIT_STATE_READ_WRITE?  Then maybe we can inform all the
> backend first but before setting the state to
> WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites?
>

I get why you think that, I wasn't very precise in briefing the problem.

Any new backend that gets connected right after the shared memory
state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
default allowed to do the WAL writes.  Such backends can perform write
operation before the checkpointer does the XLogAcceptWrites(). Also,
possible that a backend could connect at the same time checkpointer
performing XLogAcceptWrites() and can write a wal.

So, having XLogAcceptWrites() before does not really solve my concern.
Note that the previous patch XLogAcceptWrites() does get called before
global barrier emission.

Please let me know if it is not yet cleared to you, thanks.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:

> I get why you think that, I wasn't very precise in briefing the problem.
>
> Any new backend that gets connected right after the shared memory
> state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> default allowed to do the WAL writes.  Such backends can perform write
> operation before the checkpointer does the XLogAcceptWrites().

Okay, make sense now. But my next question is why do we allow backends
to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
wait until the shared memory state is changed to
WALPROHIBIT_STATE_READ_WRITE?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:
>
> > I get why you think that, I wasn't very precise in briefing the problem.
> >
> > Any new backend that gets connected right after the shared memory
> > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> > default allowed to do the WAL writes.  Such backends can perform write
> > operation before the checkpointer does the XLogAcceptWrites().
>
> Okay, make sense now. But my next question is why do we allow backends
> to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
> wait until the shared memory state is changed to
> WALPROHIBIT_STATE_READ_WRITE?
>

Ok, good question.

Now let's first try to understand the Checkpointer's work.

When Checkpointer sees the wal prohibited state is an in-progress state, then
it first emits the global barrier and waits until all backers absorb that.
After that it set the final requested WAL prohibit state.

When other backends absorb those barriers then appropriate action is taken
(e.g. abort the read-write transaction if moving to read-only) by them. Also,
LocalXLogInsertAllowed flags get reset in it and that backend needs to call
XLogInsertAllowed() to get the right value for it, which further decides WAL
writes permitted or prohibited.

Consider an example that the system is trying to change to read-write and for
that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
Checkpointer starts its work.  If we want to treat that system as read-only for
the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
think about the behavior of the backend that has absorbed the barrier and reset
the LocalXLogInsertAllowed flag.  That backend eventually going to call
XLogInsertAllowed() to get the actual value for it and by seeing the current
state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
again same as it was before for the read-only state.

Now the question is when this value should get reset again so that backend can
be read-write? We are done with a barrier and that backend never going to come
back to read-write again.

One solution, I think, is to set the final state before emitting the barrier
but as per the current design that should get set after all barrier processing.
Let's see what Robert says on this.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Tue, May 11, 2021 at 3:38 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > > I get why you think that, I wasn't very precise in briefing the problem.
> > >
> > > Any new backend that gets connected right after the shared memory
> > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> > > default allowed to do the WAL writes.  Such backends can perform write
> > > operation before the checkpointer does the XLogAcceptWrites().
> >
> > Okay, make sense now. But my next question is why do we allow backends
> > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
> > wait until the shared memory state is changed to
> > WALPROHIBIT_STATE_READ_WRITE?
> >
>
> Ok, good question.
>
> Now let's first try to understand the Checkpointer's work.
>
> When Checkpointer sees the wal prohibited state is an in-progress state, then
> it first emits the global barrier and waits until all backers absorb that.
> After that it set the final requested WAL prohibit state.
>
> When other backends absorb those barriers then appropriate action is taken
> (e.g. abort the read-write transaction if moving to read-only) by them. Also,
> LocalXLogInsertAllowed flags get reset in it and that backend needs to call
> XLogInsertAllowed() to get the right value for it, which further decides WAL
> writes permitted or prohibited.
>
> Consider an example that the system is trying to change to read-write and for
> that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
> Checkpointer starts its work.  If we want to treat that system as read-only for
> the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
> think about the behavior of the backend that has absorbed the barrier and reset
> the LocalXLogInsertAllowed flag.  That backend eventually going to call
> XLogInsertAllowed() to get the actual value for it and by seeing the current
> state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
> again same as it was before for the read-only state.

I might be missing something, but assume the behavior should be like this

1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
-> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
the barrier, we can immediately abort any read-write transaction(and
stop allowing WAL writing), because once we ensure that all session
has responded that now they have no read-write transaction then we can
safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
WALPROHIBIT_STATE_READ_ONLY.

2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
to consider the system as read-write, instead, we should wait until
the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.

So your problem is that on receiving the barrier we need to call
LocalXLogInsertAllowed() from the backend, but how does that matter?
you can still make IsWALProhibited() return true.

I don't know the complete code so I might be missing something but at
least that is what I would expect from the design POV.


Other than this point, I think the state names READ_ONLY, READ_WRITE
are a bit confusing no? because actually, these states represent
whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
are putting the system under a Read-only state.  For example, if you
are doing some write operation on an unlogged table will be allowed, I
guess because that will not generate the WAL until you commit (because
commit generates WAL) right? so practically, we are just blocking the
WAL, not the write operation.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 3:38 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > > I get why you think that, I wasn't very precise in briefing the problem.
> > > >
> > > > Any new backend that gets connected right after the shared memory
> > > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by
> > > > default allowed to do the WAL writes.  Such backends can perform write
> > > > operation before the checkpointer does the XLogAcceptWrites().
> > >
> > > Okay, make sense now. But my next question is why do we allow backends
> > > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we
> > > wait until the shared memory state is changed to
> > > WALPROHIBIT_STATE_READ_WRITE?
> > >
> >
> > Ok, good question.
> >
> > Now let's first try to understand the Checkpointer's work.
> >
> > When Checkpointer sees the wal prohibited state is an in-progress state, then
> > it first emits the global barrier and waits until all backers absorb that.
> > After that it set the final requested WAL prohibit state.
> >
> > When other backends absorb those barriers then appropriate action is taken
> > (e.g. abort the read-write transaction if moving to read-only) by them. Also,
> > LocalXLogInsertAllowed flags get reset in it and that backend needs to call
> > XLogInsertAllowed() to get the right value for it, which further decides WAL
> > writes permitted or prohibited.
> >
> > Consider an example that the system is trying to change to read-write and for
> > that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before
> > Checkpointer starts its work.  If we want to treat that system as read-only for
> > the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to
> > think about the behavior of the backend that has absorbed the barrier and reset
> > the LocalXLogInsertAllowed flag.  That backend eventually going to call
> > XLogInsertAllowed() to get the actual value for it and by seeing the current
> > state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed
> > again same as it was before for the read-only state.
>
> I might be missing something, but assume the behavior should be like this
>
> 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> the barrier, we can immediately abort any read-write transaction(and
> stop allowing WAL writing), because once we ensure that all session
> has responded that now they have no read-write transaction then we can
> safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> WALPROHIBIT_STATE_READ_ONLY.
>

Yes, that's what the current patch is doing from the first patch version.

> 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> to consider the system as read-write, instead, we should wait until
> the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
>

I am sure that only not enough will have the same issue where
LocalXLogInsertAllowed gets set the same as the read-only as described in
my previous reply.

> So your problem is that on receiving the barrier we need to call
> LocalXLogInsertAllowed() from the backend, but how does that matter?
> you can still make IsWALProhibited() return true.
>

Note that LocalXLogInsertAllowed is a local flag for a backend, not a
function, and in the server code at every place, we don't rely on
IsWALProhibited() instead we do rely on LocalXLogInsertAllowed
flags before wal writes and that check made via XLogInsertAllowed().

> I don't know the complete code so I might be missing something but at
> least that is what I would expect from the design POV.
>
>
> Other than this point, I think the state names READ_ONLY, READ_WRITE
> are a bit confusing no? because actually, these states represent
> whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
> are putting the system under a Read-only state.  For example, if you
> are doing some write operation on an unlogged table will be allowed, I
> guess because that will not generate the WAL until you commit (because
> commit generates WAL) right? so practically, we are just blocking the
> WAL, not the write operation.
>

This read-only and read-write are the wal prohibited states though we
are using for read-only/read-write system in the discussion and the
complete macro name is WALPROHIBIT_STATE_READ_ONLY and
WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make
implementation confusing.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I might be missing something, but assume the behavior should be like this
> >
> > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > the barrier, we can immediately abort any read-write transaction(and
> > stop allowing WAL writing), because once we ensure that all session
> > has responded that now they have no read-write transaction then we can
> > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > WALPROHIBIT_STATE_READ_ONLY.
> >
>
> Yes, that's what the current patch is doing from the first patch version.
>
> > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > to consider the system as read-write, instead, we should wait until
> > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> >
>
> I am sure that only not enough will have the same issue where
> LocalXLogInsertAllowed gets set the same as the read-only as described in
> my previous reply.

Okay, but while browsing the code I do not see any direct if condition
based on the "LocalXLogInsertAllowed" variable, can you point me to
some references?
I only see one if check on this variable and that is in
XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
you are already checking IsWALProhibited.  No?


> > Other than this point, I think the state names READ_ONLY, READ_WRITE
> > are a bit confusing no? because actually, these states represent
> > whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we
> > are putting the system under a Read-only state.  For example, if you
> > are doing some write operation on an unlogged table will be allowed, I
> > guess because that will not generate the WAL until you commit (because
> > commit generates WAL) right? so practically, we are just blocking the
> > WAL, not the write operation.
> >
>
> This read-only and read-write are the wal prohibited states though we
> are using for read-only/read-write system in the discussion and the
> complete macro name is WALPROHIBIT_STATE_READ_ONLY and
> WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make
> implementation confusing.

Fine, I am not too particular about these names.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I might be missing something, but assume the behavior should be like this
> > >
> > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > the barrier, we can immediately abort any read-write transaction(and
> > > stop allowing WAL writing), because once we ensure that all session
> > > has responded that now they have no read-write transaction then we can
> > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > WALPROHIBIT_STATE_READ_ONLY.
> > >
> >
> > Yes, that's what the current patch is doing from the first patch version.
> >
> > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > > to consider the system as read-write, instead, we should wait until
> > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > >
> >
> > I am sure that only not enough will have the same issue where
> > LocalXLogInsertAllowed gets set the same as the read-only as described in
> > my previous reply.
>
> Okay, but while browsing the code I do not see any direct if condition
> based on the "LocalXLogInsertAllowed" variable, can you point me to
> some references?
> I only see one if check on this variable and that is in
> XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> you are already checking IsWALProhibited.  No?
>

I am not sure I understood this. Where am I checking IsWALProhibited()?

IsWALProhibited() is called by XLogInsertAllowed() once when
LocalXLogInsertAllowed is in a reset state, and that result will be
cached in LocalXLogInsertAllowed and will be used in the subsequent
XLogInsertAllowed() call.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Tue, May 11, 2021 at 6:56 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > I might be missing something, but assume the behavior should be like this
> > > >
> > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > > the barrier, we can immediately abort any read-write transaction(and
> > > > stop allowing WAL writing), because once we ensure that all session
> > > > has responded that now they have no read-write transaction then we can
> > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > > WALPROHIBIT_STATE_READ_ONLY.
> > > >
> > >
> > > Yes, that's what the current patch is doing from the first patch version.
> > >
> > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > > > to consider the system as read-write, instead, we should wait until
> > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > > >
> > >
> > > I am sure that only not enough will have the same issue where
> > > LocalXLogInsertAllowed gets set the same as the read-only as described in
> > > my previous reply.
> >
> > Okay, but while browsing the code I do not see any direct if condition
> > based on the "LocalXLogInsertAllowed" variable, can you point me to
> > some references?
> > I only see one if check on this variable and that is in
> > XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> > you are already checking IsWALProhibited.  No?
> >
>
> I am not sure I understood this. Where am I checking IsWALProhibited()?
>
> IsWALProhibited() is called by XLogInsertAllowed() once when
> LocalXLogInsertAllowed is in a reset state, and that result will be
> cached in LocalXLogInsertAllowed and will be used in the subsequent
> XLogInsertAllowed() call.

Okay, got what you were trying to say.  But that can be easily
fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
then what we can do is don't allow to write the WAL but let's not set
the LocalXLogInsertAllowed to 0.  So until we are in the intermediate
state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
on GetWALProhibitState(), I know this will add a performance penalty
but this is for the short period until we are in the intermediate
state.  After that as soon as it will set to
WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
LocalXLogInsertAllowed to 1.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:


On Tue, 11 May 2021 at 7:50 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, May 11, 2021 at 6:56 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > I might be missing something, but assume the behavior should be like this
> > > >
> > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > > the barrier, we can immediately abort any read-write transaction(and
> > > > stop allowing WAL writing), because once we ensure that all session
> > > > has responded that now they have no read-write transaction then we can
> > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > > WALPROHIBIT_STATE_READ_ONLY.
> > > >
> > >
> > > Yes, that's what the current patch is doing from the first patch version.
> > >
> > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > > > to consider the system as read-write, instead, we should wait until
> > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > > >
> > >
> > > I am sure that only not enough will have the same issue where
> > > LocalXLogInsertAllowed gets set the same as the read-only as described in
> > > my previous reply.
> >
> > Okay, but while browsing the code I do not see any direct if condition
> > based on the "LocalXLogInsertAllowed" variable, can you point me to
> > some references?
> > I only see one if check on this variable and that is in
> > XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> > you are already checking IsWALProhibited.  No?
> >
>
> I am not sure I understood this. Where am I checking IsWALProhibited()?
>
> IsWALProhibited() is called by XLogInsertAllowed() once when
> LocalXLogInsertAllowed is in a reset state, and that result will be
> cached in LocalXLogInsertAllowed and will be used in the subsequent
> XLogInsertAllowed() call.

Okay, got what you were trying to say.  But that can be easily
fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
then what we can do is don't allow to write the WAL but let's not set
the LocalXLogInsertAllowed to 0.  So until we are in the intermediate
state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
on GetWALProhibitState(), I know this will add a performance penalty
but this is for the short period until we are in the intermediate
state.  After that as soon as it will set to
WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
LocalXLogInsertAllowed to 1.

I think I have much easier solution than this, will post that with update version patch set tomorrow.

Regards,
Amul


Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote:
> I think I have much easier solution than this, will post that with update version patch set tomorrow.

I don't know what you have in mind, but based on this discussion, it
seems to me that we should just have 5 states instead of 4:

1. WAL is permitted.
2. WAL is being prohibited but some backends may not know about the change yet.
3. WAL is prohibited.
4. WAL is in the process of being permitted but XLogAcceptWrites() may
not have been called yet.
5. WAL is in the process of being permitted and XLogAcceptWrites() has
been called but some backends may not know about the change yet.

If we're in state #3 and someone does pg_prohibit_wal(false) then we
enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
state #5, and pushes out a barrier. Then it waits for the barrier to
be absorbed and, when it has been, it moves us to state #1. Then if
someone does pg_prohibit_wal(true) we move to state #2. The
checkpointer pushes out a barrier and waits for it to be absorbed.
Then it calls XLogFlush() and afterward moves us to state #3.

We can have any (reasonable) number of states that we want. There's
nothing magical about 4.

I also entirely agree with Dilip that we should do some renaming to
get rid of the read-write/read-only terminology, now that this is no
longer part of the syntax. In fact I made the exact same point in my
last review. The WALPROHIBIT_STATE_* constants are just one thing of
many that needs to be included in that renaming.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Tue, May 11, 2021 at 11:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote:
> > I think I have much easier solution than this, will post that with update version patch set tomorrow.
>
> I don't know what you have in mind, but based on this discussion, it
> seems to me that we should just have 5 states instead of 4:
>
> 1. WAL is permitted.
> 2. WAL is being prohibited but some backends may not know about the change yet.
> 3. WAL is prohibited.
> 4. WAL is in the process of being permitted but XLogAcceptWrites() may
> not have been called yet.
> 5. WAL is in the process of being permitted and XLogAcceptWrites() has
> been called but some backends may not know about the change yet.
>
> If we're in state #3 and someone does pg_prohibit_wal(false) then we
> enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
> state #5, and pushes out a barrier. Then it waits for the barrier to
> be absorbed and, when it has been, it moves us to state #1. Then if
> someone does pg_prohibit_wal(true) we move to state #2. The
> checkpointer pushes out a barrier and waits for it to be absorbed.
> Then it calls XLogFlush() and afterward moves us to state #3.
>
> We can have any (reasonable) number of states that we want. There's
> nothing magical about 4.

Your idea makes sense, but IMHO, if we are first writing
XLogAcceptWrites() and then pushing out the barrier, then I don't
understand the meaning of having state #4.  I mean whenever any
backend receives the barrier the system will always be in state #5.
So what do we want to do with state #4?

Is it just to make the state machine better?  I mean in the checkpoint
process, we don't need separate "if checks" whether the
XLogAcceptWrites() is called or not, instead we can just rely on the
state, if it is #4 then we have to call XLogAcceptWrites().  If so
then I think it's okay to have an additional state, just wanted to
know what idea you had in mind?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, May 12, 2021 at 11:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 11:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote:
> > > I think I have much easier solution than this, will post that with update version patch set tomorrow.
> >
> > I don't know what you have in mind, but based on this discussion, it
> > seems to me that we should just have 5 states instead of 4:
> >

I had to have two different ideas, the first one is a little bit
aligned with the approach you mentioned below but without introducing
a new state. Basically, what we want is to restrict any backend that
connects to the server and write a WAL record while we are doing
XLogAcceptWrites(). For XLogAcceptWrites() skip we do already have a
flag for that, when that flag is set (i.e. XLogAcceptWrites() skipped
previously) then treat the system as read-only (i.e. WAL prohibited)
until XLogAcceptWrites() finishes. In that case, our IsWALProhibited()
function will be:

bool
IsWALProhibited(void)
{
    WALProhibitState cur_state;

    /*
     * If essential operations are needed to enable wal writes are skipped
     * previously then treat this state as WAL prohibited until that gets
     * done.
     */
    if (unlikely(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_SKIPPED))
        return true;

    cur_state = GetWALProhibitState(GetWALProhibitCounter());

    return (cur_state != WALPROHIBIT_STATE_READ_WRITE &&
            cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE);
}

Another idea that I want to propose & did the changes according to in
the attached version is making IsWALProhibited() something like this:

bool
IsWALProhibited(void)
{
    /* Other than read-write state will be considered as read-only */
    return (GetWALProhibitState(GetWALProhibitCounter()) !=
            WALPROHIBIT_STATE_READ_WRITE);
}

But this needs some additional changes to CompleteWALProhibitChange()
function where the final in-memory system state update happens
differently i.e. before or after emitting a global barrier.

When in-memory WAL prohibited state is _GOING_READ_WRITE then
in-memory state immediately changes to _READ_WRITE.  After that global
barrier is emitted for other backends to change their local state.
This should be harmless because a _READ_WRITE system could have
_READ_ONLY and _READ_WRITE backends.

But when the in-memory WAL prohibited state is _GOING_READ_ONLY then
in-memory update for the final state setting is not going to happen
before the global barrier. We cannot say the system is _READ_ONLY
until we ensure that all backends are _READ_ONLY.

For more details please have a look at CompleteWALProhibitChange().
Note that XLogAcceptWrites() happens before
CompleteWALProhibitChange() so if any backend connect while
XLogAcceptWrites() is in progress and will not allow WAL writes until it
gets finished and CompleteWALProhibitChange() executed.

The second approach is much better, IMO, because IsWALProhibited() is
much lighter which would run a number of times when a new backend
connects and/or its LocalXLogInsertAllowed cached value gets reset.
Perhaps, you could argue that the number of calls might not be that
much due to the locally cached value in LocalXLogInsertAllowed, but I
am in favour of having less work.

Apart from this, I made a separate patch for XLogAcceptWrites()
refactoring. Now, each patch can be compiled without having the next
patch on top of it.

> > 1. WAL is permitted.
> > 2. WAL is being prohibited but some backends may not know about the change yet.
> > 3. WAL is prohibited.
> > 4. WAL is in the process of being permitted but XLogAcceptWrites() may
> > not have been called yet.
> > 5. WAL is in the process of being permitted and XLogAcceptWrites() has
> > been called but some backends may not know about the change yet.
> >
> > If we're in state #3 and someone does pg_prohibit_wal(false) then we
> > enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to
> > state #5, and pushes out a barrier. Then it waits for the barrier to
> > be absorbed and, when it has been, it moves us to state #1. Then if
> > someone does pg_prohibit_wal(true) we move to state #2. The
> > checkpointer pushes out a barrier and waits for it to be absorbed.
> > Then it calls XLogFlush() and afterward moves us to state #3.
> >
> > We can have any (reasonable) number of states that we want. There's
> > nothing magical about 4.
>
> Your idea makes sense, but IMHO, if we are first writing
> XLogAcceptWrites() and then pushing out the barrier, then I don't
> understand the meaning of having state #4.  I mean whenever any
> backend receives the barrier the system will always be in state #5.
> So what do we want to do with state #4?
>
> Is it just to make the state machine better?  I mean in the checkpoint
> process, we don't need separate "if checks" whether the
> XLogAcceptWrites() is called or not, instead we can just rely on the
> state, if it is #4 then we have to call XLogAcceptWrites().  If so
> then I think it's okay to have an additional state, just wanted to
> know what idea you had in mind?
>
AFAICU, that proposed state #4 is to restrict the newly connected
backend from WAL writes.  My first approach doing the same by changing
IsWALProhibited() a bit.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, May 12, 2021 at 1:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Your idea makes sense, but IMHO, if we are first writing
> XLogAcceptWrites() and then pushing out the barrier, then I don't
> understand the meaning of having state #4.  I mean whenever any
> backend receives the barrier the system will always be in state #5.
> So what do we want to do with state #4?

Well, if you don't have that, how does the checkpointer know that it's
supposed to push out the barrier?

You and Amul both seem to want to merge states #4 and #5. But how to
make that work? Basically what you are both saying is that, after we
move into the "going read-write" state, backends aren't immediately
told that they can write WAL, but have to keep checking back. But this
could be expensive. If you have one state that means that the
checkpointer has been requested to run XLogAcceptWrites() and push out
a barrier, and another state to mean that it has done so, then you
avoid that. Maybe that overhead wouldn't be large anyway, but it seems
like it's only necessary because you're trying to merge two states
which, from a logical point of view, are separate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Thu, May 13, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 12, 2021 at 1:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Your idea makes sense, but IMHO, if we are first writing
> > XLogAcceptWrites() and then pushing out the barrier, then I don't
> > understand the meaning of having state #4.  I mean whenever any
> > backend receives the barrier the system will always be in state #5.
> > So what do we want to do with state #4?
>
> Well, if you don't have that, how does the checkpointer know that it's
> supposed to push out the barrier?
>
> You and Amul both seem to want to merge states #4 and #5. But how to
> make that work? Basically what you are both saying is that, after we
> move into the "going read-write" state, backends aren't immediately
> told that they can write WAL, but have to keep checking back. But this
> could be expensive. If you have one state that means that the
> checkpointer has been requested to run XLogAcceptWrites() and push out
> a barrier, and another state to mean that it has done so, then you
> avoid that. Maybe that overhead wouldn't be large anyway, but it seems
> like it's only necessary because you're trying to merge two states
> which, from a logical point of view, are separate.

I don't have an objection to having 5 states, just wanted to
understand your reasoning.  So it makes sense to me.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote:
>

Thanks for the updated patch, while going through I noticed this comment.

+ /*
+ * WAL prohibit state changes not allowed during recovery except the crash
+ * recovery case.
+ */
+ PreventCommandDuringRecovery("pg_prohibit_wal()");

Why do we need to allow state change during recovery?  Do you still
need it after the latest changes you discussed here, I mean now
XLogAcceptWrites() being called before sending barrier to backends.
So now we are not afraid that the backend will write WAL before we
call XLogAcceptWrites().  So now IMHO, we don't need to keep the
system in recovery until pg_prohibit_wal(false) is called, right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, May 13, 2021 at 12:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote:
> >
>
> Thanks for the updated patch, while going through I noticed this comment.
>
> + /*
> + * WAL prohibit state changes not allowed during recovery except the crash
> + * recovery case.
> + */
> + PreventCommandDuringRecovery("pg_prohibit_wal()");
>
> Why do we need to allow state change during recovery?  Do you still
> need it after the latest changes you discussed here, I mean now
> XLogAcceptWrites() being called before sending barrier to backends.
> So now we are not afraid that the backend will write WAL before we
> call XLogAcceptWrites().  So now IMHO, we don't need to keep the
> system in recovery until pg_prohibit_wal(false) is called, right?
>

Your understanding is correct, and the previous patch also does the same, but
the code comment is wrong.  Fixed in the attached version, also rebased for the
latest master head. Sorry for the confusion.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Thu, May 13, 2021 at 2:54 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 12:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> >
> > Thanks for the updated patch, while going through I noticed this comment.
> >
> > + /*
> > + * WAL prohibit state changes not allowed during recovery except the crash
> > + * recovery case.
> > + */
> > + PreventCommandDuringRecovery("pg_prohibit_wal()");
> >
> > Why do we need to allow state change during recovery?  Do you still
> > need it after the latest changes you discussed here, I mean now
> > XLogAcceptWrites() being called before sending barrier to backends.
> > So now we are not afraid that the backend will write WAL before we
> > call XLogAcceptWrites().  So now IMHO, we don't need to keep the
> > system in recovery until pg_prohibit_wal(false) is called, right?
> >
>
> Your understanding is correct, and the previous patch also does the same, but
> the code comment is wrong.  Fixed in the attached version, also rebased for the
> latest master head. Sorry for the confusion.

Great thanks.  I will review the remaining patch soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Great thanks.  I will review the remaining patch soon.

I have reviewed v28-0003, and I have some comments on this.

===
@@ -126,9 +127,14 @@ XLogBeginInsert(void)
     Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
     Assert(mainrdata_len == 0);

+    /*
+     * WAL permission must have checked before entering the critical section.
+     * Otherwise, WAL prohibited error will force system panic.
+     */
+    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
!CritSectionCount);
+
     /* cross-check on whether we should be here or not */
-    if (!XLogInsertAllowed())
-        elog(ERROR, "cannot make new WAL entries during recovery");
+    CheckWALPermitted();

We must not call CheckWALPermitted inside the critical section,
instead if we are here we must be sure that
WAL is permitted, so better put an assert.  Even if that is ensured by
some other mean then also I don't
see any reason for calling this error generating function.

===

+CheckWALPermitted(void)
+{
+    if (!XLogInsertAllowed())
+        ereport(ERROR,
+                (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+                 errmsg("system is now read only")));
+

system is now read only ->  wal is prohibited (in error message)

===

-     * We can't write WAL in recovery mode, so there's no point trying to
+     * We can't write WAL during read-only mode, so there's no point trying to

during read-only mode -> if WAL is prohibited or WAL recovery in
progress (add recovery in progress and also modify read-only to wal
prohibited)

===

+        if (!XLogInsertAllowed())
         {
             GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
-            GUC_check_errmsg("cannot set transaction read-write mode
during recovery");
+            GUC_check_errmsg("cannot set transaction read-write mode
while system is read only");
             return false;
         }

system is read only -> WAL is prohibited

===

I think that's all, I have to say about 0003.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Great thanks.  I will review the remaining patch soon.
>
> I have reviewed v28-0003, and I have some comments on this.
>
> ===
> @@ -126,9 +127,14 @@ XLogBeginInsert(void)
>      Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
>      Assert(mainrdata_len == 0);
>
> +    /*
> +     * WAL permission must have checked before entering the critical section.
> +     * Otherwise, WAL prohibited error will force system panic.
> +     */
> +    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
> !CritSectionCount);
> +
>      /* cross-check on whether we should be here or not */
> -    if (!XLogInsertAllowed())
> -        elog(ERROR, "cannot make new WAL entries during recovery");
> +    CheckWALPermitted();
>
> We must not call CheckWALPermitted inside the critical section,
> instead if we are here we must be sure that
> WAL is permitted, so better put an assert.  Even if that is ensured by
> some other mean then also I don't
> see any reason for calling this error generating function.
>

I understand that we should not have an error inside a critical section but
this check is not wrong. Patch has enough checking so that errors due to WAL
prohibited state must not hit in the critical section, see assert just before
CheckWALPermitted().  Before entering into the critical section, we do have an
explicit WAL prohibited check. And to make sure that check has been done for
all current critical section for the wal writes, we have aforesaid assert
checking, for more detail on this please have a look at the "WAL prohibited
system state" section of src/backend/access/transam/README added in 0004 patch.
This assertion also ensures that future development does not miss the WAL
prohibited state check before entering into a newly added critical section for
WAL writes.

> ===
>
> +CheckWALPermitted(void)
> +{
> +    if (!XLogInsertAllowed())
> +        ereport(ERROR,
> +                (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
> +                 errmsg("system is now read only")));
> +
>
> system is now read only ->  wal is prohibited (in error message)
>
> ===
>
> -     * We can't write WAL in recovery mode, so there's no point trying to
> +     * We can't write WAL during read-only mode, so there's no point trying to
>
> during read-only mode -> if WAL is prohibited or WAL recovery in
> progress (add recovery in progress and also modify read-only to wal
> prohibited)
>
> ===
>
> +        if (!XLogInsertAllowed())
>          {
>              GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
> -            GUC_check_errmsg("cannot set transaction read-write mode
> during recovery");
> +            GUC_check_errmsg("cannot set transaction read-write mode
> while system is read only");
>              return false;
>          }
>
> system is read only -> WAL is prohibited
>
> ===

Fixed all in the attached version.

>
> I think that's all, I have to say about 0003.
>

Thanks for the review.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Mon, May 17, 2021 at 11:48 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Great thanks.  I will review the remaining patch soon.
> >
> > I have reviewed v28-0003, and I have some comments on this.
> >
> > ===
> > @@ -126,9 +127,14 @@ XLogBeginInsert(void)
> >      Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
> >      Assert(mainrdata_len == 0);
> >
> > +    /*
> > +     * WAL permission must have checked before entering the critical section.
> > +     * Otherwise, WAL prohibited error will force system panic.
> > +     */
> > +    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
> > !CritSectionCount);
> > +
> >      /* cross-check on whether we should be here or not */
> > -    if (!XLogInsertAllowed())
> > -        elog(ERROR, "cannot make new WAL entries during recovery");
> > +    CheckWALPermitted();
> >
> > We must not call CheckWALPermitted inside the critical section,
> > instead if we are here we must be sure that
> > WAL is permitted, so better put an assert.  Even if that is ensured by
> > some other mean then also I don't
> > see any reason for calling this error generating function.
> >
>
> I understand that we should not have an error inside a critical section but
> this check is not wrong. Patch has enough checking so that errors due to WAL
> prohibited state must not hit in the critical section, see assert just before
> CheckWALPermitted().  Before entering into the critical section, we do have an
> explicit WAL prohibited check. And to make sure that check has been done for
> all current critical section for the wal writes, we have aforesaid assert
> checking, for more detail on this please have a look at the "WAL prohibited
> system state" section of src/backend/access/transam/README added in 0004 patch.
> This assertion also ensures that future development does not miss the WAL
> prohibited state check before entering into a newly added critical section for
> WAL writes.

I think we need CheckWALPermitted(); check, in XLogBeginInsert()
function because if XLogBeginInsert() maybe called outside critical
section e.g. pg_truncate_visibility_map() then we should error out.
So this check make sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Attached is rebase for the latest master head.  Also, I added one more
refactoring code that deduplicates the code setting database state in the
control file. The same code set the database state is also needed for this
feature.

Regards.
Amul

On Mon, May 17, 2021 at 1:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 11:48 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > Great thanks.  I will review the remaining patch soon.
> > >
> > > I have reviewed v28-0003, and I have some comments on this.
> > >
> > > ===
> > > @@ -126,9 +127,14 @@ XLogBeginInsert(void)
> > >      Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
> > >      Assert(mainrdata_len == 0);
> > >
> > > +    /*
> > > +     * WAL permission must have checked before entering the critical section.
> > > +     * Otherwise, WAL prohibited error will force system panic.
> > > +     */
> > > +    Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
> > > !CritSectionCount);
> > > +
> > >      /* cross-check on whether we should be here or not */
> > > -    if (!XLogInsertAllowed())
> > > -        elog(ERROR, "cannot make new WAL entries during recovery");
> > > +    CheckWALPermitted();
> > >
> > > We must not call CheckWALPermitted inside the critical section,
> > > instead if we are here we must be sure that
> > > WAL is permitted, so better put an assert.  Even if that is ensured by
> > > some other mean then also I don't
> > > see any reason for calling this error generating function.
> > >
> >
> > I understand that we should not have an error inside a critical section but
> > this check is not wrong. Patch has enough checking so that errors due to WAL
> > prohibited state must not hit in the critical section, see assert just before
> > CheckWALPermitted().  Before entering into the critical section, we do have an
> > explicit WAL prohibited check. And to make sure that check has been done for
> > all current critical section for the wal writes, we have aforesaid assert
> > checking, for more detail on this please have a look at the "WAL prohibited
> > system state" section of src/backend/access/transam/README added in 0004 patch.
> > This assertion also ensures that future development does not miss the WAL
> > prohibited state check before entering into a newly added critical section for
> > WAL writes.
>
> I think we need CheckWALPermitted(); check, in XLogBeginInsert()
> function because if XLogBeginInsert() maybe called outside critical
> section e.g. pg_truncate_visibility_map() then we should error out.
> So this check make sense to me.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Jun 17, 2021 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote:
> Attached is rebase for the latest master head.  Also, I added one more
> refactoring code that deduplicates the code setting database state in the
> control file. The same code set the database state is also needed for this
> feature.

I started studying 0001 today and found that it rearranged the order
of operations in StartupXLOG() more than I was expecting. It does, as
per previous discussions, move a bunch of things to the place where we
now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
XLogReaderFree() then have to move down even further. Since the goal
here is to get to a situation where we sometimes XLogAcceptWrites()
after InRecovery = false, it didn't seem nice for this refactoring
patch to still end up with a situation where this stuff happens while
InRecovery = true. In fact, with the patch, the amount of code that
runs with InRecovery = true actually *increases*, which is not what I
think should be happening here. That's why the patch ends up having to
adjust SetMultiXactIdLimit to not Assert(!InRecovery).

And then I started to wonder how this was ever going to work as part
of the larger patch set, because as you have it here,
XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
checkpointer is calling that at a later time after the user issues
pg_prohibit_wal(false), it's going to have none of those things. So I
had a quick look at that part of the code and found this in
checkpointer.c:

XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);

For those following along from home, the additional "true" is a bool
needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
this is very satisfying. The whole purpose of passing the xlogreader
is so we can figure out whether we need a checkpoint (never mind the
question of whether the existing algorithm for determining that is
really sensible) but now we need a second argument that basically
serves the same purpose since one of the two callers to this function
won't have an xlogreader. And then we're passing the EndOfLog and
EndOfLogTLI as dummy values which seems like it's probably just
totally wrong, but if for some reason it works correctly there sure
don't seem to be any comments explaining why.

So I started doing a bit of hacking myself and ended up with the
attached, which I think is not completely the right thing yet but I
think it's better than your version. I split this into three parts.
0001 splits up the logic that currently decides whether to write an
end-of-recovery record or a checkpoint record and if the latter how
the checkpoint ought to be performed into two functions.
DetermineRecoveryXlogAction() figures out what we want to do, and
PerformRecoveryXlogAction() does it. It also moves the code to run
recovery_end_command and related stuff into a new function
CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
CleanupAfterArchiveRecovery() to just before we
XLogReportParameters(). Because of the refactoring done by 0001, this
is only a small amount of code movement. Because of the separation
between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
the latter doesn't need the xlogreader. So we can do
DetermineRecoveryXlogAction() at the same time as now, while the
xlogreader is available, and then we don't need it later when we
PerformRecoveryXlogAction(), because we already know what we need to
know. I think this is all fine as far as it goes.

My 0003 is where I see some lingering problems. It creates
XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
need the xlogreader. But it doesn't really solve the problem of how
checkpointer.c would be able to call this function with proper
arguments. It is at least better in not needing two arguments to
decide what to do, but how is checkpointer.c supposed to know what to
pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
seem too problematic, because that value has been stored in four (!)
places inside XLogCtl by this code:

    LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

    XLogCtl->LogwrtResult = LogwrtResult;

    XLogCtl->LogwrtRqst.Write = EndOfLog;
    XLogCtl->LogwrtRqst.Flush = EndOfLog;

Presumably we could relatively easily change things around so that we
finish one of those values ... probably one of the "write" values ..
back out of XLogCtl instead of passing it as a parameter. That would
work just as well from the checkpointer as from the startup process,
and there seems to be no way for the value to change until after
XLogAcceptWrites() has been called, so it seems fine. But that doesn't
help for the other arguments. What I'm thinking is that we should just
arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and
then XLogAcceptWrites() can fish those values out of there as well,
which should be enough to make it work and do the same thing
regardless of which process is calling it. But I have run out of time
for today so have not explored coding that up.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
> My 0003 is where I see some lingering problems. It creates
> XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> need the xlogreader. But it doesn't really solve the problem of how
> checkpointer.c would be able to call this function with proper
> arguments. It is at least better in not needing two arguments to
> decide what to do, but how is checkpointer.c supposed to know what to
> pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> what to pass for EndOfLogTLI and EndOfLog?

On further study, I found another problem: the way my patch set leaves
things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
will not be correctly initialized in any process other than the
startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
would just be skipped. Your 0001 seems to have the same problem. You
added Assert(AmStartupProcess()) to the inside of the if
(ArchiveRecoveryRequested) block, but that doesn't fix anything.
Outside the startup process, ArchiveRecoveryRequested will always be
false, but the point is that the associated stuff should be done if
ArchiveRecoveryRequested would have been true in the startup process.
Both of our patch sets leave things in a state where that would never
happen, which is not good. Unless I'm missing something, it seems like
maybe you didn't test your patches to verify that, when the
XLogAcceptWrites() call comes from the checkpointer, all the same
things happen that would have happened had it been called from the
startup process. That would be a really good thing to have tested
before posting your patches.

As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
several TLIs stored in XLogCtl. None of them seem to be precisely the
same thing as EndLogTLI, but I am hoping that replayEndTLI is close
enough. I found out pretty quickly through testing that replayEndTLI
isn't always valid -- it ends up 0 if we don't enter recovery. That's
not really a problem, though, because we only need it to be valid if
ArchiveRecoveryRequested. The code that initializes and updates it
seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
= true will force InRecovery = true. So it looks to me like
replayEndTLI will always be initialized in the cases where we need a
value. It's not yet entirely clear to me if it has to have the same
value as EndOfLogTLI. I find this code comment quite mysterious:

    /*
     * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
     * the end-of-log. It could be different from the timeline that EndOfLog
     * nominally belongs to, if there was a timeline switch in that segment,
     * and we were reading the old WAL from a segment belonging to a higher
     * timeline.
     */
    EndOfLogTLI = xlogreader->seg.ws_tli;

The thing is, if we were reading old WAL from a segment belonging to a
higher timeline, wouldn't we have switched to that new timeline?
Suppose we want WAL segment 246 from TLI 1, but we don't have that
segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
the TLI 2 version, we'd need to have TLI 2 in the history of the
recovery_target_timeline. And if that is the case, then we would have
to replay through the record where the timeline changes. And if we do
that, then the discrepancy postulated by the comment cannot still
exist by the time we reach this code, because this code is only
reached after we finish WAL redo. So I'm baffled as to how this can
happen, but considering how many cases there are in this code, I sure
can't promise that it doesn't. The fact that we have few tests for any
of this doesn't help either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Jul 28, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > My 0003 is where I see some lingering problems. It creates
> > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> > need the xlogreader. But it doesn't really solve the problem of how
> > checkpointer.c would be able to call this function with proper
> > arguments. It is at least better in not needing two arguments to
> > decide what to do, but how is checkpointer.c supposed to know what to
> > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> > what to pass for EndOfLogTLI and EndOfLog?
>
> On further study, I found another problem: the way my patch set leaves
> things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
> will not be correctly initialized in any process other than the
> startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
> would just be skipped. Your 0001 seems to have the same problem. You
> added Assert(AmStartupProcess()) to the inside of the if
> (ArchiveRecoveryRequested) block, but that doesn't fix anything.
> Outside the startup process, ArchiveRecoveryRequested will always be
> false, but the point is that the associated stuff should be done if
> ArchiveRecoveryRequested would have been true in the startup process.
> Both of our patch sets leave things in a state where that would never
> happen, which is not good. Unless I'm missing something, it seems like
> maybe you didn't test your patches to verify that, when the
> XLogAcceptWrites() call comes from the checkpointer, all the same
> things happen that would have happened had it been called from the
> startup process. That would be a really good thing to have tested
> before posting your patches.
>

My bad, I am extremely sorry about that. I usually do test my patches,
but somehow I failed to test this change due to manually testing the
whole ASRO feature and hurrying in posting the newest version.

I will try to be more careful next time.

> As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
> several TLIs stored in XLogCtl. None of them seem to be precisely the
> same thing as EndLogTLI, but I am hoping that replayEndTLI is close
> enough. I found out pretty quickly through testing that replayEndTLI
> isn't always valid -- it ends up 0 if we don't enter recovery. That's
> not really a problem, though, because we only need it to be valid if
> ArchiveRecoveryRequested. The code that initializes and updates it
> seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
> = true will force InRecovery = true. So it looks to me like
> replayEndTLI will always be initialized in the cases where we need a
> value. It's not yet entirely clear to me if it has to have the same
> value as EndOfLogTLI. I find this code comment quite mysterious:
>
>     /*
>      * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
>      * the end-of-log. It could be different from the timeline that EndOfLog
>      * nominally belongs to, if there was a timeline switch in that segment,
>      * and we were reading the old WAL from a segment belonging to a higher
>      * timeline.
>      */
>     EndOfLogTLI = xlogreader->seg.ws_tli;
>
> The thing is, if we were reading old WAL from a segment belonging to a
> higher timeline, wouldn't we have switched to that new timeline?

AFAIUC, by browsing the code, yes, we are switching to the new
timeline.  Along with lastReplayedTLI, lastReplayedEndRecPtr is also
the same as the EndOfLog that we needed when ArchiveRecoveryRequested
is true.

I went through the original commit 7cbee7c0a1db and the thread[1] but
didn't find any related discussion for that.

> Suppose we want WAL segment 246 from TLI 1, but we don't have that
> segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
> the TLI 2 version, we'd need to have TLI 2 in the history of the
> recovery_target_timeline. And if that is the case, then we would have
> to replay through the record where the timeline changes. And if we do
> that, then the discrepancy postulated by the comment cannot still
> exist by the time we reach this code, because this code is only
> reached after we finish WAL redo. So I'm baffled as to how this can
> happen, but considering how many cases there are in this code, I sure
> can't promise that it doesn't. The fact that we have few tests for any
> of this doesn't help either.

I am not an expert in this area, but will try to spend some more time
on understanding and testing.

1] postgr.es/m/555DD101.7080209@iki.fi

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Jul 28, 2021 at 4:37 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Wed, Jul 28, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > My 0003 is where I see some lingering problems. It creates
> > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> > > need the xlogreader. But it doesn't really solve the problem of how
> > > checkpointer.c would be able to call this function with proper
> > > arguments. It is at least better in not needing two arguments to
> > > decide what to do, but how is checkpointer.c supposed to know what to
> > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> > > what to pass for EndOfLogTLI and EndOfLog?
> >
> > On further study, I found another problem: the way my patch set leaves
> > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
> > will not be correctly initialized in any process other than the
> > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
> > would just be skipped. Your 0001 seems to have the same problem. You
> > added Assert(AmStartupProcess()) to the inside of the if
> > (ArchiveRecoveryRequested) block, but that doesn't fix anything.
> > Outside the startup process, ArchiveRecoveryRequested will always be
> > false, but the point is that the associated stuff should be done if
> > ArchiveRecoveryRequested would have been true in the startup process.
> > Both of our patch sets leave things in a state where that would never
> > happen, which is not good. Unless I'm missing something, it seems like
> > maybe you didn't test your patches to verify that, when the
> > XLogAcceptWrites() call comes from the checkpointer, all the same
> > things happen that would have happened had it been called from the
> > startup process. That would be a really good thing to have tested
> > before posting your patches.
> >
>
> My bad, I am extremely sorry about that. I usually do test my patches,
> but somehow I failed to test this change due to manually testing the
> whole ASRO feature and hurrying in posting the newest version.
>
> I will try to be more careful next time.
>

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

@@ -6614,13 +6629,30 @@ StartupXLOG(void)
  (errmsg("starting archive recovery")));
  }

- /*
- * Take ownership of the wakeup latch if we're going to sleep during
- * recovery.
- */
  if (ArchiveRecoveryRequested)
+ {
+ /*
+ * Take ownership of the wakeup latch if we're going to sleep during
+ * recovery.
+ */
  OwnLatch(&XLogCtl->recoveryWakeupLatch);

+ /*
+ * Since archive recovery is requested, we cannot be in a wal prohibited
+ * state.
+ */
+ if (ControlFile->wal_prohibited)
+ {
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
+ ControlFile->wal_prohibited = false;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+
+ ereport(LOG,
+ (errmsg("clearing WAL prohibition because the system is in archive
recovery")));
+ }
+ }
+


> > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
> > several TLIs stored in XLogCtl. None of them seem to be precisely the
> > same thing as EndLogTLI, but I am hoping that replayEndTLI is close
> > enough. I found out pretty quickly through testing that replayEndTLI
> > isn't always valid -- it ends up 0 if we don't enter recovery. That's
> > not really a problem, though, because we only need it to be valid if
> > ArchiveRecoveryRequested. The code that initializes and updates it
> > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
> > = true will force InRecovery = true. So it looks to me like
> > replayEndTLI will always be initialized in the cases where we need a
> > value. It's not yet entirely clear to me if it has to have the same
> > value as EndOfLogTLI. I find this code comment quite mysterious:
> >
> >     /*
> >      * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> >      * the end-of-log. It could be different from the timeline that EndOfLog
> >      * nominally belongs to, if there was a timeline switch in that segment,
> >      * and we were reading the old WAL from a segment belonging to a higher
> >      * timeline.
> >      */
> >     EndOfLogTLI = xlogreader->seg.ws_tli;
> >
> > The thing is, if we were reading old WAL from a segment belonging to a
> > higher timeline, wouldn't we have switched to that new timeline?
>
> AFAIUC, by browsing the code, yes, we are switching to the new
> timeline.  Along with lastReplayedTLI, lastReplayedEndRecPtr is also
> the same as the EndOfLog that we needed when ArchiveRecoveryRequested
> is true.
>
> I went through the original commit 7cbee7c0a1db and the thread[1] but
> didn't find any related discussion for that.
>
> > Suppose we want WAL segment 246 from TLI 1, but we don't have that
> > segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
> > the TLI 2 version, we'd need to have TLI 2 in the history of the
> > recovery_target_timeline. And if that is the case, then we would have
> > to replay through the record where the timeline changes. And if we do
> > that, then the discrepancy postulated by the comment cannot still
> > exist by the time we reach this code, because this code is only
> > reached after we finish WAL redo. So I'm baffled as to how this can
> > happen, but considering how many cases there are in this code, I sure
> > can't promise that it doesn't. The fact that we have few tests for any
> > of this doesn't help either.
>
> I am not an expert in this area, but will try to spend some more time
> on understanding and testing.
>
> 1] postgr.es/m/555DD101.7080209@iki.fi
>
> Regards,
> Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Dilip Kumar
Date:
On Wed, Jul 28, 2021 at 5:03 PM Amul Sul <sulamul@gmail.com> wrote:
>
> I was too worried about how I could miss that & after thinking more
> about that, I realized that the operation for ArchiveRecoveryRequested
> is never going to be skipped in the startup process and that never
> left for the checkpoint process to do that later. That is the reason
> that assert was added there.
>
> When ArchiveRecoveryRequested, the server will no longer be in
> the wal prohibited mode, we implicitly change the state to
> wal-permitted. Here is the snip from the 0003 patch:
>
> @@ -6614,13 +6629,30 @@ StartupXLOG(void)
>   (errmsg("starting archive recovery")));
>   }
>
> - /*
> - * Take ownership of the wakeup latch if we're going to sleep during
> - * recovery.
> - */
>   if (ArchiveRecoveryRequested)
> + {
> + /*
> + * Take ownership of the wakeup latch if we're going to sleep during
> + * recovery.
> + */
>   OwnLatch(&XLogCtl->recoveryWakeupLatch);
>
> + /*
> + * Since archive recovery is requested, we cannot be in a wal prohibited
> + * state.
> + */
> + if (ControlFile->wal_prohibited)
> + {
> + /* No need to hold ControlFileLock yet, we aren't up far enough */
> + ControlFile->wal_prohibited = false;
> + ControlFile->time = (pg_time_t) time(NULL);
> + UpdateControlFile();
> +

Is there some reason why we are forcing 'wal_prohibited' to off if we
are doing archive recovery?  It might have already been discussed, but
I could not find it on a quick look into the thread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Jul 29, 2021 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 28, 2021 at 5:03 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > I was too worried about how I could miss that & after thinking more
> > about that, I realized that the operation for ArchiveRecoveryRequested
> > is never going to be skipped in the startup process and that never
> > left for the checkpoint process to do that later. That is the reason
> > that assert was added there.
> >
> > When ArchiveRecoveryRequested, the server will no longer be in
> > the wal prohibited mode, we implicitly change the state to
> > wal-permitted. Here is the snip from the 0003 patch:
> >
> > @@ -6614,13 +6629,30 @@ StartupXLOG(void)
> >   (errmsg("starting archive recovery")));
> >   }
> >
> > - /*
> > - * Take ownership of the wakeup latch if we're going to sleep during
> > - * recovery.
> > - */
> >   if (ArchiveRecoveryRequested)
> > + {
> > + /*
> > + * Take ownership of the wakeup latch if we're going to sleep during
> > + * recovery.
> > + */
> >   OwnLatch(&XLogCtl->recoveryWakeupLatch);
> >
> > + /*
> > + * Since archive recovery is requested, we cannot be in a wal prohibited
> > + * state.
> > + */
> > + if (ControlFile->wal_prohibited)
> > + {
> > + /* No need to hold ControlFileLock yet, we aren't up far enough */
> > + ControlFile->wal_prohibited = false;
> > + ControlFile->time = (pg_time_t) time(NULL);
> > + UpdateControlFile();
> > +
>
> Is there some reason why we are forcing 'wal_prohibited' to off if we
> are doing archive recovery?  It might have already been discussed, but
> I could not find it on a quick look into the thread.
>

Here is: https://postgr.es/m/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:
> I was too worried about how I could miss that & after thinking more
> about that, I realized that the operation for ArchiveRecoveryRequested
> is never going to be skipped in the startup process and that never
> left for the checkpoint process to do that later. That is the reason
> that assert was added there.
>
> When ArchiveRecoveryRequested, the server will no longer be in
> the wal prohibited mode, we implicitly change the state to
> wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Prabhat Sahu
Date:
Hi,

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:
> I was too worried about how I could miss that & after thinking more
> about that, I realized that the operation for ArchiveRecoveryRequested
> is never going to be skipped in the startup process and that never
> left for the checkpoint process to do that later. That is the reason
> that assert was added there.
>
> When ArchiveRecoveryRequested, the server will no longer be in
> the wal prohibited mode, we implicitly change the state to
> wal-permitted. Here is the snip from the 0003 patch:

Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.

This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.

Hi,

I have been testing “ALTER SYSTEM READ ONLY” and wrote a few tap test cases for this feature.
Please find the test case(Draft version) attached herewith, to be applied on top of the v30 patch by Amul.
Kindly have a review and let me know the required changes.
--

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Attached is the rebase version on top of the latest master head
includes refactoring patches posted by Robert.

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:
> > I was too worried about how I could miss that & after thinking more
> > about that, I realized that the operation for ArchiveRecoveryRequested
> > is never going to be skipped in the startup process and that never
> > left for the checkpoint process to do that later. That is the reason
> > that assert was added there.
> >
> > When ArchiveRecoveryRequested, the server will no longer be in
> > the wal prohibited mode, we implicitly change the state to
> > wal-permitted. Here is the snip from the 0003 patch:
>
> Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
> kind of been wondering: why not have XLogAcceptWrites() be the
> responsibility of the checkpointer all the time, in every case? That
> would require fixing some more things, and this is one of them, but
> then it would be consistent, which means that any bugs would be likely
> to get found and fixed. If calling XLogAcceptWrites() from the
> checkpointer is some funny case that only happens when the system
> crashes while WAL is prohibited, then we might fail to notice that we
> have a bug.
>

Unfortunately, I didn't get much time to think about this and don't
have a strong opinion on it either.

> This is especially true given that we have very little test coverage
> in this area. Andres was ranting to me about this earlier this week,
> and I wasn't sure he was right, but then I noticed that we have
> exactly zero tests in the entire source tree that make use of
> recovery_end_command. We really need a TAP test for that, I think.
> It's too scary to do much reorganization of the code without having
> any tests at all for the stuff we're moving around. Likewise, we're
> going to need TAP tests for the stuff that is specific to this patch.
> For example, we should have a test that crashes the server while it's
> read only, brings it back up, checks that we still can't write WAL,
> then re-enables WAL, and checks that we now can write WAL. There are
> probably a bunch of other things that we should test, too.
>

Yes, my next plan is to work on the TAP tests and look into the patch
posted by Prabhat to improve test coverage.

Regards,
Amul Sul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Attached is the rebased version for the latest master head. Also,
added tap tests to test some part of this feature and a separate patch
to test recovery_end_command execution.

I have also been through Prabhat's patch which helps me to write
current tests, but I am not sure about the few basic tests that he
included in the tap test which can be done using pg_regress otherwise,
e.g. checking permission to execute the pg_prohibit_wal() function.
Those basic tests I am yet to add, is it ok to add those tests in
pg_regress instead of TAP? The problem I see is that all the tests
covering a feature will not be together, which I think is not correct.

What is usual practice, can have a few tests in TAP and a few in
pg_regress for the same feature?

Regards,
Amul





On Wed, Aug 4, 2021 at 6:26 PM Amul Sul <sulamul@gmail.com> wrote:
>
> Attached is the rebase version on top of the latest master head
> includes refactoring patches posted by Robert.
>
> On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:
> > > I was too worried about how I could miss that & after thinking more
> > > about that, I realized that the operation for ArchiveRecoveryRequested
> > > is never going to be skipped in the startup process and that never
> > > left for the checkpoint process to do that later. That is the reason
> > > that assert was added there.
> > >
> > > When ArchiveRecoveryRequested, the server will no longer be in
> > > the wal prohibited mode, we implicitly change the state to
> > > wal-permitted. Here is the snip from the 0003 patch:
> >
> > Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
> > kind of been wondering: why not have XLogAcceptWrites() be the
> > responsibility of the checkpointer all the time, in every case? That
> > would require fixing some more things, and this is one of them, but
> > then it would be consistent, which means that any bugs would be likely
> > to get found and fixed. If calling XLogAcceptWrites() from the
> > checkpointer is some funny case that only happens when the system
> > crashes while WAL is prohibited, then we might fail to notice that we
> > have a bug.
> >
>
> Unfortunately, I didn't get much time to think about this and don't
> have a strong opinion on it either.
>
> > This is especially true given that we have very little test coverage
> > in this area. Andres was ranting to me about this earlier this week,
> > and I wasn't sure he was right, but then I noticed that we have
> > exactly zero tests in the entire source tree that make use of
> > recovery_end_command. We really need a TAP test for that, I think.
> > It's too scary to do much reorganization of the code without having
> > any tests at all for the stuff we're moving around. Likewise, we're
> > going to need TAP tests for the stuff that is specific to this patch.
> > For example, we should have a test that crashes the server while it's
> > read only, brings it back up, checks that we still can't write WAL,
> > then re-enables WAL, and checks that we now can write WAL. There are
> > probably a bunch of other things that we should test, too.
> >
>
> Yes, my next plan is to work on the TAP tests and look into the patch
> posted by Prabhat to improve test coverage.
>
> Regards,
> Amul Sul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Tue, Aug 31, 2021 at 8:16 AM Amul Sul <sulamul@gmail.com> wrote:
> Attached is the rebased version for the latest master head. Also,
> added tap tests to test some part of this feature and a separate patch
> to test recovery_end_command execution.

It looks like you haven't given any thought to writing that in a way
that will work on Windows?

> What is usual practice, can have a few tests in TAP and a few in
> pg_regress for the same feature?

Sure, there's no problem with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:
>
> Attached is the rebased version for the latest master head.

Hi Amul!

Could you please rebase again?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:


On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:


> On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:
>
> Attached is the rebased version for the latest master head.

Hi Amul!

Could you please rebase again?

Ok will do that tomorrow, thanks.

Regards,
Amul

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, Sep 7, 2021 at 10:02 PM Amul Sul <sulamul@gmail.com> wrote:
>
>
>
> On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>>
>>
>>
>> > On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:
>> >
>> > Attached is the rebased version for the latest master head.
>>
>> Hi Amul!
>>
>> Could you please rebase again?
>
>
> Ok will do that tomorrow, thanks.
>

Here is the rebased version. I have added a few more test cases,
perhaps needing more tests and optimization to it, that I'll try in
the next version.  I dropped the patch for recovery_end_command
testing & will post that separately.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote:
>
> Here is the rebased version.

v33-0004

This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h
indirectlyincluded from a much larger set of files.  Maybe that's ok.  I don't know.  But it seems you are doing this
merelyto get the symbol (not even the definition) for struct DBState.  I'd recommend rearranging the code so this isn't
necessary,but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from
xlogdesc.c,xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. 

v33-0005

This patch makes bool XLogInsertAllowed() more complicated than before.  The result used to depend mostly on the value
ofLocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress().
Therewas an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercible to boolean
"true"and "false", with the basis for that rule being the coding of XLogInsertAllowed().  Now that the function is more
complicated,this rule seems even more arcane.  Can we change the logic to not depend on casting an integer to bool? 

The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e.
walwrites permitted". 

The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than
actuallydoing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint
process".

The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a
readerof the function name could already surmise.  Perhaps the comment can better clarify what is happening: "Set up
walprobibit shared state" 

The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase:  "As in
ProcessSyncRequests,we don't want to stop wal prohibit change requests".  The nearby comment reads "stop absorbing".  I
thinkthis one should read "stop processing".  This same comment is used again below.   Then a third comment reads "For
thesame reason mentioned previously for the wal prohibit state change request check."  That third comment is too glib. 

tcop/utility.c needlessly includes "access/walprohibit.h"

wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and
WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear.  Waiting for a
statechange is clear enough.  But how is waiting on a state different? 

xlog.h defines a new enum.  I don't find any of it clear; not the comment, nor the name of the enum, nor the names of
thevalues: 

/* State of work that enables wal writes */
typedef enum XLogAcceptWritesState
{
    XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
    XLOG_ACCEPT_WRITES_SKIPPED,     /* skipped wal writes */
    XLOG_ACCEPT_WRITES_DONE         /* wal writes are enabled */
} XLogAcceptWritesState;

This enum seems to have been written from the point of view of someone who already knew what it was for.  It needs to
bewritten in a way that will be clear to people who have no idea what it is for. 

v33-0006:

The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building
indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ 

The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the new
functioncomment says: 

/*
 * In opposite to the above assertion if a transaction doesn't have valid XID
 * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
 * prohibited.  Therefore, we need to explicitly error out before entering into
 * the critical section.
 */

This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which
needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process.  I'm given
towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be.  Does it
triggera panic?  I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster.  If there
issome reason this is not a problem, I think the comment should explain it.  In particular, why is it sufficient to
checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed
throughthe lifetime of that critical section? 

v33-0007:

I don't really like what the documentation has to say about pg_prohibit_wal.  Why should pg_prohibit_wal differ from
othersignal sending functions in whether it returns a boolean?  If you believe it must always succeed, you can still
defineit as returning a boolean and always return true.  That leaves the door open to future code changes which might
needto return false for some reason. 

But I also don't like the idea that existing transactions with xids are immediately killed.  Shouldn't this function
takean optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into
WALPROHIBIT_STATE_GOING_READ_ONLYfor a period of time before killing remaining transactions? 

Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and
pg_prohibit_wal(false)means to allow wal.  Wouldn't a different function named pg_allow_wal() make it more clear?  This
alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout
parameterwhen allowing wal is less clearly useful. 

That's enough code review for now.  Next I will review your regression tests....

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Sep 9, 2021 at 1:42 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> v33-0006:
>
> The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building
indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ 

Honestly that sentence doesn't sound very clear even with a different verb.

> This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which
needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process.  I'm given
towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be.  Does it
triggera panic?  I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster.  If there
issome reason this is not a problem, I think the comment should explain it.  In particular, why is it sufficient to
checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed
throughthe lifetime of that critical section? 

The idea here is that if a transaction already has an XID assigned, we
have to kill it off before we can declare the system read-only,
because it will definitely write WAL when the transaction ends: either
a commit record, or an abort record, but definitely something. So
cases where we write WAL without necessarily having an XID require
special handling. They have to check whether WAL has become prohibited
and error out if so, and they need to do so before entering the
critical section - because if the problem were detected for the first
time inside the critical section it would escalate to a PANIC, which
we do not want. Places where we're guaranteed to have an XID - e.g.
inserting a heap tuple - don't need a run-time check before entering
the critical section, because the code can't be reached in the first
place if the system is WAL-read-only.

> Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and
pg_prohibit_wal(false)means to allow wal.  Wouldn't a different function named pg_allow_wal() make it more clear?  This
alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout
parameterwhen allowing wal is less clearly useful. 

Hmm, I find pg_prohibit_wal(true/false) better than pg_prohibit_wal()
and pg_allow_wal(), and would prefer pg_prohibit_wal(true/false,
timeout) over pg_prohibit_wal(timeout) and pg_allow_wal(), because I
think then once you find that one function you know how to do
everything about that feature, whereas the other way you need to find
both functions to have the whole story. That said, I can see why
somebody else might prefer something else.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 9, 2021, at 11:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> They have to check whether WAL has become prohibited
> and error out if so, and they need to do so before entering the
> critical section - because if the problem were detected for the first
> time inside the critical section it would escalate to a PANIC, which
> we do not want.

But that is the part that is still not clear.  Should the comment say that a concurrent change to prohibit wal after
thecurrent process checks but before the current process exists the critical section will result in a panic?  What is
unclearabout the comment is that it implies that a check before the critical section is sufficient, but ordinarily one
wouldexpect a lock to be held and the check-and-lock dance to carefully avoid any race condition.  If somehow this is
safe,the logic for why it is safe should be spelled out.  If not, a mia culpa saying, "hey, were not terribly safe
aboutthis" should be explicit in the comment. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>

Thank you, for looking at the patch.  Please see my reply inline below:

>
> > On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote:
> >
> > Here is the rebased version.
>
> v33-0004
>
> This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h
indirectlyincluded from a much larger set of files.  Maybe that's ok.  I don't know.  But it seems you are doing this
merelyto get the symbol (not even the definition) for struct DBState.  I'd recommend rearranging the code so this isn't
necessary,but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from
xlogdesc.c,xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. 
>

Yes, you are correct, xlog.h is included in more than 150 files. I was
wondering if we can have a forward declaration instead of including
pg_control.h (e.g. The same way struct XLogRecData was declared in
xlog.h). Perhaps, DBState is enum & I don't see we have done the same
for enum elsewhere as we are doing for structures, but that seems to
be fine, IMO.

Earlier, I was unsure before preparing this patch, but since that
makes sense (I assumed) and minimizes duplications, can we go ahead
and post separately with the same change in StartupXLOG() which I have
skipped for the same reason mentioned in patch commit-msg.

> v33-0005
>
> This patch makes bool XLogInsertAllowed() more complicated than before.  The result used to depend mostly on the
valueof LocalXLogInsertAllowed except that when that value was negative, the result was determined by
RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary
coercibleto boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed().  Now that
thefunction is more complicated, this rule seems even more arcane.  Can we change the logic to not depend on casting an
integerto bool? 
>

We can't use a boolean variable because LocalXLogInsertAllowed
represents three states as, 1 means "wal is allowed'', 0 for "wal is
disallowed", and -1 is for "need to check".

> The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only
i.e.wal writes permitted". 
>
> The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than
actuallydoing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint
process".
>
> The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a
readerof the function name could already surmise.  Perhaps the comment can better clarify what is happening: "Set up
walprobibit shared state" 
>
> The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase:  "As in
ProcessSyncRequests,we don't want to stop wal prohibit change requests".  The nearby comment reads "stop absorbing".  I
thinkthis one should read "stop processing".  This same comment is used again below.   Then a third comment reads "For
thesame reason mentioned previously for the wal prohibit state change request check."  That third comment is too glib. 
>
> tcop/utility.c needlessly includes "access/walprohibit.h"
>
> wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and
WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear.  Waiting for a
statechange is clear enough.  But how is waiting on a state different? 
>
> xlog.h defines a new enum.  I don't find any of it clear; not the comment, nor the name of the enum, nor the names of
thevalues: 
>
> /* State of work that enables wal writes */
> typedef enum XLogAcceptWritesState
> {
>     XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
>     XLOG_ACCEPT_WRITES_SKIPPED,     /* skipped wal writes */
>     XLOG_ACCEPT_WRITES_DONE         /* wal writes are enabled */
> } XLogAcceptWritesState;
>
> This enum seems to have been written from the point of view of someone who already knew what it was for.  It needs to
bewritten in a way that will be clear to people who have no idea what it is for. 
>
> v33-0006:
>
> The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building
indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ 
>

Will try to think about the pointed code comments for the improvements.

> The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the
newfunction comment says: 
>

CheckWALPermitted() calls XLogInsertAllowed() does check the
LocalXLogInsertAllowed flag which is local to that process only, and
nobody else reads that concurrently.

> /*
>  * In opposite to the above assertion if a transaction doesn't have valid XID
>  * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
>  * prohibited.  Therefore, we need to explicitly error out before entering into
>  * the critical section.
>  */
>
> This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which
needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process.  I'm given
towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be.  Does it
triggera panic?  I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster.  If there
issome reason this is not a problem, I think the comment should explain it.  In particular, why is it sufficient to
checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed
throughthe lifetime of that critical section? 
>

Hm, interrupts absorption are disabled inside the critical section.
The wal prohibited state for that process (here vacuum) will never get
set until it sees the interrupts & the system will not be said wal
prohibited until every process sees that interrupts. I am not sure we
should explain the characteristics of the critical section at this
place, if want, we can add a brief saying that inside the critical
section we should not worry about the state change which never happens
because interrupts are disabled there.

> v33-0007:
>
> I don't really like what the documentation has to say about pg_prohibit_wal.  Why should pg_prohibit_wal differ from
othersignal sending functions in whether it returns a boolean?  If you believe it must always succeed, you can still
defineit as returning a boolean and always return true.  That leaves the door open to future code changes which might
needto return false for some reason. 
>

Ok, I am fine to always return true.

> But I also don't like the idea that existing transactions with xids are immediately killed.  Shouldn't this function
takean optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into
WALPROHIBIT_STATE_GOING_READ_ONLYfor a period of time before killing remaining transactions? 
>

Ok, will check.

> Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and
pg_prohibit_wal(false)means to allow wal.  Wouldn't a different function named pg_allow_wal() make it more clear?  This
alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout
parameterwhen allowing wal is less clearly useful. 
>

Like Robert, I am too inclined to have a single function that is easy
to remember. Apart from this, recently while testing this patch with
pgbench where I have exhausted the connection limit and want to change
the system's prohibited state in between but I was unable to do that,
I wish I could do that using the pg_clt option. How about having a
pg_clt option to alter wal prohibited state?

> That's enough code review for now.  Next I will review your regression tests....
>
Thanks again.



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 10, 2021, at 7:36 AM, Amul Sul <sulamul@gmail.com> wrote:
>
>> v33-0005
>>
>> This patch makes bool XLogInsertAllowed() more complicated than before.  The result used to depend mostly on the
valueof LocalXLogInsertAllowed except that when that value was negative, the result was determined by
RecoveryInProgress().There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary
coercibleto boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed().  Now that
thefunction is more complicated, this rule seems even more arcane.  Can we change the logic to not depend on casting an
integerto bool? 
>>
>
> We can't use a boolean variable because LocalXLogInsertAllowed
> represents three states as, 1 means "wal is allowed'', 0 for "wal is
> disallowed", and -1 is for "need to check".

I'm complaining that we're using an integer rather than an enum.  I'm ok if we define it so that WAL_ALLOWABLE_UNKNOWN
=-1, WAL_DISALLOWED = 0, WAL_ALLOWED = 1 or such, but the logic of the function has gotten complicated enough that
havingto remember which number represents which logical condition has become a (small) mental burden.  Given how hard
theWAL code is to read and fully grok, I'd rather avoid any unnecessary burden, even small ones. 

>> The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the
newfunction comment says: 
>>
>
> CheckWALPermitted() calls XLogInsertAllowed() does check the
> LocalXLogInsertAllowed flag which is local to that process only, and
> nobody else reads that concurrently.
>
>> /*
>> * In opposite to the above assertion if a transaction doesn't have valid XID
>> * (e.g. VACUUM) then it won't be killed while changing the system state to WAL
>> * prohibited.  Therefore, we need to explicitly error out before entering into
>> * the critical section.
>> */
>>
>> This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which
needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process.  I'm given
towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be.  Does it
triggera panic?  I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster.  If there
issome reason this is not a problem, I think the comment should explain it.  In particular, why is it sufficient to
checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed
throughthe lifetime of that critical section? 
>>
>
> Hm, interrupts absorption are disabled inside the critical section.
> The wal prohibited state for that process (here vacuum) will never get
> set until it sees the interrupts & the system will not be said wal
> prohibited until every process sees that interrupts. I am not sure we
> should explain the characteristics of the critical section at this
> place, if want, we can add a brief saying that inside the critical
> section we should not worry about the state change which never happens
> because interrupts are disabled there.

I think the fact that interrupts are disabled during critical sections is understood, so there is no need to mention
that. The problem is that the method for taking the system read-only is less generally known, and readers of other
sectionsof code need to jump to the definition of CheckWALPermitted to read the comments and understand what it does.
Takefor example a code stanza from heapam.c: 

    if (needwal)
        CheckWALPermitted();

    /* NO EREPORT(ERROR) from here till changes are logged */
    START_CRIT_SECTION();

Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that an
interruptmight get processed *during* CheckWALPermitted, since that function isn't atomic.  It might happen after the
checkis meaningfully finished but before the function actually returns.  So I'm not inclined to believe that the way
thisall works is dependent on interrupts being blocked.  So I think, maybe this is all protected by some other scheme.
Butwhat?  It's not clear from the code comments for CheckWALPermitted, so I'm left having to reverse engineer the
systemto understand it. 

One interpretation is that the signal handler will exit() my backend if it receives a signal saying that the system is
goingread-only, so there is no race condition.  But then why the call to CheckWALPermitted()?  If this interpretation
werecorrect, we'd happily enter the critical section without checking, secure in the knowledge that as long as we
haven'texited yet, all is ok. 

Another interpretation is that the whole thing is just a performance trick.  Maybe we're ok with the idea that we will
occasionallymiss the fact that wal is prohibited, do whatever work we need in the critical section, and then fail
later. But if that is true, it had better not be a panic, because designing the system to panic 1% of the time (or
whateverpercent it works out to be) isn't project style.  So looking into the critical section in the heapam.c code, I
see:

        XLogBeginInsert();
        XLogRegisterData((char *) &xlrec, SizeOfHeapInplace);

        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
        XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);

And jumping to the definition of XLogBeginInsert() I see

    /*
     * WAL permission must have checked before entering the critical section.
     * Otherwise, WAL prohibited error will force system panic.
     */

So now I'm flummoxed.  Is it that the code is broken, or is it that I don't know what the strategy behind all this is?
Ifthere were a code comment saying how this all works, I'd be in a better position to either know that it is truly safe
oralternately know that the strategy is wrong. 

Even if my analysis that this is all flawed is incorrect, I still think that a code comment would help.

>> v33-0007:
>>
>> I don't really like what the documentation has to say about pg_prohibit_wal.  Why should pg_prohibit_wal differ from
othersignal sending functions in whether it returns a boolean?  If you believe it must always succeed, you can still
defineit as returning a boolean and always return true.  That leaves the door open to future code changes which might
needto return false for some reason. 
>>
>
> Ok, I am fine to always return true.

Ok.

>> But I also don't like the idea that existing transactions with xids are immediately killed.  Shouldn't this function
takean optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into
WALPROHIBIT_STATE_GOING_READ_ONLYfor a period of time before killing remaining transactions? 
>>
>
> Ok, will check.
>
>> Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and
pg_prohibit_wal(false)means to allow wal.  Wouldn't a different function named pg_allow_wal() make it more clear?  This
alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout
parameterwhen allowing wal is less clearly useful. 
>>
>
> Like Robert, I am too inclined to have a single function that is easy
> to remember.

For C language functions that take a bool argument, I can jump to the definition using ctags, and I assume most other
developerscan do so using whatever IDE they like.  For SQL functions, it's a bit harder to jump to the definition,
particularlyif you are logged into a production server where non-essential software is intentionally missing.  Then you
haveto wonder, what exactly is the boolean argument toggling here? 

I don't feel strongly about this, though, and you don't need to change it.

> Apart from this, recently while testing this patch with
> pgbench where I have exhausted the connection limit and want to change
> the system's prohibited state in between but I was unable to do that,
> I wish I could do that using the pg_clt option. How about having a
> pg_clt option to alter wal prohibited state?

I'd have to review the implementation, but sure, that sounds like a useful ability.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 10, 2021, at 8:42 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> Take for example a code stanza from heapam.c:
>
>    if (needwal)
>        CheckWALPermitted();
>
>    /* NO EREPORT(ERROR) from here till changes are logged */
>    START_CRIT_SECTION();
>
> Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that
aninterrupt might get processed *during* CheckWALPermitted, since that function isn't atomic.  

A better example may be found in ginmetapage.c:

        needwal = RelationNeedsWAL(indexrel);
        if (needwal)
        {
            CheckWALPermitted();
            computeLeafRecompressWALData(leaf);
        }

        /* Apply changes to page */
        START_CRIT_SECTION();

Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument
can'tbe made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is
running.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Fri, Sep 10, 2021 at 12:20 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> A better example may be found in ginmetapage.c:
>
>         needwal = RelationNeedsWAL(indexrel);
>         if (needwal)
>         {
>             CheckWALPermitted();
>             computeLeafRecompressWALData(leaf);
>         }
>
>         /* Apply changes to page */
>         START_CRIT_SECTION();

Yeah, that looks sketchy. Why not move CheckWALPermitted() down a line?

> Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument
can'tbe made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is
running.

I think the relevant question here is not "could a signal handler
fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant
question is the former, then there's no hope of ever making it work
because there's always a race condition. But the signal handler is
only setting flags whose only effect is to make a subsequent
CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when
the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call
ProcessInterrupts().

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 10, 2021, at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I think the relevant question here is not "could a signal handler
> fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant
> question is the former, then there's no hope of ever making it work
> because there's always a race condition. But the signal handler is
> only setting flags whose only effect is to make a subsequent
> CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when
> the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call
> ProcessInterrupts().

Ok, that makes more sense.  I was reviewing the code after first reviewing the documentation changes, which lead me to
believethe system was designed to respond more quickly than that: 

+    WAL prohibited is a read-only system state. Any permitted user can call
+    <function>pg_prohibit_wal</function> function to forces the system into
+    a WAL prohibited mode where insert write ahead log will be prohibited until
+    the same function executed to change that state to read-write. Like Hot

and

+    Otherwise, it will be <literal>off</literal>.  When the user requests WAL
+    prohibited state, at that moment if any existing session is already running
+    a transaction, and that transaction has already been performed or planning
+    to perform wal write operations then the session running that transaction
+    will be terminated.

"forces the system" in the first part, and "at that moment ... that transaction will be terminated" sounds heavier
handedthan something which merely sets a flag asking the backend to exit.  I was reading that as more immediate and
thentrying to figure out how the signal handling could possibly work, and failing to see how. 

The README:

+Any
+backends which receive WAL prohibited system state transition barrier interrupt
+need to stop WAL writing immediately.  For barrier absorption the backed(s) will
+kill the running transaction which has valid XID indicates that the transaction
+has performed and/or planning WAL write.

uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is
heavierhanded than it is. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Fri, Sep 10, 2021 at 1:16 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is
heavierhanded than it is.
 

It's intended to be just as immediate as e.g. pg_cancel_backend() and
pg_terminate_backend(), which work just the same way, but not any more
so. I guess we could look at how things are worded in those cases.
From a user perspective such things are usually pretty immediate, but
not as immediate as firing a signal handler. Computers are fast.[1]

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://www.youtube.com/watch?v=6xijhqU8r2A



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Jun 16, 2020, at 6:55 AM, amul sul <sulamul@gmail.com> wrote:
>
> (2) if the session is idle, we also need the top-level abort
> record to be written immediately, but can't send an error to the client until the next
> command is issued without losing wire protocol synchronization. For now, we just use
> FATAL to kill the session; maybe this can be improved in the future.

Andres,

I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned
pageheld by the cursor, when another session disables WAL.  It would be very interesting to test how the vacuum handles
thatspecific change.  I have not figured out the cleanest way to do this, though, as we don't as a project yet have a
standardway of setting up xid exhaustion in a regression test, do we?  The closest I saw to it was your work in [1],
butthat doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which
isn'tuseable from inside an isolation test.  Do you have a suggestion how best to continue developing out the test
infrastructure?


Amul,

The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the
isolationtester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test
toabort and all subsequent permutations to not run.  Attached patch v34-0009 fixes that. 

Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a
differentsession.  The expected output is worth reviewing to see how this plays out.  I don't see anything in there
whichis obviously wrong, but some of it is a bit clunky.  For example, by the time the client sees an error "FATAL:
WALis now prohibited", the system may already have switched back to read-write.  Also, it is a bit strange to get one
ofthese errors on an attempted ROLLBACK.  Once again, not wrong as such, but clunky. 





[1]
https://www.postgresql.org/message-id/flat/CAP4vRV5gEHFLB7NwOE6_dyHAeVfkvqF8Z_g5GaCQZNgBAE0Frw%40mail.gmail.com#e10861372aec22119b66756ecbac581c

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
, On Sat, Jul 24, 2021 at 1:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 17, 2021 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote:
> > Attached is rebase for the latest master head.  Also, I added one more
> > refactoring code that deduplicates the code setting database state in the
> > control file. The same code set the database state is also needed for this
> > feature.
>
> I started studying 0001 today and found that it rearranged the order
> of operations in StartupXLOG() more than I was expecting. It does, as
> per previous discussions, move a bunch of things to the place where we
> now call XLogParamters(). But, unsatisfyingly, InRecovery = false and
> XLogReaderFree() then have to move down even further. Since the goal
> here is to get to a situation where we sometimes XLogAcceptWrites()
> after InRecovery = false, it didn't seem nice for this refactoring
> patch to still end up with a situation where this stuff happens while
> InRecovery = true. In fact, with the patch, the amount of code that
> runs with InRecovery = true actually *increases*, which is not what I
> think should be happening here. That's why the patch ends up having to
> adjust SetMultiXactIdLimit to not Assert(!InRecovery).
>
> And then I started to wonder how this was ever going to work as part
> of the larger patch set, because as you have it here,
> XLogAcceptWrites() takes arguments XLogReaderState *xlogreader,
> XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the
> checkpointer is calling that at a later time after the user issues
> pg_prohibit_wal(false), it's going to have none of those things. So I
> had a quick look at that part of the code and found this in
> checkpointer.c:
>
> XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0);
>
> For those following along from home, the additional "true" is a bool
> needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of
> this is very satisfying. The whole purpose of passing the xlogreader
> is so we can figure out whether we need a checkpoint (never mind the
> question of whether the existing algorithm for determining that is
> really sensible) but now we need a second argument that basically
> serves the same purpose since one of the two callers to this function
> won't have an xlogreader. And then we're passing the EndOfLog and
> EndOfLogTLI as dummy values which seems like it's probably just
> totally wrong, but if for some reason it works correctly there sure
> don't seem to be any comments explaining why.
>
> So I started doing a bit of hacking myself and ended up with the
> attached, which I think is not completely the right thing yet but I
> think it's better than your version. I split this into three parts.
> 0001 splits up the logic that currently decides whether to write an
> end-of-recovery record or a checkpoint record and if the latter how
> the checkpoint ought to be performed into two functions.
> DetermineRecoveryXlogAction() figures out what we want to do, and
> PerformRecoveryXlogAction() does it. It also moves the code to run
> recovery_end_command and related stuff into a new function
> CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing
> UpdateFullPageWrites(), PerformRecoveryXLogAction(), and
> CleanupAfterArchiveRecovery() to just before we
> XLogReportParameters(). Because of the refactoring done by 0001, this
> is only a small amount of code movement. Because of the separation
> between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(),
> the latter doesn't need the xlogreader. So we can do
> DetermineRecoveryXlogAction() at the same time as now, while the
> xlogreader is available, and then we don't need it later when we
> PerformRecoveryXlogAction(), because we already know what we need to
> know. I think this is all fine as far as it goes.
>
> My 0003 is where I see some lingering problems. It creates
> XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> need the xlogreader. But it doesn't really solve the problem of how
> checkpointer.c would be able to call this function with proper
> arguments. It is at least better in not needing two arguments to
> decide what to do, but how is checkpointer.c supposed to know what to
> pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't
> seem too problematic, because that value has been stored in four (!)
> places inside XLogCtl by this code:
>
>     LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
>
>     XLogCtl->LogwrtResult = LogwrtResult;
>
>     XLogCtl->LogwrtRqst.Write = EndOfLog;
>     XLogCtl->LogwrtRqst.Flush = EndOfLog;
>
> Presumably we could relatively easily change things around so that we
> finish one of those values ... probably one of the "write" values ..
> back out of XLogCtl instead of passing it as a parameter. That would
> work just as well from the checkpointer as from the startup process,
> and there seems to be no way for the value to change until after
> XLogAcceptWrites() has been called, so it seems fine. But that doesn't
> help for the other arguments. What I'm thinking is that we should just
> arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and
> then XLogAcceptWrites() can fish those values out of there as well,
> which should be enough to make it work and do the same thing
> regardless of which process is calling it. But I have run out of time
> for today so have not explored coding that up.
>

I have spent some time thinking about making XLogAcceptWrites()
independent, and for that, we need to get rid of its arguments which
are available only in the startup process. The first argument
xlogaction deduced by the DetermineRecoveryXlogAction(). If we are able to
make this function logic independent and can deduce that xlogaction in
any process, we can skip xlogaction argument passing.

DetermineRecoveryXlogAction() function depends on a few global
variables, valid only in the startup process are InRecovery,
ArchiveRecoveryRequested & LocalPromoteIsTriggered.  Out of
three LocalPromoteIsTriggered's value is already available in shared
memory and that can be fetched by calling LocalPromoteIsTriggered().
InRecovery's value can be guessed by as long as DBState in the control
file doesn't get changed unexpectedly until XLogAcceptWrites()
executes.  If the DBState was not a clean shutdown, then surely the
server has gone through recovery. If we could rely on DBState in the
control file then we are good to go. For the last one,
ArchiveRecoveryRequested, I don't see any existing and appropriate
shared memory or control file information, so that can be identified
if the archive recovery was requested or not. Initially, I thought to
use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE,
if  the archive recovery requested. But there is another case where
SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of
ArchiveRecoveryRequested value, that is the presence of a backup label
file.  If we want to use SharedRecoveryState, we need one more state
which could differentiate between ArchiveRecoveryRequested and the
backup label file presence case.  To move ahead, I have copied
ArchiveRecoveryRequested into shared memory and it will be cleared
once archive cleanup is finished. With all these changes, we could get
rid of xlogaction argument and DetermineRecoveryXlogAction() function.
Could move its logic to PerformRecoveryXLogAction() directly.

Now, the remaining two arguments of XLogAcceptWrites() are required
for the CleanupAfterArchiveRecovery() function. Along with these two
arguments, this function requires ArchiveRecoveryRequested and
ThisTimeLineID which are again global variables.  With the previous
changes, we have got ArchiveRecoveryRequested into shared memory.
And for ThisTimeLineID, I don't think we need to do anything since this
value is available with all the backend as per the following comment:

"
/*
 * ThisTimeLineID will be same in all backends --- it identifies current
 * WAL timeline for the database system.
 */
TimeLineID  ThisTimeLineID = 0;
"

In addition to the four places that Robert has pointed for EndOfLog,
XLogCtl->lastSegSwitchLSN also holds EndOfLog value and that doesn't
seem to change until WAL write is enabled. For EndOfLogTLI, I think we
can safely use XLogCtl->replayEndTLI. Currently, The EndOfLogTLI is
the timeline ID of the last record that xlogreader reads, but this
xlogreader was simply re-fetching the last record which we have
replied in redo loop if it was in recovery, if not in recovery, we
don't need to worry since this value is needed only in case of
ArchiveRecoveryRequested = true, which implicitly forces redo and sets
replayEndTLI value.

With all the above changes XLogAcceptWrites() can be called from other
processes but I haven't tested that. This finding is still not
complete and not too clean, perhaps, posting the patches with
aforesaid changes just to confirm the direction and forward the
discussion, thanks.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Sep 15, 2021 at 6:49 AM Amul Sul <sulamul@gmail.com> wrote:
>  Initially, I thought to
> use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE,
> if  the archive recovery requested. But there is another case where
> SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of
> ArchiveRecoveryRequested value, that is the presence of a backup label
> file.

Right, there's a difference between whether archive recovery has been
*requested* and whether it is actually *happening*.

> If we want to use SharedRecoveryState, we need one more state
> which could differentiate between ArchiveRecoveryRequested and the
> backup label file presence case.  To move ahead, I have copied
> ArchiveRecoveryRequested into shared memory and it will be cleared
> once archive cleanup is finished. With all these changes, we could get
> rid of xlogaction argument and DetermineRecoveryXlogAction() function.
> Could move its logic to PerformRecoveryXLogAction() directly.

Putting these changes into 0001 seems to make no sense. It seems like
they should be part of 0003, or maybe a new 0004 patch.

> And for ThisTimeLineID, I don't think we need to do anything since this
> value is available with all the backend as per the following comment:
> "
> /*
>  * ThisTimeLineID will be same in all backends --- it identifies current
>  * WAL timeline for the database system.
>  */
> TimeLineID  ThisTimeLineID = 0;

I'm not sure I find that argument totally convincing. The two
variables aren't assigned at exactly the same places in the code,
nonwithstanding the comment. I'm not saying you're wrong. I'm just
saying I don't believe it just because the comment says so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Wed, Sep 15, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Putting these changes into 0001 seems to make no sense. It seems like
> they should be part of 0003, or maybe a new 0004 patch.

After looking at this a little bit more, I think it's really necessary
to separate out all of your changes into separate patches at least for
initial review. It's particularly important to separate code movement
changes from other kinds of changes. 0001 was just moving code before,
and so was 0002, but now both are making other changes, which is not
easy to see from looking at the 'git diff' output. For that reason
it's not so easy to understand exactly what you've changed here and
analyze it.

I poked around a little bit at these patches, looking for
perhaps-interesting global variables upon which the code called from
XLogAcceptWrites() would depend with your patches applied. The most
interesting ones seem to be (1) ThisTimeLineID, which you mentioned
and which may be fine but I'm not totally convinced yet, (2)
LocalXLogInsertAllowed, which is probably not broken but I'm thinking
we may want to redesign that mechanism somehow to make it cleaner, and
(3) CheckpointStats, which is called from RemoveXlogFile which is
called from RemoveNonParentXlogFiles which is called from
CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
This last one is actually pretty weird already in the existing code.
It sort of looks like RemoveXlogFile() only expects to be called from
the checkpointer (or a standalone backend) so that it can update
CheckpointStats and have that just work, but actually it's also called
from the startup process when a timeline switch happens. I don't know
whether the fact that the increments to ckpt_segs_recycled get lost in
that case should be considered an intentional behavior that should be
preserved or an inadvertent mistake.

So I think you've covered most of the necessary things here, with
probably some more discussion needed on whether you've done the right
things...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Sep 15, 2021 at 9:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Sep 15, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Putting these changes into 0001 seems to make no sense. It seems like
> > they should be part of 0003, or maybe a new 0004 patch.
>
> After looking at this a little bit more, I think it's really necessary
> to separate out all of your changes into separate patches at least for
> initial review. It's particularly important to separate code movement
> changes from other kinds of changes. 0001 was just moving code before,
> and so was 0002, but now both are making other changes, which is not
> easy to see from looking at the 'git diff' output. For that reason
> it's not so easy to understand exactly what you've changed here and
> analyze it.
>

Ok, understood, I have separated my changes into 0001 and 0002 patch,
and the refactoring patches start from 0003.

In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
memory as said previously. Coping ArchiveRecoveryRequested value to
shared memory is not really interesting, and I think somehow we should
reuse existing variable, (perhaps, with some modification of the
information it can store, e.g. adding one more enum value for
SharedRecoveryState or something else), thinking on the same.

In addition to that, I tried to turn down the scope of
ArchiveRecoveryRequested global variable. Now, this is a static
variable, and the scope is limited to xlog.c file like
LocalXLogInsertAllowed and can be accessed through the newly added
function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
me know what you think about the approach.

In 0002 patch is a mixed one where I tried to remove the dependencies
on global variables and local variables belonging to StartupXLOG(). I
am still worried about the InRecovery value that needs to be deduced
afterward inside XLogAcceptWrites().  Currently, relying on
ControlFile->state != DB_SHUTDOWNED check but I think that will not be
good for ASRO where we plan to skip XLogAcceptWrites() work only and
let the StartupXLOG() do the rest of the work as it is where it will
going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
might need some ugly kludge to call PerformRecoveryXLogAction() in
checkpointer irrespective of DBState, which makes me a bit
uncomfortable.

> I poked around a little bit at these patches, looking for
> perhaps-interesting global variables upon which the code called from
> XLogAcceptWrites() would depend with your patches applied. The most
> interesting ones seem to be (1) ThisTimeLineID, which you mentioned
> and which may be fine but I'm not totally convinced yet, (2)
> LocalXLogInsertAllowed, which is probably not broken but I'm thinking
> we may want to redesign that mechanism somehow to make it cleaner, and

Thanks for the off-list detailed explanation on this.

For somebody else who might be reading this, the concern here is (not
really a concern, it is a good thing to improve) the
LocalSetXLogInsertAllowed() function call, is a kind of hack that
enables wal writes irrespective of RecoveryInProgress() for a shorter
period. E.g. see following code in StartupXLOG:

"
  LocalSetXLogInsertAllowed();
  UpdateFullPageWrites();
  LocalXLogInsertAllowed = -1;
....
....
  /*
   * If any of the critical GUCs have changed, log them before we allow
   * backends to write WAL.
   */
  LocalSetXLogInsertAllowed();
  XLogReportParameters();
"

Instead of explicitly enabling wal insert, somehow that implicitly
allowed for the startup process and/or the checkpointer doing the
first checkpoint and/or wal writes after the recovery. Well, the
current LocalSetXLogInsertAllowed() mechanism is not really harming
anything or bad and does not necessarily need to change but it would
be nice if we were able to come up with something much cleaner,
bug-free, and 100% perfect enough design.

(Hope I am not missing anything from the discussion).

> (3) CheckpointStats, which is called from RemoveXlogFile which is
> called from RemoveNonParentXlogFiles which is called from
> CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
> This last one is actually pretty weird already in the existing code.
> It sort of looks like RemoveXlogFile() only expects to be called from
> the checkpointer (or a standalone backend) so that it can update
> CheckpointStats and have that just work, but actually it's also called
> from the startup process when a timeline switch happens. I don't know
> whether the fact that the increments to ckpt_segs_recycled get lost in
> that case should be considered an intentional behavior that should be
> preserved or an inadvertent mistake.
>

Maybe I could be wrong, but I think that is intentional.  It removes
pre-allocated or bogus files of the old timeline which are not
supposed to be considered in stats. The comments for
CheckpointStatsData might not be clear but comment at the calling
RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:

"
/*
 * Before we continue on the new timeline, clean up any
 * (possibly bogus) future WAL segments on the old
 * timeline.
 */
RemoveNonParentXlogFiles(EndRecPtr, ThisTimeLineID);
....
....

 * We switched to a new timeline. Clean up segments on the old
 * timeline.
 *
 * If there are any higher-numbered segments on the old timeline,
 * remove them. They might contain valid WAL, but they might also be
 * pre-allocated files containing garbage. In any case, they are not
 * part of the new timeline's history so we don't need them.
 */
RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);
"

> So I think you've covered most of the necessary things here, with
> probably some more discussion needed on whether you've done the right
> things...
>

Thanks, Robert, for your time.

Regards,
Amul Sul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Hi Mark,

I have tried to fix your review comment in the attached version,
please see my inline reply below.

On Fri, Sep 10, 2021 at 8:06 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> >
> >
>
> Thank you, for looking at the patch.  Please see my reply inline below:
>
> >
> > > On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > Here is the rebased version.
> >
> > v33-0004
> >
> > This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h
indirectlyincluded from a much larger set of files.  Maybe that's ok.  I don't know.  But it seems you are doing this
merelyto get the symbol (not even the definition) for struct DBState.  I'd recommend rearranging the code so this isn't
necessary,but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from
xlogdesc.c,xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. 
> >
>
> Yes, you are correct, xlog.h is included in more than 150 files. I was
> wondering if we can have a forward declaration instead of including
> pg_control.h (e.g. The same way struct XLogRecData was declared in
> xlog.h). Perhaps, DBState is enum & I don't see we have done the same
> for enum elsewhere as we are doing for structures, but that seems to
> be fine, IMO.
>
> Earlier, I was unsure before preparing this patch, but since that
> makes sense (I assumed) and minimizes duplications, can we go ahead
> and post separately with the same change in StartupXLOG() which I have
> skipped for the same reason mentioned in patch commit-msg.
>

FYI, I have posted this patch separately [1] & drop it from the current set.

> > v33-0005
> > The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only
i.e.wal writes permitted". 
> >

Fixed.

> > The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than
actuallydoing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint
process".
> >

I am not sure I understood the concern, what comments should you
think? This function helps to signal the checkpointer, but doesn't
tell what it is supposed to do.

> > The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a
readerof the function name could already surmise.  Perhaps the comment can better clarify what is happening: "Set up
walprobibit shared state" 
> >

Done.

> > The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase:  "As in
ProcessSyncRequests,we don't want to stop wal prohibit change requests".  The nearby comment reads "stop absorbing".  I
thinkthis one should read "stop processing".  This same comment is used again below.   Then a third comment reads "For
thesame reason mentioned previously for the wal prohibit state change request check."  That third comment is too glib. 
> >

Ok, "stop processing" is used.  I think the third comment should be
fine instead of coping the same again, however, I change that comment
a bit for more clarity as "For the same reason mentioned previously
for the same function call".

> > tcop/utility.c needlessly includes "access/walprohibit.h"
> >
> > wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and
WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear.  Waiting for a
statechange is clear enough.  But how is waiting on a state different? 
> >

WAIT_EVENT_WALPROHIBIT_STATE_CHANGE gets set in pg_prohibit_wal()
while waiting for the system to prohibit state change.
WAIT_EVENT_WALPROHIBIT_STATE is set for the checkpointer process when
it sees the system is in a WAL PROHIBITED state & stops there. But I
think it makes sense to have only one, i.e.
WAIT_EVENT_WALPROHIBIT_STATE_CHANGE.  The same can be used for
checkpointer since it won't do anything until wal prohibited state
change.

Remove WAIT_EVENT_WALPROHIBIT_STATE in the attached version.

> > xlog.h defines a new enum.  I don't find any of it clear; not the comment, nor the name of the enum, nor the names
ofthe values: 
> >
> > /* State of work that enables wal writes */
> > typedef enum XLogAcceptWritesState
> > {
> >     XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */
> >     XLOG_ACCEPT_WRITES_SKIPPED,     /* skipped wal writes */
> >     XLOG_ACCEPT_WRITES_DONE         /* wal writes are enabled */
> > } XLogAcceptWritesState;
> >
> > This enum seems to have been written from the point of view of someone who already knew what it was for.  It needs
tobe written in a way that will be clear to people who have no idea what it is for. 
> >

I tried to avoid the function name in the comment, since the enum name
pretty much resembles the XLogAcceptWrite() function name whose
execution state we are trying to track, but added now, that would be
much clearer.

> > v33-0006:
> >
> > The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building
indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ 
> >

Rephrased the comments but I think HAVE XID is much more appropriate
there because that assert function name ends with HaveXID.

Apart from this I have moved CheckWALPermitted() closer to
START_CRIT_SECTION which you have pointed out in your other post and
made a few other changes. Note that patch numbers are changed, I have
rebased my implementation on top of the under discussion refactoring
patches which I have posted previously [2] and reattached the same
here to make CFbot continue its testing.

Note that with the current version patch on the latest master head
getting one issue but can be seen sometimes only where one, the same
INSERT query gets stuck, waiting for WALWriteLock in exclusive mode. I
am not sure if it is due to my changes, but that is not occurring without
my patch. I am looking into that, just in case if anybody wants to
know more, I have attached the backtrace, pg_lock & ps output, see
ps-bt-pg_lock.out.text attached file.

Regards,
Amul

1] https://postgr.es/m/CAAJ_b97nd_ghRpyFV9Djf9RLXkoTbOUqnocq11WGq9TisX09Fw@mail.gmail.com
2] https://postgr.es/m/CAAJ_b96G-oBxDC3C7Y72ER09bsheGHOxBK1HXHVOyHNXjTDmcA@mail.gmail.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Sep 15, 2021 at 4:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 16, 2020, at 6:55 AM, amul sul <sulamul@gmail.com> wrote:
> >
> > (2) if the session is idle, we also need the top-level abort
> > record to be written immediately, but can't send an error to the client until the next
> > command is issued without losing wire protocol synchronization. For now, we just use
> > FATAL to kill the session; maybe this can be improved in the future.
>
> Andres,
>
> I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned
pageheld by the cursor, when another session disables WAL.  It would be very interesting to test how the vacuum handles
thatspecific change.  I have not figured out the cleanest way to do this, though, as we don't as a project yet have a
standardway of setting up xid exhaustion in a regression test, do we?  The closest I saw to it was your work in [1],
butthat doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which
isn'tuseable from inside an isolation test.  Do you have a suggestion how best to continue developing out the test
infrastructure?
>
>
> Amul,
>
> The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the
isolationtester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test
toabort and all subsequent permutations to not run.  Attached patch v34-0009 fixes that. 
>

Interesting.

> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a
differentsession.  The expected output is worth reviewing to see how this plays out.  I don't see anything in there
whichis obviously wrong, but some of it is a bit clunky.  For example, by the time the client sees an error "FATAL:
WALis now prohibited", the system may already have switched back to read-write.  Also, it is a bit strange to get one
ofthese errors on an attempted ROLLBACK.  Once again, not wrong as such, but clunky. 
>

Can't we do the same in the TAP test? If the intention is only to test
session termination when the system changes to WAL are prohibited then
that I have added in the latest version, but that test does not
reinitiate the same connection again, I think that is not possible
there too.


Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 22, 2021, at 6:14 AM, Amul Sul <sulamul@gmail.com> wrote:
>
>> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by
adifferent session.  The expected output is worth reviewing to see how this plays out.  I don't see anything in there
whichis obviously wrong, but some of it is a bit clunky.  For example, by the time the client sees an error "FATAL:
WALis now prohibited", the system may already have switched back to read-write.  Also, it is a bit strange to get one
ofthese errors on an attempted ROLLBACK.  Once again, not wrong as such, but clunky. 
>>
>
> Can't we do the same in the TAP test? If the intention is only to test
> session termination when the system changes to WAL are prohibited then
> that I have added in the latest version, but that test does not
> reinitiate the same connection again, I think that is not possible
> there too.

Perhaps you can point me to a TAP test that does this in a concise fashion.  When I tried writing a TAP test for this,
itwas much longer than the equivalent isolation test spec. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Sep 22, 2021 at 6:59 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Sep 22, 2021, at 6:14 AM, Amul Sul <sulamul@gmail.com> wrote:
> >
> >> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only
bya different session.  The expected output is worth reviewing to see how this plays out.  I don't see anything in
therewhich is obviously wrong, but some of it is a bit clunky.  For example, by the time the client sees an error
"FATAL: WAL is now prohibited", the system may already have switched back to read-write.  Also, it is a bit strange to
getone of these errors on an attempted ROLLBACK.  Once again, not wrong as such, but clunky. 
> >>
> >
> > Can't we do the same in the TAP test? If the intention is only to test
> > session termination when the system changes to WAL are prohibited then
> > that I have added in the latest version, but that test does not
> > reinitiate the same connection again, I think that is not possible
> > there too.
>
> Perhaps you can point me to a TAP test that does this in a concise fashion.  When I tried writing a TAP test for
this,it was much longer than the equivalent isolation test spec. 
>

Yes, that is a bit longer, here is the snip from v35-0010 patch:

+my $psql_timeout = IPC::Run::timer(60);
+my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', '');
+my $mysession = IPC::Run::start(
+ [
+ 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d',
+ $node_primary->connstr('postgres')
+ ],
+ '<',
+ \$mysession_stdin,
+ '>',
+ \$mysession_stdout,
+ '2>',
+ \$mysession_stderr,
+ $psql_timeout);
+
+# Write in transaction and get backend pid
+$mysession_stdin .= q[
+BEGIN;
+INSERT INTO tab VALUES(7);
+SELECT $$value-7-inserted-into-tab$$;
+];
+$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/;
+like($mysession_stdout, qr/value-7-inserted-into-tab/,
+ 'started write transaction in a session');
+$mysession_stdout = '';
+$mysession_stderr = '';
+
+# Change to WAL prohibited
+$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)');
+is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on',
+ 'server is changed to wal prohibited by another session');
+
+# Try to commit open write transaction.
+$mysession_stdin .= q[
+COMMIT;
+];
+$mysession->pump;
+like($mysession_stderr, qr/FATAL:  WAL is now prohibited/,
+ 'session with open write transaction is terminated');

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Mark Dilger
Date:

> On Sep 22, 2021, at 6:39 AM, Amul Sul <sulamul@gmail.com> wrote:
>
> Yes, that is a bit longer, here is the snip from v35-0010 patch

Right, that's longer, and only tests one interaction.  The isolation spec I posted upthread tests multiple interactions
betweenthe session which uses cursors and the system going read-only.  Whether the session using a cursor gets a FATAL,
justan ERROR, or neither depends on where it is in the process of opening, using, closing and committing.  I think
that'sinteresting.  If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way as, for
example,to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Sep 22, 2021 at 7:33 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Sep 22, 2021, at 6:39 AM, Amul Sul <sulamul@gmail.com> wrote:
> >
> > Yes, that is a bit longer, here is the snip from v35-0010 patch
>
> Right, that's longer, and only tests one interaction.  The isolation spec I posted upthread tests multiple
interactionsbetween the session which uses cursors and the system going read-only.  Whether the session using a cursor
getsa FATAL, just an ERROR, or neither depends on where it is in the process of opening, using, closing and committing.
I think that's interesting.  If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way
as,for example, to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output. 
>

Agreed.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote:
> Ok, understood, I have separated my changes into 0001 and 0002 patch,
> and the refactoring patches start from 0003.

I think it would be better in the other order, with the refactoring
patches at the beginning of the series.

> In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
> memory as said previously. Coping ArchiveRecoveryRequested value to
> shared memory is not really interesting, and I think somehow we should
> reuse existing variable, (perhaps, with some modification of the
> information it can store, e.g. adding one more enum value for
> SharedRecoveryState or something else), thinking on the same.
>
> In addition to that, I tried to turn down the scope of
> ArchiveRecoveryRequested global variable. Now, this is a static
> variable, and the scope is limited to xlog.c file like
> LocalXLogInsertAllowed and can be accessed through the newly added
> function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
> me know what you think about the approach.

I'm not sure yet whether I like this or not, but it doesn't seem like
a terrible idea. You spelled UNKNOWN wrong, though, which does seem
like a terrible idea. :-) "acccsed" is not correct either.

Also, the new comments for ArchiveRecoveryRequested /
ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is
substitute the new terminology into the existing comment, but that
means that the purpose of the new "unknown" value is not at all clear.

Consider the following two patch fragments:

+ * SharedArchiveRecoveryRequested indicates whether an archive recovery is
+ * requested. Protected by info_lck.
...
+ * Remember archive recovery request in shared memory state.  A lock is not
+ * needed since we are the only ones who updating this.

These two comments directly contradict each other.

+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->SharedArchiveRecoveryRequested = false;
+ ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
+ SpinLockRelease(&XLogCtl->info_lck);

This seems odd to me. In the first place, there doesn't seem to be any
value in clearing this -- we're just expending extra CPU cycles to get
rid of a value that wouldn't be used anyway. In the second place, if
somehow someone checked the value after this point, with this code,
they might get the wrong answer, whereas if you just deleted this,
they would get the right answer.

> In 0002 patch is a mixed one where I tried to remove the dependencies
> on global variables and local variables belonging to StartupXLOG(). I
> am still worried about the InRecovery value that needs to be deduced
> afterward inside XLogAcceptWrites().  Currently, relying on
> ControlFile->state != DB_SHUTDOWNED check but I think that will not be
> good for ASRO where we plan to skip XLogAcceptWrites() work only and
> let the StartupXLOG() do the rest of the work as it is where it will
> going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
> might need some ugly kludge to call PerformRecoveryXLogAction() in
> checkpointer irrespective of DBState, which makes me a bit
> uncomfortable.

I think that replacing the if (InRecovery) test with if
(ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I
mean, there are three separate places where we set InRecovery = true.
One of those executes if ControlFile->state != DB_SHUTDOWNED, matching
what you have here, but it also can happen if checkPoint.redo <
RecPtr, or if read_backup_label is true and ReadCheckpointRecord
returns non-NULL. Now maybe you're going to tell me that in those
other two cases we can't reach here anyway, but I don't see off-hand
why that should be true, and even if it is true, it seems like kind of
a fragile thing to rely on. I think we need to rely on something in
shared memory that is more explicitly connected to the question of
whether we are in recovery.

The other part of this patch has to do with whether we can use the
return value of GetLastSegSwitchData as a substitute for relying on
EndOfLog. Now as you have it, you end up creating a local variable
called EndOfLog that shadows another such variable in an outer scope,
which probably would not make anyone who noticed things in such a
state very happy. However, that will naturally get fixed if you
reorder the patches as per above, so let's turn to the central
question: is this a good way of getting EndOfLog? The value that would
be in effect at the time this code is executed is set here:

    XLogBeginRead(xlogreader, LastRec);
    record = ReadRecord(xlogreader, PANIC, false);
    EndOfLog = EndRecPtr;

Subsequently we do this:

    /* start the archive_timeout timer and LSN running */
    XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
    XLogCtl->lastSegSwitchLSN = EndOfLog;

So at that point the value that GetLastSegSwitchData() would return
has to match what's in the existing variable. But later XLogWrite()
will change the value. So the question boils down to whether
XLogWrite() could have been called between the assignment just above
and when this code runs. And the answer seems to pretty clear be yes,
because just above this code, we might have done
CreateEndOfRecoveryRecord() or RequestCheckpoint(), and just above
that, we did UpdateFullPageWrites(). So I don't think this is right.

> > (3) CheckpointStats, which is called from RemoveXlogFile which is
> > called from RemoveNonParentXlogFiles which is called from
> > CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
> > This last one is actually pretty weird already in the existing code.
> > It sort of looks like RemoveXlogFile() only expects to be called from
> > the checkpointer (or a standalone backend) so that it can update
> > CheckpointStats and have that just work, but actually it's also called
> > from the startup process when a timeline switch happens. I don't know
> > whether the fact that the increments to ckpt_segs_recycled get lost in
> > that case should be considered an intentional behavior that should be
> > preserved or an inadvertent mistake. >
>
> Maybe I could be wrong, but I think that is intentional.  It removes
> pre-allocated or bogus files of the old timeline which are not
> supposed to be considered in stats. The comments for
> CheckpointStatsData might not be clear but comment at the calling
> RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:

Sure, I'm not saying the files are being removed by accident. I'm
saying it may be accidental that the removals are (I think) not going
to make it into the stats.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Sep 23, 2021 at 11:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > Ok, understood, I have separated my changes into 0001 and 0002 patch,
> > and the refactoring patches start from 0003.
>
> I think it would be better in the other order, with the refactoring
> patches at the beginning of the series.
>

Ok, will do that. I did this other way to minimize the diff e.g.
deletion diff of RecoveryXlogAction enum and
DetermineRecoveryXlogAction(), etc.

> > In the 0001 patch, I have copied ArchiveRecoveryRequested to shared
> > memory as said previously. Coping ArchiveRecoveryRequested value to
> > shared memory is not really interesting, and I think somehow we should
> > reuse existing variable, (perhaps, with some modification of the
> > information it can store, e.g. adding one more enum value for
> > SharedRecoveryState or something else), thinking on the same.
> >
> > In addition to that, I tried to turn down the scope of
> > ArchiveRecoveryRequested global variable. Now, this is a static
> > variable, and the scope is limited to xlog.c file like
> > LocalXLogInsertAllowed and can be accessed through the newly added
> > function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let
> > me know what you think about the approach.
>
> I'm not sure yet whether I like this or not, but it doesn't seem like
> a terrible idea. You spelled UNKNOWN wrong, though, which does seem
> like a terrible idea. :-) "acccsed" is not correct either.
>
> Also, the new comments for ArchiveRecoveryRequested /
> ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is
> substitute the new terminology into the existing comment, but that
> means that the purpose of the new "unknown" value is not at all clear.
>

Ok, will fix those typos and try to improve the comments.

> Consider the following two patch fragments:
>
> + * SharedArchiveRecoveryRequested indicates whether an archive recovery is
> + * requested. Protected by info_lck.
> ...
> + * Remember archive recovery request in shared memory state.  A lock is not
> + * needed since we are the only ones who updating this.
>
> These two comments directly contradict each other.
>

Okay, the first comment is not clear enough, I will fix that too. I
meant we don't need the lock now since we are the only one updating at
this point.

> + SpinLockAcquire(&XLogCtl->info_lck);
> + XLogCtl->SharedArchiveRecoveryRequested = false;
> + ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN;
> + SpinLockRelease(&XLogCtl->info_lck);
>
> This seems odd to me. In the first place, there doesn't seem to be any
> value in clearing this -- we're just expending extra CPU cycles to get
> rid of a value that wouldn't be used anyway. In the second place, if
> somehow someone checked the value after this point, with this code,
> they might get the wrong answer, whereas if you just deleted this,
> they would get the right answer.
>

Previously, this flag was only valid in the startup process. But now
it will be valid for all the processes and will stay until the whole
server gets restarted. I don't want anybody to use this flag after the
cleanup point and just be sure I am explicitly cleaning this.

By the way, I also don't expect we should go with this approach.  I
proposed this by referring to the PromoteIsTriggered() implementation,
but IMO, it is better to have something else since we just want to perform
archive cleanup operation, and most of the work related to archive
recovery was done inside the StartupXLOG().

Rather than the proposed design, I was thinking of adding one or two
more RecoveryState enums. And while skipping XLogAcceptsWrite() set
XLogCtl->SharedRecoveryState appropriately, so that we can easily
identify that the archive recovery was requested previously and now,
we need to perform its pending cleanup operation. Thoughts?

> > In 0002 patch is a mixed one where I tried to remove the dependencies
> > on global variables and local variables belonging to StartupXLOG(). I
> > am still worried about the InRecovery value that needs to be deduced
> > afterward inside XLogAcceptWrites().  Currently, relying on
> > ControlFile->state != DB_SHUTDOWNED check but I think that will not be
> > good for ASRO where we plan to skip XLogAcceptWrites() work only and
> > let the StartupXLOG() do the rest of the work as it is where it will
> > going to update ControlFile's DBState to DB_IN_PRODUCTION, then we
> > might need some ugly kludge to call PerformRecoveryXLogAction() in
> > checkpointer irrespective of DBState, which makes me a bit
> > uncomfortable.
>
> I think that replacing the if (InRecovery) test with if
> (ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I
> mean, there are three separate places where we set InRecovery = true.
> One of those executes if ControlFile->state != DB_SHUTDOWNED, matching
> what you have here, but it also can happen if checkPoint.redo <
> RecPtr, or if read_backup_label is true and ReadCheckpointRecord
> returns non-NULL. Now maybe you're going to tell me that in those
> other two cases we can't reach here anyway, but I don't see off-hand
> why that should be true, and even if it is true, it seems like kind of
> a fragile thing to rely on. I think we need to rely on something in
> shared memory that is more explicitly connected to the question of
> whether we are in recovery.
>

No, this is the other way. I haven't picked (ControlFile->state !=
DB_SHUTDOWNED) condition because it setting InRecovery, rather, I
picked because InRecovery flag is setting ControlFile->state to either
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, see next if
(InRecovery) block after InRecovery flag gets set. It is certain that
when the system will be InRecovery, it will have the DBState other
than DB_SHUTDOWNED. But that isn't a clean approach for me because
when it will be in WAL prohibited the DBState will be DB_IN_PRODUCTION
which will not work, as I mentioned previously.

I am too thinking about passing this information via shared memory but
trying to somehow avoid this, lets see.

> The other part of this patch has to do with whether we can use the
> return value of GetLastSegSwitchData as a substitute for relying on
> EndOfLog. Now as you have it, you end up creating a local variable
> called EndOfLog that shadows another such variable in an outer scope,
> which probably would not make anyone who noticed things in such a
> state very happy. However, that will naturally get fixed if you
> reorder the patches as per above, so let's turn to the central
> question: is this a good way of getting EndOfLog? The value that would
> be in effect at the time this code is executed is set here:
>
>     XLogBeginRead(xlogreader, LastRec);
>     record = ReadRecord(xlogreader, PANIC, false);
>     EndOfLog = EndRecPtr;
>
> Subsequently we do this:
>
>     /* start the archive_timeout timer and LSN running */
>     XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
>     XLogCtl->lastSegSwitchLSN = EndOfLog;
>
> So at that point the value that GetLastSegSwitchData() would return
> has to match what's in the existing variable. But later XLogWrite()
> will change the value. So the question boils down to whether
> XLogWrite() could have been called between the assignment just above
> and when this code runs. And the answer seems to pretty clear be yes,
> because just above this code, we might have done
> CreateEndOfRecoveryRecord() or RequestCheckpoint(), and just above
> that, we did UpdateFullPageWrites(). So I don't think this is right.
>

You are correct, if XLogWrite() called between the lastSegSwitchLSN
value can be changed, but the question is, will that change in our
case. I think it won't, let me explain.

IIUC, lastSegSwitchLSN will change generally in XLogWrite(), if the
previous WAL has been filled up. But if we see closely what will be
going to be written before we do check lastSegSwitchLSN. Currently, we
have a record for full-page write and record for either recovery end
or checkpoint, all these are fixed size and I don't think going to
fill the whole 16MB wal file. Correct me if I am missing something.

> > > (3) CheckpointStats, which is called from RemoveXlogFile which is
> > > called from RemoveNonParentXlogFiles which is called from
> > > CleanupAfterArchiveRecovery which is called from XLogAcceptWrites.
> > > This last one is actually pretty weird already in the existing code.
> > > It sort of looks like RemoveXlogFile() only expects to be called from
> > > the checkpointer (or a standalone backend) so that it can update
> > > CheckpointStats and have that just work, but actually it's also called
> > > from the startup process when a timeline switch happens. I don't know
> > > whether the fact that the increments to ckpt_segs_recycled get lost in
> > > that case should be considered an intentional behavior that should be
> > > preserved or an inadvertent mistake. >
> >
> > Maybe I could be wrong, but I think that is intentional.  It removes
> > pre-allocated or bogus files of the old timeline which are not
> > supposed to be considered in stats. The comments for
> > CheckpointStatsData might not be clear but comment at the calling
> > RemoveNonParentXlogFiles() place inside StartupXLOG hints the same:
>
> Sure, I'm not saying the files are being removed by accident. I'm
> saying it may be accidental that the removals are (I think) not going
> to make it into the stats.
>

Understood, it looks like I missed the concluding line in the previous
reply.  My point was if deleting bogus files then why should we care
about counting them in stats.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Fri, Sep 24, 2021 at 5:07 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Sep 23, 2021 at 11:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > Ok, understood, I have separated my changes into 0001 and 0002 patch,
> > > and the refactoring patches start from 0003.
> >
> > I think it would be better in the other order, with the refactoring
> > patches at the beginning of the series.
> >
>
> Ok, will do that. I did this other way to minimize the diff e.g.
> deletion diff of RecoveryXlogAction enum and
> DetermineRecoveryXlogAction(), etc.
>

I have reversed the patch order. Now refactoring patches will be
first, and the patch that removes the dependencies on global & local
variables will be the last. I did the necessary modification in the
refactoring patches too e.g. removed DetermineRecoveryXlogAction() and
RecoveryXlogAction enum which is no longer needed (thanks to commit #
1d919de5eb3fffa7cc9479ed6d2915fb89794459 to make code simple).

To find the value of InRecovery after we clear it, patch still uses
ControlFile's DBState, but now the check condition changed to a more
specific one which is less confusing.

In casual off-list discussion, the point was made to check
SharedRecoveryState to find out the InRecovery value afterward, and
check that using RecoveryInProgress().  But we can't depend on
SharedRecoveryState because at the start it gets initialized to
RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
Therefore, we can't use RecoveryInProgress() which always returns
true if SharedRecoveryState != RECOVERY_STATE_DONE.

I am posting only refactoring patches for now.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:
> To find the value of InRecovery after we clear it, patch still uses
> ControlFile's DBState, but now the check condition changed to a more
> specific one which is less confusing.
>
> In casual off-list discussion, the point was made to check
> SharedRecoveryState to find out the InRecovery value afterward, and
> check that using RecoveryInProgress().  But we can't depend on
> SharedRecoveryState because at the start it gets initialized to
> RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
> Therefore, we can't use RecoveryInProgress() which always returns
> true if SharedRecoveryState != RECOVERY_STATE_DONE.

Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes. Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?

[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej

It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.

I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Rushabh Lathia
Date:


On Fri, Oct 1, 2021 at 2:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:
> To find the value of InRecovery after we clear it, patch still uses
> ControlFile's DBState, but now the check condition changed to a more
> specific one which is less confusing.
>
> In casual off-list discussion, the point was made to check
> SharedRecoveryState to find out the InRecovery value afterward, and
> check that using RecoveryInProgress().  But we can't depend on
> SharedRecoveryState because at the start it gets initialized to
> RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
> Therefore, we can't use RecoveryInProgress() which always returns
> true if SharedRecoveryState != RECOVERY_STATE_DONE.

Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes. Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?

[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej


I tried to apply the patch on the master branch head and it's failing
with conflicts.

Later applied patch on below commit and it got applied cleanly:

commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Sep 27 18:48:01 2021 -0400

    Re-enable contrib/bloom's TAP tests.
   
rushabh@rushabh:postgresql$ git apply v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
rushabh@rushabh:postgresql$ git apply v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
rushabh@rushabh:postgresql$ git apply v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space before tab in indent.
  /*
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space before tab in indent.
  */
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space before tab in indent.
  Insert->fullPageWrites = lastFullPageWrites;
warning: 3 lines add whitespace errors.
rushabh@rushabh:postgresql$ git apply v36-0004-Remove-dependencies-on-startup-process-specifica.patch
 
There are whitespace errors on patch 0003.
 
It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.

I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.

--
Robert Haas
EDB: http://www.enterprisedb.com




--
Rushabh Lathia

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
   On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
>
>
> On Fri, Oct 1, 2021 at 2:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:
>> > To find the value of InRecovery after we clear it, patch still uses
>> > ControlFile's DBState, but now the check condition changed to a more
>> > specific one which is less confusing.
>> >
>> > In casual off-list discussion, the point was made to check
>> > SharedRecoveryState to find out the InRecovery value afterward, and
>> > check that using RecoveryInProgress().  But we can't depend on
>> > SharedRecoveryState because at the start it gets initialized to
>> > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
>> > Therefore, we can't use RecoveryInProgress() which always returns
>> > true if SharedRecoveryState != RECOVERY_STATE_DONE.
>>
>> Uh, this change has crept into 0002, but it should be in 0004 with the
>> rest of the changes to remove dependencies on variables specific to
>> the startup process. Like I said before, we should really be trying to
>> separate code movement from functional changes.

Well, I have to replace the InRecovery flag in that patch since we are
moving code past to the point where the InRecovery flag gets cleared.
If I don't do, then the 0002 patch would be wrong since InRecovery is
always false, and behaviour won't be the same as it was before that
patch.

>> Also, 0002 doesn't
>> actually apply for me. Did you generate these patches with 'git
>> format-patch'?
>>
>> [rhaas pgsql]$ patch -p1 <
>> ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
>> patching file src/backend/access/transam/xlog.c
>> Hunk #1 succeeded at 889 (offset 9 lines).
>> Hunk #2 succeeded at 939 (offset 12 lines).
>> Hunk #3 succeeded at 5734 (offset 37 lines).
>> Hunk #4 succeeded at 8038 (offset 70 lines).
>> Hunk #5 succeeded at 8248 (offset 70 lines).
>> [rhaas pgsql]$ patch -p1 <
>> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
>> patching file src/backend/access/transam/xlog.c
>> Reversed (or previously applied) patch detected!  Assume -R? [n]
>> Apply anyway? [n] y
>> Hunk #1 FAILED at 7954.
>> Hunk #2 succeeded at 8079 (offset 70 lines).
>> 1 out of 2 hunks FAILED -- saving rejects to file
>> src/backend/access/transam/xlog.c.rej
>> [rhaas pgsql]$ git reset --hard
>> HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
>> non-recoverable connection failure.
>> [rhaas pgsql]$ patch -p1 <
>> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
>> patching file src/backend/access/transam/xlog.c
>> Reversed (or previously applied) patch detected!  Assume -R? [n]
>> Apply anyway? [n]
>> Skipping patch.
>> 2 out of 2 hunks ignored -- saving rejects to file
>> src/backend/access/transam/xlog.c.rej
>>
>
> I tried to apply the patch on the master branch head and it's failing
> with conflicts.
>

Thanks, Rushabh, for the quick check, I have attached a rebased version for the
latest master head commit # f6b5d05ba9a.

> Later applied patch on below commit and it got applied cleanly:
>
> commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Mon Sep 27 18:48:01 2021 -0400
>
>     Re-enable contrib/bloom's TAP tests.
>
> rushabh@rushabh:postgresql$ git apply v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
> rushabh@rushabh:postgresql$ git apply v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
> rushabh@rushabh:postgresql$ git apply v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space before tab in indent.
>   /*
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space before tab in indent.
>   */
> v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space before tab in indent.
>   Insert->fullPageWrites = lastFullPageWrites;
> warning: 3 lines add whitespace errors.
> rushabh@rushabh:postgresql$ git apply v36-0004-Remove-dependencies-on-startup-process-specifica.patch
>
> There are whitespace errors on patch 0003.
>

Fixed.

>>
>> It seems to me that the approach you're pursuing here can't work,
>> because the long-term goal is to get to a place where, if the system
>> starts up read-only, XLogAcceptWrites() might not be called until
>> later, after StartupXLOG() has exited. But in that case the control
>> file state would be DB_IN_PRODUCTION. But my idea of using
>> RecoveryInProgress() won't work either, because we set
>> RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
>> differently, the question we want to answer is not "are we in recovery
>> now?" but "did we perform recovery?". After studying the code a bit, I
>> think a good test might be
>> !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
>> gets set to true, then we're certain to enter the if (InRecovery)
>> block that contains the main redo loop. And that block unconditionally
>> does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
>> think that replayEndRecPtr can't be 0 because it's supposed to
>> represent the record we're pretending to have last replayed, as
>> explained by the comments. And while lastReplayedEndRecPtr will get
>> updated later as we replay more records, I think it will never be set
>> back to 0. It's only going to increase, as we replay more records. On
>> the other hand if InRecovery = false then we'll never change it, and
>> it seems that it starts out as 0.
>>

Understood, used lastReplayedEndRecPtr but in 0002 patch for the
aforesaid reason.

>> I was hoping to have more time today to comment on 0004, but the day
>> seems to have gotten away from me. One quick thought is that it looks
>> a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
>> returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
>> ... because there's also replayEndRecPtr, which seems to go with
>> replayEndTLI. It feels like we should use a source for the TLI that
>> clearly matches the source for the corresponding LSN, unless there's
>> some super-good reason to do otherwise.

Agreed, that would be the right thing, but on the latest master head
that might not be the right thing to use because of commit #
ff9f111bce24 that has introduced the following code that changes the
EndOfLog that could be different from replayEndRecPtr:

    /*
     * Actually, if WAL ended in an incomplete record, skip the parts that
     * made it through and start writing after the portion that persisted.
     * (It's critical to first write an OVERWRITE_CONTRECORD message, which
     * we'll do as soon as we're open for writing new WAL.)
     */
    if (!XLogRecPtrIsInvalid(missingContrecPtr))
    {
        Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
        EndOfLog = missingContrecPtr;
    }

With this commit, we have got two new global variables. First,
missingContrecPtr is an EndOfLog which gets stored in shared memory at
few places, and the other one abortedRecPtr that is needed in
XLogAcceptWrite(), which I have exported into shared memory.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Jaime Casanova
Date:
On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
>    On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
> <rushabh.lathia@gmail.com> wrote:
> >
> > I tried to apply the patch on the master branch head and it's failing
> > with conflicts.
> >
> 
> Thanks, Rushabh, for the quick check, I have attached a rebased version for the
> latest master head commit # f6b5d05ba9a.
> 

Hi,

I got this error while executing "make check" on src/test/recovery:

"""
t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query:
# SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
# Looks like your test exited with 29 just after 1.
t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 2/3 subtests 

Test Summary Report
-------------------
t/026_overwrite_contrecord.pl      (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 3 tests but ran 1.
Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1
"""



-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
>
> On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
> >    On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
> > <rushabh.lathia@gmail.com> wrote:
> > >
> > > I tried to apply the patch on the master branch head and it's failing
> > > with conflicts.
> > >
> >
> > Thanks, Rushabh, for the quick check, I have attached a rebased version for the
> > latest master head commit # f6b5d05ba9a.
> >
>
> Hi,
>
> I got this error while executing "make check" on src/test/recovery:
>
> """
> t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query:
> # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> # Looks like your test exited with 29 just after 1.
> t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
> Failed 2/3 subtests
>
> Test Summary Report
> -------------------
> t/026_overwrite_contrecord.pl      (Wstat: 7424 Tests: 1 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 3 tests but ran 1.
> Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU)
> Result: FAIL
> make: *** [Makefile:23: check] Error 1
> """
>

Thanks for the reporting problem, I am working on it. The cause of
failure is that v37_0004 patch clearing the missingContrecPtr global
variable before CreateOverwriteContrecordRecord() execution, which it
shouldn't.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Oct 7, 2021 at 6:21 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova
> <jcasanov@systemguards.com.ec> wrote:
> >
> > On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote:
> > >    On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia
> > > <rushabh.lathia@gmail.com> wrote:
> > > >
> > > > I tried to apply the patch on the master branch head and it's failing
> > > > with conflicts.
> > > >
> > >
> > > Thanks, Rushabh, for the quick check, I have attached a rebased version for the
> > > latest master head commit # f6b5d05ba9a.
> > >
> >
> > Hi,
> >
> > I got this error while executing "make check" on src/test/recovery:
> >
> > """
> > t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query:
> > # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn()
> > # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > # Looks like your test exited with 29 just after 1.
> > t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
> > Failed 2/3 subtests
> >
> > Test Summary Report
> > -------------------
> > t/026_overwrite_contrecord.pl      (Wstat: 7424 Tests: 1 Failed: 0)
> >   Non-zero exit status: 29
> >   Parse errors: Bad plan.  You planned 3 tests but ran 1.
> > Files=26, Tests=279, 400 wallclock secs ( 0.27 usr  0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU)
> > Result: FAIL
> > make: *** [Makefile:23: check] Error 1
> > """
> >
>
> Thanks for the reporting problem, I am working on it. The cause of
> failure is that v37_0004 patch clearing the missingContrecPtr global
> variable before CreateOverwriteContrecordRecord() execution, which it
> shouldn't.
>

In the attached version I have fixed this issue by restoring missingContrecPtr.

To handle abortedRecPtr and missingContrecPtr newly added global
variables thought the commit # ff9f111bce24, we don't need to store
them in the shared memory separately, instead, we need a flag that
indicates a broken record found previously, at the end of recovery, so
that we can overwrite contrecord.

The missingContrecPtr is assigned to the EndOfLog, and we have handled
EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
same as the lastReplayedEndRecPtr, AFAICS.  I have added an assert to
ensure that the lastReplayedEndRecPtr value is the same as the
abortedRecPtr, but I think that is not needed, we can go ahead and
write an overwrite-contrecord starting at lastReplayedEndRecPtr.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul@gmail.com> wrote:
> In the attached version I have fixed this issue by restoring missingContrecPtr.
>
> To handle abortedRecPtr and missingContrecPtr newly added global
> variables thought the commit # ff9f111bce24, we don't need to store
> them in the shared memory separately, instead, we need a flag that
> indicates a broken record found previously, at the end of recovery, so
> that we can overwrite contrecord.
>
> The missingContrecPtr is assigned to the EndOfLog, and we have handled
> EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
> same as the lastReplayedEndRecPtr, AFAICS.  I have added an assert to
> ensure that the lastReplayedEndRecPtr value is the same as the
> abortedRecPtr, but I think that is not needed, we can go ahead and
> write an overwrite-contrecord starting at lastReplayedEndRecPtr.

I thought that it made sense to commit 0001 and 0002 at this point, so
I have done that. I think that the treatment of missingContrecPtr and
abortedRecPtr may need more thought yet, so at least for that reason I
don't think it's a good idea to proceed with 0004 yet. 0003 is just
code movement so I guess that can be committed whenever we're
confident that we know exactly which things we want to end up inside
XLogAcceptWrites().

I do have a few ideas after studying this a bit more:

- I wonder whether, in addition to moving a few things later as 0002
did, we also ought to think about moving one thing earlier,
specifically XLogReportParameters(). Right now, we have, I believe,
four things that write WAL at the end of recovery:
CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
is structured now, we do the first three of those things, and then do
a bunch of other stuff inside CleanupAfterArchiveRecovery() like
running recovery_end_command, and removing non-parent xlog files, and
archiving the partial segment, and then come back and do the fourth
one. Is there any good reason for that? If not, I think doing them all
together would be cleaner, and would propose to reverse the order of
CleanupAfterArchiveRecovery() and XLogReportParameters().

- If we did that, then I would further propose to adjust things so
that we remove the call to LocalSetXLogInsertAllowed() and the
assignment LocalXLogInsertAllowed = -1 from inside
CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
after UpdateFullPageWrites(), and the call to
LocalSetXLogInsertAllowed() just before XLogReportParameters().
Instead, just let the call to LocalSetXLogInsertAllowed() right before
CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
to be much point in flipping that switch off and on again, and the
fact that we have been doing so is in my view just evidence that
StartupXLOG() doesn't do a very good job of getting related code all
into one place.

- It seems really tempting to invent a fourth RecoveryState value that
indicates that we are done with REDO but not yet in production, and
maybe also to rename RecoveryState to something like WALState. I'm
thinking of something like WAL_STATE_CRASH_RECOVERY,
WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
WAL_STATE_PRODUCTION. Then, instead of having
LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
the startup process and the checkpointer are allowed to insert WAL
when the state is WAL_STATE_REDO_COMPLETE, but other processes only
once we reach WAL_STATE_PRODUCTION. We would set
WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed().
It's necessary to include the checkpointer, or at least I think it is,
because PerformRecoveryXLogAction() might call RequestCheckpoint(),
and that's got to work. If we did this, then I think it would also
solve another problem which the overall patch set has to address
somehow. Say that we eventually move responsibility for the
to-be-created XLogAcceptWrites() function from the startup process to
the checkpointer, as proposed. The checkpointer needs to know when to
call it ... and the answer with this change is simple: when we reach
WAL_STATE_REDO_COMPLETE, it's time!

But this idea is not completely problem-free. I spent some time poking
at it and I think it's a little hard to come up with a satisfying way
to code XLogInsertAllowed(). Right now that function calls
RecoveryInProgress(), and if RecoveryInProgress() decides that
recovery is no longer in progress, it calls InitXLOGAccess(). However,
that presumes that the only reason you'd call RecoveryInProgress() is
to figure out whether you should write WAL, which I don't think is
really true, and it also means that, when the wal state is
WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return
true in the checkpointer and startup process and false everywhere
else, which does not sound like a great idea. It seems fine to say
that xlog insertion is allowed in some processes but not others,
because not all processes are necessarily equally privileged, but
whether or not we're in recovery is supposed to be something about
which everyone agrees, so answering that question differently in
different processes doesn't seem nice. XLogInsertAllowed() could be
rewritten to check the state directly and make its own determination,
without relying on RecoveryInProgress(), and I think that might be the
right way to go here.

But that isn't entirely problem-free either, because there's a lot of
code that uses RecoveryInProgress() to answer the question "should I
write WAL?" and therefore it's not great if RecoveryInProgress() is
returning an answer that is inconsistent with XLogInsertAllowed().
MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this
kind of coding. It probably wouldn't break in practice right away,
because most of that code never runs in the startup process or the
checkpointer and would therefore never notice the difference in
behavior between those two functions, but if in the future we get the
read-only feature that this thread is supposed to be about, we'd have
problems. Not all RecoveryInProgress() calls have this sense - e.g.
sendDir() in basebackup.c is trying to figure out whether recovery
ended during the backup, not whether we can write WAL. But perhaps
this is a good time to go and replace RecoveryInProgress() checks that
are intending to decide whether or not it's OK to write WAL with
XLogInsertAllowed() checks (noting that the return value is reversed).
If we did that, then I think RecoveryInProgress() could also NOT call
InitXLOGAccess(), and that could be done only by XLogInsertAllowed(),
which seems like it might be better. But I haven't really tried to
code all of this up, so I'm not really sure how it all works out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Thu, Oct 14, 2021 at 11:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul@gmail.com> wrote:
> > In the attached version I have fixed this issue by restoring missingContrecPtr.
> >
> > To handle abortedRecPtr and missingContrecPtr newly added global
> > variables thought the commit # ff9f111bce24, we don't need to store
> > them in the shared memory separately, instead, we need a flag that
> > indicates a broken record found previously, at the end of recovery, so
> > that we can overwrite contrecord.
> >
> > The missingContrecPtr is assigned to the EndOfLog, and we have handled
> > EndOfLog previously in the 0004 patch, and the abortedRecPtr is the
> > same as the lastReplayedEndRecPtr, AFAICS.  I have added an assert to
> > ensure that the lastReplayedEndRecPtr value is the same as the
> > abortedRecPtr, but I think that is not needed, we can go ahead and
> > write an overwrite-contrecord starting at lastReplayedEndRecPtr.
>
> I thought that it made sense to commit 0001 and 0002 at this point, so
> I have done that. I think that the treatment of missingContrecPtr and
> abortedRecPtr may need more thought yet, so at least for that reason I
> don't think it's a good idea to proceed with 0004 yet. 0003 is just
> code movement so I guess that can be committed whenever we're
> confident that we know exactly which things we want to end up inside
> XLogAcceptWrites().
>

Ok.

> I do have a few ideas after studying this a bit more:
>
> - I wonder whether, in addition to moving a few things later as 0002
> did, we also ought to think about moving one thing earlier,
> specifically XLogReportParameters(). Right now, we have, I believe,
> four things that write WAL at the end of recovery:
> CreateOverwriteContrecordRecord(), UpdateFullPageWrites(),
> PerformRecoveryXLogAction(), and XLogReportParameters(). As the code
> is structured now, we do the first three of those things, and then do
> a bunch of other stuff inside CleanupAfterArchiveRecovery() like
> running recovery_end_command, and removing non-parent xlog files, and
> archiving the partial segment, and then come back and do the fourth
> one. Is there any good reason for that? If not, I think doing them all
> together would be cleaner, and would propose to reverse the order of
> CleanupAfterArchiveRecovery() and XLogReportParameters().
>

Yes, that can be done.

> - If we did that, then I would further propose to adjust things so
> that we remove the call to LocalSetXLogInsertAllowed() and the
> assignment LocalXLogInsertAllowed = -1 from inside
> CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just
> after UpdateFullPageWrites(), and the call to
> LocalSetXLogInsertAllowed() just before XLogReportParameters().
> Instead, just let the call to LocalSetXLogInsertAllowed() right before
> CreateOverwriteContrecordRecord() remain in effect. There doesn't seem
> to be much point in flipping that switch off and on again, and the
> fact that we have been doing so is in my view just evidence that
> StartupXLOG() doesn't do a very good job of getting related code all
> into one place.
>

Currently there are three places that are calling
LocalSetXLogInsertAllowed() and resetting that LocalXLogInsertAllowed
flag as StartupXLOG(), CreateEndOfRecoveryRecord() and the
CreateCheckPoint().  By doing the aforementioned code rearrangement we
can get rid of frequent calls from StartupXLOG() and can completely
remove the need for it in CreateEndOfRecoveryRecord() since that gets
called only from StartupXLOG() directly. Whereas CreateCheckPoint()
too gets called from StartupXLOG() when it is running in a standalone
backend only, at that time we don't need to call
LocalSetXLogInsertAllowed()  but if that running in the Checkpointer
process then we need that.

I tried this in the attached version, but I'm a bit skeptical with
changes that are needed for CreateCheckPoint(), those don't seem to be
clean. I am wondering if we could completely remove the need to end of
recovery checkpoint as proposed in [1], that would get rid of
CHECKPOINT_END_OF_RECOVERY operation and the
LocalSetXLogInsertAllowed() requirement in CreateCheckPoint(), and
after that, we were not expecting checkpoint operation in recovery. If
we could do that then we would have LocalSetXLogInsertAllowed() only
at one place i.e. in StartupXLOG (...and in the future in
XLogAcceptWrites()) -- the code that runs only once in a lifetime of
the server and the kludge that the attached patch doing for
CreateCheckPoint() will not be needed.

> - It seems really tempting to invent a fourth RecoveryState value that
> indicates that we are done with REDO but not yet in production, and
> maybe also to rename RecoveryState to something like WALState. I'm
> thinking of something like WAL_STATE_CRASH_RECOVERY,
> WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and
> WAL_STATE_PRODUCTION. Then, instead of having
> LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that
> the startup process and the checkpointer are allowed to insert WAL
> when the state is WAL_STATE_REDO_COMPLETE, but other processes only
> once we reach WAL_STATE_PRODUCTION. We would set
> WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed().
> It's necessary to include the checkpointer, or at least I think it is,
> because PerformRecoveryXLogAction() might call RequestCheckpoint(),
> and that's got to work. If we did this, then I think it would also
> solve another problem which the overall patch set has to address
> somehow. Say that we eventually move responsibility for the
> to-be-created XLogAcceptWrites() function from the startup process to
> the checkpointer, as proposed. The checkpointer needs to know when to
> call it ... and the answer with this change is simple: when we reach
> WAL_STATE_REDO_COMPLETE, it's time!
>
> But this idea is not completely problem-free. I spent some time poking
> at it and I think it's a little hard to come up with a satisfying way
> to code XLogInsertAllowed(). Right now that function calls
> RecoveryInProgress(), and if RecoveryInProgress() decides that
> recovery is no longer in progress, it calls InitXLOGAccess(). However,
> that presumes that the only reason you'd call RecoveryInProgress() is
> to figure out whether you should write WAL, which I don't think is
> really true, and it also means that, when the wal state is
> WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return
> true in the checkpointer and startup process and false everywhere
> else, which does not sound like a great idea. It seems fine to say
> that xlog insertion is allowed in some processes but not others,
> because not all processes are necessarily equally privileged, but
> whether or not we're in recovery is supposed to be something about
> which everyone agrees, so answering that question differently in
> different processes doesn't seem nice. XLogInsertAllowed() could be
> rewritten to check the state directly and make its own determination,
> without relying on RecoveryInProgress(), and I think that might be the
> right way to go here.
>
> But that isn't entirely problem-free either, because there's a lot of
> code that uses RecoveryInProgress() to answer the question "should I
> write WAL?" and therefore it's not great if RecoveryInProgress() is
> returning an answer that is inconsistent with XLogInsertAllowed().
> MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this
> kind of coding. It probably wouldn't break in practice right away,
> because most of that code never runs in the startup process or the
> checkpointer and would therefore never notice the difference in
> behavior between those two functions, but if in the future we get the
> read-only feature that this thread is supposed to be about, we'd have
> problems. Not all RecoveryInProgress() calls have this sense - e.g.
> sendDir() in basebackup.c is trying to figure out whether recovery
> ended during the backup, not whether we can write WAL. But perhaps
> this is a good time to go and replace RecoveryInProgress() checks that
> are intending to decide whether or not it's OK to write WAL with
> XLogInsertAllowed() checks (noting that the return value is reversed).
> If we did that, then I think RecoveryInProgress() could also NOT call
> InitXLOGAccess(), and that could be done only by XLogInsertAllowed(),
> which seems like it might be better. But I haven't really tried to
> code all of this up, so I'm not really sure how it all works out.
>

I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
good, but I am not sure about calling it from XLogInsertAllowed()
either, perhaps, both are status check function and general
expectations might be that status checking functions are not going
change and/or initialize the system state. InitXLOGAccess() should
get called from the very first WAL write operation if needed, but if
we don't want to do that, then I would prefer to call InitXLOGAccess()
from XLogInsertAllowed() instead of RecoveryInProgress().

As said before, if we were able to get rid of the need to
end-of-recovery checkpoint [1] then we don't need separate handling in
XLogInsertAllowed() for the Checkpointer process, that would be much
cleaner and for the startup process, we would force
XLogInsertAllowed() return true by calling LocalSetXLogInsertAllowed()
for the time being as we are doing right now.

Regards,
Amul

1] "using an end-of-recovery record in all cases" :
https://postgr.es/m/CAAJ_b95xPx6oHRb5VEatGbp-cLsZApf_9GWGtbv9dsFKiV_VDQ@mail.gmail.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Mon, Oct 18, 2021 at 9:54 AM Amul Sul <sulamul@gmail.com> wrote:
> I tried this in the attached version, but I'm a bit skeptical with
> changes that are needed for CreateCheckPoint(), those don't seem to be
> clean.

Yeah, that doesn't look great. I don't think it's entirely correct,
actually, because surely you want LocalXLogInsertAllowed = 0 to be
executed even if !IsPostmasterEnvironment. It's only
LocalXLogInsertAllowed = -1 that we would want to have depend on
IsPostmasterEnvironment. But that's pretty ugly too: I guess the
reason it has to be like is that, if it does that unconditionally, it
will overwrite the temporary value of 1 set by the caller, which will
then cause problems when the caller tries to XLogReportParameters().

I think that problem goes away if we drive the decision off of shared
state rather than a local variable, but I agree that it's otherwise a
bit tricky to untangle. One idea might be to have
LocalSetXLogInsertAllowed return the old value. Then we could use the
same kind of coding we do when switching memory contexts, where we
say:

oldcontext = MemoryContextSwitchTo(something);
// do stuff
MemoryContextSwitchTo(oldcontext);

Here we could maybe do:

oldxlallowed = LocalSetXLogInsertAllowed();
// do stuff
XLogInsertAllowed = oldxlallowed;

That way, instead of CreateCheckPoint() knowing under what
circumstances the caller might have changed the value, it only knows
that some callers might have already changed the value. That seems
better.

> I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
> good, but I am not sure about calling it from XLogInsertAllowed()
> either, perhaps, both are status check function and general
> expectations might be that status checking functions are not going
> change and/or initialize the system state. InitXLOGAccess() should
> get called from the very first WAL write operation if needed, but if
> we don't want to do that, then I would prefer to call InitXLOGAccess()
> from XLogInsertAllowed() instead of RecoveryInProgress().

Well, that's a fair point, too, but it might not be safe to, say, move
this to XLogBeginInsert(). Like, imagine that there's a hypothetical
piece of code that looks like this:

if (RecoveryInProgress())
    ereport(ERROR, errmsg("can't do that in recovery")));

// do something here that depends on ThisTimeLineID or
wal_segment_size or RedoRecPtr

XLogBeginInsert();
....
lsn = XLogInsert(...);

Such code would work correctly the way things are today, but if the
InitXLOGAccess() call were deferred until XLogBeginInsert() time, then
it would fail.

I was curious whether this is just a theoretical problem. It turns out
that it's not. I wrote a couple of just-for-testing patches, which I
attach here. The first one just adjusts things so that we'll fail an
assertion if we try to make use of ThisTimeLineID before we've set it
to a legal value. I had to exempt two places from these checks just
for 'make check-world' to pass; these are shown in the patch, and one
or both of them might be existing bugs -- or maybe not, I haven't
looked too deeply. The second one then adjusts the patch to pretend
that ThisTimeLineID is not necessarily valid just because we've called
InitXLOGAccess() but that it is valid after XLogBeginInsert(). With
that change, I find about a dozen places where, apparently, the early
call to InitXLOGAccess() is critical to getting ThisTimeLineID
adjusted in time. So apparently a change of this type is not entirely
trivial. And this is just a quick test, and just for one of the three
things that get initialized here.

On the other hand, just moving it to XLogInsertAllowed() isn't
risk-free either and would likely require adjusting some of the same
places I found with this test. So I guess if we want to do something
like this we need more study.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, Oct 19, 2021 at 3:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 18, 2021 at 9:54 AM Amul Sul <sulamul@gmail.com> wrote:
> > I tried this in the attached version, but I'm a bit skeptical with
> > changes that are needed for CreateCheckPoint(), those don't seem to be
> > clean.
>
> Yeah, that doesn't look great. I don't think it's entirely correct,
> actually, because surely you want LocalXLogInsertAllowed = 0 to be
> executed even if !IsPostmasterEnvironment. It's only
> LocalXLogInsertAllowed = -1 that we would want to have depend on
> IsPostmasterEnvironment. But that's pretty ugly too: I guess the
> reason it has to be like is that, if it does that unconditionally, it
> will overwrite the temporary value of 1 set by the caller, which will
> then cause problems when the caller tries to XLogReportParameters().
>
> I think that problem goes away if we drive the decision off of shared
> state rather than a local variable, but I agree that it's otherwise a
> bit tricky to untangle. One idea might be to have
> LocalSetXLogInsertAllowed return the old value. Then we could use the
> same kind of coding we do when switching memory contexts, where we
> say:
>
> oldcontext = MemoryContextSwitchTo(something);
> // do stuff
> MemoryContextSwitchTo(oldcontext);
>
> Here we could maybe do:
>
> oldxlallowed = LocalSetXLogInsertAllowed();
> // do stuff
> XLogInsertAllowed = oldxlallowed;
>

Ok, did the same in the attached 0001 patch.

There is no harm in calling LocalSetXLogInsertAllowed() calling
multiple times, but the problem I can see is that with this patch user
is allowed to call LocalSetXLogInsertAllowed() at the time it is
supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
WAL writes are explicitly disabled.

> That way, instead of CreateCheckPoint() knowing under what
> circumstances the caller might have changed the value, it only knows
> that some callers might have already changed the value. That seems
> better.
>
> > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not
> > good, but I am not sure about calling it from XLogInsertAllowed()
> > either, perhaps, both are status check function and general
> > expectations might be that status checking functions are not going
> > change and/or initialize the system state. InitXLOGAccess() should
> > get called from the very first WAL write operation if needed, but if
> > we don't want to do that, then I would prefer to call InitXLOGAccess()
> > from XLogInsertAllowed() instead of RecoveryInProgress().
>
> Well, that's a fair point, too, but it might not be safe to, say, move
> this to XLogBeginInsert(). Like, imagine that there's a hypothetical
> piece of code that looks like this:
>
> if (RecoveryInProgress())
>     ereport(ERROR, errmsg("can't do that in recovery")));
>
> // do something here that depends on ThisTimeLineID or
> wal_segment_size or RedoRecPtr
>
> XLogBeginInsert();
> ....
> lsn = XLogInsert(...);
>
> Such code would work correctly the way things are today, but if the
> InitXLOGAccess() call were deferred until XLogBeginInsert() time, then
> it would fail.
>
> I was curious whether this is just a theoretical problem. It turns out
> that it's not. I wrote a couple of just-for-testing patches, which I
> attach here. The first one just adjusts things so that we'll fail an
> assertion if we try to make use of ThisTimeLineID before we've set it
> to a legal value. I had to exempt two places from these checks just
> for 'make check-world' to pass; these are shown in the patch, and one
> or both of them might be existing bugs -- or maybe not, I haven't
> looked too deeply. The second one then adjusts the patch to pretend
> that ThisTimeLineID is not necessarily valid just because we've called
> InitXLOGAccess() but that it is valid after XLogBeginInsert(). With
> that change, I find about a dozen places where, apparently, the early
> call to InitXLOGAccess() is critical to getting ThisTimeLineID
> adjusted in time. So apparently a change of this type is not entirely
> trivial. And this is just a quick test, and just for one of the three
> things that get initialized here.
>
> On the other hand, just moving it to XLogInsertAllowed() isn't
> risk-free either and would likely require adjusting some of the same
> places I found with this test. So I guess if we want to do something
> like this we need more study.
>

Yeah, that requires a lot of energy and time -- not done anything
related to this in the attached version.

Please have a look at the attached version where the 0001 patch does
change LocalSetXLogInsertAllowed() as said before. 0002 patch moves
XLogReportParameters() closer to other wal write operations and
removes unnecessary LocalSetXLogInsertAllowed() calls. 0003 is code
movements adds XLogAcceptWrites() function same as the before, and
0004 patch tries to remove the dependency. 0004 patch could change
w.r.t. decision that is going to be made for the patch that I
posted[1] to remove abortedRecPtr global variable. For now, I have
copied abortedRecPtr into shared memory. Thanks !

1] https://postgr.es/m/CAAJ_b94Y75ZwMim+gxxexVwf_yzO-dChof90ky0dB2GstspNjA@mail.gmail.com

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote:
> Ok, did the same in the attached 0001 patch.
>
> There is no harm in calling LocalSetXLogInsertAllowed() calling
> multiple times, but the problem I can see is that with this patch user
> is allowed to call LocalSetXLogInsertAllowed() at the time it is
> supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
> WAL writes are explicitly disabled.

I've pushed 0001 and 0002 but I reversed the order of them and made a
few other edits.

I don't really see the issue you mention here as a problem. There's
only one place where we set LocalXLogInsertAllowed = 0, and I don't
know that we'll ever have another one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
"Bossart, Nathan"
Date:
On 10/25/21, 7:50 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I've pushed 0001 and 0002 but I reversed the order of them and made a
> few other edits.

My compiler is complaining about oldXLogAllowed possibly being used
uninitialized in CreateCheckPoint().  AFAICT it can just be initially
set to zero to silence this warning because it will, in fact, be
initialized properly when it is used.

Nathan


Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> My compiler is complaining about oldXLogAllowed possibly being used
> uninitialized in CreateCheckPoint().  AFAICT it can just be initially
> set to zero to silence this warning because it will, in fact, be
> initialized properly when it is used.

Hmm, I guess I could have foreseen that, had I been a little bit
smarter than I am. I have committed a change to initialize it to 0 as
you propose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
"Bossart, Nathan"
Date:
On 10/25/21, 1:33 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> My compiler is complaining about oldXLogAllowed possibly being used
>> uninitialized in CreateCheckPoint().  AFAICT it can just be initially
>> set to zero to silence this warning because it will, in fact, be
>> initialized properly when it is used.
>
> Hmm, I guess I could have foreseen that, had I been a little bit
> smarter than I am. I have committed a change to initialize it to 0 as
> you propose.

Thanks!

Nathan


Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Mon, Oct 25, 2021 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote:
> > Ok, did the same in the attached 0001 patch.
> >
> > There is no harm in calling LocalSetXLogInsertAllowed() calling
> > multiple times, but the problem I can see is that with this patch user
> > is allowed to call LocalSetXLogInsertAllowed() at the time it is
> > supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
> > WAL writes are explicitly disabled.
>
> I've pushed 0001 and 0002 but I reversed the order of them and made a
> few other edits.
>

Thank you!

I have rebased the remaining patches on top of the latest master head
(commit # e63ce9e8d6a).

In addition to that, I did the additional changes to 0002 where I
haven't included the change that tries to remove arguments of
CleanupAfterArchiveRecovery() in the previous version. Because if we
want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to
replace EndOfLogTLI and EndOfLog arguments respectively, then we also
need to consider the case where EndOfLog is changing if the
abort-record does exist. That can be decided only in XLogAcceptWrite()
before the shared memory value related to abort-record is going to be
clear.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, Oct 26, 2021 at 4:29 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Mon, Oct 25, 2021 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote:
> > > Ok, did the same in the attached 0001 patch.
> > >
> > > There is no harm in calling LocalSetXLogInsertAllowed() calling
> > > multiple times, but the problem I can see is that with this patch user
> > > is allowed to call LocalSetXLogInsertAllowed() at the time it is
> > > supposed not to be called e.g. when LocalXLogInsertAllowed = 0;
> > > WAL writes are explicitly disabled.
> >
> > I've pushed 0001 and 0002 but I reversed the order of them and made a
> > few other edits.
> >
>
> Thank you!
>
> I have rebased the remaining patches on top of the latest master head
> (commit # e63ce9e8d6a).
>
> In addition to that, I did the additional changes to 0002 where I
> haven't included the change that tries to remove arguments of
> CleanupAfterArchiveRecovery() in the previous version. Because if we
> want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to
> replace EndOfLogTLI and EndOfLog arguments respectively, then we also
> need to consider the case where EndOfLog is changing if the
> abort-record does exist. That can be decided only in XLogAcceptWrite()
> before the shared memory value related to abort-record is going to be
> clear.
>

Attached is the rebased version of refactoring as well as the
pg_prohibit_wal feature patches for the latest master head (commit #
39a3105678a).

I was planning to attach the rebased version of isolation test patches
that Mark has posted before[1], but some permutation tests are not
stable, where expected errors get printed differently; therefore, I
dropped that from the attachment, for now.

Regards,
Amul

1] https://postgr.es/m/9BA3BA57-6B7B-45CB-B8D9-6B5EB0104FFA@enterprisedb.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Robert Haas
Date:
On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> Attached is the rebased version of refactoring as well as the
> pg_prohibit_wal feature patches for the latest master head (commit #
> 39a3105678a).

I spent a lot of time today studying 0002, and specifically the
question of whether EndOfLog must be the same as
XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
XLogCtl->replayEndTLI.

The answer to the former question is "no" because, if we don't enter
redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
enter redo, then I think it has to be the same unless something very
weird happens. EndOfLog gets set like this:

    XLogBeginRead(xlogreader, LastRec);
    record = ReadRecord(xlogreader, PANIC, false, replayTLI);
    EndOfLog = EndRecPtr;

In every case that exists in our regression tests, EndRecPtr is the
same before these three lines of code as it is afterward. However, if
you test with recovery_target=immediate, you can get it to be
different, because in that case we drop out of the redo loop after
calling recoveryStopsBefore() rather than after calling
recoveryStopsAfter(). Similarly I'm fairly sure that if you use
recovery_target_inclusive=off you can likewise get it to be different
(though I discovered the hard way that recovery_target_inclusive=off
is ignored when you use recovery_target_name). It seems like a really
bad thing that neither recovery_target=immediate nor
recovery_target_inclusive=off have any tests, and I think we ought to
add some.

Anyway, in effect, these three lines of code have the effect of
backing up the xlogreader by one record when we stop before rather
than after a record that we're replaying. What that means is that
EndOfLog is going to be the end+1 of the last record that we actually
replayed. There might be one more record that we read but did not
replay, and that record won't impact the value we end up with in
EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
record that we actually replayed. To put that another way, there's no
way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
and before we change LastRec. So in the cases where
XLogCtl->replayEndRecPtr gets initialized at all, it can only be
different from EndOfLog if something different happens when we re-read
the last-replayed WAL record than what happened when we read it the
first time. That seems unlikely, and would be unfortunate it if it did
happen. I am inclined to think that it might be better not to reread
the record at all, though. As far as this patch goes, I think we need
a solution that doesn't involve fetching EndOfLog from a variable
that's only sometimes initialized and then not doing anything with it
except in the cases where it was initialized.

As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
think the regression tests contain any scenarios where we run recovery
and the values end up different. However, I think that the code sets
EndOfLogTLI to the TLI of the last WAL file that we read, and I think
XLogCtl->replayEndTLI gets set to the timeline from which that WAL
record originated. So imagine that we are looking for WAL that ought
to be in 000000010000000000000003 but we don't find it; instead we
find 000000020000000000000003 because our recovery target timeline is
2, or something that has 2 in its history. We will read the WAL for
timeline 1 from this file which has timeline 2 in the file name. I
think if recovery ends in this file before the timeline switch, these
values will be different. I did not try to construct a test case for
this today due to not having enough time, so it's possible that I'm
wrong about this, but that's how it looks to me from the code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > Attached is the rebased version of refactoring as well as the
> > pg_prohibit_wal feature patches for the latest master head (commit #
> > 39a3105678a).
>
> I spent a lot of time today studying 0002, and specifically the
> question of whether EndOfLog must be the same as
> XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> XLogCtl->replayEndTLI.
>
> The answer to the former question is "no" because, if we don't enter
> redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> enter redo, then I think it has to be the same unless something very
> weird happens. EndOfLog gets set like this:
>
>     XLogBeginRead(xlogreader, LastRec);
>     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
>     EndOfLog = EndRecPtr;
>
> In every case that exists in our regression tests, EndRecPtr is the
> same before these three lines of code as it is afterward. However, if
> you test with recovery_target=immediate, you can get it to be
> different, because in that case we drop out of the redo loop after
> calling recoveryStopsBefore() rather than after calling
> recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> recovery_target_inclusive=off you can likewise get it to be different
> (though I discovered the hard way that recovery_target_inclusive=off
> is ignored when you use recovery_target_name). It seems like a really
> bad thing that neither recovery_target=immediate nor
> recovery_target_inclusive=off have any tests, and I think we ought to
> add some.
>

recovery/t/003_recovery_targets.pl has test for
recovery_target=immediate but not for recovery_target_inclusive=off, we
can add that for recovery_target_lsn, recovery_target_time, and
recovery_target_xid case only where it affects.

> Anyway, in effect, these three lines of code have the effect of
> backing up the xlogreader by one record when we stop before rather
> than after a record that we're replaying. What that means is that
> EndOfLog is going to be the end+1 of the last record that we actually
> replayed. There might be one more record that we read but did not
> replay, and that record won't impact the value we end up with in
> EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> record that we actually replayed. To put that another way, there's no
> way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> and before we change LastRec. So in the cases where
> XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> different from EndOfLog if something different happens when we re-read
> the last-replayed WAL record than what happened when we read it the
> first time. That seems unlikely, and would be unfortunate it if it did
> happen. I am inclined to think that it might be better not to reread
> the record at all, though.

There are two reasons that the record is reread; first, one that you
have just explained where the redo loop drops out due to
recoveryStopsBefore() and another one is that InRecovery is false.

In the formal case at the end, redo while-loop does read a new record
which in effect updates EndRecPtr and when we breaks the loop, we do
reach the place where we do reread record -- where we do read the
record (i.e. LastRec) before the record that redo loop has read and
which correctly sets EndRecPtr. In the latter case, definitely, we
don't need any adjustment to EndRecPtr.

So technically one case needs reread but that is also not needed, we
have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
do not need to reread the record, but EndOfLog and EndOfLogTLI should
be set conditionally something like:

if (InRecovery)
{
    EndOfLog = XLogCtl->lastReplayedEndRecPtr;
    EndOfLogTLI = XLogCtl->lastReplayedTLI;
}
else
{
    EndOfLog = EndRecPtr;
    EndOfLogTLI = replayTLI;
}

> As far as this patch goes, I think we need
> a solution that doesn't involve fetching EndOfLog from a variable
> that's only sometimes initialized and then not doing anything with it
> except in the cases where it was initialized.
>

Another reason could be EndOfLog changes further in the following case:

/*
 * Actually, if WAL ended in an incomplete record, skip the parts that
 * made it through and start writing after the portion that persisted.
 * (It's critical to first write an OVERWRITE_CONTRECORD message, which
 * we'll do as soon as we're open for writing new WAL.)
 */
if (!XLogRecPtrIsInvalid(missingContrecPtr))
{
    Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
    EndOfLog = missingContrecPtr;
}

Now only solution that I can think is to copy EndOfLog (so
EndOfLogTLI) into shared memory.

> As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> think the regression tests contain any scenarios where we run recovery
> and the values end up different. However, I think that the code sets
> EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> record originated. So imagine that we are looking for WAL that ought
> to be in 000000010000000000000003 but we don't find it; instead we
> find 000000020000000000000003 because our recovery target timeline is
> 2, or something that has 2 in its history. We will read the WAL for
> timeline 1 from this file which has timeline 2 in the file name. I
> think if recovery ends in this file before the timeline switch, these
> values will be different. I did not try to construct a test case for
> this today due to not having enough time, so it's possible that I'm
> wrong about this, but that's how it looks to me from the code.
>

I am not sure, I have understood this scenario due to lack of
expertise in this area -- Why would the record we looking that ought
to be in 000000010000000000000003 we don't find it? Possibly WAL
corruption or that file is missing?

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > Attached is the rebased version of refactoring as well as the
> > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > 39a3105678a).
> >
> > I spent a lot of time today studying 0002, and specifically the
> > question of whether EndOfLog must be the same as
> > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > XLogCtl->replayEndTLI.
> >
> > The answer to the former question is "no" because, if we don't enter
> > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > enter redo, then I think it has to be the same unless something very
> > weird happens. EndOfLog gets set like this:
> >
> >     XLogBeginRead(xlogreader, LastRec);
> >     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> >     EndOfLog = EndRecPtr;
> >
> > In every case that exists in our regression tests, EndRecPtr is the
> > same before these three lines of code as it is afterward. However, if
> > you test with recovery_target=immediate, you can get it to be
> > different, because in that case we drop out of the redo loop after
> > calling recoveryStopsBefore() rather than after calling
> > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > recovery_target_inclusive=off you can likewise get it to be different
> > (though I discovered the hard way that recovery_target_inclusive=off
> > is ignored when you use recovery_target_name). It seems like a really
> > bad thing that neither recovery_target=immediate nor
> > recovery_target_inclusive=off have any tests, and I think we ought to
> > add some.
> >
>
> recovery/t/003_recovery_targets.pl has test for
> recovery_target=immediate but not for recovery_target_inclusive=off, we
> can add that for recovery_target_lsn, recovery_target_time, and
> recovery_target_xid case only where it affects.
>
> > Anyway, in effect, these three lines of code have the effect of
> > backing up the xlogreader by one record when we stop before rather
> > than after a record that we're replaying. What that means is that
> > EndOfLog is going to be the end+1 of the last record that we actually
> > replayed. There might be one more record that we read but did not
> > replay, and that record won't impact the value we end up with in
> > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > record that we actually replayed. To put that another way, there's no
> > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > and before we change LastRec. So in the cases where
> > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > different from EndOfLog if something different happens when we re-read
> > the last-replayed WAL record than what happened when we read it the
> > first time. That seems unlikely, and would be unfortunate it if it did
> > happen. I am inclined to think that it might be better not to reread
> > the record at all, though.
>
> There are two reasons that the record is reread; first, one that you
> have just explained where the redo loop drops out due to
> recoveryStopsBefore() and another one is that InRecovery is false.
>
> In the formal case at the end, redo while-loop does read a new record
> which in effect updates EndRecPtr and when we breaks the loop, we do
> reach the place where we do reread record -- where we do read the
> record (i.e. LastRec) before the record that redo loop has read and
> which correctly sets EndRecPtr. In the latter case, definitely, we
> don't need any adjustment to EndRecPtr.
>
> So technically one case needs reread but that is also not needed, we
> have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> do not need to reread the record, but EndOfLog and EndOfLogTLI should
> be set conditionally something like:
>
> if (InRecovery)
> {
>     EndOfLog = XLogCtl->lastReplayedEndRecPtr;
>     EndOfLogTLI = XLogCtl->lastReplayedTLI;
> }
> else
> {
>     EndOfLog = EndRecPtr;
>     EndOfLogTLI = replayTLI;
> }
>
> > As far as this patch goes, I think we need
> > a solution that doesn't involve fetching EndOfLog from a variable
> > that's only sometimes initialized and then not doing anything with it
> > except in the cases where it was initialized.
> >
>
> Another reason could be EndOfLog changes further in the following case:
>
> /*
>  * Actually, if WAL ended in an incomplete record, skip the parts that
>  * made it through and start writing after the portion that persisted.
>  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
>  * we'll do as soon as we're open for writing new WAL.)
>  */
> if (!XLogRecPtrIsInvalid(missingContrecPtr))
> {
>     Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
>     EndOfLog = missingContrecPtr;
> }
>
> Now only solution that I can think is to copy EndOfLog (so
> EndOfLogTLI) into shared memory.
>
> > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> > think the regression tests contain any scenarios where we run recovery
> > and the values end up different. However, I think that the code sets
> > EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> > XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> > record originated. So imagine that we are looking for WAL that ought
> > to be in 000000010000000000000003 but we don't find it; instead we
> > find 000000020000000000000003 because our recovery target timeline is
> > 2, or something that has 2 in its history. We will read the WAL for
> > timeline 1 from this file which has timeline 2 in the file name. I
> > think if recovery ends in this file before the timeline switch, these
> > values will be different. I did not try to construct a test case for
> > this today due to not having enough time, so it's possible that I'm
> > wrong about this, but that's how it looks to me from the code.
> >
>
> I am not sure, I have understood this scenario due to lack of
> expertise in this area -- Why would the record we looking that ought
> to be in 000000010000000000000003 we don't find it? Possibly WAL
> corruption or that file is missing?
>

On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
XLogFileReadAnyTLI(), I think I could make a sense that there could be
a case where the record belong to TLI 1 we are looking for; we might
open the file with TLI 2. But, I am wondering what's wrong if we say
that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
in its file name -- that statement is still true, and that record
should be still accessible from the filename with TLI 1.  Also, if we
going to consider this reading record exists before the timeline
switch point as the EndOfLog then why should be worried about the
latter timeline switch which eventually everything after the EndOfLog
going to be useless for us. We might continue switching TLI and/or
writing the WAL right after EndOfLog, correct me if I am missing
something here.

Further, I still think replayEndTLI has set to the correct value what
we looking for EndOfLogTLI when we go through the redo loop. When it
read the record and finds a change in the current replayTLI then it
updates that as:

if (newReplayTLI != replayTLI)
{
    /* Check that it's OK to switch to this TLI */
    checkTimeLineSwitch(EndRecPtr, newReplayTLI,
                        prevReplayTLI, replayTLI);

    /* Following WAL records should be run with new TLI */
    replayTLI = newReplayTLI;
    switchedTLI = true;
}

Then replayEndTLI gets updated. If we going to skip the reread of
"LastRec" that we were discussing, then I think the following code
that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
(or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
should be enough, AFAICU.

/*
 * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
 * the end-of-log. It could be different from the timeline that EndOfLog
 * nominally belongs to, if there was a timeline switch in that segment,
 * and we were reading the old WAL from a segment belonging to a higher
 * timeline.
 */
EndOfLogTLI = xlogreader->seg.ws_tli;

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
  On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > > Attached is the rebased version of refactoring as well as the
> > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > 39a3105678a).
> > >
> > > I spent a lot of time today studying 0002, and specifically the
> > > question of whether EndOfLog must be the same as
> > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > XLogCtl->replayEndTLI.
> > >
> > > The answer to the former question is "no" because, if we don't enter
> > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > enter redo, then I think it has to be the same unless something very
> > > weird happens. EndOfLog gets set like this:
> > >
> > >     XLogBeginRead(xlogreader, LastRec);
> > >     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > >     EndOfLog = EndRecPtr;
> > >
> > > In every case that exists in our regression tests, EndRecPtr is the
> > > same before these three lines of code as it is afterward. However, if
> > > you test with recovery_target=immediate, you can get it to be
> > > different, because in that case we drop out of the redo loop after
> > > calling recoveryStopsBefore() rather than after calling
> > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > recovery_target_inclusive=off you can likewise get it to be different
> > > (though I discovered the hard way that recovery_target_inclusive=off
> > > is ignored when you use recovery_target_name). It seems like a really
> > > bad thing that neither recovery_target=immediate nor
> > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > add some.
> > >
> >
> > recovery/t/003_recovery_targets.pl has test for
> > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > can add that for recovery_target_lsn, recovery_target_time, and
> > recovery_target_xid case only where it affects.
> >
> > > Anyway, in effect, these three lines of code have the effect of
> > > backing up the xlogreader by one record when we stop before rather
> > > than after a record that we're replaying. What that means is that
> > > EndOfLog is going to be the end+1 of the last record that we actually
> > > replayed. There might be one more record that we read but did not
> > > replay, and that record won't impact the value we end up with in
> > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > record that we actually replayed. To put that another way, there's no
> > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > and before we change LastRec. So in the cases where
> > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > different from EndOfLog if something different happens when we re-read
> > > the last-replayed WAL record than what happened when we read it the
> > > first time. That seems unlikely, and would be unfortunate it if it did
> > > happen. I am inclined to think that it might be better not to reread
> > > the record at all, though.
> >
> > There are two reasons that the record is reread; first, one that you
> > have just explained where the redo loop drops out due to
> > recoveryStopsBefore() and another one is that InRecovery is false.
> >
> > In the formal case at the end, redo while-loop does read a new record
> > which in effect updates EndRecPtr and when we breaks the loop, we do
> > reach the place where we do reread record -- where we do read the
> > record (i.e. LastRec) before the record that redo loop has read and
> > which correctly sets EndRecPtr. In the latter case, definitely, we
> > don't need any adjustment to EndRecPtr.
> >
> > So technically one case needs reread but that is also not needed, we
> > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > be set conditionally something like:
> >
> > if (InRecovery)
> > {
> >     EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> >     EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > }
> > else
> > {
> >     EndOfLog = EndRecPtr;
> >     EndOfLogTLI = replayTLI;
> > }
> >
> > > As far as this patch goes, I think we need
> > > a solution that doesn't involve fetching EndOfLog from a variable
> > > that's only sometimes initialized and then not doing anything with it
> > > except in the cases where it was initialized.
> > >
> >
> > Another reason could be EndOfLog changes further in the following case:
> >
> > /*
> >  * Actually, if WAL ended in an incomplete record, skip the parts that
> >  * made it through and start writing after the portion that persisted.
> >  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> >  * we'll do as soon as we're open for writing new WAL.)
> >  */
> > if (!XLogRecPtrIsInvalid(missingContrecPtr))
> > {
> >     Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> >     EndOfLog = missingContrecPtr;
> > }
> >
> > Now only solution that I can think is to copy EndOfLog (so
> > EndOfLogTLI) into shared memory.
> >
> > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> > > think the regression tests contain any scenarios where we run recovery
> > > and the values end up different. However, I think that the code sets
> > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> > > record originated. So imagine that we are looking for WAL that ought
> > > to be in 000000010000000000000003 but we don't find it; instead we
> > > find 000000020000000000000003 because our recovery target timeline is
> > > 2, or something that has 2 in its history. We will read the WAL for
> > > timeline 1 from this file which has timeline 2 in the file name. I
> > > think if recovery ends in this file before the timeline switch, these
> > > values will be different. I did not try to construct a test case for
> > > this today due to not having enough time, so it's possible that I'm
> > > wrong about this, but that's how it looks to me from the code.
> > >
> >
> > I am not sure, I have understood this scenario due to lack of
> > expertise in this area -- Why would the record we looking that ought
> > to be in 000000010000000000000003 we don't find it? Possibly WAL
> > corruption or that file is missing?
> >
>
> On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
> XLogFileReadAnyTLI(), I think I could make a sense that there could be
> a case where the record belong to TLI 1 we are looking for; we might
> open the file with TLI 2. But, I am wondering what's wrong if we say
> that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
> in its file name -- that statement is still true, and that record
> should be still accessible from the filename with TLI 1.  Also, if we
> going to consider this reading record exists before the timeline
> switch point as the EndOfLog then why should be worried about the
> latter timeline switch which eventually everything after the EndOfLog
> going to be useless for us. We might continue switching TLI and/or
> writing the WAL right after EndOfLog, correct me if I am missing
> something here.
>
> Further, I still think replayEndTLI has set to the correct value what
> we looking for EndOfLogTLI when we go through the redo loop. When it
> read the record and finds a change in the current replayTLI then it
> updates that as:
>
> if (newReplayTLI != replayTLI)
> {
>     /* Check that it's OK to switch to this TLI */
>     checkTimeLineSwitch(EndRecPtr, newReplayTLI,
>                         prevReplayTLI, replayTLI);
>
>     /* Following WAL records should be run with new TLI */
>     replayTLI = newReplayTLI;
>     switchedTLI = true;
> }
>
> Then replayEndTLI gets updated. If we going to skip the reread of
> "LastRec" that we were discussing, then I think the following code
> that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
> (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
> should be enough, AFAICU.
>
> /*
>  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
>  * the end-of-log. It could be different from the timeline that EndOfLog
>  * nominally belongs to, if there was a timeline switch in that segment,
>  * and we were reading the old WAL from a segment belonging to a higher
>  * timeline.
>  */
> EndOfLogTLI = xlogreader->seg.ws_tli;
>

I think I found the right case for this, above TLI fetch is needed in
the case where we do restore from the archived WAL files. In my trial,
the archive directory has files as below (Kindly ignore the extra
history file, I perform a few more trials to be sure):

-rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
-rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
-rw-------. 1 amul amul      128 Nov 17 06:36 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
-rw-------. 1 amul amul      171 Nov 17 06:39 00000005.history
-rw-------. 1 amul amul      209 Nov 17 06:45 00000006.history
-rw-------. 1 amul amul      247 Nov 17 06:52 00000007.history

The timeline is switched in 1F file but the archiver has backup older
timeline file and renamed it. While performing PITR using these
archived files, the .partitial file seems to be skipped from the
restore. The file with the next timeline id is selected to read the
records that belong to the previous timeline id as well (i.e. 4 here,
all the records before timeline switch point). Here is the files
inside pg_wal directory after restore, note that in the current
experiment, I chose recovery_target_xid = <just before the timeline#5
switch point > and then recovery_target_action = 'promote':

-rw-------. 1 amul amul       85 Nov 17 07:33 00000003.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
-rw-------. 1 amul amul      128 Nov 17 07:33 00000004.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
-rw-------. 1 amul amul      171 Nov 17 07:33 00000005.history
-rw-------. 1 amul amul      209 Nov 17 07:33 00000006.history
-rw-------. 1 amul amul      247 Nov 17 07:33 00000007.history
-rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F

The last one is the new WAL file created in that cluster.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
   On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote:
>
>   On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > >
> > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > > > Attached is the rebased version of refactoring as well as the
> > > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > > 39a3105678a).
> > > >
> > > > I spent a lot of time today studying 0002, and specifically the
> > > > question of whether EndOfLog must be the same as
> > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > > XLogCtl->replayEndTLI.
> > > >
> > > > The answer to the former question is "no" because, if we don't enter
> > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > > enter redo, then I think it has to be the same unless something very
> > > > weird happens. EndOfLog gets set like this:
> > > >
> > > >     XLogBeginRead(xlogreader, LastRec);
> > > >     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > >     EndOfLog = EndRecPtr;
> > > >
> > > > In every case that exists in our regression tests, EndRecPtr is the
> > > > same before these three lines of code as it is afterward. However, if
> > > > you test with recovery_target=immediate, you can get it to be
> > > > different, because in that case we drop out of the redo loop after
> > > > calling recoveryStopsBefore() rather than after calling
> > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > > recovery_target_inclusive=off you can likewise get it to be different
> > > > (though I discovered the hard way that recovery_target_inclusive=off
> > > > is ignored when you use recovery_target_name). It seems like a really
> > > > bad thing that neither recovery_target=immediate nor
> > > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > > add some.
> > > >
> > >
> > > recovery/t/003_recovery_targets.pl has test for
> > > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > > can add that for recovery_target_lsn, recovery_target_time, and
> > > recovery_target_xid case only where it affects.
> > >
> > > > Anyway, in effect, these three lines of code have the effect of
> > > > backing up the xlogreader by one record when we stop before rather
> > > > than after a record that we're replaying. What that means is that
> > > > EndOfLog is going to be the end+1 of the last record that we actually
> > > > replayed. There might be one more record that we read but did not
> > > > replay, and that record won't impact the value we end up with in
> > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > > record that we actually replayed. To put that another way, there's no
> > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > > and before we change LastRec. So in the cases where
> > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > > different from EndOfLog if something different happens when we re-read
> > > > the last-replayed WAL record than what happened when we read it the
> > > > first time. That seems unlikely, and would be unfortunate it if it did
> > > > happen. I am inclined to think that it might be better not to reread
> > > > the record at all, though.
> > >
> > > There are two reasons that the record is reread; first, one that you
> > > have just explained where the redo loop drops out due to
> > > recoveryStopsBefore() and another one is that InRecovery is false.
> > >
> > > In the formal case at the end, redo while-loop does read a new record
> > > which in effect updates EndRecPtr and when we breaks the loop, we do
> > > reach the place where we do reread record -- where we do read the
> > > record (i.e. LastRec) before the record that redo loop has read and
> > > which correctly sets EndRecPtr. In the latter case, definitely, we
> > > don't need any adjustment to EndRecPtr.
> > >
> > > So technically one case needs reread but that is also not needed, we
> > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > > be set conditionally something like:
> > >
> > > if (InRecovery)
> > > {
> > >     EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > >     EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > > }
> > > else
> > > {
> > >     EndOfLog = EndRecPtr;
> > >     EndOfLogTLI = replayTLI;
> > > }
> > >
> > > > As far as this patch goes, I think we need
> > > > a solution that doesn't involve fetching EndOfLog from a variable
> > > > that's only sometimes initialized and then not doing anything with it
> > > > except in the cases where it was initialized.
> > > >
> > >
> > > Another reason could be EndOfLog changes further in the following case:
> > >
> > > /*
> > >  * Actually, if WAL ended in an incomplete record, skip the parts that
> > >  * made it through and start writing after the portion that persisted.
> > >  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> > >  * we'll do as soon as we're open for writing new WAL.)
> > >  */
> > > if (!XLogRecPtrIsInvalid(missingContrecPtr))
> > > {
> > >     Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> > >     EndOfLog = missingContrecPtr;
> > > }
> > >
> > > Now only solution that I can think is to copy EndOfLog (so
> > > EndOfLogTLI) into shared memory.
> > >
> > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> > > > think the regression tests contain any scenarios where we run recovery
> > > > and the values end up different. However, I think that the code sets
> > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> > > > record originated. So imagine that we are looking for WAL that ought
> > > > to be in 000000010000000000000003 but we don't find it; instead we
> > > > find 000000020000000000000003 because our recovery target timeline is
> > > > 2, or something that has 2 in its history. We will read the WAL for
> > > > timeline 1 from this file which has timeline 2 in the file name. I
> > > > think if recovery ends in this file before the timeline switch, these
> > > > values will be different. I did not try to construct a test case for
> > > > this today due to not having enough time, so it's possible that I'm
> > > > wrong about this, but that's how it looks to me from the code.
> > > >
> > >
> > > I am not sure, I have understood this scenario due to lack of
> > > expertise in this area -- Why would the record we looking that ought
> > > to be in 000000010000000000000003 we don't find it? Possibly WAL
> > > corruption or that file is missing?
> > >
> >
> > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
> > XLogFileReadAnyTLI(), I think I could make a sense that there could be
> > a case where the record belong to TLI 1 we are looking for; we might
> > open the file with TLI 2. But, I am wondering what's wrong if we say
> > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
> > in its file name -- that statement is still true, and that record
> > should be still accessible from the filename with TLI 1.  Also, if we
> > going to consider this reading record exists before the timeline
> > switch point as the EndOfLog then why should be worried about the
> > latter timeline switch which eventually everything after the EndOfLog
> > going to be useless for us. We might continue switching TLI and/or
> > writing the WAL right after EndOfLog, correct me if I am missing
> > something here.
> >
> > Further, I still think replayEndTLI has set to the correct value what
> > we looking for EndOfLogTLI when we go through the redo loop. When it
> > read the record and finds a change in the current replayTLI then it
> > updates that as:
> >
> > if (newReplayTLI != replayTLI)
> > {
> >     /* Check that it's OK to switch to this TLI */
> >     checkTimeLineSwitch(EndRecPtr, newReplayTLI,
> >                         prevReplayTLI, replayTLI);
> >
> >     /* Following WAL records should be run with new TLI */
> >     replayTLI = newReplayTLI;
> >     switchedTLI = true;
> > }
> >
> > Then replayEndTLI gets updated. If we going to skip the reread of
> > "LastRec" that we were discussing, then I think the following code
> > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
> > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
> > should be enough, AFAICU.
> >
> > /*
> >  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> >  * the end-of-log. It could be different from the timeline that EndOfLog
> >  * nominally belongs to, if there was a timeline switch in that segment,
> >  * and we were reading the old WAL from a segment belonging to a higher
> >  * timeline.
> >  */
> > EndOfLogTLI = xlogreader->seg.ws_tli;
> >
>
> I think I found the right case for this, above TLI fetch is needed in
> the case where we do restore from the archived WAL files. In my trial,
> the archive directory has files as below (Kindly ignore the extra
> history file, I perform a few more trials to be sure):
>
> -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
> -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
> -rw-------. 1 amul amul      128 Nov 17 06:36 00000004.history
> -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
> -rw-------. 1 amul amul      171 Nov 17 06:39 00000005.history
> -rw-------. 1 amul amul      209 Nov 17 06:45 00000006.history
> -rw-------. 1 amul amul      247 Nov 17 06:52 00000007.history
>
> The timeline is switched in 1F file but the archiver has backup older
> timeline file and renamed it. While performing PITR using these
> archived files, the .partitial file seems to be skipped from the
> restore. The file with the next timeline id is selected to read the
> records that belong to the previous timeline id as well (i.e. 4 here,
> all the records before timeline switch point). Here is the files
> inside pg_wal directory after restore, note that in the current
> experiment, I chose recovery_target_xid = <just before the timeline#5
> switch point > and then recovery_target_action = 'promote':
>
> -rw-------. 1 amul amul       85 Nov 17 07:33 00000003.history
> -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
> -rw-------. 1 amul amul      128 Nov 17 07:33 00000004.history
> -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
> -rw-------. 1 amul amul      171 Nov 17 07:33 00000005.history
> -rw-------. 1 amul amul      209 Nov 17 07:33 00000006.history
> -rw-------. 1 amul amul      247 Nov 17 07:33 00000007.history
> -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F
>
> The last one is the new WAL file created in that cluster.
>

With this experiment, I think it is clear that the EndOfLogTLI can be
different from the replayEndTLI or lastReplayedTLI, and we don't have
any other option to get that into other processes other than exporting
into shared memory.  Similarly, we have bunch of option (e.g.
replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get
EndOfLog value but those are not perfect and reliable options.

Therefore, in the attached patch, I have exported EndOfLog and
EndOfLogTLI into shared memory and attached only the refactoring
patches since there a bunch of other work needs to be done on the main
ASRO patches what I discussed with Robert off-list, thanks.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Tue, Nov 23, 2021 at 7:23 PM Amul Sul <sulamul@gmail.com> wrote:
>
>    On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> >   On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote:
> > > >
> > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > >
> > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote:
> > > > > > Attached is the rebased version of refactoring as well as the
> > > > > > pg_prohibit_wal feature patches for the latest master head (commit #
> > > > > > 39a3105678a).
> > > > >
> > > > > I spent a lot of time today studying 0002, and specifically the
> > > > > question of whether EndOfLog must be the same as
> > > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as
> > > > > XLogCtl->replayEndTLI.
> > > > >
> > > > > The answer to the former question is "no" because, if we don't enter
> > > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do
> > > > > enter redo, then I think it has to be the same unless something very
> > > > > weird happens. EndOfLog gets set like this:
> > > > >
> > > > >     XLogBeginRead(xlogreader, LastRec);
> > > > >     record = ReadRecord(xlogreader, PANIC, false, replayTLI);
> > > > >     EndOfLog = EndRecPtr;
> > > > >
> > > > > In every case that exists in our regression tests, EndRecPtr is the
> > > > > same before these three lines of code as it is afterward. However, if
> > > > > you test with recovery_target=immediate, you can get it to be
> > > > > different, because in that case we drop out of the redo loop after
> > > > > calling recoveryStopsBefore() rather than after calling
> > > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use
> > > > > recovery_target_inclusive=off you can likewise get it to be different
> > > > > (though I discovered the hard way that recovery_target_inclusive=off
> > > > > is ignored when you use recovery_target_name). It seems like a really
> > > > > bad thing that neither recovery_target=immediate nor
> > > > > recovery_target_inclusive=off have any tests, and I think we ought to
> > > > > add some.
> > > > >
> > > >
> > > > recovery/t/003_recovery_targets.pl has test for
> > > > recovery_target=immediate but not for recovery_target_inclusive=off, we
> > > > can add that for recovery_target_lsn, recovery_target_time, and
> > > > recovery_target_xid case only where it affects.
> > > >
> > > > > Anyway, in effect, these three lines of code have the effect of
> > > > > backing up the xlogreader by one record when we stop before rather
> > > > > than after a record that we're replaying. What that means is that
> > > > > EndOfLog is going to be the end+1 of the last record that we actually
> > > > > replayed. There might be one more record that we read but did not
> > > > > replay, and that record won't impact the value we end up with in
> > > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last
> > > > > record that we actually replayed. To put that another way, there's no
> > > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr
> > > > > and before we change LastRec. So in the cases where
> > > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be
> > > > > different from EndOfLog if something different happens when we re-read
> > > > > the last-replayed WAL record than what happened when we read it the
> > > > > first time. That seems unlikely, and would be unfortunate it if it did
> > > > > happen. I am inclined to think that it might be better not to reread
> > > > > the record at all, though.
> > > >
> > > > There are two reasons that the record is reread; first, one that you
> > > > have just explained where the redo loop drops out due to
> > > > recoveryStopsBefore() and another one is that InRecovery is false.
> > > >
> > > > In the formal case at the end, redo while-loop does read a new record
> > > > which in effect updates EndRecPtr and when we breaks the loop, we do
> > > > reach the place where we do reread record -- where we do read the
> > > > record (i.e. LastRec) before the record that redo loop has read and
> > > > which correctly sets EndRecPtr. In the latter case, definitely, we
> > > > don't need any adjustment to EndRecPtr.
> > > >
> > > > So technically one case needs reread but that is also not needed, we
> > > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we
> > > > do not need to reread the record, but EndOfLog and EndOfLogTLI should
> > > > be set conditionally something like:
> > > >
> > > > if (InRecovery)
> > > > {
> > > >     EndOfLog = XLogCtl->lastReplayedEndRecPtr;
> > > >     EndOfLogTLI = XLogCtl->lastReplayedTLI;
> > > > }
> > > > else
> > > > {
> > > >     EndOfLog = EndRecPtr;
> > > >     EndOfLogTLI = replayTLI;
> > > > }
> > > >
> > > > > As far as this patch goes, I think we need
> > > > > a solution that doesn't involve fetching EndOfLog from a variable
> > > > > that's only sometimes initialized and then not doing anything with it
> > > > > except in the cases where it was initialized.
> > > > >
> > > >
> > > > Another reason could be EndOfLog changes further in the following case:
> > > >
> > > > /*
> > > >  * Actually, if WAL ended in an incomplete record, skip the parts that
> > > >  * made it through and start writing after the portion that persisted.
> > > >  * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> > > >  * we'll do as soon as we're open for writing new WAL.)
> > > >  */
> > > > if (!XLogRecPtrIsInvalid(missingContrecPtr))
> > > > {
> > > >     Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> > > >     EndOfLog = missingContrecPtr;
> > > > }
> > > >
> > > > Now only solution that I can think is to copy EndOfLog (so
> > > > EndOfLogTLI) into shared memory.
> > > >
> > > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as
> > > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't
> > > > > think the regression tests contain any scenarios where we run recovery
> > > > > and the values end up different. However, I think that the code sets
> > > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think
> > > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL
> > > > > record originated. So imagine that we are looking for WAL that ought
> > > > > to be in 000000010000000000000003 but we don't find it; instead we
> > > > > find 000000020000000000000003 because our recovery target timeline is
> > > > > 2, or something that has 2 in its history. We will read the WAL for
> > > > > timeline 1 from this file which has timeline 2 in the file name. I
> > > > > think if recovery ends in this file before the timeline switch, these
> > > > > values will be different. I did not try to construct a test case for
> > > > > this today due to not having enough time, so it's possible that I'm
> > > > > wrong about this, but that's how it looks to me from the code.
> > > > >
> > > >
> > > > I am not sure, I have understood this scenario due to lack of
> > > > expertise in this area -- Why would the record we looking that ought
> > > > to be in 000000010000000000000003 we don't find it? Possibly WAL
> > > > corruption or that file is missing?
> > > >
> > >
> > > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and
> > > XLogFileReadAnyTLI(), I think I could make a sense that there could be
> > > a case where the record belong to TLI 1 we are looking for; we might
> > > open the file with TLI 2. But, I am wondering what's wrong if we say
> > > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4
> > > in its file name -- that statement is still true, and that record
> > > should be still accessible from the filename with TLI 1.  Also, if we
> > > going to consider this reading record exists before the timeline
> > > switch point as the EndOfLog then why should be worried about the
> > > latter timeline switch which eventually everything after the EndOfLog
> > > going to be useless for us. We might continue switching TLI and/or
> > > writing the WAL right after EndOfLog, correct me if I am missing
> > > something here.
> > >
> > > Further, I still think replayEndTLI has set to the correct value what
> > > we looking for EndOfLogTLI when we go through the redo loop. When it
> > > read the record and finds a change in the current replayTLI then it
> > > updates that as:
> > >
> > > if (newReplayTLI != replayTLI)
> > > {
> > >     /* Check that it's OK to switch to this TLI */
> > >     checkTimeLineSwitch(EndRecPtr, newReplayTLI,
> > >                         prevReplayTLI, replayTLI);
> > >
> > >     /* Following WAL records should be run with new TLI */
> > >     replayTLI = newReplayTLI;
> > >     switchedTLI = true;
> > > }
> > >
> > > Then replayEndTLI gets updated. If we going to skip the reread of
> > > "LastRec" that we were discussing, then I think the following code
> > > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI
> > > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false)
> > > should be enough, AFAICU.
> > >
> > > /*
> > >  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> > >  * the end-of-log. It could be different from the timeline that EndOfLog
> > >  * nominally belongs to, if there was a timeline switch in that segment,
> > >  * and we were reading the old WAL from a segment belonging to a higher
> > >  * timeline.
> > >  */
> > > EndOfLogTLI = xlogreader->seg.ws_tli;
> > >
> >
> > I think I found the right case for this, above TLI fetch is needed in
> > the case where we do restore from the archived WAL files. In my trial,
> > the archive directory has files as below (Kindly ignore the extra
> > history file, I perform a few more trials to be sure):
> >
> > -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E
> > -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial
> > -rw-------. 1 amul amul      128 Nov 17 06:36 00000004.history
> > -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F
> > -rw-------. 1 amul amul      171 Nov 17 06:39 00000005.history
> > -rw-------. 1 amul amul      209 Nov 17 06:45 00000006.history
> > -rw-------. 1 amul amul      247 Nov 17 06:52 00000007.history
> >
> > The timeline is switched in 1F file but the archiver has backup older
> > timeline file and renamed it. While performing PITR using these
> > archived files, the .partitial file seems to be skipped from the
> > restore. The file with the next timeline id is selected to read the
> > records that belong to the previous timeline id as well (i.e. 4 here,
> > all the records before timeline switch point). Here is the files
> > inside pg_wal directory after restore, note that in the current
> > experiment, I chose recovery_target_xid = <just before the timeline#5
> > switch point > and then recovery_target_action = 'promote':
> >
> > -rw-------. 1 amul amul       85 Nov 17 07:33 00000003.history
> > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E
> > -rw-------. 1 amul amul      128 Nov 17 07:33 00000004.history
> > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F
> > -rw-------. 1 amul amul      171 Nov 17 07:33 00000005.history
> > -rw-------. 1 amul amul      209 Nov 17 07:33 00000006.history
> > -rw-------. 1 amul amul      247 Nov 17 07:33 00000007.history
> > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F
> >
> > The last one is the new WAL file created in that cluster.
> >
>
> With this experiment, I think it is clear that the EndOfLogTLI can be
> different from the replayEndTLI or lastReplayedTLI, and we don't have
> any other option to get that into other processes other than exporting
> into shared memory.  Similarly, we have bunch of option (e.g.
> replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get
> EndOfLog value but those are not perfect and reliable options.
>
> Therefore, in the attached patch, I have exported EndOfLog and
> EndOfLogTLI into shared memory and attached only the refactoring
> patches since there a bunch of other work needs to be done on the main
> ASRO patches what I discussed with Robert off-list, thanks.
>

Attaching the rest of the patches. To execute XLogAcceptWrites() ->
PerformRecoveryXLogAction() in Checkpointer process; ideally, we
should perform full checkpoint but we can't do that using current
PerformRecoveryXLogAction() which would call RequestCheckpoint() with
WAIT flags which make the Checkpointer process wait infinite on itself
to finish the requested checkpoint, bad!!

The option we have is to change RequestCheckpoint() for the
Checkpointer process directly call CreateCheckPoint() as we do for
!IsPostmasterEnvironment case, but problem is that XLogWrite() running
inside Checkpointer process can reach to CreateCheckPoint() and cause
an unexpected behaviour that I have noted previously[1].  The
RequestCheckpoint() from XLogWrite() when inside Checkpointer process
is needed or not is need a separate discussion.   For now, I have
changed PerformRecoveryXLogAction() to call CreateCheckPoint() for the
Checkpointer process; in the v41-0003 version I tried to do the
changes to RequestCheckpoint() to avoid that but that change looks too
ugly.

Another problem is the recursive call to XLogAccepWrite() in the
Checkpointer process due to the aforesaid CreateCheckPoint() call from
PerformRecoveryXLogAction(). The reason is to avoid the delay in
processing WAL prohibit state change requests we do have added
ProcessWALProhibitStateChangeRequest() call multiple places that
Checkpointer can check and process while performing a long-running
checkpoint. When Checkpointer call CreateCheckPoint() from
PerformRecoveryXLogAction() then that can also hit
ProcessWALProhibitStateChangeRequest() and since XLogAccepWrite()
operation not completed yet that tried to do that again. To avoid that
I have added a flag that avoids ProcessWALProhibitStateChangeRequest()
execution is that flag is set, see
ProcessWALProhibitStateChangeRequest() in attached 0003 patch.

Note that both the issues, I noted above are boil down to
CreateCheckPoint() and its need. If we don't need to perform a full
checkpoint in our case then we might not have that recursion issue.
Instead, do the CreateEndOfRecoveryRecord() and then do the full
checkpoint that currently PerformRecoveryXLogAction() does for the
promotion case but not having full checkpoint looks might look scary.
I tried that and works fine for me, but I am not much confident about
that.

Regards,
Amul

1] https://postgr.es/m/CAAJ_b97fPWU_yyOg97Y5AtSvx5mrg2cGyz260swz5x5iPKEM+g@mail.gmail.com

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Attaching the later version, has a few additional changes that decide
for the Checkpointer process where it should be halt or not in the wal
prohibited state; those changes are yet to be confirmed and tested
thoroughly, thanks.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Attached is rebase version for the latest maste head(#891624f0ec).

0001 and 0002 patch is changed a bit due to xlog.c refactoring
commit(#70e81861), needing a bit more thought to copy global variables into
right shared memory structure.  Also, I made some changes to the 0003
patch to avoid
XLogAcceptWrites() entrancing suggested in offline discussion.

Regards,
Amul

Attachment

Re: [Patch] ALTER SYSTEM READ ONLY

From
Bharath Rupireddy
Date:
On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> > It is a very minor change, so I rebased the patch. Please take a look, if that works for you.
> >
>
> Thanks, I am getting one more failure for the vacuumlazy.c. on the
> latest master head(d75288fb27b), I fixed that in attached version.

Thanks Amul! I haven't looked at the whole thread, I may be repeating
things here, please bear with me.

1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
think the patches still have ALTER SYSTEM READ ONLY references.
2) IIUC, the idea of this patch is not to generate any new WAL when
set as default_transaction_read_only and transaction_read_only can't
guarantee that?
3) IMO, the function name pg_prohibit_wal doesn't look good where it
also allows one to set WAL writes, how about the following functions -
pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
pg_allow_wal or pg_allow_wal_{generation, inserts} without any
arguments and if needed a common function
pg_set_wal_generation_state(read-only/read-write) something like that?
4) It looks like only the checkpointer is setting the WAL prohibit
state? Is there a strong reason for that? Why can't the backend take a
lock on prohibit state in shared memory and set it and let the
checkpointer read it and block itself from writing WAL?
5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
checkpointer? Why?
6) What happens for long-running or in-progress transactions if
someone prohibits WAL in the midst of them? Do these txns fail? Or do
we say that we will allow them to run to completion? Or do we fail
those txns at commit time? One might use this feature to say not let
server go out of disk space, but if we allow in-progress txns to
generate/write WAL, then how can one achieve that with this feature?
Say, I monitor my server in such a way that at 90% of disk space,
prohibit WAL to avoid server crash. But if this feature allows
in-progress txns to generate WAL, then the server may still crash?
7) What are the other use-cases (I can think of - to avoid out of disk
crashes, block/freeze writes to database when the server is
compromised) with this feature? Any usages during/before failover,
promotion or after it?
8) Is there a strong reason that we've picked up conditional variable
wal_prohibit_cv over mutex/lock for updating WALProhibit shared
memory?
9) Any tests that you are planning to add?

Regards,
Bharath Rupireddy.



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
On Sat, Apr 23, 2022 at 1:34 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > It is a very minor change, so I rebased the patch. Please take a look, if that works for you.
> > >
> >
> > Thanks, I am getting one more failure for the vacuumlazy.c. on the
> > latest master head(d75288fb27b), I fixed that in attached version.
>
> Thanks Amul! I haven't looked at the whole thread, I may be repeating
> things here, please bear with me.
>

Np, thanks for looking into it.

> 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or
> do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I
> think the patches still have ALTER SYSTEM READ ONLY references.

Could you please point me to what those references are? I didn't find
any in the v45 version.

> 2) IIUC, the idea of this patch is not to generate any new WAL when
> set as default_transaction_read_only and transaction_read_only can't
> guarantee that?

No. Complete WAL write should be disabled, in other words XLogInsert()
should be restricted.

> 3) IMO, the function name pg_prohibit_wal doesn't look good where it
> also allows one to set WAL writes, how about the following functions -
> pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or
> pg_allow_wal or pg_allow_wal_{generation, inserts} without any
> arguments and if needed a common function
> pg_set_wal_generation_state(read-only/read-write) something like that?

There are already similar suggestions before too, but none of that
finalized yet, there are other more challenges that need to be
handled, so we can keep this work at last.

> 4) It looks like only the checkpointer is setting the WAL prohibit
> state? Is there a strong reason for that? Why can't the backend take a
> lock on prohibit state in shared memory and set it and let the
> checkpointer read it and block itself from writing WAL?

Once WAL prohibited state transition is initiated and should be
completed, there is no fallback. What if the backed exit before the
complete transition? Similarly, even if the checkpointer exits,
that will be restarted again and will complete the state transition.

> 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to
> checkpointer? Why?

Simply want to wake up the checkpointer process without asking for
specific work in the handle function. Another suitable choice will be
SIGINT, we can choose that too if needed.

> 6) What happens for long-running or in-progress transactions if
> someone prohibits WAL in the midst of them? Do these txns fail? Or do
> we say that we will allow them to run to completion? Or do we fail
> those txns at commit time? One might use this feature to say not let
> server go out of disk space, but if we allow in-progress txns to
> generate/write WAL, then how can one achieve that with this feature?
> Say, I monitor my server in such a way that at 90% of disk space,
> prohibit WAL to avoid server crash. But if this feature allows
> in-progress txns to generate WAL, then the server may still crash?

Read-only transactions will be allowed to continue, and if that
transaction tries to write or any other transaction that has performed
any writes already then the session running that transaction will be
terminated -- the design is described in the first mail of this
thread.

> 7) What are the other use-cases (I can think of - to avoid out of disk
> crashes, block/freeze writes to database when the server is
> compromised) with this feature? Any usages during/before failover,
> promotion or after it?

The important use case is for failover to avoid split-brain situations.

> 8) Is there a strong reason that we've picked up conditional variable
> wal_prohibit_cv over mutex/lock for updating WALProhibit shared
> memory?

I am not sure how that can be done using mutex or lock.

> 9) Any tests that you are planning to add?

Yes, we can. I have added very sophisticated tests that cover most of
my code changes, but that is not enough for such critical code
changes, have a lot of chances of improvement and adding more tests
for this module as well as other parts e.g. some missing coverage of
gin, gists, brin, core features where this patch is adding checks, etc.
Any help will be greatly appreciated.

Regards,
Amul



Re: [Patch] ALTER SYSTEM READ ONLY

From
Jacob Champion
Date:
On Fri, Apr 8, 2022 at 7:27 AM Amul Sul <sulamul@gmail.com> wrote:
> Attached is rebase version for the latest maste head(#891624f0ec).

Hi Amul,

I'm going through past CF triage emails today; I noticed that this
patch dropped out of the commitfest when you withdrew it in January,
but it hasn't been added back with the most recent patchset you
posted. Was that intended, or did you want to re-register it for
review?

--Jacob



Re: [Patch] ALTER SYSTEM READ ONLY

From
Amul Sul
Date:
Hi,

On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion <jchampion@timescale.com> wrote:
>
> On Fri, Apr 8, 2022 at 7:27 AM Amul Sul <sulamul@gmail.com> wrote:
> > Attached is rebase version for the latest maste head(#891624f0ec).
>
> Hi Amul,
>
> I'm going through past CF triage emails today; I noticed that this
> patch dropped out of the commitfest when you withdrew it in January,
> but it hasn't been added back with the most recent patchset you
> posted. Was that intended, or did you want to re-register it for
> review?
>

Yes, there is a plan to re-register it again but not anytime soon,
once we start to rework the design.

Regards,
Amul