Thread: [Patch] ALTER SYSTEM READ ONLY
Hi,
Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.
The high-level goal is to make the availability/scale-out situation better. The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.
For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)
This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.
Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and WAL
permitted state respectively. When the checkpointer process sees the WAL prohibit
state change request, it emits a global barrier and waits until all backends that
participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
share memory and control file will be updated so that XLogInsertAllowed() returns
accordingly.
If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed. They can't commit without writing WAL, and they
can't abort without writing WAL, either, so we must at least abort the transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all cases because
(1) if there are subtransactions active, we need to force the top-level abort record to
be written immediately, but we can't really do that while keeping the subtransactions
on the transaction stack, and (2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.
Open transactions that don't have an XID are not killed, but will get an ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).
To make that happen, the patch adds a new coding rule: a critical section that will write
WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
that inserting WAL here must be OK, either because we have an XID (we would have
been killed by a change to read-only if one had occurred) or for some other reason.
The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the control
file, so that changes survive a restart.
The transition between read-write and read-only is a pretty major transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
command. Also, we have added a new GUC system_is_read_only which returns "on"
when the system is in WAL prohibited state or recovery.
Another part of the patch that quite uneasy and need a discussion is that when the
Attached patch proposes $Subject feature which forces the system into read-only
mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ
WRITE executed.
The high-level goal is to make the availability/scale-out situation better. The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.
For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)
This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.
Design:
----------
The proposed feature is built atop of super barrier mechanism commit[1] to coordinate
global state changes to all active backends. Backends which executed
ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer
process to change the requested WAL read/write state aka WAL prohibited and WAL
permitted state respectively. When the checkpointer process sees the WAL prohibit
state change request, it emits a global barrier and waits until all backends that
participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in
share memory and control file will be updated so that XLogInsertAllowed() returns
accordingly.
If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed. They can't commit without writing WAL, and they
can't abort without writing WAL, either, so we must at least abort the transaction. We
don't necessarily need to kill the session, but it's hard to avoid in all cases because
(1) if there are subtransactions active, we need to force the top-level abort record to
be written immediately, but we can't really do that while keeping the subtransactions
on the transaction stack, and (2) if the session is idle, we also need the top-level abort
record to be written immediately, but can't send an error to the client until the next
command is issued without losing wire protocol synchronization. For now, we just use
FATAL to kill the session; maybe this can be improved in the future.
Open transactions that don't have an XID are not killed, but will get an ERROR if they
try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM).
To make that happen, the patch adds a new coding rule: a critical section that will write
WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or
AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain
that inserting WAL here must be OK, either because we have an XID (we would have
been killed by a change to read-only if one had occurred) or for some other reason.
The ALTER SYSTEM READ WRITE command can be used to reverse the effects of
ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER
SYSTEM READ WRITE update not only the shared memory state but also the control
file, so that changes survive a restart.
The transition between read-write and read-only is a pretty major transition, so we emit
log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE}
command. Also, we have added a new GUC system_is_read_only which returns "on"
when the system is in WAL prohibited state or recovery.
Another part of the patch that quite uneasy and need a discussion is that when the
shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first
startup recovery will be performed and latter the read-only state will be restored to
prohibit further WAL write irrespective of recovery checkpoint succeed or not. The
concern is here if this startup recovery checkpoint wasn't ok, then it will never happen
even if it's later put back into read-write mode. Thoughts?
Quick demo:
----------------
We have few active sessions, section 1 has performed some writes and stayed in the
idle state for some time, in between in session 2 where superuser successfully changed
system state in read-only via ALTER SYSTEM READ ONLY command which kills
session 1. Any other backend who is trying to run write transactions thereafter will see
a read-only system error.
------------- SESSION 1 -------------
session_1=# BEGIN;
BEGIN
session_1=*# CREATE TABLE foo AS SELECT i FROM generate_series(1,5) i;
SELECT 5
------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ ONLY;
ALTER SYSTEM
------------- SESSION 1 -------------
session_1=*# COMMIT;
FATAL: system is now read only
HINT: Cannot continue a transaction if it has performed writes while system is read only.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
ERROR: cannot execute CREATE TABLE in a read-only transaction
------------- SESSION 4 -------------
session_4=# CHECKPOINT;
ERROR: system is now read only
System can put back to read-write mode by "ALTER SYSTEM READ WRITE" :
------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ WRITE;
ALTER SYSTEM
------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
CREATE TABLE
------------- SESSION 4 -------------
session_4=# CHECKPOINT;
CHECKPOINT
TODOs:
-----------
1. Documentation.
Attachments summary:
------------------------------
I tried to split the changes so that it can be easy to read and see the
incremental implementation.
0001: Patch by Robert, to add ability support error in global barrier absorption.
0002: Patch implement ALTER SYSTEM { READ | WRITE} syntax and psql tab
completion support for it.
0003: A basic implementation where the system can accept $Subject command
and change system to read-only by an emitting barrier.
0004: Patch does the enhancing where the backed execute $Subject command
only and places a request to the checkpointer which is responsible to change
the state by the emitting barrier. Also, store the state into the control file to
make It persists across the server restarts.
0005: Patch tightens the check to prevent error in the critical section.
0006: Documentation - WIP
Credit:
-------
The feature is one of the part of Andres Frued's high-level design ideas for inbuilt
even if it's later put back into read-write mode. Thoughts?
Quick demo:
----------------
We have few active sessions, section 1 has performed some writes and stayed in the
idle state for some time, in between in session 2 where superuser successfully changed
system state in read-only via ALTER SYSTEM READ ONLY command which kills
session 1. Any other backend who is trying to run write transactions thereafter will see
a read-only system error.
------------- SESSION 1 -------------
session_1=# BEGIN;
BEGIN
session_1=*# CREATE TABLE foo AS SELECT i FROM generate_series(1,5) i;
SELECT 5
------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ ONLY;
ALTER SYSTEM
------------- SESSION 1 -------------
session_1=*# COMMIT;
FATAL: system is now read only
HINT: Cannot continue a transaction if it has performed writes while system is read only.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
ERROR: cannot execute CREATE TABLE in a read-only transaction
------------- SESSION 4 -------------
session_4=# CHECKPOINT;
ERROR: system is now read only
System can put back to read-write mode by "ALTER SYSTEM READ WRITE" :
------------- SESSION 2 -------------
session_2=# ALTER SYSTEM READ WRITE;
ALTER SYSTEM
------------- SESSION 3 -------------
session_3=# CREATE TABLE foo_bar (i int);
CREATE TABLE
------------- SESSION 4 -------------
session_4=# CHECKPOINT;
CHECKPOINT
TODOs:
-----------
1. Documentation.
Attachments summary:
------------------------------
I tried to split the changes so that it can be easy to read and see the
incremental implementation.
0001: Patch by Robert, to add ability support error in global barrier absorption.
0002: Patch implement ALTER SYSTEM { READ | WRITE} syntax and psql tab
completion support for it.
0003: A basic implementation where the system can accept $Subject command
and change system to read-only by an emitting barrier.
0004: Patch does the enhancing where the backed execute $Subject command
only and places a request to the checkpointer which is responsible to change
the state by the emitting barrier. Also, store the state into the control file to
make It persists across the server restarts.
0005: Patch tightens the check to prevent error in the critical section.
0006: Documentation - WIP
Credit:
-------
The feature is one of the part of Andres Frued's high-level design ideas for inbuilt
graceful failover for PostgreSQL. Feature implementation design by Robert Haas.
Initial patch by Amit Khandekar further works and improvement by me under Robert's
guidance includes this mail writeup as well.
Ref:
----
1] Global barrier commit # 16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b
Ref:
----
1] Global barrier commit # 16a4e4aecd47da7a6c4e1ebc20f6dd1a13f9133b
Thank you !
Regards,
Amul Sul
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Regards,
Amul Sul
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment
- v1-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patch
- v1-0006-Documentation-WIP.patch
- v1-0002-Add-alter-system-read-only-write-syntax.patch
- v1-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patch
- v1-0001-Allow-error-or-refusal-while-absorbing-barriers.patch
- v1-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patch
On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote: > > Hi, > > Attached patch proposes $Subject feature which forces the system into read-only > mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ > WRITE executed. > > The high-level goal is to make the availability/scale-out situation better. The feature > will help HA setup where the master server needs to stop accepting WAL writes > immediately and kick out any transaction expecting WAL writes at the end, in case > of network down on master or replication connections failures. > > For example, this feature allows for a controlled switchover without needing to shut > down the master. You can instead make the master read-only, wait until the standby > catches up, and then promote the standby. The master remains available for read > queries throughout, and also for WAL streaming, but without the possibility of any > new write transactions. After switchover is complete, the master can be shut down > and brought back up as a standby without needing to use pg_rewind. (Eventually, it > would be nice to be able to make the read-only master into a standby without having > to restart it, but that is a problem for another patch.) > > This might also help in failover scenarios. For example, if you detect that the master > has lost network connectivity to the standby, you might make it read-only after 30 s, > and promote the standby after 60 s, so that you never have two writable masters at > the same time. In this case, there's still some split-brain, but it's still better than what > we have now. > > Design: > ---------- > The proposed feature is built atop of super barrier mechanism commit[1] to coordinate > global state changes to all active backends. Backends which executed > ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer > process to change the requested WAL read/write state aka WAL prohibited and WAL > permitted state respectively. When the checkpointer process sees the WAL prohibit > state change request, it emits a global barrier and waits until all backends that > participate in the ProcSignal absorbs it. Once it has done the WAL read/write state in > share memory and control file will be updated so that XLogInsertAllowed() returns > accordingly. > Do we prohibit the checkpointer to write dirty pages and write a checkpoint record as well? If so, will the checkpointer process writes the current dirty pages and writes a checkpoint record or we skip that as well? > If there are open transactions that have acquired an XID, the sessions are killed > before the barrier is absorbed. > What about prepared transactions? > They can't commit without writing WAL, and they > can't abort without writing WAL, either, so we must at least abort the transaction. We > don't necessarily need to kill the session, but it's hard to avoid in all cases because > (1) if there are subtransactions active, we need to force the top-level abort record to > be written immediately, but we can't really do that while keeping the subtransactions > on the transaction stack, and (2) if the session is idle, we also need the top-level abort > record to be written immediately, but can't send an error to the client until the next > command is issued without losing wire protocol synchronization. For now, we just use > FATAL to kill the session; maybe this can be improved in the future. > > Open transactions that don't have an XID are not killed, but will get an ERROR if they > try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM). > What if vacuum is on an unlogged relation? Do we allow writes via vacuum to unlogged relation? > To make that happen, the patch adds a new coding rule: a critical section that will write > WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or > AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain > that inserting WAL here must be OK, either because we have an XID (we would have > been killed by a change to read-only if one had occurred) or for some other reason. > > The ALTER SYSTEM READ WRITE command can be used to reverse the effects of > ALTER SYSTEM READ ONLY. Both ALTER SYSTEM READ ONLY and ALTER > SYSTEM READ WRITE update not only the shared memory state but also the control > file, so that changes survive a restart. > > The transition between read-write and read-only is a pretty major transition, so we emit > log message for each successful execution of a ALTER SYSTEM READ {ONLY | WRITE} > command. Also, we have added a new GUC system_is_read_only which returns "on" > when the system is in WAL prohibited state or recovery. > > Another part of the patch that quite uneasy and need a discussion is that when the > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first > startup recovery will be performed and latter the read-only state will be restored to > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen > even if it's later put back into read-write mode. > I am not able to understand this problem. What do you mean by "recovery checkpoint succeed or not", do you add a try..catch and skip any error while performing recovery checkpoint? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 6/16/20 7:25 PM, amul sul wrote: > Attached patch proposes $Subject feature which forces the system into > read-only > mode where insert write-ahead log will be prohibited until ALTER > SYSTEM READ > WRITE executed. Thanks Amul. 1) ALTER SYSTEM postgres=# alter system read only; ALTER SYSTEM postgres=# alter system reset all; ALTER SYSTEM postgres=# create table t1(n int); ERROR: cannot execute CREATE TABLE in a read-only transaction Initially i thought after firing 'Alter system reset all' , it will be back to normal. can't we have a syntax like - "Alter system set read_only='True' ; " so that ALTER SYSTEM command syntax should be same for all. postgres=# \h alter system Command: ALTER SYSTEM Description: change a server configuration parameter Syntax: ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' | DEFAULT } ALTER SYSTEM RESET configuration_parameter ALTER SYSTEM RESET ALL How we are going to justify this in help command of ALTER SYSTEM ? 2)When i connected to postgres in a single user mode , i was not able to set the system in read only [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres PostgreSQL stand-alone backend 14devel backend> alter system read only; ERROR: checkpointer is not running backend> -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Do we prohibit the checkpointer to write dirty pages and write a > checkpoint record as well? If so, will the checkpointer process > writes the current dirty pages and writes a checkpoint record or we > skip that as well? I think the definition of this feature should be that you can't write WAL. So, it's OK to write dirty pages in general, for example to allow for buffer replacement so we can continue to run read-only queries. But there's no reason for the checkpointer to do it: it shouldn't try to checkpoint, and therefore it shouldn't write dirty pages either. (I'm not sure if this is how the patch currently works; I'm describing how I think it should work.) > > If there are open transactions that have acquired an XID, the sessions are killed > > before the barrier is absorbed. > > What about prepared transactions? They don't matter. The problem with a running transaction that has an XID is that somebody might end the session, and then we'd have to write either a commit record or an abort record. But a prepared transaction doesn't have that problem. You can't COMMIT PREPARED or ROLLBACK PREPARED while the system is read-only, as I suppose anybody would expect, but their mere existence isn't a problem. > What if vacuum is on an unlogged relation? Do we allow writes via > vacuum to unlogged relation? Interesting question. I was thinking that we should probably teach the autovacuum launcher to stop launching workers while the system is in a READ ONLY state, but what about existing workers? Anything that generates invalidation messages, acquires an XID, or writes WAL has to be blocked in a read-only state; but I'm not sure to what extent the first two of those things would be a problem for vacuuming an unlogged table. I think you couldn't truncate it, at least, because that acquires an XID. > > Another part of the patch that quite uneasy and need a discussion is that when the > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first > > startup recovery will be performed and latter the read-only state will be restored to > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen > > even if it's later put back into read-write mode. > > I am not able to understand this problem. What do you mean by > "recovery checkpoint succeed or not", do you add a try..catch and skip > any error while performing recovery checkpoint? What I think should happen is that the end-of-recovery checkpoint should be skipped, and then if the system is put back into read-write mode later we should do it then. But I think right now the patch performs the end-of-recovery checkpoint before restoring the read-only state, which seems 100% wrong to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > 1) ALTER SYSTEM > > postgres=# alter system read only; > ALTER SYSTEM > postgres=# alter system reset all; > ALTER SYSTEM > postgres=# create table t1(n int); > ERROR: cannot execute CREATE TABLE in a read-only transaction > > Initially i thought after firing 'Alter system reset all' , it will be > back to normal. > > can't we have a syntax like - "Alter system set read_only='True' ; " No, this needs to be separate from the GUC-modification syntax, I think. It's a different kind of state change. It doesn't, and can't, just edit postgresql.auto.conf. > 2)When i connected to postgres in a single user mode , i was not able to > set the system in read only > > [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres > > PostgreSQL stand-alone backend 14devel > backend> alter system read only; > ERROR: checkpointer is not running > > backend> Hmm, that's an interesting finding. I wonder what happens if you make the system read only, shut it down, and then restart it in single-user mode. Given what you see here, I bet you can't put it back into a read-write state from single user mode either, which seems like a problem. Either single-user mode should allow changing between R/O and R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ ONLY and always allow writes anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Amit Kapila <amit.kapila16@gmail.com> writes: > On Tue, Jun 16, 2020 at 7:26 PM amul sul <sulamul@gmail.com> wrote: >> Attached patch proposes $Subject feature which forces the system into read-only >> mode where insert write-ahead log will be prohibited until ALTER SYSTEM READ >> WRITE executed. > Do we prohibit the checkpointer to write dirty pages and write a > checkpoint record as well? I think this is a really bad idea and should simply be rejected. Aside from the points you mention, such a switch would break autovacuum. It would break the ability for scans to do HOT-chain cleanup, which would likely lead to some odd behaviors (if, eg, somebody flips the switch between where that's supposed to happen and where an update needs to happen on the same page). It would break the ability for indexscans to do killed-tuple marking, which is critical for performance in some scenarios. It would break the ability to set tuple hint bits, which is even more critical for performance. It'd possibly break, or at least complicate, logic in index AMs to deal with index format updates --- I'm fairly sure there are places that will try to update out-of-date data structures rather than cope with the old structure, even in nominally read-only searches. I also think that putting such a thing into ALTER SYSTEM has got big logical problems. Someday we will probably want to have ALTER SYSTEM write WAL so that standby servers can absorb the settings changes. But if writing WAL is disabled, how can you ever turn the thing off again? Lastly, the arguments in favor seem pretty bogus. HA switchover normally involves just killing the primary server, not expecting that you can leisurely issue some commands to it first. Commands that involve a whole bunch of subtle interlocking --- and, therefore, aren't going to work if anything has gone wrong already anywhere in the server --- seem like a particularly poor thing to be hanging your HA strategy on. I also wonder what this accomplishes that couldn't be done much more simply by killing the walsenders. In short, I see a huge amount of complexity here, an ongoing source of hard-to-identify, hard-to-fix bugs, and not very much real usefulness. regards, tom lane
On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Aside from the points you mention, such a switch would break autovacuum. > It would break the ability for scans to do HOT-chain cleanup, which would > likely lead to some odd behaviors (if, eg, somebody flips the switch > between where that's supposed to happen and where an update needs to > happen on the same page). It would break the ability for indexscans to do > killed-tuple marking, which is critical for performance in some scenarios. > It would break the ability to set tuple hint bits, which is even more > critical for performance. It'd possibly break, or at least complicate, > logic in index AMs to deal with index format updates --- I'm fairly sure > there are places that will try to update out-of-date data structures > rather than cope with the old structure, even in nominally read-only > searches. This seems like pretty dubious hand-waving. Of course, things that write WAL are going to be broken by a switch that prevents writing WAL; but if they were not, there would be no purpose in having such a switch, so that's not really an argument. But you seem to have mixed in some things that don't require writing WAL, and claimed without evidence that those would somehow also be broken. I don't think that's the case, but even if it were, so what? We live with all of these restrictions on standbys anyway. > I also think that putting such a thing into ALTER SYSTEM has got big > logical problems. Someday we will probably want to have ALTER SYSTEM > write WAL so that standby servers can absorb the settings changes. > But if writing WAL is disabled, how can you ever turn the thing off again? I mean, the syntax that we use for a feature like this is arbitrary. I picked this one, so I like it, but it can easily be changed if other people want something else. The rest of this argument doesn't seem to me to make very much sense. The existing ALTER SYSTEM functionality to modify a text configuration file isn't replicated today and I'm not sure why we should make it so, considering that replication generally only considers things that are guaranteed to be the same on the master and the standby, which this is not. But even if we did, that has nothing to do with whether some functionality that changes the system state without changing a text file ought to also be replicated. This is a piece of cluster management functionality and it makes no sense to replicate it. And no right-thinking person would ever propose to change a feature that renders the system read-only in such a way that it was impossible to deactivate it. That would be nuts. > Lastly, the arguments in favor seem pretty bogus. HA switchover normally > involves just killing the primary server, not expecting that you can > leisurely issue some commands to it first. Yeah, that's exactly the problem I want to fix. If you kill the master server, then you have interrupted service, even for read-only queries. That sucks. Also, even if you don't care about interrupting service on the master, it's actually sorta hard to guarantee a clean switchover. The walsenders are supposed to send all the WAL from the master before exiting, but if the connection is broken for some reason, then the master is down and the standbys can't stream the rest of the WAL. You can start it up again, but then you might generate more WAL. You can try to copy the WAL around manually from one pg_wal directory to another, but that's not a very nice thing for users to need to do manually, and seems buggy and error-prone. And how do you figure out where the WAL ends on the master and make sure that the standby replayed it all? If the master is up, it's easy: you just use the same queries you use all the time. If the master is down, you have to use some different technique that involves manually examining files or scrutinizing pg_controldata output. It's actually very difficult to get this right. > Commands that involve a whole > bunch of subtle interlocking --- and, therefore, aren't going to work if > anything has gone wrong already anywhere in the server --- seem like a > particularly poor thing to be hanging your HA strategy on. It's important not to conflate controlled switchover with failover. When there's a failover, you have to accept some risk of data loss or service interruption; but a controlled switchover does not need to carry the same risks and there are plenty of systems out there where it doesn't. > I also wonder > what this accomplishes that couldn't be done much more simply by killing > the walsenders. Killing the walsenders does nothing ... the clients immediately reconnect. > In short, I see a huge amount of complexity here, an ongoing source of > hard-to-identify, hard-to-fix bugs, and not very much real usefulness. I do think this is complex and the risk of bugs that are hard to identify or hard to fix certainly needs to be considered. I strenuously disagree with the idea that there is not very much real usefulness. Getting failover set up in a way that actually works robustly is, in my experience, one of the two or three most serious challenges my employer's customers face today. The core server support we provide for that is breathtakingly primitive, and it's urgent that we do better. Cloud providers are moving users from PostgreSQL to their own forks of PostgreSQL in vast numbers in large part because users don't want to deal with this crap, and the cloud providers have made it so they don't have to. People running PostgreSQL themselves need complex third-party tools and even then the experience isn't as good as what a major cloud provider would offer. This patch is not going to fix that, but I think it's a step in the right direction, and I hope others will agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > This seems like pretty dubious hand-waving. Of course, things that > write WAL are going to be broken by a switch that prevents writing > WAL; but if they were not, there would be no purpose in having such a > switch, so that's not really an argument. But you seem to have mixed > in some things that don't require writing WAL, and claimed without > evidence that those would somehow also be broken. Which of the things I mentioned don't require writing WAL? You're right that these are the same things that we already forbid on a standby, for the same reason, so maybe it won't be as hard to identify them as I feared. I wonder whether we should envision this as "demote primary to standby" rather than an independent feature. >> I also think that putting such a thing into ALTER SYSTEM has got big >> logical problems. > ... no right-thinking person would ever propose to > change a feature that renders the system read-only in such a way that > it was impossible to deactivate it. That would be nuts. My point was that putting this in ALTER SYSTEM paints us into a corner as to what we can do with ALTER SYSTEM in the future: we won't ever be able to make that do anything that would require writing WAL. And I don't entirely believe your argument that that will never be something we'd want to do. regards, tom lane
On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Which of the things I mentioned don't require writing WAL? Writing hint bits and marking index tuples as killed do not write WAL unless checksums are enabled. > You're right that these are the same things that we already forbid on a > standby, for the same reason, so maybe it won't be as hard to identify > them as I feared. I wonder whether we should envision this as "demote > primary to standby" rather than an independent feature. See my comments on the nearby pg_demote thread. I think we want both. > >> I also think that putting such a thing into ALTER SYSTEM has got big > >> logical problems. > > > ... no right-thinking person would ever propose to > > change a feature that renders the system read-only in such a way that > > it was impossible to deactivate it. That would be nuts. > > My point was that putting this in ALTER SYSTEM paints us into a corner > as to what we can do with ALTER SYSTEM in the future: we won't ever be > able to make that do anything that would require writing WAL. And I > don't entirely believe your argument that that will never be something > we'd want to do. I think that depends a lot on how you view ALTER SYSTEM. I believe it would be reasonable to view ALTER SYSTEM as a catch-all for commands that make system-wide state changes, even if those changes are not all of the same kind as each other; some might be machine-local, and others cluster-wide; some WAL-logged, and others not. I don't think it's smart to view ALTER SYSTEM through a lens that boxes it into only editing postgresql.auto.conf; if that were so, we ought to have called it ALTER CONFIGURATION FILE or something rather than ALTER SYSTEM. For that reason, I do not see the choice of syntax as painting us into a corner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 17, 2020 at 12:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Which of the things I mentioned don't require writing WAL? > Writing hint bits and marking index tuples as killed do not write WAL > unless checksums are enabled. And your point is? I thought enabling checksums was considered good practice these days. >> You're right that these are the same things that we already forbid on a >> standby, for the same reason, so maybe it won't be as hard to identify >> them as I feared. I wonder whether we should envision this as "demote >> primary to standby" rather than an independent feature. > See my comments on the nearby pg_demote thread. I think we want both. Well, if pg_demote can be done for X amount of effort, and largely gets the job done, while this requires 10X or 100X the effort and introduces 10X or 100X as many bugs, I'm not especially convinced that we want both. regards, tom lane
On Wed, Jun 17, 2020 at 12:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Writing hint bits and marking index tuples as killed do not write WAL > > unless checksums are enabled. > > And your point is? I thought enabling checksums was considered > good practice these days. I don't want to have an argument about what typical or best practices are; I wasn't trying to make any point about that one way or the other. I'm just saying that the operations you listed don't necessarily all write WAL. In an event, even if they did, the larger point is that standbys work like that, too, so it's not unprecedented or illogical to think of such things. > >> You're right that these are the same things that we already forbid on a > >> standby, for the same reason, so maybe it won't be as hard to identify > >> them as I feared. I wonder whether we should envision this as "demote > >> primary to standby" rather than an independent feature. > > > See my comments on the nearby pg_demote thread. I think we want both. > > Well, if pg_demote can be done for X amount of effort, and largely > gets the job done, while this requires 10X or 100X the effort and > introduces 10X or 100X as many bugs, I'm not especially convinced > that we want both. Sure: if two features duplicate each other, and one of them is way more work and way more buggy, then it's silly to have both, and we should just accept the easy, bug-free one. However, as I said in the other email to which I referred you, I currently believe that these two features actually don't duplicate each other and that using them both together would be quite beneficial. Also, even if they did, I don't know where you are getting the idea that this feature will be 10X or 100X more work and more buggy than the other one. I have looked at this code prior to it being posted, but I haven't looked at the other code at all; I am guessing that you have looked at neither. I would be happy if you did, because it is often the case that architectural issues that escape other people are apparent to you upon examination, and it's always nice to know about those earlier rather than later so that one can decide to (a) give up or (b) fix them. But I see no point in speculating in the abstract that such issues may exist and that they may be more severe in one case than the other. My own guess is that, properly implemented, they are within 2-3X of each in one direction or the other, not 10-100X. It is almost unbelievable to me that the pg_demote patch could be 100X simpler than this one; if it were, I'd be practically certain it was a 5-minute hack job unworthy of any serious consideration. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-06-17 12:07:22 -0400, Robert Haas wrote: > On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I also think that putting such a thing into ALTER SYSTEM has got big > > logical problems. Someday we will probably want to have ALTER SYSTEM > > write WAL so that standby servers can absorb the settings changes. > > But if writing WAL is disabled, how can you ever turn the thing off again? > > I mean, the syntax that we use for a feature like this is arbitrary. I > picked this one, so I like it, but it can easily be changed if other > people want something else. The rest of this argument doesn't seem to > me to make very much sense. The existing ALTER SYSTEM functionality to > modify a text configuration file isn't replicated today and I'm not > sure why we should make it so, considering that replication generally > only considers things that are guaranteed to be the same on the master > and the standby, which this is not. But even if we did, that has > nothing to do with whether some functionality that changes the system > state without changing a text file ought to also be replicated. This > is a piece of cluster management functionality and it makes no sense > to replicate it. And no right-thinking person would ever propose to > change a feature that renders the system read-only in such a way that > it was impossible to deactivate it. That would be nuts. I agree that the concrete syntax here doesn't seem to matter much. If this worked by actually putting a GUC into the config file, it would perhaps matter a bit more, but it doesn't afaict. It seems good to avoid new top-level statements, and ALTER SYSTEM seems to fit well. I wonder if there's an argument about wanting to be able to execute this command over a physical replication connection? I think this feature fairly obviously is a building block for "gracefully failover to this standby", and it seems like it'd be nicer if that didn't potentially require two pg_hba.conf entries for the to-be-promoted primary on the current/old primary? > > Lastly, the arguments in favor seem pretty bogus. HA switchover normally > > involves just killing the primary server, not expecting that you can > > leisurely issue some commands to it first. > > Yeah, that's exactly the problem I want to fix. If you kill the master > server, then you have interrupted service, even for read-only queries. > That sucks. Also, even if you don't care about interrupting service on > the master, it's actually sorta hard to guarantee a clean switchover. > The walsenders are supposed to send all the WAL from the master before > exiting, but if the connection is broken for some reason, then the > master is down and the standbys can't stream the rest of the WAL. You > can start it up again, but then you might generate more WAL. You can > try to copy the WAL around manually from one pg_wal directory to > another, but that's not a very nice thing for users to need to do > manually, and seems buggy and error-prone. Also (I'm sure you're aware) if you just non-gracefully shut down the old primary, you're going to have to rewind the old primary to be able to use it as a standby. And if you non-gracefully stop you're gonna incur checkpoint overhead, which is *massive* on non-toy databases. There's a huge practical difference between a minor version upgrade causing 10s of unavailability and causing 5min-30min. > And how do you figure out where the WAL ends on the master and make > sure that the standby replayed it all? If the master is up, it's easy: > you just use the same queries you use all the time. If the master is > down, you have to use some different technique that involves manually > examining files or scrutinizing pg_controldata output. It's actually > very difficult to get this right. Yea, it's absurdly hard. I think it's really kind of ridiculous that we expect others to get this right if we, the developers of this stuff, can't really get it right because it's so complicated. Which imo makes this: > > Commands that involve a whole > > bunch of subtle interlocking --- and, therefore, aren't going to work if > > anything has gone wrong already anywhere in the server --- seem like a > > particularly poor thing to be hanging your HA strategy on. more of an argument for having this type of stuff builtin. > It's important not to conflate controlled switchover with failover. > When there's a failover, you have to accept some risk of data loss or > service interruption; but a controlled switchover does not need to > carry the same risks and there are plenty of systems out there where > it doesn't. Yup. Greetings, Andres Freund
On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Do we prohibit the checkpointer to write dirty pages and write a > > checkpoint record as well? If so, will the checkpointer process > > writes the current dirty pages and writes a checkpoint record or we > > skip that as well? > > I think the definition of this feature should be that you can't write > WAL. So, it's OK to write dirty pages in general, for example to allow > for buffer replacement so we can continue to run read-only queries. > But there's no reason for the checkpointer to do it: it shouldn't try > to checkpoint, and therefore it shouldn't write dirty pages either. > (I'm not sure if this is how the patch currently works; I'm describing > how I think it should work.) > You are correct -- writing dirty pages is not restricted. > > > If there are open transactions that have acquired an XID, the sessions are killed > > > before the barrier is absorbed. > > > > What about prepared transactions? > > They don't matter. The problem with a running transaction that has an > XID is that somebody might end the session, and then we'd have to > write either a commit record or an abort record. But a prepared > transaction doesn't have that problem. You can't COMMIT PREPARED or > ROLLBACK PREPARED while the system is read-only, as I suppose anybody > would expect, but their mere existence isn't a problem. > > > What if vacuum is on an unlogged relation? Do we allow writes via > > vacuum to unlogged relation? > > Interesting question. I was thinking that we should probably teach the > autovacuum launcher to stop launching workers while the system is in a > READ ONLY state, but what about existing workers? Anything that > generates invalidation messages, acquires an XID, or writes WAL has to > be blocked in a read-only state; but I'm not sure to what extent the > first two of those things would be a problem for vacuuming an unlogged > table. I think you couldn't truncate it, at least, because that > acquires an XID. > > > > Another part of the patch that quite uneasy and need a discussion is that when the > > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first > > > startup recovery will be performed and latter the read-only state will be restored to > > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The > > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen > > > even if it's later put back into read-write mode. > > > > I am not able to understand this problem. What do you mean by > > "recovery checkpoint succeed or not", do you add a try..catch and skip > > any error while performing recovery checkpoint? > > What I think should happen is that the end-of-recovery checkpoint > should be skipped, and then if the system is put back into read-write > mode later we should do it then. But I think right now the patch > performs the end-of-recovery checkpoint before restoring the read-only > state, which seems 100% wrong to me. > Yeah, we need more thought on how to proceed further. I am kind of agree that the current behavior is not right with Robert since writing end-of-recovery checkpoint violates the no-wal-write rule. Regards, Amul
On Wed, Jun 17, 2020 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 9:51 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > > 1) ALTER SYSTEM > > > > postgres=# alter system read only; > > ALTER SYSTEM > > postgres=# alter system reset all; > > ALTER SYSTEM > > postgres=# create table t1(n int); > > ERROR: cannot execute CREATE TABLE in a read-only transaction > > > > Initially i thought after firing 'Alter system reset all' , it will be > > back to normal. > > > > can't we have a syntax like - "Alter system set read_only='True' ; " > > No, this needs to be separate from the GUC-modification syntax, I > think. It's a different kind of state change. It doesn't, and can't, > just edit postgresql.auto.conf. > > > 2)When i connected to postgres in a single user mode , i was not able to > > set the system in read only > > > > [edb@tushar-ldap-docker bin]$ ./postgres --single -D data postgres > > > > PostgreSQL stand-alone backend 14devel > > backend> alter system read only; > > ERROR: checkpointer is not running > > > > backend> > > Hmm, that's an interesting finding. I wonder what happens if you make > the system read only, shut it down, and then restart it in single-user > mode. Given what you see here, I bet you can't put it back into a > read-write state from single user mode either, which seems like a > problem. Either single-user mode should allow changing between R/O and > R/W, or alternatively single-user mode should ignore ALTER SYSTEM READ > ONLY and always allow writes anyway. > Ok, will try to enable changing between R/O and R/W in the next version. Thanks Tushar for the testing. Regards, Amul
On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Do we prohibit the checkpointer to write dirty pages and write a > > checkpoint record as well? If so, will the checkpointer process > > writes the current dirty pages and writes a checkpoint record or we > > skip that as well? > > I think the definition of this feature should be that you can't write > WAL. So, it's OK to write dirty pages in general, for example to allow > for buffer replacement so we can continue to run read-only queries. > For buffer replacement, many-a-times we have to also perform XLogFlush, what do we do for that? We can't proceed without doing that and erroring out from there means stopping read-only query from the user perspective. > But there's no reason for the checkpointer to do it: it shouldn't try > to checkpoint, and therefore it shouldn't write dirty pages either. > What is the harm in doing the checkpoint before we put the system into READ ONLY state? The advantage is that we can at least reduce the recovery time if we allow writing checkpoint record. > > > What if vacuum is on an unlogged relation? Do we allow writes via > > vacuum to unlogged relation? > > Interesting question. I was thinking that we should probably teach the > autovacuum launcher to stop launching workers while the system is in a > READ ONLY state, but what about existing workers? Anything that > generates invalidation messages, acquires an XID, or writes WAL has to > be blocked in a read-only state; but I'm not sure to what extent the > first two of those things would be a problem for vacuuming an unlogged > table. I think you couldn't truncate it, at least, because that > acquires an XID. > If the truncate operation errors out, then won't the system will again trigger a new autovacuum worker for the same relation as we update stats at the end? Also, in general for regular tables, if there is an error while it tries to WAL, it could again trigger the autovacuum worker for the same relation. If this is true then unnecessarily it will generate a lot of dirty pages and don't think it will be good for the system to behave that way? > > > Another part of the patch that quite uneasy and need a discussion is that when the > > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first > > > startup recovery will be performed and latter the read-only state will be restored to > > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The > > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen > > > even if it's later put back into read-write mode. > > > > I am not able to understand this problem. What do you mean by > > "recovery checkpoint succeed or not", do you add a try..catch and skip > > any error while performing recovery checkpoint? > > What I think should happen is that the end-of-recovery checkpoint > should be skipped, and then if the system is put back into read-write > mode later we should do it then. > But then if we have to perform recovery again, it will start from the previous checkpoint. I think we have to live with it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, 17 Jun 2020 12:07:22 -0400 Robert Haas <robertmhaas@gmail.com> wrote: [...] > > Commands that involve a whole > > bunch of subtle interlocking --- and, therefore, aren't going to work if > > anything has gone wrong already anywhere in the server --- seem like a > > particularly poor thing to be hanging your HA strategy on. > > It's important not to conflate controlled switchover with failover. > When there's a failover, you have to accept some risk of data loss or > service interruption; but a controlled switchover does not need to > carry the same risks and there are plenty of systems out there where > it doesn't. Yes. Maybe we should make sure the wording we are using is the same for everyone. I already hear/read "failover", "controlled failover", "switchover" or "controlled switchover", this is confusing. My definition of switchover is: swapping primary and secondary status between two replicating instances. With no data loss. This is a controlled procedure where all steps must succeed to complete. If a step fails, the procedure fail back to the original primary with no data loss. However, Wikipedia has a broader definition, including situations where the switchover is executed upon a failure: https://en.wikipedia.org/wiki/Switchover Regards,
On Tue, 16 Jun 2020 at 14:56, amul sul <sulamul@gmail.com> wrote:
The high-level goal is to make the availability/scale-out situation better. The feature
will help HA setup where the master server needs to stop accepting WAL writes
immediately and kick out any transaction expecting WAL writes at the end, in case
of network down on master or replication connections failures.
For example, this feature allows for a controlled switchover without needing to shut
down the master. You can instead make the master read-only, wait until the standby
catches up, and then promote the standby. The master remains available for read
queries throughout, and also for WAL streaming, but without the possibility of any
new write transactions. After switchover is complete, the master can be shut down
and brought back up as a standby without needing to use pg_rewind. (Eventually, it
would be nice to be able to make the read-only master into a standby without having
to restart it, but that is a problem for another patch.)
This might also help in failover scenarios. For example, if you detect that the master
has lost network connectivity to the standby, you might make it read-only after 30 s,
and promote the standby after 60 s, so that you never have two writable masters at
the same time. In this case, there's still some split-brain, but it's still better than what
we have now.
If there are open transactions that have acquired an XID, the sessions are killed
before the barrier is absorbed.
inbuilt graceful failover for PostgreSQL
That doesn't appear to be very graceful. Perhaps objections could be assuaged by having a smoother transition and perhaps not even a full barrier, initially.
On Wed, Jun 17, 2020 at 9:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Lastly, the arguments in favor seem pretty bogus. HA switchover normally > > involves just killing the primary server, not expecting that you can > > leisurely issue some commands to it first. > > Yeah, that's exactly the problem I want to fix. If you kill the master > server, then you have interrupted service, even for read-only queries. > Yeah, but if there is a synchronuos_standby (standby that provide sync replication), user can always route the connections to it (automatically if there is some middleware which can detect and route the connection to standby) > That sucks. Also, even if you don't care about interrupting service on > the master, it's actually sorta hard to guarantee a clean switchover. > Fair enough. However, it is not described in the initial email (unless I have missed it; there is a mention that this patch is one part of that bigger feature but no further explanation of that bigger feature) how this feature will allow a clean switchover. I think before we put the system into READ ONLY state, there could be some WAL which we haven't sent to standby, what we do we do for that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 18, 2020 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 8:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Wed, Jun 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Do we prohibit the checkpointer to write dirty pages and write a > > > checkpoint record as well? If so, will the checkpointer process > > > writes the current dirty pages and writes a checkpoint record or we > > > skip that as well? > > > > I think the definition of this feature should be that you can't write > > WAL. So, it's OK to write dirty pages in general, for example to allow > > for buffer replacement so we can continue to run read-only queries. > > > > For buffer replacement, many-a-times we have to also perform > XLogFlush, what do we do for that? We can't proceed without doing > that and erroring out from there means stopping read-only query from > the user perspective. > Read-only does not restrict XLogFlush(). > > But there's no reason for the checkpointer to do it: it shouldn't try > > to checkpoint, and therefore it shouldn't write dirty pages either. > > > > What is the harm in doing the checkpoint before we put the system into > READ ONLY state? The advantage is that we can at least reduce the > recovery time if we allow writing checkpoint record. > The checkpoint could take longer, intending to quickly switch to the read-only state. > > > > > What if vacuum is on an unlogged relation? Do we allow writes via > > > vacuum to unlogged relation? > > > > Interesting question. I was thinking that we should probably teach the > > autovacuum launcher to stop launching workers while the system is in a > > READ ONLY state, but what about existing workers? Anything that > > generates invalidation messages, acquires an XID, or writes WAL has to > > be blocked in a read-only state; but I'm not sure to what extent the > > first two of those things would be a problem for vacuuming an unlogged > > table. I think you couldn't truncate it, at least, because that > > acquires an XID. > > > > If the truncate operation errors out, then won't the system will again > trigger a new autovacuum worker for the same relation as we update > stats at the end? Also, in general for regular tables, if there is an > error while it tries to WAL, it could again trigger the autovacuum > worker for the same relation. If this is true then unnecessarily it > will generate a lot of dirty pages and don't think it will be good for > the system to behave that way? > No new autovacuum worker will be forked in the read-only state and existing will have an error if they try to write WAL after barrier absorption. > > > > Another part of the patch that quite uneasy and need a discussion is that when the > > > > shutdown in the read-only state we do skip shutdown checkpoint and at a restart, first > > > > startup recovery will be performed and latter the read-only state will be restored to > > > > prohibit further WAL write irrespective of recovery checkpoint succeed or not. The > > > > concern is here if this startup recovery checkpoint wasn't ok, then it will never happen > > > > even if it's later put back into read-write mode. > > > > > > I am not able to understand this problem. What do you mean by > > > "recovery checkpoint succeed or not", do you add a try..catch and skip > > > any error while performing recovery checkpoint? > > > > What I think should happen is that the end-of-recovery checkpoint > > should be skipped, and then if the system is put back into read-write > > mode later we should do it then. > > > > But then if we have to perform recovery again, it will start from the > previous checkpoint. I think we have to live with it. > Let me explain the case, if we do skip the end-of-recovery checkpoint while starting the system in read-only mode and then later changing the state to read-write and do a few write operations and online checkpoints, that will be fine? I am yet to explore those things. Regards, Amul
On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > For buffer replacement, many-a-times we have to also perform > XLogFlush, what do we do for that? We can't proceed without doing > that and erroring out from there means stopping read-only query from > the user perspective. I think we should stop WAL writes, then XLogFlush() once, then declare the system R/O. After that there might be more XLogFlush() calls but there won't be any new WAL, so they won't do anything. > > But there's no reason for the checkpointer to do it: it shouldn't try > > to checkpoint, and therefore it shouldn't write dirty pages either. > > What is the harm in doing the checkpoint before we put the system into > READ ONLY state? The advantage is that we can at least reduce the > recovery time if we allow writing checkpoint record. Well, as Andres says in http://postgr.es/m/20200617180546.yucxtiupvxghxss6@alap3.anarazel.de it can take a really long time. > > Interesting question. I was thinking that we should probably teach the > > autovacuum launcher to stop launching workers while the system is in a > > READ ONLY state, but what about existing workers? Anything that > > generates invalidation messages, acquires an XID, or writes WAL has to > > be blocked in a read-only state; but I'm not sure to what extent the > > first two of those things would be a problem for vacuuming an unlogged > > table. I think you couldn't truncate it, at least, because that > > acquires an XID. > > > > If the truncate operation errors out, then won't the system will again > trigger a new autovacuum worker for the same relation as we update > stats at the end? Not if we do what I said in that paragraph. If we're not launching new workers we can't again trigger a worker for the same relation. > Also, in general for regular tables, if there is an > error while it tries to WAL, it could again trigger the autovacuum > worker for the same relation. If this is true then unnecessarily it > will generate a lot of dirty pages and don't think it will be good for > the system to behave that way? I don't see how this would happen. VACUUM can't really dirty pages without writing WAL, can it? And, anyway, if there's an error, we're not going to try again for the same relation unless we launch new workers. > > What I think should happen is that the end-of-recovery checkpoint > > should be skipped, and then if the system is put back into read-write > > mode later we should do it then. > > But then if we have to perform recovery again, it will start from the > previous checkpoint. I think we have to live with it. Yeah. I don't think it's that bad. The case where you shut down the system while it's read-only should be a somewhat unusual one. Normally you would mark it read only and then promote a standby and shut the old master down (or demote it). But what you want is that if it does happen to go down for some reason before all the WAL is streamed, you can bring it back up and finish streaming the WAL without generating any new WAL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 18, 2020 at 7:19 AM amul sul <sulamul@gmail.com> wrote: > Let me explain the case, if we do skip the end-of-recovery checkpoint while > starting the system in read-only mode and then later changing the state to > read-write and do a few write operations and online checkpoints, that will be > fine? I am yet to explore those things. I think we'd want the FIRST write operation to be the end-of-recovery checkpoint, before the system is fully read-write. And then after that completes you could do other things. It would be good if we can get an opinion from Andres about this, since I think he has thought about this stuff quite a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 18, 2020 at 6:39 AM Simon Riggs <simon@2ndquadrant.com> wrote: > That doesn't appear to be very graceful. Perhaps objections could be assuaged by having a smoother transition and perhapsnot even a full barrier, initially. Yeah, it's not ideal, though still better than what we have now. What do you mean by "a smoother transition and perhaps not even a full barrier"? I think if you want to switch the primary to another machine and make the old primary into a standby, you really need to arrest WAL writes completely. It would be better to make existing write transactions ERROR rather than FATAL, but there are some very difficult cases there, so I would like to leave that as a possible later improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 18 Jun 2020 10:52:49 -0400 Robert Haas <robertmhaas@gmail.com> wrote: [...] > But what you want is that if it does happen to go down for some reason before > all the WAL is streamed, you can bring it back up and finish streaming the > WAL without generating any new WAL. Thanks to cascading replication, it could be very possible without this READ ONLY mode, just in recovery mode, isn't it? Regards,
On Thu, Jun 18, 2020 at 11:08 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Thanks to cascading replication, it could be very possible without this READ > ONLY mode, just in recovery mode, isn't it? Yeah, perhaps. I just wrote an email about that over on the demote thread, so I won't repeat it here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 18, 2020 at 8:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 18, 2020 at 5:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > For buffer replacement, many-a-times we have to also perform > > XLogFlush, what do we do for that? We can't proceed without doing > > that and erroring out from there means stopping read-only query from > > the user perspective. > > I think we should stop WAL writes, then XLogFlush() once, then declare > the system R/O. After that there might be more XLogFlush() calls but > there won't be any new WAL, so they won't do anything. > Yeah, the proposed v1 patch does the same. Regards, Amul
Hi All, Attaching a new set of patches rebased atop the latest master head and includes the following changes: 1. Enabling ALTER SYSTEM READ { ONLY | WRITE } support for the single-user, discussed here [1] 2. Now skipping the startup checkpoint if the system is read-only mode, as discussed [2]. 3. While changing the system state to READ-WRITE, a new checkpoint request will be made. All these changes are part of the v2-0004 patch and the rest of the patches will be the same as the v1. Regards, Amul 1] https://postgr.es/m/CAAJ_b96WPPt-=vyjpPUy8pG0vAvLgpjLukCZONUkvdR1_exrKA@mail.gmail.com 2] https://postgr.es/m/CAAJ_b95hddJrgciCfri2NkTLdEUSz6zdMSjoDuWPFPBFvJy+Kg@mail.gmail.com
Attachment
- v2-0006-Documentation-WIP.patch
- v2-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patch
- v2-0001-Allow-error-or-refusal-while-absorbing-barriers.patch
- v2-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patch
- v2-0002-Add-alter-system-read-only-write-syntax.patch
- v2-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patch
On 6/22/20 11:59 AM, Amul Sul wrote: > 2. Now skipping the startup checkpoint if the system is read-only mode, as > discussed [2]. I am not able to perform pg_checksums o/p after shutting down my server in read only mode . Steps - 1.initdb (./initdb -k -D data) 2.start the server(./pg_ctl -D data start) 3.connect to psql (./psql postgres) 4.Fire query (alter system read only;) 5.shutdown the server(./pg_ctl -D data stop) 6.pg_checksums [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data pg_checksums: error: cluster must be shut down [edb@tushar-ldap-docker bin]$ Result - (when server is not in read only) [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data Checksum operation completed Files scanned: 916 Blocks scanned: 2976 Bad checksums: 0 Data checksum version: 1 -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Wed, Jun 24, 2020 at 1:54 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > > On 6/22/20 11:59 AM, Amul Sul wrote: > > 2. Now skipping the startup checkpoint if the system is read-only mode, as > > discussed [2]. > > I am not able to perform pg_checksums o/p after shutting down my server > in read only mode . > > Steps - > > 1.initdb (./initdb -k -D data) > 2.start the server(./pg_ctl -D data start) > 3.connect to psql (./psql postgres) > 4.Fire query (alter system read only;) > 5.shutdown the server(./pg_ctl -D data stop) > 6.pg_checksums > > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data > pg_checksums: error: cluster must be shut down > [edb@tushar-ldap-docker bin]$ > > Result - (when server is not in read only) > > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data > Checksum operation completed > Files scanned: 916 > Blocks scanned: 2976 > Bad checksums: 0 > Data checksum version: 1 > I think that's expected since the server isn't clean shutdown, similar error can be seen with any server which has been shutdown in immediate mode (pg_clt -D data_dir -m i). Regards, Amul
Hi, On Wed, Jun 24, 2020 at 01:54:29PM +0530, tushar wrote: > On 6/22/20 11:59 AM, Amul Sul wrote: > > 2. Now skipping the startup checkpoint if the system is read-only mode, as > > discussed [2]. > > I am not able to perform pg_checksums o/p after shutting down my server in > read only mode . > > Steps - > > 1.initdb (./initdb -k -D data) > 2.start the server(./pg_ctl -D data start) > 3.connect to psql (./psql postgres) > 4.Fire query (alter system read only;) > 5.shutdown the server(./pg_ctl -D data stop) > 6.pg_checksums > > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data > pg_checksums: error: cluster must be shut down > [edb@tushar-ldap-docker bin]$ What's the 'Database cluster state' from pg_controldata at this point? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Fri, Jun 26, 2020 at 10:11:41AM +0530, Amul Sul wrote: > I think that's expected since the server isn't clean shutdown, similar error can > be seen with any server which has been shutdown in immediate mode > (pg_clt -D data_dir -m i). Any operation working on on-disk relation blocks needs to have a consistent state, and a clean shutdown gives this guarantee thanks to the shutdown checkpoint (see also pg_rewind). There are two states in the control file, shutdown for a primary and shutdown while in recovery to cover that. So if you stop the server cleanly but fail to see a proper state with pg_checksums, it seems to me that the proposed patch does not handle correctly the state of the cluster in the control file at shutdown. That's not good. -- Michael
Attachment
On Fri, Jun 26, 2020 at 12:15 PM Michael Banck <michael.banck@credativ.de> wrote: > > Hi, > > On Wed, Jun 24, 2020 at 01:54:29PM +0530, tushar wrote: > > On 6/22/20 11:59 AM, Amul Sul wrote: > > > 2. Now skipping the startup checkpoint if the system is read-only mode, as > > > discussed [2]. > > > > I am not able to perform pg_checksums o/p after shutting down my server in > > read only mode . > > > > Steps - > > > > 1.initdb (./initdb -k -D data) > > 2.start the server(./pg_ctl -D data start) > > 3.connect to psql (./psql postgres) > > 4.Fire query (alter system read only;) > > 5.shutdown the server(./pg_ctl -D data stop) > > 6.pg_checksums > > > > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data > > pg_checksums: error: cluster must be shut down > > [edb@tushar-ldap-docker bin]$ > > What's the 'Database cluster state' from pg_controldata at this point? > "in production" Regards, Amul
On Fri, Jun 26, 2020 at 5:59 AM Michael Paquier <michael@paquier.xyz> wrote: > Any operation working on on-disk relation blocks needs to have a > consistent state, and a clean shutdown gives this guarantee thanks to > the shutdown checkpoint (see also pg_rewind). There are two states in > the control file, shutdown for a primary and shutdown while in > recovery to cover that. So if you stop the server cleanly but fail to > see a proper state with pg_checksums, it seems to me that the proposed > patch does not handle correctly the state of the cluster in the > control file at shutdown. That's not good. I think it is actually very good. If a feature that supposedly prevents writing WAL permitted a shutdown checkpoint to be written, it would be failing to accomplish its design goal. There is not much of a use case for a feature that stops WAL from being written except when it doesn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attached is a rebased version for the latest master head[1]. Regards, Amul 1] Commit # 101f903e51f52bf595cd8177d2e0bc6fe9000762
Attachment
- v3-0001-Allow-error-or-refusal-while-absorbing-barriers.patch
- v3-0004-Use-checkpointer-to-make-system-READ-ONLY-or-READ.patch
- v3-0002-Add-alter-system-read-only-write-syntax.patch
- v3-0003-Implement-ALTER-SYSTEM-READ-ONLY-using-global-bar.patch
- v3-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WAL.patch
- v3-0006-Documentation-WIP.patch
Hi All,
I was testing the feature on top of v3 patch and found the "pg_upgrade" failure after keeping "alter system read only;" as below:
-- Steps:
./initdb -D data
./pg_ctl -D data -l logs start -c
./psql postgres
alter system read only;
\q
./pg_ctl -D data -l logs stop -c
./initdb -D data2
./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520
[edb@localhost bin]$ ./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
The source cluster was not shut down cleanly.
Failure, exiting
--Below is the logs
2021-07-16 11:04:20.305 IST [105788] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-07-16 11:04:20.309 IST [105788] LOG: listening on IPv6 address "::1", port 5432
2020-07-16 11:04:20.309 IST [105788] LOG: listening on IPv4 address "127.0.0.1", port 5432
2020-07-16 11:04:20.321 IST [105788] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-07-16 11:04:20.347 IST [105789] LOG: database system was shut down at 2020-07-16 11:04:20 IST
2020-07-16 11:04:20.352 IST [105788] LOG: database system is ready to accept connections
2020-07-16 11:04:20.534 IST [105790] LOG: system is now read only
2020-07-16 11:04:20.542 IST [105788] LOG: received fast shutdown request
2020-07-16 11:04:20.543 IST [105788] LOG: aborting any active transactions
2020-07-16 11:04:20.544 IST [105788] LOG: background worker "logical replication launcher" (PID 105795) exited with exit code 1
2020-07-16 11:04:20.544 IST [105790] LOG: shutting down
2020-07-16 11:04:20.544 IST [105790] LOG: skipping shutdown checkpoint because the system is read only
2020-07-16 11:04:20.551 IST [105788] LOG: database system is shut down
I was testing the feature on top of v3 patch and found the "pg_upgrade" failure after keeping "alter system read only;" as below:
-- Steps:
./initdb -D data
./pg_ctl -D data -l logs start -c
./psql postgres
alter system read only;
\q
./pg_ctl -D data -l logs stop -c
./initdb -D data2
./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520
[edb@localhost bin]$ ./pg_upgrade -b . -B . -d data -D data2 -p 5555 -P 5520
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
The source cluster was not shut down cleanly.
Failure, exiting
--Below is the logs
2021-07-16 11:04:20.305 IST [105788] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-07-16 11:04:20.309 IST [105788] LOG: listening on IPv6 address "::1", port 5432
2020-07-16 11:04:20.309 IST [105788] LOG: listening on IPv4 address "127.0.0.1", port 5432
2020-07-16 11:04:20.321 IST [105788] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-07-16 11:04:20.347 IST [105789] LOG: database system was shut down at 2020-07-16 11:04:20 IST
2020-07-16 11:04:20.352 IST [105788] LOG: database system is ready to accept connections
2020-07-16 11:04:20.534 IST [105790] LOG: system is now read only
2020-07-16 11:04:20.542 IST [105788] LOG: received fast shutdown request
2020-07-16 11:04:20.543 IST [105788] LOG: aborting any active transactions
2020-07-16 11:04:20.544 IST [105788] LOG: background worker "logical replication launcher" (PID 105795) exited with exit code 1
2020-07-16 11:04:20.544 IST [105790] LOG: shutting down
2020-07-16 11:04:20.544 IST [105790] LOG: skipping shutdown checkpoint because the system is read only
2020-07-16 11:04:20.551 IST [105788] LOG: database system is shut down
On Tue, Jul 14, 2020 at 12:08 PM Amul Sul <sulamul@gmail.com> wrote:
Attached is a rebased version for the latest master head[1].
Regards,
Amul
1] Commit # 101f903e51f52bf595cd8177d2e0bc6fe9000762
With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 16, 2020 at 2:12 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
Hi All,
I was testing the feature on top of v3 patch and found the "pg_upgrade" failure after keeping "alter system read only;" as below:
That's expected. You can't perform a clean shutdown without writing WAL.
Hello, I think we should really term this feature, as it stands, as a means to solely stop WAL writes from happening. The feature doesn't truly make the system read-only (e.g. dirty buffer flushes may succeed the system being put into a read-only state), which does make it confusing to a degree. Ideally, if we were to have a read-only system, we should be able to run pg_checksums on it, or take file-system snapshots etc, without the need to shut down the cluster. It would also enable an interesting use case: we should also be able to do a live upgrade on any running cluster and entertain read-only queries at the same time, given that all the cluster's files will be immutable? So if we are not going to address those cases, we should change the syntax and remove the notion of read-only. It could be: ALTER SYSTEM SET wal_writes TO off|on; or ALTER SYSTEM SET prohibit_wal TO off|on; If we are going to try to make it truly read-only, and cater to the other use cases, we have to: Perform a checkpoint before declaring the system read-only (i.e. before the command returns). This may be expensive of course, as Andres has pointed out in this thread, but it is a price that has to be paid. If we do this checkpoint, then we can avoid an additional shutdown checkpoint and an end-of-recovery checkpoint (if we restart the primary after a crash while in read-only mode). Also, we would have to prevent any operation that touches control files, which I am not sure we do today in the current patch. Why not have the best of both worlds? Consider: ALTER SYSTEM SET read_only to {off, on, wal}; -- on: wal writes off + no writes to disk -- off: default -- wal: only wal writes off Of course, there can probably be better syntax for the above. Regards, Soumyadeep (VMware)
+1 to this feature and I have been thinking about it for sometime. There are several use cases with marking database read only (no transaction log generation). Some of the examples in a hosted service scenario are 1/ when customer runs out of storage space, 2/ Upgrading the server to a different major version (current server can be set to read only, new one can be built and then switch DNS), 3/ If user wants to force a database to read only and not accept writes, may be for import / export a database.
Thanks,
Satya
On Wed, Jul 22, 2020 at 3:04 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
Hello,
I think we should really term this feature, as it stands, as a means to
solely stop WAL writes from happening.
The feature doesn't truly make the system read-only (e.g. dirty buffer
flushes may succeed the system being put into a read-only state), which
does make it confusing to a degree.
Ideally, if we were to have a read-only system, we should be able to run
pg_checksums on it, or take file-system snapshots etc, without the need
to shut down the cluster. It would also enable an interesting use case:
we should also be able to do a live upgrade on any running cluster and
entertain read-only queries at the same time, given that all the
cluster's files will be immutable?
So if we are not going to address those cases, we should change the
syntax and remove the notion of read-only. It could be:
ALTER SYSTEM SET wal_writes TO off|on;
or
ALTER SYSTEM SET prohibit_wal TO off|on;
If we are going to try to make it truly read-only, and cater to the
other use cases, we have to:
Perform a checkpoint before declaring the system read-only (i.e. before
the command returns). This may be expensive of course, as Andres has
pointed out in this thread, but it is a price that has to be paid. If we
do this checkpoint, then we can avoid an additional shutdown checkpoint
and an end-of-recovery checkpoint (if we restart the primary after a
crash while in read-only mode). Also, we would have to prevent any
operation that touches control files, which I am not sure we do today in
the current patch.
Why not have the best of both worlds? Consider:
ALTER SYSTEM SET read_only to {off, on, wal};
-- on: wal writes off + no writes to disk
-- off: default
-- wal: only wal writes off
Of course, there can probably be better syntax for the above.
Regards,
Soumyadeep (VMware)
Hi Amul, On Tue, Jun 16, 2020 at 6:56 AM amul sul <sulamul@gmail.com> wrote: > The proposed feature is built atop of super barrier mechanism commit[1] to > coordinate > global state changes to all active backends. Backends which executed > ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer > process to change the requested WAL read/write state aka WAL prohibited and > WAL > permitted state respectively. When the checkpointer process sees the WAL > prohibit > state change request, it emits a global barrier and waits until all > backends that > participate in the ProcSignal absorbs it. Why should the checkpointer have the responsibility of setting the state of the system to read-only? Maybe this should be the postmaster's responsibility - the checkpointer should just handle requests to checkpoint. I think the backend requesting the read-only transition should signal the postmaster, which in turn, will take on the aforesaid responsibilities. The postmaster, could also additionally request a checkpoint, using RequestCheckpoint() (if we want to support the read-onlyness discussed in [1]). checkpointer.c should not be touched by this feature. Following on, any condition variable used by the backend to wait for the ALTER SYSTEM command to finish (the patch uses CheckpointerShmem->readonly_cv), could be housed in ProcGlobal. Regards, Soumyadeep (VMware) [1] https://www.postgresql.org/message-id/CAE-ML%2B-zdWODAyWNs_Eu-siPxp_3PGbPkiSg%3DtoLeW9iS_eioA%40mail.gmail.com
On Thu, Jul 23, 2020 at 3:33 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > > Hello, > > I think we should really term this feature, as it stands, as a means to > solely stop WAL writes from happening. > True. > The feature doesn't truly make the system read-only (e.g. dirty buffer > flushes may succeed the system being put into a read-only state), which > does make it confusing to a degree. > > Ideally, if we were to have a read-only system, we should be able to run > pg_checksums on it, or take file-system snapshots etc, without the need > to shut down the cluster. It would also enable an interesting use case: > we should also be able to do a live upgrade on any running cluster and > entertain read-only queries at the same time, given that all the > cluster's files will be immutable? > Read-only is for the queries. The aim of this feature is preventing new WAL records from being generated, not preventing them from being flushed to disk, or streamed to standbys, or anything else. The rest should happen as normal. If you can't flush WAL, then you might not be able to evict some number of buffers, which in the worst case could be large. That's because you can't evict a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise, you wouldn't be following the WAL-before-data rule). And having a potentially large number of unevictable buffers around sounds terrible, not only for performance, but also for having the system keep working at all. > So if we are not going to address those cases, we should change the > syntax and remove the notion of read-only. It could be: > > ALTER SYSTEM SET wal_writes TO off|on; > or > ALTER SYSTEM SET prohibit_wal TO off|on; > > If we are going to try to make it truly read-only, and cater to the > other use cases, we have to: > > Perform a checkpoint before declaring the system read-only (i.e. before > the command returns). This may be expensive of course, as Andres has > pointed out in this thread, but it is a price that has to be paid. If we > do this checkpoint, then we can avoid an additional shutdown checkpoint > and an end-of-recovery checkpoint (if we restart the primary after a > crash while in read-only mode). Also, we would have to prevent any > operation that touches control files, which I am not sure we do today in > the current patch. > The intention is to change the system to read-only ASAP; the checkpoint will make it much slower. I don't think we can skip control file updates that need to make read-only state persistent across the restart. > Why not have the best of both worlds? Consider: > > ALTER SYSTEM SET read_only to {off, on, wal}; > > -- on: wal writes off + no writes to disk > -- off: default > -- wal: only wal writes off > > Of course, there can probably be better syntax for the above. > Sure, thanks for the suggestions. Syntax change is not a harder part; we can choose the better one later. Regards, Amul
On Thu, Jul 23, 2020 at 4:34 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > > +1 to this feature and I have been thinking about it for sometime. There are several use cases with marking database readonly (no transaction log generation). Some of the examples in a hosted service scenario are 1/ when customer runs outof storage space, 2/ Upgrading the server to a different major version (current server can be set to read only, new onecan be built and then switch DNS), 3/ If user wants to force a database to read only and not accept writes, may be forimport / export a database. > Thanks for voting & listing the realistic use cases. Regards, Amul
On Thu, Jul 23, 2020 at 6:08 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > > Hi Amul, > Thanks, Soumyadeep for looking and putting your thoughts on the patch. > On Tue, Jun 16, 2020 at 6:56 AM amul sul <sulamul@gmail.com> wrote: > > The proposed feature is built atop of super barrier mechanism commit[1] to > > coordinate > > global state changes to all active backends. Backends which executed > > ALTER SYSTEM READ { ONLY | WRITE } command places request to checkpointer > > process to change the requested WAL read/write state aka WAL prohibited and > > WAL > > permitted state respectively. When the checkpointer process sees the WAL > > prohibit > > state change request, it emits a global barrier and waits until all > > backends that > > participate in the ProcSignal absorbs it. > > Why should the checkpointer have the responsibility of setting the state > of the system to read-only? Maybe this should be the postmaster's > responsibility - the checkpointer should just handle requests to > checkpoint. Well, once we've initiated the change to a read-only state, we probably want to always either finish that change or go back to read-write, even if the process that initiated the change is interrupted. Leaving the system in a half-way-in-between state long term seems bad. Maybe we would have put some background process, but choose the checkpointer in charge of making the state change and to avoid the new background process to keep the first version patch simple. The checkpointer isn't likely to get killed, but if it does, it will be relaunched and the new one can clean things up. On the other hand, I agree making the checkpointer responsible for more than one thing might not be a good idea but I don't think the postmaster should do the work that any background process can do. >I think the backend requesting the read-only transition > should signal the postmaster, which in turn, will take on the aforesaid > responsibilities. The postmaster, could also additionally request a > checkpoint, using RequestCheckpoint() (if we want to support the > read-onlyness discussed in [1]). checkpointer.c should not be touched by > this feature. > > Following on, any condition variable used by the backend to wait for the > ALTER SYSTEM command to finish (the patch uses > CheckpointerShmem->readonly_cv), could be housed in ProcGlobal. > Relevant only if we don't want to use the checkpointer process. Regards, Amul
On Thu, Jul 23, 2020 at 3:42 AM Amul Sul <sulamul@gmail.com> wrote: > The aim of this feature is preventing new WAL records from being generated, not > preventing them from being flushed to disk, or streamed to standbys, or anything > else. The rest should happen as normal. > > If you can't flush WAL, then you might not be able to evict some number of > buffers, which in the worst case could be large. That's because you can't evict > a dirty buffer until WAL has been flushed up to the buffer's LSN (otherwise, > you wouldn't be following the WAL-before-data rule). And having a potentially > large number of unevictable buffers around sounds terrible, not only for > performance, but also for having the system keep working at all. In the read-only level I was suggesting, I wasn't suggesting that we stop WAL flushes, in fact we should flush the WAL before we mark the system as read-only. Once the system declares itself as read-only, it will not perform any more on-disk changes; It may perform all the flushes it needs as a part of the read-only request handling. WAL should still stream to the secondary of course, even after you mark the primary as read-only. > Read-only is for the queries. What I am saying is it doesn't have to be just the queries. I think we can cater to all the other use cases simply by forcing a checkpoint before marking the system as read-only. > The intention is to change the system to read-only ASAP; the checkpoint will > make it much slower. I agree - if one needs that speed, then they can do the equivalent of: ALTER SYSTEM SET read_only to 'wal'; and the expensive checkpoint you mentioned can be avoided. > I don't think we can skip control file updates that need to make read-only > state persistent across the restart. I was referring to control file updates post the read-only state change. Any updates done as a part of the state change is totally cool. Regards, Soumyadeep (VMware)
On Thu, Jul 23, 2020 at 3:57 AM Amul Sul <sulamul@gmail.com> wrote: > Well, once we've initiated the change to a read-only state, we probably want to > always either finish that change or go back to read-write, even if the process > that initiated the change is interrupted. Leaving the system in a > half-way-in-between state long term seems bad. Maybe we would have put some > background process, but choose the checkpointer in charge of making the state > change and to avoid the new background process to keep the first version patch > simple. The checkpointer isn't likely to get killed, but if it does, it will > be relaunched and the new one can clean things up. On the other hand, I agree > making the checkpointer responsible for more than one thing might not > be a good idea > but I don't think the postmaster should do the work that any > background process can > do. +1 for doing it in a background process rather than in the backend itself (as we can't risk doing it in a backend as it can crash and won't restart and clean up as a background process would). As my co-worker pointed out to me, doing the work in the postmaster is a very bad idea as we don't want delays in serving connection requests on account of the barrier that comes with this patch. I would like to see this responsibility in a separate auxiliary process but I guess having it in the checkpointer isn't the end of the world. Regards, Soumyadeep (VMware)
On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think we'd want the FIRST write operation to be the end-of-recovery > checkpoint, before the system is fully read-write. And then after that > completes you could do other things. I can't see why this is necessary from a correctness or performance point of view. Maybe I'm missing something. In case it is necessary, the patch set does not wait for the checkpoint to complete before marking the system as read-write. Refer: /* Set final state by clearing in-progress flag bit */ if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS))) { if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0) ereport(LOG, (errmsg("system is now read only"))); else { /* Request checkpoint */ RequestCheckpoint(CHECKPOINT_IMMEDIATE); ereport(LOG, (errmsg("system is now read write"))); } } We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before we SetWALProhibitState() and do the ereport(), if we have a read-write state change request. Also, we currently request this checkpoint even if there was no startup recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where the read-write request does follow a startup recovery. So it should really be: RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT | CHECKPOINT_END_OF_RECOVERY); We would need to convey that an end-of-recovery-checkpoint is pending in shmem somehow (and only if one such checkpoint is pending, should we do it as a part of the read-write request handling). Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do: /* * Skip end-of-recovery checkpoint if the system is in WAL prohibited state. */ and then check for that. Some minor comments about the code (some of them probably doesn't warrant immediate attention, but for the record...): 1. There are some places where we can use a local variable to store the result of RelationNeedsWAL() to avoid repeated calls to it. E.g. brin_doupdate() 2. Similarly, we can also capture the calls to GetWALProhibitState() in a local variable where applicable. E.g. inside WALProhibitRequest(). 3. Some of the functions that were added such as GetWALProhibitState(), IsWALProhibited() etc could be declared static inline. 4. IsWALProhibited(): Shouldn't it really be: bool IsWALProhibited(void) { uint32 walProhibitState = GetWALProhibitState(); return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0 && (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0; } 5. I think the comments: /* Must be performing an INSERT or UPDATE, so we'll have an XID */ and /* Can reach here from VACUUM, so need not have an XID */ can be internalized in the function/macro comment header. 6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started advances */ We need to update the comment here. Regards, Soumyadeep (VMware)
Hi, > From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001 > From: Amul Sul <amul.sul@enterprisedb.com> > Date: Fri, 27 Mar 2020 05:05:38 -0400 > Subject: [PATCH v3 2/6] Add alter system read only/write syntax > > Note that syntax doesn't have any implementation. > --- > src/backend/nodes/copyfuncs.c | 12 ++++++++++++ > src/backend/nodes/equalfuncs.c | 9 +++++++++ > src/backend/parser/gram.y | 13 +++++++++++++ > src/backend/tcop/utility.c | 20 ++++++++++++++++++++ > src/bin/psql/tab-complete.c | 6 ++++-- > src/include/nodes/nodes.h | 1 + > src/include/nodes/parsenodes.h | 10 ++++++++++ > src/tools/pgindent/typedefs.list | 1 + > 8 files changed, 70 insertions(+), 2 deletions(-) Shouldn't there be at outfuncs support as well? Perhaps we even need readfuncs, not immediately sure. > From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001 > From: Amul Sul <amul.sul@enterprisedb.com> > Date: Fri, 19 Jun 2020 06:29:36 -0400 > Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier. > > Implementation: > > 1. When a user tried to change server state to WAL-Prohibited using > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit > PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the > barrier has been absorbed by all the backends. > > 2. When a backend receives the WAL-Prohibited barrier, at that moment if > it is already in a transaction and the transaction already assigned XID, > then the backend will be killed by throwing FATAL(XXX: need more discussion > on this) I think we should consider introducing XACTFATAL or such, guaranteeing the transaction gets aborted, without requiring a FATAL. This has been needed for enough cases that it's worthwhile. There are several cases where we WAL log without having an xid assigned. E.g. when HOT pruning during syscache lookups or such. Are there any cases where the check for being in recovery is followed by a CHECK_FOR_INTERRUPTS, before the WAL logging is done? > 3. Otherwise, if that backend running transaction which yet to get XID > assigned we don't need to do anything special, simply call > ResetLocalXLogInsertAllowed() so that any future WAL insert in will check > XLogInsertAllowed() first which set ready only state appropriately. > > 4. A new transaction (from existing or new backend) starts as a read-only > transaction. Why do we need 4)? And doesn't that have the potential to be unnecessarily problematic if a the server is subsequently brought out of the readonly state again? > 5. Auxiliary processes like autovacuum launcher, background writer, > checkpointer and walwriter will don't do anything in WAL-Prohibited > server state until someone wakes us up. E.g. a backend might later on > request us to put the system back to read-write. Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do anything in this state. bgwriter for example is even running entirely normally in a hot standby node? > 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint > and xlog rotation. Starting up again will perform crash recovery(XXX: > need some discussion on this as well) > > 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server. > > 8. Only super user can toggle WAL-Prohibit state. > > 9. Add system_is_read_only GUC show the system state -- will true when system > is wal prohibited or in recovery. > +/* > + * AlterSystemSetWALProhibitState > + * > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > + */ > +void > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > +{ > + if (!superuser()) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("must be superuser to execute ALTER SYSTEM command"))); ISTM we should rather do this in a GRANTable manner. We've worked substantially towards that in the last few years. > > + /* > + * WALProhibited indicates if we have stopped allowing WAL writes. > + * Protected by info_lck. > + */ > + bool WALProhibited; > + > /* > * SharedHotStandbyActive indicates if we allow hot standby queries to be > * run. Protected by info_lck. > @@ -7962,6 +7969,25 @@ StartupXLOG(void) > RequestCheckpoint(CHECKPOINT_FORCE); > } > > +void > +MakeReadOnlyXLOG(void) > +{ > + SpinLockAcquire(&XLogCtl->info_lck); > + XLogCtl->WALProhibited = true; > + SpinLockRelease(&XLogCtl->info_lck); > +} > + > +/* > + * Is the system still in WAL prohibited state? > + */ > +bool > +IsWALProhibited(void) > +{ > + volatile XLogCtlData *xlogctl = XLogCtl; > + > + return xlogctl->WALProhibited; > +} What does this kind of locking achieving? It doesn't protect against concurrent ALTER SYSTEM SET READ ONLY or such? > + /* > + * If the server is in WAL-Prohibited state then don't do anything until > + * someone wakes us up. E.g. a backend might later on request us to put > + * the system back to read-write. > + */ > + if (IsWALProhibited()) > + { > + (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, > + WAIT_EVENT_CHECKPOINTER_MAIN); > + continue; > + } > + > /* > * Detect a pending checkpoint request by checking whether the flags > * word in shared memory is nonzero. We shouldn't need to acquire the So if the ASRO happens while a checkpoint, potentially with a checkpoint_timeout = 60d, it'll not take effect until the checkpoint has finished. But uh, as far as I can tell, the code would simply continue an in-progress checkpoint, despite having absorbed the barrier. And then we'd PANIC when doing the XLogInsert()? > diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h > new file mode 100644 > index 00000000000..619c33cd780 > --- /dev/null > +++ b/src/include/access/walprohibit.h Not sure I like the mix of xlog/wal prefix for pretty closely related files... I'm not convinced it's worth having a separate file for this, fwiw. > From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001 > From: Amul Sul <amul.sul@enterprisedb.com> > Date: Fri, 15 May 2020 06:39:43 -0400 > Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or > READ-WRITE > > Till the previous commit, the backend used to do this, but now the backend > requests checkpointer to do it. Checkpointer, noticing that the current state > is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, > and then acknowledges back to the backend who requested the state change. > > Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL > prohibited state persistent across the system restarts. The split between the previous commit and this commit seems more confusing than useful to me. > +/* > + * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to > + * read-only. > + */ > +void > +WALProhibitRequest(void) > +{ > + /* Must not be called from checkpointer */ > + Assert(!AmCheckpointerProcess()); > + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS); > + > + /* > + * If in a standalone backend, just do it ourselves. > + */ > + if (!IsPostmasterEnvironment) > + { > + performWALProhibitStateChange(GetWALProhibitState()); > + return; > + } > + > + if (CheckpointerShmem->checkpointer_pid == 0) > + elog(ERROR, "checkpointer is not running"); > + > + if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0) > + elog(ERROR, "could not signal checkpointer: %m"); > + > + /* Wait for the state to change to read-only */ > + ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv); > + for (;;) > + { > + /* We'll be done once in-progress flag bit is cleared */ > + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > + break; > + > + elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer"); > + ConditionVariableSleep(&CheckpointerShmem->readonly_cv, > + WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE); > + } > + ConditionVariableCancelSleep(); > + elog(DEBUG1, "Done WALProhibitRequest"); > +} Isn't it possible that the system could have been changed back to be read-write by the time the wakeup is being processed? > From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001 > From: Amul Sul <amul.sul@enterprisedb.com> > Date: Tue, 14 Jul 2020 02:10:55 -0400 > Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL > write Isn't that the wrong order? This needs to come before the feature is enabled, no? > @@ -758,6 +759,9 @@ brinbuildempty(Relation index) > ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); > LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); > > + /* Building indexes will have an XID */ > + AssertWALPermitted_HaveXID(); > + Ugh, that's a pretty ugly naming scheme mix. > @@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, > if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) && > brin_can_do_samepage_update(oldbuf, origsz, newsz)) > { > + /* Can reach here from VACUUM, so need not have an XID */ > + if (RelationNeedsWAL(idxrel)) > + CheckWALPermitted(); > + Hm. Maybe I am confused, but why is that dependent on RelationNeedsWAL()? Shouldn't readonly actually mean readonly, even if no WAL is emitted? > #include "access/genam.h" > #include "access/gist_private.h" > #include "access/transam.h" > +#include "access/walprohibit.h" > #include "commands/vacuum.h" > #include "lib/integerset.h" > #include "miscadmin.h" The number of places that now need this new header - pretty much the same set of files that do XLogInsert, already requiring an xlog* header to be included - drives me further towards the conclusion that it's not a good idea to have it separate. > extern void ProcessInterrupts(void); > > +#ifdef USE_ASSERT_CHECKING > +typedef enum > +{ > + WALPERMIT_UNCHECKED, > + WALPERMIT_CHECKED, > + WALPERMIT_CHECKED_AND_USED > +} WALPermitCheckState; > + > +/* in access/walprohibit.c */ > +extern WALPermitCheckState walpermit_checked_state; > + > +/* > + * Reset walpermit_checked flag when no longer in the critical section. > + * Otherwise, marked checked and used. > + */ > +#define RESET_WALPERMIT_CHECKED_STATE() \ > +do { \ > + walpermit_checked_state = CritSectionCount ? \ > + WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \ > +} while(0) > +#else > +#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0) > +#endif > + Why are these in headers? And why is this tied to CritSectionCount? Greetings, Andres Freund
On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think we'd want the FIRST write operation to be the end-of-recovery
> > checkpoint, before the system is fully read-write. And then after that
> > completes you could do other things.
>
> I can't see why this is necessary from a correctness or performance
> point of view. Maybe I'm missing something.
>
> In case it is necessary, the patch set does not wait for the checkpoint to
> complete before marking the system as read-write. Refer:
>
> /* Set final state by clearing in-progress flag bit */
> if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> {
> if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> ereport(LOG, (errmsg("system is now read only")));
> else
> {
> /* Request checkpoint */
> RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> ereport(LOG, (errmsg("system is now read write")));
> }
> }
>
> We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> we SetWALProhibitState() and do the ereport(), if we have a read-write
> state change request.
>
+1, I too have the same question.
FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it think
it will be deadlock case -- checkpointer process waiting for itself.
> Also, we currently request this checkpoint even if there was no startup
> recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
> the read-write request does follow a startup recovery.
> So it should really be:
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
> CHECKPOINT_END_OF_RECOVERY);
> We would need to convey that an end-of-recovery-checkpoint is pending in
> shmem somehow (and only if one such checkpoint is pending, should we do
> it as a part of the read-write request handling).
> Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
> /*
> * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
> */
> and then check for that.
>
Yep, we need some indication that end-of-recovery was skipped at the startup,
but I haven't added that since I wasn't sure do we really need
CHECKPOINT_END_OF_RECOVERY as part of the previous concern?
> Some minor comments about the code (some of them probably doesn't
> warrant immediate attention, but for the record...):
>
> 1. There are some places where we can use a local variable to store the
> result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
> brin_doupdate()
>
Ok.
> 2. Similarly, we can also capture the calls to GetWALProhibitState() in
> a local variable where applicable. E.g. inside WALProhibitRequest().
>
I don't think so.
> 3. Some of the functions that were added such as GetWALProhibitState(),
> IsWALProhibited() etc could be declared static inline.
>
IsWALProhibited() can be static but not GetWALProhibitState() since it needed to
be accessible from other files.
> 4. IsWALProhibited(): Shouldn't it really be:
> bool
> IsWALProhibited(void)
> {
> uint32 walProhibitState = GetWALProhibitState();
> return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
> && (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
> }
>
I think the current one is better, this allows read-write transactions from
existing backend which has absorbed barrier or from new backend while we
changing stated to read-write in the assumption that we never fallback.
> 5. I think the comments:
> /* Must be performing an INSERT or UPDATE, so we'll have an XID */
> and
> /* Can reach here from VACUUM, so need not have an XID */
> can be internalized in the function/macro comment header.
>
Ok.
> 6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
> advances */
> We need to update the comment here.
>
Ok.
Will try to address all the above review comments in the next version along with
Andres' concern/suggestion. Thanks again for your time.
Regards,
Amul
>
> On Thu, Jun 18, 2020 at 7:54 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I think we'd want the FIRST write operation to be the end-of-recovery
> > checkpoint, before the system is fully read-write. And then after that
> > completes you could do other things.
>
> I can't see why this is necessary from a correctness or performance
> point of view. Maybe I'm missing something.
>
> In case it is necessary, the patch set does not wait for the checkpoint to
> complete before marking the system as read-write. Refer:
>
> /* Set final state by clearing in-progress flag bit */
> if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> {
> if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> ereport(LOG, (errmsg("system is now read only")));
> else
> {
> /* Request checkpoint */
> RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> ereport(LOG, (errmsg("system is now read write")));
> }
> }
>
> We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> we SetWALProhibitState() and do the ereport(), if we have a read-write
> state change request.
>
+1, I too have the same question.
FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it think
it will be deadlock case -- checkpointer process waiting for itself.
> Also, we currently request this checkpoint even if there was no startup
> recovery and we don't set CHECKPOINT_END_OF_RECOVERY in the case where
> the read-write request does follow a startup recovery.
> So it should really be:
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
> CHECKPOINT_END_OF_RECOVERY);
> We would need to convey that an end-of-recovery-checkpoint is pending in
> shmem somehow (and only if one such checkpoint is pending, should we do
> it as a part of the read-write request handling).
> Maybe we can set CHECKPOINT_END_OF_RECOVERY in ckpt_flags where we do:
> /*
> * Skip end-of-recovery checkpoint if the system is in WAL prohibited state.
> */
> and then check for that.
>
Yep, we need some indication that end-of-recovery was skipped at the startup,
but I haven't added that since I wasn't sure do we really need
CHECKPOINT_END_OF_RECOVERY as part of the previous concern?
> Some minor comments about the code (some of them probably doesn't
> warrant immediate attention, but for the record...):
>
> 1. There are some places where we can use a local variable to store the
> result of RelationNeedsWAL() to avoid repeated calls to it. E.g.
> brin_doupdate()
>
Ok.
> 2. Similarly, we can also capture the calls to GetWALProhibitState() in
> a local variable where applicable. E.g. inside WALProhibitRequest().
>
I don't think so.
> 3. Some of the functions that were added such as GetWALProhibitState(),
> IsWALProhibited() etc could be declared static inline.
>
IsWALProhibited() can be static but not GetWALProhibitState() since it needed to
be accessible from other files.
> 4. IsWALProhibited(): Shouldn't it really be:
> bool
> IsWALProhibited(void)
> {
> uint32 walProhibitState = GetWALProhibitState();
> return (walProhibitState & WALPROHIBIT_STATE_READ_ONLY) != 0
> && (walProhibitState & WALPROHIBIT_TRANSITION_IN_PROGRESS) == 0;
> }
>
I think the current one is better, this allows read-write transactions from
existing backend which has absorbed barrier or from new backend while we
changing stated to read-write in the assumption that we never fallback.
> 5. I think the comments:
> /* Must be performing an INSERT or UPDATE, so we'll have an XID */
> and
> /* Can reach here from VACUUM, so need not have an XID */
> can be internalized in the function/macro comment header.
>
Ok.
> 6. Typo: ConditionVariable readonly_cv; /* signaled when ckpt_started
> advances */
> We need to update the comment here.
>
Ok.
Will try to address all the above review comments in the next version along with
Andres' concern/suggestion. Thanks again for your time.
Regards,
Amul
On Fri, Jul 24, 2020 at 7:34 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, Thanks for looking at the patch. > > > From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001 > > From: Amul Sul <amul.sul@enterprisedb.com> > > Date: Fri, 27 Mar 2020 05:05:38 -0400 > > Subject: [PATCH v3 2/6] Add alter system read only/write syntax > > > > Note that syntax doesn't have any implementation. > > --- > > src/backend/nodes/copyfuncs.c | 12 ++++++++++++ > > src/backend/nodes/equalfuncs.c | 9 +++++++++ > > src/backend/parser/gram.y | 13 +++++++++++++ > > src/backend/tcop/utility.c | 20 ++++++++++++++++++++ > > src/bin/psql/tab-complete.c | 6 ++++-- > > src/include/nodes/nodes.h | 1 + > > src/include/nodes/parsenodes.h | 10 ++++++++++ > > src/tools/pgindent/typedefs.list | 1 + > > 8 files changed, 70 insertions(+), 2 deletions(-) > > Shouldn't there be at outfuncs support as well? Perhaps we even need > readfuncs, not immediately sure. Ok, can add that as well. > > > > > From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001 > > From: Amul Sul <amul.sul@enterprisedb.com> > > Date: Fri, 19 Jun 2020 06:29:36 -0400 > > Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global barrier. > > > > Implementation: > > > > 1. When a user tried to change server state to WAL-Prohibited using > > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will emit > > PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait until the > > barrier has been absorbed by all the backends. > > > > 2. When a backend receives the WAL-Prohibited barrier, at that moment if > > it is already in a transaction and the transaction already assigned XID, > > then the backend will be killed by throwing FATAL(XXX: need more discussion > > on this) > > I think we should consider introducing XACTFATAL or such, guaranteeing > the transaction gets aborted, without requiring a FATAL. This has been > needed for enough cases that it's worthwhile. > As I am aware of, the existing code PostgresMain() uses FATAL to terminate the connection when protocol synchronization was lost. Currently, in a proposal, this and another one is "Terminate the idle sessions"[1] is using FATAL, afaik. > > There are several cases where we WAL log without having an xid > assigned. E.g. when HOT pruning during syscache lookups or such. Are > there any cases where the check for being in recovery is followed by a > CHECK_FOR_INTERRUPTS, before the WAL logging is done? > In case of operation without xid, an error will be raised just before the point where the wal record is expected. The places you are asking about, I haven't found in a glance, will try to search for that, but I am sure current implementation is not missing those places where it is supposed to check the prohibited state and complaint. Quick question, is it possible that pruning will happen with the SELECT query? It would be helpful if you or someone else could point me to the place where WAL can be generated even in the case of read-only queries. > > > > 3. Otherwise, if that backend running transaction which yet to get XID > > assigned we don't need to do anything special, simply call > > ResetLocalXLogInsertAllowed() so that any future WAL insert in will check > > XLogInsertAllowed() first which set ready only state appropriately. > > > > 4. A new transaction (from existing or new backend) starts as a read-only > > transaction. > > Why do we need 4)? And doesn't that have the potential to be > unnecessarily problematic if a the server is subsequently brought out of > the readonly state again? The transaction that was started in the read-only system state will be read-only until the end. I think that shouldn't be too problematic. > > > > 5. Auxiliary processes like autovacuum launcher, background writer, > > checkpointer and walwriter will don't do anything in WAL-Prohibited > > server state until someone wakes us up. E.g. a backend might later on > > request us to put the system back to read-write. > > Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do > anything in this state. bgwriter for example is even running entirely > normally in a hot standby node? I think I missed to update the description when I reverted the walwriter changes. The current version doesn't have any changes to the walwriter. And bgwriter too behaves the same as it on the recovery system. Will update this, sorry for the confusion. > > > > 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint > > and xlog rotation. Starting up again will perform crash recovery(XXX: > > need some discussion on this as well) > > > > 7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server. > > > > 8. Only super user can toggle WAL-Prohibit state. > > > > 9. Add system_is_read_only GUC show the system state -- will true when system > > is wal prohibited or in recovery. > > > > > +/* > > + * AlterSystemSetWALProhibitState > > + * > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > > + */ > > +void > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > > +{ > > + if (!superuser()) > > + ereport(ERROR, > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > + errmsg("must be superuser to execute ALTER SYSTEM command"))); > > ISTM we should rather do this in a GRANTable manner. We've worked > substantially towards that in the last few years. > I added this to be inlined with AlterSystemSetConfigFile(), if we want a GRANTable manner, will try that. > > > > > > + /* > > + * WALProhibited indicates if we have stopped allowing WAL writes. > > + * Protected by info_lck. > > + */ > > + bool WALProhibited; > > + > > /* > > * SharedHotStandbyActive indicates if we allow hot standby queries to be > > * run. Protected by info_lck. > > @@ -7962,6 +7969,25 @@ StartupXLOG(void) > > RequestCheckpoint(CHECKPOINT_FORCE); > > } > > > > +void > > +MakeReadOnlyXLOG(void) > > +{ > > + SpinLockAcquire(&XLogCtl->info_lck); > > + XLogCtl->WALProhibited = true; > > + SpinLockRelease(&XLogCtl->info_lck); > > +} > > + > > +/* > > + * Is the system still in WAL prohibited state? > > + */ > > +bool > > +IsWALProhibited(void) > > +{ > > + volatile XLogCtlData *xlogctl = XLogCtl; > > + > > + return xlogctl->WALProhibited; > > +} > > What does this kind of locking achieving? It doesn't protect against > concurrent ALTER SYSTEM SET READ ONLY or such? > The 0004 patch improves that. > > > > + /* > > + * If the server is in WAL-Prohibited state then don't do anything until > > + * someone wakes us up. E.g. a backend might later on request us to put > > + * the system back to read-write. > > + */ > > + if (IsWALProhibited()) > > + { > > + (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, > > + WAIT_EVENT_CHECKPOINTER_MAIN); > > + continue; > > + } > > + > > /* > > * Detect a pending checkpoint request by checking whether the flags > > * word in shared memory is nonzero. We shouldn't need to acquire the > > So if the ASRO happens while a checkpoint, potentially with a > checkpoint_timeout = 60d, it'll not take effect until the checkpoint has > finished. > > But uh, as far as I can tell, the code would simply continue an > in-progress checkpoint, despite having absorbed the barrier. And then > we'd PANIC when doing the XLogInsert()? I think this might not be the case with the next checkpointer changes in the 0004 patch. > > > diff --git a/src/include/access/walprohibit.h b/src/include/access/walprohibit.h > > new file mode 100644 > > index 00000000000..619c33cd780 > > --- /dev/null > > +++ b/src/include/access/walprohibit.h > > Not sure I like the mix of xlog/wal prefix for pretty closely related > files... I'm not convinced it's worth having a separate file for this, > fwiw. I see. > > > > > From 5600adc647bd729e4074ecf13e97b9f297e9d5c6 Mon Sep 17 00:00:00 2001 > > From: Amul Sul <amul.sul@enterprisedb.com> > > Date: Fri, 15 May 2020 06:39:43 -0400 > > Subject: [PATCH v3 4/6] Use checkpointer to make system READ-ONLY or > > READ-WRITE > > > > Till the previous commit, the backend used to do this, but now the backend > > requests checkpointer to do it. Checkpointer, noticing that the current state > > is has WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, > > and then acknowledges back to the backend who requested the state change. > > > > Note that this commit also enables ALTER SYSTEM READ WRITE support and make WAL > > prohibited state persistent across the system restarts. > > The split between the previous commit and this commit seems more > confusing than useful to me. By looking at the previous two review comments I agree with you. My intention to make things easier for the reviewer. Will merge this patch with the previous one. > > > +/* > > + * WALProhibitedRequest: Request checkpointer to make the WALProhibitState to > > + * read-only. > > + */ > > +void > > +WALProhibitRequest(void) > > +{ > > + /* Must not be called from checkpointer */ > > + Assert(!AmCheckpointerProcess()); > > + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS); > > + > > + /* > > + * If in a standalone backend, just do it ourselves. > > + */ > > + if (!IsPostmasterEnvironment) > > + { > > + performWALProhibitStateChange(GetWALProhibitState()); > > + return; > > + } > > + > > + if (CheckpointerShmem->checkpointer_pid == 0) > > + elog(ERROR, "checkpointer is not running"); > > + > > + if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0) > > + elog(ERROR, "could not signal checkpointer: %m"); > > + > > + /* Wait for the state to change to read-only */ > > + ConditionVariablePrepareToSleep(&CheckpointerShmem->readonly_cv); > > + for (;;) > > + { > > + /* We'll be done once in-progress flag bit is cleared */ > > + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > + break; > > + > > + elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer"); > > + ConditionVariableSleep(&CheckpointerShmem->readonly_cv, > > + WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE); > > + } > > + ConditionVariableCancelSleep(); > > + elog(DEBUG1, "Done WALProhibitRequest"); > > +} > > Isn't it possible that the system could have been changed back to be > read-write by the time the wakeup is being processed? You have a point, the second backend will see the ASRW executed successfully despite any changes by this. I think it better to have an error for the second backend instead of silent. Will do the same. > > > From 0b7426fc4708cc0e4ad333da3b35e473658bba28 Mon Sep 17 00:00:00 2001 > > From: Amul Sul <amul.sul@enterprisedb.com> > > Date: Tue, 14 Jul 2020 02:10:55 -0400 > > Subject: [PATCH v3 5/6] Error or Assert before START_CRIT_SECTION for WAL > > write > > Isn't that the wrong order? This needs to come before the feature is > enabled, no? > Agreed but, IMHO, let it be, my intention behind the split is to make code read easy and I don't think they are going to be check-in separately except 0001. > > > > @@ -758,6 +759,9 @@ brinbuildempty(Relation index) > > ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); > > LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); > > > > + /* Building indexes will have an XID */ > > + AssertWALPermitted_HaveXID(); > > + > > Ugh, that's a pretty ugly naming scheme mix. > Ok. > > > > > @@ -176,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, > > if (((BrinPageFlags(oldpage) & BRIN_EVACUATE_PAGE) == 0) && > > brin_can_do_samepage_update(oldbuf, origsz, newsz)) > > { > > + /* Can reach here from VACUUM, so need not have an XID */ > > + if (RelationNeedsWAL(idxrel)) > > + CheckWALPermitted(); > > + > > Hm. Maybe I am confused, but why is that dependent on > RelationNeedsWAL()? Shouldn't readonly actually mean readonly, even if > no WAL is emitted? > To avoid the unnecessary error for the case where the wal record will not be generated. > > > #include "access/genam.h" > > #include "access/gist_private.h" > > #include "access/transam.h" > > +#include "access/walprohibit.h" > > #include "commands/vacuum.h" > > #include "lib/integerset.h" > > #include "miscadmin.h" > > The number of places that now need this new header - pretty much the > same set of files that do XLogInsert, already requiring an xlog* header > to be included - drives me further towards the conclusion that it's not > a good idea to have it separate. > Noted. > > > extern void ProcessInterrupts(void); > > > > +#ifdef USE_ASSERT_CHECKING > > +typedef enum > > +{ > > + WALPERMIT_UNCHECKED, > > + WALPERMIT_CHECKED, > > + WALPERMIT_CHECKED_AND_USED > > +} WALPermitCheckState; > > + > > +/* in access/walprohibit.c */ > > +extern WALPermitCheckState walpermit_checked_state; > > + > > +/* > > + * Reset walpermit_checked flag when no longer in the critical section. > > + * Otherwise, marked checked and used. > > + */ > > +#define RESET_WALPERMIT_CHECKED_STATE() \ > > +do { \ > > + walpermit_checked_state = CritSectionCount ? \ > > + WALPERMIT_CHECKED_AND_USED : WALPERMIT_UNCHECKED; \ > > +} while(0) > > +#else > > +#define RESET_WALPERMIT_CHECKED_STATE() ((void) 0) > > +#endif > > + > > Why are these in headers? And why is this tied to CritSectionCount? > If it is too bad we could think to move that. In the critical section, we don't want the walpermit_checked_state flag to be reset by XLogResetInsertion() otherwise following XLogBeginInsert() will have an assertion. The idea is that anything that checks the flag changes it from UNCHECKED to CHECKED. XLogResetInsertion() sets it to CHECKED_AND_USED if in a critical section and to UNCHECKED otherwise (i.e. when CritSectionCount == 0). Regards, Amul 1] https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > So if we are not going to address those cases, we should change the > syntax and remove the notion of read-only. It could be: > > ALTER SYSTEM SET wal_writes TO off|on; > or > ALTER SYSTEM SET prohibit_wal TO off|on; This doesn't really work because of the considerations mentioned in http://postgr.es/m/CA+TgmoakCtzOZr0XEqaLFiMBcjE2rGcBAzf4EybpXjtNetpSVw@mail.gmail.com > If we are going to try to make it truly read-only, and cater to the > other use cases, we have to: > > Perform a checkpoint before declaring the system read-only (i.e. before > the command returns). This may be expensive of course, as Andres has > pointed out in this thread, but it is a price that has to be paid. If we > do this checkpoint, then we can avoid an additional shutdown checkpoint > and an end-of-recovery checkpoint (if we restart the primary after a > crash while in read-only mode). Also, we would have to prevent any > operation that touches control files, which I am not sure we do today in > the current patch. It's basically impossible to create a system for fast failover that involves a checkpoint. See my comments at http://postgr.es/m/CA+TgmoYe8uCgtYFGfnv3vWpZTygsdkSu2F4MNiqhkar_UKbWfQ@mail.gmail.com - you can't achieve five nines or even four nines of availability if you have to wait for a checkpoint that might take twenty minutes. I have nothing against a feature that does what you're describing, but this feature is designed to make fast failover easier to accomplish, and it's not going to succeed if it involves a checkpoint. > Why not have the best of both worlds? Consider: > > ALTER SYSTEM SET read_only to {off, on, wal}; > > -- on: wal writes off + no writes to disk > -- off: default > -- wal: only wal writes off > > Of course, there can probably be better syntax for the above. There are a few things you can can imagine doing here: 1. Freeze WAL writes but allow dirty buffers to be flushed afterward. This is the most useful thing for fast failover, I would argue, because it's quick and the fact that some dirty buffers may not be written doesn't matter. 2. Freeze WAL writes except a final checkpoint which will flush dirty buffers along the way. This is like shutting the system down cleanly and bringing it back up as a standby, except without performing a shutdown. 3. Freeze WAL writes and write out all dirty buffers without actually checkpointing. This is sort of a hybrid of #1 and #2. It's probably not much faster than #2 but it avoids generating any more WAL. 4. Freeze WAL writes and just keep all the dirty buffers cached, without writing them out. This seems like a bad idea for the reasons mentioned in Amul's reply. The system might not be able to respond even to read-only queries any more if shared_buffers is full of unevictable dirty buffers. Either #2 or #3 is sufficient to take a filesystem level snapshot of the cluster while it's running, but I'm not sure why that's interesting. You can already do that sort of thing by using pg_basebackup or by running pg_start_backup() and pg_stop_backup() and copying the directory in the middle, and you can do all of that while the cluster is accepting writes, which seems like it will usually be more convenient. If you do want this, you have several options, like running a checkpoint immediately followed by ALTER SYSTEM READ ONLY (so that the amount of WAL generated during the backup is small but maybe not none); or shutting down the system cleanly and restarting it as a standby; or maybe using the proposed pg_ctl demote feature mentioned on a separate thread. Contrary to what you write, I don't think either #2 or #3 is sufficient to enable checksums, at least not without some more engineering, because the server would cache the state from the control file, and a bunch of blocks from the database. I guess it would work if you did a server restart afterward, but I think there are better ways of supporting online checksum enabling that don't require shutting down the server, or even making it read-only; and there's been significant work done on those already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > In the read-only level I was suggesting, I wasn't suggesting that we > stop WAL flushes, in fact we should flush the WAL before we mark the > system as read-only. Once the system declares itself as read-only, it > will not perform any more on-disk changes; It may perform all the > flushes it needs as a part of the read-only request handling. I think that's already how the patch works, or at least how it should work. You stop new writes, flush any existing WAL, and then declare the system read-only. That can all be done quickly. > What I am saying is it doesn't have to be just the queries. I think we > can cater to all the other use cases simply by forcing a checkpoint > before marking the system as read-only. But that part can't, which means that if we did that, it would break the feature for the originally intended use case. I'm not on board with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 23, 2020 at 10:04 PM Andres Freund <andres@anarazel.de> wrote: > I think we should consider introducing XACTFATAL or such, guaranteeing > the transaction gets aborted, without requiring a FATAL. This has been > needed for enough cases that it's worthwhile. Seems like that would need a separate discussion, apart from this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul <sulamul@gmail.com> wrote: > > On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > > In case it is necessary, the patch set does not wait for the checkpoint to > > complete before marking the system as read-write. Refer: > > > > /* Set final state by clearing in-progress flag bit */ > > if (SetWALProhibitState(wal_state & > ~(WALPROHIBIT_TRANSITION_IN_PROGRESS))) > > { > > if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0) > > ereport(LOG, (errmsg("system is now read only"))); > > else > > { > > /* Request checkpoint */ > > RequestCheckpoint(CHECKPOINT_IMMEDIATE); > > ereport(LOG, (errmsg("system is now read write"))); > > } > > } > > > > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before > > we SetWALProhibitState() and do the ereport(), if we have a read-write > > state change request. > > > +1, I too have the same question. > > > > FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it > think > it will be deadlock case -- checkpointer process waiting for itself. We should really just call CreateCheckPoint() here instead of RequestCheckpoint(). > > 3. Some of the functions that were added such as GetWALProhibitState(), > > IsWALProhibited() etc could be declared static inline. > > > IsWALProhibited() can be static but not GetWALProhibitState() since it > needed to > be accessible from other files. If you place a static inline function in a header file, it will be accessible from other files. E.g. pg_atomic_* functions. Regards, Soumyadeep
On Fri, Jul 24, 2020 at 7:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty > <soumyadeep2007@gmail.com> wrote: > > So if we are not going to address those cases, we should change the > > syntax and remove the notion of read-only. It could be: > > > > ALTER SYSTEM SET wal_writes TO off|on; > > or > > ALTER SYSTEM SET prohibit_wal TO off|on; > > This doesn't really work because of the considerations mentioned in > http://postgr.es/m/CA+TgmoakCtzOZr0XEqaLFiMBcjE2rGcBAzf4EybpXjtNetpSVw@mail.gmail.com Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't think we should say "READ ONLY" if we still allow on-disk file changes after the ALTER SYSTEM command returns (courtesy dirty buffer flushes) because it does introduce confusion, especially to an audience not privy to this thread. When people hear "read-only" they may think of static on-disk files immediately. > Contrary to what you write, I don't think either #2 or #3 is > sufficient to enable checksums, at least not without some more > engineering, because the server would cache the state from the control > file, and a bunch of blocks from the database. I guess it would work > if you did a server restart afterward, but I think there are better > ways of supporting online checksum enabling that don't require > shutting down the server, or even making it read-only; and there's > been significant work done on those already. Agreed. As you mentioned, if we did do #2 or #3, we would be able to do pg_checksums on a server that was shut down or that had crashed while it was in a read-only state, which is what Michael was asking for in [1]. I think it's just cleaner if we allow for this. I don't have enough context to enumerate use cases for the advantages or opportunities that would come with an assurance that the cluster's files are frozen (and not covered by any existing utilities), but surely there are some? Like the possibility of pg_upgrade on a running server while it can entertain read-only queries? Surely, that's a nice one! Of course, some or all of these utilities would need to be taught about read-only mode. Regards, Soumyadeep [1] http://postgr.es/m/20200626095921.GF1504@paquier.xyz
On Fri, Jul 24, 2020 at 7:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty > <soumyadeep2007@gmail.com> wrote: > > In the read-only level I was suggesting, I wasn't suggesting that we > > stop WAL flushes, in fact we should flush the WAL before we mark the > > system as read-only. Once the system declares itself as read-only, it > > will not perform any more on-disk changes; It may perform all the > > flushes it needs as a part of the read-only request handling. > > I think that's already how the patch works, or at least how it should > work. You stop new writes, flush any existing WAL, and then declare > the system read-only. That can all be done quickly. > True, except for the fact that it allows dirty buffers to be flushed after the ALTER command returns. > > What I am saying is it doesn't have to be just the queries. I think we > > can cater to all the other use cases simply by forcing a checkpoint > > before marking the system as read-only. > > But that part can't, which means that if we did that, it would break > the feature for the originally intended use case. I'm not on board > with that. > Referring to the options you presented in [1]: I am saying that we should allow for both: with a checkpoint (#2) (can also be #3) and without a checkpoint (#1) before having the ALTER command return, by having different levels of read-onlyness. We should have syntax variants for these. The syntax should not be an ALTER SYSTEM SET as you have pointed out before. Perhaps: ALTER SYSTEM READ ONLY; -- #2 or #3 ALTER SYSTEM READ ONLY WAL; -- #1 ALTER SYSTEM READ WRITE; or even: ALTER SYSTEM FREEZE; -- #2 or #3 ALTER SYSTEM FREEZE WAL; -- #1 ALTER SYSTEM UNFREEZE; Regards, Soumyadeep (VMware) [1] http://postgr.es/m/CA+TgmoZ-c3Dz9QwHwmm4bc36N4u0XZ2OyENewMf+BwokbYdK9Q@mail.gmail.com
Hi,
The attached version is updated w.r.t. some of the review comments
from Soumyadeep and Andres.
Two thing from Andres' review comment are not addressed are:
1. Only superuser allowed to execute AlterSystemSetWALProhibitState(). As per
Andres instead we should do this in a GRANTable manner. I tried that but
got a little confused with the roles that we could use for ASRO and didn't see
any much appropriate one. pg_signal_backend could have been suited for ASRO
where we terminate some of the backends but a user granted this role is not
supposed to terminate the superuser backend. If we used that we need to check a
superuser backend and raise an error or warning. Other roles are
pg_write_server_files or pg_execute_server_program but I am not sure we should
use either of this, seems a bit confusing to me. Any suggestion or am I missing
something here?
2. About walprohibit.c/.h file, Andres' concern on file name is that WAL
related file names are started with xlog. I think renaming to xlog* will not be
the correct and will be more confusing since function/variable/macros inside
walprohibit.c/.h files contain the walprohibit keyword. And another concern is due to
separate file we have to include it to many places but I think that will be
one time pain and worth it to keep code modularised.
Andres, Robert, do let me know your opinion on this if you think we should merge
walprohibit.c/.h file into xlog.c/.h, will do that in the next version.
trying to make the system read-write or vice versa. Previously 2nd backend seeing
command that was executed successfully but it wasn't.
8. Merged checkpointer code changes patch to 0002.
Two thing from Andres' review comment are not addressed are:
1. Only superuser allowed to execute AlterSystemSetWALProhibitState(). As per
Andres instead we should do this in a GRANTable manner. I tried that but
got a little confused with the roles that we could use for ASRO and didn't see
any much appropriate one. pg_signal_backend could have been suited for ASRO
where we terminate some of the backends but a user granted this role is not
supposed to terminate the superuser backend. If we used that we need to check a
superuser backend and raise an error or warning. Other roles are
pg_write_server_files or pg_execute_server_program but I am not sure we should
use either of this, seems a bit confusing to me. Any suggestion or am I missing
something here?
2. About walprohibit.c/.h file, Andres' concern on file name is that WAL
related file names are started with xlog. I think renaming to xlog* will not be
the correct and will be more confusing since function/variable/macros inside
walprohibit.c/.h files contain the walprohibit keyword. And another concern is due to
separate file we have to include it to many places but I think that will be
one time pain and worth it to keep code modularised.
Andres, Robert, do let me know your opinion on this if you think we should merge
walprohibit.c/.h file into xlog.c/.h, will do that in the next version.
Changes in the attached version are:
1. Renamed readonly_cv to walprohibit_cv.
2. Removed repetitive comments for CheckWALPermitted() &
AssertWALPermitted_HaveXID().
3. Renamed AssertWALPermitted_HaveXID() to AssertWALPermittedHaveXID().
2. Removed repetitive comments for CheckWALPermitted() &
AssertWALPermitted_HaveXID().
3. Renamed AssertWALPermitted_HaveXID() to AssertWALPermittedHaveXID().
4. Changes to avoid repeated RelationNeedsWAL() calls.
5. IsWALProhibited() made static inline function.
6. Added outfuncs and readfuncs functions.
7. Added error when read-only state transition is in progress and other backendstrying to make the system read-write or vice versa. Previously 2nd backend seeing
command that was executed successfully but it wasn't.
8. Merged checkpointer code changes patch to 0002.
Regards,
Amul
Attachment
On Fri, Jul 24, 2020 at 10:40 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
> > In case it is necessary, the patch set does not wait for the checkpoint to
> > complete before marking the system as read-write. Refer:
> >
> > /* Set final state by clearing in-progress flag bit */
> > if (SetWALProhibitState(wal_state &
> ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > {
> > if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> > ereport(LOG, (errmsg("system is now read only")));
> > else
> > {
> > /* Request checkpoint */
> > RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> > ereport(LOG, (errmsg("system is now read write")));
> > }
> > }
> >
> > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> > we SetWALProhibitState() and do the ereport(), if we have a read-write
> > state change request.
> >
> +1, I too have the same question.
>
>
>
> FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
> think
> it will be deadlock case -- checkpointer process waiting for itself.
We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().
The only setting flag would have been enough for now, the next loop of
CheckpointerMain() will anyway be going to call CreateCheckPoint() without
waiting. I used RequestCheckpoint() to avoid duplicate flag setting code.
Also, I think RequestCheckpoint() will be better so that we don't need to deal
will the standalone backend, the only imperfection is it will unnecessary signal
itself, that would be fine I guess.
CheckpointerMain() will anyway be going to call CreateCheckPoint() without
waiting. I used RequestCheckpoint() to avoid duplicate flag setting code.
Also, I think RequestCheckpoint() will be better so that we don't need to deal
will the standalone backend, the only imperfection is it will unnecessary signal
itself, that would be fine I guess.
> > 3. Some of the functions that were added such as GetWALProhibitState(),
> > IsWALProhibited() etc could be declared static inline.
> >
> IsWALProhibited() can be static but not GetWALProhibitState() since it
> needed to
> be accessible from other files.
If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.
Well, the current patch set also has few inline functions in the header file.
But, I don't think we can do the same for GetWALProhibitState() without changing
the XLogCtl structure scope which is local to xlog.c file and the changing XLogCtl
scope would be a bad idea.
But, I don't think we can do the same for GetWALProhibitState() without changing
the XLogCtl structure scope which is local to xlog.c file and the changing XLogCtl
scope would be a bad idea.
Regards,
Amul
On Fri, Jul 24, 2020 at 3:12 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't > think we should say "READ ONLY" if we still allow on-disk file changes > after the ALTER SYSTEM command returns (courtesy dirty buffer flushes) > because it does introduce confusion, especially to an audience not privy > to this thread. When people hear "read-only" they may think of static on-disk > files immediately. They might think of a variety of things that are not a correct interpretation of what the feature does, but I think the way to handle that is to document it properly. I don't think making WAL a grammar keyword just for this is a good idea. I'm not totally stuck on this particular syntax if there's consensus on something else, but I seriously doubt that there will be consensus around adding parser keywords for this. > I don't have enough context to enumerate use cases for the advantages or > opportunities that would come with an assurance that the cluster's files > are frozen (and not covered by any existing utilities), but surely there > are some? Like the possibility of pg_upgrade on a running server while > it can entertain read-only queries? Surely, that's a nice one! I think that this feature is plenty complicated enough already, and we shouldn't make it more complicated to cater to additional use cases, especially when those use cases are somewhat uncertain and would probably require additional work in other parts of the system. For instance, I think it would be great to have an option to start the postmaster in a strictly "don't write ANYTHING" mode where regardless of the cluster state it won't write any data files or any WAL or even the control file. It would be useful for poking around on damaged clusters without making things worse. And it's somewhat related to the topic of this thread, but it's not THAT closely related. It's better to add features one at a time; you can always add more later, but if you make the individual ones too big and hard they don't get done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attached is a rebased on top of the latest master head (# 3e98c0bafb2). Regards, Amul
Attachment
On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote: > Attached is a rebased on top of the latest master head (# 3e98c0bafb2). Does anyone, especially anyone named Andres Freund, have comments on 0001? That work is somewhat independent of the rest of this patch set from a theoretical point of view, and it seems like if nobody sees a problem with the line of attack there, it would make sense to go ahead and commit that part. Considering that this global barrier stuff is new and that I'm not sure how well we really understand the problems yet, there's a possibility that we might end up revising these details again. I understand that most people, including me, are somewhat reluctant to see experimental code get committed, in this case that ship has basically sailed already, since neither of the patches that we thought would use the barrier mechanism end up making it into v13. I don't think it's really making things any worse to try to improve the mechanism. 0002 isn't separately committable, but I don't see anything wrong with it. Regarding 0003: I don't understand why ProcessBarrierWALProhibit() can safely assert that the WALPROHIBIT_STATE_READ_ONLY is set. + errhint("Cannot continue a transaction if it has performed writes while system is read only."))); This sentence is bad because it makes it sound like the current transaction successfully performed a write after the system had already become read-only. I think something like errdetail("Sessions with open write transactions must be terminated.") would be better. I think SetWALProhibitState() could be in walprohibit.c rather than xlog.c. Also, this function appears to have obvious race conditions. It fetches the current state, then thinks things over while holding no lock, and then unconditionally updates the current state. What happens if somebody else has changed the state in the meantime? I had sort of imagined that we'd use something like pg_atomic_uint32 for this and manipulate it using compare-and-swap operations. Using some kind of lock is probably fine, too, but you have to hold it long enough that the variable can't change under you while you're still deciding whether it's OK to modify it, or else recheck after reacquiring the lock that the value doesn't differ from what you expect. I think the choice to use info_lck to synchronize SharedWALProhibitState is very strange -- what is the justification for that? I thought the idea might be that we frequently need to check SharedWALProhibitState at times when we'd be holding info_lck anyway, but it looks to me like you always do separate acquisitions of info_lck just for this, in which case I don't see why we should use it here instead of a separate lock. For that matter, why does this need to be part of XLogCtlData rather than a separate shared memory area that is private to walprohibit.c? - else + /* + * Can't perform checkpoint or xlog rotation without writing WAL. + */ + else if (XLogInsertAllowed()) Not project style. + case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE: Can we drop the word SYSTEM here to make this shorter, or would that break some convention? +/* + * NB: The return string should be the same as the _ShowOption() for boolean + * type. + */ + static const char * + show_system_is_read_only(void) +{ I'm not sure the comment is appropriate here, but I'm very sure the extra spaces before "static" and "show" are not per style. + /* We'll be done once in-progress flag bit is cleared */ Another whitespace mistake. + elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer"); + elog(DEBUG1, "Done WALProhibitRequest"); I think these should be removed. Can WALProhibitRequest() and performWALProhibitStateChange() be moved to walprohibit.c, just to bring more of the code for this feature together in one place? Maybe we could also rename them to RequestWALProhibitChange() and CompleteWALProhibitChange()? - * think it should leave the child state in place. + * think it should leave the child state in place. Note that the upper + * transaction will be a force to ready-only irrespective of its previous + * status if the server state is WAL prohibited. */ - XactReadOnly = s->prevXactReadOnly; + XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed(); Both instances of this pattern seem sketchy to me. You don't expect that reverting the state to a previous state will instead change to a different state that doesn't match up with what you had before. What is the bad thing that would happen if we did not make this change? - * Else, must check to see if we're still in recovery. + * Else, must check to see if we're still in recovery Spurious change. + /* Request checkpoint */ + RequestCheckpoint(CHECKPOINT_IMMEDIATE); + ereport(LOG, (errmsg("system is now read write"))); This does not seem right. Perhaps the intention here was that the system should perform a checkpoint when it switches to read-write state after having skipped the startup checkpoint. But why would we do this unconditionally in all cases where we just went to a read-write state? There's probably quite a bit more to say about 0003 but I think I'm running too low on mental energy to say more now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Aug 29, 2020 at 1:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote: > > Attached is a rebased on top of the latest master head (# 3e98c0bafb2). > > Does anyone, especially anyone named Andres Freund, have comments on > 0001? That work is somewhat independent of the rest of this patch set > from a theoretical point of view, and it seems like if nobody sees a > problem with the line of attack there, it would make sense to go ahead > and commit that part. Considering that this global barrier stuff is > new and that I'm not sure how well we really understand the problems > yet, there's a possibility that we might end up revising these details > again. I understand that most people, including me, are somewhat > reluctant to see experimental code get committed, in this case that > ship has basically sailed already, since neither of the patches that > we thought would use the barrier mechanism end up making it into v13. > I don't think it's really making things any worse to try to improve > the mechanism. > > 0002 isn't separately committable, but I don't see anything wrong with it. > > Regarding 0003: > > I don't understand why ProcessBarrierWALProhibit() can safely assert > that the WALPROHIBIT_STATE_READ_ONLY is set. > IF blocks entered to kill a transaction have valid XID & this happens only in case of system state changing to READ_ONLY. > + errhint("Cannot continue a > transaction if it has performed writes while system is read only."))); > > This sentence is bad because it makes it sound like the current > transaction successfully performed a write after the system had > already become read-only. I think something like errdetail("Sessions > with open write transactions must be terminated.") would be better. > Ok, changed as suggested in the attached version. > I think SetWALProhibitState() could be in walprohibit.c rather than > xlog.c. Also, this function appears to have obvious race conditions. > It fetches the current state, then thinks things over while holding no > lock, and then unconditionally updates the current state. What happens > if somebody else has changed the state in the meantime? I had sort of > imagined that we'd use something like pg_atomic_uint32 for this and > manipulate it using compare-and-swap operations. Using some kind of > lock is probably fine, too, but you have to hold it long enough that > the variable can't change under you while you're still deciding > whether it's OK to modify it, or else recheck after reacquiring the > lock that the value doesn't differ from what you expect. > > I think the choice to use info_lck to synchronize > SharedWALProhibitState is very strange -- what is the justification > for that? I thought the idea might be that we frequently need to check > SharedWALProhibitState at times when we'd be holding info_lck anyway, > but it looks to me like you always do separate acquisitions of > info_lck just for this, in which case I don't see why we should use it > here instead of a separate lock. For that matter, why does this need > to be part of XLogCtlData rather than a separate shared memory area > that is private to walprohibit.c? > In the attached patch I added a separate shared memory structure for WAL prohibit state. SharedWALProhibitState is now pg_atomic_uint32 and part of that structure instead of XLogCtlData. The shared state will be changed using a compare-and-swap operation. I hope that should be enough to avoid said race conditions. > - else > + /* > + * Can't perform checkpoint or xlog rotation without writing WAL. > + */ > + else if (XLogInsertAllowed()) > > Not project style. > Corrected. > + case WAIT_EVENT_SYSTEM_WALPROHIBIT_STATE_CHANGE: > > Can we drop the word SYSTEM here to make this shorter, or would that > break some convention? > No issue, removed SYSTEM. > +/* > + * NB: The return string should be the same as the _ShowOption() for boolean > + * type. > + */ > + static const char * > + show_system_is_read_only(void) > +{ > Fixed. > I'm not sure the comment is appropriate here, but I'm very sure the > extra spaces before "static" and "show" are not per style. > > + /* We'll be done once in-progress flag bit is cleared */ > > Another whitespace mistake. > Fixed. > + elog(DEBUG1, "WALProhibitRequest: Waiting for checkpointer"); > + elog(DEBUG1, "Done WALProhibitRequest"); > > I think these should be removed. > Removed. > Can WALProhibitRequest() and performWALProhibitStateChange() be moved > to walprohibit.c, just to bring more of the code for this feature > together in one place? Maybe we could also rename them to > RequestWALProhibitChange() and CompleteWALProhibitChange()? > Yes, I have moved these functions to walprohibit.c and renamed as suggested. For this, I needed to add few helper functions to send a signal to checkpointer and update Control File, as send_signal_to_checkpointer & SetControlFileWALProhibitFlag() respectively, since checkpointer_pid or ControlFile are not directly accessible from walprohibit.c > - * think it should leave the child state in place. > + * think it should leave the child state in place. Note that the upper > + * transaction will be a force to ready-only irrespective of > its previous > + * status if the server state is WAL prohibited. > */ > - XactReadOnly = s->prevXactReadOnly; > + XactReadOnly = s->prevXactReadOnly || !XLogInsertAllowed(); > > Both instances of this pattern seem sketchy to me. You don't expect > that reverting the state to a previous state will instead change to a > different state that doesn't match up with what you had before. What > is the bad thing that would happen if we did not make this change? > We can drop these changes now since we are simply terminating sessions for those who have performed or expected to perform write operations. > - * Else, must check to see if we're still in recovery. > + * Else, must check to see if we're still in recovery > > Spurious change. > Fixed. > + /* Request checkpoint */ > + RequestCheckpoint(CHECKPOINT_IMMEDIATE); > + ereport(LOG, (errmsg("system is now read write"))); > > This does not seem right. Perhaps the intention here was that the > system should perform a checkpoint when it switches to read-write > state after having skipped the startup checkpoint. But why would we do > this unconditionally in all cases where we just went to a read-write > state? > You are correct since this could be expensive if the system changes to read-only for a shorter period. For the initial version, I did this unconditionally to avoid additional shared-memory variables in XLogCtlData but now WAL prohibits state got its own shared-memory structure so that I have added the required variable to it. Now, doing this checkpoint conditionally with CHECKPOINT_END_OF_RECOVERY & CHECKPOINT_IMMEDIATE flag what we do in the startup process. Note that to mark end-of-recovery checkpoint has been skipped from the startup process I have added helper function as MarkCheckPointSkippedInWalProhibitState(), I am not sure the name that I have chosen is the best fit. > There's probably quite a bit more to say about 0003 but I think I'm > running too low on mental energy to say more now. > Thanks for your time and suggestions. Regards, Amul
Attachment
Hi, On 2020-08-28 15:53:29 -0400, Robert Haas wrote: > On Wed, Aug 19, 2020 at 6:28 AM Amul Sul <sulamul@gmail.com> wrote: > > Attached is a rebased on top of the latest master head (# 3e98c0bafb2). > > Does anyone, especially anyone named Andres Freund, have comments on > 0001? That work is somewhat independent of the rest of this patch set > from a theoretical point of view, and it seems like if nobody sees a > problem with the line of attack there, it would make sense to go ahead > and commit that part. It'd be easier to review the proposed commit if it included reasoning about the change... In particular, it looks to me like the commit actually implements two different changes: 1) Allow a barrier function to "reject" a set barrier, because it can't be set in that moment 2) Allow barrier functions to raise errors and there's not much of an explanation as to why (probably somewhere upthread, but ...) /* * ProcSignalShmemSize @@ -486,17 +490,59 @@ ProcessProcSignalBarrier(void) flags = pg_atomic_exchange_u32(&MyProcSignalSlot->pss_barrierCheckMask, 0); /* - * Process each type of barrier. It's important that nothing we call from - * here throws an error, because pss_barrierCheckMask has already been - * cleared. If we jumped out of here before processing all barrier types, - * then we'd forget about the need to do so later. - * - * NB: It ought to be OK to call the barrier-processing functions - * unconditionally, but it's more efficient to call only the ones that - * might need us to do something based on the flags. + * If there are no flags set, then we can skip doing any real work. + * Otherwise, establish a PG_TRY block, so that we don't lose track of + * which types of barrier processing are needed if an ERROR occurs. */ - if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER)) - ProcessBarrierPlaceholder(); + if (flags != 0) + { + PG_TRY(); + { + /* + * Process each type of barrier. The barrier-processing functions + * should normally return true, but may return false if the barrier + * can't be absorbed at the current time. This should be rare, + * because it's pretty expensive. Every single + * CHECK_FOR_INTERRUPTS() will return here until we manage to + * absorb the barrier, and that cost will add up in a hurry. + * + * NB: It ought to be OK to call the barrier-processing functions + * unconditionally, but it's more efficient to call only the ones + * that might need us to do something based on the flags. + */ + if (BARRIER_SHOULD_CHECK(flags, PROCSIGNAL_BARRIER_PLACEHOLDER) + && ProcessBarrierPlaceholder()) + BARRIER_CLEAR_BIT(flags, PROCSIGNAL_BARRIER_PLACEHOLDER); This pattern seems like it'll get unwieldy with more than one barrier type. And won't flag "unhandled" barrier types either (already the case, I know). We could go for something like: while (flags != 0) { barrier_bit = pg_rightmost_one_pos32(flags); barrier_type = 1 >> barrier_bit; switch (barrier_type) { case PROCSIGNAL_BARRIER_PLACEHOLDER: processed = ProcessBarrierPlaceholder(); } if (processed) BARRIER_CLEAR_BIT(flags, barrier_type); } But perhaps that's too complicated? + } + PG_CATCH(); + { + /* + * If an ERROR occurred, add any flags that weren't yet handled + * back into pss_barrierCheckMask, and reset the global variables + * so that we try again the next time we check for interrupts. + */ + pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask, + flags); For this to be correct, wouldn't flags need to be volatile? Otherwise this might use a register value for flags, which might not contain the correct value at this point. Perhaps a comment explaining why we have to clear bits first would be good? + ProcSignalBarrierPending = true; + InterruptPending = true; + + PG_RE_THROW(); + } + PG_END_TRY(); + /* + * If some barrier was not successfully absorbed, we will have to try + * again later. + */ + if (flags != 0) + { + pg_atomic_fetch_or_u32(&MyProcSignalSlot->pss_barrierCheckMask, + flags); + ProcSignalBarrierPending = true; + InterruptPending = true; + return; + } + } I wish there were a way we could combine the PG_CATCH and this instance of the same code. I'd probably just move into a helper. It might be good to add a warning to WaitForProcSignalBarrier() or by pss_barrierCheckMask indicating that it's *not* OK to look at pss_barrierCheckMask when checking whether barriers have been processed. > Considering that this global barrier stuff is > new and that I'm not sure how well we really understand the problems > yet, there's a possibility that we might end up revising these details > again. I understand that most people, including me, are somewhat > reluctant to see experimental code get committed, in this case that > ship has basically sailed already, since neither of the patches that > we thought would use the barrier mechanism end up making it into v13. > I don't think it's really making things any worse to try to improve > the mechanism. Yea, I have no problem with this. Greetings, Andres Freund
Hi, Thomas, there's one point below that could be relevant for you. You can search for your name and/or checkpoint... On 2020-09-01 16:43:10 +0530, Amul Sul wrote: > diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c > index 42050ab7195..0ac826d3c2f 100644 > --- a/src/backend/nodes/readfuncs.c > +++ b/src/backend/nodes/readfuncs.c > @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void) > READ_DONE(); > } > > +/* > + * _readAlterSystemWALProhibitState > + */ > +static AlterSystemWALProhibitState * > +_readAlterSystemWALProhibitState(void) > +{ > + READ_LOCALS(AlterSystemWALProhibitState); > + > + READ_BOOL_FIELD(WALProhibited); > + > + READ_DONE(); > +} > + Why do we need readfuncs support for this? > + > +/* > + * AlterSystemSetWALProhibitState > + * > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > + */ > +static void > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > +{ > + /* some code */ > + elog(INFO, "AlterSystemSetWALProhibitState() called"); > +} As long as it's not implemented it seems better to return an ERROR. > @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt > VariableSetStmt *setstmt; /* SET subcommand */ > } AlterSystemStmt; > > +/* ---------------------- > + * Alter System Read Statement > + * ---------------------- > + */ > +typedef struct AlterSystemWALProhibitState > +{ > + NodeTag type; > + bool WALProhibited; > +} AlterSystemWALProhibitState; > + All the nearby fields use under_score_style names. > From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001 > From: Amul Sul <amul.sul@enterprisedb.com> > Date: Fri, 19 Jun 2020 06:29:36 -0400 > Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier. > > Implementation: > > 1. When a user tried to change server state to WAL-Prohibited using > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() > raises request to checkpointer by marking current state to inprogress in > shared memory. Checkpointer, noticing that the current state is has "is has" > WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and > then acknowledges back to the backend who requested the state change once > the transition has been completed. Final state will be updated in control > file to make it persistent across the system restarts. What makes checkpointer the right backend to do this work? > 2. When a backend receives the WAL-Prohibited barrier, at that moment if > it is already in a transaction and the transaction already assigned XID, > then the backend will be killed by throwing FATAL(XXX: need more discussion > on this) > 3. Otherwise, if that backend running transaction which yet to get XID > assigned we don't need to do anything special Somewhat garbled sentence... > 4. A new transaction (from existing or new backend) starts as a read-only > transaction. Maybe "(in an existing or in a new backend)"? > 5. Autovacuum launcher as well as checkpointer will don't do anything in > WAL-Prohibited server state until someone wakes us up. E.g. a backend > might later on request us to put the system back to read-write. "will don't do anything", "might later on request us" > 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint > and xlog rotation. Starting up again will perform crash recovery(XXX: > need some discussion on this as well) but the end of recovery checkpoint > will be skipped and it will be performed when the system changed to > WAL-Permitted mode. Hm, this has some interesting interactions with some of Thomas' recent hacking. > 8. Only super user can toggle WAL-Prohibit state. Hm. I don't quite agree with this. We try to avoid if (superuser()) style checks these days, because they can't be granted to other users. Look at how e.g. pg_promote() - an operation of similar severity - is handled. We just revoke the permission from public in system_views.sql: REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public; > 9. Add system_is_read_only GUC show the system state -- will true when system > is wal prohibited or in recovery. *shows the system state. There's also some oddity in the second part of the sentence. Is it really correct to show system_is_read_only as true during recovery? For one, recovery could end soon after, putting the system into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also, during recovery the database state actually changes if there are changes to replay. ISTM it would not be a good idea to mix ASRO and pg_is_in_recovery() into one GUC. > --- /dev/null > +++ b/src/backend/access/transam/walprohibit.c > @@ -0,0 +1,321 @@ > +/*------------------------------------------------------------------------- > + * > + * walprohibit.c > + * PostgreSQL write-ahead log prohibit states > + * > + * > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > + * > + * src/backend/access/transam/walprohibit.c > + * > + *------------------------------------------------------------------------- > + */ > +#include "postgres.h" > + > +#include "access/walprohibit.h" > +#include "pgstat.h" > +#include "port/atomics.h" > +#include "postmaster/bgwriter.h" > +#include "storage/condition_variable.h" > +#include "storage/procsignal.h" > +#include "storage/shmem.h" > + > +/* > + * Shared-memory WAL prohibit state > + */ > +typedef struct WALProhibitStateData > +{ > + /* Indicates current WAL prohibit state */ > + pg_atomic_uint32 SharedWALProhibitState; > + > + /* Startup checkpoint pending */ > + bool checkpointPending; > + > + /* Signaled when requested WAL prohibit state changes */ > + ConditionVariable walprohibit_cv; You're using three different naming styles for as many members. > +/* > + * ProcessBarrierWALProhibit() > + * > + * Handle WAL prohibit state change request. > + */ > +bool > +ProcessBarrierWALProhibit(void) > +{ > + /* > + * Kill off any transactions that have an XID *before* allowing the system > + * to go WAL prohibit state. > + */ > + if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny())) Hm. I wonder if this check is good enough. If you look at RecordTransactionCommit() we also WAL log in some cases where no xid was assigned. This is particularly true of (auto-)vacuum, but also for HOT pruning. I think it'd be good to put the logic of this check into xlog.c and mirror the logic in RecordTransactionCommit(). And add cross-referencing comments to RecordTransactionCommit and the new function, reminding our futures selves that both places need to be modified. > + { > + /* Should be here only for the WAL prohibit state. */ > + Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY); There are no races where an ASRO READ ONLY is quickly followed by ASRO READ WRITE where this could be reached? > +/* > + * AlterSystemSetWALProhibitState() > + * > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > + */ > +void > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > +{ > + uint32 state; > + > + if (!superuser()) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("must be superuser to execute ALTER SYSTEM command"))); See comments about this above. > + /* Alter WAL prohibit state not allowed during recovery */ > + PreventCommandDuringRecovery("ALTER SYSTEM"); > + > + /* Requested state */ > + state = stmt->WALProhibited ? > + WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE; > + > + /* > + * Since we yet to convey this WAL prohibit state to all backend mark it > + * in-progress. > + */ > + state |= WALPROHIBIT_TRANSITION_IN_PROGRESS; > + > + if (!SetWALProhibitState(state)) > + return; /* server is already in the desired state */ > + This use of bitmasks seems unnecessary to me. I'd rather have one param for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one for WALPROHIBIT_TRANSITION_IN_PROGRESS > +/* > + * RequestWALProhibitChange() > + * > + * Request checkpointer to make the WALProhibitState to read-only. > + */ > +static void > +RequestWALProhibitChange(void) > +{ > + /* Must not be called from checkpointer */ > + Assert(!AmCheckpointerProcess()); > + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS); > + > + /* > + * If in a standalone backend, just do it ourselves. > + */ > + if (!IsPostmasterEnvironment) > + { > + CompleteWALProhibitChange(GetWALProhibitState()); > + return; > + } > + > + send_signal_to_checkpointer(SIGINT); > + > + /* Wait for the state to change to read-only */ > + ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv); > + for (;;) > + { > + /* We'll be done once in-progress flag bit is cleared */ > + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > + break; > + > + ConditionVariableSleep(&WALProhibitState->walprohibit_cv, > + WAIT_EVENT_WALPROHIBIT_STATE_CHANGE); > + } > + ConditionVariableCancelSleep(); What if somebody concurrently changes the state back to READ WRITE? Won't we unnecessarily wait here? That's probably fine, because we would just wait until that transition is complete too. But at least a comment about that would be good. Alternatively a "ASRO transitions completed counter" or such might be a better idea? > +/* > + * CompleteWALProhibitChange() > + * > + * Checkpointer will call this to complete the requested WAL prohibit state > + * transition. > + */ > +void > +CompleteWALProhibitChange(uint32 wal_state) > +{ > + uint64 barrierGeneration; > + > + /* > + * Must be called from checkpointer. Otherwise, it must be single-user > + * backend. > + */ > + Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment); > + Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS); > + > + /* > + * WAL prohibit state change is initiated. We need to complete the state > + * transition by setting requested WAL prohibit state in all backends. > + */ > + elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state"); > + > + /* Emit global barrier */ > + barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT); > + WaitForProcSignalBarrier(barrierGeneration); > + > + /* And flush all writes. */ > + XLogFlush(GetXLogWriteRecPtr()); Hm, maybe I'm missing something, but why is the write pointer the right thing to flush? That won't include records that haven't been written to disk yet... We also need to trigger writing out all WAL that is as of yet unwritten, no? Without having thought a lot about it, it seems that GetXLogInsertRecPtr() would be the right thing to flush? > + /* Set final state by clearing in-progress flag bit */ > + if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS))) > + { > + bool wal_prohibited; > + > + wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0; > + > + /* Update the control file to make state persistent */ > + SetControlFileWALProhibitFlag(wal_prohibited); Hm. Is there an issue with not WAL logging the control file change? Is there a scenario where we a crash + recovery would end up overwriting this? > + if (wal_prohibited) > + ereport(LOG, (errmsg("system is now read only"))); > + else > + { > + /* > + * Request checkpoint if the end-of-recovery checkpoint has been > + * skipped previously. > + */ > + if (WALProhibitState->checkpointPending) > + { > + RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | > + CHECKPOINT_IMMEDIATE); > + WALProhibitState->checkpointPending = false; > + } > + ereport(LOG, (errmsg("system is now read write"))); > + } > + } > + > + /* Wake up the backend who requested the state change */ > + ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv); Could be multiple backends, right? > +} > + > +/* > + * GetWALProhibitState() > + * > + * Atomically return the current server WAL prohibited state > + */ > +uint32 > +GetWALProhibitState(void) > +{ > + return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState); > +} Is there an issue with needing memory barriers here? > +/* > + * SetWALProhibitState() > + * > + * Change current WAL prohibit state to the input state. > + * > + * If the server is already completely moved to the requested WAL prohibit > + * state, or if the desired state is same as the current state, return false, > + * indicating that the server state did not change. Else return true. > + */ > +bool > +SetWALProhibitState(uint32 new_state) > +{ > + bool state_updated = false; > + uint32 cur_state; > + > + cur_state = GetWALProhibitState(); > + > + /* Server is already in requested state */ > + if (new_state == cur_state || > + new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS)) > + return false; > + > + /* Prevent concurrent contrary in progress transition state setting */ > + if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) && > + (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > + { > + if (cur_state & WALPROHIBIT_STATE_READ_ONLY) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("system state transition to read only is already in progress"), > + errhint("Try after sometime again."))); > + else > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("system state transition to read write is already in progress"), > + errhint("Try after sometime again."))); > + } > + > + /* Update new state in share memory */ > + state_updated = > + pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState, > + &cur_state, new_state); > + > + if (!state_updated) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("system read write state concurrently changed"), > + errhint("Try after sometime again."))); > + I don't think it's safe to use pg_atomic_compare_exchange_u32() outside of a loop. I think there's platforms (basically all load-linked / store-conditional architectures) where than can fail spuriously. Also, there's no memory barrier around GetWALProhibitState, so there's no guarantee it's not an out-of-date value you're starting with. > +/ > + * MarkCheckPointSkippedInWalProhibitState() > + * > + * Sets checkpoint pending flag so that it can be performed next time while > + * changing system state to WAL permitted. > + */ > +void > +MarkCheckPointSkippedInWalProhibitState(void) > +{ > + WALProhibitState->checkpointPending = true; > +} I don't *at all* like this living outside of xlog.c. I think this should be moved there, and merged with deferring checkpoints in other cases (promotions, not immediately performing a checkpoint after recovery). There's state in ControlFile *and* here for essentially the same thing. > + * If it is not currently possible to insert write-ahead log records, > + * either because we are still in recovery or because ALTER SYSTEM READ > + * ONLY has been executed, force this to be a read-only transaction. > + * We have lower level defences in XLogBeginInsert() and elsewhere to stop > + * us from modifying data during recovery when !XLogInsertAllowed(), but > + * this gives the normal indication to the user that the transaction is > + * read-only. > + * > + * On the other hand, we only need to set the startedInRecovery flag when > + * the transaction started during recovery, and not when WAL is otherwise > + * prohibited. This information is used by RelationGetIndexScan() to > + * decide whether to permit (1) relying on existing killed-tuple markings > + * and (2) further killing of index tuples. Even when WAL is prohibited > + * on the master, it's still the master, so the former is OK; and since > + * killing index tuples doesn't generate WAL, the latter is also OK. > + * See comments in RelationGetIndexScan() and MarkBufferDirtyHint(). > + */ > + XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed(); > + s->startedInRecovery = RecoveryInProgress(); It's somewhat ugly that we call RecoveryInProgress() once in XLogInsertAllowed() and then again directly here... It's probably fine runtime cost wise, but... > /* > * Subroutine to try to fetch and validate a prior checkpoint record. > * > @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg) > */ > WalSndWaitStopping(); > > + /* > + * The restartpoint, checkpoint, or xlog rotation will be performed if the > + * WAL writing is permitted. > + */ > if (RecoveryInProgress()) > CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); > - else > + else if (XLogInsertAllowed()) Not sure I like going via XLogInsertAllowed(), that seems like a confusing indirection here. And it encompasses things we atually don't want to check for - it's fragile to also look at LocalXLogInsertAllowed here imo. > ShutdownCLOG(); > ShutdownCommitTs(); > ShutdownSUBTRANS(); > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > index 1b8cd7bacd4..aa4cdd57ec1 100644 > --- a/src/backend/postmaster/autovacuum.c > +++ b/src/backend/postmaster/autovacuum.c > @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[]) > > HandleAutoVacLauncherInterrupts(); > > + /* If the server is read only just go back to sleep. */ > + if (!XLogInsertAllowed()) > + continue; > + I think we really should have a different functions for places like this. We don't want to generally hide bugs like e.g. starting the autovac launcher in recovery, but this would. > @@ -342,6 +344,28 @@ CheckpointerMain(void) > AbsorbSyncRequests(); > HandleCheckpointerInterrupts(); > > + wal_state = GetWALProhibitState(); > + > + if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) > + { > + /* Complete WAL prohibit state change request */ > + CompleteWALProhibitChange(wal_state); > + continue; > + } > + else if (wal_state & WALPROHIBIT_STATE_READ_ONLY) > + { > + /* > + * Don't do anything until someone wakes us up. For example a > + * backend might later on request us to put the system back to > + * read-write wal prohibit sate. > + */ > + (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, > + WAIT_EVENT_CHECKPOINTER_MAIN); > + continue; > + } > + Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE); > + > /* > * Detect a pending checkpoint request by checking whether the flags > * word in shared memory is nonzero. We shouldn't need to acquire the > @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void) > > return FirstCall; > } So, if we're in the middle of a paced checkpoint with a large checkpoint_timeout - a sensible real world configuration - we'll not process ASRO until that checkpoint is over? That seems very much not practical. What am I missing? > +/* > + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process. > + */ > +void > +send_signal_to_checkpointer(int signum) > +{ > + if (CheckpointerShmem->checkpointer_pid == 0) > + elog(ERROR, "checkpointer is not running"); > + > + if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0) > + elog(ERROR, "could not signal checkpointer: %m"); > +} Sudden switch to a different naming style... Greetings, Andres Freund
On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, Thanks for your time. > > Thomas, there's one point below that could be relevant for you. You can > search for your name and/or checkpoint... > > > On 2020-09-01 16:43:10 +0530, Amul Sul wrote: > > diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c > > index 42050ab7195..0ac826d3c2f 100644 > > --- a/src/backend/nodes/readfuncs.c > > +++ b/src/backend/nodes/readfuncs.c > > @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void) > > READ_DONE(); > > } > > > > +/* > > + * _readAlterSystemWALProhibitState > > + */ > > +static AlterSystemWALProhibitState * > > +_readAlterSystemWALProhibitState(void) > > +{ > > + READ_LOCALS(AlterSystemWALProhibitState); > > + > > + READ_BOOL_FIELD(WALProhibited); > > + > > + READ_DONE(); > > +} > > + > > Why do we need readfuncs support for this? > I thought we need that from your previous comment[1]. > > + > > +/* > > + * AlterSystemSetWALProhibitState > > + * > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > > + */ > > +static void > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > > +{ > > + /* some code */ > > + elog(INFO, "AlterSystemSetWALProhibitState() called"); > > +} > > As long as it's not implemented it seems better to return an ERROR. > Ok, will add an error in the next version. > > @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt > > VariableSetStmt *setstmt; /* SET subcommand */ > > } AlterSystemStmt; > > > > +/* ---------------------- > > + * Alter System Read Statement > > + * ---------------------- > > + */ > > +typedef struct AlterSystemWALProhibitState > > +{ > > + NodeTag type; > > + bool WALProhibited; > > +} AlterSystemWALProhibitState; > > + > > All the nearby fields use under_score_style names. > I am not sure which nearby fields having the underscore that you are referring to. Probably "WALProhibited" needs to be renamed to "walprohibited" to be inline with the nearby fields. > > > From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001 > > From: Amul Sul <amul.sul@enterprisedb.com> > > Date: Fri, 19 Jun 2020 06:29:36 -0400 > > Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier. > > > > Implementation: > > > > 1. When a user tried to change server state to WAL-Prohibited using > > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() > > raises request to checkpointer by marking current state to inprogress in > > shared memory. Checkpointer, noticing that the current state is has > > "is has" > > > WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and > > then acknowledges back to the backend who requested the state change once > > the transition has been completed. Final state will be updated in control > > file to make it persistent across the system restarts. > > What makes checkpointer the right backend to do this work? > Once we've initiated the change to a read-only state, we probably want to always either finish that change or go back to read-write, even if the process that initiated the change is interrupted. Leaving the system in a half-way-in-between state long term seems bad. Maybe we would have put some background process, but choose the checkpointer in charge of making the state change and to avoid the new background process to keep the first version patch simple. The checkpointer isn't likely to get killed, but if it does, it will be relaunched and the new one can clean things up. Probably later we might want such a background worker that will be isn't likely to get killed. > > > 2. When a backend receives the WAL-Prohibited barrier, at that moment if > > it is already in a transaction and the transaction already assigned XID, > > then the backend will be killed by throwing FATAL(XXX: need more discussion > > on this) > > > > 3. Otherwise, if that backend running transaction which yet to get XID > > assigned we don't need to do anything special > > Somewhat garbled sentence... > > > > 4. A new transaction (from existing or new backend) starts as a read-only > > transaction. > > Maybe "(in an existing or in a new backend)"? > > > > 5. Autovacuum launcher as well as checkpointer will don't do anything in > > WAL-Prohibited server state until someone wakes us up. E.g. a backend > > might later on request us to put the system back to read-write. > > "will don't do anything", "might later on request us" > Ok, I'll fix all of this. I usually don't much focus on the commit message text but I try to make it as much as possible sane enough. > > > 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint > > and xlog rotation. Starting up again will perform crash recovery(XXX: > > need some discussion on this as well) but the end of recovery checkpoint > > will be skipped and it will be performed when the system changed to > > WAL-Permitted mode. > > Hm, this has some interesting interactions with some of Thomas' recent > hacking. > I would be so thankful for the help. > > > 8. Only super user can toggle WAL-Prohibit state. > > Hm. I don't quite agree with this. We try to avoid if (superuser()) > style checks these days, because they can't be granted to other > users. Look at how e.g. pg_promote() - an operation of similar severity > - is handled. We just revoke the permission from public in > system_views.sql: > REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public; > Ok, currently we don't have SQL callable function to change the system read-write state. Do you want me to add that? If so, any naming suggesting? How about pg_make_system_read_only(bool) or have two function as pg_make_system_read_only(void) & pg_make_system_read_write(void). > > > 9. Add system_is_read_only GUC show the system state -- will true when system > > is wal prohibited or in recovery. > > *shows the system state. There's also some oddity in the second part of > the sentence. > > Is it really correct to show system_is_read_only as true during > recovery? For one, recovery could end soon after, putting the system > into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also, > during recovery the database state actually changes if there are changes > to replay. ISTM it would not be a good idea to mix ASRO and > pg_is_in_recovery() into one GUC. > Well, whether the system is in recovery or wal prohibited state it is read-only for the user perspective, isn't it? > > > --- /dev/null > > +++ b/src/backend/access/transam/walprohibit.c > > @@ -0,0 +1,321 @@ > > +/*------------------------------------------------------------------------- > > + * > > + * walprohibit.c > > + * PostgreSQL write-ahead log prohibit states > > + * > > + * > > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > + * > > + * src/backend/access/transam/walprohibit.c > > + * > > + *------------------------------------------------------------------------- > > + */ > > +#include "postgres.h" > > + > > +#include "access/walprohibit.h" > > +#include "pgstat.h" > > +#include "port/atomics.h" > > +#include "postmaster/bgwriter.h" > > +#include "storage/condition_variable.h" > > +#include "storage/procsignal.h" > > +#include "storage/shmem.h" > > + > > +/* > > + * Shared-memory WAL prohibit state > > + */ > > +typedef struct WALProhibitStateData > > +{ > > + /* Indicates current WAL prohibit state */ > > + pg_atomic_uint32 SharedWALProhibitState; > > + > > + /* Startup checkpoint pending */ > > + bool checkpointPending; > > + > > + /* Signaled when requested WAL prohibit state changes */ > > + ConditionVariable walprohibit_cv; > > You're using three different naming styles for as many members. > Ill fix in the next version. > > > +/* > > + * ProcessBarrierWALProhibit() > > + * > > + * Handle WAL prohibit state change request. > > + */ > > +bool > > +ProcessBarrierWALProhibit(void) > > +{ > > + /* > > + * Kill off any transactions that have an XID *before* allowing the system > > + * to go WAL prohibit state. > > + */ > > + if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny())) > > Hm. I wonder if this check is good enough. If you look at > RecordTransactionCommit() we also WAL log in some cases where no xid was > assigned. This is particularly true of (auto-)vacuum, but also for HOT > pruning. > > I think it'd be good to put the logic of this check into xlog.c and > mirror the logic in RecordTransactionCommit(). And add cross-referencing > comments to RecordTransactionCommit and the new function, reminding our > futures selves that both places need to be modified. > I am not sure I have understood this, here is the snip from the implementation detail from the first post[2]: "Open transactions that don't have an XID are not killed, but will get an ERROR if they try to acquire an XID later, or if they try to write WAL without acquiring an XID (e.g. VACUUM). To make that happen, the patch adds a new coding rule: a critical section that will write WAL must be preceded by a call to CheckWALPermitted(), AssertWALPermitted(), or AssertWALPermitted_HaveXID(). The latter variants are used when we know for certain that inserting WAL here must be OK, either because we have an XID (we would have been killed by a change to read-only if one had occurred) or for some other reason." Do let me know if you want further clarification. > > > + { > > + /* Should be here only for the WAL prohibit state. */ > > + Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY); > > There are no races where an ASRO READ ONLY is quickly followed by ASRO > READ WRITE where this could be reached? > No, right now SetWALProhibitState() doesn't allow two transient wal prohibit states at a time. > > > +/* > > + * AlterSystemSetWALProhibitState() > > + * > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > > + */ > > +void > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > > +{ > > + uint32 state; > > + > > + if (!superuser()) > > + ereport(ERROR, > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > + errmsg("must be superuser to execute ALTER SYSTEM command"))); > > See comments about this above. > > > > + /* Alter WAL prohibit state not allowed during recovery */ > > + PreventCommandDuringRecovery("ALTER SYSTEM"); > > + > > + /* Requested state */ > > + state = stmt->WALProhibited ? > > + WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE; > > + > > + /* > > + * Since we yet to convey this WAL prohibit state to all backend mark it > > + * in-progress. > > + */ > > + state |= WALPROHIBIT_TRANSITION_IN_PROGRESS; > > + > > + if (!SetWALProhibitState(state)) > > + return; /* server is already in the desired state */ > > + > > This use of bitmasks seems unnecessary to me. I'd rather have one param > for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one > for WALPROHIBIT_TRANSITION_IN_PROGRESS > Ok. How about the new version of SetWALProhibitState function as : SetWALProhibitState(bool wal_prohibited, bool is_final_state) ? > > > > +/* > > + * RequestWALProhibitChange() > > + * > > + * Request checkpointer to make the WALProhibitState to read-only. > > + */ > > +static void > > +RequestWALProhibitChange(void) > > +{ > > + /* Must not be called from checkpointer */ > > + Assert(!AmCheckpointerProcess()); > > + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS); > > + > > + /* > > + * If in a standalone backend, just do it ourselves. > > + */ > > + if (!IsPostmasterEnvironment) > > + { > > + CompleteWALProhibitChange(GetWALProhibitState()); > > + return; > > + } > > + > > + send_signal_to_checkpointer(SIGINT); > > + > > + /* Wait for the state to change to read-only */ > > + ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv); > > + for (;;) > > + { > > + /* We'll be done once in-progress flag bit is cleared */ > > + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > + break; > > + > > + ConditionVariableSleep(&WALProhibitState->walprohibit_cv, > > + WAIT_EVENT_WALPROHIBIT_STATE_CHANGE); > > + } > > + ConditionVariableCancelSleep(); > > What if somebody concurrently changes the state back to READ WRITE? > Won't we unnecessarily wait here? > Yes, there will be wait. > That's probably fine, because we would just wait until that transition > is complete too. But at least a comment about that would be > good. Alternatively a "ASRO transitions completed counter" or such might > be a better idea? > Ok, will add comments but could you please elaborate little a bit about "ASRO transitions completed counter" and is there any existing counter I can refer to? > > > +/* > > + * CompleteWALProhibitChange() > > + * > > + * Checkpointer will call this to complete the requested WAL prohibit state > > + * transition. > > + */ > > +void > > +CompleteWALProhibitChange(uint32 wal_state) > > +{ > > + uint64 barrierGeneration; > > + > > + /* > > + * Must be called from checkpointer. Otherwise, it must be single-user > > + * backend. > > + */ > > + Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment); > > + Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS); > > + > > + /* > > + * WAL prohibit state change is initiated. We need to complete the state > > + * transition by setting requested WAL prohibit state in all backends. > > + */ > > + elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state"); > > + > > + /* Emit global barrier */ > > + barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT); > > + WaitForProcSignalBarrier(barrierGeneration); > > + > > + /* And flush all writes. */ > > + XLogFlush(GetXLogWriteRecPtr()); > > Hm, maybe I'm missing something, but why is the write pointer the right > thing to flush? That won't include records that haven't been written to > disk yet... We also need to trigger writing out all WAL that is as of > yet unwritten, no? Without having thought a lot about it, it seems that > GetXLogInsertRecPtr() would be the right thing to flush? > TBH, I am not an expert in this area. I wants to flush the latest record pointer that needs to be flushed, I think GetXLogInsertRecPtr() would be fine if is the latest one. Note that wal flushes are not blocked in read-only mode. > > > + /* Set final state by clearing in-progress flag bit */ > > + if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS))) > > + { > > + bool wal_prohibited; > > + > > + wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0; > > + > > + /* Update the control file to make state persistent */ > > + SetControlFileWALProhibitFlag(wal_prohibited); > > Hm. Is there an issue with not WAL logging the control file change? Is > there a scenario where we a crash + recovery would end up overwriting > this? > I am not sure. If the system crash before update this that means we haven't acknowledged the system state change. And the server will be restarted with the previous state. Could you please explain what bothering you. > > > + if (wal_prohibited) > > + ereport(LOG, (errmsg("system is now read only"))); > > + else > > + { > > + /* > > + * Request checkpoint if the end-of-recovery checkpoint has been > > + * skipped previously. > > + */ > > + if (WALProhibitState->checkpointPending) > > + { > > + RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | > > + CHECKPOINT_IMMEDIATE); > > + WALProhibitState->checkpointPending = false; > > + } > > + ereport(LOG, (errmsg("system is now read write"))); > > + } > > + } > > + > > + /* Wake up the backend who requested the state change */ > > + ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv); > > Could be multiple backends, right? > Yes, you are correct, will fix that. > > > +} > > + > > +/* > > + * GetWALProhibitState() > > + * > > + * Atomically return the current server WAL prohibited state > > + */ > > +uint32 > > +GetWALProhibitState(void) > > +{ > > + return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState); > > +} > > Is there an issue with needing memory barriers here? > > > > +/* > > + * SetWALProhibitState() > > + * > > + * Change current WAL prohibit state to the input state. > > + * > > + * If the server is already completely moved to the requested WAL prohibit > > + * state, or if the desired state is same as the current state, return false, > > + * indicating that the server state did not change. Else return true. > > + */ > > +bool > > +SetWALProhibitState(uint32 new_state) > > +{ > > + bool state_updated = false; > > + uint32 cur_state; > > + > > + cur_state = GetWALProhibitState(); > > + > > + /* Server is already in requested state */ > > + if (new_state == cur_state || > > + new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > + return false; > > + > > + /* Prevent concurrent contrary in progress transition state setting */ > > + if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) && > > + (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > + { > > + if (cur_state & WALPROHIBIT_STATE_READ_ONLY) > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("system state transition to read only is already in progress"), > > + errhint("Try after sometime again."))); > > + else > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("system state transition to read write is already in progress"), > > + errhint("Try after sometime again."))); > > + } > > + > > + /* Update new state in share memory */ > > + state_updated = > > + pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState, > > + &cur_state, new_state); > > + > > + if (!state_updated) > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("system read write state concurrently changed"), > > + errhint("Try after sometime again."))); > > + > > I don't think it's safe to use pg_atomic_compare_exchange_u32() outside > of a loop. I think there's platforms (basically all load-linked / > store-conditional architectures) where than can fail spuriously. > > Also, there's no memory barrier around GetWALProhibitState, so there's > no guarantee it's not an out-of-date value you're starting with. > How about having some kind of lock instead what Robert have suggested previously[3] ? > > > +/ > > + * MarkCheckPointSkippedInWalProhibitState() > > + * > > + * Sets checkpoint pending flag so that it can be performed next time while > > + * changing system state to WAL permitted. > > + */ > > +void > > +MarkCheckPointSkippedInWalProhibitState(void) > > +{ > > + WALProhibitState->checkpointPending = true; > > +} > > I don't *at all* like this living outside of xlog.c. I think this should > be moved there, and merged with deferring checkpoints in other cases > (promotions, not immediately performing a checkpoint after recovery). Here we want to perform the checkpoint sometime quite later when the system state changes to read-write. For that, I think we need some flag if we want this in xlog.c then we can have that flag in XLogCtl. > There's state in ControlFile *and* here for essentially the same thing. > I am sorry to trouble you much, but I haven't understood this too. > > > > + * If it is not currently possible to insert write-ahead log records, > > + * either because we are still in recovery or because ALTER SYSTEM READ > > + * ONLY has been executed, force this to be a read-only transaction. > > + * We have lower level defences in XLogBeginInsert() and elsewhere to stop > > + * us from modifying data during recovery when !XLogInsertAllowed(), but > > + * this gives the normal indication to the user that the transaction is > > + * read-only. > > + * > > + * On the other hand, we only need to set the startedInRecovery flag when > > + * the transaction started during recovery, and not when WAL is otherwise > > + * prohibited. This information is used by RelationGetIndexScan() to > > + * decide whether to permit (1) relying on existing killed-tuple markings > > + * and (2) further killing of index tuples. Even when WAL is prohibited > > + * on the master, it's still the master, so the former is OK; and since > > + * killing index tuples doesn't generate WAL, the latter is also OK. > > + * See comments in RelationGetIndexScan() and MarkBufferDirtyHint(). > > + */ > > + XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed(); > > + s->startedInRecovery = RecoveryInProgress(); > > It's somewhat ugly that we call RecoveryInProgress() once in > XLogInsertAllowed() and then again directly here... It's probably fine > runtime cost wise, but... > > > > /* > > * Subroutine to try to fetch and validate a prior checkpoint record. > > * > > @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg) > > */ > > WalSndWaitStopping(); > > > > + /* > > + * The restartpoint, checkpoint, or xlog rotation will be performed if the > > + * WAL writing is permitted. > > + */ > > if (RecoveryInProgress()) > > CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); > > - else > > + else if (XLogInsertAllowed()) > > Not sure I like going via XLogInsertAllowed(), that seems like a > confusing indirection here. And it encompasses things we atually don't > want to check for - it's fragile to also look at LocalXLogInsertAllowed > here imo. > > > > ShutdownCLOG(); > > ShutdownCommitTs(); > > ShutdownSUBTRANS(); > > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > > index 1b8cd7bacd4..aa4cdd57ec1 100644 > > --- a/src/backend/postmaster/autovacuum.c > > +++ b/src/backend/postmaster/autovacuum.c > > @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[]) > > > > HandleAutoVacLauncherInterrupts(); > > > > + /* If the server is read only just go back to sleep. */ > > + if (!XLogInsertAllowed()) > > + continue; > > + > > I think we really should have a different functions for places like > this. We don't want to generally hide bugs like e.g. starting the > autovac launcher in recovery, but this would. > So, we need a separate function like XLogInsertAllowed() and a global variable like LocalXLogInsertAllowed for the caching wal prohibit state. > > > @@ -342,6 +344,28 @@ CheckpointerMain(void) > > AbsorbSyncRequests(); > > HandleCheckpointerInterrupts(); > > > > + wal_state = GetWALProhibitState(); > > + > > + if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) > > + { > > + /* Complete WAL prohibit state change request */ > > + CompleteWALProhibitChange(wal_state); > > + continue; > > + } > > + else if (wal_state & WALPROHIBIT_STATE_READ_ONLY) > > + { > > + /* > > + * Don't do anything until someone wakes us up. For example a > > + * backend might later on request us to put the system back to > > + * read-write wal prohibit sate. > > + */ > > + (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, > > + WAIT_EVENT_CHECKPOINTER_MAIN); > > + continue; > > + } > > + Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE); > > + > > /* > > * Detect a pending checkpoint request by checking whether the flags > > * word in shared memory is nonzero. We shouldn't need to acquire the > > @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void) > > > > return FirstCall; > > } > > So, if we're in the middle of a paced checkpoint with a large > checkpoint_timeout - a sensible real world configuration - we'll not > process ASRO until that checkpoint is over? That seems very much not > practical. What am I missing? > Yes, the process doing ASRO will wait until that checkpoint is over. > > > +/* > > + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process. > > + */ > > +void > > +send_signal_to_checkpointer(int signum) > > +{ > > + if (CheckpointerShmem->checkpointer_pid == 0) > > + elog(ERROR, "checkpointer is not running"); > > + > > + if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0) > > + elog(ERROR, "could not signal checkpointer: %m"); > > +} > > Sudden switch to a different naming style... > My bad, sorry, will fix that. Regards, Amul 1] http://postgr.es/m/20200724020402.2byiiufsd7pw4hsp@alap3.anarazel.de 2] http://postgr.es/m/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com 3] http://postgr.es/m/CA+TgmoYMyw-m3O5XQ8tRy4mdEArGcfXr+9niO5Fmq1wVdKxYmQ@mail.gmail.com
Hi Andres, The attached patch has fixed the issue that you have raised & I have confirmed in my previous email. Also, I tried to improve some of the things that you have pointed but for those changes, I am a little unsure and looking forward to the inputs/suggestions/confirmation on that, therefore 0003 patch is marked WIP. Please have a look at my inline reply below for the things that are changes in the attached version and need inputs: On Sat, Sep 12, 2020 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > Thanks for your time. > > > > > Thomas, there's one point below that could be relevant for you. You can > > search for your name and/or checkpoint... > > > > > > On 2020-09-01 16:43:10 +0530, Amul Sul wrote: > > > diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c > > > index 42050ab7195..0ac826d3c2f 100644 > > > --- a/src/backend/nodes/readfuncs.c > > > +++ b/src/backend/nodes/readfuncs.c > > > @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void) > > > READ_DONE(); > > > } > > > > > > +/* > > > + * _readAlterSystemWALProhibitState > > > + */ > > > +static AlterSystemWALProhibitState * > > > +_readAlterSystemWALProhibitState(void) > > > +{ > > > + READ_LOCALS(AlterSystemWALProhibitState); > > > + > > > + READ_BOOL_FIELD(WALProhibited); > > > + > > > + READ_DONE(); > > > +} > > > + > > > > Why do we need readfuncs support for this? > > > > I thought we need that from your previous comment[1]. > > > > + > > > +/* > > > + * AlterSystemSetWALProhibitState > > > + * > > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > > > + */ > > > +static void > > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > > > +{ > > > + /* some code */ > > > + elog(INFO, "AlterSystemSetWALProhibitState() called"); > > > +} > > > > As long as it's not implemented it seems better to return an ERROR. > > > > Ok, will add an error in the next version. > > > > @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt > > > VariableSetStmt *setstmt; /* SET subcommand */ > > > } AlterSystemStmt; > > > > > > +/* ---------------------- > > > + * Alter System Read Statement > > > + * ---------------------- > > > + */ > > > +typedef struct AlterSystemWALProhibitState > > > +{ > > > + NodeTag type; > > > + bool WALProhibited; > > > +} AlterSystemWALProhibitState; > > > + > > > > All the nearby fields use under_score_style names. > > > > I am not sure which nearby fields having the underscore that you are referring > to. Probably "WALProhibited" needs to be renamed to "walprohibited" to be > inline with the nearby fields. > > > > > > From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001 > > > From: Amul Sul <amul.sul@enterprisedb.com> > > > Date: Fri, 19 Jun 2020 06:29:36 -0400 > > > Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier. > > > > > > Implementation: > > > > > > 1. When a user tried to change server state to WAL-Prohibited using > > > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() > > > raises request to checkpointer by marking current state to inprogress in > > > shared memory. Checkpointer, noticing that the current state is has > > > > "is has" > > > > > WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and > > > then acknowledges back to the backend who requested the state change once > > > the transition has been completed. Final state will be updated in control > > > file to make it persistent across the system restarts. > > > > What makes checkpointer the right backend to do this work? > > > > Once we've initiated the change to a read-only state, we probably want to > always either finish that change or go back to read-write, even if the process > that initiated the change is interrupted. Leaving the system in a > half-way-in-between state long term seems bad. Maybe we would have put some > background process, but choose the checkpointer in charge of making the state > change and to avoid the new background process to keep the first version patch > simple. The checkpointer isn't likely to get killed, but if it does, it will > be relaunched and the new one can clean things up. Probably later we might want > such a background worker that will be isn't likely to get killed. > > > > > > 2. When a backend receives the WAL-Prohibited barrier, at that moment if > > > it is already in a transaction and the transaction already assigned XID, > > > then the backend will be killed by throwing FATAL(XXX: need more discussion > > > on this) > > > > > > > 3. Otherwise, if that backend running transaction which yet to get XID > > > assigned we don't need to do anything special > > > > Somewhat garbled sentence... > > > > > > > 4. A new transaction (from existing or new backend) starts as a read-only > > > transaction. > > > > Maybe "(in an existing or in a new backend)"? > > > > > > > 5. Autovacuum launcher as well as checkpointer will don't do anything in > > > WAL-Prohibited server state until someone wakes us up. E.g. a backend > > > might later on request us to put the system back to read-write. > > > > "will don't do anything", "might later on request us" > > > > Ok, I'll fix all of this. I usually don't much focus on the commit message text > but I try to make it as much as possible sane enough. > > > > > > 6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint > > > and xlog rotation. Starting up again will perform crash recovery(XXX: > > > need some discussion on this as well) but the end of recovery checkpoint > > > will be skipped and it will be performed when the system changed to > > > WAL-Permitted mode. > > > > Hm, this has some interesting interactions with some of Thomas' recent > > hacking. > > > > I would be so thankful for the help. > > > > > > 8. Only super user can toggle WAL-Prohibit state. > > > > Hm. I don't quite agree with this. We try to avoid if (superuser()) > > style checks these days, because they can't be granted to other > > users. Look at how e.g. pg_promote() - an operation of similar severity > > - is handled. We just revoke the permission from public in > > system_views.sql: > > REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public; > > > > Ok, currently we don't have SQL callable function to change the system > read-write state. Do you want me to add that? If so, any naming suggesting? How > about pg_make_system_read_only(bool) or have two function as > pg_make_system_read_only(void) & pg_make_system_read_write(void). > In the attached version I added SQL callable function as pg_alter_wal_prohibit_state(bool), and another suggestion for the naming is welcome. For the permission denied error for ASRO READ-ONLY/READ-WRITE, I have added ereport() in AlterSystemSetWALProhibitState() instead of aclcheck_error() and the hint is added. Any suggestions? > > > > > 9. Add system_is_read_only GUC show the system state -- will true when system > > > is wal prohibited or in recovery. > > > > *shows the system state. There's also some oddity in the second part of > > the sentence. > > > > Is it really correct to show system_is_read_only as true during > > recovery? For one, recovery could end soon after, putting the system > > into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also, > > during recovery the database state actually changes if there are changes > > to replay. ISTM it would not be a good idea to mix ASRO and > > pg_is_in_recovery() into one GUC. > > > > Well, whether the system is in recovery or wal prohibited state it is read-only > for the user perspective, isn't it? > > > > > > --- /dev/null > > > +++ b/src/backend/access/transam/walprohibit.c > > > @@ -0,0 +1,321 @@ > > > +/*------------------------------------------------------------------------- > > > + * > > > + * walprohibit.c > > > + * PostgreSQL write-ahead log prohibit states > > > + * > > > + * > > > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > > + * > > > + * src/backend/access/transam/walprohibit.c > > > + * > > > + *------------------------------------------------------------------------- > > > + */ > > > +#include "postgres.h" > > > + > > > +#include "access/walprohibit.h" > > > +#include "pgstat.h" > > > +#include "port/atomics.h" > > > +#include "postmaster/bgwriter.h" > > > +#include "storage/condition_variable.h" > > > +#include "storage/procsignal.h" > > > +#include "storage/shmem.h" > > > + > > > +/* > > > + * Shared-memory WAL prohibit state > > > + */ > > > +typedef struct WALProhibitStateData > > > +{ > > > + /* Indicates current WAL prohibit state */ > > > + pg_atomic_uint32 SharedWALProhibitState; > > > + > > > + /* Startup checkpoint pending */ > > > + bool checkpointPending; > > > + > > > + /* Signaled when requested WAL prohibit state changes */ > > > + ConditionVariable walprohibit_cv; > > > > You're using three different naming styles for as many members. > > > > Ill fix in the next version. > > > > > > +/* > > > + * ProcessBarrierWALProhibit() > > > + * > > > + * Handle WAL prohibit state change request. > > > + */ > > > +bool > > > +ProcessBarrierWALProhibit(void) > > > +{ > > > + /* > > > + * Kill off any transactions that have an XID *before* allowing the system > > > + * to go WAL prohibit state. > > > + */ > > > + if (FullTransactionIdIsValid(GetTopFullTransactionIdIfAny())) > > > > Hm. I wonder if this check is good enough. If you look at > > RecordTransactionCommit() we also WAL log in some cases where no xid was > > assigned. This is particularly true of (auto-)vacuum, but also for HOT > > pruning. > > > > I think it'd be good to put the logic of this check into xlog.c and > > mirror the logic in RecordTransactionCommit(). And add cross-referencing > > comments to RecordTransactionCommit and the new function, reminding our > > futures selves that both places need to be modified. > > > > I am not sure I have understood this, here is the snip from the implementation > detail from the first post[2]: > > "Open transactions that don't have an XID are not killed, but will get an ERROR > if they try to acquire an XID later, or if they try to write WAL without > acquiring an XID (e.g. VACUUM). To make that happen, the patch adds a new > coding rule: a critical section that will write WAL must be preceded by a call > to CheckWALPermitted(), AssertWALPermitted(), or AssertWALPermitted_HaveXID(). > The latter variants are used when we know for certain that inserting WAL here > must be OK, either because we have an XID (we would have been killed by a change > to read-only if one had occurred) or for some other reason." > > Do let me know if you want further clarification. > > > > > > + { > > > + /* Should be here only for the WAL prohibit state. */ > > > + Assert(GetWALProhibitState() & WALPROHIBIT_STATE_READ_ONLY); > > > > There are no races where an ASRO READ ONLY is quickly followed by ASRO > > READ WRITE where this could be reached? > > > > No, right now SetWALProhibitState() doesn't allow two transient wal prohibit > states at a time. > > > > > > +/* > > > + * AlterSystemSetWALProhibitState() > > > + * > > > + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement. > > > + */ > > > +void > > > +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt) > > > +{ > > > + uint32 state; > > > + > > > + if (!superuser()) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > > + errmsg("must be superuser to execute ALTER SYSTEM command"))); > > > > See comments about this above. > > > > > > > + /* Alter WAL prohibit state not allowed during recovery */ > > > + PreventCommandDuringRecovery("ALTER SYSTEM"); > > > + > > > + /* Requested state */ > > > + state = stmt->WALProhibited ? > > > + WALPROHIBIT_STATE_READ_ONLY : WALPROHIBIT_STATE_READ_WRITE; > > > + > > > + /* > > > + * Since we yet to convey this WAL prohibit state to all backend mark it > > > + * in-progress. > > > + */ > > > + state |= WALPROHIBIT_TRANSITION_IN_PROGRESS; > > > + > > > + if (!SetWALProhibitState(state)) > > > + return; /* server is already in the desired state */ > > > + > > > > This use of bitmasks seems unnecessary to me. I'd rather have one param > > for WALPROHIBIT_STATE_READ_ONLY / WALPROHIBIT_STATE_READ_WRITE and one > > for WALPROHIBIT_TRANSITION_IN_PROGRESS > > > > Ok. > > How about the new version of SetWALProhibitState function as : > SetWALProhibitState(bool wal_prohibited, bool is_final_state) ? > I have added the same. > > > > > > > +/* > > > + * RequestWALProhibitChange() > > > + * > > > + * Request checkpointer to make the WALProhibitState to read-only. > > > + */ > > > +static void > > > +RequestWALProhibitChange(void) > > > +{ > > > + /* Must not be called from checkpointer */ > > > + Assert(!AmCheckpointerProcess()); > > > + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS); > > > + > > > + /* > > > + * If in a standalone backend, just do it ourselves. > > > + */ > > > + if (!IsPostmasterEnvironment) > > > + { > > > + CompleteWALProhibitChange(GetWALProhibitState()); > > > + return; > > > + } > > > + > > > + send_signal_to_checkpointer(SIGINT); > > > + > > > + /* Wait for the state to change to read-only */ > > > + ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv); > > > + for (;;) > > > + { > > > + /* We'll be done once in-progress flag bit is cleared */ > > > + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > > + break; > > > + > > > + ConditionVariableSleep(&WALProhibitState->walprohibit_cv, > > > + WAIT_EVENT_WALPROHIBIT_STATE_CHANGE); > > > + } > > > + ConditionVariableCancelSleep(); > > > > What if somebody concurrently changes the state back to READ WRITE? > > Won't we unnecessarily wait here? > > > > Yes, there will be wait. > > > That's probably fine, because we would just wait until that transition > > is complete too. But at least a comment about that would be > > good. Alternatively a "ASRO transitions completed counter" or such might > > be a better idea? > > > > Ok, will add comments but could you please elaborate little a bit about "ASRO > transitions completed counter" and is there any existing counter I can refer > to? > > > > > > +/* > > > + * CompleteWALProhibitChange() > > > + * > > > + * Checkpointer will call this to complete the requested WAL prohibit state > > > + * transition. > > > + */ > > > +void > > > +CompleteWALProhibitChange(uint32 wal_state) > > > +{ > > > + uint64 barrierGeneration; > > > + > > > + /* > > > + * Must be called from checkpointer. Otherwise, it must be single-user > > > + * backend. > > > + */ > > > + Assert(AmCheckpointerProcess() || !IsPostmasterEnvironment); > > > + Assert(wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS); > > > + > > > + /* > > > + * WAL prohibit state change is initiated. We need to complete the state > > > + * transition by setting requested WAL prohibit state in all backends. > > > + */ > > > + elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state"); > > > + > > > + /* Emit global barrier */ > > > + barrierGeneration = EmitProcSignalBarrier(PROCSIGNAL_BARRIER_WALPROHIBIT); > > > + WaitForProcSignalBarrier(barrierGeneration); > > > + > > > + /* And flush all writes. */ > > > + XLogFlush(GetXLogWriteRecPtr()); > > > > Hm, maybe I'm missing something, but why is the write pointer the right > > thing to flush? That won't include records that haven't been written to > > disk yet... We also need to trigger writing out all WAL that is as of > > yet unwritten, no? Without having thought a lot about it, it seems that > > GetXLogInsertRecPtr() would be the right thing to flush? > > > > TBH, I am not an expert in this area. I wants to flush the latest record > pointer that needs to be flushed, I think GetXLogInsertRecPtr() would be fine > if is the latest one. Note that wal flushes are not blocked in read-only mode. > Used GetXLogInsertRecPtr(). > > > > > + /* Set final state by clearing in-progress flag bit */ > > > + if (SetWALProhibitState(wal_state & ~(WALPROHIBIT_TRANSITION_IN_PROGRESS))) > > > + { > > > + bool wal_prohibited; > > > + > > > + wal_prohibited = (wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0; > > > + > > > + /* Update the control file to make state persistent */ > > > + SetControlFileWALProhibitFlag(wal_prohibited); > > > > Hm. Is there an issue with not WAL logging the control file change? Is > > there a scenario where we a crash + recovery would end up overwriting > > this? > > > > I am not sure. If the system crash before update this that means we haven't > acknowledged the system state change. And the server will be restarted with the > previous state. > > Could you please explain what bothering you. > > > > > > + if (wal_prohibited) > > > + ereport(LOG, (errmsg("system is now read only"))); > > > + else > > > + { > > > + /* > > > + * Request checkpoint if the end-of-recovery checkpoint has been > > > + * skipped previously. > > > + */ > > > + if (WALProhibitState->checkpointPending) > > > + { > > > + RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | > > > + CHECKPOINT_IMMEDIATE); > > > + WALProhibitState->checkpointPending = false; > > > + } > > > + ereport(LOG, (errmsg("system is now read write"))); > > > + } > > > + } > > > + > > > + /* Wake up the backend who requested the state change */ > > > + ConditionVariableBroadcast(&WALProhibitState->walprohibit_cv); > > > > Could be multiple backends, right? > > > > Yes, you are correct, will fix that. > > > > > > +} > > > + > > > +/* > > > + * GetWALProhibitState() > > > + * > > > + * Atomically return the current server WAL prohibited state > > > + */ > > > +uint32 > > > +GetWALProhibitState(void) > > > +{ > > > + return pg_atomic_read_u32(&WALProhibitState->SharedWALProhibitState); > > > +} > > > > Is there an issue with needing memory barriers here? > > > > > > > +/* > > > + * SetWALProhibitState() > > > + * > > > + * Change current WAL prohibit state to the input state. > > > + * > > > + * If the server is already completely moved to the requested WAL prohibit > > > + * state, or if the desired state is same as the current state, return false, > > > + * indicating that the server state did not change. Else return true. > > > + */ > > > +bool > > > +SetWALProhibitState(uint32 new_state) > > > +{ > > > + bool state_updated = false; > > > + uint32 cur_state; > > > + > > > + cur_state = GetWALProhibitState(); > > > + > > > + /* Server is already in requested state */ > > > + if (new_state == cur_state || > > > + new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > > + return false; > > > + > > > + /* Prevent concurrent contrary in progress transition state setting */ > > > + if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) && > > > + (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > > + { > > > + if (cur_state & WALPROHIBIT_STATE_READ_ONLY) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > + errmsg("system state transition to read only is already in progress"), > > > + errhint("Try after sometime again."))); > > > + else > > > + ereport(ERROR, > > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > + errmsg("system state transition to read write is already in progress"), > > > + errhint("Try after sometime again."))); > > > + } > > > + > > > + /* Update new state in share memory */ > > > + state_updated = > > > + pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState, > > > + &cur_state, new_state); > > > + > > > + if (!state_updated) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > + errmsg("system read write state concurrently changed"), > > > + errhint("Try after sometime again."))); > > > + > > > > I don't think it's safe to use pg_atomic_compare_exchange_u32() outside > > of a loop. I think there's platforms (basically all load-linked / > > store-conditional architectures) where than can fail spuriously. > > > > Also, there's no memory barrier around GetWALProhibitState, so there's > > no guarantee it's not an out-of-date value you're starting with. > > > > How about having some kind of lock instead what Robert have suggested > previously[3] ? > I would like to discuss this point more. In the attached version I have added WALProhibitLock to protect shared walprohibit state updates. I was a little unsure do we want another spinlock what XLogCtlData has which is mostly used to read the shared variable and for the update, both are used e.g. LogwrtResult. Right now I haven't added and shared walprohibit state was fetch using a volatile pointer. Do we need a spinlock there, I am not sure why? Thoughts? > > > > > +/ > > > + * MarkCheckPointSkippedInWalProhibitState() > > > + * > > > + * Sets checkpoint pending flag so that it can be performed next time while > > > + * changing system state to WAL permitted. > > > + */ > > > +void > > > +MarkCheckPointSkippedInWalProhibitState(void) > > > +{ > > > + WALProhibitState->checkpointPending = true; > > > +} > > > > I don't *at all* like this living outside of xlog.c. I think this should > > be moved there, and merged with deferring checkpoints in other cases > > (promotions, not immediately performing a checkpoint after recovery). > > Here we want to perform the checkpoint sometime quite later when the > system state changes to read-write. For that, I think we need some flag > if we want this in xlog.c then we can have that flag in XLogCtl. > Right now I have added a new variable to XLogCtlData and moved this code to xlog.c. > > > There's state in ControlFile *and* here for essentially the same thing. > > > > I am sorry to trouble you much, but I haven't understood this too. > > > > > > > > + * If it is not currently possible to insert write-ahead log records, > > > + * either because we are still in recovery or because ALTER SYSTEM READ > > > + * ONLY has been executed, force this to be a read-only transaction. > > > + * We have lower level defences in XLogBeginInsert() and elsewhere to stop > > > + * us from modifying data during recovery when !XLogInsertAllowed(), but > > > + * this gives the normal indication to the user that the transaction is > > > + * read-only. > > > + * > > > + * On the other hand, we only need to set the startedInRecovery flag when > > > + * the transaction started during recovery, and not when WAL is otherwise > > > + * prohibited. This information is used by RelationGetIndexScan() to > > > + * decide whether to permit (1) relying on existing killed-tuple markings > > > + * and (2) further killing of index tuples. Even when WAL is prohibited > > > + * on the master, it's still the master, so the former is OK; and since > > > + * killing index tuples doesn't generate WAL, the latter is also OK. > > > + * See comments in RelationGetIndexScan() and MarkBufferDirtyHint(). > > > + */ > > > + XactReadOnly = DefaultXactReadOnly || !XLogInsertAllowed(); > > > + s->startedInRecovery = RecoveryInProgress(); > > > > It's somewhat ugly that we call RecoveryInProgress() once in > > XLogInsertAllowed() and then again directly here... It's probably fine > > runtime cost wise, but... > > > > > > > /* > > > * Subroutine to try to fetch and validate a prior checkpoint record. > > > * > > > @@ -8508,9 +8564,13 @@ ShutdownXLOG(int code, Datum arg) > > > */ > > > WalSndWaitStopping(); > > > > > > + /* > > > + * The restartpoint, checkpoint, or xlog rotation will be performed if the > > > + * WAL writing is permitted. > > > + */ > > > if (RecoveryInProgress()) > > > CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); > > > - else > > > + else if (XLogInsertAllowed()) > > > > Not sure I like going via XLogInsertAllowed(), that seems like a > > confusing indirection here. And it encompasses things we atually don't > > want to check for - it's fragile to also look at LocalXLogInsertAllowed > > here imo. > > > > > > > ShutdownCLOG(); > > > ShutdownCommitTs(); > > > ShutdownSUBTRANS(); > > > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c > > > index 1b8cd7bacd4..aa4cdd57ec1 100644 > > > --- a/src/backend/postmaster/autovacuum.c > > > +++ b/src/backend/postmaster/autovacuum.c > > > @@ -652,6 +652,10 @@ AutoVacLauncherMain(int argc, char *argv[]) > > > > > > HandleAutoVacLauncherInterrupts(); > > > > > > + /* If the server is read only just go back to sleep. */ > > > + if (!XLogInsertAllowed()) > > > + continue; > > > + > > > > I think we really should have a different functions for places like > > this. We don't want to generally hide bugs like e.g. starting the > > autovac launcher in recovery, but this would. > > > > So, we need a separate function like XLogInsertAllowed() and a global variable > like LocalXLogInsertAllowed for the caching wal prohibit state. > > > > > > @@ -342,6 +344,28 @@ CheckpointerMain(void) > > > AbsorbSyncRequests(); > > > HandleCheckpointerInterrupts(); > > > > > > + wal_state = GetWALProhibitState(); > > > + > > > + if (wal_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) > > > + { > > > + /* Complete WAL prohibit state change request */ > > > + CompleteWALProhibitChange(wal_state); > > > + continue; > > > + } > > > + else if (wal_state & WALPROHIBIT_STATE_READ_ONLY) > > > + { > > > + /* > > > + * Don't do anything until someone wakes us up. For example a > > > + * backend might later on request us to put the system back to > > > + * read-write wal prohibit sate. > > > + */ > > > + (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, > > > + WAIT_EVENT_CHECKPOINTER_MAIN); > > > + continue; > > > + } > > > + Assert(wal_state == WALPROHIBIT_STATE_READ_WRITE); > > > + > > > /* > > > * Detect a pending checkpoint request by checking whether the flags > > > * word in shared memory is nonzero. We shouldn't need to acquire the > > > @@ -1323,3 +1347,16 @@ FirstCallSinceLastCheckpoint(void) > > > > > > return FirstCall; > > > } > > > > So, if we're in the middle of a paced checkpoint with a large > > checkpoint_timeout - a sensible real world configuration - we'll not > > process ASRO until that checkpoint is over? That seems very much not > > practical. What am I missing? > > > > Yes, the process doing ASRO will wait until that checkpoint is over. > > > > > > +/* > > > + * send_signal_to_checkpointer allows a process to send a signal to the checkpoint process. > > > + */ > > > +void > > > +send_signal_to_checkpointer(int signum) > > > +{ > > > + if (CheckpointerShmem->checkpointer_pid == 0) > > > + elog(ERROR, "checkpointer is not running"); > > > + > > > + if (kill(CheckpointerShmem->checkpointer_pid, signum) != 0) > > > + elog(ERROR, "could not signal checkpointer: %m"); > > > +} > > > > Sudden switch to a different naming style... > > > > My bad, sorry, will fix that. > > 1] http://postgr.es/m/20200724020402.2byiiufsd7pw4hsp@alap3.anarazel.de > 2] http://postgr.es/m/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com > 3] http://postgr.es/m/CA+TgmoYMyw-m3O5XQ8tRy4mdEArGcfXr+9niO5Fmq1wVdKxYmQ@mail.gmail.com Thank you ! Regards, Amul
Attachment
On Tue, Sep 8, 2020 at 2:20 PM Andres Freund <andres@anarazel.de> wrote: > This pattern seems like it'll get unwieldy with more than one barrier > type. And won't flag "unhandled" barrier types either (already the case, > I know). We could go for something like: > > while (flags != 0) > { > barrier_bit = pg_rightmost_one_pos32(flags); > barrier_type = 1 >> barrier_bit; > > switch (barrier_type) > { > case PROCSIGNAL_BARRIER_PLACEHOLDER: > processed = ProcessBarrierPlaceholder(); > } > > if (processed) > BARRIER_CLEAR_BIT(flags, barrier_type); > } > > But perhaps that's too complicated? I don't mind a loop, but that one looks broken. We have to clear the bit before we call the function that processes that type of barrier. Otherwise, if we succeed in absorbing the barrier but a new instance of the same barrier arrives meanwhile, we'll fail to realize that we need to absorb the new one. > For this to be correct, wouldn't flags need to be volatile? Otherwise > this might use a register value for flags, which might not contain the > correct value at this point. I think you're right. > Perhaps a comment explaining why we have to clear bits first would be > good? Probably a good idea. [ snipping assorted comments with which I agree ] > It might be good to add a warning to WaitForProcSignalBarrier() or by > pss_barrierCheckMask indicating that it's *not* OK to look at > pss_barrierCheckMask when checking whether barriers have been processed. Not sure I understand this one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 15, 2020 at 2:35 PM Amul Sul <sulamul@gmail.com> wrote: > > Hi Andres, > > The attached patch has fixed the issue that you have raised & I have confirmed > in my previous email. Also, I tried to improve some of the things that you have > pointed but for those changes, I am a little unsure and looking forward to the > inputs/suggestions/confirmation on that, therefore 0003 patch is marked WIP. > > Please have a look at my inline reply below for the things that are changes in > the attached version and need inputs: > > On Sat, Sep 12, 2020 at 10:52 AM Amul Sul <sulamul@gmail.com> wrote: > > > > On Thu, Sep 10, 2020 at 2:33 AM Andres Freund <andres@anarazel.de> wrote: > > > [... Skipped ....] > > > > > > > > > > +/* > > > > + * RequestWALProhibitChange() > > > > + * > > > > + * Request checkpointer to make the WALProhibitState to read-only. > > > > + */ > > > > +static void > > > > +RequestWALProhibitChange(void) > > > > +{ > > > > + /* Must not be called from checkpointer */ > > > > + Assert(!AmCheckpointerProcess()); > > > > + Assert(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS); > > > > + > > > > + /* > > > > + * If in a standalone backend, just do it ourselves. > > > > + */ > > > > + if (!IsPostmasterEnvironment) > > > > + { > > > > + CompleteWALProhibitChange(GetWALProhibitState()); > > > > + return; > > > > + } > > > > + > > > > + send_signal_to_checkpointer(SIGINT); > > > > + > > > > + /* Wait for the state to change to read-only */ > > > > + ConditionVariablePrepareToSleep(&WALProhibitState->walprohibit_cv); > > > > + for (;;) > > > > + { > > > > + /* We'll be done once in-progress flag bit is cleared */ > > > > + if (!(GetWALProhibitState() & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > > > + break; > > > > + > > > > + ConditionVariableSleep(&WALProhibitState->walprohibit_cv, > > > > + WAIT_EVENT_WALPROHIBIT_STATE_CHANGE); > > > > + } > > > > + ConditionVariableCancelSleep(); > > > > > > What if somebody concurrently changes the state back to READ WRITE? > > > Won't we unnecessarily wait here? > > > > > > > Yes, there will be wait. > > > > > That's probably fine, because we would just wait until that transition > > > is complete too. But at least a comment about that would be > > > good. Alternatively a "ASRO transitions completed counter" or such might > > > be a better idea? > > > > > > > Ok, will add comments but could you please elaborate little a bit about "ASRO > > transitions completed counter" and is there any existing counter I can refer > > to? > > In an off-list discussion, Robert had explained to me this counter thing and its requirement. I tried to add the same as "shared WAL prohibited state generation" in the attached version. The implementation is quite similar to the generation counter in the super barrier. In the attached version, when a backend makes a request for the WAL prohibit state changes then a generation number will be given to that backend to wait on and that wait will be ended when the shared generation counter changes. > > > [... Skipped ....] > > > > +/* > > > > + * SetWALProhibitState() > > > > + * > > > > + * Change current WAL prohibit state to the input state. > > > > + * > > > > + * If the server is already completely moved to the requested WAL prohibit > > > > + * state, or if the desired state is same as the current state, return false, > > > > + * indicating that the server state did not change. Else return true. > > > > + */ > > > > +bool > > > > +SetWALProhibitState(uint32 new_state) > > > > +{ > > > > + bool state_updated = false; > > > > + uint32 cur_state; > > > > + > > > > + cur_state = GetWALProhibitState(); > > > > + > > > > + /* Server is already in requested state */ > > > > + if (new_state == cur_state || > > > > + new_state == (cur_state | WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > > > + return false; > > > > + > > > > + /* Prevent concurrent contrary in progress transition state setting */ > > > > + if ((new_state & WALPROHIBIT_TRANSITION_IN_PROGRESS) && > > > > + (cur_state & WALPROHIBIT_TRANSITION_IN_PROGRESS)) > > > > + { > > > > + if (cur_state & WALPROHIBIT_STATE_READ_ONLY) > > > > + ereport(ERROR, > > > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > > + errmsg("system state transition to read only is already in progress"), > > > > + errhint("Try after sometime again."))); > > > > + else > > > > + ereport(ERROR, > > > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > > + errmsg("system state transition to read write is already in progress"), > > > > + errhint("Try after sometime again."))); > > > > + } > > > > + > > > > + /* Update new state in share memory */ > > > > + state_updated = > > > > + pg_atomic_compare_exchange_u32(&WALProhibitState->SharedWALProhibitState, > > > > + &cur_state, new_state); > > > > + > > > > + if (!state_updated) > > > > + ereport(ERROR, > > > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > > + errmsg("system read write state concurrently changed"), > > > > + errhint("Try after sometime again."))); > > > > + > > > > > > I don't think it's safe to use pg_atomic_compare_exchange_u32() outside > > > of a loop. I think there's platforms (basically all load-linked / > > > store-conditional architectures) where than can fail spuriously. > > > > > > Also, there's no memory barrier around GetWALProhibitState, so there's > > > no guarantee it's not an out-of-date value you're starting with. > > > > > > > How about having some kind of lock instead what Robert have suggested > > previously[3] ? > > > > I would like to discuss this point more. In the attached version I have added > WALProhibitLock to protect shared walprohibit state updates. I was a little > unsure do we want another spinlock what XLogCtlData has which is mostly used to > read the shared variable and for the update, both are used e.g. LogwrtResult. > > Right now I haven't added and shared walprohibit state was fetch using a > volatile pointer. Do we need a spinlock there, I am not sure why? Thoughts? > I reverted this WALProhibitLock implementation since with changes in the attached version I don't think we need that locking. Regards, Amul
Attachment
Attached is a rebased version for the latest master head(#e21cbb4b893). Regards, Amul
Attachment
On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't mind a loop, but that one looks broken. We have to clear the > bit before we call the function that processes that type of barrier. > Otherwise, if we succeed in absorbing the barrier but a new instance > of the same barrier arrives meanwhile, we'll fail to realize that we > need to absorb the new one. Here's a new version of the patch for allowing errors in barrier-handling functions and/or rejection of barriers by those functions. I think this responds to all of the previous review comments from Andres. Also, here is an 0002 which is a handy bit of test code that I wrote. It's not for commit, but it is useful for finding bugs. In addition to improving 0001 based on the review comments, I also tried to write a better commit message for it, but it might still be possible to do better there. It's a bit hard to explain the idea in the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process with an XID -- and possibly a bunch of sub-XIDs, and possibly while idle-in-transaction -- can elect to FATAL rather than absorbing the barrier. I suspect for other barrier types we might have certain (hopefully short) stretches of code where a barrier of a particular type can't be absorbed because we're in the middle of doing something that relies on the previous value of whatever state is protected by the barrier. Holding off interrupts in those stretches of code would prevent the barrier from being absorbed, but would also prevent query cancel, backend termination, and absorption of other barrier types, so it seems possible that just allowing the barrier-absorption function for a barrier of that type to just refuse the barrier until after the backend exits the critical section of code will work out better. Just for kicks, I tried running 'make installcheck-parallel' while emitting placeholder barriers every 0.05 s after altering the barrier-absorption function to always return false, just to see how ugly that was. In round figures, it made it take 24 s vs. 21 s, so it's actually not that bad. However, it all depends on how many times you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine that the effect might be very non-uniform. That is, if you can get the code to be running a tight loop that does little real work but does CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of barrier, it will probably suck. Therefore, I'm inclined to think that the fairly strong cautionary logic in the patch is reasonable, but perhaps it can be better worded somehow. Thoughts welcome. I have not rebased the remainder of the patch series over these two. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Oct 7, 2020 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I don't mind a loop, but that one looks broken. We have to clear the > > bit before we call the function that processes that type of barrier. > > Otherwise, if we succeed in absorbing the barrier but a new instance > > of the same barrier arrives meanwhile, we'll fail to realize that we > > need to absorb the new one. > > Here's a new version of the patch for allowing errors in > barrier-handling functions and/or rejection of barriers by those > functions. I think this responds to all of the previous review > comments from Andres. Also, here is an 0002 which is a handy bit of > test code that I wrote. It's not for commit, but it is useful for > finding bugs. > > In addition to improving 0001 based on the review comments, I also > tried to write a better commit message for it, but it might still be > possible to do better there. It's a bit hard to explain the idea in > the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process > with an XID -- and possibly a bunch of sub-XIDs, and possibly while > idle-in-transaction -- can elect to FATAL rather than absorbing the > barrier. I suspect for other barrier types we might have certain > (hopefully short) stretches of code where a barrier of a particular > type can't be absorbed because we're in the middle of doing something > that relies on the previous value of whatever state is protected by > the barrier. Holding off interrupts in those stretches of code would > prevent the barrier from being absorbed, but would also prevent query > cancel, backend termination, and absorption of other barrier types, so > it seems possible that just allowing the barrier-absorption function > for a barrier of that type to just refuse the barrier until after the > backend exits the critical section of code will work out better. > > Just for kicks, I tried running 'make installcheck-parallel' while > emitting placeholder barriers every 0.05 s after altering the > barrier-absorption function to always return false, just to see how > ugly that was. In round figures, it made it take 24 s vs. 21 s, so > it's actually not that bad. However, it all depends on how many times > you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine > that the effect might be very non-uniform. That is, if you can get the > code to be running a tight loop that does little real work but does > CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of > barrier, it will probably suck. Therefore, I'm inclined to think that > the fairly strong cautionary logic in the patch is reasonable, but > perhaps it can be better worded somehow. Thoughts welcome. > > I have not rebased the remainder of the patch series over these two. > That I'll do. On a quick look at the latest 0001 patch, the following hunk to reset leftover flags seems to be unnecessary: + /* + * If some barrier types were not successfully absorbed, we will have + * to try again later. + */ + if (!success) + { + ResetProcSignalBarrierBits(flags); + return; + } When the ProcessBarrierPlaceholder() function returns false without an error, that barrier flag gets reset within the while loop. The case when it has an error, the rest of the flags get reset in the catch block. Correct me if I am missing something here. Regards, Amul
On Thu, Oct 8, 2020 at 3:52 PM Amul Sul <sulamul@gmail.com> wrote: > > On Wed, Oct 7, 2020 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Wed, Sep 16, 2020 at 3:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I don't mind a loop, but that one looks broken. We have to clear the > > > bit before we call the function that processes that type of barrier. > > > Otherwise, if we succeed in absorbing the barrier but a new instance > > > of the same barrier arrives meanwhile, we'll fail to realize that we > > > need to absorb the new one. > > > > Here's a new version of the patch for allowing errors in > > barrier-handling functions and/or rejection of barriers by those > > functions. I think this responds to all of the previous review > > comments from Andres. Also, here is an 0002 which is a handy bit of > > test code that I wrote. It's not for commit, but it is useful for > > finding bugs. > > > > In addition to improving 0001 based on the review comments, I also > > tried to write a better commit message for it, but it might still be > > possible to do better there. It's a bit hard to explain the idea in > > the abstract. For ALTER SYSTEM READ ONLY, the idea is that a process > > with an XID -- and possibly a bunch of sub-XIDs, and possibly while > > idle-in-transaction -- can elect to FATAL rather than absorbing the > > barrier. I suspect for other barrier types we might have certain > > (hopefully short) stretches of code where a barrier of a particular > > type can't be absorbed because we're in the middle of doing something > > that relies on the previous value of whatever state is protected by > > the barrier. Holding off interrupts in those stretches of code would > > prevent the barrier from being absorbed, but would also prevent query > > cancel, backend termination, and absorption of other barrier types, so > > it seems possible that just allowing the barrier-absorption function > > for a barrier of that type to just refuse the barrier until after the > > backend exits the critical section of code will work out better. > > > > Just for kicks, I tried running 'make installcheck-parallel' while > > emitting placeholder barriers every 0.05 s after altering the > > barrier-absorption function to always return false, just to see how > > ugly that was. In round figures, it made it take 24 s vs. 21 s, so > > it's actually not that bad. However, it all depends on how many times > > you hit CHECK_FOR_INTERRUPTS() how quickly, so it's easy to imagine > > that the effect might be very non-uniform. That is, if you can get the > > code to be running a tight loop that does little real work but does > > CHECK_FOR_INTERRUPTS() while refusing to absorb outstanding type of > > barrier, it will probably suck. Therefore, I'm inclined to think that > > the fairly strong cautionary logic in the patch is reasonable, but > > perhaps it can be better worded somehow. Thoughts welcome. > > > > I have not rebased the remainder of the patch series over these two. > > > That I'll do. > Attaching a rebased version includes Robert's patches for the latest master head. > On a quick look at the latest 0001 patch, the following hunk to reset leftover > flags seems to be unnecessary: > > + /* > + * If some barrier types were not successfully absorbed, we will have > + * to try again later. > + */ > + if (!success) > + { > + ResetProcSignalBarrierBits(flags); > + return; > + } > > When the ProcessBarrierPlaceholder() function returns false without an error, > that barrier flag gets reset within the while loop. The case when it has an > error, the rest of the flags get reset in the catch block. Correct me if I am > missing something here. > Robert, could you please confirm this? Regards, Amul
Attachment
- v10-0001-Allow-for-error-or-refusal-while-absorbing-barri.patch
- v10-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v10-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patch
- v10-0003-Add-alter-system-read-only-write-syntax.patch
- v10-0004-Implement-ALTER-SYSTEM-READ-ONLY-using-global-ba.patch
- v10-0006-WIP-Documentation.patch
On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote: > On a quick look at the latest 0001 patch, the following hunk to reset leftover > flags seems to be unnecessary: > > + /* > + * If some barrier types were not successfully absorbed, we will have > + * to try again later. > + */ > + if (!success) > + { > + ResetProcSignalBarrierBits(flags); > + return; > + } > > When the ProcessBarrierPlaceholder() function returns false without an error, > that barrier flag gets reset within the while loop. The case when it has an > error, the rest of the flags get reset in the catch block. Correct me if I am > missing something here. Good catch. I think you're right. Do you want to update accordingly? Andres, do you like the new loop better? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote:
> On a quick look at the latest 0001 patch, the following hunk to reset leftover
> flags seems to be unnecessary:
>
> + /*
> + * If some barrier types were not successfully absorbed, we will have
> + * to try again later.
> + */
> + if (!success)
> + {
> + ResetProcSignalBarrierBits(flags);
> + return;
> + }
>
> When the ProcessBarrierPlaceholder() function returns false without an error,
> that barrier flag gets reset within the while loop. The case when it has an
> error, the rest of the flags get reset in the catch block. Correct me if I am
> missing something here.
Good catch. I think you're right. Do you want to update accordingly?
Sure, Ill update that. Thanks for the confirmation.
Andres, do you like the new loop better?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Nov 20, 2020 at 11:13 PM Amul Sul <sulamul@gmail.com> wrote: > > On Fri, 20 Nov 2020 at 9:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Oct 8, 2020 at 6:23 AM Amul Sul <sulamul@gmail.com> wrote: >> > On a quick look at the latest 0001 patch, the following hunk to reset leftover >> > flags seems to be unnecessary: >> > >> > + /* >> > + * If some barrier types were not successfully absorbed, we will have >> > + * to try again later. >> > + */ >> > + if (!success) >> > + { >> > + ResetProcSignalBarrierBits(flags); >> > + return; >> > + } >> > >> > When the ProcessBarrierPlaceholder() function returns false without an error, >> > that barrier flag gets reset within the while loop. The case when it has an >> > error, the rest of the flags get reset in the catch block. Correct me if I am >> > missing something here. >> >> Good catch. I think you're right. Do you want to update accordingly? > > > Sure, Ill update that. Thanks for the confirmation. > Attached is the updated version where unnecessary ResetProcSignalBarrierBits() call in 0001 patch is removed. The rest of the patches are unchanged, thanks. >> >> Andres, do you like the new loop better? >> Regards, Amul
Attachment
- v11-0004-Implement-ALTER-SYSTEM-READ-ONLY-using-global-ba.patch
- v11-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v11-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patch
- v11-0001-Allow-for-error-or-refusal-while-absorbing-barri.patch
- v11-0006-WIP-Documentation.patch
- v11-0003-Add-alter-system-read-only-write-syntax.patch
On Sat, Sep 12, 2020 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote: > > So, if we're in the middle of a paced checkpoint with a large > > checkpoint_timeout - a sensible real world configuration - we'll not > > process ASRO until that checkpoint is over? That seems very much not > > practical. What am I missing? > > Yes, the process doing ASRO will wait until that checkpoint is over. That's not good. On a typical busy system, a system is going to be in the middle of a checkpoint most of the time, and the checkpoint will take a long time to finish - maybe minutes. We want this feature to respond within milliseconds or a few seconds, not minutes. So we need something better here. I'm inclined to think that we should try to CompleteWALProhibitChange() at the same places we AbsorbSyncRequests(). We know from experience that bad things happen if we fail to absorb sync requests in a timely fashion, so we probably have enough calls to AbsorbSyncRequests() to make sure that we always do that work in a timely fashion. So, if we do this work in the same place, then it will also be done in a timely fashion. I'm not 100% sure whether that introduces any other problems. Certainly, we're not going to be able to finish the checkpoint once we've gone read-only, so we'll fail when we try to write the WAL record for that, or maybe earlier if there's anything else that tries to write WAL. Either the checkpoint needs to error out, like any other attempt to write WAL, and we can attempt a new checkpoint if and when we go read/write, or else we need to finish writing stuff out to disk but not actually write the checkpoint completion record (or any other WAL) unless and until the system goes back into read/write mode - and then at that point the previously-started checkpoint will finish normally. The latter seems better if we can make it work, but the former is probably also acceptable. What you've got right now is not. -- Robert Haas EDB: http://www.enterprisedb.com
On 2020-11-20 11:23:44 -0500, Robert Haas wrote: > Andres, do you like the new loop better? I do!
Hi, On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > That's not good. On a typical busy system, a system is going to be in > the middle of a checkpoint most of the time, and the checkpoint will > take a long time to finish - maybe minutes. Or hours, even. Due to the cost of FPWs it can make a lot of sense to reduce the frequency of that cost... > We want this feature to respond within milliseconds or a few seconds, > not minutes. So we need something better here. Indeed. > I'm inclined to think > that we should try to CompleteWALProhibitChange() at the same places > we AbsorbSyncRequests(). We know from experience that bad things > happen if we fail to absorb sync requests in a timely fashion, so we > probably have enough calls to AbsorbSyncRequests() to make sure that > we always do that work in a timely fashion. So, if we do this work in > the same place, then it will also be done in a timely fashion. Sounds sane, without having looked in detail. > I'm not 100% sure whether that introduces any other problems. > Certainly, we're not going to be able to finish the checkpoint once > we've gone read-only, so we'll fail when we try to write the WAL > record for that, or maybe earlier if there's anything else that tries > to write WAL. Either the checkpoint needs to error out, like any other > attempt to write WAL, and we can attempt a new checkpoint if and when > we go read/write, or else we need to finish writing stuff out to disk > but not actually write the checkpoint completion record (or any other > WAL) unless and until the system goes back into read/write mode - and > then at that point the previously-started checkpoint will finish > normally. The latter seems better if we can make it work, but the > former is probably also acceptable. What you've got right now is not. I mostly wonder which of those two has which implications for how many FPWs we need to redo. Presumably stalling but not cancelling the current checkpoint is better? Greetings, Andres Freund
On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > > That's not good. On a typical busy system, a system is going to be in > > the middle of a checkpoint most of the time, and the checkpoint will > > take a long time to finish - maybe minutes. > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to > reduce the frequency of that cost... > > > > We want this feature to respond within milliseconds or a few seconds, > > not minutes. So we need something better here. > > Indeed. > > > > I'm inclined to think > > that we should try to CompleteWALProhibitChange() at the same places > > we AbsorbSyncRequests(). We know from experience that bad things > > happen if we fail to absorb sync requests in a timely fashion, so we > > probably have enough calls to AbsorbSyncRequests() to make sure that > > we always do that work in a timely fashion. So, if we do this work in > > the same place, then it will also be done in a timely fashion. > > Sounds sane, without having looked in detail. > Understood & agreed that we need to change the system state as soon as possible. I can see AbsorbSyncRequests() is called from 4 routing as CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and CheckpointerMain(). Out of 4, the first three executes with an interrupt is on hod which will cause a problem when we do emit barrier and wait for those barriers absorption by all the process including itself and will cause an infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(), do not wait on self to absorb barrier. Let that get absorbed at a later point in time when the interrupt is resumed. I assumed that we cannot do barrier processing right away since there could be other barriers (maybe in the future) including ours that should not process while the interrupt is on hold. > > > I'm not 100% sure whether that introduces any other problems. > > Certainly, we're not going to be able to finish the checkpoint once > > we've gone read-only, so we'll fail when we try to write the WAL > > record for that, or maybe earlier if there's anything else that tries > > to write WAL. Either the checkpoint needs to error out, like any other > > attempt to write WAL, and we can attempt a new checkpoint if and when > > we go read/write, or else we need to finish writing stuff out to disk > > but not actually write the checkpoint completion record (or any other > > WAL) unless and until the system goes back into read/write mode - and > > then at that point the previously-started checkpoint will finish > > normally. The latter seems better if we can make it work, but the > > former is probably also acceptable. What you've got right now is not. > > I mostly wonder which of those two has which implications for how many > FPWs we need to redo. Presumably stalling but not cancelling the current > checkpoint is better? > Also, I like to uphold this idea of stalling a checkpointer's work in the middle instead of canceling it. But here, we need to take care of shutdown requests and death of postmaster cases that can cancel this stalling. If that happens we need to make sure that no unwanted wal insertion happens afterward and for that LocalXLogInsertAllowed flag needs to be updated correctly since the wal prohibits barrier processing was skipped for the checkpointer since it emits that barrier as mentioned above. Regards, Amul
On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > > > That's not good. On a typical busy system, a system is going to be in > > > the middle of a checkpoint most of the time, and the checkpoint will > > > take a long time to finish - maybe minutes. > > > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to > > reduce the frequency of that cost... > > > > > > > We want this feature to respond within milliseconds or a few seconds, > > > not minutes. So we need something better here. > > > > Indeed. > > > > > > > I'm inclined to think > > > that we should try to CompleteWALProhibitChange() at the same places > > > we AbsorbSyncRequests(). We know from experience that bad things > > > happen if we fail to absorb sync requests in a timely fashion, so we > > > probably have enough calls to AbsorbSyncRequests() to make sure that > > > we always do that work in a timely fashion. So, if we do this work in > > > the same place, then it will also be done in a timely fashion. > > > > Sounds sane, without having looked in detail. > > > > Understood & agreed that we need to change the system state as soon as possible. > > I can see AbsorbSyncRequests() is called from 4 routing as > CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and > CheckpointerMain(). Out of 4, the first three executes with an interrupt is on > hod which will cause a problem when we do emit barrier and wait for those > barriers absorption by all the process including itself and will cause an > infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(), > do not wait on self to absorb barrier. Let that get absorbed at a later point > in time when the interrupt is resumed. I assumed that we cannot do barrier > processing right away since there could be other barriers (maybe in the future) > including ours that should not process while the interrupt is on hold. > CreateCheckPoint() holds CheckpointLock LW at start and releases at the end which puts interrupt on hold. This kinda surprising that we were holding this lock and putting interrupt on hots for a long time. We do need that CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do something easy to ensure that instead of the lock? Probably holding off interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions? Regards, Amul
On Mon, Dec 14, 2020 at 8:03 PM Amul Sul <sulamul@gmail.com> wrote: > > On Mon, Dec 14, 2020 at 11:28 AM Amul Sul <sulamul@gmail.com> wrote: > > > > On Thu, Dec 10, 2020 at 6:04 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2020-12-09 16:13:06 -0500, Robert Haas wrote: > > > > That's not good. On a typical busy system, a system is going to be in > > > > the middle of a checkpoint most of the time, and the checkpoint will > > > > take a long time to finish - maybe minutes. > > > > > > Or hours, even. Due to the cost of FPWs it can make a lot of sense to > > > reduce the frequency of that cost... > > > > > > > > > > We want this feature to respond within milliseconds or a few seconds, > > > > not minutes. So we need something better here. > > > > > > Indeed. > > > > > > > > > > I'm inclined to think > > > > that we should try to CompleteWALProhibitChange() at the same places > > > > we AbsorbSyncRequests(). We know from experience that bad things > > > > happen if we fail to absorb sync requests in a timely fashion, so we > > > > probably have enough calls to AbsorbSyncRequests() to make sure that > > > > we always do that work in a timely fashion. So, if we do this work in > > > > the same place, then it will also be done in a timely fashion. > > > > > > Sounds sane, without having looked in detail. > > > > > > > Understood & agreed that we need to change the system state as soon as possible. > > > > I can see AbsorbSyncRequests() is called from 4 routing as > > CheckpointWriteDelay(), ProcessSyncRequests(), SyncPostCheckpoint() and > > CheckpointerMain(). Out of 4, the first three executes with an interrupt is on > > hod which will cause a problem when we do emit barrier and wait for those > > barriers absorption by all the process including itself and will cause an > > infinite wait. I think that can be fixed by teaching WaitForProcSignalBarrier(), > > do not wait on self to absorb barrier. Let that get absorbed at a later point > > in time when the interrupt is resumed. I assumed that we cannot do barrier > > processing right away since there could be other barriers (maybe in the future) > > including ours that should not process while the interrupt is on hold. > > > > CreateCheckPoint() holds CheckpointLock LW at start and releases at the end > which puts interrupt on hold. This kinda surprising that we were holding this > lock and putting interrupt on hots for a long time. We do need that > CheckpointLock just to ensure that one checkpoint happens at a time. Can't we do > something easy to ensure that instead of the lock? Probably holding off > interrupts for so long doesn't seem to be a good idea. Thoughts/Suggestions? > To move development, testing, and the review forward, I have commented out the code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and added the changes for the checkpointer so that system read-write state change request can be processed as soon as possible, as suggested by Robert[1]. I have started a new thread[2] to understand the need for the CheckpointLock in CreateCheckPoint() function. Until then we can continue work on this feature by skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is marked WIP. 1] http://postgr.es/m/CA+TgmoYexwDQjdd1=15KMz+7VfHVx8VHNL2qjRRK92P=CSZDxg@mail.gmail.com 2] http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com Regards, Amul
Attachment
- v12-0001-Allow-for-error-or-refusal-while-absorbing-barri.patch
- v12-0002-Test-module-for-barriers.-NOT-FOR-COMMIT.patch
- v12-0004-WIP-Implement-ALTER-SYSTEM-READ-ONLY-using-globa.patch
- v12-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v12-0003-Add-alter-system-read-only-write-syntax.patch
- v12-0006-WIP-Documentation.patch
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote: > To move development, testing, and the review forward, I have commented out the > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and > added the changes for the checkpointer so that system read-write state change > request can be processed as soon as possible, as suggested by Robert[1]. > > I have started a new thread[2] to understand the need for the CheckpointLock in > CreateCheckPoint() function. Until then we can continue work on this feature by > skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is > marked WIP. Based on the favorable review comment from Andres upthread and also your feedback, I committed 0001. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote: > To move development, testing, and the review forward, I have commented out the > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and > added the changes for the checkpointer so that system read-write state change > request can be processed as soon as possible, as suggested by Robert[1]. I am extremely doubtful about SetWALProhibitState()'s claim that "The final state can only be requested by the checkpointer or by the single-user so that there will be no chance that the server is already in the desired final state." It seems like there is an obvious race condition: CompleteWALProhibitChange() is called with a cur_state_gen argument which embeds the last state we saw, but there's nothing to keep it from changing between the time we saw it and the time that function calls SetWALProhibitState(), is there? We aren't holding any lock. It seems to me that SetWALProhibitState() needs to be rewritten to avoid this assumption. On a related note, SetWALProhibitState() has only two callers. One passes is_final_state as true, and the other as false: it's never a variable. The two cases are handled mostly differently. This doesn't seem good. A lot of the logic in this function should probably be moved to the calling sites, especially because it's almost certainly wrong for this function to be basing what it does on the *current* WAL prohibit state rather than the WAL prohibit state that was in effect at the time we made the decision to call this function in the first place. As I mentioned in the previous paragraph, that's a built-in race condition. To put that another way, this function should NOT feel free to call GetWALProhibitStateGen(). I don't really see why we should have both an SQL callable function pg_alter_wal_prohibit_state() and also a DDL command for this. If we're going to go with a functional interface, and I guess the idea of that is to make it so GRANT EXECUTE works, then why not just get rid of the DDL? RequestWALProhibitChange() doesn't look very nice. It seems like it's basically the second half of pg_alter_wal_prohibit_state(), not being called from anywhere else. It doesn't seem to add anything to separate it out like this; the interface between the two is not especially clean. It seems odd that ProcessWALProhibitStateChangeRequest() returns without doing anything if !AmCheckpointerProcess(), rather than having that be an Assert(). Why is it like that? I think WALProhibitStateShmemInit() would probably look more similar to other functions if it did if (found) { stuff; } rather than if (!found) return; stuff; -- but I might be wrong about the existing precedent. The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff seems confusingly-named, because we have other reasons for skipping a checkpoint that are not what we're talking about here. I think this is talking about whether we've performed a checkpoint after recovery, and the naming should reflect that. But I think there's something else wrong with the design, too: why is this protected by a spinlock? I have questions in both directions. On the one hand, I wonder why we need any kind of lock at all. On the other hand, if we do need a lock, I wonder why a spinlock that protects only the setting and clearing of the flag and nothing else is sufficient. There are zero comments explaining what the idea behind this locking regime is, and I can't understand why it should be correct. In fact, I think this area needs a broader rethink. Like, the way you integrated that stuff into StartupXLog(), it sure looks to me like we might skip the checkpoint but still try to write other WAL records. Before we reach the offending segment of code, we call UpdateFullPageWrites(). Afterwards, we call XLogReportParameters(). Both of those are going to potentially write WAL. I guess you could argue that's OK, on the grounds that neither function is necessarily going to log anything, but I don't think I believe that. If I make my server read only, take the OS down, change some GUCs, and then start it again, I don't expect it to PANIC. Also, I doubt that it's OK to skip the checkpoint as this code does and then go ahead and execute recovery_end_command and update the control file anyway. It sure looks like the existing code is written with the assumption that the checkpoint happens before those other things. One idea I just had was: suppose that, if the system is READ ONLY, we don't actually exit recovery right away, and the startup process doesn't exit. Instead we just sit there and wait for the system to be made read-write again before doing anything else. But then if hot_standby=false, there's no way for someone to execute a ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which seems bad. So perhaps we need to let in regular connections *as if* the system were read-write while postponing not just the end-of-recovery checkpoint but also the other associated things like UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command, control file update, etc. until the end of recovery. Or maybe that's not the right idea either, but regardless of what we do here it needs clear comments justifying it. The current version of the patch does not have any. I think that you've mis-positioned the check in autovacuum.c. Note that the comment right afterwards says: "a worker finished, or postmaster signaled failure to start a worker". Those are things we should still check for even when the system is R/O. What we don't want to do in that case is start new workers. I would suggest revising the comment that starts with "There are some conditions that..." to mention three conditions. The new one would be that the system is in a read-only state. I'd mention that first, making the existing ones #2 and #3, and then add the code to "continue;" in that case right after that comment, before setting current_time. SendsSignalToCheckpointer() has multiple problems. As far as the name, it should at least be "Send" rather than "Sends" but the corresponding functions elsewhere have names like SendPostmasterSignal() not SendSignalToPostmaster(). Also, why is it OK for it to use elog() rather than ereport()? Also, why is it an error if the checkpointer's not running, rather than just having the next checkpointer do it when it's relaunched? Also, why pass SIGINT as an argument if there's only one caller? A related thing that's also odd is that sending SIGINT calls ReqCheckpointHandler() not anything specific to prohibiting WAL. That is probably OK because that function now just sets the latch. But then we could stop sending SIGINT to the checkpointer at all and just send SIGUSR1, which would also set the latch, without using up a signal. I wonder if we should make that change as a separate preparatory patch. It seems like that would clear things up; it would remove the oddity that this patch is invoking a handler called ReqCheckpointerHandler() with no intention of requesting a checkpoint, because ReqCheckpointerHandler() would be gone. That problem could also be fixed by renaming ReqCheckpointerHandler() to something clearer, but that seems inferior. This is probably not a complete list of problems. Review from others would be appreciated. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jan 20, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 14, 2021 at 6:29 AM Amul Sul <sulamul@gmail.com> wrote: > > To move development, testing, and the review forward, I have commented out the > > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and > > added the changes for the checkpointer so that system read-write state change > > request can be processed as soon as possible, as suggested by Robert[1]. > > I am extremely doubtful about SetWALProhibitState()'s claim that "The > final state can only be requested by the checkpointer or by the > single-user so that there will be no chance that the server is already > in the desired final state." It seems like there is an obvious race > condition: CompleteWALProhibitChange() is called with a cur_state_gen > argument which embeds the last state we saw, but there's nothing to > keep it from changing between the time we saw it and the time that > function calls SetWALProhibitState(), is there? We aren't holding any > lock. It seems to me that SetWALProhibitState() needs to be rewritten > to avoid this assumption. > It is not like that, let me explain. When a user backend requests to alter WAL prohibit state by using ASRO/ASRW DDL with the previous patch or calling pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be set to the transition state i.e. going-read-only or going-read-write if it is not already. If another backend trying to request the same alteration to the wal prohibit state then nothing going to be changed in shared memory but that backend needs to wait until the transition to the final wal prohibited state completes. If a backend tries to request for the opposite state than the previous which is in progress then it will see an error as "system state transition to read only/write is already in progress". At a time only one transition state can be set. For the case where transition state changes to the complete states i.e. read-only/read-write that can only be changed by the checkpointer or standalone backend, there won't be any concurrency to change transition state to complete state. > On a related note, SetWALProhibitState() has only two callers. One > passes is_final_state as true, and the other as false: it's never a > variable. The two cases are handled mostly differently. This doesn't > seem good. A lot of the logic in this function should probably be > moved to the calling sites, especially because it's almost certainly > wrong for this function to be basing what it does on the *current* WAL > prohibit state rather than the WAL prohibit state that was in effect > at the time we made the decision to call this function in the first > place. As I mentioned in the previous paragraph, that's a built-in > race condition. To put that another way, this function should NOT feel > free to call GetWALProhibitStateGen(). > Understood. I have removed SetWALProhibitState() and moved the respective code to the caller in the attached version. > I don't really see why we should have both an SQL callable function > pg_alter_wal_prohibit_state() and also a DDL command for this. If > we're going to go with a functional interface, and I guess the idea of > that is to make it so GRANT EXECUTE works, then why not just get rid > of the DDL? > Ok, dropped the patch of the DDL command. If in the future we want it back, I can add that again. Now, I am a little bit concerned about the current function name. How about pg_set_wal_prohibit_state(bool) name or have two functions as pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any other suggestions? > RequestWALProhibitChange() doesn't look very nice. It seems like it's > basically the second half of pg_alter_wal_prohibit_state(), not being > called from anywhere else. It doesn't seem to add anything to separate > it out like this; the interface between the two is not especially > clean. > Ok, moved that code in pg_alter_wal_prohibit_state() in the attached version. > It seems odd that ProcessWALProhibitStateChangeRequest() returns > without doing anything if !AmCheckpointerProcess(), rather than having > that be an Assert(). Why is it like that? > Like AbsorbSyncRequests(). > I think WALProhibitStateShmemInit() would probably look more similar > to other functions if it did if (found) { stuff; } rather than if > (!found) return; stuff; -- but I might be wrong about the existing > precedent. > Ok, did the same in the attached version. > The SetLastCheckPointSkipped() and LastCheckPointIsSkipped() stuff > seems confusingly-named, because we have other reasons for skipping a > checkpoint that are not what we're talking about here. I think this is > talking about whether we've performed a checkpoint after recovery, and > the naming should reflect that. But I think there's something else > wrong with the design, too: why is this protected by a spinlock? I > have questions in both directions. On the one hand, I wonder why we > need any kind of lock at all. On the other hand, if we do need a lock, > I wonder why a spinlock that protects only the setting and clearing of > the flag and nothing else is sufficient. There are zero comments > explaining what the idea behind this locking regime is, and I can't > understand why it should be correct. > Renamed those functions to SetRecoveryCheckpointSkippedFlag() and RecoveryCheckpointIsSkipped() respectively and remove the lock which is not needed. Updated comment for lastRecoveryCheckpointSkipped variable for the lock requirement. > In fact, I think this area needs a broader rethink. Like, the way you > integrated that stuff into StartupXLog(), it sure looks to me like we > might skip the checkpoint but still try to write other WAL records. > Before we reach the offending segment of code, we call > UpdateFullPageWrites(). Afterwards, we call XLogReportParameters(). > Both of those are going to potentially write WAL. I guess you could > argue that's OK, on the grounds that neither function is necessarily > going to log anything, but I don't think I believe that. If I make my > server read only, take the OS down, change some GUCs, and then start > it again, I don't expect it to PANIC. > If you think that there will be panic when UpdateFullPageWrites() and/or XLogReportParameters() tries to write WAL since the shared memory state for WAL prohibited is set then it is not like that. For those functions, WAL write is explicitly enabled by calling LocalSetXLogInsertAllowed(). I was under the impression that there won't be any problem if we allow the writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be considered as an exception since it is fine that this WAL record is not streamed to standby while graceful failover, I may be wrong though. > Also, I doubt that it's OK to skip the checkpoint as this code does > and then go ahead and execute recovery_end_command and update the > control file anyway. It sure looks like the existing code is written > with the assumption that the checkpoint happens before those other > things. Hmm, here we could go wrong. I need to look at this part carefully. > One idea I just had was: suppose that, if the system is READ > ONLY, we don't actually exit recovery right away, and the startup > process doesn't exit. Instead we just sit there and wait for the > system to be made read-write again before doing anything else. But > then if hot_standby=false, there's no way for someone to execute a > ALTER SYSTEM READ WRITE and/or pg_alter_wal_prohibit_state(), which > seems bad. So perhaps we need to let in regular connections *as if* > the system were read-write while postponing not just the > end-of-recovery checkpoint but also the other associated things like > UpdateFullPageWrites(), XLogReportParameters(), recovery_end_command, > control file update, etc. until the end of recovery. Or maybe that's > not the right idea either, but regardless of what we do here it needs > clear comments justifying it. The current version of the patch does > not have any. > Will get back to you on this. Let me think more on this and the previous point. > I think that you've mis-positioned the check in autovacuum.c. Note > that the comment right afterwards says: "a worker finished, or > postmaster signaled failure to start a worker". Those are things we > should still check for even when the system is R/O. What we don't want > to do in that case is start new workers. I would suggest revising the > comment that starts with "There are some conditions that..." to > mention three conditions. The new one would be that the system is in a > read-only state. I'd mention that first, making the existing ones #2 > and #3, and then add the code to "continue;" in that case right after > that comment, before setting current_time. > Done. > SendsSignalToCheckpointer() has multiple problems. As far as the name, > it should at least be "Send" rather than "Sends" but the corresponding "Sends" is unacceptable, it is a typo. > functions elsewhere have names like SendPostmasterSignal() not > SendSignalToPostmaster(). Also, why is it OK for it to use elog() > rather than ereport()? Also, why is it an error if the checkpointer's > not running, rather than just having the next checkpointer do it when > it's relaunched? Ok, now the function only returns true or false. It's up to the caller what to do with that. In our case, the caller will issue a warning only. If you want this could be a NOTICE as well. > Also, why pass SIGINT as an argument if there's only > one caller? I thoughts, anybody can also reuse it to send some other signal to the checkpointer process in the future. > A related thing that's also odd is that sending SIGINT > calls ReqCheckpointHandler() not anything specific to prohibiting WAL. > That is probably OK because that function now just sets the latch. But > then we could stop sending SIGINT to the checkpointer at all and just > send SIGUSR1, which would also set the latch, without using up a > signal. I wonder if we should make that change as a separate > preparatory patch. It seems like that would clear things up; it would > remove the oddity that this patch is invoking a handler called > ReqCheckpointerHandler() with no intention of requesting a checkpoint, > because ReqCheckpointerHandler() would be gone. That problem could > also be fixed by renaming ReqCheckpointerHandler() to something > clearer, but that seems inferior. > I am not clear on this part. In the attached version I am sending SIGUSR1 instead of SIGINT, which works for me. > This is probably not a complete list of problems. Review from others > would be appreciated. > Thanks a lot. The attached version does not address all your comments, I'll continue my work on that. Regards, Amul
Attachment
On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul@gmail.com> wrote: > It is not like that, let me explain. When a user backend requests to alter WAL > prohibit state by using ASRO/ASRW DDL with the previous patch or calling > pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be > set to the transition state i.e. going-read-only or going-read-write if it is > not already. If another backend trying to request the same alteration to the > wal prohibit state then nothing going to be changed in shared memory but that > backend needs to wait until the transition to the final wal prohibited state > completes. If a backend tries to request for the opposite state than the > previous which is in progress then it will see an error as "system state > transition to read only/write is already in progress". At a time only one > transition state can be set. Hrm. Well, then that needs to be abundantly clear in the relevant comments. > Now, I am a little bit concerned about the current function name. How about > pg_set_wal_prohibit_state(bool) name or have two functions as > pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any > other suggestions? How about pg_prohibit_wal(true|false)? > > It seems odd that ProcessWALProhibitStateChangeRequest() returns > > without doing anything if !AmCheckpointerProcess(), rather than having > > that be an Assert(). Why is it like that? > > Like AbsorbSyncRequests(). Well, that can be called not from the checkpointer, according to the comments. Specifically from the postmaster, I guess. Again, comments please. > If you think that there will be panic when UpdateFullPageWrites() and/or > XLogReportParameters() tries to write WAL since the shared memory state for WAL > prohibited is set then it is not like that. For those functions, WAL write is > explicitly enabled by calling LocalSetXLogInsertAllowed(). > > I was under the impression that there won't be any problem if we allow the > writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be > considered as an exception since it is fine that this WAL record is not streamed > to standby while graceful failover, I may be wrong though. I don't think that's OK. I mean, the purpose of the feature is to prohibit WAL. If it doesn't do that, I believe it will fail to satisfy the principle of least surprise. > I am not clear on this part. In the attached version I am sending SIGUSR1 > instead of SIGINT, which works for me. OK. > The attached version does not address all your comments, I'll continue my work > on that. Some thoughts on this version: +/* Extract last two bits */ +#define WALPROHIBIT_CURRENT_STATE(stateGeneration) \ + ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1))) +#define WALPROHIBIT_NEXT_STATE(stateGeneration) \ + WALPROHIBIT_CURRENT_STATE((stateGeneration + 1)) This is really confusing. First, the comment looks like it applies to both based on how it is positioned, but that's clearly not true. Second, the naming is really hard to understand. Third, there don't seem to be comments explaining the theory of what is going on here. Fourth, stateGeneration refers not to which generation of state we've got here but to the combination of the state and the generation. However, it's not clear that we ever really use the generation for anything. I think that the direction you went with this is somewhat different from what I had in mind. That may be OK, but let me just explain the difference. We both had in mind the idea that the low two bits of the state would represent the current state and the upper bits would represent the state generation. However, I wasn't necessarily imagining that the only supported operation was making the combined value go up by 1. For instance, I had thought that perhaps the effect of trying to go read-only when we're in the middle of going read-write would be to cancel the previous operation and start the new one. What you have instead is that it errors out. So in your model a change always has to finish before the next one can start, which in turn means that the sequence is completely linear. In my idea the state+generation might go from say 1 to 7, because trying to go read-write would cancel the previous attempt to go read-only and replace it with an attempt to go the other direction, and from 7 we might go to to 9 if somebody now tries to go read-only again before that finishes. In your model, there's never any sort of cancellation of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc. One disadvantage of the way you've got it from a user perspective is that if I'm writing a tool, I might get an error telling me that the state change I'm trying to make is already in progress, and then I have to retry. With the other design, I might attempt a state change and have it fail because the change can't be completed, but I won't ever fail because I attempt a state change and it can't be started because we're in the wrong starting state. So, with this design, as the tool author, I may not be able to just say, well, I tried to change the state and it didn't work, so report the error to the user. I think with the other approach that would be more viable. But I might be wrong here; it would be interesting to hear what other people think. I dislike the use of the term state_gen or StateGen to refer to the combination of a state and a generation. That seems unintuitive. I'm tempted to propose that we just call it a counter, and, assuming we stick with the design as you now have it, explain it with a comment like this in walprohibit.h: "There are four possible states. A brand new database cluster is always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to make it read only, then we enter the state WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently tries to make it read write, we will enter the state WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete, we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state transitions are the only ones possible; for example, if we're currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go read-write will produce an error, and a second attempt to go read-only will not cause a state change. Thus, we can represent the state as a shared-memory counter that whose value only ever changes by adding 1. The initial value at postmaster startup is either 0 or 2, depending on whether the control file specifies the the system is starting read-only or read-write." And then maybe change all the state_gen references to reference wal_prohibit_counter or, where a shorter name is appropriate, counter. I think this might be clearer if we used different data types for the state and the state/generation combination, with functions to convert between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0 etc. maybe do: typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState; And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like: static inline WALProhibitState GetWALProhibitState(uint32 wal_prohibit_counter) { return (WALProhibitState) (wal_prohibit_counter & 3); } I don't really know why we need WALPROHIBIT_NEXT_STATE at all, honestly. It's just a macro to add 1 to an integer. And you don't even use it consistently. Like pg_alter_wal_prohibit_state() does this: + /* Server is already in requested state */ + if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state) + PG_RETURN_VOID(); But then later does this: + next_state_gen = cur_state_gen + 1; Which is exactly the same thing as what you computed above using WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly sure how to structure this to make it as simple as possible, but I don't think this is it. Honestly this whole logic here seems correct but a bit hard to follow. Like, maybe: wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter); switch (GetWALProhibitState(wal_prohibit_counter)) { case WALPROHIBIT_STATE_READ_WRITE: if (!walprohibit) return; increment = true; break; case WALPROHIBIT_STATE_GOING_READ_WRITE: if (walprohibit) ereport(ERROR, ...); break; ... } And then just: if (increment) wal_prohibit_counter = pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1); target_counter_value = wal_prohibit_counter + 1; // random stuff // eventually wait until the counter reaches >= target_counter_value This might not be exactly the right idea though. I'm just looking for a way to make it clearer, because I find it a bit hard to understand right now. Maybe you or someone else will have a better idea. + success = pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation, + &cur_state_gen, next_state_gen); + Assert(success); I am almost positive that this is not OK. I think on some platforms atomics just randomly fail some percentage of the time. You always need a retry loop. Anyway, what happens if two people enter this function at the same time and both read the same starting counter value before either does anything? + /* To be sure that any later reads of memory happen strictly after this. */ + pg_memory_barrier(); You don't need a memory barrier after use of an atomic. The atomic includes a barrier. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 26, 2021 at 2:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 21, 2021 at 9:47 AM Amul Sul <sulamul@gmail.com> wrote: > > It is not like that, let me explain. When a user backend requests to alter WAL > > prohibit state by using ASRO/ASRW DDL with the previous patch or calling > > pg_alter_wal_prohibit_state() then WAL prohibit state in shared memory will be > > set to the transition state i.e. going-read-only or going-read-write if it is > > not already. If another backend trying to request the same alteration to the > > wal prohibit state then nothing going to be changed in shared memory but that > > backend needs to wait until the transition to the final wal prohibited state > > completes. If a backend tries to request for the opposite state than the > > previous which is in progress then it will see an error as "system state > > transition to read only/write is already in progress". At a time only one > > transition state can be set. > > Hrm. Well, then that needs to be abundantly clear in the relevant comments. > > > Now, I am a little bit concerned about the current function name. How about > > pg_set_wal_prohibit_state(bool) name or have two functions as > > pg_set_wal_prohibit_state(void) and pg_unset_wal_prohibit_state(void) or any > > other suggestions? > > How about pg_prohibit_wal(true|false)? > LGTM. Used this. > > > It seems odd that ProcessWALProhibitStateChangeRequest() returns > > > without doing anything if !AmCheckpointerProcess(), rather than having > > > that be an Assert(). Why is it like that? > > > > Like AbsorbSyncRequests(). > > Well, that can be called not from the checkpointer, according to the > comments. Specifically from the postmaster, I guess. Again, comments > please. > Done. > > If you think that there will be panic when UpdateFullPageWrites() and/or > > XLogReportParameters() tries to write WAL since the shared memory state for WAL > > prohibited is set then it is not like that. For those functions, WAL write is > > explicitly enabled by calling LocalSetXLogInsertAllowed(). > > > > I was under the impression that there won't be any problem if we allow the > > writing WAL to UpdateFullPageWrites() and XLogReportParameters(). It can be > > considered as an exception since it is fine that this WAL record is not streamed > > to standby while graceful failover, I may be wrong though. > > I don't think that's OK. I mean, the purpose of the feature is to > prohibit WAL. If it doesn't do that, I believe it will fail to satisfy > the principle of least surprise. > Yes, you are correct. I am still on this. The things that worried me here are the wal records sequence being written in the startup process -- UpdateFullPageWrites() generate record just before the recovery check-point record and XLogReportParameters() just after that but before any other backend could write any wal record. We might also need to follow the same sequence while changing the system to read-write. But in our case maintaining this sequence seems to be a little difficult. let me explain, when a backend executes a function (ie. pg_prohibit_wal(false)) to make the system read-write then that system state changes will be conveyed by the Checkpointer process to all existing backends using global barrier and then checkpoint might want to write those records. While checkpoint in progress, few existing backends who might have absorbed barriers can write new records that might come before aforesaid wal record sequence to be written. Also, we might think that we could write these records before emitting the super barrier which also might not solve the problem because a new backend could connect the server just after the read-write system state change request was made but before Checkpointer could pick that. Such a backend could write WAL before the Checkpointer could, (see IsWALProhibited()). Apart from this I also had a thought on the point recovery_end_command execution that happens just after the recovery end checkpoint in the Startup process. I think, first of all, why should we go and execute this command if we are read-only? I don't think there will be any use to boot-up a read-only server as standby, which itself is read-only to some extent. Also, pg_basebackup from read-only is not allowed, a new standby cannot be set up. I think, IMHO, we should simply error-out if tried to boot-up read-only server as standby using standby.signal file, thoughts? > > I am not clear on this part. In the attached version I am sending SIGUSR1 > > instead of SIGINT, which works for me. > > OK. > > > The attached version does not address all your comments, I'll continue my work > > on that. > > Some thoughts on this version: > > +/* Extract last two bits */ > +#define WALPROHIBIT_CURRENT_STATE(stateGeneration) \ > + ((uint32)(stateGeneration) & ((uint32) ((1 << 2) - 1))) > +#define WALPROHIBIT_NEXT_STATE(stateGeneration) \ > + WALPROHIBIT_CURRENT_STATE((stateGeneration + 1)) > > This is really confusing. First, the comment looks like it applies to > both based on how it is positioned, but that's clearly not true. > Second, the naming is really hard to understand. Third, there don't > seem to be comments explaining the theory of what is going on here. > Fourth, stateGeneration refers not to which generation of state we've > got here but to the combination of the state and the generation. > However, it's not clear that we ever really use the generation for > anything. > > I think that the direction you went with this is somewhat different > from what I had in mind. That may be OK, but let me just explain the > difference. We both had in mind the idea that the low two bits of the > state would represent the current state and the upper bits would > represent the state generation. However, I wasn't necessarily > imagining that the only supported operation was making the combined > value go up by 1. For instance, I had thought that perhaps the effect > of trying to go read-only when we're in the middle of going read-write > would be to cancel the previous operation and start the new one. What > you have instead is that it errors out. So in your model a change > always has to finish before the next one can start, which in turn > means that the sequence is completely linear. In my idea the > state+generation might go from say 1 to 7, because trying to go > read-write would cancel the previous attempt to go read-only and > replace it with an attempt to go the other direction, and from 7 we > might go to to 9 if somebody now tries to go read-only again before > that finishes. In your model, there's never any sort of cancellation > of that kind, so you can only go 0->1->2->3->4->5->6->7->8->9 etc. > Yes, that made implementation quite simple. I was under the impression that we might not have that much concurrency that so many backends might be trying to change the system state so quickly. > One disadvantage of the way you've got it from a user perspective is > that if I'm writing a tool, I might get an error telling me that the > state change I'm trying to make is already in progress, and then I > have to retry. With the other design, I might attempt a state change > and have it fail because the change can't be completed, but I won't > ever fail because I attempt a state change and it can't be started > because we're in the wrong starting state. So, with this design, as > the tool author, I may not be able to just say, well, I tried to > change the state and it didn't work, so report the error to the user. > I think with the other approach that would be more viable. But I might > be wrong here; it would be interesting to hear what other people > think. > Thinking a little bit more, I agree that your approach is more viable as it can cancel previously in-progress state. For e.g. in a graceful failure future, the master might have detected that he lost the connection to all standby and immediately calls the function to change the system state to read-only. But, it regains the connection soon and wants to back to read-write then it might need to wait until the previous state completion. That might be the worst if the system is quite busy and/or any backend which might have stuck or too busy and could not absorb the barrier. If you want, I try to change the way you have thought, in the next version. > I dislike the use of the term state_gen or StateGen to refer to the > combination of a state and a generation. That seems unintuitive. I'm > tempted to propose that we just call it a counter, and, assuming we > stick with the design as you now have it, explain it with a comment > like this in walprohibit.h: > > "There are four possible states. A brand new database cluster is > always initially WALPROHIBIT_STATE_READ_WRITE. If the user tries to > make it read only, then we enter the state > WALPROHIBIT_STATE_GOING_READ_ONLY. When the transition is complete, we > enter the state WALPROHIBIT_STATE_READ_ONLY. If the user subsequently > tries to make it read write, we will enter the state > WALPROHIBIT_STATE_GOING_READ_WRITE. When that transition is complete, > we will enter the state WALPROHIBIT_STATE_READ_WRITE. These four state > transitions are the only ones possible; for example, if we're > currently in state WALPROHIBIT_STATE_GOING_READ_ONLY, an attempt to go > read-write will produce an error, and a second attempt to go read-only > will not cause a state change. Thus, we can represent the state as a > shared-memory counter that whose value only ever changes by adding 1. > The initial value at postmaster startup is either 0 or 2, depending on > whether the control file specifies the the system is starting > read-only or read-write." > Thanks, added the same. > And then maybe change all the state_gen references to reference > wal_prohibit_counter or, where a shorter name is appropriate, counter. > Done. > I think this might be clearer if we used different data types for the > state and the state/generation combination, with functions to convert > between them. e.g. instead of define WALPROHIBIT_STATE_READ_WRITE 0 > etc. maybe do: > > typedef enum { ... = 0, ... = 1, ... = 2, ... = 3 } WALProhibitState; > > And then instead of WALPROHIBIT_CURRENT_STATE perhaps something like: > > static inline WALProhibitState > GetWALProhibitState(uint32 wal_prohibit_counter) > { > return (WALProhibitState) (wal_prohibit_counter & 3); > } > Done. > I don't really know why we need WALPROHIBIT_NEXT_STATE at all, > honestly. It's just a macro to add 1 to an integer. And you don't even > use it consistently. Like pg_alter_wal_prohibit_state() does this: > > + /* Server is already in requested state */ > + if (WALPROHIBIT_NEXT_STATE(new_transition_state) == cur_state) > + PG_RETURN_VOID(); > > But then later does this: > > + next_state_gen = cur_state_gen + 1; > > Which is exactly the same thing as what you computed above using > WALPROHIBIT_NEXT_STATE() but spelled differently. I am not exactly > sure how to structure this to make it as simple as possible, but I > don't think this is it. > > Honestly this whole logic here seems correct but a bit hard to follow. > Like, maybe: > > wal_prohibit_counter = pg_atomic_read_u32(&WALProhibitState->shared_counter); > switch (GetWALProhibitState(wal_prohibit_counter)) > { > case WALPROHIBIT_STATE_READ_WRITE: > if (!walprohibit) return; > increment = true; > break; > case WALPROHIBIT_STATE_GOING_READ_WRITE: > if (walprohibit) ereport(ERROR, ...); > break; > ... > } > > And then just: > > if (increment) > wal_prohibit_counter = > pg_atomic_add_fetch_u32(&WALProhibitState->shared_counter, 1); > target_counter_value = wal_prohibit_counter + 1; > // random stuff > // eventually wait until the counter reaches >= target_counter_value > > This might not be exactly the right idea though. I'm just looking for > a way to make it clearer, because I find it a bit hard to understand > right now. Maybe you or someone else will have a better idea. > Yeah, this makes code much cleaner than before, did the same in the attached version. Thanks again. > + success = > pg_atomic_compare_exchange_u32(&WALProhibitState->shared_state_generation, > + > &cur_state_gen, next_state_gen); > + Assert(success); > > I am almost positive that this is not OK. I think on some platforms > atomics just randomly fail some percentage of the time. You always > need a retry loop. Anyway, what happens if two people enter this > function at the same time and both read the same starting counter > value before either does anything? > > + /* To be sure that any later reads of memory happen > strictly after this. */ > + pg_memory_barrier(); > > You don't need a memory barrier after use of an atomic. The atomic > includes a barrier. Understood, removed. Regards, Amul
Attachment
On Thu, Jan 28, 2021 at 7:17 AM Amul Sul <sulamul@gmail.com> wrote: > I am still on this. The things that worried me here are the wal records sequence > being written in the startup process -- UpdateFullPageWrites() generate record > just before the recovery check-point record and XLogReportParameters() just > after that but before any other backend could write any wal record. We might > also need to follow the same sequence while changing the system to read-write. I was able to chat with Andres about this topic for a while today and he made some proposals which seemed pretty good to me. I can't promise that what I'm about to write is an entirely faithful representation of what he said, but hopefully it's not so far off that he gets mad at me or something. 1. If the server starts up and is read-only and ArchiveRecoveryRequested, clear the read-only state in memory and also in the control file, log a message saying that this has been done, and proceed. This makes some other cases simpler to deal with. 2. Create a new function with a name like XLogAcceptWrites(). Move the following things from StartupXLOG() into that function: (1) the call to UpdateFullPageWrites(), (2) the following block of code that does either CreateEndOfRecoveryRecord() or RequestCheckpoint() or CreateCheckPoint(), (3) the next block of code that runs recovery_end_command, (4) the call to XLogReportParameters(), and (5) the call to CompleteCommitTsInitialization(). Call the new function from the place where we now call XLogReportParameters(). This would mean that (1)-(3) happen later than they do now, which might require some adjustments. 3. If the system is starting up read only (and the read-only state didn't get cleared because of #1 above) then don't call XLogAcceptWrites() at the end of StartupXLOG() and instead have the checkpointer do it later when the system is going read-write for the first time. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-02-16 17:11:06 -0500, Robert Haas wrote: > I can't promise that what I'm about to write is an entirely faithful > representation of what he said, but hopefully it's not so far off that > he gets mad at me or something. Seems accurate - and also I'm way too tired that I'd be mad ;) > 1. If the server starts up and is read-only and > ArchiveRecoveryRequested, clear the read-only state in memory and also > in the control file, log a message saying that this has been done, and > proceed. This makes some other cases simpler to deal with. It seems also to make sense from a behaviour POV to me: Imagine a "smooth" planned failover with ASRO: 1) ASRO on primary 2) promote standby 3) edit primary config to include primary_conninfo, add standby.signal 4) restart "read only primary" There's not really any spot in which it'd be useful to do disable ASRO, right? But 4) should make the node a normal standby. Greetings, Andres Freund
On Wed, Feb 17, 2021 at 7:50 AM Andres Freund <andres@anarazel.de> wrote: > On 2021-02-16 17:11:06 -0500, Robert Haas wrote: Thank you very much to both of you ! > > I can't promise that what I'm about to write is an entirely faithful > > representation of what he said, but hopefully it's not so far off that > > he gets mad at me or something. > > Seems accurate - and also I'm way too tired that I'd be mad ;) > > > > 1. If the server starts up and is read-only and > > ArchiveRecoveryRequested, clear the read-only state in memory and also > > in the control file, log a message saying that this has been done, and > > proceed. This makes some other cases simpler to deal with. > > It seems also to make sense from a behaviour POV to me: Imagine a > "smooth" planned failover with ASRO: > 1) ASRO on primary > 2) promote standby > 3) edit primary config to include primary_conninfo, add standby.signal > 4) restart "read only primary" > > There's not really any spot in which it'd be useful to do disable ASRO, > right? But 4) should make the node a normal standby. > Understood. In the attached version I have made the changes accordingly what Robert has summarised in his previous mail[1]. In addition to that, I also move the code that updates the control file to XLogAcceptWrites() which will also get skipped when the system is read-only (wal prohibited). The system will be in the crash recovery, and that will change once we do the end-of-recovery checkpoint and the WAL writes operation which we were skipping from startup. The benefit of keeping the system in recovery mode is that it fixes my concern[2] where other backends could connect and write wal records while we were changing the system to read-write. Now, no other backends allow a wal write; UpdateFullPageWrites(), end-of-recovery checkpoint, and XLogReportParameters() operations will be performed in the same sequence as it is in the startup while changing the system to read-write. Regards, Amul 1] http://postgr.es/m/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com 2] http://postgr.es/m/CAAJ_b97xX-nqRyM_uXzecpH9aSgoMROrDNhrg1N51fDCDwoy2g@mail.gmail.com
Attachment
Hi all,
While testing this feature with v20-patch, the server is crashing with below steps.
Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);
-- Slave:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
postgres=# SELECT pg_prohibit_wal(true);
pg_prohibit_wal
-----------------
(1 row)
postgres=# SELECT pg_prohibit_wal(false);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
-- Below are the stack trace:
[prabhat@localhost bin]$ gdb -q -c /tmp/data_slave/core.35273 postgres
Reading symbols from /home/prabhat/PG/PGsrcNew/postgresql/inst/bin/postgres...done.
[New LWP 35273]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer '.
Program terminated with signal 6, Aborted.
#0 0x00007fa876233387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00007fa876233387 in raise () from /lib64/libc.so.6
#1 0x00007fa876234a78 in abort () from /lib64/libc.so.6
#2 0x0000000000aea31c in ExceptionalCondition (conditionName=0xb8c998 "ThisTimeLineID != 0 || IsBootstrapProcessingMode()",
errorType=0xb8956d "FailedAssertion", fileName=0xb897c0 "xlog.c", lineNumber=8611) at assert.c:69
#3 0x0000000000588eb5 in InitXLOGAccess () at xlog.c:8611
#4 0x0000000000588ae6 in LocalSetXLogInsertAllowed () at xlog.c:8483
#5 0x00000000005881bb in XLogAcceptWrites (needChkpt=true, xlogreader=0x0, EndOfLog=0, EndOfLogTLI=0) at xlog.c:8008
#6 0x00000000005751ed in ProcessWALProhibitStateChangeRequest () at walprohibit.c:361
#7 0x000000000088c69f in CheckpointerMain () at checkpointer.c:355
#8 0x000000000059d7db in AuxiliaryProcessMain (argc=2, argv=0x7ffd1290d060) at bootstrap.c:455
#9 0x000000000089fc5f in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5416
#10 0x000000000089f782 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5128
#11 <signal handler called>
#12 0x00007fa8762f2983 in __select_nocancel () from /lib64/libc.so.6
#13 0x000000000089b511 in ServerLoop () at postmaster.c:1700
#14 0x000000000089af00 in PostmasterMain (argc=5, argv=0x15b8460) at postmaster.c:1408
#15 0x000000000079c23a in main (argc=5, argv=0x15b8460) at main.c:209
(gdb)
While testing this feature with v20-patch, the server is crashing with below steps.
Steps to reproduce:
1. Configure master-slave replication setup.
2. Connect to Slave.
3. Execute below statements, it will crash the server:
SELECT pg_prohibit_wal(true);
SELECT pg_prohibit_wal(false);
-- Slave:
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
postgres=# SELECT pg_prohibit_wal(true);
pg_prohibit_wal
-----------------
(1 row)
postgres=# SELECT pg_prohibit_wal(false);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
-- Below are the stack trace:
[prabhat@localhost bin]$ gdb -q -c /tmp/data_slave/core.35273 postgres
Reading symbols from /home/prabhat/PG/PGsrcNew/postgresql/inst/bin/postgres...done.
[New LWP 35273]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer '.
Program terminated with signal 6, Aborted.
#0 0x00007fa876233387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00007fa876233387 in raise () from /lib64/libc.so.6
#1 0x00007fa876234a78 in abort () from /lib64/libc.so.6
#2 0x0000000000aea31c in ExceptionalCondition (conditionName=0xb8c998 "ThisTimeLineID != 0 || IsBootstrapProcessingMode()",
errorType=0xb8956d "FailedAssertion", fileName=0xb897c0 "xlog.c", lineNumber=8611) at assert.c:69
#3 0x0000000000588eb5 in InitXLOGAccess () at xlog.c:8611
#4 0x0000000000588ae6 in LocalSetXLogInsertAllowed () at xlog.c:8483
#5 0x00000000005881bb in XLogAcceptWrites (needChkpt=true, xlogreader=0x0, EndOfLog=0, EndOfLogTLI=0) at xlog.c:8008
#6 0x00000000005751ed in ProcessWALProhibitStateChangeRequest () at walprohibit.c:361
#7 0x000000000088c69f in CheckpointerMain () at checkpointer.c:355
#8 0x000000000059d7db in AuxiliaryProcessMain (argc=2, argv=0x7ffd1290d060) at bootstrap.c:455
#9 0x000000000089fc5f in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5416
#10 0x000000000089f782 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5128
#11 <signal handler called>
#12 0x00007fa8762f2983 in __select_nocancel () from /lib64/libc.so.6
#13 0x000000000089b511 in ServerLoop () at postmaster.c:1700
#14 0x000000000089af00 in PostmasterMain (argc=5, argv=0x15b8460) at postmaster.c:1408
#15 0x000000000079c23a in main (argc=5, argv=0x15b8460) at main.c:209
(gdb)
kindly let me know if you need more inputs on this.
On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote:
On Sun, Mar 14, 2021 at 11:51 PM Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote:
>
> On Tue, Mar 9, 2021 at 3:31 PM Amul Sul <sulamul@gmail.com> wrote:
>>
>> On Thu, Mar 4, 2021 at 11:02 PM Amul Sul <sulamul@gmail.com> wrote:
>> >
>> > On Wed, Mar 3, 2021 at 8:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> > >
>> > > On Tue, Mar 2, 2021 at 7:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>[....]
>
> One of the patch (v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch) from the latest patchset does not apply successfully.
>
> http://cfbot.cputube.org/patch_32_2602.log
>
> === applying patch ./v18-0002-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
>
> Hunk #15 succeeded at 2604 (offset -13 lines).
> 1 out of 15 hunks FAILED -- saving rejects to file src/backend/access/nbtree/nbtpage.c.rej
> patching file src/backend/access/spgist/spgdoinsert.c
>
> It is a very minor change, so I rebased the patch. Please take a look, if that works for you.
>
Thanks, I am getting one more failure for the vacuumlazy.c. on the
latest master head(d75288fb27b), I fixed that in attached version.
Regards,
Amul
--
With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 19, 2021 at 7:17 PM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: > > Hi all, > While testing this feature with v20-patch, the server is crashing with below steps. > > Steps to reproduce: > 1. Configure master-slave replication setup. > 2. Connect to Slave. > 3. Execute below statements, it will crash the server: > SELECT pg_prohibit_wal(true); > SELECT pg_prohibit_wal(false); > > -- Slave: > postgres=# select pg_is_in_recovery(); > pg_is_in_recovery > ------------------- > t > (1 row) > > postgres=# SELECT pg_prohibit_wal(true); > pg_prohibit_wal > ----------------- > > (1 row) > > postgres=# SELECT pg_prohibit_wal(false); > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and repeat your command. > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !?> Thanks Prabhat. The assertion failure is due to wrong assumptions for the flag that were used for the XLogAcceptWrites() call. In the case of standby, the startup process never reached the place where it could call XLogAcceptWrites() and update the respective flag. Due to this flag value, pg_prohibit_wal() function does alter system state in recovery state which is incorrect. In the attached function I took enum value for that flag so that pg_prohibit_wal() is only allowed in the recovery mode, iff that flag indicates that XLogAcceptWrites() has been skipped previously. Regards, Amul
Attachment
Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6). Regards, Amul On Mon, Mar 22, 2021 at 12:13 PM Amul Sul <sulamul@gmail.com> wrote: > > On Fri, Mar 19, 2021 at 7:17 PM Prabhat Sahu > <prabhat.sahu@enterprisedb.com> wrote: > > > > Hi all, > > While testing this feature with v20-patch, the server is crashing with below steps. > > > > Steps to reproduce: > > 1. Configure master-slave replication setup. > > 2. Connect to Slave. > > 3. Execute below statements, it will crash the server: > > SELECT pg_prohibit_wal(true); > > SELECT pg_prohibit_wal(false); > > > > -- Slave: > > postgres=# select pg_is_in_recovery(); > > pg_is_in_recovery > > ------------------- > > t > > (1 row) > > > > postgres=# SELECT pg_prohibit_wal(true); > > pg_prohibit_wal > > ----------------- > > > > (1 row) > > > > postgres=# SELECT pg_prohibit_wal(false); > > WARNING: terminating connection because of crash of another server process > > DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory. > > HINT: In a moment you should be able to reconnect to the database and repeat your command. > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Failed. > > !?> > > Thanks Prabhat. > > The assertion failure is due to wrong assumptions for the flag that were used > for the XLogAcceptWrites() call. In the case of standby, the startup process > never reached the place where it could call XLogAcceptWrites() and update the > respective flag. Due to this flag value, pg_prohibit_wal() function does > alter system state in recovery state which is incorrect. > > In the attached function I took enum value for that flag so that > pg_prohibit_wal() is only allowed in the recovery mode, iff that flag indicates > that XLogAcceptWrites() has been skipped previously. > > Regards, > Amul
Attachment
On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote: > > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6). Some minor comments on 0001: Isn't it "might not be running"? + errdetail("Checkpointer might not running."), Isn't it "Try again after sometime"? + errhint("Try after sometime again."))); Can we have ereport(DEBUG1 just to be consistent(although it doesn't make any difference from elog(DEBUG1) with the new log messages introduced in the patch? + elog(DEBUG1, "waiting for backends to adopt requested WAL prohibit state change"); With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Thanks Bharath for your review. > On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote: > > > > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6). > > Some minor comments on 0001: > Isn't it "might not be running"? > + errdetail("Checkpointer might not running."), > Ok, fixed in the attached version. > Isn't it "Try again after sometime"? > + errhint("Try after sometime again."))); > Ok, done. > Can we have ereport(DEBUG1 just to be consistent(although it doesn't > make any difference from elog(DEBUG1) with the new log messages > introduced in the patch? > + elog(DEBUG1, "waiting for backends to adopt requested WAL > prohibit state change"); > I think it's fine; many existing places have used elog(DEBUG1, ....) too. Regards, Amul
Attachment
Rotten again, attached the rebased version. Regards, Amul On Mon, Apr 5, 2021 at 5:27 PM Amul Sul <sulamul@gmail.com> wrote: > > On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Thanks Bharath for your review. > > > On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote: > > > > > > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6). > > > > Some minor comments on 0001: > > Isn't it "might not be running"? > > + errdetail("Checkpointer might not running."), > > > > Ok, fixed in the attached version. > > > Isn't it "Try again after sometime"? > > + errhint("Try after sometime again."))); > > > > Ok, done. > > > Can we have ereport(DEBUG1 just to be consistent(although it doesn't > > make any difference from elog(DEBUG1) with the new log messages > > introduced in the patch? > > + elog(DEBUG1, "waiting for backends to adopt requested WAL > > prohibit state change"); > > > > I think it's fine; many existing places have used elog(DEBUG1, ....) too. > > Regards, > Amul
Attachment
Rebased again. On Wed, Apr 7, 2021 at 12:38 PM Amul Sul <sulamul@gmail.com> wrote: > > Rotten again, attached the rebased version. > > Regards, > Amul > > On Mon, Apr 5, 2021 at 5:27 PM Amul Sul <sulamul@gmail.com> wrote: > > > > On Mon, Apr 5, 2021 at 4:45 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > Thanks Bharath for your review. > > > > > On Mon, Apr 5, 2021 at 11:02 AM Amul Sul <sulamul@gmail.com> wrote: > > > > > > > > Attached is the rebase version for the latest master head(commit # 9f6f1f9b8e6). > > > > > > Some minor comments on 0001: > > > Isn't it "might not be running"? > > > + errdetail("Checkpointer might not running."), > > > > > > > Ok, fixed in the attached version. > > > > > Isn't it "Try again after sometime"? > > > + errhint("Try after sometime again."))); > > > > > > > Ok, done. > > > > > Can we have ereport(DEBUG1 just to be consistent(although it doesn't > > > make any difference from elog(DEBUG1) with the new log messages > > > introduced in the patch? > > > + elog(DEBUG1, "waiting for backends to adopt requested WAL > > > prohibit state change"); > > > > > > > I think it's fine; many existing places have used elog(DEBUG1, ....) too. > > > > Regards, > > Amul
Attachment
On Mon, Apr 12, 2021 at 10:04 AM Amul Sul <sulamul@gmail.com> wrote: > Rebased again. I started to look at this today, and didn't get very far, but I have a few comments. The main one is that I don't think this patch implements the design proposed in https://www.postgresql.org/message-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com The first part of that proposal said this: "1. If the server starts up and is read-only and ArchiveRecoveryRequested, clear the read-only state in memory and also in the control file, log a message saying that this has been done, and proceed. This makes some other cases simpler to deal with." As I read it, the patch clears the read-only state in memory, does not clear it in the control file, and does not log a message. The second part of this proposal was: "2. Create a new function with a name like XLogAcceptWrites(). Move the following things from StartupXLOG() into that function: (1) the call to UpdateFullPageWrites(), (2) the following block of code that does either CreateEndOfRecoveryRecord() or RequestCheckpoint() or CreateCheckPoint(), (3) the next block of code that runs recovery_end_command, (4) the call to XLogReportParameters(), and (5) the call to CompleteCommitTsInitialization(). Call the new function from the place where we now call XLogReportParameters(). This would mean that (1)-(3) happen later than they do now, which might require some adjustments." Now you moved that code, but you also moved (6) CompleteCommitTsInitialization(), (7) setting the control file to DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and (9) requesting a checkpoint if we were just promoted. That's not what was proposed. One result of this is that the server now thinks it's in recovery even after the startup process has exited. RecoveryInProgress() is still returning true everywhere. But that is inconsistent with what Andres and I were recommending in http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com I also noticed that 0001 does not compile without 0002, so the separation into multiple patches is not clean. I would actually suggest that the first patch in the series should just create XLogAcceptWrites() with the minimum amount of adjustment to make that work. That would potentially let us commit that change independently, which would be good, because then if we accidentally break something, it'll be easier to pin down to that particular change instead of being mixed with everything else this needs to change. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, May 7, 2021 at 1:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Apr 12, 2021 at 10:04 AM Amul Sul <sulamul@gmail.com> wrote: > > Rebased again. > > I started to look at this today, and didn't get very far, but I have a > few comments. The main one is that I don't think this patch implements > the design proposed in > https://www.postgresql.org/message-id/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com > > The first part of that proposal said this: > > "1. If the server starts up and is read-only and > ArchiveRecoveryRequested, clear the read-only state in memory and also > in the control file, log a message saying that this has been done, and > proceed. This makes some other cases simpler to deal with." > > As I read it, the patch clears the read-only state in memory, does not > clear it in the control file, and does not log a message. > The state in the control file also gets cleared. Though, after clearing in memory the state patch doesn't really do the immediate change to the control file, it relies on the next UpdateControlFile() to do that. Regarding log message I think I have skipped that intentionally, to avoid confusing log as "system is now read write" when we do start as hot-standby which is not really read-write. > The second part of this proposal was: > > "2. Create a new function with a name like XLogAcceptWrites(). Move the > following things from StartupXLOG() into that function: (1) the call > to UpdateFullPageWrites(), (2) the following block of code that does > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or > CreateCheckPoint(), (3) the next block of code that runs > recovery_end_command, (4) the call to XLogReportParameters(), and (5) > the call to CompleteCommitTsInitialization(). Call the new function > from the place where we now call XLogReportParameters(). This would > mean that (1)-(3) happen later than they do now, which might require > some adjustments." > > Now you moved that code, but you also moved (6) > CompleteCommitTsInitialization(), (7) setting the control file to > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and > (9) requesting a checkpoint if we were just promoted. That's not what > was proposed. One result of this is that the server now thinks it's in > recovery even after the startup process has exited. > RecoveryInProgress() is still returning true everywhere. But that is > inconsistent with what Andres and I were recommending in > http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com > Regarding modified approach, I tried to explain that why I did this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com > I also noticed that 0001 does not compile without 0002, so the > separation into multiple patches is not clean. I would actually > suggest that the first patch in the series should just create > XLogAcceptWrites() with the minimum amount of adjustment to make that > work. That would potentially let us commit that change independently, > which would be good, because then if we accidentally break something, > it'll be easier to pin down to that particular change instead of being > mixed with everything else this needs to change. > Ok, I will try in the next version. Regards, Amul
On Sun, May 9, 2021 at 1:26 AM Amul Sul <sulamul@gmail.com> wrote: > The state in the control file also gets cleared. Though, after > clearing in memory the state patch doesn't really do the immediate > change to the control file, it relies on the next UpdateControlFile() > to do that. But when will that happen? If you're relying on some very nearby code, that might be OK, but perhaps a comment is in order. If you're just thinking it's going to happen eventually, I think that's not good enough. > Regarding log message I think I have skipped that intentionally, to > avoid confusing log as "system is now read write" when we do start as > hot-standby which is not really read-write. I think the message should not be phrased that way. In fact, I think now that we've moved to calling this pg_prohibit_wal() rather than ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and maybe some comments and function names as well. Perhaps something like: system is read only -> WAL is now prohibited system is read write -> WAL is no longer prohibited And then for this particular case, maybe something like: clearing WAL prohibition because the system is in archive recovery > > The second part of this proposal was: > > > > "2. Create a new function with a name like XLogAcceptWrites(). Move the > > following things from StartupXLOG() into that function: (1) the call > > to UpdateFullPageWrites(), (2) the following block of code that does > > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or > > CreateCheckPoint(), (3) the next block of code that runs > > recovery_end_command, (4) the call to XLogReportParameters(), and (5) > > the call to CompleteCommitTsInitialization(). Call the new function > > from the place where we now call XLogReportParameters(). This would > > mean that (1)-(3) happen later than they do now, which might require > > some adjustments." > > > > Now you moved that code, but you also moved (6) > > CompleteCommitTsInitialization(), (7) setting the control file to > > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and > > (9) requesting a checkpoint if we were just promoted. That's not what > > was proposed. One result of this is that the server now thinks it's in > > recovery even after the startup process has exited. > > RecoveryInProgress() is still returning true everywhere. But that is > > inconsistent with what Andres and I were recommending in > > http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com > > Regarding modified approach, I tried to explain that why I did > this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com I am not able to understand what problem you are seeing there. If we're in crash recovery, then nobody can connect to the database, so there can't be any concurrent activity. If we're in archive recovery, we now clear the WAL-is-prohibited flag so that we will go read-write directly at the end of recovery. We can and should refuse any effort to call pg_prohibit_wal() during recovery. If we reached the end of crash recovery and are now permitting read-only connections, why would anyone be able to write WAL before the system has been changed to read-write? If that can happen, it's a bug, not a reason to change the design. Maybe your concern here is about ordering: the process that is going to run XLogAcceptWrites() needs to allow xlog writes locally before we tell other backends that they also can xlog writes; otherwise, some other records could slip in before UpdateFullPageWrites() and similar have run, which we probably don't want. But that's why LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do what we need in this situation, we should be able to tweak it so it does. If your concern is something else, can you spell it out for me again because I'm not getting it? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, May 10, 2021 at 9:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, May 9, 2021 at 1:26 AM Amul Sul <sulamul@gmail.com> wrote: > > The state in the control file also gets cleared. Though, after > > clearing in memory the state patch doesn't really do the immediate > > change to the control file, it relies on the next UpdateControlFile() > > to do that. > > But when will that happen? If you're relying on some very nearby code, > that might be OK, but perhaps a comment is in order. If you're just > thinking it's going to happen eventually, I think that's not good > enough. > Ok. > > Regarding log message I think I have skipped that intentionally, to > > avoid confusing log as "system is now read write" when we do start as > > hot-standby which is not really read-write. > > I think the message should not be phrased that way. In fact, I think > now that we've moved to calling this pg_prohibit_wal() rather than > ALTER SYSTEM READ ONLY, a lot of messages need to be rethought, and > maybe some comments and function names as well. Perhaps something > like: > > system is read only -> WAL is now prohibited > system is read write -> WAL is no longer prohibited > > And then for this particular case, maybe something like: > > clearing WAL prohibition because the system is in archive recovery > Ok, thanks for the suggestions. > > > The second part of this proposal was: > > > > > > "2. Create a new function with a name like XLogAcceptWrites(). Move the > > > following things from StartupXLOG() into that function: (1) the call > > > to UpdateFullPageWrites(), (2) the following block of code that does > > > either CreateEndOfRecoveryRecord() or RequestCheckpoint() or > > > CreateCheckPoint(), (3) the next block of code that runs > > > recovery_end_command, (4) the call to XLogReportParameters(), and (5) > > > the call to CompleteCommitTsInitialization(). Call the new function > > > from the place where we now call XLogReportParameters(). This would > > > mean that (1)-(3) happen later than they do now, which might require > > > some adjustments." > > > > > > Now you moved that code, but you also moved (6) > > > CompleteCommitTsInitialization(), (7) setting the control file to > > > DB_IN_PRODUCTION, (8) setting the state to RECOVERY_STATE_DONE, and > > > (9) requesting a checkpoint if we were just promoted. That's not what > > > was proposed. One result of this is that the server now thinks it's in > > > recovery even after the startup process has exited. > > > RecoveryInProgress() is still returning true everywhere. But that is > > > inconsistent with what Andres and I were recommending in > > > http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com > > > > Regarding modified approach, I tried to explain that why I did > > this in http://postgr.es/m/CAAJ_b96Yb4jaW6oU1bVYEBaf=TQ-QL+mMT1ExfwvNZEr7XRyoQ@mail.gmail.com > > I am not able to understand what problem you are seeing there. If > we're in crash recovery, then nobody can connect to the database, so > there can't be any concurrent activity. If we're in archive recovery, > we now clear the WAL-is-prohibited flag so that we will go read-write > directly at the end of recovery. We can and should refuse any effort > to call pg_prohibit_wal() during recovery. If we reached the end of > crash recovery and are now permitting read-only connections, why would > anyone be able to write WAL before the system has been changed to > read-write? If that can happen, it's a bug, not a reason to change the > design. > > Maybe your concern here is about ordering: the process that is going > to run XLogAcceptWrites() needs to allow xlog writes locally before we > tell other backends that they also can xlog writes; otherwise, some > other records could slip in before UpdateFullPageWrites() and similar > have run, which we probably don't want. But that's why > LocalSetXLogInsertAllowed() was invented, and if it doesn't quite do > what we need in this situation, we should be able to tweak it so it > does. > Yes, we don't want any write slip in before UpdateFullPageWrites(). Recently[1], we have decided to let the Checkpointed process call XLogAcceptWrites() unconditionally. Here problem is that when a backend executes the pg_prohibit_wal(false) function to make the system read-write, the wal prohibited state is set to inprogress(ie. WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled. Next, Checkpointer will convey this system change to all existing backends using a global barrier, and after that final wal prohibited state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE). While Checkpointer is in the progress of conveying this global barrier, any new backend can connect at that time and can write a new record because the inprogress read-write state is equivalent to the final read-write state iff LocalXLogInsertAllowed != 0 for that backend. And, that new record could slip in before or in between records to be written by XLogAcceptWrites(). 1] http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com Regards, Amul
On Mon, May 10, 2021 at 10:25 PM Amul Sul <sulamul@gmail.com> wrote: > > Yes, we don't want any write slip in before UpdateFullPageWrites(). > Recently[1], we have decided to let the Checkpointed process call > XLogAcceptWrites() unconditionally. > > Here problem is that when a backend executes the > pg_prohibit_wal(false) function to make the system read-write, the wal > prohibited state is set to inprogress(ie. > WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled. > Next, Checkpointer will convey this system change to all existing > backends using a global barrier, and after that final wal prohibited > state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE). > While Checkpointer is in the progress of conveying this global > barrier, any new backend can connect at that time and can write a new > record because the inprogress read-write state is equivalent to the > final read-write state iff LocalXLogInsertAllowed != 0 for that > backend. And, that new record could slip in before or in between > records to be written by XLogAcceptWrites(). > > 1] http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE and signaled to the checkpointer. The checkpointer should first call XLogAcceptWrites and then it should inform other backends through the global barrier? Are we worried that if we have written the WAL in XLogAcceptWrites but later if we could not set the state to WALPROHIBIT_STATE_READ_WRITE? Then maybe we can inform all the backend first but before setting the state to WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 11, 2021 at 11:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 10, 2021 at 10:25 PM Amul Sul <sulamul@gmail.com> wrote: > > > > Yes, we don't want any write slip in before UpdateFullPageWrites(). > > Recently[1], we have decided to let the Checkpointed process call > > XLogAcceptWrites() unconditionally. > > > > Here problem is that when a backend executes the > > pg_prohibit_wal(false) function to make the system read-write, the wal > > prohibited state is set to inprogress(ie. > > WALPROHIBIT_STATE_GOING_READ_WRITE) and then Checkpointer is signaled. > > Next, Checkpointer will convey this system change to all existing > > backends using a global barrier, and after that final wal prohibited > > state is set to the read-write(i.e. WALPROHIBIT_STATE_READ_WRITE). > > While Checkpointer is in the progress of conveying this global > > barrier, any new backend can connect at that time and can write a new > > record because the inprogress read-write state is equivalent to the > > final read-write state iff LocalXLogInsertAllowed != 0 for that > > backend. And, that new record could slip in before or in between > > records to be written by XLogAcceptWrites(). > > > > 1] http://postgr.es/m/CA+TgmoZYQN=rcYE-iXWnjdvMAoH+7Jaqsif3U2k8xqXipBaS7A@mail.gmail.com > > But, IIUC, once the state is set to WALPROHIBIT_STATE_GOING_READ_WRITE > and signaled to the checkpointer. The checkpointer should first call > XLogAcceptWrites and then it should inform other backends through the > global barrier? Are we worried that if we have written the WAL in > XLogAcceptWrites but later if we could not set the state to > WALPROHIBIT_STATE_READ_WRITE? Then maybe we can inform all the > backend first but before setting the state to > WALPROHIBIT_STATE_READ_WRITE, we can call XLogAcceptWrites? > I get why you think that, I wasn't very precise in briefing the problem. Any new backend that gets connected right after the shared memory state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by default allowed to do the WAL writes. Such backends can perform write operation before the checkpointer does the XLogAcceptWrites(). Also, possible that a backend could connect at the same time checkpointer performing XLogAcceptWrites() and can write a wal. So, having XLogAcceptWrites() before does not really solve my concern. Note that the previous patch XLogAcceptWrites() does get called before global barrier emission. Please let me know if it is not yet cleared to you, thanks. Regards, Amul
On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote: > I get why you think that, I wasn't very precise in briefing the problem. > > Any new backend that gets connected right after the shared memory > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > default allowed to do the WAL writes. Such backends can perform write > operation before the checkpointer does the XLogAcceptWrites(). Okay, make sense now. But my next question is why do we allow backends to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we wait until the shared memory state is changed to WALPROHIBIT_STATE_READ_WRITE? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote: > > > I get why you think that, I wasn't very precise in briefing the problem. > > > > Any new backend that gets connected right after the shared memory > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > > default allowed to do the WAL writes. Such backends can perform write > > operation before the checkpointer does the XLogAcceptWrites(). > > Okay, make sense now. But my next question is why do we allow backends > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we > wait until the shared memory state is changed to > WALPROHIBIT_STATE_READ_WRITE? > Ok, good question. Now let's first try to understand the Checkpointer's work. When Checkpointer sees the wal prohibited state is an in-progress state, then it first emits the global barrier and waits until all backers absorb that. After that it set the final requested WAL prohibit state. When other backends absorb those barriers then appropriate action is taken (e.g. abort the read-write transaction if moving to read-only) by them. Also, LocalXLogInsertAllowed flags get reset in it and that backend needs to call XLogInsertAllowed() to get the right value for it, which further decides WAL writes permitted or prohibited. Consider an example that the system is trying to change to read-write and for that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before Checkpointer starts its work. If we want to treat that system as read-only for the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to think about the behavior of the backend that has absorbed the barrier and reset the LocalXLogInsertAllowed flag. That backend eventually going to call XLogInsertAllowed() to get the actual value for it and by seeing the current state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed again same as it was before for the read-only state. Now the question is when this value should get reset again so that backend can be read-write? We are done with a barrier and that backend never going to come back to read-write again. One solution, I think, is to set the final state before emitting the barrier but as per the current design that should get set after all barrier processing. Let's see what Robert says on this. Regards, Amul
On Tue, May 11, 2021 at 3:38 PM Amul Sul <sulamul@gmail.com> wrote: > > On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote: > > > > > I get why you think that, I wasn't very precise in briefing the problem. > > > > > > Any new backend that gets connected right after the shared memory > > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > > > default allowed to do the WAL writes. Such backends can perform write > > > operation before the checkpointer does the XLogAcceptWrites(). > > > > Okay, make sense now. But my next question is why do we allow backends > > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we > > wait until the shared memory state is changed to > > WALPROHIBIT_STATE_READ_WRITE? > > > > Ok, good question. > > Now let's first try to understand the Checkpointer's work. > > When Checkpointer sees the wal prohibited state is an in-progress state, then > it first emits the global barrier and waits until all backers absorb that. > After that it set the final requested WAL prohibit state. > > When other backends absorb those barriers then appropriate action is taken > (e.g. abort the read-write transaction if moving to read-only) by them. Also, > LocalXLogInsertAllowed flags get reset in it and that backend needs to call > XLogInsertAllowed() to get the right value for it, which further decides WAL > writes permitted or prohibited. > > Consider an example that the system is trying to change to read-write and for > that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before > Checkpointer starts its work. If we want to treat that system as read-only for > the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to > think about the behavior of the backend that has absorbed the barrier and reset > the LocalXLogInsertAllowed flag. That backend eventually going to call > XLogInsertAllowed() to get the actual value for it and by seeing the current > state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed > again same as it was before for the read-only state. I might be missing something, but assume the behavior should be like this 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process the barrier, we can immediately abort any read-write transaction(and stop allowing WAL writing), because once we ensure that all session has responded that now they have no read-write transaction then we can safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to WALPROHIBIT_STATE_READ_ONLY. 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend to consider the system as read-write, instead, we should wait until the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. So your problem is that on receiving the barrier we need to call LocalXLogInsertAllowed() from the backend, but how does that matter? you can still make IsWALProhibited() return true. I don't know the complete code so I might be missing something but at least that is what I would expect from the design POV. Other than this point, I think the state names READ_ONLY, READ_WRITE are a bit confusing no? because actually, these states represent whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we are putting the system under a Read-only state. For example, if you are doing some write operation on an unlogged table will be allowed, I guess because that will not generate the WAL until you commit (because commit generates WAL) right? so practically, we are just blocking the WAL, not the write operation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 11, 2021 at 3:38 PM Amul Sul <sulamul@gmail.com> wrote: > > > > On Tue, May 11, 2021 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, May 11, 2021 at 2:16 PM Amul Sul <sulamul@gmail.com> wrote: > > > > > > > I get why you think that, I wasn't very precise in briefing the problem. > > > > > > > > Any new backend that gets connected right after the shared memory > > > > state changes to WALPROHIBIT_STATE_GOING_READ_WRITE will be by > > > > default allowed to do the WAL writes. Such backends can perform write > > > > operation before the checkpointer does the XLogAcceptWrites(). > > > > > > Okay, make sense now. But my next question is why do we allow backends > > > to write WAL in WALPROHIBIT_STATE_GOING_READ_WRITE state? why don't we > > > wait until the shared memory state is changed to > > > WALPROHIBIT_STATE_READ_WRITE? > > > > > > > Ok, good question. > > > > Now let's first try to understand the Checkpointer's work. > > > > When Checkpointer sees the wal prohibited state is an in-progress state, then > > it first emits the global barrier and waits until all backers absorb that. > > After that it set the final requested WAL prohibit state. > > > > When other backends absorb those barriers then appropriate action is taken > > (e.g. abort the read-write transaction if moving to read-only) by them. Also, > > LocalXLogInsertAllowed flags get reset in it and that backend needs to call > > XLogInsertAllowed() to get the right value for it, which further decides WAL > > writes permitted or prohibited. > > > > Consider an example that the system is trying to change to read-write and for > > that wal prohibited state is set to WALPROHIBIT_STATE_GOING_READ_WRITE before > > Checkpointer starts its work. If we want to treat that system as read-only for > > the WALPROHIBIT_STATE_GOING_READ_WRITE state as well. Then we might need to > > think about the behavior of the backend that has absorbed the barrier and reset > > the LocalXLogInsertAllowed flag. That backend eventually going to call > > XLogInsertAllowed() to get the actual value for it and by seeing the current > > state as WALPROHIBIT_STATE_GOING_READ_WRITE, it will set LocalXLogInsertAllowed > > again same as it was before for the read-only state. > > I might be missing something, but assume the behavior should be like this > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > the barrier, we can immediately abort any read-write transaction(and > stop allowing WAL writing), because once we ensure that all session > has responded that now they have no read-write transaction then we can > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > WALPROHIBIT_STATE_READ_ONLY. > Yes, that's what the current patch is doing from the first patch version. > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > to consider the system as read-write, instead, we should wait until > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > I am sure that only not enough will have the same issue where LocalXLogInsertAllowed gets set the same as the read-only as described in my previous reply. > So your problem is that on receiving the barrier we need to call > LocalXLogInsertAllowed() from the backend, but how does that matter? > you can still make IsWALProhibited() return true. > Note that LocalXLogInsertAllowed is a local flag for a backend, not a function, and in the server code at every place, we don't rely on IsWALProhibited() instead we do rely on LocalXLogInsertAllowed flags before wal writes and that check made via XLogInsertAllowed(). > I don't know the complete code so I might be missing something but at > least that is what I would expect from the design POV. > > > Other than this point, I think the state names READ_ONLY, READ_WRITE > are a bit confusing no? because actually, these states represent > whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we > are putting the system under a Read-only state. For example, if you > are doing some write operation on an unlogged table will be allowed, I > guess because that will not generate the WAL until you commit (because > commit generates WAL) right? so practically, we are just blocking the > WAL, not the write operation. > This read-only and read-write are the wal prohibited states though we are using for read-only/read-write system in the discussion and the complete macro name is WALPROHIBIT_STATE_READ_ONLY and WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make implementation confusing. Regards, Amul
On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote: > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I might be missing something, but assume the behavior should be like this > > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > the barrier, we can immediately abort any read-write transaction(and > > stop allowing WAL writing), because once we ensure that all session > > has responded that now they have no read-write transaction then we can > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > WALPROHIBIT_STATE_READ_ONLY. > > > > Yes, that's what the current patch is doing from the first patch version. > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > > to consider the system as read-write, instead, we should wait until > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > I am sure that only not enough will have the same issue where > LocalXLogInsertAllowed gets set the same as the read-only as described in > my previous reply. Okay, but while browsing the code I do not see any direct if condition based on the "LocalXLogInsertAllowed" variable, can you point me to some references? I only see one if check on this variable and that is in XLogInsertAllowed() function, but now in XLogInsertAllowed() function, you are already checking IsWALProhibited. No? > > Other than this point, I think the state names READ_ONLY, READ_WRITE > > are a bit confusing no? because actually, these states represent > > whether WAL is allowed or not, but READ_ONLY, READ_WRITE seems like we > > are putting the system under a Read-only state. For example, if you > > are doing some write operation on an unlogged table will be allowed, I > > guess because that will not generate the WAL until you commit (because > > commit generates WAL) right? so practically, we are just blocking the > > WAL, not the write operation. > > > > This read-only and read-write are the wal prohibited states though we > are using for read-only/read-write system in the discussion and the > complete macro name is WALPROHIBIT_STATE_READ_ONLY and > WALPROHIBIT_STATE_READ_WRITE, I am not sure why that would make > implementation confusing. Fine, I am not too particular about these names. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote: > > > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I might be missing something, but assume the behavior should be like this > > > > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > > the barrier, we can immediately abort any read-write transaction(and > > > stop allowing WAL writing), because once we ensure that all session > > > has responded that now they have no read-write transaction then we can > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > > WALPROHIBIT_STATE_READ_ONLY. > > > > > > > Yes, that's what the current patch is doing from the first patch version. > > > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > > > to consider the system as read-write, instead, we should wait until > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > > > > I am sure that only not enough will have the same issue where > > LocalXLogInsertAllowed gets set the same as the read-only as described in > > my previous reply. > > Okay, but while browsing the code I do not see any direct if condition > based on the "LocalXLogInsertAllowed" variable, can you point me to > some references? > I only see one if check on this variable and that is in > XLogInsertAllowed() function, but now in XLogInsertAllowed() function, > you are already checking IsWALProhibited. No? > I am not sure I understood this. Where am I checking IsWALProhibited()? IsWALProhibited() is called by XLogInsertAllowed() once when LocalXLogInsertAllowed is in a reset state, and that result will be cached in LocalXLogInsertAllowed and will be used in the subsequent XLogInsertAllowed() call. Regards, Amul
On Tue, May 11, 2021 at 6:56 PM Amul Sul <sulamul@gmail.com> wrote: > > On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote: > > > > > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I might be missing something, but assume the behavior should be like this > > > > > > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE > > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process > > > > the barrier, we can immediately abort any read-write transaction(and > > > > stop allowing WAL writing), because once we ensure that all session > > > > has responded that now they have no read-write transaction then we can > > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to > > > > WALPROHIBIT_STATE_READ_ONLY. > > > > > > > > > > Yes, that's what the current patch is doing from the first patch version. > > > > > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY -> > > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend > > > > to consider the system as read-write, instead, we should wait until > > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE. > > > > > > > > > > I am sure that only not enough will have the same issue where > > > LocalXLogInsertAllowed gets set the same as the read-only as described in > > > my previous reply. > > > > Okay, but while browsing the code I do not see any direct if condition > > based on the "LocalXLogInsertAllowed" variable, can you point me to > > some references? > > I only see one if check on this variable and that is in > > XLogInsertAllowed() function, but now in XLogInsertAllowed() function, > > you are already checking IsWALProhibited. No? > > > > I am not sure I understood this. Where am I checking IsWALProhibited()? > > IsWALProhibited() is called by XLogInsertAllowed() once when > LocalXLogInsertAllowed is in a reset state, and that result will be > cached in LocalXLogInsertAllowed and will be used in the subsequent > XLogInsertAllowed() call. Okay, got what you were trying to say. But that can be easily fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE then what we can do is don't allow to write the WAL but let's not set the LocalXLogInsertAllowed to 0. So until we are in the intermediate state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely on GetWALProhibitState(), I know this will add a performance penalty but this is for the short period until we are in the intermediate state. After that as soon as it will set to WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set LocalXLogInsertAllowed to 1. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, 11 May 2021 at 7:50 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, May 11, 2021 at 6:56 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 6:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 11, 2021 at 4:50 PM Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > On Tue, May 11, 2021 at 4:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > I might be missing something, but assume the behavior should be like this
> > > >
> > > > 1. If the state is getting changed from WALPROHIBIT_STATE_READ_WRITE
> > > > -> WALPROHIBIT_STATE_READ_ONLY, then as soon as the backend process
> > > > the barrier, we can immediately abort any read-write transaction(and
> > > > stop allowing WAL writing), because once we ensure that all session
> > > > has responded that now they have no read-write transaction then we can
> > > > safely change the state from WALPROHIBIT_STATE_GOING_READ_ONLY to
> > > > WALPROHIBIT_STATE_READ_ONLY.
> > > >
> > >
> > > Yes, that's what the current patch is doing from the first patch version.
> > >
> > > > 2. OTOH, if we are changing from WALPROHIBIT_STATE_READ_ONLY ->
> > > > WALPROHIBIT_STATE_READ_WRITE, then we don't need to allow the backend
> > > > to consider the system as read-write, instead, we should wait until
> > > > the shared state is changed to WALPROHIBIT_STATE_READ_WRITE.
> > > >
> > >
> > > I am sure that only not enough will have the same issue where
> > > LocalXLogInsertAllowed gets set the same as the read-only as described in
> > > my previous reply.
> >
> > Okay, but while browsing the code I do not see any direct if condition
> > based on the "LocalXLogInsertAllowed" variable, can you point me to
> > some references?
> > I only see one if check on this variable and that is in
> > XLogInsertAllowed() function, but now in XLogInsertAllowed() function,
> > you are already checking IsWALProhibited. No?
> >
>
> I am not sure I understood this. Where am I checking IsWALProhibited()?
>
> IsWALProhibited() is called by XLogInsertAllowed() once when
> LocalXLogInsertAllowed is in a reset state, and that result will be
> cached in LocalXLogInsertAllowed and will be used in the subsequent
> XLogInsertAllowed() call.
Okay, got what you were trying to say. But that can be easily
fixable, I mean if the state is WALPROHIBIT_STATE_GOING_READ_WRITE
then what we can do is don't allow to write the WAL but let's not set
the LocalXLogInsertAllowed to 0. So until we are in the intermediate
state WALPROHIBIT_STATE_GOING_READ_WRITE, we will always have to rely
on GetWALProhibitState(), I know this will add a performance penalty
but this is for the short period until we are in the intermediate
state. After that as soon as it will set to
WALPROHIBIT_STATE_READ_WRITE then the XLogInsertAllowed() will set
LocalXLogInsertAllowed to 1.
I think I have much easier solution than this, will post that with update version patch set tomorrow.
Regards,
Amul
On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote: > I think I have much easier solution than this, will post that with update version patch set tomorrow. I don't know what you have in mind, but based on this discussion, it seems to me that we should just have 5 states instead of 4: 1. WAL is permitted. 2. WAL is being prohibited but some backends may not know about the change yet. 3. WAL is prohibited. 4. WAL is in the process of being permitted but XLogAcceptWrites() may not have been called yet. 5. WAL is in the process of being permitted and XLogAcceptWrites() has been called but some backends may not know about the change yet. If we're in state #3 and someone does pg_prohibit_wal(false) then we enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to state #5, and pushes out a barrier. Then it waits for the barrier to be absorbed and, when it has been, it moves us to state #1. Then if someone does pg_prohibit_wal(true) we move to state #2. The checkpointer pushes out a barrier and waits for it to be absorbed. Then it calls XLogFlush() and afterward moves us to state #3. We can have any (reasonable) number of states that we want. There's nothing magical about 4. I also entirely agree with Dilip that we should do some renaming to get rid of the read-write/read-only terminology, now that this is no longer part of the syntax. In fact I made the exact same point in my last review. The WALPROHIBIT_STATE_* constants are just one thing of many that needs to be included in that renaming. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, May 11, 2021 at 11:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote: > > I think I have much easier solution than this, will post that with update version patch set tomorrow. > > I don't know what you have in mind, but based on this discussion, it > seems to me that we should just have 5 states instead of 4: > > 1. WAL is permitted. > 2. WAL is being prohibited but some backends may not know about the change yet. > 3. WAL is prohibited. > 4. WAL is in the process of being permitted but XLogAcceptWrites() may > not have been called yet. > 5. WAL is in the process of being permitted and XLogAcceptWrites() has > been called but some backends may not know about the change yet. > > If we're in state #3 and someone does pg_prohibit_wal(false) then we > enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to > state #5, and pushes out a barrier. Then it waits for the barrier to > be absorbed and, when it has been, it moves us to state #1. Then if > someone does pg_prohibit_wal(true) we move to state #2. The > checkpointer pushes out a barrier and waits for it to be absorbed. > Then it calls XLogFlush() and afterward moves us to state #3. > > We can have any (reasonable) number of states that we want. There's > nothing magical about 4. Your idea makes sense, but IMHO, if we are first writing XLogAcceptWrites() and then pushing out the barrier, then I don't understand the meaning of having state #4. I mean whenever any backend receives the barrier the system will always be in state #5. So what do we want to do with state #4? Is it just to make the state machine better? I mean in the checkpoint process, we don't need separate "if checks" whether the XLogAcceptWrites() is called or not, instead we can just rely on the state, if it is #4 then we have to call XLogAcceptWrites(). If so then I think it's okay to have an additional state, just wanted to know what idea you had in mind? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, May 12, 2021 at 11:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, May 11, 2021 at 11:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, May 11, 2021 at 11:17 AM Amul Sul <sulamul@gmail.com> wrote: > > > I think I have much easier solution than this, will post that with update version patch set tomorrow. > > > > I don't know what you have in mind, but based on this discussion, it > > seems to me that we should just have 5 states instead of 4: > > I had to have two different ideas, the first one is a little bit aligned with the approach you mentioned below but without introducing a new state. Basically, what we want is to restrict any backend that connects to the server and write a WAL record while we are doing XLogAcceptWrites(). For XLogAcceptWrites() skip we do already have a flag for that, when that flag is set (i.e. XLogAcceptWrites() skipped previously) then treat the system as read-only (i.e. WAL prohibited) until XLogAcceptWrites() finishes. In that case, our IsWALProhibited() function will be: bool IsWALProhibited(void) { WALProhibitState cur_state; /* * If essential operations are needed to enable wal writes are skipped * previously then treat this state as WAL prohibited until that gets * done. */ if (unlikely(GetXLogWriteAllowedState() == XLOG_ACCEPT_WRITES_SKIPPED)) return true; cur_state = GetWALProhibitState(GetWALProhibitCounter()); return (cur_state != WALPROHIBIT_STATE_READ_WRITE && cur_state != WALPROHIBIT_STATE_GOING_READ_WRITE); } Another idea that I want to propose & did the changes according to in the attached version is making IsWALProhibited() something like this: bool IsWALProhibited(void) { /* Other than read-write state will be considered as read-only */ return (GetWALProhibitState(GetWALProhibitCounter()) != WALPROHIBIT_STATE_READ_WRITE); } But this needs some additional changes to CompleteWALProhibitChange() function where the final in-memory system state update happens differently i.e. before or after emitting a global barrier. When in-memory WAL prohibited state is _GOING_READ_WRITE then in-memory state immediately changes to _READ_WRITE. After that global barrier is emitted for other backends to change their local state. This should be harmless because a _READ_WRITE system could have _READ_ONLY and _READ_WRITE backends. But when the in-memory WAL prohibited state is _GOING_READ_ONLY then in-memory update for the final state setting is not going to happen before the global barrier. We cannot say the system is _READ_ONLY until we ensure that all backends are _READ_ONLY. For more details please have a look at CompleteWALProhibitChange(). Note that XLogAcceptWrites() happens before CompleteWALProhibitChange() so if any backend connect while XLogAcceptWrites() is in progress and will not allow WAL writes until it gets finished and CompleteWALProhibitChange() executed. The second approach is much better, IMO, because IsWALProhibited() is much lighter which would run a number of times when a new backend connects and/or its LocalXLogInsertAllowed cached value gets reset. Perhaps, you could argue that the number of calls might not be that much due to the locally cached value in LocalXLogInsertAllowed, but I am in favour of having less work. Apart from this, I made a separate patch for XLogAcceptWrites() refactoring. Now, each patch can be compiled without having the next patch on top of it. > > 1. WAL is permitted. > > 2. WAL is being prohibited but some backends may not know about the change yet. > > 3. WAL is prohibited. > > 4. WAL is in the process of being permitted but XLogAcceptWrites() may > > not have been called yet. > > 5. WAL is in the process of being permitted and XLogAcceptWrites() has > > been called but some backends may not know about the change yet. > > > > If we're in state #3 and someone does pg_prohibit_wal(false) then we > > enter state #4. The checkpointer calls XLogAcceptWrites(), moves us to > > state #5, and pushes out a barrier. Then it waits for the barrier to > > be absorbed and, when it has been, it moves us to state #1. Then if > > someone does pg_prohibit_wal(true) we move to state #2. The > > checkpointer pushes out a barrier and waits for it to be absorbed. > > Then it calls XLogFlush() and afterward moves us to state #3. > > > > We can have any (reasonable) number of states that we want. There's > > nothing magical about 4. > > Your idea makes sense, but IMHO, if we are first writing > XLogAcceptWrites() and then pushing out the barrier, then I don't > understand the meaning of having state #4. I mean whenever any > backend receives the barrier the system will always be in state #5. > So what do we want to do with state #4? > > Is it just to make the state machine better? I mean in the checkpoint > process, we don't need separate "if checks" whether the > XLogAcceptWrites() is called or not, instead we can just rely on the > state, if it is #4 then we have to call XLogAcceptWrites(). If so > then I think it's okay to have an additional state, just wanted to > know what idea you had in mind? > AFAICU, that proposed state #4 is to restrict the newly connected backend from WAL writes. My first approach doing the same by changing IsWALProhibited() a bit. Regards, Amul
Attachment
On Wed, May 12, 2021 at 1:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Your idea makes sense, but IMHO, if we are first writing > XLogAcceptWrites() and then pushing out the barrier, then I don't > understand the meaning of having state #4. I mean whenever any > backend receives the barrier the system will always be in state #5. > So what do we want to do with state #4? Well, if you don't have that, how does the checkpointer know that it's supposed to push out the barrier? You and Amul both seem to want to merge states #4 and #5. But how to make that work? Basically what you are both saying is that, after we move into the "going read-write" state, backends aren't immediately told that they can write WAL, but have to keep checking back. But this could be expensive. If you have one state that means that the checkpointer has been requested to run XLogAcceptWrites() and push out a barrier, and another state to mean that it has done so, then you avoid that. Maybe that overhead wouldn't be large anyway, but it seems like it's only necessary because you're trying to merge two states which, from a logical point of view, are separate. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, May 13, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 12, 2021 at 1:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Your idea makes sense, but IMHO, if we are first writing > > XLogAcceptWrites() and then pushing out the barrier, then I don't > > understand the meaning of having state #4. I mean whenever any > > backend receives the barrier the system will always be in state #5. > > So what do we want to do with state #4? > > Well, if you don't have that, how does the checkpointer know that it's > supposed to push out the barrier? > > You and Amul both seem to want to merge states #4 and #5. But how to > make that work? Basically what you are both saying is that, after we > move into the "going read-write" state, backends aren't immediately > told that they can write WAL, but have to keep checking back. But this > could be expensive. If you have one state that means that the > checkpointer has been requested to run XLogAcceptWrites() and push out > a barrier, and another state to mean that it has done so, then you > avoid that. Maybe that overhead wouldn't be large anyway, but it seems > like it's only necessary because you're trying to merge two states > which, from a logical point of view, are separate. I don't have an objection to having 5 states, just wanted to understand your reasoning. So it makes sense to me. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote: > Thanks for the updated patch, while going through I noticed this comment. + /* + * WAL prohibit state changes not allowed during recovery except the crash + * recovery case. + */ + PreventCommandDuringRecovery("pg_prohibit_wal()"); Why do we need to allow state change during recovery? Do you still need it after the latest changes you discussed here, I mean now XLogAcceptWrites() being called before sending barrier to backends. So now we are not afraid that the backend will write WAL before we call XLogAcceptWrites(). So now IMHO, we don't need to keep the system in recovery until pg_prohibit_wal(false) is called, right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, May 13, 2021 at 12:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote: > > > > Thanks for the updated patch, while going through I noticed this comment. > > + /* > + * WAL prohibit state changes not allowed during recovery except the crash > + * recovery case. > + */ > + PreventCommandDuringRecovery("pg_prohibit_wal()"); > > Why do we need to allow state change during recovery? Do you still > need it after the latest changes you discussed here, I mean now > XLogAcceptWrites() being called before sending barrier to backends. > So now we are not afraid that the backend will write WAL before we > call XLogAcceptWrites(). So now IMHO, we don't need to keep the > system in recovery until pg_prohibit_wal(false) is called, right? > Your understanding is correct, and the previous patch also does the same, but the code comment is wrong. Fixed in the attached version, also rebased for the latest master head. Sorry for the confusion. Regards, Amul
Attachment
On Thu, May 13, 2021 at 2:54 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, May 13, 2021 at 12:36 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, May 12, 2021 at 5:55 PM Amul Sul <sulamul@gmail.com> wrote: > > > > > > > Thanks for the updated patch, while going through I noticed this comment. > > > > + /* > > + * WAL prohibit state changes not allowed during recovery except the crash > > + * recovery case. > > + */ > > + PreventCommandDuringRecovery("pg_prohibit_wal()"); > > > > Why do we need to allow state change during recovery? Do you still > > need it after the latest changes you discussed here, I mean now > > XLogAcceptWrites() being called before sending barrier to backends. > > So now we are not afraid that the backend will write WAL before we > > call XLogAcceptWrites(). So now IMHO, we don't need to keep the > > system in recovery until pg_prohibit_wal(false) is called, right? > > > > Your understanding is correct, and the previous patch also does the same, but > the code comment is wrong. Fixed in the attached version, also rebased for the > latest master head. Sorry for the confusion. Great thanks. I will review the remaining patch soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Great thanks. I will review the remaining patch soon. I have reviewed v28-0003, and I have some comments on this. === @@ -126,9 +127,14 @@ XLogBeginInsert(void) Assert(mainrdata_last == (XLogRecData *) &mainrdata_head); Assert(mainrdata_len == 0); + /* + * WAL permission must have checked before entering the critical section. + * Otherwise, WAL prohibited error will force system panic. + */ + Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || !CritSectionCount); + /* cross-check on whether we should be here or not */ - if (!XLogInsertAllowed()) - elog(ERROR, "cannot make new WAL entries during recovery"); + CheckWALPermitted(); We must not call CheckWALPermitted inside the critical section, instead if we are here we must be sure that WAL is permitted, so better put an assert. Even if that is ensured by some other mean then also I don't see any reason for calling this error generating function. === +CheckWALPermitted(void) +{ + if (!XLogInsertAllowed()) + ereport(ERROR, + (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION), + errmsg("system is now read only"))); + system is now read only -> wal is prohibited (in error message) === - * We can't write WAL in recovery mode, so there's no point trying to + * We can't write WAL during read-only mode, so there's no point trying to during read-only mode -> if WAL is prohibited or WAL recovery in progress (add recovery in progress and also modify read-only to wal prohibited) === + if (!XLogInsertAllowed()) { GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED); - GUC_check_errmsg("cannot set transaction read-write mode during recovery"); + GUC_check_errmsg("cannot set transaction read-write mode while system is read only"); return false; } system is read only -> WAL is prohibited === I think that's all, I have to say about 0003. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Great thanks. I will review the remaining patch soon. > > I have reviewed v28-0003, and I have some comments on this. > > === > @@ -126,9 +127,14 @@ XLogBeginInsert(void) > Assert(mainrdata_last == (XLogRecData *) &mainrdata_head); > Assert(mainrdata_len == 0); > > + /* > + * WAL permission must have checked before entering the critical section. > + * Otherwise, WAL prohibited error will force system panic. > + */ > + Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || > !CritSectionCount); > + > /* cross-check on whether we should be here or not */ > - if (!XLogInsertAllowed()) > - elog(ERROR, "cannot make new WAL entries during recovery"); > + CheckWALPermitted(); > > We must not call CheckWALPermitted inside the critical section, > instead if we are here we must be sure that > WAL is permitted, so better put an assert. Even if that is ensured by > some other mean then also I don't > see any reason for calling this error generating function. > I understand that we should not have an error inside a critical section but this check is not wrong. Patch has enough checking so that errors due to WAL prohibited state must not hit in the critical section, see assert just before CheckWALPermitted(). Before entering into the critical section, we do have an explicit WAL prohibited check. And to make sure that check has been done for all current critical section for the wal writes, we have aforesaid assert checking, for more detail on this please have a look at the "WAL prohibited system state" section of src/backend/access/transam/README added in 0004 patch. This assertion also ensures that future development does not miss the WAL prohibited state check before entering into a newly added critical section for WAL writes. > === > > +CheckWALPermitted(void) > +{ > + if (!XLogInsertAllowed()) > + ereport(ERROR, > + (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION), > + errmsg("system is now read only"))); > + > > system is now read only -> wal is prohibited (in error message) > > === > > - * We can't write WAL in recovery mode, so there's no point trying to > + * We can't write WAL during read-only mode, so there's no point trying to > > during read-only mode -> if WAL is prohibited or WAL recovery in > progress (add recovery in progress and also modify read-only to wal > prohibited) > > === > > + if (!XLogInsertAllowed()) > { > GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED); > - GUC_check_errmsg("cannot set transaction read-write mode > during recovery"); > + GUC_check_errmsg("cannot set transaction read-write mode > while system is read only"); > return false; > } > > system is read only -> WAL is prohibited > > === Fixed all in the attached version. > > I think that's all, I have to say about 0003. > Thanks for the review. Regards, Amul
Attachment
On Mon, May 17, 2021 at 11:48 AM Amul Sul <sulamul@gmail.com> wrote: > > On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > Great thanks. I will review the remaining patch soon. > > > > I have reviewed v28-0003, and I have some comments on this. > > > > === > > @@ -126,9 +127,14 @@ XLogBeginInsert(void) > > Assert(mainrdata_last == (XLogRecData *) &mainrdata_head); > > Assert(mainrdata_len == 0); > > > > + /* > > + * WAL permission must have checked before entering the critical section. > > + * Otherwise, WAL prohibited error will force system panic. > > + */ > > + Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || > > !CritSectionCount); > > + > > /* cross-check on whether we should be here or not */ > > - if (!XLogInsertAllowed()) > > - elog(ERROR, "cannot make new WAL entries during recovery"); > > + CheckWALPermitted(); > > > > We must not call CheckWALPermitted inside the critical section, > > instead if we are here we must be sure that > > WAL is permitted, so better put an assert. Even if that is ensured by > > some other mean then also I don't > > see any reason for calling this error generating function. > > > > I understand that we should not have an error inside a critical section but > this check is not wrong. Patch has enough checking so that errors due to WAL > prohibited state must not hit in the critical section, see assert just before > CheckWALPermitted(). Before entering into the critical section, we do have an > explicit WAL prohibited check. And to make sure that check has been done for > all current critical section for the wal writes, we have aforesaid assert > checking, for more detail on this please have a look at the "WAL prohibited > system state" section of src/backend/access/transam/README added in 0004 patch. > This assertion also ensures that future development does not miss the WAL > prohibited state check before entering into a newly added critical section for > WAL writes. I think we need CheckWALPermitted(); check, in XLogBeginInsert() function because if XLogBeginInsert() maybe called outside critical section e.g. pg_truncate_visibility_map() then we should error out. So this check make sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attached is rebase for the latest master head. Also, I added one more refactoring code that deduplicates the code setting database state in the control file. The same code set the database state is also needed for this feature. Regards. Amul On Mon, May 17, 2021 at 1:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, May 17, 2021 at 11:48 AM Amul Sul <sulamul@gmail.com> wrote: > > > > On Sat, May 15, 2021 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > Great thanks. I will review the remaining patch soon. > > > > > > I have reviewed v28-0003, and I have some comments on this. > > > > > > === > > > @@ -126,9 +127,14 @@ XLogBeginInsert(void) > > > Assert(mainrdata_last == (XLogRecData *) &mainrdata_head); > > > Assert(mainrdata_len == 0); > > > > > > + /* > > > + * WAL permission must have checked before entering the critical section. > > > + * Otherwise, WAL prohibited error will force system panic. > > > + */ > > > + Assert(walpermit_checked_state != WALPERMIT_UNCHECKED || > > > !CritSectionCount); > > > + > > > /* cross-check on whether we should be here or not */ > > > - if (!XLogInsertAllowed()) > > > - elog(ERROR, "cannot make new WAL entries during recovery"); > > > + CheckWALPermitted(); > > > > > > We must not call CheckWALPermitted inside the critical section, > > > instead if we are here we must be sure that > > > WAL is permitted, so better put an assert. Even if that is ensured by > > > some other mean then also I don't > > > see any reason for calling this error generating function. > > > > > > > I understand that we should not have an error inside a critical section but > > this check is not wrong. Patch has enough checking so that errors due to WAL > > prohibited state must not hit in the critical section, see assert just before > > CheckWALPermitted(). Before entering into the critical section, we do have an > > explicit WAL prohibited check. And to make sure that check has been done for > > all current critical section for the wal writes, we have aforesaid assert > > checking, for more detail on this please have a look at the "WAL prohibited > > system state" section of src/backend/access/transam/README added in 0004 patch. > > This assertion also ensures that future development does not miss the WAL > > prohibited state check before entering into a newly added critical section for > > WAL writes. > > I think we need CheckWALPermitted(); check, in XLogBeginInsert() > function because if XLogBeginInsert() maybe called outside critical > section e.g. pg_truncate_visibility_map() then we should error out. > So this check make sense to me. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Jun 17, 2021 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote: > Attached is rebase for the latest master head. Also, I added one more > refactoring code that deduplicates the code setting database state in the > control file. The same code set the database state is also needed for this > feature. I started studying 0001 today and found that it rearranged the order of operations in StartupXLOG() more than I was expecting. It does, as per previous discussions, move a bunch of things to the place where we now call XLogParamters(). But, unsatisfyingly, InRecovery = false and XLogReaderFree() then have to move down even further. Since the goal here is to get to a situation where we sometimes XLogAcceptWrites() after InRecovery = false, it didn't seem nice for this refactoring patch to still end up with a situation where this stuff happens while InRecovery = true. In fact, with the patch, the amount of code that runs with InRecovery = true actually *increases*, which is not what I think should be happening here. That's why the patch ends up having to adjust SetMultiXactIdLimit to not Assert(!InRecovery). And then I started to wonder how this was ever going to work as part of the larger patch set, because as you have it here, XLogAcceptWrites() takes arguments XLogReaderState *xlogreader, XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the checkpointer is calling that at a later time after the user issues pg_prohibit_wal(false), it's going to have none of those things. So I had a quick look at that part of the code and found this in checkpointer.c: XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0); For those following along from home, the additional "true" is a bool needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of this is very satisfying. The whole purpose of passing the xlogreader is so we can figure out whether we need a checkpoint (never mind the question of whether the existing algorithm for determining that is really sensible) but now we need a second argument that basically serves the same purpose since one of the two callers to this function won't have an xlogreader. And then we're passing the EndOfLog and EndOfLogTLI as dummy values which seems like it's probably just totally wrong, but if for some reason it works correctly there sure don't seem to be any comments explaining why. So I started doing a bit of hacking myself and ended up with the attached, which I think is not completely the right thing yet but I think it's better than your version. I split this into three parts. 0001 splits up the logic that currently decides whether to write an end-of-recovery record or a checkpoint record and if the latter how the checkpoint ought to be performed into two functions. DetermineRecoveryXlogAction() figures out what we want to do, and PerformRecoveryXlogAction() does it. It also moves the code to run recovery_end_command and related stuff into a new function CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing UpdateFullPageWrites(), PerformRecoveryXLogAction(), and CleanupAfterArchiveRecovery() to just before we XLogReportParameters(). Because of the refactoring done by 0001, this is only a small amount of code movement. Because of the separation between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(), the latter doesn't need the xlogreader. So we can do DetermineRecoveryXlogAction() at the same time as now, while the xlogreader is available, and then we don't need it later when we PerformRecoveryXlogAction(), because we already know what we need to know. I think this is all fine as far as it goes. My 0003 is where I see some lingering problems. It creates XLogAcceptWrites(), moves the appropriate stuff there, and doesn't need the xlogreader. But it doesn't really solve the problem of how checkpointer.c would be able to call this function with proper arguments. It is at least better in not needing two arguments to decide what to do, but how is checkpointer.c supposed to know what to pass for xlogaction? Worse yet, how is checkpointer.c supposed to know what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't seem too problematic, because that value has been stored in four (!) places inside XLogCtl by this code: LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; XLogCtl->LogwrtResult = LogwrtResult; XLogCtl->LogwrtRqst.Write = EndOfLog; XLogCtl->LogwrtRqst.Flush = EndOfLog; Presumably we could relatively easily change things around so that we finish one of those values ... probably one of the "write" values .. back out of XLogCtl instead of passing it as a parameter. That would work just as well from the checkpointer as from the startup process, and there seems to be no way for the value to change until after XLogAcceptWrites() has been called, so it seems fine. But that doesn't help for the other arguments. What I'm thinking is that we should just arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and then XLogAcceptWrites() can fish those values out of there as well, which should be enough to make it work and do the same thing regardless of which process is calling it. But I have run out of time for today so have not explored coding that up. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > My 0003 is where I see some lingering problems. It creates > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > need the xlogreader. But it doesn't really solve the problem of how > checkpointer.c would be able to call this function with proper > arguments. It is at least better in not needing two arguments to > decide what to do, but how is checkpointer.c supposed to know what to > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > what to pass for EndOfLogTLI and EndOfLog? On further study, I found another problem: the way my patch set leaves things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which will not be correctly initialized in any process other than the startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) would just be skipped. Your 0001 seems to have the same problem. You added Assert(AmStartupProcess()) to the inside of the if (ArchiveRecoveryRequested) block, but that doesn't fix anything. Outside the startup process, ArchiveRecoveryRequested will always be false, but the point is that the associated stuff should be done if ArchiveRecoveryRequested would have been true in the startup process. Both of our patch sets leave things in a state where that would never happen, which is not good. Unless I'm missing something, it seems like maybe you didn't test your patches to verify that, when the XLogAcceptWrites() call comes from the checkpointer, all the same things happen that would have happened had it been called from the startup process. That would be a really good thing to have tested before posting your patches. As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, several TLIs stored in XLogCtl. None of them seem to be precisely the same thing as EndLogTLI, but I am hoping that replayEndTLI is close enough. I found out pretty quickly through testing that replayEndTLI isn't always valid -- it ends up 0 if we don't enter recovery. That's not really a problem, though, because we only need it to be valid if ArchiveRecoveryRequested. The code that initializes and updates it seems to run whenever InRecovery = true, and ArchiveRecoveryRequested = true will force InRecovery = true. So it looks to me like replayEndTLI will always be initialized in the cases where we need a value. It's not yet entirely clear to me if it has to have the same value as EndOfLogTLI. I find this code comment quite mysterious: /* * EndOfLogTLI is the TLI in the filename of the XLOG segment containing * the end-of-log. It could be different from the timeline that EndOfLog * nominally belongs to, if there was a timeline switch in that segment, * and we were reading the old WAL from a segment belonging to a higher * timeline. */ EndOfLogTLI = xlogreader->seg.ws_tli; The thing is, if we were reading old WAL from a segment belonging to a higher timeline, wouldn't we have switched to that new timeline? Suppose we want WAL segment 246 from TLI 1, but we don't have that segment on TLI 1, only TLI 2. Well, as far as I know, for us to use the TLI 2 version, we'd need to have TLI 2 in the history of the recovery_target_timeline. And if that is the case, then we would have to replay through the record where the timeline changes. And if we do that, then the discrepancy postulated by the comment cannot still exist by the time we reach this code, because this code is only reached after we finish WAL redo. So I'm baffled as to how this can happen, but considering how many cases there are in this code, I sure can't promise that it doesn't. The fact that we have few tests for any of this doesn't help either. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jul 28, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > > My 0003 is where I see some lingering problems. It creates > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > > need the xlogreader. But it doesn't really solve the problem of how > > checkpointer.c would be able to call this function with proper > > arguments. It is at least better in not needing two arguments to > > decide what to do, but how is checkpointer.c supposed to know what to > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > > what to pass for EndOfLogTLI and EndOfLog? > > On further study, I found another problem: the way my patch set leaves > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which > will not be correctly initialized in any process other than the > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) > would just be skipped. Your 0001 seems to have the same problem. You > added Assert(AmStartupProcess()) to the inside of the if > (ArchiveRecoveryRequested) block, but that doesn't fix anything. > Outside the startup process, ArchiveRecoveryRequested will always be > false, but the point is that the associated stuff should be done if > ArchiveRecoveryRequested would have been true in the startup process. > Both of our patch sets leave things in a state where that would never > happen, which is not good. Unless I'm missing something, it seems like > maybe you didn't test your patches to verify that, when the > XLogAcceptWrites() call comes from the checkpointer, all the same > things happen that would have happened had it been called from the > startup process. That would be a really good thing to have tested > before posting your patches. > My bad, I am extremely sorry about that. I usually do test my patches, but somehow I failed to test this change due to manually testing the whole ASRO feature and hurrying in posting the newest version. I will try to be more careful next time. > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, > several TLIs stored in XLogCtl. None of them seem to be precisely the > same thing as EndLogTLI, but I am hoping that replayEndTLI is close > enough. I found out pretty quickly through testing that replayEndTLI > isn't always valid -- it ends up 0 if we don't enter recovery. That's > not really a problem, though, because we only need it to be valid if > ArchiveRecoveryRequested. The code that initializes and updates it > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested > = true will force InRecovery = true. So it looks to me like > replayEndTLI will always be initialized in the cases where we need a > value. It's not yet entirely clear to me if it has to have the same > value as EndOfLogTLI. I find this code comment quite mysterious: > > /* > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > * the end-of-log. It could be different from the timeline that EndOfLog > * nominally belongs to, if there was a timeline switch in that segment, > * and we were reading the old WAL from a segment belonging to a higher > * timeline. > */ > EndOfLogTLI = xlogreader->seg.ws_tli; > > The thing is, if we were reading old WAL from a segment belonging to a > higher timeline, wouldn't we have switched to that new timeline? AFAIUC, by browsing the code, yes, we are switching to the new timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also the same as the EndOfLog that we needed when ArchiveRecoveryRequested is true. I went through the original commit 7cbee7c0a1db and the thread[1] but didn't find any related discussion for that. > Suppose we want WAL segment 246 from TLI 1, but we don't have that > segment on TLI 1, only TLI 2. Well, as far as I know, for us to use > the TLI 2 version, we'd need to have TLI 2 in the history of the > recovery_target_timeline. And if that is the case, then we would have > to replay through the record where the timeline changes. And if we do > that, then the discrepancy postulated by the comment cannot still > exist by the time we reach this code, because this code is only > reached after we finish WAL redo. So I'm baffled as to how this can > happen, but considering how many cases there are in this code, I sure > can't promise that it doesn't. The fact that we have few tests for any > of this doesn't help either. I am not an expert in this area, but will try to spend some more time on understanding and testing. 1] postgr.es/m/555DD101.7080209@iki.fi Regards, Amul
On Wed, Jul 28, 2021 at 4:37 PM Amul Sul <sulamul@gmail.com> wrote: > > On Wed, Jul 28, 2021 at 2:26 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > My 0003 is where I see some lingering problems. It creates > > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > > > need the xlogreader. But it doesn't really solve the problem of how > > > checkpointer.c would be able to call this function with proper > > > arguments. It is at least better in not needing two arguments to > > > decide what to do, but how is checkpointer.c supposed to know what to > > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > > > what to pass for EndOfLogTLI and EndOfLog? > > > > On further study, I found another problem: the way my patch set leaves > > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which > > will not be correctly initialized in any process other than the > > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog) > > would just be skipped. Your 0001 seems to have the same problem. You > > added Assert(AmStartupProcess()) to the inside of the if > > (ArchiveRecoveryRequested) block, but that doesn't fix anything. > > Outside the startup process, ArchiveRecoveryRequested will always be > > false, but the point is that the associated stuff should be done if > > ArchiveRecoveryRequested would have been true in the startup process. > > Both of our patch sets leave things in a state where that would never > > happen, which is not good. Unless I'm missing something, it seems like > > maybe you didn't test your patches to verify that, when the > > XLogAcceptWrites() call comes from the checkpointer, all the same > > things happen that would have happened had it been called from the > > startup process. That would be a really good thing to have tested > > before posting your patches. > > > > My bad, I am extremely sorry about that. I usually do test my patches, > but somehow I failed to test this change due to manually testing the > whole ASRO feature and hurrying in posting the newest version. > > I will try to be more careful next time. > I was too worried about how I could miss that & after thinking more about that, I realized that the operation for ArchiveRecoveryRequested is never going to be skipped in the startup process and that never left for the checkpoint process to do that later. That is the reason that assert was added there. When ArchiveRecoveryRequested, the server will no longer be in the wal prohibited mode, we implicitly change the state to wal-permitted. Here is the snip from the 0003 patch: @@ -6614,13 +6629,30 @@ StartupXLOG(void) (errmsg("starting archive recovery"))); } - /* - * Take ownership of the wakeup latch if we're going to sleep during - * recovery. - */ if (ArchiveRecoveryRequested) + { + /* + * Take ownership of the wakeup latch if we're going to sleep during + * recovery. + */ OwnLatch(&XLogCtl->recoveryWakeupLatch); + /* + * Since archive recovery is requested, we cannot be in a wal prohibited + * state. + */ + if (ControlFile->wal_prohibited) + { + /* No need to hold ControlFileLock yet, we aren't up far enough */ + ControlFile->wal_prohibited = false; + ControlFile->time = (pg_time_t) time(NULL); + UpdateControlFile(); + + ereport(LOG, + (errmsg("clearing WAL prohibition because the system is in archive recovery"))); + } + } + > > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly, > > several TLIs stored in XLogCtl. None of them seem to be precisely the > > same thing as EndLogTLI, but I am hoping that replayEndTLI is close > > enough. I found out pretty quickly through testing that replayEndTLI > > isn't always valid -- it ends up 0 if we don't enter recovery. That's > > not really a problem, though, because we only need it to be valid if > > ArchiveRecoveryRequested. The code that initializes and updates it > > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested > > = true will force InRecovery = true. So it looks to me like > > replayEndTLI will always be initialized in the cases where we need a > > value. It's not yet entirely clear to me if it has to have the same > > value as EndOfLogTLI. I find this code comment quite mysterious: > > > > /* > > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > > * the end-of-log. It could be different from the timeline that EndOfLog > > * nominally belongs to, if there was a timeline switch in that segment, > > * and we were reading the old WAL from a segment belonging to a higher > > * timeline. > > */ > > EndOfLogTLI = xlogreader->seg.ws_tli; > > > > The thing is, if we were reading old WAL from a segment belonging to a > > higher timeline, wouldn't we have switched to that new timeline? > > AFAIUC, by browsing the code, yes, we are switching to the new > timeline. Along with lastReplayedTLI, lastReplayedEndRecPtr is also > the same as the EndOfLog that we needed when ArchiveRecoveryRequested > is true. > > I went through the original commit 7cbee7c0a1db and the thread[1] but > didn't find any related discussion for that. > > > Suppose we want WAL segment 246 from TLI 1, but we don't have that > > segment on TLI 1, only TLI 2. Well, as far as I know, for us to use > > the TLI 2 version, we'd need to have TLI 2 in the history of the > > recovery_target_timeline. And if that is the case, then we would have > > to replay through the record where the timeline changes. And if we do > > that, then the discrepancy postulated by the comment cannot still > > exist by the time we reach this code, because this code is only > > reached after we finish WAL redo. So I'm baffled as to how this can > > happen, but considering how many cases there are in this code, I sure > > can't promise that it doesn't. The fact that we have few tests for any > > of this doesn't help either. > > I am not an expert in this area, but will try to spend some more time > on understanding and testing. > > 1] postgr.es/m/555DD101.7080209@iki.fi > > Regards, > Amul
On Wed, Jul 28, 2021 at 5:03 PM Amul Sul <sulamul@gmail.com> wrote: > > I was too worried about how I could miss that & after thinking more > about that, I realized that the operation for ArchiveRecoveryRequested > is never going to be skipped in the startup process and that never > left for the checkpoint process to do that later. That is the reason > that assert was added there. > > When ArchiveRecoveryRequested, the server will no longer be in > the wal prohibited mode, we implicitly change the state to > wal-permitted. Here is the snip from the 0003 patch: > > @@ -6614,13 +6629,30 @@ StartupXLOG(void) > (errmsg("starting archive recovery"))); > } > > - /* > - * Take ownership of the wakeup latch if we're going to sleep during > - * recovery. > - */ > if (ArchiveRecoveryRequested) > + { > + /* > + * Take ownership of the wakeup latch if we're going to sleep during > + * recovery. > + */ > OwnLatch(&XLogCtl->recoveryWakeupLatch); > > + /* > + * Since archive recovery is requested, we cannot be in a wal prohibited > + * state. > + */ > + if (ControlFile->wal_prohibited) > + { > + /* No need to hold ControlFileLock yet, we aren't up far enough */ > + ControlFile->wal_prohibited = false; > + ControlFile->time = (pg_time_t) time(NULL); > + UpdateControlFile(); > + Is there some reason why we are forcing 'wal_prohibited' to off if we are doing archive recovery? It might have already been discussed, but I could not find it on a quick look into the thread. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 29, 2021 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jul 28, 2021 at 5:03 PM Amul Sul <sulamul@gmail.com> wrote: > > > > I was too worried about how I could miss that & after thinking more > > about that, I realized that the operation for ArchiveRecoveryRequested > > is never going to be skipped in the startup process and that never > > left for the checkpoint process to do that later. That is the reason > > that assert was added there. > > > > When ArchiveRecoveryRequested, the server will no longer be in > > the wal prohibited mode, we implicitly change the state to > > wal-permitted. Here is the snip from the 0003 patch: > > > > @@ -6614,13 +6629,30 @@ StartupXLOG(void) > > (errmsg("starting archive recovery"))); > > } > > > > - /* > > - * Take ownership of the wakeup latch if we're going to sleep during > > - * recovery. > > - */ > > if (ArchiveRecoveryRequested) > > + { > > + /* > > + * Take ownership of the wakeup latch if we're going to sleep during > > + * recovery. > > + */ > > OwnLatch(&XLogCtl->recoveryWakeupLatch); > > > > + /* > > + * Since archive recovery is requested, we cannot be in a wal prohibited > > + * state. > > + */ > > + if (ControlFile->wal_prohibited) > > + { > > + /* No need to hold ControlFileLock yet, we aren't up far enough */ > > + ControlFile->wal_prohibited = false; > > + ControlFile->time = (pg_time_t) time(NULL); > > + UpdateControlFile(); > > + > > Is there some reason why we are forcing 'wal_prohibited' to off if we > are doing archive recovery? It might have already been discussed, but > I could not find it on a quick look into the thread. > Here is: https://postgr.es/m/CA+TgmoZ=CCTbAXxMTYZoGXEgqzOz9smkBWrDpsacpjvFcGCuaw@mail.gmail.com Regards, Amul
On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote: > I was too worried about how I could miss that & after thinking more > about that, I realized that the operation for ArchiveRecoveryRequested > is never going to be skipped in the startup process and that never > left for the checkpoint process to do that later. That is the reason > that assert was added there. > > When ArchiveRecoveryRequested, the server will no longer be in > the wal prohibited mode, we implicitly change the state to > wal-permitted. Here is the snip from the 0003 patch: Ugh, OK. That makes sense, but I'm still not sure that I like it. I've kind of been wondering: why not have XLogAcceptWrites() be the responsibility of the checkpointer all the time, in every case? That would require fixing some more things, and this is one of them, but then it would be consistent, which means that any bugs would be likely to get found and fixed. If calling XLogAcceptWrites() from the checkpointer is some funny case that only happens when the system crashes while WAL is prohibited, then we might fail to notice that we have a bug. This is especially true given that we have very little test coverage in this area. Andres was ranting to me about this earlier this week, and I wasn't sure he was right, but then I noticed that we have exactly zero tests in the entire source tree that make use of recovery_end_command. We really need a TAP test for that, I think. It's too scary to do much reorganization of the code without having any tests at all for the stuff we're moving around. Likewise, we're going to need TAP tests for the stuff that is specific to this patch. For example, we should have a test that crashes the server while it's read only, brings it back up, checks that we still can't write WAL, then re-enables WAL, and checks that we now can write WAL. There are probably a bunch of other things that we should test, too. -- Robert Haas EDB: http://www.enterprisedb.com
Hi,
On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote:
> I was too worried about how I could miss that & after thinking more
> about that, I realized that the operation for ArchiveRecoveryRequested
> is never going to be skipped in the startup process and that never
> left for the checkpoint process to do that later. That is the reason
> that assert was added there.
>
> When ArchiveRecoveryRequested, the server will no longer be in
> the wal prohibited mode, we implicitly change the state to
> wal-permitted. Here is the snip from the 0003 patch:
Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
kind of been wondering: why not have XLogAcceptWrites() be the
responsibility of the checkpointer all the time, in every case? That
would require fixing some more things, and this is one of them, but
then it would be consistent, which means that any bugs would be likely
to get found and fixed. If calling XLogAcceptWrites() from the
checkpointer is some funny case that only happens when the system
crashes while WAL is prohibited, then we might fail to notice that we
have a bug.
This is especially true given that we have very little test coverage
in this area. Andres was ranting to me about this earlier this week,
and I wasn't sure he was right, but then I noticed that we have
exactly zero tests in the entire source tree that make use of
recovery_end_command. We really need a TAP test for that, I think.
It's too scary to do much reorganization of the code without having
any tests at all for the stuff we're moving around. Likewise, we're
going to need TAP tests for the stuff that is specific to this patch.
For example, we should have a test that crashes the server while it's
read only, brings it back up, checks that we still can't write WAL,
then re-enables WAL, and checks that we now can write WAL. There are
probably a bunch of other things that we should test, too.
Hi,
I have been testing “ALTER SYSTEM READ ONLY” and wrote a few tap test cases for this feature.
Please find the test case(Draft version) attached herewith, to be applied on top of the v30 patch by Amul.
Kindly have a review and let me know the required changes.
With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com
Attachment
Attached is the rebase version on top of the latest master head includes refactoring patches posted by Robert. On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote: > > I was too worried about how I could miss that & after thinking more > > about that, I realized that the operation for ArchiveRecoveryRequested > > is never going to be skipped in the startup process and that never > > left for the checkpoint process to do that later. That is the reason > > that assert was added there. > > > > When ArchiveRecoveryRequested, the server will no longer be in > > the wal prohibited mode, we implicitly change the state to > > wal-permitted. Here is the snip from the 0003 patch: > > Ugh, OK. That makes sense, but I'm still not sure that I like it. I've > kind of been wondering: why not have XLogAcceptWrites() be the > responsibility of the checkpointer all the time, in every case? That > would require fixing some more things, and this is one of them, but > then it would be consistent, which means that any bugs would be likely > to get found and fixed. If calling XLogAcceptWrites() from the > checkpointer is some funny case that only happens when the system > crashes while WAL is prohibited, then we might fail to notice that we > have a bug. > Unfortunately, I didn't get much time to think about this and don't have a strong opinion on it either. > This is especially true given that we have very little test coverage > in this area. Andres was ranting to me about this earlier this week, > and I wasn't sure he was right, but then I noticed that we have > exactly zero tests in the entire source tree that make use of > recovery_end_command. We really need a TAP test for that, I think. > It's too scary to do much reorganization of the code without having > any tests at all for the stuff we're moving around. Likewise, we're > going to need TAP tests for the stuff that is specific to this patch. > For example, we should have a test that crashes the server while it's > read only, brings it back up, checks that we still can't write WAL, > then re-enables WAL, and checks that we now can write WAL. There are > probably a bunch of other things that we should test, too. > Yes, my next plan is to work on the TAP tests and look into the patch posted by Prabhat to improve test coverage. Regards, Amul Sul
Attachment
- v31-0007-Documentation.patch
- v31-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v31-0005-Implement-wal-prohibit-state-using-global-barrie.patch
- v31-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
- v31-0004-Refactor-add-function-to-set-database-state-in-c.patch
- v31-0002-Postpone-some-end-of-recovery-operations-relatin.patch
- v31-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
Attached is the rebased version for the latest master head. Also, added tap tests to test some part of this feature and a separate patch to test recovery_end_command execution. I have also been through Prabhat's patch which helps me to write current tests, but I am not sure about the few basic tests that he included in the tap test which can be done using pg_regress otherwise, e.g. checking permission to execute the pg_prohibit_wal() function. Those basic tests I am yet to add, is it ok to add those tests in pg_regress instead of TAP? The problem I see is that all the tests covering a feature will not be together, which I think is not correct. What is usual practice, can have a few tests in TAP and a few in pg_regress for the same feature? Regards, Amul On Wed, Aug 4, 2021 at 6:26 PM Amul Sul <sulamul@gmail.com> wrote: > > Attached is the rebase version on top of the latest master head > includes refactoring patches posted by Robert. > > On Thu, Jul 29, 2021 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Wed, Jul 28, 2021 at 7:33 AM Amul Sul <sulamul@gmail.com> wrote: > > > I was too worried about how I could miss that & after thinking more > > > about that, I realized that the operation for ArchiveRecoveryRequested > > > is never going to be skipped in the startup process and that never > > > left for the checkpoint process to do that later. That is the reason > > > that assert was added there. > > > > > > When ArchiveRecoveryRequested, the server will no longer be in > > > the wal prohibited mode, we implicitly change the state to > > > wal-permitted. Here is the snip from the 0003 patch: > > > > Ugh, OK. That makes sense, but I'm still not sure that I like it. I've > > kind of been wondering: why not have XLogAcceptWrites() be the > > responsibility of the checkpointer all the time, in every case? That > > would require fixing some more things, and this is one of them, but > > then it would be consistent, which means that any bugs would be likely > > to get found and fixed. If calling XLogAcceptWrites() from the > > checkpointer is some funny case that only happens when the system > > crashes while WAL is prohibited, then we might fail to notice that we > > have a bug. > > > > Unfortunately, I didn't get much time to think about this and don't > have a strong opinion on it either. > > > This is especially true given that we have very little test coverage > > in this area. Andres was ranting to me about this earlier this week, > > and I wasn't sure he was right, but then I noticed that we have > > exactly zero tests in the entire source tree that make use of > > recovery_end_command. We really need a TAP test for that, I think. > > It's too scary to do much reorganization of the code without having > > any tests at all for the stuff we're moving around. Likewise, we're > > going to need TAP tests for the stuff that is specific to this patch. > > For example, we should have a test that crashes the server while it's > > read only, brings it back up, checks that we still can't write WAL, > > then re-enables WAL, and checks that we now can write WAL. There are > > probably a bunch of other things that we should test, too. > > > > Yes, my next plan is to work on the TAP tests and look into the patch > posted by Prabhat to improve test coverage. > > Regards, > Amul Sul
Attachment
- v32-0009-Test-check-recovery_end_command-execution.patch
- v32-0007-Documentation.patch
- v32-0008-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v32-0005-Implement-wal-prohibit-state-using-global-barrie.patch
- v32-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v32-0004-Refactor-add-function-to-set-database-state-in-c.patch
- v32-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
- v32-0002-Postpone-some-end-of-recovery-operations-relatin.patch
- v32-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
On Tue, Aug 31, 2021 at 8:16 AM Amul Sul <sulamul@gmail.com> wrote: > Attached is the rebased version for the latest master head. Also, > added tap tests to test some part of this feature and a separate patch > to test recovery_end_command execution. It looks like you haven't given any thought to writing that in a way that will work on Windows? > What is usual practice, can have a few tests in TAP and a few in > pg_regress for the same feature? Sure, there's no problem with that. -- Robert Haas EDB: http://www.enterprisedb.com
> On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote: > > Attached is the rebased version for the latest master head. Hi Amul! Could you please rebase again? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote:
>
> Attached is the rebased version for the latest master head.
Hi Amul!
Could you please rebase again?
Ok will do that tomorrow, thanks.
Regards,
Amul
On Tue, Sep 7, 2021 at 10:02 PM Amul Sul <sulamul@gmail.com> wrote: > > > > On Tue, 7 Sep 2021 at 8:43 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: >> >> >> >> > On Aug 31, 2021, at 5:15 AM, Amul Sul <sulamul@gmail.com> wrote: >> > >> > Attached is the rebased version for the latest master head. >> >> Hi Amul! >> >> Could you please rebase again? > > > Ok will do that tomorrow, thanks. > Here is the rebased version. I have added a few more test cases, perhaps needing more tests and optimization to it, that I'll try in the next version. I dropped the patch for recovery_end_command testing & will post that separately. Regards, Amul
Attachment
- v33-0008-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v33-0004-Refactor-add-function-to-set-database-state-in-c.patch
- v33-0005-Implement-wal-prohibit-state-using-global-barrie.patch
- v33-0007-Documentation.patch
- v33-0006-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v33-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
- v33-0002-Postpone-some-end-of-recovery-operations-relatin.patch
- v33-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
> On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote: > > Here is the rebased version. v33-0004 This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectlyincluded from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merelyto get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary,but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c,xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. v33-0005 This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the value ofLocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress(). Therewas an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercible to boolean "true"and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that the function is more complicated,this rule seems even more arcane. Can we change the logic to not depend on casting an integer to bool? The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e. walwrites permitted". The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actuallydoing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process". The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a readerof the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up walprobibit shared state" The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests,we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I thinkthis one should read "stop processing". This same comment is used again below. Then a third comment reads "For thesame reason mentioned previously for the wal prohibit state change request check." That third comment is too glib. tcop/utility.c needlessly includes "access/walprohibit.h" wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a statechange is clear enough. But how is waiting on a state different? xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names of thevalues: /* State of work that enables wal writes */ typedef enum XLogAcceptWritesState { XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */ XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */ XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */ } XLogAcceptWritesState; This enum seems to have been written from the point of view of someone who already knew what it was for. It needs to bewritten in a way that will be clear to people who have no idea what it is for. v33-0006: The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the new functioncomment says: /* * In opposite to the above assertion if a transaction doesn't have valid XID * (e.g. VACUUM) then it won't be killed while changing the system state to WAL * prohibited. Therefore, we need to explicitly error out before entering into * the critical section. */ This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it triggera panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there issome reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed throughthe lifetime of that critical section? v33-0007: I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from othersignal sending functions in whether it returns a boolean? If you believe it must always succeed, you can still defineit as returning a boolean and always return true. That leaves the door open to future code changes which might needto return false for some reason. But I also don't like the idea that existing transactions with xids are immediately killed. Shouldn't this function takean optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into WALPROHIBIT_STATE_GOING_READ_ONLYfor a period of time before killing remaining transactions? Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false)means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameterwhen allowing wal is less clearly useful. That's enough code review for now. Next I will review your regression tests.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 9, 2021 at 1:42 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > v33-0006: > > The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ Honestly that sentence doesn't sound very clear even with a different verb. > This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it triggera panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there issome reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed throughthe lifetime of that critical section? The idea here is that if a transaction already has an XID assigned, we have to kill it off before we can declare the system read-only, because it will definitely write WAL when the transaction ends: either a commit record, or an abort record, but definitely something. So cases where we write WAL without necessarily having an XID require special handling. They have to check whether WAL has become prohibited and error out if so, and they need to do so before entering the critical section - because if the problem were detected for the first time inside the critical section it would escalate to a PANIC, which we do not want. Places where we're guaranteed to have an XID - e.g. inserting a heap tuple - don't need a run-time check before entering the critical section, because the code can't be reached in the first place if the system is WAL-read-only. > Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false)means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameterwhen allowing wal is less clearly useful. Hmm, I find pg_prohibit_wal(true/false) better than pg_prohibit_wal() and pg_allow_wal(), and would prefer pg_prohibit_wal(true/false, timeout) over pg_prohibit_wal(timeout) and pg_allow_wal(), because I think then once you find that one function you know how to do everything about that feature, whereas the other way you need to find both functions to have the whole story. That said, I can see why somebody else might prefer something else. -- Robert Haas EDB: http://www.enterprisedb.com
> On Sep 9, 2021, at 11:21 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > They have to check whether WAL has become prohibited > and error out if so, and they need to do so before entering the > critical section - because if the problem were detected for the first > time inside the critical section it would escalate to a PANIC, which > we do not want. But that is the part that is still not clear. Should the comment say that a concurrent change to prohibit wal after thecurrent process checks but before the current process exists the critical section will result in a panic? What is unclearabout the comment is that it implies that a check before the critical section is sufficient, but ordinarily one wouldexpect a lock to be held and the check-and-lock dance to carefully avoid any race condition. If somehow this is safe,the logic for why it is safe should be spelled out. If not, a mia culpa saying, "hey, were not terribly safe aboutthis" should be explicit in the comment. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Thank you, for looking at the patch. Please see my reply inline below: > > > On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote: > > > > Here is the rebased version. > > v33-0004 > > This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectlyincluded from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merelyto get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary,but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c,xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. > Yes, you are correct, xlog.h is included in more than 150 files. I was wondering if we can have a forward declaration instead of including pg_control.h (e.g. The same way struct XLogRecData was declared in xlog.h). Perhaps, DBState is enum & I don't see we have done the same for enum elsewhere as we are doing for structures, but that seems to be fine, IMO. Earlier, I was unsure before preparing this patch, but since that makes sense (I assumed) and minimizes duplications, can we go ahead and post separately with the same change in StartupXLOG() which I have skipped for the same reason mentioned in patch commit-msg. > v33-0005 > > This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the valueof LocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress(). There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercibleto boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that thefunction is more complicated, this rule seems even more arcane. Can we change the logic to not depend on casting an integerto bool? > We can't use a boolean variable because LocalXLogInsertAllowed represents three states as, 1 means "wal is allowed'', 0 for "wal is disallowed", and -1 is for "need to check". > The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e.wal writes permitted". > > The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actuallydoing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process". > > The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a readerof the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up walprobibit shared state" > > The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests,we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I thinkthis one should read "stop processing". This same comment is used again below. Then a third comment reads "For thesame reason mentioned previously for the wal prohibit state change request check." That third comment is too glib. > > tcop/utility.c needlessly includes "access/walprohibit.h" > > wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a statechange is clear enough. But how is waiting on a state different? > > xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names of thevalues: > > /* State of work that enables wal writes */ > typedef enum XLogAcceptWritesState > { > XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */ > XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */ > XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */ > } XLogAcceptWritesState; > > This enum seems to have been written from the point of view of someone who already knew what it was for. It needs to bewritten in a way that will be clear to people who have no idea what it is for. > > v33-0006: > > The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ > Will try to think about the pointed code comments for the improvements. > The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the newfunction comment says: > CheckWALPermitted() calls XLogInsertAllowed() does check the LocalXLogInsertAllowed flag which is local to that process only, and nobody else reads that concurrently. > /* > * In opposite to the above assertion if a transaction doesn't have valid XID > * (e.g. VACUUM) then it won't be killed while changing the system state to WAL > * prohibited. Therefore, we need to explicitly error out before entering into > * the critical section. > */ > > This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it triggera panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there issome reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed throughthe lifetime of that critical section? > Hm, interrupts absorption are disabled inside the critical section. The wal prohibited state for that process (here vacuum) will never get set until it sees the interrupts & the system will not be said wal prohibited until every process sees that interrupts. I am not sure we should explain the characteristics of the critical section at this place, if want, we can add a brief saying that inside the critical section we should not worry about the state change which never happens because interrupts are disabled there. > v33-0007: > > I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from othersignal sending functions in whether it returns a boolean? If you believe it must always succeed, you can still defineit as returning a boolean and always return true. That leaves the door open to future code changes which might needto return false for some reason. > Ok, I am fine to always return true. > But I also don't like the idea that existing transactions with xids are immediately killed. Shouldn't this function takean optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into WALPROHIBIT_STATE_GOING_READ_ONLYfor a period of time before killing remaining transactions? > Ok, will check. > Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false)means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameterwhen allowing wal is less clearly useful. > Like Robert, I am too inclined to have a single function that is easy to remember. Apart from this, recently while testing this patch with pgbench where I have exhausted the connection limit and want to change the system's prohibited state in between but I was unable to do that, I wish I could do that using the pg_clt option. How about having a pg_clt option to alter wal prohibited state? > That's enough code review for now. Next I will review your regression tests.... > Thanks again.
> On Sep 10, 2021, at 7:36 AM, Amul Sul <sulamul@gmail.com> wrote: > >> v33-0005 >> >> This patch makes bool XLogInsertAllowed() more complicated than before. The result used to depend mostly on the valueof LocalXLogInsertAllowed except that when that value was negative, the result was determined by RecoveryInProgress().There was an arcane rule that LocalXLogInsertAllowed must have the non-negative values binary coercibleto boolean "true" and "false", with the basis for that rule being the coding of XLogInsertAllowed(). Now that thefunction is more complicated, this rule seems even more arcane. Can we change the logic to not depend on casting an integerto bool? >> > > We can't use a boolean variable because LocalXLogInsertAllowed > represents three states as, 1 means "wal is allowed'', 0 for "wal is > disallowed", and -1 is for "need to check". I'm complaining that we're using an integer rather than an enum. I'm ok if we define it so that WAL_ALLOWABLE_UNKNOWN =-1, WAL_DISALLOWED = 0, WAL_ALLOWED = 1 or such, but the logic of the function has gotten complicated enough that havingto remember which number represents which logical condition has become a (small) mental burden. Given how hard theWAL code is to read and fully grok, I'd rather avoid any unnecessary burden, even small ones. >> The new function CheckWALPermitted() seems to test the current state of variables but not lock any of them, and the newfunction comment says: >> > > CheckWALPermitted() calls XLogInsertAllowed() does check the > LocalXLogInsertAllowed flag which is local to that process only, and > nobody else reads that concurrently. > >> /* >> * In opposite to the above assertion if a transaction doesn't have valid XID >> * (e.g. VACUUM) then it won't be killed while changing the system state to WAL >> * prohibited. Therefore, we need to explicitly error out before entering into >> * the critical section. >> */ >> >> This suggests to me that a vacuum process can check whether wal is prohibited, then begin a critical section which needswal to be allowed, and concurrently somebody else might disable wal without killing the vacuum process. I'm given towonder what horrors await when the vacuum process does something that needs to be wal logged but cannot be. Does it triggera panic? I don't like the idea that calling pg_prohibit_wal durning a vacuum might panic the cluster. If there issome reason this is not a problem, I think the comment should explain it. In particular, why is it sufficient to checkwhether wal is prohibited before entering the critical section and not necessary to be sure it remains allowed throughthe lifetime of that critical section? >> > > Hm, interrupts absorption are disabled inside the critical section. > The wal prohibited state for that process (here vacuum) will never get > set until it sees the interrupts & the system will not be said wal > prohibited until every process sees that interrupts. I am not sure we > should explain the characteristics of the critical section at this > place, if want, we can add a brief saying that inside the critical > section we should not worry about the state change which never happens > because interrupts are disabled there. I think the fact that interrupts are disabled during critical sections is understood, so there is no need to mention that. The problem is that the method for taking the system read-only is less generally known, and readers of other sectionsof code need to jump to the definition of CheckWALPermitted to read the comments and understand what it does. Takefor example a code stanza from heapam.c: if (needwal) CheckWALPermitted(); /* NO EREPORT(ERROR) from here till changes are logged */ START_CRIT_SECTION(); Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that an interruptmight get processed *during* CheckWALPermitted, since that function isn't atomic. It might happen after the checkis meaningfully finished but before the function actually returns. So I'm not inclined to believe that the way thisall works is dependent on interrupts being blocked. So I think, maybe this is all protected by some other scheme. Butwhat? It's not clear from the code comments for CheckWALPermitted, so I'm left having to reverse engineer the systemto understand it. One interpretation is that the signal handler will exit() my backend if it receives a signal saying that the system is goingread-only, so there is no race condition. But then why the call to CheckWALPermitted()? If this interpretation werecorrect, we'd happily enter the critical section without checking, secure in the knowledge that as long as we haven'texited yet, all is ok. Another interpretation is that the whole thing is just a performance trick. Maybe we're ok with the idea that we will occasionallymiss the fact that wal is prohibited, do whatever work we need in the critical section, and then fail later. But if that is true, it had better not be a panic, because designing the system to panic 1% of the time (or whateverpercent it works out to be) isn't project style. So looking into the critical section in the heapam.c code, I see: XLogBeginInsert(); XLogRegisterData((char *) &xlrec, SizeOfHeapInplace); XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen); And jumping to the definition of XLogBeginInsert() I see /* * WAL permission must have checked before entering the critical section. * Otherwise, WAL prohibited error will force system panic. */ So now I'm flummoxed. Is it that the code is broken, or is it that I don't know what the strategy behind all this is? Ifthere were a code comment saying how this all works, I'd be in a better position to either know that it is truly safe oralternately know that the strategy is wrong. Even if my analysis that this is all flawed is incorrect, I still think that a code comment would help. >> v33-0007: >> >> I don't really like what the documentation has to say about pg_prohibit_wal. Why should pg_prohibit_wal differ from othersignal sending functions in whether it returns a boolean? If you believe it must always succeed, you can still defineit as returning a boolean and always return true. That leaves the door open to future code changes which might needto return false for some reason. >> > > Ok, I am fine to always return true. Ok. >> But I also don't like the idea that existing transactions with xids are immediately killed. Shouldn't this function takean optional timeout, perhaps defaulting to none, but otherwise allowing the user to put the system into WALPROHIBIT_STATE_GOING_READ_ONLYfor a period of time before killing remaining transactions? >> > > Ok, will check. > >> Why is this function defined to take a boolean such that pg_prohibit_wal(true) means to prohibit wal and pg_prohibit_wal(false)means to allow wal. Wouldn't a different function named pg_allow_wal() make it more clear? This alsowould be a better interface if taking the system read-only had a timeout as I suggested above, as such a timeout parameterwhen allowing wal is less clearly useful. >> > > Like Robert, I am too inclined to have a single function that is easy > to remember. For C language functions that take a bool argument, I can jump to the definition using ctags, and I assume most other developerscan do so using whatever IDE they like. For SQL functions, it's a bit harder to jump to the definition, particularlyif you are logged into a production server where non-essential software is intentionally missing. Then you haveto wonder, what exactly is the boolean argument toggling here? I don't feel strongly about this, though, and you don't need to change it. > Apart from this, recently while testing this patch with > pgbench where I have exhausted the connection limit and want to change > the system's prohibited state in between but I was unable to do that, > I wish I could do that using the pg_clt option. How about having a > pg_clt option to alter wal prohibited state? I'd have to review the implementation, but sure, that sounds like a useful ability. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sep 10, 2021, at 8:42 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Take for example a code stanza from heapam.c: > > if (needwal) > CheckWALPermitted(); > > /* NO EREPORT(ERROR) from here till changes are logged */ > START_CRIT_SECTION(); > > Now, I know that interrupts won't be processed after starting the critical section, but I can see plain as day that aninterrupt might get processed *during* CheckWALPermitted, since that function isn't atomic. A better example may be found in ginmetapage.c: needwal = RelationNeedsWAL(indexrel); if (needwal) { CheckWALPermitted(); computeLeafRecompressWALData(leaf); } /* Apply changes to page */ START_CRIT_SECTION(); Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument can'tbe made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is running. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 10, 2021 at 12:20 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > A better example may be found in ginmetapage.c: > > needwal = RelationNeedsWAL(indexrel); > if (needwal) > { > CheckWALPermitted(); > computeLeafRecompressWALData(leaf); > } > > /* Apply changes to page */ > START_CRIT_SECTION(); Yeah, that looks sketchy. Why not move CheckWALPermitted() down a line? > Even if CheckWALPermitted is assumed to be close enough to atomic to not be a problem (I don't agree), that argument can'tbe made here, as computeLeafRecompressWALData is not trivial and signals could easily be processed while it is running. I think the relevant question here is not "could a signal handler fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant question is the former, then there's no hope of ever making it work because there's always a race condition. But the signal handler is only setting flags whose only effect is to make a subsequent CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call ProcessInterrupts(). -- Robert Haas EDB: http://www.enterprisedb.com
> On Sep 10, 2021, at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think the relevant question here is not "could a signal handler > fire?" but "can we hit a CHECK_FOR_INTERRUPTS()?". If the relevant > question is the former, then there's no hope of ever making it work > because there's always a race condition. But the signal handler is > only setting flags whose only effect is to make a subsequent > CHECK_FOR_INTERRUPTS() do something, so it doesn't really matter when > the signal handler can run, but when CHECK_FOR_INTERRUPTS() can call > ProcessInterrupts(). Ok, that makes more sense. I was reviewing the code after first reviewing the documentation changes, which lead me to believethe system was designed to respond more quickly than that: + WAL prohibited is a read-only system state. Any permitted user can call + <function>pg_prohibit_wal</function> function to forces the system into + a WAL prohibited mode where insert write ahead log will be prohibited until + the same function executed to change that state to read-write. Like Hot and + Otherwise, it will be <literal>off</literal>. When the user requests WAL + prohibited state, at that moment if any existing session is already running + a transaction, and that transaction has already been performed or planning + to perform wal write operations then the session running that transaction + will be terminated. "forces the system" in the first part, and "at that moment ... that transaction will be terminated" sounds heavier handedthan something which merely sets a flag asking the backend to exit. I was reading that as more immediate and thentrying to figure out how the signal handling could possibly work, and failing to see how. The README: +Any +backends which receive WAL prohibited system state transition barrier interrupt +need to stop WAL writing immediately. For barrier absorption the backed(s) will +kill the running transaction which has valid XID indicates that the transaction +has performed and/or planning WAL write. uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is heavierhanded than it is. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 10, 2021 at 1:16 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > uses "immediately" and "will kill the running transaction" which reenforced the impression that this mechanism is heavierhanded than it is. It's intended to be just as immediate as e.g. pg_cancel_backend() and pg_terminate_backend(), which work just the same way, but not any more so. I guess we could look at how things are worded in those cases. From a user perspective such things are usually pretty immediate, but not as immediate as firing a signal handler. Computers are fast.[1] -- Robert Haas EDB: http://www.enterprisedb.com [1] https://www.youtube.com/watch?v=6xijhqU8r2A
> On Jun 16, 2020, at 6:55 AM, amul sul <sulamul@gmail.com> wrote: > > (2) if the session is idle, we also need the top-level abort > record to be written immediately, but can't send an error to the client until the next > command is issued without losing wire protocol synchronization. For now, we just use > FATAL to kill the session; maybe this can be improved in the future. Andres, I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned pageheld by the cursor, when another session disables WAL. It would be very interesting to test how the vacuum handles thatspecific change. I have not figured out the cleanest way to do this, though, as we don't as a project yet have a standardway of setting up xid exhaustion in a regression test, do we? The closest I saw to it was your work in [1], butthat doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which isn'tuseable from inside an isolation test. Do you have a suggestion how best to continue developing out the test infrastructure? Amul, The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the isolationtester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test toabort and all subsequent permutations to not run. Attached patch v34-0009 fixes that. Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a differentsession. The expected output is worth reviewing to see how this plays out. I don't see anything in there whichis obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WALis now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one ofthese errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky. [1] https://www.postgresql.org/message-id/flat/CAP4vRV5gEHFLB7NwOE6_dyHAeVfkvqF8Z_g5GaCQZNgBAE0Frw%40mail.gmail.com#e10861372aec22119b66756ecbac581c — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
, On Sat, Jul 24, 2021 at 1:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 17, 2021 at 1:23 AM Amul Sul <sulamul@gmail.com> wrote: > > Attached is rebase for the latest master head. Also, I added one more > > refactoring code that deduplicates the code setting database state in the > > control file. The same code set the database state is also needed for this > > feature. > > I started studying 0001 today and found that it rearranged the order > of operations in StartupXLOG() more than I was expecting. It does, as > per previous discussions, move a bunch of things to the place where we > now call XLogParamters(). But, unsatisfyingly, InRecovery = false and > XLogReaderFree() then have to move down even further. Since the goal > here is to get to a situation where we sometimes XLogAcceptWrites() > after InRecovery = false, it didn't seem nice for this refactoring > patch to still end up with a situation where this stuff happens while > InRecovery = true. In fact, with the patch, the amount of code that > runs with InRecovery = true actually *increases*, which is not what I > think should be happening here. That's why the patch ends up having to > adjust SetMultiXactIdLimit to not Assert(!InRecovery). > > And then I started to wonder how this was ever going to work as part > of the larger patch set, because as you have it here, > XLogAcceptWrites() takes arguments XLogReaderState *xlogreader, > XLogRecPtr EndOfLog, and TimeLineID EndOfLogTLI and if the > checkpointer is calling that at a later time after the user issues > pg_prohibit_wal(false), it's going to have none of those things. So I > had a quick look at that part of the code and found this in > checkpointer.c: > > XLogAcceptWrites(true, NULL, InvalidXLogRecPtr, 0); > > For those following along from home, the additional "true" is a bool > needChkpt argument added to XLogAcceptWrites() by 0003. Well, none of > this is very satisfying. The whole purpose of passing the xlogreader > is so we can figure out whether we need a checkpoint (never mind the > question of whether the existing algorithm for determining that is > really sensible) but now we need a second argument that basically > serves the same purpose since one of the two callers to this function > won't have an xlogreader. And then we're passing the EndOfLog and > EndOfLogTLI as dummy values which seems like it's probably just > totally wrong, but if for some reason it works correctly there sure > don't seem to be any comments explaining why. > > So I started doing a bit of hacking myself and ended up with the > attached, which I think is not completely the right thing yet but I > think it's better than your version. I split this into three parts. > 0001 splits up the logic that currently decides whether to write an > end-of-recovery record or a checkpoint record and if the latter how > the checkpoint ought to be performed into two functions. > DetermineRecoveryXlogAction() figures out what we want to do, and > PerformRecoveryXlogAction() does it. It also moves the code to run > recovery_end_command and related stuff into a new function > CleanupAfterArchiveRecovery(). 0002 then builds on this by postponing > UpdateFullPageWrites(), PerformRecoveryXLogAction(), and > CleanupAfterArchiveRecovery() to just before we > XLogReportParameters(). Because of the refactoring done by 0001, this > is only a small amount of code movement. Because of the separation > between DetermineRecoveryXlogAction() and PerformRecoveryXlogAction(), > the latter doesn't need the xlogreader. So we can do > DetermineRecoveryXlogAction() at the same time as now, while the > xlogreader is available, and then we don't need it later when we > PerformRecoveryXlogAction(), because we already know what we need to > know. I think this is all fine as far as it goes. > > My 0003 is where I see some lingering problems. It creates > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't > need the xlogreader. But it doesn't really solve the problem of how > checkpointer.c would be able to call this function with proper > arguments. It is at least better in not needing two arguments to > decide what to do, but how is checkpointer.c supposed to know what to > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know > what to pass for EndOfLogTLI and EndOfLog? Actually, EndOfLog doesn't > seem too problematic, because that value has been stored in four (!) > places inside XLogCtl by this code: > > LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; > > XLogCtl->LogwrtResult = LogwrtResult; > > XLogCtl->LogwrtRqst.Write = EndOfLog; > XLogCtl->LogwrtRqst.Flush = EndOfLog; > > Presumably we could relatively easily change things around so that we > finish one of those values ... probably one of the "write" values .. > back out of XLogCtl instead of passing it as a parameter. That would > work just as well from the checkpointer as from the startup process, > and there seems to be no way for the value to change until after > XLogAcceptWrites() has been called, so it seems fine. But that doesn't > help for the other arguments. What I'm thinking is that we should just > arrange to store EndOfLogTLI and xlogaction into XLogCtl also, and > then XLogAcceptWrites() can fish those values out of there as well, > which should be enough to make it work and do the same thing > regardless of which process is calling it. But I have run out of time > for today so have not explored coding that up. > I have spent some time thinking about making XLogAcceptWrites() independent, and for that, we need to get rid of its arguments which are available only in the startup process. The first argument xlogaction deduced by the DetermineRecoveryXlogAction(). If we are able to make this function logic independent and can deduce that xlogaction in any process, we can skip xlogaction argument passing. DetermineRecoveryXlogAction() function depends on a few global variables, valid only in the startup process are InRecovery, ArchiveRecoveryRequested & LocalPromoteIsTriggered. Out of three LocalPromoteIsTriggered's value is already available in shared memory and that can be fetched by calling LocalPromoteIsTriggered(). InRecovery's value can be guessed by as long as DBState in the control file doesn't get changed unexpectedly until XLogAcceptWrites() executes. If the DBState was not a clean shutdown, then surely the server has gone through recovery. If we could rely on DBState in the control file then we are good to go. For the last one, ArchiveRecoveryRequested, I don't see any existing and appropriate shared memory or control file information, so that can be identified if the archive recovery was requested or not. Initially, I thought to use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE, if the archive recovery requested. But there is another case where SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of ArchiveRecoveryRequested value, that is the presence of a backup label file. If we want to use SharedRecoveryState, we need one more state which could differentiate between ArchiveRecoveryRequested and the backup label file presence case. To move ahead, I have copied ArchiveRecoveryRequested into shared memory and it will be cleared once archive cleanup is finished. With all these changes, we could get rid of xlogaction argument and DetermineRecoveryXlogAction() function. Could move its logic to PerformRecoveryXLogAction() directly. Now, the remaining two arguments of XLogAcceptWrites() are required for the CleanupAfterArchiveRecovery() function. Along with these two arguments, this function requires ArchiveRecoveryRequested and ThisTimeLineID which are again global variables. With the previous changes, we have got ArchiveRecoveryRequested into shared memory. And for ThisTimeLineID, I don't think we need to do anything since this value is available with all the backend as per the following comment: " /* * ThisTimeLineID will be same in all backends --- it identifies current * WAL timeline for the database system. */ TimeLineID ThisTimeLineID = 0; " In addition to the four places that Robert has pointed for EndOfLog, XLogCtl->lastSegSwitchLSN also holds EndOfLog value and that doesn't seem to change until WAL write is enabled. For EndOfLogTLI, I think we can safely use XLogCtl->replayEndTLI. Currently, The EndOfLogTLI is the timeline ID of the last record that xlogreader reads, but this xlogreader was simply re-fetching the last record which we have replied in redo loop if it was in recovery, if not in recovery, we don't need to worry since this value is needed only in case of ArchiveRecoveryRequested = true, which implicitly forces redo and sets replayEndTLI value. With all the above changes XLogAcceptWrites() can be called from other processes but I haven't tested that. This finding is still not complete and not too clean, perhaps, posting the patches with aforesaid changes just to confirm the direction and forward the discussion, thanks. Regards, Amul
Attachment
On Wed, Sep 15, 2021 at 6:49 AM Amul Sul <sulamul@gmail.com> wrote: > Initially, I thought to > use SharedRecoveryState which is always set to RECOVERY_STATE_ARCHIVE, > if the archive recovery requested. But there is another case where > SharedRecoveryState could be RECOVERY_STATE_ARCHIVE irrespective of > ArchiveRecoveryRequested value, that is the presence of a backup label > file. Right, there's a difference between whether archive recovery has been *requested* and whether it is actually *happening*. > If we want to use SharedRecoveryState, we need one more state > which could differentiate between ArchiveRecoveryRequested and the > backup label file presence case. To move ahead, I have copied > ArchiveRecoveryRequested into shared memory and it will be cleared > once archive cleanup is finished. With all these changes, we could get > rid of xlogaction argument and DetermineRecoveryXlogAction() function. > Could move its logic to PerformRecoveryXLogAction() directly. Putting these changes into 0001 seems to make no sense. It seems like they should be part of 0003, or maybe a new 0004 patch. > And for ThisTimeLineID, I don't think we need to do anything since this > value is available with all the backend as per the following comment: > " > /* > * ThisTimeLineID will be same in all backends --- it identifies current > * WAL timeline for the database system. > */ > TimeLineID ThisTimeLineID = 0; I'm not sure I find that argument totally convincing. The two variables aren't assigned at exactly the same places in the code, nonwithstanding the comment. I'm not saying you're wrong. I'm just saying I don't believe it just because the comment says so. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 15, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > Putting these changes into 0001 seems to make no sense. It seems like > they should be part of 0003, or maybe a new 0004 patch. After looking at this a little bit more, I think it's really necessary to separate out all of your changes into separate patches at least for initial review. It's particularly important to separate code movement changes from other kinds of changes. 0001 was just moving code before, and so was 0002, but now both are making other changes, which is not easy to see from looking at the 'git diff' output. For that reason it's not so easy to understand exactly what you've changed here and analyze it. I poked around a little bit at these patches, looking for perhaps-interesting global variables upon which the code called from XLogAcceptWrites() would depend with your patches applied. The most interesting ones seem to be (1) ThisTimeLineID, which you mentioned and which may be fine but I'm not totally convinced yet, (2) LocalXLogInsertAllowed, which is probably not broken but I'm thinking we may want to redesign that mechanism somehow to make it cleaner, and (3) CheckpointStats, which is called from RemoveXlogFile which is called from RemoveNonParentXlogFiles which is called from CleanupAfterArchiveRecovery which is called from XLogAcceptWrites. This last one is actually pretty weird already in the existing code. It sort of looks like RemoveXlogFile() only expects to be called from the checkpointer (or a standalone backend) so that it can update CheckpointStats and have that just work, but actually it's also called from the startup process when a timeline switch happens. I don't know whether the fact that the increments to ckpt_segs_recycled get lost in that case should be considered an intentional behavior that should be preserved or an inadvertent mistake. So I think you've covered most of the necessary things here, with probably some more discussion needed on whether you've done the right things... -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 15, 2021 at 9:38 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Sep 15, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Putting these changes into 0001 seems to make no sense. It seems like > > they should be part of 0003, or maybe a new 0004 patch. > > After looking at this a little bit more, I think it's really necessary > to separate out all of your changes into separate patches at least for > initial review. It's particularly important to separate code movement > changes from other kinds of changes. 0001 was just moving code before, > and so was 0002, but now both are making other changes, which is not > easy to see from looking at the 'git diff' output. For that reason > it's not so easy to understand exactly what you've changed here and > analyze it. > Ok, understood, I have separated my changes into 0001 and 0002 patch, and the refactoring patches start from 0003. In the 0001 patch, I have copied ArchiveRecoveryRequested to shared memory as said previously. Coping ArchiveRecoveryRequested value to shared memory is not really interesting, and I think somehow we should reuse existing variable, (perhaps, with some modification of the information it can store, e.g. adding one more enum value for SharedRecoveryState or something else), thinking on the same. In addition to that, I tried to turn down the scope of ArchiveRecoveryRequested global variable. Now, this is a static variable, and the scope is limited to xlog.c file like LocalXLogInsertAllowed and can be accessed through the newly added function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let me know what you think about the approach. In 0002 patch is a mixed one where I tried to remove the dependencies on global variables and local variables belonging to StartupXLOG(). I am still worried about the InRecovery value that needs to be deduced afterward inside XLogAcceptWrites(). Currently, relying on ControlFile->state != DB_SHUTDOWNED check but I think that will not be good for ASRO where we plan to skip XLogAcceptWrites() work only and let the StartupXLOG() do the rest of the work as it is where it will going to update ControlFile's DBState to DB_IN_PRODUCTION, then we might need some ugly kludge to call PerformRecoveryXLogAction() in checkpointer irrespective of DBState, which makes me a bit uncomfortable. > I poked around a little bit at these patches, looking for > perhaps-interesting global variables upon which the code called from > XLogAcceptWrites() would depend with your patches applied. The most > interesting ones seem to be (1) ThisTimeLineID, which you mentioned > and which may be fine but I'm not totally convinced yet, (2) > LocalXLogInsertAllowed, which is probably not broken but I'm thinking > we may want to redesign that mechanism somehow to make it cleaner, and Thanks for the off-list detailed explanation on this. For somebody else who might be reading this, the concern here is (not really a concern, it is a good thing to improve) the LocalSetXLogInsertAllowed() function call, is a kind of hack that enables wal writes irrespective of RecoveryInProgress() for a shorter period. E.g. see following code in StartupXLOG: " LocalSetXLogInsertAllowed(); UpdateFullPageWrites(); LocalXLogInsertAllowed = -1; .... .... /* * If any of the critical GUCs have changed, log them before we allow * backends to write WAL. */ LocalSetXLogInsertAllowed(); XLogReportParameters(); " Instead of explicitly enabling wal insert, somehow that implicitly allowed for the startup process and/or the checkpointer doing the first checkpoint and/or wal writes after the recovery. Well, the current LocalSetXLogInsertAllowed() mechanism is not really harming anything or bad and does not necessarily need to change but it would be nice if we were able to come up with something much cleaner, bug-free, and 100% perfect enough design. (Hope I am not missing anything from the discussion). > (3) CheckpointStats, which is called from RemoveXlogFile which is > called from RemoveNonParentXlogFiles which is called from > CleanupAfterArchiveRecovery which is called from XLogAcceptWrites. > This last one is actually pretty weird already in the existing code. > It sort of looks like RemoveXlogFile() only expects to be called from > the checkpointer (or a standalone backend) so that it can update > CheckpointStats and have that just work, but actually it's also called > from the startup process when a timeline switch happens. I don't know > whether the fact that the increments to ckpt_segs_recycled get lost in > that case should be considered an intentional behavior that should be > preserved or an inadvertent mistake. > Maybe I could be wrong, but I think that is intentional. It removes pre-allocated or bogus files of the old timeline which are not supposed to be considered in stats. The comments for CheckpointStatsData might not be clear but comment at the calling RemoveNonParentXlogFiles() place inside StartupXLOG hints the same: " /* * Before we continue on the new timeline, clean up any * (possibly bogus) future WAL segments on the old * timeline. */ RemoveNonParentXlogFiles(EndRecPtr, ThisTimeLineID); .... .... * We switched to a new timeline. Clean up segments on the old * timeline. * * If there are any higher-numbered segments on the old timeline, * remove them. They might contain valid WAL, but they might also be * pre-allocated files containing garbage. In any case, they are not * part of the new timeline's history so we don't need them. */ RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID); " > So I think you've covered most of the necessary things here, with > probably some more discussion needed on whether you've done the right > things... > Thanks, Robert, for your time. Regards, Amul Sul
Attachment
- v35-0005-Create-XLogAcceptWrites-function-with-code-from-.patch
- v35-0002-miscellaneous-remove-dependency-on-global-and-lo.patch
- v35-0004-Postpone-some-end-of-recovery-operations-relatin.patch
- v35-0001-Store-ArchiveRecoveryRequested-in-shared-memory-.patch
- v35-0003-Refactor-some-end-of-recovery-code-out-of-Startu.patch
Hi Mark, I have tried to fix your review comment in the attached version, please see my inline reply below. On Fri, Sep 10, 2021 at 8:06 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Sep 9, 2021 at 11:12 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: > > > > > > Thank you, for looking at the patch. Please see my reply inline below: > > > > > > On Sep 8, 2021, at 6:44 AM, Amul Sul <sulamul@gmail.com> wrote: > > > > > > Here is the rebased version. > > > > v33-0004 > > > > This patch moves the include of "catalog/pg_control.h" from transam/xlog.c into access/xlog.h, making pg_control.h indirectlyincluded from a much larger set of files. Maybe that's ok. I don't know. But it seems you are doing this merelyto get the symbol (not even the definition) for struct DBState. I'd recommend rearranging the code so this isn't necessary,but otherwise you'd at least want to remove the now redundant includes of catalog/pg_control.h from xlogdesc.c,xloginsert.c, auth-scram.c, postmaster.c, misc/pg_controldata.c, and pg_controldata/pg_controldata.c. > > > > Yes, you are correct, xlog.h is included in more than 150 files. I was > wondering if we can have a forward declaration instead of including > pg_control.h (e.g. The same way struct XLogRecData was declared in > xlog.h). Perhaps, DBState is enum & I don't see we have done the same > for enum elsewhere as we are doing for structures, but that seems to > be fine, IMO. > > Earlier, I was unsure before preparing this patch, but since that > makes sense (I assumed) and minimizes duplications, can we go ahead > and post separately with the same change in StartupXLOG() which I have > skipped for the same reason mentioned in patch commit-msg. > FYI, I have posted this patch separately [1] & drop it from the current set. > > v33-0005 > > The code comment change in autovacuum.c introduces a non-grammatical sentence: "First, the system is not read only i.e.wal writes permitted". > > Fixed. > > The function comment in checkpointer.c reads more like it toggles the system into allowing something, rather than actuallydoing that same something: "SendSignalToCheckpointer allows a process to send a signal to the checkpoint process". > > I am not sure I understood the concern, what comments should you think? This function helps to signal the checkpointer, but doesn't tell what it is supposed to do. > > The new code comment in ipci.c contains a typo, but more importantly, it doesn't impart any knowledge beyond what a readerof the function name could already surmise. Perhaps the comment can better clarify what is happening: "Set up walprobibit shared state" > > Done. > > The new code comment in sync.c copies and changes a nearby comment but drops part of the verb phrase: "As in ProcessSyncRequests,we don't want to stop wal prohibit change requests". The nearby comment reads "stop absorbing". I thinkthis one should read "stop processing". This same comment is used again below. Then a third comment reads "For thesame reason mentioned previously for the wal prohibit state change request check." That third comment is too glib. > > Ok, "stop processing" is used. I think the third comment should be fine instead of coping the same again, however, I change that comment a bit for more clarity as "For the same reason mentioned previously for the same function call". > > tcop/utility.c needlessly includes "access/walprohibit.h" > > > > wait_event.h extends enum WaitEventIO with new values WAIT_EVENT_WALPROHIBIT_STATE and WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. I don't find the difference between these two names at all clear. Waiting for a statechange is clear enough. But how is waiting on a state different? > > WAIT_EVENT_WALPROHIBIT_STATE_CHANGE gets set in pg_prohibit_wal() while waiting for the system to prohibit state change. WAIT_EVENT_WALPROHIBIT_STATE is set for the checkpointer process when it sees the system is in a WAL PROHIBITED state & stops there. But I think it makes sense to have only one, i.e. WAIT_EVENT_WALPROHIBIT_STATE_CHANGE. The same can be used for checkpointer since it won't do anything until wal prohibited state change. Remove WAIT_EVENT_WALPROHIBIT_STATE in the attached version. > > xlog.h defines a new enum. I don't find any of it clear; not the comment, nor the name of the enum, nor the names ofthe values: > > > > /* State of work that enables wal writes */ > > typedef enum XLogAcceptWritesState > > { > > XLOG_ACCEPT_WRITES_PENDING = 0, /* initial state, not started */ > > XLOG_ACCEPT_WRITES_SKIPPED, /* skipped wal writes */ > > XLOG_ACCEPT_WRITES_DONE /* wal writes are enabled */ > > } XLogAcceptWritesState; > > > > This enum seems to have been written from the point of view of someone who already knew what it was for. It needs tobe written in a way that will be clear to people who have no idea what it is for. > > I tried to avoid the function name in the comment, since the enum name pretty much resembles the XLogAcceptWrite() function name whose execution state we are trying to track, but added now, that would be much clearer. > > v33-0006: > > > > The new code comments in brin.c and elsewhere should use the verb "require" rather than "have", otherwise "building indexes"reads as a noun phrase rather than as a gerund: /* Building indexes will have an XID */ > > Rephrased the comments but I think HAVE XID is much more appropriate there because that assert function name ends with HaveXID. Apart from this I have moved CheckWALPermitted() closer to START_CRIT_SECTION which you have pointed out in your other post and made a few other changes. Note that patch numbers are changed, I have rebased my implementation on top of the under discussion refactoring patches which I have posted previously [2] and reattached the same here to make CFbot continue its testing. Note that with the current version patch on the latest master head getting one issue but can be seen sometimes only where one, the same INSERT query gets stuck, waiting for WALWriteLock in exclusive mode. I am not sure if it is due to my changes, but that is not occurring without my patch. I am looking into that, just in case if anybody wants to know more, I have attached the backtrace, pg_lock & ps output, see ps-bt-pg_lock.out.text attached file. Regards, Amul 1] https://postgr.es/m/CAAJ_b97nd_ghRpyFV9Djf9RLXkoTbOUqnocq11WGq9TisX09Fw@mail.gmail.com 2] https://postgr.es/m/CAAJ_b96G-oBxDC3C7Y72ER09bsheGHOxBK1HXHVOyHNXjTDmcA@mail.gmail.com
Attachment
- ps-bt-pg_lock.out.text
- v35-0010-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v35-0008-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v35-0006-Allow-RequestCheckpoint-call-from-checkpointer-p.patch
- v35-0007-Implement-wal-prohibit-state-using-global-barrie.patch
- v35-0009-Documentation.patch
- v35-0005-Create-XLogAcceptWrites-function-with-code-from-.patch
- v35-0004-Postpone-some-end-of-recovery-operations-relatin.patch
- v35-0003-Refactor-some-end-of-recovery-code-out-of-Startu.patch
- v35-0002-miscellaneous-remove-dependency-on-global-and-lo.patch
- v35-0001-Store-ArchiveRecoveryRequested-in-shared-memory-.patch
On Wed, Sep 15, 2021 at 4:34 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Jun 16, 2020, at 6:55 AM, amul sul <sulamul@gmail.com> wrote: > > > > (2) if the session is idle, we also need the top-level abort > > record to be written immediately, but can't send an error to the client until the next > > command is issued without losing wire protocol synchronization. For now, we just use > > FATAL to kill the session; maybe this can be improved in the future. > > Andres, > > I'd like to have a patch that tests the impact of a vacuum running for xid wraparound purposes, blocked on a pinned pageheld by the cursor, when another session disables WAL. It would be very interesting to test how the vacuum handles thatspecific change. I have not figured out the cleanest way to do this, though, as we don't as a project yet have a standardway of setting up xid exhaustion in a regression test, do we? The closest I saw to it was your work in [1], butthat doesn't seem to have made much headway recently, and is designed for the TAP testing infrastructure, which isn'tuseable from inside an isolation test. Do you have a suggestion how best to continue developing out the test infrastructure? > > > Amul, > > The most obvious way to test how your ALTER SYSTEM READ ONLY feature interacts with concurrent sessions is using the isolationtester in src/test/isolation/, but as it stands now, the first permutation that gets a FATAL causes the test toabort and all subsequent permutations to not run. Attached patch v34-0009 fixes that. > Interesting. > Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by a differentsession. The expected output is worth reviewing to see how this plays out. I don't see anything in there whichis obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WALis now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one ofthese errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky. > Can't we do the same in the TAP test? If the intention is only to test session termination when the system changes to WAL are prohibited then that I have added in the latest version, but that test does not reinitiate the same connection again, I think that is not possible there too. Regards, Amul
> On Sep 22, 2021, at 6:14 AM, Amul Sul <sulamul@gmail.com> wrote: > >> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only by adifferent session. The expected output is worth reviewing to see how this plays out. I don't see anything in there whichis obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WALis now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to get one ofthese errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky. >> > > Can't we do the same in the TAP test? If the intention is only to test > session termination when the system changes to WAL are prohibited then > that I have added in the latest version, but that test does not > reinitiate the same connection again, I think that is not possible > there too. Perhaps you can point me to a TAP test that does this in a concise fashion. When I tried writing a TAP test for this, itwas much longer than the equivalent isolation test spec. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 22, 2021 at 6:59 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Sep 22, 2021, at 6:14 AM, Amul Sul <sulamul@gmail.com> wrote: > > > >> Attached patch v34-0010 adds a test of cursors opened FOR UPDATE interacting with a system that is set read-only bya different session. The expected output is worth reviewing to see how this plays out. I don't see anything in therewhich is obviously wrong, but some of it is a bit clunky. For example, by the time the client sees an error "FATAL: WAL is now prohibited", the system may already have switched back to read-write. Also, it is a bit strange to getone of these errors on an attempted ROLLBACK. Once again, not wrong as such, but clunky. > >> > > > > Can't we do the same in the TAP test? If the intention is only to test > > session termination when the system changes to WAL are prohibited then > > that I have added in the latest version, but that test does not > > reinitiate the same connection again, I think that is not possible > > there too. > > Perhaps you can point me to a TAP test that does this in a concise fashion. When I tried writing a TAP test for this,it was much longer than the equivalent isolation test spec. > Yes, that is a bit longer, here is the snip from v35-0010 patch: +my $psql_timeout = IPC::Run::timer(60); +my ($mysession_stdin, $mysession_stdout, $mysession_stderr) = ('', '', ''); +my $mysession = IPC::Run::start( + [ + 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', + $node_primary->connstr('postgres') + ], + '<', + \$mysession_stdin, + '>', + \$mysession_stdout, + '2>', + \$mysession_stderr, + $psql_timeout); + +# Write in transaction and get backend pid +$mysession_stdin .= q[ +BEGIN; +INSERT INTO tab VALUES(7); +SELECT $$value-7-inserted-into-tab$$; +]; +$mysession->pump until $mysession_stdout =~ /value-7-inserted-into-tab[\r\n]$/; +like($mysession_stdout, qr/value-7-inserted-into-tab/, + 'started write transaction in a session'); +$mysession_stdout = ''; +$mysession_stderr = ''; + +# Change to WAL prohibited +$node_primary->safe_psql('postgres', 'SELECT pg_prohibit_wal(true)'); +is($node_primary->safe_psql('postgres', $show_wal_prohibited_query), 'on', + 'server is changed to wal prohibited by another session'); + +# Try to commit open write transaction. +$mysession_stdin .= q[ +COMMIT; +]; +$mysession->pump; +like($mysession_stderr, qr/FATAL: WAL is now prohibited/, + 'session with open write transaction is terminated'); Regards, Amul
> On Sep 22, 2021, at 6:39 AM, Amul Sul <sulamul@gmail.com> wrote: > > Yes, that is a bit longer, here is the snip from v35-0010 patch Right, that's longer, and only tests one interaction. The isolation spec I posted upthread tests multiple interactions betweenthe session which uses cursors and the system going read-only. Whether the session using a cursor gets a FATAL, justan ERROR, or neither depends on where it is in the process of opening, using, closing and committing. I think that'sinteresting. If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way as, for example,to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 22, 2021 at 7:33 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Sep 22, 2021, at 6:39 AM, Amul Sul <sulamul@gmail.com> wrote: > > > > Yes, that is a bit longer, here is the snip from v35-0010 patch > > Right, that's longer, and only tests one interaction. The isolation spec I posted upthread tests multiple interactionsbetween the session which uses cursors and the system going read-only. Whether the session using a cursor getsa FATAL, just an ERROR, or neither depends on where it is in the process of opening, using, closing and committing. I think that's interesting. If the implementation of the ALTER SESSION READ ONLY feature were to change in such a way as,for example, to make the attempt to open the cursor be a FATAL error, you'd see a change in the test output. > Agreed. Regards, Amul
On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote: > Ok, understood, I have separated my changes into 0001 and 0002 patch, > and the refactoring patches start from 0003. I think it would be better in the other order, with the refactoring patches at the beginning of the series. > In the 0001 patch, I have copied ArchiveRecoveryRequested to shared > memory as said previously. Coping ArchiveRecoveryRequested value to > shared memory is not really interesting, and I think somehow we should > reuse existing variable, (perhaps, with some modification of the > information it can store, e.g. adding one more enum value for > SharedRecoveryState or something else), thinking on the same. > > In addition to that, I tried to turn down the scope of > ArchiveRecoveryRequested global variable. Now, this is a static > variable, and the scope is limited to xlog.c file like > LocalXLogInsertAllowed and can be accessed through the newly added > function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let > me know what you think about the approach. I'm not sure yet whether I like this or not, but it doesn't seem like a terrible idea. You spelled UNKNOWN wrong, though, which does seem like a terrible idea. :-) "acccsed" is not correct either. Also, the new comments for ArchiveRecoveryRequested / ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is substitute the new terminology into the existing comment, but that means that the purpose of the new "unknown" value is not at all clear. Consider the following two patch fragments: + * SharedArchiveRecoveryRequested indicates whether an archive recovery is + * requested. Protected by info_lck. ... + * Remember archive recovery request in shared memory state. A lock is not + * needed since we are the only ones who updating this. These two comments directly contradict each other. + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->SharedArchiveRecoveryRequested = false; + ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN; + SpinLockRelease(&XLogCtl->info_lck); This seems odd to me. In the first place, there doesn't seem to be any value in clearing this -- we're just expending extra CPU cycles to get rid of a value that wouldn't be used anyway. In the second place, if somehow someone checked the value after this point, with this code, they might get the wrong answer, whereas if you just deleted this, they would get the right answer. > In 0002 patch is a mixed one where I tried to remove the dependencies > on global variables and local variables belonging to StartupXLOG(). I > am still worried about the InRecovery value that needs to be deduced > afterward inside XLogAcceptWrites(). Currently, relying on > ControlFile->state != DB_SHUTDOWNED check but I think that will not be > good for ASRO where we plan to skip XLogAcceptWrites() work only and > let the StartupXLOG() do the rest of the work as it is where it will > going to update ControlFile's DBState to DB_IN_PRODUCTION, then we > might need some ugly kludge to call PerformRecoveryXLogAction() in > checkpointer irrespective of DBState, which makes me a bit > uncomfortable. I think that replacing the if (InRecovery) test with if (ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I mean, there are three separate places where we set InRecovery = true. One of those executes if ControlFile->state != DB_SHUTDOWNED, matching what you have here, but it also can happen if checkPoint.redo < RecPtr, or if read_backup_label is true and ReadCheckpointRecord returns non-NULL. Now maybe you're going to tell me that in those other two cases we can't reach here anyway, but I don't see off-hand why that should be true, and even if it is true, it seems like kind of a fragile thing to rely on. I think we need to rely on something in shared memory that is more explicitly connected to the question of whether we are in recovery. The other part of this patch has to do with whether we can use the return value of GetLastSegSwitchData as a substitute for relying on EndOfLog. Now as you have it, you end up creating a local variable called EndOfLog that shadows another such variable in an outer scope, which probably would not make anyone who noticed things in such a state very happy. However, that will naturally get fixed if you reorder the patches as per above, so let's turn to the central question: is this a good way of getting EndOfLog? The value that would be in effect at the time this code is executed is set here: XLogBeginRead(xlogreader, LastRec); record = ReadRecord(xlogreader, PANIC, false); EndOfLog = EndRecPtr; Subsequently we do this: /* start the archive_timeout timer and LSN running */ XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); XLogCtl->lastSegSwitchLSN = EndOfLog; So at that point the value that GetLastSegSwitchData() would return has to match what's in the existing variable. But later XLogWrite() will change the value. So the question boils down to whether XLogWrite() could have been called between the assignment just above and when this code runs. And the answer seems to pretty clear be yes, because just above this code, we might have done CreateEndOfRecoveryRecord() or RequestCheckpoint(), and just above that, we did UpdateFullPageWrites(). So I don't think this is right. > > (3) CheckpointStats, which is called from RemoveXlogFile which is > > called from RemoveNonParentXlogFiles which is called from > > CleanupAfterArchiveRecovery which is called from XLogAcceptWrites. > > This last one is actually pretty weird already in the existing code. > > It sort of looks like RemoveXlogFile() only expects to be called from > > the checkpointer (or a standalone backend) so that it can update > > CheckpointStats and have that just work, but actually it's also called > > from the startup process when a timeline switch happens. I don't know > > whether the fact that the increments to ckpt_segs_recycled get lost in > > that case should be considered an intentional behavior that should be > > preserved or an inadvertent mistake. > > > Maybe I could be wrong, but I think that is intentional. It removes > pre-allocated or bogus files of the old timeline which are not > supposed to be considered in stats. The comments for > CheckpointStatsData might not be clear but comment at the calling > RemoveNonParentXlogFiles() place inside StartupXLOG hints the same: Sure, I'm not saying the files are being removed by accident. I'm saying it may be accidental that the removals are (I think) not going to make it into the stats. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 23, 2021 at 11:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote: > > Ok, understood, I have separated my changes into 0001 and 0002 patch, > > and the refactoring patches start from 0003. > > I think it would be better in the other order, with the refactoring > patches at the beginning of the series. > Ok, will do that. I did this other way to minimize the diff e.g. deletion diff of RecoveryXlogAction enum and DetermineRecoveryXlogAction(), etc. > > In the 0001 patch, I have copied ArchiveRecoveryRequested to shared > > memory as said previously. Coping ArchiveRecoveryRequested value to > > shared memory is not really interesting, and I think somehow we should > > reuse existing variable, (perhaps, with some modification of the > > information it can store, e.g. adding one more enum value for > > SharedRecoveryState or something else), thinking on the same. > > > > In addition to that, I tried to turn down the scope of > > ArchiveRecoveryRequested global variable. Now, this is a static > > variable, and the scope is limited to xlog.c file like > > LocalXLogInsertAllowed and can be accessed through the newly added > > function ArchiveRecoveryIsRequested() (like PromoteIsTriggered()). Let > > me know what you think about the approach. > > I'm not sure yet whether I like this or not, but it doesn't seem like > a terrible idea. You spelled UNKNOWN wrong, though, which does seem > like a terrible idea. :-) "acccsed" is not correct either. > > Also, the new comments for ArchiveRecoveryRequested / > ARCHIVE_RECOVERY_REQUEST_* are really not very clear. All you did is > substitute the new terminology into the existing comment, but that > means that the purpose of the new "unknown" value is not at all clear. > Ok, will fix those typos and try to improve the comments. > Consider the following two patch fragments: > > + * SharedArchiveRecoveryRequested indicates whether an archive recovery is > + * requested. Protected by info_lck. > ... > + * Remember archive recovery request in shared memory state. A lock is not > + * needed since we are the only ones who updating this. > > These two comments directly contradict each other. > Okay, the first comment is not clear enough, I will fix that too. I meant we don't need the lock now since we are the only one updating at this point. > + SpinLockAcquire(&XLogCtl->info_lck); > + XLogCtl->SharedArchiveRecoveryRequested = false; > + ArchiveRecoveryRequested = ARCHIVE_RECOVERY_REQUEST_UNKOWN; > + SpinLockRelease(&XLogCtl->info_lck); > > This seems odd to me. In the first place, there doesn't seem to be any > value in clearing this -- we're just expending extra CPU cycles to get > rid of a value that wouldn't be used anyway. In the second place, if > somehow someone checked the value after this point, with this code, > they might get the wrong answer, whereas if you just deleted this, > they would get the right answer. > Previously, this flag was only valid in the startup process. But now it will be valid for all the processes and will stay until the whole server gets restarted. I don't want anybody to use this flag after the cleanup point and just be sure I am explicitly cleaning this. By the way, I also don't expect we should go with this approach. I proposed this by referring to the PromoteIsTriggered() implementation, but IMO, it is better to have something else since we just want to perform archive cleanup operation, and most of the work related to archive recovery was done inside the StartupXLOG(). Rather than the proposed design, I was thinking of adding one or two more RecoveryState enums. And while skipping XLogAcceptsWrite() set XLogCtl->SharedRecoveryState appropriately, so that we can easily identify that the archive recovery was requested previously and now, we need to perform its pending cleanup operation. Thoughts? > > In 0002 patch is a mixed one where I tried to remove the dependencies > > on global variables and local variables belonging to StartupXLOG(). I > > am still worried about the InRecovery value that needs to be deduced > > afterward inside XLogAcceptWrites(). Currently, relying on > > ControlFile->state != DB_SHUTDOWNED check but I think that will not be > > good for ASRO where we plan to skip XLogAcceptWrites() work only and > > let the StartupXLOG() do the rest of the work as it is where it will > > going to update ControlFile's DBState to DB_IN_PRODUCTION, then we > > might need some ugly kludge to call PerformRecoveryXLogAction() in > > checkpointer irrespective of DBState, which makes me a bit > > uncomfortable. > > I think that replacing the if (InRecovery) test with if > (ControlFile->state != DB_SHUTDOWNED) is probably just plain wrong. I > mean, there are three separate places where we set InRecovery = true. > One of those executes if ControlFile->state != DB_SHUTDOWNED, matching > what you have here, but it also can happen if checkPoint.redo < > RecPtr, or if read_backup_label is true and ReadCheckpointRecord > returns non-NULL. Now maybe you're going to tell me that in those > other two cases we can't reach here anyway, but I don't see off-hand > why that should be true, and even if it is true, it seems like kind of > a fragile thing to rely on. I think we need to rely on something in > shared memory that is more explicitly connected to the question of > whether we are in recovery. > No, this is the other way. I haven't picked (ControlFile->state != DB_SHUTDOWNED) condition because it setting InRecovery, rather, I picked because InRecovery flag is setting ControlFile->state to either DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, see next if (InRecovery) block after InRecovery flag gets set. It is certain that when the system will be InRecovery, it will have the DBState other than DB_SHUTDOWNED. But that isn't a clean approach for me because when it will be in WAL prohibited the DBState will be DB_IN_PRODUCTION which will not work, as I mentioned previously. I am too thinking about passing this information via shared memory but trying to somehow avoid this, lets see. > The other part of this patch has to do with whether we can use the > return value of GetLastSegSwitchData as a substitute for relying on > EndOfLog. Now as you have it, you end up creating a local variable > called EndOfLog that shadows another such variable in an outer scope, > which probably would not make anyone who noticed things in such a > state very happy. However, that will naturally get fixed if you > reorder the patches as per above, so let's turn to the central > question: is this a good way of getting EndOfLog? The value that would > be in effect at the time this code is executed is set here: > > XLogBeginRead(xlogreader, LastRec); > record = ReadRecord(xlogreader, PANIC, false); > EndOfLog = EndRecPtr; > > Subsequently we do this: > > /* start the archive_timeout timer and LSN running */ > XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL); > XLogCtl->lastSegSwitchLSN = EndOfLog; > > So at that point the value that GetLastSegSwitchData() would return > has to match what's in the existing variable. But later XLogWrite() > will change the value. So the question boils down to whether > XLogWrite() could have been called between the assignment just above > and when this code runs. And the answer seems to pretty clear be yes, > because just above this code, we might have done > CreateEndOfRecoveryRecord() or RequestCheckpoint(), and just above > that, we did UpdateFullPageWrites(). So I don't think this is right. > You are correct, if XLogWrite() called between the lastSegSwitchLSN value can be changed, but the question is, will that change in our case. I think it won't, let me explain. IIUC, lastSegSwitchLSN will change generally in XLogWrite(), if the previous WAL has been filled up. But if we see closely what will be going to be written before we do check lastSegSwitchLSN. Currently, we have a record for full-page write and record for either recovery end or checkpoint, all these are fixed size and I don't think going to fill the whole 16MB wal file. Correct me if I am missing something. > > > (3) CheckpointStats, which is called from RemoveXlogFile which is > > > called from RemoveNonParentXlogFiles which is called from > > > CleanupAfterArchiveRecovery which is called from XLogAcceptWrites. > > > This last one is actually pretty weird already in the existing code. > > > It sort of looks like RemoveXlogFile() only expects to be called from > > > the checkpointer (or a standalone backend) so that it can update > > > CheckpointStats and have that just work, but actually it's also called > > > from the startup process when a timeline switch happens. I don't know > > > whether the fact that the increments to ckpt_segs_recycled get lost in > > > that case should be considered an intentional behavior that should be > > > preserved or an inadvertent mistake. > > > > > Maybe I could be wrong, but I think that is intentional. It removes > > pre-allocated or bogus files of the old timeline which are not > > supposed to be considered in stats. The comments for > > CheckpointStatsData might not be clear but comment at the calling > > RemoveNonParentXlogFiles() place inside StartupXLOG hints the same: > > Sure, I'm not saying the files are being removed by accident. I'm > saying it may be accidental that the removals are (I think) not going > to make it into the stats. > Understood, it looks like I missed the concluding line in the previous reply. My point was if deleting bogus files then why should we care about counting them in stats. Regards, Amul
On Fri, Sep 24, 2021 at 5:07 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Sep 23, 2021 at 11:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Sep 20, 2021 at 11:20 AM Amul Sul <sulamul@gmail.com> wrote: > > > Ok, understood, I have separated my changes into 0001 and 0002 patch, > > > and the refactoring patches start from 0003. > > > > I think it would be better in the other order, with the refactoring > > patches at the beginning of the series. > > > > Ok, will do that. I did this other way to minimize the diff e.g. > deletion diff of RecoveryXlogAction enum and > DetermineRecoveryXlogAction(), etc. > I have reversed the patch order. Now refactoring patches will be first, and the patch that removes the dependencies on global & local variables will be the last. I did the necessary modification in the refactoring patches too e.g. removed DetermineRecoveryXlogAction() and RecoveryXlogAction enum which is no longer needed (thanks to commit # 1d919de5eb3fffa7cc9479ed6d2915fb89794459 to make code simple). To find the value of InRecovery after we clear it, patch still uses ControlFile's DBState, but now the check condition changed to a more specific one which is less confusing. In casual off-list discussion, the point was made to check SharedRecoveryState to find out the InRecovery value afterward, and check that using RecoveryInProgress(). But we can't depend on SharedRecoveryState because at the start it gets initialized to RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. Therefore, we can't use RecoveryInProgress() which always returns true if SharedRecoveryState != RECOVERY_STATE_DONE. I am posting only refactoring patches for now. Regards, Amul
Attachment
On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote: > To find the value of InRecovery after we clear it, patch still uses > ControlFile's DBState, but now the check condition changed to a more > specific one which is less confusing. > > In casual off-list discussion, the point was made to check > SharedRecoveryState to find out the InRecovery value afterward, and > check that using RecoveryInProgress(). But we can't depend on > SharedRecoveryState because at the start it gets initialized to > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. > Therefore, we can't use RecoveryInProgress() which always returns > true if SharedRecoveryState != RECOVERY_STATE_DONE. Uh, this change has crept into 0002, but it should be in 0004 with the rest of the changes to remove dependencies on variables specific to the startup process. Like I said before, we should really be trying to separate code movement from functional changes. Also, 0002 doesn't actually apply for me. Did you generate these patches with 'git format-patch'? [rhaas pgsql]$ patch -p1 < ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch patching file src/backend/access/transam/xlog.c Hunk #1 succeeded at 889 (offset 9 lines). Hunk #2 succeeded at 939 (offset 12 lines). Hunk #3 succeeded at 5734 (offset 37 lines). Hunk #4 succeeded at 8038 (offset 70 lines). Hunk #5 succeeded at 8248 (offset 70 lines). [rhaas pgsql]$ patch -p1 < ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch patching file src/backend/access/transam/xlog.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] y Hunk #1 FAILED at 7954. Hunk #2 succeeded at 8079 (offset 70 lines). 1 out of 2 hunks FAILED -- saving rejects to file src/backend/access/transam/xlog.c.rej [rhaas pgsql]$ git reset --hard HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a non-recoverable connection failure. [rhaas pgsql]$ patch -p1 < ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch patching file src/backend/access/transam/xlog.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] Skipping patch. 2 out of 2 hunks ignored -- saving rejects to file src/backend/access/transam/xlog.c.rej It seems to me that the approach you're pursuing here can't work, because the long-term goal is to get to a place where, if the system starts up read-only, XLogAcceptWrites() might not be called until later, after StartupXLOG() has exited. But in that case the control file state would be DB_IN_PRODUCTION. But my idea of using RecoveryInProgress() won't work either, because we set RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put differently, the question we want to answer is not "are we in recovery now?" but "did we perform recovery?". After studying the code a bit, I think a good test might be !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery gets set to true, then we're certain to enter the if (InRecovery) block that contains the main redo loop. And that block unconditionally does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I think that replayEndRecPtr can't be 0 because it's supposed to represent the record we're pretending to have last replayed, as explained by the comments. And while lastReplayedEndRecPtr will get updated later as we replay more records, I think it will never be set back to 0. It's only going to increase, as we replay more records. On the other hand if InRecovery = false then we'll never change it, and it seems that it starts out as 0. I was hoping to have more time today to comment on 0004, but the day seems to have gotten away from me. One quick thought is that it looks a bit strange to be getting EndOfLog from GetLastSegSwitchData() which returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI ... because there's also replayEndRecPtr, which seems to go with replayEndTLI. It feels like we should use a source for the TLI that clearly matches the source for the corresponding LSN, unless there's some super-good reason to do otherwise. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Oct 1, 2021 at 2:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote:
> To find the value of InRecovery after we clear it, patch still uses
> ControlFile's DBState, but now the check condition changed to a more
> specific one which is less confusing.
>
> In casual off-list discussion, the point was made to check
> SharedRecoveryState to find out the InRecovery value afterward, and
> check that using RecoveryInProgress(). But we can't depend on
> SharedRecoveryState because at the start it gets initialized to
> RECOVERY_STATE_CRASH irrespective of InRecovery that happens later.
> Therefore, we can't use RecoveryInProgress() which always returns
> true if SharedRecoveryState != RECOVERY_STATE_DONE.
Uh, this change has crept into 0002, but it should be in 0004 with the
rest of the changes to remove dependencies on variables specific to
the startup process. Like I said before, we should really be trying to
separate code movement from functional changes. Also, 0002 doesn't
actually apply for me. Did you generate these patches with 'git
format-patch'?
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 889 (offset 9 lines).
Hunk #2 succeeded at 939 (offset 12 lines).
Hunk #3 succeeded at 5734 (offset 37 lines).
Hunk #4 succeeded at 8038 (offset 70 lines).
Hunk #5 succeeded at 8248 (offset 70 lines).
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 7954.
Hunk #2 succeeded at 8079 (offset 70 lines).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlog.c.rej
[rhaas pgsql]$ git reset --hard
HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a
non-recoverable connection failure.
[rhaas pgsql]$ patch -p1 <
~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
patching file src/backend/access/transam/xlog.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file
src/backend/access/transam/xlog.c.rej
I tried to apply the patch on the master branch head and it's failing
with conflicts.
Later applied patch on below commit and it got applied cleanly:
commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Sep 27 18:48:01 2021 -0400
Re-enable contrib/bloom's TAP tests.
rushabh@rushabh:postgresql$ git apply v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patchAuthor: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Sep 27 18:48:01 2021 -0400
Re-enable contrib/bloom's TAP tests.
rushabh@rushabh:postgresql$ git apply v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch
rushabh@rushabh:postgresql$ git apply v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space before tab in indent.
/*
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space before tab in indent.
*/
v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space before tab in indent.
Insert->fullPageWrites = lastFullPageWrites;
warning: 3 lines add whitespace errors.
rushabh@rushabh:postgresql$ git apply v36-0004-Remove-dependencies-on-startup-process-specifica.patch
There are whitespace errors on patch 0003.
It seems to me that the approach you're pursuing here can't work,
because the long-term goal is to get to a place where, if the system
starts up read-only, XLogAcceptWrites() might not be called until
later, after StartupXLOG() has exited. But in that case the control
file state would be DB_IN_PRODUCTION. But my idea of using
RecoveryInProgress() won't work either, because we set
RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put
differently, the question we want to answer is not "are we in recovery
now?" but "did we perform recovery?". After studying the code a bit, I
think a good test might be
!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery
gets set to true, then we're certain to enter the if (InRecovery)
block that contains the main redo loop. And that block unconditionally
does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I
think that replayEndRecPtr can't be 0 because it's supposed to
represent the record we're pretending to have last replayed, as
explained by the comments. And while lastReplayedEndRecPtr will get
updated later as we replay more records, I think it will never be set
back to 0. It's only going to increase, as we replay more records. On
the other hand if InRecovery = false then we'll never change it, and
it seems that it starts out as 0.
I was hoping to have more time today to comment on 0004, but the day
seems to have gotten away from me. One quick thought is that it looks
a bit strange to be getting EndOfLog from GetLastSegSwitchData() which
returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI
... because there's also replayEndRecPtr, which seems to go with
replayEndTLI. It feels like we should use a source for the TLI that
clearly matches the source for the corresponding LSN, unless there's
some super-good reason to do otherwise.
--
Robert Haas
EDB: http://www.enterprisedb.com
Rushabh Lathia
On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia <rushabh.lathia@gmail.com> wrote: > > > > On Fri, Oct 1, 2021 at 2:29 AM Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Sep 30, 2021 at 7:59 AM Amul Sul <sulamul@gmail.com> wrote: >> > To find the value of InRecovery after we clear it, patch still uses >> > ControlFile's DBState, but now the check condition changed to a more >> > specific one which is less confusing. >> > >> > In casual off-list discussion, the point was made to check >> > SharedRecoveryState to find out the InRecovery value afterward, and >> > check that using RecoveryInProgress(). But we can't depend on >> > SharedRecoveryState because at the start it gets initialized to >> > RECOVERY_STATE_CRASH irrespective of InRecovery that happens later. >> > Therefore, we can't use RecoveryInProgress() which always returns >> > true if SharedRecoveryState != RECOVERY_STATE_DONE. >> >> Uh, this change has crept into 0002, but it should be in 0004 with the >> rest of the changes to remove dependencies on variables specific to >> the startup process. Like I said before, we should really be trying to >> separate code movement from functional changes. Well, I have to replace the InRecovery flag in that patch since we are moving code past to the point where the InRecovery flag gets cleared. If I don't do, then the 0002 patch would be wrong since InRecovery is always false, and behaviour won't be the same as it was before that patch. >> Also, 0002 doesn't >> actually apply for me. Did you generate these patches with 'git >> format-patch'? >> >> [rhaas pgsql]$ patch -p1 < >> ~/Downloads/v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch >> patching file src/backend/access/transam/xlog.c >> Hunk #1 succeeded at 889 (offset 9 lines). >> Hunk #2 succeeded at 939 (offset 12 lines). >> Hunk #3 succeeded at 5734 (offset 37 lines). >> Hunk #4 succeeded at 8038 (offset 70 lines). >> Hunk #5 succeeded at 8248 (offset 70 lines). >> [rhaas pgsql]$ patch -p1 < >> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch >> patching file src/backend/access/transam/xlog.c >> Reversed (or previously applied) patch detected! Assume -R? [n] >> Apply anyway? [n] y >> Hunk #1 FAILED at 7954. >> Hunk #2 succeeded at 8079 (offset 70 lines). >> 1 out of 2 hunks FAILED -- saving rejects to file >> src/backend/access/transam/xlog.c.rej >> [rhaas pgsql]$ git reset --hard >> HEAD is now at b484ddf4d2 Treat ETIMEDOUT as indicating a >> non-recoverable connection failure. >> [rhaas pgsql]$ patch -p1 < >> ~/Downloads/v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch >> patching file src/backend/access/transam/xlog.c >> Reversed (or previously applied) patch detected! Assume -R? [n] >> Apply anyway? [n] >> Skipping patch. >> 2 out of 2 hunks ignored -- saving rejects to file >> src/backend/access/transam/xlog.c.rej >> > > I tried to apply the patch on the master branch head and it's failing > with conflicts. > Thanks, Rushabh, for the quick check, I have attached a rebased version for the latest master head commit # f6b5d05ba9a. > Later applied patch on below commit and it got applied cleanly: > > commit 7d1aa6bf1c27bf7438179db446f7d1e72ae093d0 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Mon Sep 27 18:48:01 2021 -0400 > > Re-enable contrib/bloom's TAP tests. > > rushabh@rushabh:postgresql$ git apply v36-0001-Refactor-some-end-of-recovery-code-out-of-Startu.patch > rushabh@rushabh:postgresql$ git apply v36-0002-Postpone-some-end-of-recovery-operations-relatin.patch > rushabh@rushabh:postgresql$ git apply v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:34: space before tab in indent. > /* > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:38: space before tab in indent. > */ > v36-0003-Create-XLogAcceptWrites-function-with-code-from-.patch:39: space before tab in indent. > Insert->fullPageWrites = lastFullPageWrites; > warning: 3 lines add whitespace errors. > rushabh@rushabh:postgresql$ git apply v36-0004-Remove-dependencies-on-startup-process-specifica.patch > > There are whitespace errors on patch 0003. > Fixed. >> >> It seems to me that the approach you're pursuing here can't work, >> because the long-term goal is to get to a place where, if the system >> starts up read-only, XLogAcceptWrites() might not be called until >> later, after StartupXLOG() has exited. But in that case the control >> file state would be DB_IN_PRODUCTION. But my idea of using >> RecoveryInProgress() won't work either, because we set >> RECOVERY_STATE_DONE just after we set DB_IN_PRODUCTION. Put >> differently, the question we want to answer is not "are we in recovery >> now?" but "did we perform recovery?". After studying the code a bit, I >> think a good test might be >> !XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr). If InRecovery >> gets set to true, then we're certain to enter the if (InRecovery) >> block that contains the main redo loop. And that block unconditionally >> does XLogCtl->lastReplayedEndRecPtr = XLogCtl->replayEndRecPtr. I >> think that replayEndRecPtr can't be 0 because it's supposed to >> represent the record we're pretending to have last replayed, as >> explained by the comments. And while lastReplayedEndRecPtr will get >> updated later as we replay more records, I think it will never be set >> back to 0. It's only going to increase, as we replay more records. On >> the other hand if InRecovery = false then we'll never change it, and >> it seems that it starts out as 0. >> Understood, used lastReplayedEndRecPtr but in 0002 patch for the aforesaid reason. >> I was hoping to have more time today to comment on 0004, but the day >> seems to have gotten away from me. One quick thought is that it looks >> a bit strange to be getting EndOfLog from GetLastSegSwitchData() which >> returns lastSegSwitchLSN while getting EndOfLogTLI from replayEndTLI >> ... because there's also replayEndRecPtr, which seems to go with >> replayEndTLI. It feels like we should use a source for the TLI that >> clearly matches the source for the corresponding LSN, unless there's >> some super-good reason to do otherwise. Agreed, that would be the right thing, but on the latest master head that might not be the right thing to use because of commit # ff9f111bce24 that has introduced the following code that changes the EndOfLog that could be different from replayEndRecPtr: /* * Actually, if WAL ended in an incomplete record, skip the parts that * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ if (!XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; } With this commit, we have got two new global variables. First, missingContrecPtr is an EndOfLog which gets stored in shared memory at few places, and the other one abortedRecPtr that is needed in XLogAcceptWrite(), which I have exported into shared memory. Regards, Amul
Attachment
On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote: > On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia > <rushabh.lathia@gmail.com> wrote: > > > > I tried to apply the patch on the master branch head and it's failing > > with conflicts. > > > > Thanks, Rushabh, for the quick check, I have attached a rebased version for the > latest master head commit # f6b5d05ba9a. > Hi, I got this error while executing "make check" on src/test/recovery: """ t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query: # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn() # expecting this output: # t # last actual query output: # f # with stderr: # Looks like your test exited with 29 just after 1. t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 2/3 subtests Test Summary Report ------------------- t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 3 tests but ran 1. Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU) Result: FAIL make: *** [Makefile:23: check] Error 1 """ -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova <jcasanov@systemguards.com.ec> wrote: > > On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote: > > On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia > > <rushabh.lathia@gmail.com> wrote: > > > > > > I tried to apply the patch on the master branch head and it's failing > > > with conflicts. > > > > > > > Thanks, Rushabh, for the quick check, I have attached a rebased version for the > > latest master head commit # f6b5d05ba9a. > > > > Hi, > > I got this error while executing "make check" on src/test/recovery: > > """ > t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query: > # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn() > # expecting this output: > # t > # last actual query output: > # f > # with stderr: > # Looks like your test exited with 29 just after 1. > t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00) > Failed 2/3 subtests > > Test Summary Report > ------------------- > t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 3 tests but ran 1. > Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU) > Result: FAIL > make: *** [Makefile:23: check] Error 1 > """ > Thanks for the reporting problem, I am working on it. The cause of failure is that v37_0004 patch clearing the missingContrecPtr global variable before CreateOverwriteContrecordRecord() execution, which it shouldn't. Regards, Amul
On Thu, Oct 7, 2021 at 6:21 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Oct 7, 2021 at 5:56 AM Jaime Casanova > <jcasanov@systemguards.com.ec> wrote: > > > > On Tue, Oct 05, 2021 at 04:11:58PM +0530, Amul Sul wrote: > > > On Mon, Oct 4, 2021 at 1:57 PM Rushabh Lathia > > > <rushabh.lathia@gmail.com> wrote: > > > > > > > > I tried to apply the patch on the master branch head and it's failing > > > > with conflicts. > > > > > > > > > > Thanks, Rushabh, for the quick check, I have attached a rebased version for the > > > latest master head commit # f6b5d05ba9a. > > > > > > > Hi, > > > > I got this error while executing "make check" on src/test/recovery: > > > > """ > > t/026_overwrite_contrecord.pl ........ 1/3 # poll_query_until timed out executing this query: > > # SELECT '0/201D4D8'::pg_lsn <= pg_last_wal_replay_lsn() > > # expecting this output: > > # t > > # last actual query output: > > # f > > # with stderr: > > # Looks like your test exited with 29 just after 1. > > t/026_overwrite_contrecord.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00) > > Failed 2/3 subtests > > > > Test Summary Report > > ------------------- > > t/026_overwrite_contrecord.pl (Wstat: 7424 Tests: 1 Failed: 0) > > Non-zero exit status: 29 > > Parse errors: Bad plan. You planned 3 tests but ran 1. > > Files=26, Tests=279, 400 wallclock secs ( 0.27 usr 0.10 sys + 73.78 cusr 59.66 csys = 133.81 CPU) > > Result: FAIL > > make: *** [Makefile:23: check] Error 1 > > """ > > > > Thanks for the reporting problem, I am working on it. The cause of > failure is that v37_0004 patch clearing the missingContrecPtr global > variable before CreateOverwriteContrecordRecord() execution, which it > shouldn't. > In the attached version I have fixed this issue by restoring missingContrecPtr. To handle abortedRecPtr and missingContrecPtr newly added global variables thought the commit # ff9f111bce24, we don't need to store them in the shared memory separately, instead, we need a flag that indicates a broken record found previously, at the end of recovery, so that we can overwrite contrecord. The missingContrecPtr is assigned to the EndOfLog, and we have handled EndOfLog previously in the 0004 patch, and the abortedRecPtr is the same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to ensure that the lastReplayedEndRecPtr value is the same as the abortedRecPtr, but I think that is not needed, we can go ahead and write an overwrite-contrecord starting at lastReplayedEndRecPtr. Regards, Amul
Attachment
On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul@gmail.com> wrote: > In the attached version I have fixed this issue by restoring missingContrecPtr. > > To handle abortedRecPtr and missingContrecPtr newly added global > variables thought the commit # ff9f111bce24, we don't need to store > them in the shared memory separately, instead, we need a flag that > indicates a broken record found previously, at the end of recovery, so > that we can overwrite contrecord. > > The missingContrecPtr is assigned to the EndOfLog, and we have handled > EndOfLog previously in the 0004 patch, and the abortedRecPtr is the > same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to > ensure that the lastReplayedEndRecPtr value is the same as the > abortedRecPtr, but I think that is not needed, we can go ahead and > write an overwrite-contrecord starting at lastReplayedEndRecPtr. I thought that it made sense to commit 0001 and 0002 at this point, so I have done that. I think that the treatment of missingContrecPtr and abortedRecPtr may need more thought yet, so at least for that reason I don't think it's a good idea to proceed with 0004 yet. 0003 is just code movement so I guess that can be committed whenever we're confident that we know exactly which things we want to end up inside XLogAcceptWrites(). I do have a few ideas after studying this a bit more: - I wonder whether, in addition to moving a few things later as 0002 did, we also ought to think about moving one thing earlier, specifically XLogReportParameters(). Right now, we have, I believe, four things that write WAL at the end of recovery: CreateOverwriteContrecordRecord(), UpdateFullPageWrites(), PerformRecoveryXLogAction(), and XLogReportParameters(). As the code is structured now, we do the first three of those things, and then do a bunch of other stuff inside CleanupAfterArchiveRecovery() like running recovery_end_command, and removing non-parent xlog files, and archiving the partial segment, and then come back and do the fourth one. Is there any good reason for that? If not, I think doing them all together would be cleaner, and would propose to reverse the order of CleanupAfterArchiveRecovery() and XLogReportParameters(). - If we did that, then I would further propose to adjust things so that we remove the call to LocalSetXLogInsertAllowed() and the assignment LocalXLogInsertAllowed = -1 from inside CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just after UpdateFullPageWrites(), and the call to LocalSetXLogInsertAllowed() just before XLogReportParameters(). Instead, just let the call to LocalSetXLogInsertAllowed() right before CreateOverwriteContrecordRecord() remain in effect. There doesn't seem to be much point in flipping that switch off and on again, and the fact that we have been doing so is in my view just evidence that StartupXLOG() doesn't do a very good job of getting related code all into one place. - It seems really tempting to invent a fourth RecoveryState value that indicates that we are done with REDO but not yet in production, and maybe also to rename RecoveryState to something like WALState. I'm thinking of something like WAL_STATE_CRASH_RECOVERY, WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and WAL_STATE_PRODUCTION. Then, instead of having LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that the startup process and the checkpointer are allowed to insert WAL when the state is WAL_STATE_REDO_COMPLETE, but other processes only once we reach WAL_STATE_PRODUCTION. We would set WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed(). It's necessary to include the checkpointer, or at least I think it is, because PerformRecoveryXLogAction() might call RequestCheckpoint(), and that's got to work. If we did this, then I think it would also solve another problem which the overall patch set has to address somehow. Say that we eventually move responsibility for the to-be-created XLogAcceptWrites() function from the startup process to the checkpointer, as proposed. The checkpointer needs to know when to call it ... and the answer with this change is simple: when we reach WAL_STATE_REDO_COMPLETE, it's time! But this idea is not completely problem-free. I spent some time poking at it and I think it's a little hard to come up with a satisfying way to code XLogInsertAllowed(). Right now that function calls RecoveryInProgress(), and if RecoveryInProgress() decides that recovery is no longer in progress, it calls InitXLOGAccess(). However, that presumes that the only reason you'd call RecoveryInProgress() is to figure out whether you should write WAL, which I don't think is really true, and it also means that, when the wal state is WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return true in the checkpointer and startup process and false everywhere else, which does not sound like a great idea. It seems fine to say that xlog insertion is allowed in some processes but not others, because not all processes are necessarily equally privileged, but whether or not we're in recovery is supposed to be something about which everyone agrees, so answering that question differently in different processes doesn't seem nice. XLogInsertAllowed() could be rewritten to check the state directly and make its own determination, without relying on RecoveryInProgress(), and I think that might be the right way to go here. But that isn't entirely problem-free either, because there's a lot of code that uses RecoveryInProgress() to answer the question "should I write WAL?" and therefore it's not great if RecoveryInProgress() is returning an answer that is inconsistent with XLogInsertAllowed(). MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this kind of coding. It probably wouldn't break in practice right away, because most of that code never runs in the startup process or the checkpointer and would therefore never notice the difference in behavior between those two functions, but if in the future we get the read-only feature that this thread is supposed to be about, we'd have problems. Not all RecoveryInProgress() calls have this sense - e.g. sendDir() in basebackup.c is trying to figure out whether recovery ended during the backup, not whether we can write WAL. But perhaps this is a good time to go and replace RecoveryInProgress() checks that are intending to decide whether or not it's OK to write WAL with XLogInsertAllowed() checks (noting that the return value is reversed). If we did that, then I think RecoveryInProgress() could also NOT call InitXLOGAccess(), and that could be done only by XLogInsertAllowed(), which seems like it might be better. But I haven't really tried to code all of this up, so I'm not really sure how it all works out. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Oct 14, 2021 at 11:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 12, 2021 at 8:18 AM Amul Sul <sulamul@gmail.com> wrote: > > In the attached version I have fixed this issue by restoring missingContrecPtr. > > > > To handle abortedRecPtr and missingContrecPtr newly added global > > variables thought the commit # ff9f111bce24, we don't need to store > > them in the shared memory separately, instead, we need a flag that > > indicates a broken record found previously, at the end of recovery, so > > that we can overwrite contrecord. > > > > The missingContrecPtr is assigned to the EndOfLog, and we have handled > > EndOfLog previously in the 0004 patch, and the abortedRecPtr is the > > same as the lastReplayedEndRecPtr, AFAICS. I have added an assert to > > ensure that the lastReplayedEndRecPtr value is the same as the > > abortedRecPtr, but I think that is not needed, we can go ahead and > > write an overwrite-contrecord starting at lastReplayedEndRecPtr. > > I thought that it made sense to commit 0001 and 0002 at this point, so > I have done that. I think that the treatment of missingContrecPtr and > abortedRecPtr may need more thought yet, so at least for that reason I > don't think it's a good idea to proceed with 0004 yet. 0003 is just > code movement so I guess that can be committed whenever we're > confident that we know exactly which things we want to end up inside > XLogAcceptWrites(). > Ok. > I do have a few ideas after studying this a bit more: > > - I wonder whether, in addition to moving a few things later as 0002 > did, we also ought to think about moving one thing earlier, > specifically XLogReportParameters(). Right now, we have, I believe, > four things that write WAL at the end of recovery: > CreateOverwriteContrecordRecord(), UpdateFullPageWrites(), > PerformRecoveryXLogAction(), and XLogReportParameters(). As the code > is structured now, we do the first three of those things, and then do > a bunch of other stuff inside CleanupAfterArchiveRecovery() like > running recovery_end_command, and removing non-parent xlog files, and > archiving the partial segment, and then come back and do the fourth > one. Is there any good reason for that? If not, I think doing them all > together would be cleaner, and would propose to reverse the order of > CleanupAfterArchiveRecovery() and XLogReportParameters(). > Yes, that can be done. > - If we did that, then I would further propose to adjust things so > that we remove the call to LocalSetXLogInsertAllowed() and the > assignment LocalXLogInsertAllowed = -1 from inside > CreateEndOfRecoveryRecord(), the LocalXLogInsertAllowed = -1 from just > after UpdateFullPageWrites(), and the call to > LocalSetXLogInsertAllowed() just before XLogReportParameters(). > Instead, just let the call to LocalSetXLogInsertAllowed() right before > CreateOverwriteContrecordRecord() remain in effect. There doesn't seem > to be much point in flipping that switch off and on again, and the > fact that we have been doing so is in my view just evidence that > StartupXLOG() doesn't do a very good job of getting related code all > into one place. > Currently there are three places that are calling LocalSetXLogInsertAllowed() and resetting that LocalXLogInsertAllowed flag as StartupXLOG(), CreateEndOfRecoveryRecord() and the CreateCheckPoint(). By doing the aforementioned code rearrangement we can get rid of frequent calls from StartupXLOG() and can completely remove the need for it in CreateEndOfRecoveryRecord() since that gets called only from StartupXLOG() directly. Whereas CreateCheckPoint() too gets called from StartupXLOG() when it is running in a standalone backend only, at that time we don't need to call LocalSetXLogInsertAllowed() but if that running in the Checkpointer process then we need that. I tried this in the attached version, but I'm a bit skeptical with changes that are needed for CreateCheckPoint(), those don't seem to be clean. I am wondering if we could completely remove the need to end of recovery checkpoint as proposed in [1], that would get rid of CHECKPOINT_END_OF_RECOVERY operation and the LocalSetXLogInsertAllowed() requirement in CreateCheckPoint(), and after that, we were not expecting checkpoint operation in recovery. If we could do that then we would have LocalSetXLogInsertAllowed() only at one place i.e. in StartupXLOG (...and in the future in XLogAcceptWrites()) -- the code that runs only once in a lifetime of the server and the kludge that the attached patch doing for CreateCheckPoint() will not be needed. > - It seems really tempting to invent a fourth RecoveryState value that > indicates that we are done with REDO but not yet in production, and > maybe also to rename RecoveryState to something like WALState. I'm > thinking of something like WAL_STATE_CRASH_RECOVERY, > WAL_STATE_ARCHIVE_RECOVERY, WAL_STATE_REDO_COMPLETE, and > WAL_STATE_PRODUCTION. Then, instead of having > LocalSetXLogInsertAllowed(), we could teach XLogInsertAllowed() that > the startup process and the checkpointer are allowed to insert WAL > when the state is WAL_STATE_REDO_COMPLETE, but other processes only > once we reach WAL_STATE_PRODUCTION. We would set > WAL_STATE_REDO_COMPLETE where we now call LocalSetXLogInsertAllowed(). > It's necessary to include the checkpointer, or at least I think it is, > because PerformRecoveryXLogAction() might call RequestCheckpoint(), > and that's got to work. If we did this, then I think it would also > solve another problem which the overall patch set has to address > somehow. Say that we eventually move responsibility for the > to-be-created XLogAcceptWrites() function from the startup process to > the checkpointer, as proposed. The checkpointer needs to know when to > call it ... and the answer with this change is simple: when we reach > WAL_STATE_REDO_COMPLETE, it's time! > > But this idea is not completely problem-free. I spent some time poking > at it and I think it's a little hard to come up with a satisfying way > to code XLogInsertAllowed(). Right now that function calls > RecoveryInProgress(), and if RecoveryInProgress() decides that > recovery is no longer in progress, it calls InitXLOGAccess(). However, > that presumes that the only reason you'd call RecoveryInProgress() is > to figure out whether you should write WAL, which I don't think is > really true, and it also means that, when the wal state is > WAL_STATE_REDO_COMPLETE, RecoveryInProgress() would need to return > true in the checkpointer and startup process and false everywhere > else, which does not sound like a great idea. It seems fine to say > that xlog insertion is allowed in some processes but not others, > because not all processes are necessarily equally privileged, but > whether or not we're in recovery is supposed to be something about > which everyone agrees, so answering that question differently in > different processes doesn't seem nice. XLogInsertAllowed() could be > rewritten to check the state directly and make its own determination, > without relying on RecoveryInProgress(), and I think that might be the > right way to go here. > > But that isn't entirely problem-free either, because there's a lot of > code that uses RecoveryInProgress() to answer the question "should I > write WAL?" and therefore it's not great if RecoveryInProgress() is > returning an answer that is inconsistent with XLogInsertAllowed(). > MarkBufferDirtyHint() and heap_page_prune_opt() are examples of this > kind of coding. It probably wouldn't break in practice right away, > because most of that code never runs in the startup process or the > checkpointer and would therefore never notice the difference in > behavior between those two functions, but if in the future we get the > read-only feature that this thread is supposed to be about, we'd have > problems. Not all RecoveryInProgress() calls have this sense - e.g. > sendDir() in basebackup.c is trying to figure out whether recovery > ended during the backup, not whether we can write WAL. But perhaps > this is a good time to go and replace RecoveryInProgress() checks that > are intending to decide whether or not it's OK to write WAL with > XLogInsertAllowed() checks (noting that the return value is reversed). > If we did that, then I think RecoveryInProgress() could also NOT call > InitXLOGAccess(), and that could be done only by XLogInsertAllowed(), > which seems like it might be better. But I haven't really tried to > code all of this up, so I'm not really sure how it all works out. > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not good, but I am not sure about calling it from XLogInsertAllowed() either, perhaps, both are status check function and general expectations might be that status checking functions are not going change and/or initialize the system state. InitXLOGAccess() should get called from the very first WAL write operation if needed, but if we don't want to do that, then I would prefer to call InitXLOGAccess() from XLogInsertAllowed() instead of RecoveryInProgress(). As said before, if we were able to get rid of the need to end-of-recovery checkpoint [1] then we don't need separate handling in XLogInsertAllowed() for the Checkpointer process, that would be much cleaner and for the startup process, we would force XLogInsertAllowed() return true by calling LocalSetXLogInsertAllowed() for the time being as we are doing right now. Regards, Amul 1] "using an end-of-recovery record in all cases" : https://postgr.es/m/CAAJ_b95xPx6oHRb5VEatGbp-cLsZApf_9GWGtbv9dsFKiV_VDQ@mail.gmail.com
Attachment
On Mon, Oct 18, 2021 at 9:54 AM Amul Sul <sulamul@gmail.com> wrote: > I tried this in the attached version, but I'm a bit skeptical with > changes that are needed for CreateCheckPoint(), those don't seem to be > clean. Yeah, that doesn't look great. I don't think it's entirely correct, actually, because surely you want LocalXLogInsertAllowed = 0 to be executed even if !IsPostmasterEnvironment. It's only LocalXLogInsertAllowed = -1 that we would want to have depend on IsPostmasterEnvironment. But that's pretty ugly too: I guess the reason it has to be like is that, if it does that unconditionally, it will overwrite the temporary value of 1 set by the caller, which will then cause problems when the caller tries to XLogReportParameters(). I think that problem goes away if we drive the decision off of shared state rather than a local variable, but I agree that it's otherwise a bit tricky to untangle. One idea might be to have LocalSetXLogInsertAllowed return the old value. Then we could use the same kind of coding we do when switching memory contexts, where we say: oldcontext = MemoryContextSwitchTo(something); // do stuff MemoryContextSwitchTo(oldcontext); Here we could maybe do: oldxlallowed = LocalSetXLogInsertAllowed(); // do stuff XLogInsertAllowed = oldxlallowed; That way, instead of CreateCheckPoint() knowing under what circumstances the caller might have changed the value, it only knows that some callers might have already changed the value. That seems better. > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not > good, but I am not sure about calling it from XLogInsertAllowed() > either, perhaps, both are status check function and general > expectations might be that status checking functions are not going > change and/or initialize the system state. InitXLOGAccess() should > get called from the very first WAL write operation if needed, but if > we don't want to do that, then I would prefer to call InitXLOGAccess() > from XLogInsertAllowed() instead of RecoveryInProgress(). Well, that's a fair point, too, but it might not be safe to, say, move this to XLogBeginInsert(). Like, imagine that there's a hypothetical piece of code that looks like this: if (RecoveryInProgress()) ereport(ERROR, errmsg("can't do that in recovery"))); // do something here that depends on ThisTimeLineID or wal_segment_size or RedoRecPtr XLogBeginInsert(); .... lsn = XLogInsert(...); Such code would work correctly the way things are today, but if the InitXLOGAccess() call were deferred until XLogBeginInsert() time, then it would fail. I was curious whether this is just a theoretical problem. It turns out that it's not. I wrote a couple of just-for-testing patches, which I attach here. The first one just adjusts things so that we'll fail an assertion if we try to make use of ThisTimeLineID before we've set it to a legal value. I had to exempt two places from these checks just for 'make check-world' to pass; these are shown in the patch, and one or both of them might be existing bugs -- or maybe not, I haven't looked too deeply. The second one then adjusts the patch to pretend that ThisTimeLineID is not necessarily valid just because we've called InitXLOGAccess() but that it is valid after XLogBeginInsert(). With that change, I find about a dozen places where, apparently, the early call to InitXLOGAccess() is critical to getting ThisTimeLineID adjusted in time. So apparently a change of this type is not entirely trivial. And this is just a quick test, and just for one of the three things that get initialized here. On the other hand, just moving it to XLogInsertAllowed() isn't risk-free either and would likely require adjusting some of the same places I found with this test. So I guess if we want to do something like this we need more study. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Oct 19, 2021 at 3:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 18, 2021 at 9:54 AM Amul Sul <sulamul@gmail.com> wrote: > > I tried this in the attached version, but I'm a bit skeptical with > > changes that are needed for CreateCheckPoint(), those don't seem to be > > clean. > > Yeah, that doesn't look great. I don't think it's entirely correct, > actually, because surely you want LocalXLogInsertAllowed = 0 to be > executed even if !IsPostmasterEnvironment. It's only > LocalXLogInsertAllowed = -1 that we would want to have depend on > IsPostmasterEnvironment. But that's pretty ugly too: I guess the > reason it has to be like is that, if it does that unconditionally, it > will overwrite the temporary value of 1 set by the caller, which will > then cause problems when the caller tries to XLogReportParameters(). > > I think that problem goes away if we drive the decision off of shared > state rather than a local variable, but I agree that it's otherwise a > bit tricky to untangle. One idea might be to have > LocalSetXLogInsertAllowed return the old value. Then we could use the > same kind of coding we do when switching memory contexts, where we > say: > > oldcontext = MemoryContextSwitchTo(something); > // do stuff > MemoryContextSwitchTo(oldcontext); > > Here we could maybe do: > > oldxlallowed = LocalSetXLogInsertAllowed(); > // do stuff > XLogInsertAllowed = oldxlallowed; > Ok, did the same in the attached 0001 patch. There is no harm in calling LocalSetXLogInsertAllowed() calling multiple times, but the problem I can see is that with this patch user is allowed to call LocalSetXLogInsertAllowed() at the time it is supposed not to be called e.g. when LocalXLogInsertAllowed = 0; WAL writes are explicitly disabled. > That way, instead of CreateCheckPoint() knowing under what > circumstances the caller might have changed the value, it only knows > that some callers might have already changed the value. That seems > better. > > > I agree that calling InitXLOGAccess() from RecoveryInProgress() is not > > good, but I am not sure about calling it from XLogInsertAllowed() > > either, perhaps, both are status check function and general > > expectations might be that status checking functions are not going > > change and/or initialize the system state. InitXLOGAccess() should > > get called from the very first WAL write operation if needed, but if > > we don't want to do that, then I would prefer to call InitXLOGAccess() > > from XLogInsertAllowed() instead of RecoveryInProgress(). > > Well, that's a fair point, too, but it might not be safe to, say, move > this to XLogBeginInsert(). Like, imagine that there's a hypothetical > piece of code that looks like this: > > if (RecoveryInProgress()) > ereport(ERROR, errmsg("can't do that in recovery"))); > > // do something here that depends on ThisTimeLineID or > wal_segment_size or RedoRecPtr > > XLogBeginInsert(); > .... > lsn = XLogInsert(...); > > Such code would work correctly the way things are today, but if the > InitXLOGAccess() call were deferred until XLogBeginInsert() time, then > it would fail. > > I was curious whether this is just a theoretical problem. It turns out > that it's not. I wrote a couple of just-for-testing patches, which I > attach here. The first one just adjusts things so that we'll fail an > assertion if we try to make use of ThisTimeLineID before we've set it > to a legal value. I had to exempt two places from these checks just > for 'make check-world' to pass; these are shown in the patch, and one > or both of them might be existing bugs -- or maybe not, I haven't > looked too deeply. The second one then adjusts the patch to pretend > that ThisTimeLineID is not necessarily valid just because we've called > InitXLOGAccess() but that it is valid after XLogBeginInsert(). With > that change, I find about a dozen places where, apparently, the early > call to InitXLOGAccess() is critical to getting ThisTimeLineID > adjusted in time. So apparently a change of this type is not entirely > trivial. And this is just a quick test, and just for one of the three > things that get initialized here. > > On the other hand, just moving it to XLogInsertAllowed() isn't > risk-free either and would likely require adjusting some of the same > places I found with this test. So I guess if we want to do something > like this we need more study. > Yeah, that requires a lot of energy and time -- not done anything related to this in the attached version. Please have a look at the attached version where the 0001 patch does change LocalSetXLogInsertAllowed() as said before. 0002 patch moves XLogReportParameters() closer to other wal write operations and removes unnecessary LocalSetXLogInsertAllowed() calls. 0003 is code movements adds XLogAcceptWrites() function same as the before, and 0004 patch tries to remove the dependency. 0004 patch could change w.r.t. decision that is going to be made for the patch that I posted[1] to remove abortedRecPtr global variable. For now, I have copied abortedRecPtr into shared memory. Thanks ! 1] https://postgr.es/m/CAAJ_b94Y75ZwMim+gxxexVwf_yzO-dChof90ky0dB2GstspNjA@mail.gmail.com Regards, Amul
Attachment
On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote: > Ok, did the same in the attached 0001 patch. > > There is no harm in calling LocalSetXLogInsertAllowed() calling > multiple times, but the problem I can see is that with this patch user > is allowed to call LocalSetXLogInsertAllowed() at the time it is > supposed not to be called e.g. when LocalXLogInsertAllowed = 0; > WAL writes are explicitly disabled. I've pushed 0001 and 0002 but I reversed the order of them and made a few other edits. I don't really see the issue you mention here as a problem. There's only one place where we set LocalXLogInsertAllowed = 0, and I don't know that we'll ever have another one. -- Robert Haas EDB: http://www.enterprisedb.com
On 10/25/21, 7:50 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > I've pushed 0001 and 0002 but I reversed the order of them and made a > few other edits. My compiler is complaining about oldXLogAllowed possibly being used uninitialized in CreateCheckPoint(). AFAICT it can just be initially set to zero to silence this warning because it will, in fact, be initialized properly when it is used. Nathan
On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan <bossartn@amazon.com> wrote: > My compiler is complaining about oldXLogAllowed possibly being used > uninitialized in CreateCheckPoint(). AFAICT it can just be initially > set to zero to silence this warning because it will, in fact, be > initialized properly when it is used. Hmm, I guess I could have foreseen that, had I been a little bit smarter than I am. I have committed a change to initialize it to 0 as you propose. -- Robert Haas EDB: http://www.enterprisedb.com
On 10/25/21, 1:33 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Mon, Oct 25, 2021 at 3:14 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> My compiler is complaining about oldXLogAllowed possibly being used >> uninitialized in CreateCheckPoint(). AFAICT it can just be initially >> set to zero to silence this warning because it will, in fact, be >> initialized properly when it is used. > > Hmm, I guess I could have foreseen that, had I been a little bit > smarter than I am. I have committed a change to initialize it to 0 as > you propose. Thanks! Nathan
On Mon, Oct 25, 2021 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote: > > Ok, did the same in the attached 0001 patch. > > > > There is no harm in calling LocalSetXLogInsertAllowed() calling > > multiple times, but the problem I can see is that with this patch user > > is allowed to call LocalSetXLogInsertAllowed() at the time it is > > supposed not to be called e.g. when LocalXLogInsertAllowed = 0; > > WAL writes are explicitly disabled. > > I've pushed 0001 and 0002 but I reversed the order of them and made a > few other edits. > Thank you! I have rebased the remaining patches on top of the latest master head (commit # e63ce9e8d6a). In addition to that, I did the additional changes to 0002 where I haven't included the change that tries to remove arguments of CleanupAfterArchiveRecovery() in the previous version. Because if we want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to replace EndOfLogTLI and EndOfLog arguments respectively, then we also need to consider the case where EndOfLog is changing if the abort-record does exist. That can be decided only in XLogAcceptWrite() before the shared memory value related to abort-record is going to be clear. Regards, Amul
Attachment
On Tue, Oct 26, 2021 at 4:29 PM Amul Sul <sulamul@gmail.com> wrote: > > On Mon, Oct 25, 2021 at 8:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Oct 25, 2021 at 3:05 AM Amul Sul <sulamul@gmail.com> wrote: > > > Ok, did the same in the attached 0001 patch. > > > > > > There is no harm in calling LocalSetXLogInsertAllowed() calling > > > multiple times, but the problem I can see is that with this patch user > > > is allowed to call LocalSetXLogInsertAllowed() at the time it is > > > supposed not to be called e.g. when LocalXLogInsertAllowed = 0; > > > WAL writes are explicitly disabled. > > > > I've pushed 0001 and 0002 but I reversed the order of them and made a > > few other edits. > > > > Thank you! > > I have rebased the remaining patches on top of the latest master head > (commit # e63ce9e8d6a). > > In addition to that, I did the additional changes to 0002 where I > haven't included the change that tries to remove arguments of > CleanupAfterArchiveRecovery() in the previous version. Because if we > want to use XLogCtl->replayEndTLI and XLogCtl->replayEndRecPtr to > replace EndOfLogTLI and EndOfLog arguments respectively, then we also > need to consider the case where EndOfLog is changing if the > abort-record does exist. That can be decided only in XLogAcceptWrite() > before the shared memory value related to abort-record is going to be > clear. > Attached is the rebased version of refactoring as well as the pg_prohibit_wal feature patches for the latest master head (commit # 39a3105678a). I was planning to attach the rebased version of isolation test patches that Mark has posted before[1], but some permutation tests are not stable, where expected errors get printed differently; therefore, I dropped that from the attachment, for now. Regards, Amul 1] https://postgr.es/m/9BA3BA57-6B7B-45CB-B8D9-6B5EB0104FFA@enterprisedb.com
Attachment
- v41-0007-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v41-0006-Documentation.patch
- v41-0004-Implement-wal-prohibit-state-using-global-barrie.patch
- v41-0003-Allow-RequestCheckpoint-call-from-checkpointer-p.patch
- v41-0005-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v41-0002-Remove-dependencies-on-startup-process-specifica.patch
- v41-0001-Create-XLogAcceptWrites-function-with-code-from-.patch
On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote: > Attached is the rebased version of refactoring as well as the > pg_prohibit_wal feature patches for the latest master head (commit # > 39a3105678a). I spent a lot of time today studying 0002, and specifically the question of whether EndOfLog must be the same as XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as XLogCtl->replayEndTLI. The answer to the former question is "no" because, if we don't enter redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do enter redo, then I think it has to be the same unless something very weird happens. EndOfLog gets set like this: XLogBeginRead(xlogreader, LastRec); record = ReadRecord(xlogreader, PANIC, false, replayTLI); EndOfLog = EndRecPtr; In every case that exists in our regression tests, EndRecPtr is the same before these three lines of code as it is afterward. However, if you test with recovery_target=immediate, you can get it to be different, because in that case we drop out of the redo loop after calling recoveryStopsBefore() rather than after calling recoveryStopsAfter(). Similarly I'm fairly sure that if you use recovery_target_inclusive=off you can likewise get it to be different (though I discovered the hard way that recovery_target_inclusive=off is ignored when you use recovery_target_name). It seems like a really bad thing that neither recovery_target=immediate nor recovery_target_inclusive=off have any tests, and I think we ought to add some. Anyway, in effect, these three lines of code have the effect of backing up the xlogreader by one record when we stop before rather than after a record that we're replaying. What that means is that EndOfLog is going to be the end+1 of the last record that we actually replayed. There might be one more record that we read but did not replay, and that record won't impact the value we end up with in EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last record that we actually replayed. To put that another way, there's no way to exit the main redo loop after we set XLogCtl->replayEndRecPtr and before we change LastRec. So in the cases where XLogCtl->replayEndRecPtr gets initialized at all, it can only be different from EndOfLog if something different happens when we re-read the last-replayed WAL record than what happened when we read it the first time. That seems unlikely, and would be unfortunate it if it did happen. I am inclined to think that it might be better not to reread the record at all, though. As far as this patch goes, I think we need a solution that doesn't involve fetching EndOfLog from a variable that's only sometimes initialized and then not doing anything with it except in the cases where it was initialized. As for EndOfLogTLI, I'm afraid I don't think that's the same thing as XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't think the regression tests contain any scenarios where we run recovery and the values end up different. However, I think that the code sets EndOfLogTLI to the TLI of the last WAL file that we read, and I think XLogCtl->replayEndTLI gets set to the timeline from which that WAL record originated. So imagine that we are looking for WAL that ought to be in 000000010000000000000003 but we don't find it; instead we find 000000020000000000000003 because our recovery target timeline is 2, or something that has 2 in its history. We will read the WAL for timeline 1 from this file which has timeline 2 in the file name. I think if recovery ends in this file before the timeline switch, these values will be different. I did not try to construct a test case for this today due to not having enough time, so it's possible that I'm wrong about this, but that's how it looks to me from the code. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote: > > Attached is the rebased version of refactoring as well as the > > pg_prohibit_wal feature patches for the latest master head (commit # > > 39a3105678a). > > I spent a lot of time today studying 0002, and specifically the > question of whether EndOfLog must be the same as > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > XLogCtl->replayEndTLI. > > The answer to the former question is "no" because, if we don't enter > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > enter redo, then I think it has to be the same unless something very > weird happens. EndOfLog gets set like this: > > XLogBeginRead(xlogreader, LastRec); > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > EndOfLog = EndRecPtr; > > In every case that exists in our regression tests, EndRecPtr is the > same before these three lines of code as it is afterward. However, if > you test with recovery_target=immediate, you can get it to be > different, because in that case we drop out of the redo loop after > calling recoveryStopsBefore() rather than after calling > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > recovery_target_inclusive=off you can likewise get it to be different > (though I discovered the hard way that recovery_target_inclusive=off > is ignored when you use recovery_target_name). It seems like a really > bad thing that neither recovery_target=immediate nor > recovery_target_inclusive=off have any tests, and I think we ought to > add some. > recovery/t/003_recovery_targets.pl has test for recovery_target=immediate but not for recovery_target_inclusive=off, we can add that for recovery_target_lsn, recovery_target_time, and recovery_target_xid case only where it affects. > Anyway, in effect, these three lines of code have the effect of > backing up the xlogreader by one record when we stop before rather > than after a record that we're replaying. What that means is that > EndOfLog is going to be the end+1 of the last record that we actually > replayed. There might be one more record that we read but did not > replay, and that record won't impact the value we end up with in > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > record that we actually replayed. To put that another way, there's no > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > and before we change LastRec. So in the cases where > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > different from EndOfLog if something different happens when we re-read > the last-replayed WAL record than what happened when we read it the > first time. That seems unlikely, and would be unfortunate it if it did > happen. I am inclined to think that it might be better not to reread > the record at all, though. There are two reasons that the record is reread; first, one that you have just explained where the redo loop drops out due to recoveryStopsBefore() and another one is that InRecovery is false. In the formal case at the end, redo while-loop does read a new record which in effect updates EndRecPtr and when we breaks the loop, we do reach the place where we do reread record -- where we do read the record (i.e. LastRec) before the record that redo loop has read and which correctly sets EndRecPtr. In the latter case, definitely, we don't need any adjustment to EndRecPtr. So technically one case needs reread but that is also not needed, we have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we do not need to reread the record, but EndOfLog and EndOfLogTLI should be set conditionally something like: if (InRecovery) { EndOfLog = XLogCtl->lastReplayedEndRecPtr; EndOfLogTLI = XLogCtl->lastReplayedTLI; } else { EndOfLog = EndRecPtr; EndOfLogTLI = replayTLI; } > As far as this patch goes, I think we need > a solution that doesn't involve fetching EndOfLog from a variable > that's only sometimes initialized and then not doing anything with it > except in the cases where it was initialized. > Another reason could be EndOfLog changes further in the following case: /* * Actually, if WAL ended in an incomplete record, skip the parts that * made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) */ if (!XLogRecPtrIsInvalid(missingContrecPtr)) { Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); EndOfLog = missingContrecPtr; } Now only solution that I can think is to copy EndOfLog (so EndOfLogTLI) into shared memory. > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't > think the regression tests contain any scenarios where we run recovery > and the values end up different. However, I think that the code sets > EndOfLogTLI to the TLI of the last WAL file that we read, and I think > XLogCtl->replayEndTLI gets set to the timeline from which that WAL > record originated. So imagine that we are looking for WAL that ought > to be in 000000010000000000000003 but we don't find it; instead we > find 000000020000000000000003 because our recovery target timeline is > 2, or something that has 2 in its history. We will read the WAL for > timeline 1 from this file which has timeline 2 in the file name. I > think if recovery ends in this file before the timeline switch, these > values will be different. I did not try to construct a test case for > this today due to not having enough time, so it's possible that I'm > wrong about this, but that's how it looks to me from the code. > I am not sure, I have understood this scenario due to lack of expertise in this area -- Why would the record we looking that ought to be in 000000010000000000000003 we don't find it? Possibly WAL corruption or that file is missing? Regards, Amul
On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote: > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote: > > > Attached is the rebased version of refactoring as well as the > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > 39a3105678a). > > > > I spent a lot of time today studying 0002, and specifically the > > question of whether EndOfLog must be the same as > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > XLogCtl->replayEndTLI. > > > > The answer to the former question is "no" because, if we don't enter > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > enter redo, then I think it has to be the same unless something very > > weird happens. EndOfLog gets set like this: > > > > XLogBeginRead(xlogreader, LastRec); > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > EndOfLog = EndRecPtr; > > > > In every case that exists in our regression tests, EndRecPtr is the > > same before these three lines of code as it is afterward. However, if > > you test with recovery_target=immediate, you can get it to be > > different, because in that case we drop out of the redo loop after > > calling recoveryStopsBefore() rather than after calling > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > recovery_target_inclusive=off you can likewise get it to be different > > (though I discovered the hard way that recovery_target_inclusive=off > > is ignored when you use recovery_target_name). It seems like a really > > bad thing that neither recovery_target=immediate nor > > recovery_target_inclusive=off have any tests, and I think we ought to > > add some. > > > > recovery/t/003_recovery_targets.pl has test for > recovery_target=immediate but not for recovery_target_inclusive=off, we > can add that for recovery_target_lsn, recovery_target_time, and > recovery_target_xid case only where it affects. > > > Anyway, in effect, these three lines of code have the effect of > > backing up the xlogreader by one record when we stop before rather > > than after a record that we're replaying. What that means is that > > EndOfLog is going to be the end+1 of the last record that we actually > > replayed. There might be one more record that we read but did not > > replay, and that record won't impact the value we end up with in > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > record that we actually replayed. To put that another way, there's no > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > and before we change LastRec. So in the cases where > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > different from EndOfLog if something different happens when we re-read > > the last-replayed WAL record than what happened when we read it the > > first time. That seems unlikely, and would be unfortunate it if it did > > happen. I am inclined to think that it might be better not to reread > > the record at all, though. > > There are two reasons that the record is reread; first, one that you > have just explained where the redo loop drops out due to > recoveryStopsBefore() and another one is that InRecovery is false. > > In the formal case at the end, redo while-loop does read a new record > which in effect updates EndRecPtr and when we breaks the loop, we do > reach the place where we do reread record -- where we do read the > record (i.e. LastRec) before the record that redo loop has read and > which correctly sets EndRecPtr. In the latter case, definitely, we > don't need any adjustment to EndRecPtr. > > So technically one case needs reread but that is also not needed, we > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > do not need to reread the record, but EndOfLog and EndOfLogTLI should > be set conditionally something like: > > if (InRecovery) > { > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > EndOfLogTLI = XLogCtl->lastReplayedTLI; > } > else > { > EndOfLog = EndRecPtr; > EndOfLogTLI = replayTLI; > } > > > As far as this patch goes, I think we need > > a solution that doesn't involve fetching EndOfLog from a variable > > that's only sometimes initialized and then not doing anything with it > > except in the cases where it was initialized. > > > > Another reason could be EndOfLog changes further in the following case: > > /* > * Actually, if WAL ended in an incomplete record, skip the parts that > * made it through and start writing after the portion that persisted. > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > * we'll do as soon as we're open for writing new WAL.) > */ > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > { > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > EndOfLog = missingContrecPtr; > } > > Now only solution that I can think is to copy EndOfLog (so > EndOfLogTLI) into shared memory. > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't > > think the regression tests contain any scenarios where we run recovery > > and the values end up different. However, I think that the code sets > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL > > record originated. So imagine that we are looking for WAL that ought > > to be in 000000010000000000000003 but we don't find it; instead we > > find 000000020000000000000003 because our recovery target timeline is > > 2, or something that has 2 in its history. We will read the WAL for > > timeline 1 from this file which has timeline 2 in the file name. I > > think if recovery ends in this file before the timeline switch, these > > values will be different. I did not try to construct a test case for > > this today due to not having enough time, so it's possible that I'm > > wrong about this, but that's how it looks to me from the code. > > > > I am not sure, I have understood this scenario due to lack of > expertise in this area -- Why would the record we looking that ought > to be in 000000010000000000000003 we don't find it? Possibly WAL > corruption or that file is missing? > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and XLogFileReadAnyTLI(), I think I could make a sense that there could be a case where the record belong to TLI 1 we are looking for; we might open the file with TLI 2. But, I am wondering what's wrong if we say that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4 in its file name -- that statement is still true, and that record should be still accessible from the filename with TLI 1. Also, if we going to consider this reading record exists before the timeline switch point as the EndOfLog then why should be worried about the latter timeline switch which eventually everything after the EndOfLog going to be useless for us. We might continue switching TLI and/or writing the WAL right after EndOfLog, correct me if I am missing something here. Further, I still think replayEndTLI has set to the correct value what we looking for EndOfLogTLI when we go through the redo loop. When it read the record and finds a change in the current replayTLI then it updates that as: if (newReplayTLI != replayTLI) { /* Check that it's OK to switch to this TLI */ checkTimeLineSwitch(EndRecPtr, newReplayTLI, prevReplayTLI, replayTLI); /* Following WAL records should be run with new TLI */ replayTLI = newReplayTLI; switchedTLI = true; } Then replayEndTLI gets updated. If we going to skip the reread of "LastRec" that we were discussing, then I think the following code that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false) should be enough, AFAICU. /* * EndOfLogTLI is the TLI in the filename of the XLOG segment containing * the end-of-log. It could be different from the timeline that EndOfLog * nominally belongs to, if there was a timeline switch in that segment, * and we were reading the old WAL from a segment belonging to a higher * timeline. */ EndOfLogTLI = xlogreader->seg.ws_tli; Regards, Amul
On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote: > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote: > > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote: > > > > Attached is the rebased version of refactoring as well as the > > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > > 39a3105678a). > > > > > > I spent a lot of time today studying 0002, and specifically the > > > question of whether EndOfLog must be the same as > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > > XLogCtl->replayEndTLI. > > > > > > The answer to the former question is "no" because, if we don't enter > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > > enter redo, then I think it has to be the same unless something very > > > weird happens. EndOfLog gets set like this: > > > > > > XLogBeginRead(xlogreader, LastRec); > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > > EndOfLog = EndRecPtr; > > > > > > In every case that exists in our regression tests, EndRecPtr is the > > > same before these three lines of code as it is afterward. However, if > > > you test with recovery_target=immediate, you can get it to be > > > different, because in that case we drop out of the redo loop after > > > calling recoveryStopsBefore() rather than after calling > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > > recovery_target_inclusive=off you can likewise get it to be different > > > (though I discovered the hard way that recovery_target_inclusive=off > > > is ignored when you use recovery_target_name). It seems like a really > > > bad thing that neither recovery_target=immediate nor > > > recovery_target_inclusive=off have any tests, and I think we ought to > > > add some. > > > > > > > recovery/t/003_recovery_targets.pl has test for > > recovery_target=immediate but not for recovery_target_inclusive=off, we > > can add that for recovery_target_lsn, recovery_target_time, and > > recovery_target_xid case only where it affects. > > > > > Anyway, in effect, these three lines of code have the effect of > > > backing up the xlogreader by one record when we stop before rather > > > than after a record that we're replaying. What that means is that > > > EndOfLog is going to be the end+1 of the last record that we actually > > > replayed. There might be one more record that we read but did not > > > replay, and that record won't impact the value we end up with in > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > > record that we actually replayed. To put that another way, there's no > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > > and before we change LastRec. So in the cases where > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > > different from EndOfLog if something different happens when we re-read > > > the last-replayed WAL record than what happened when we read it the > > > first time. That seems unlikely, and would be unfortunate it if it did > > > happen. I am inclined to think that it might be better not to reread > > > the record at all, though. > > > > There are two reasons that the record is reread; first, one that you > > have just explained where the redo loop drops out due to > > recoveryStopsBefore() and another one is that InRecovery is false. > > > > In the formal case at the end, redo while-loop does read a new record > > which in effect updates EndRecPtr and when we breaks the loop, we do > > reach the place where we do reread record -- where we do read the > > record (i.e. LastRec) before the record that redo loop has read and > > which correctly sets EndRecPtr. In the latter case, definitely, we > > don't need any adjustment to EndRecPtr. > > > > So technically one case needs reread but that is also not needed, we > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > > do not need to reread the record, but EndOfLog and EndOfLogTLI should > > be set conditionally something like: > > > > if (InRecovery) > > { > > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > > EndOfLogTLI = XLogCtl->lastReplayedTLI; > > } > > else > > { > > EndOfLog = EndRecPtr; > > EndOfLogTLI = replayTLI; > > } > > > > > As far as this patch goes, I think we need > > > a solution that doesn't involve fetching EndOfLog from a variable > > > that's only sometimes initialized and then not doing anything with it > > > except in the cases where it was initialized. > > > > > > > Another reason could be EndOfLog changes further in the following case: > > > > /* > > * Actually, if WAL ended in an incomplete record, skip the parts that > > * made it through and start writing after the portion that persisted. > > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > > * we'll do as soon as we're open for writing new WAL.) > > */ > > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > > { > > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > > EndOfLog = missingContrecPtr; > > } > > > > Now only solution that I can think is to copy EndOfLog (so > > EndOfLogTLI) into shared memory. > > > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't > > > think the regression tests contain any scenarios where we run recovery > > > and the values end up different. However, I think that the code sets > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL > > > record originated. So imagine that we are looking for WAL that ought > > > to be in 000000010000000000000003 but we don't find it; instead we > > > find 000000020000000000000003 because our recovery target timeline is > > > 2, or something that has 2 in its history. We will read the WAL for > > > timeline 1 from this file which has timeline 2 in the file name. I > > > think if recovery ends in this file before the timeline switch, these > > > values will be different. I did not try to construct a test case for > > > this today due to not having enough time, so it's possible that I'm > > > wrong about this, but that's how it looks to me from the code. > > > > > > > I am not sure, I have understood this scenario due to lack of > > expertise in this area -- Why would the record we looking that ought > > to be in 000000010000000000000003 we don't find it? Possibly WAL > > corruption or that file is missing? > > > > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and > XLogFileReadAnyTLI(), I think I could make a sense that there could be > a case where the record belong to TLI 1 we are looking for; we might > open the file with TLI 2. But, I am wondering what's wrong if we say > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4 > in its file name -- that statement is still true, and that record > should be still accessible from the filename with TLI 1. Also, if we > going to consider this reading record exists before the timeline > switch point as the EndOfLog then why should be worried about the > latter timeline switch which eventually everything after the EndOfLog > going to be useless for us. We might continue switching TLI and/or > writing the WAL right after EndOfLog, correct me if I am missing > something here. > > Further, I still think replayEndTLI has set to the correct value what > we looking for EndOfLogTLI when we go through the redo loop. When it > read the record and finds a change in the current replayTLI then it > updates that as: > > if (newReplayTLI != replayTLI) > { > /* Check that it's OK to switch to this TLI */ > checkTimeLineSwitch(EndRecPtr, newReplayTLI, > prevReplayTLI, replayTLI); > > /* Following WAL records should be run with new TLI */ > replayTLI = newReplayTLI; > switchedTLI = true; > } > > Then replayEndTLI gets updated. If we going to skip the reread of > "LastRec" that we were discussing, then I think the following code > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false) > should be enough, AFAICU. > > /* > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > * the end-of-log. It could be different from the timeline that EndOfLog > * nominally belongs to, if there was a timeline switch in that segment, > * and we were reading the old WAL from a segment belonging to a higher > * timeline. > */ > EndOfLogTLI = xlogreader->seg.ws_tli; > I think I found the right case for this, above TLI fetch is needed in the case where we do restore from the archived WAL files. In my trial, the archive directory has files as below (Kindly ignore the extra history file, I perform a few more trials to be sure): -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial -rw-------. 1 amul amul 128 Nov 17 06:36 00000004.history -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F -rw-------. 1 amul amul 171 Nov 17 06:39 00000005.history -rw-------. 1 amul amul 209 Nov 17 06:45 00000006.history -rw-------. 1 amul amul 247 Nov 17 06:52 00000007.history The timeline is switched in 1F file but the archiver has backup older timeline file and renamed it. While performing PITR using these archived files, the .partitial file seems to be skipped from the restore. The file with the next timeline id is selected to read the records that belong to the previous timeline id as well (i.e. 4 here, all the records before timeline switch point). Here is the files inside pg_wal directory after restore, note that in the current experiment, I chose recovery_target_xid = <just before the timeline#5 switch point > and then recovery_target_action = 'promote': -rw-------. 1 amul amul 85 Nov 17 07:33 00000003.history -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E -rw-------. 1 amul amul 128 Nov 17 07:33 00000004.history -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F -rw-------. 1 amul amul 171 Nov 17 07:33 00000005.history -rw-------. 1 amul amul 209 Nov 17 07:33 00000006.history -rw-------. 1 amul amul 247 Nov 17 07:33 00000007.history -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F The last one is the new WAL file created in that cluster. Regards, Amul
On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote: > > On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote: > > > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote: > > > > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote: > > > > > Attached is the rebased version of refactoring as well as the > > > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > > > 39a3105678a). > > > > > > > > I spent a lot of time today studying 0002, and specifically the > > > > question of whether EndOfLog must be the same as > > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > > > XLogCtl->replayEndTLI. > > > > > > > > The answer to the former question is "no" because, if we don't enter > > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > > > enter redo, then I think it has to be the same unless something very > > > > weird happens. EndOfLog gets set like this: > > > > > > > > XLogBeginRead(xlogreader, LastRec); > > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > > > EndOfLog = EndRecPtr; > > > > > > > > In every case that exists in our regression tests, EndRecPtr is the > > > > same before these three lines of code as it is afterward. However, if > > > > you test with recovery_target=immediate, you can get it to be > > > > different, because in that case we drop out of the redo loop after > > > > calling recoveryStopsBefore() rather than after calling > > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > > > recovery_target_inclusive=off you can likewise get it to be different > > > > (though I discovered the hard way that recovery_target_inclusive=off > > > > is ignored when you use recovery_target_name). It seems like a really > > > > bad thing that neither recovery_target=immediate nor > > > > recovery_target_inclusive=off have any tests, and I think we ought to > > > > add some. > > > > > > > > > > recovery/t/003_recovery_targets.pl has test for > > > recovery_target=immediate but not for recovery_target_inclusive=off, we > > > can add that for recovery_target_lsn, recovery_target_time, and > > > recovery_target_xid case only where it affects. > > > > > > > Anyway, in effect, these three lines of code have the effect of > > > > backing up the xlogreader by one record when we stop before rather > > > > than after a record that we're replaying. What that means is that > > > > EndOfLog is going to be the end+1 of the last record that we actually > > > > replayed. There might be one more record that we read but did not > > > > replay, and that record won't impact the value we end up with in > > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > > > record that we actually replayed. To put that another way, there's no > > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > > > and before we change LastRec. So in the cases where > > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > > > different from EndOfLog if something different happens when we re-read > > > > the last-replayed WAL record than what happened when we read it the > > > > first time. That seems unlikely, and would be unfortunate it if it did > > > > happen. I am inclined to think that it might be better not to reread > > > > the record at all, though. > > > > > > There are two reasons that the record is reread; first, one that you > > > have just explained where the redo loop drops out due to > > > recoveryStopsBefore() and another one is that InRecovery is false. > > > > > > In the formal case at the end, redo while-loop does read a new record > > > which in effect updates EndRecPtr and when we breaks the loop, we do > > > reach the place where we do reread record -- where we do read the > > > record (i.e. LastRec) before the record that redo loop has read and > > > which correctly sets EndRecPtr. In the latter case, definitely, we > > > don't need any adjustment to EndRecPtr. > > > > > > So technically one case needs reread but that is also not needed, we > > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > > > do not need to reread the record, but EndOfLog and EndOfLogTLI should > > > be set conditionally something like: > > > > > > if (InRecovery) > > > { > > > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > > > EndOfLogTLI = XLogCtl->lastReplayedTLI; > > > } > > > else > > > { > > > EndOfLog = EndRecPtr; > > > EndOfLogTLI = replayTLI; > > > } > > > > > > > As far as this patch goes, I think we need > > > > a solution that doesn't involve fetching EndOfLog from a variable > > > > that's only sometimes initialized and then not doing anything with it > > > > except in the cases where it was initialized. > > > > > > > > > > Another reason could be EndOfLog changes further in the following case: > > > > > > /* > > > * Actually, if WAL ended in an incomplete record, skip the parts that > > > * made it through and start writing after the portion that persisted. > > > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > > > * we'll do as soon as we're open for writing new WAL.) > > > */ > > > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > > > { > > > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > > > EndOfLog = missingContrecPtr; > > > } > > > > > > Now only solution that I can think is to copy EndOfLog (so > > > EndOfLogTLI) into shared memory. > > > > > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as > > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't > > > > think the regression tests contain any scenarios where we run recovery > > > > and the values end up different. However, I think that the code sets > > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think > > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL > > > > record originated. So imagine that we are looking for WAL that ought > > > > to be in 000000010000000000000003 but we don't find it; instead we > > > > find 000000020000000000000003 because our recovery target timeline is > > > > 2, or something that has 2 in its history. We will read the WAL for > > > > timeline 1 from this file which has timeline 2 in the file name. I > > > > think if recovery ends in this file before the timeline switch, these > > > > values will be different. I did not try to construct a test case for > > > > this today due to not having enough time, so it's possible that I'm > > > > wrong about this, but that's how it looks to me from the code. > > > > > > > > > > I am not sure, I have understood this scenario due to lack of > > > expertise in this area -- Why would the record we looking that ought > > > to be in 000000010000000000000003 we don't find it? Possibly WAL > > > corruption or that file is missing? > > > > > > > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and > > XLogFileReadAnyTLI(), I think I could make a sense that there could be > > a case where the record belong to TLI 1 we are looking for; we might > > open the file with TLI 2. But, I am wondering what's wrong if we say > > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4 > > in its file name -- that statement is still true, and that record > > should be still accessible from the filename with TLI 1. Also, if we > > going to consider this reading record exists before the timeline > > switch point as the EndOfLog then why should be worried about the > > latter timeline switch which eventually everything after the EndOfLog > > going to be useless for us. We might continue switching TLI and/or > > writing the WAL right after EndOfLog, correct me if I am missing > > something here. > > > > Further, I still think replayEndTLI has set to the correct value what > > we looking for EndOfLogTLI when we go through the redo loop. When it > > read the record and finds a change in the current replayTLI then it > > updates that as: > > > > if (newReplayTLI != replayTLI) > > { > > /* Check that it's OK to switch to this TLI */ > > checkTimeLineSwitch(EndRecPtr, newReplayTLI, > > prevReplayTLI, replayTLI); > > > > /* Following WAL records should be run with new TLI */ > > replayTLI = newReplayTLI; > > switchedTLI = true; > > } > > > > Then replayEndTLI gets updated. If we going to skip the reread of > > "LastRec" that we were discussing, then I think the following code > > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI > > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false) > > should be enough, AFAICU. > > > > /* > > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > > * the end-of-log. It could be different from the timeline that EndOfLog > > * nominally belongs to, if there was a timeline switch in that segment, > > * and we were reading the old WAL from a segment belonging to a higher > > * timeline. > > */ > > EndOfLogTLI = xlogreader->seg.ws_tli; > > > > I think I found the right case for this, above TLI fetch is needed in > the case where we do restore from the archived WAL files. In my trial, > the archive directory has files as below (Kindly ignore the extra > history file, I perform a few more trials to be sure): > > -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E > -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial > -rw-------. 1 amul amul 128 Nov 17 06:36 00000004.history > -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F > -rw-------. 1 amul amul 171 Nov 17 06:39 00000005.history > -rw-------. 1 amul amul 209 Nov 17 06:45 00000006.history > -rw-------. 1 amul amul 247 Nov 17 06:52 00000007.history > > The timeline is switched in 1F file but the archiver has backup older > timeline file and renamed it. While performing PITR using these > archived files, the .partitial file seems to be skipped from the > restore. The file with the next timeline id is selected to read the > records that belong to the previous timeline id as well (i.e. 4 here, > all the records before timeline switch point). Here is the files > inside pg_wal directory after restore, note that in the current > experiment, I chose recovery_target_xid = <just before the timeline#5 > switch point > and then recovery_target_action = 'promote': > > -rw-------. 1 amul amul 85 Nov 17 07:33 00000003.history > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E > -rw-------. 1 amul amul 128 Nov 17 07:33 00000004.history > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F > -rw-------. 1 amul amul 171 Nov 17 07:33 00000005.history > -rw-------. 1 amul amul 209 Nov 17 07:33 00000006.history > -rw-------. 1 amul amul 247 Nov 17 07:33 00000007.history > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F > > The last one is the new WAL file created in that cluster. > With this experiment, I think it is clear that the EndOfLogTLI can be different from the replayEndTLI or lastReplayedTLI, and we don't have any other option to get that into other processes other than exporting into shared memory. Similarly, we have bunch of option (e.g. replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get EndOfLog value but those are not perfect and reliable options. Therefore, in the attached patch, I have exported EndOfLog and EndOfLogTLI into shared memory and attached only the refactoring patches since there a bunch of other work needs to be done on the main ASRO patches what I discussed with Robert off-list, thanks. Regards, Amul
Attachment
On Tue, Nov 23, 2021 at 7:23 PM Amul Sul <sulamul@gmail.com> wrote: > > On Wed, Nov 17, 2021 at 6:20 PM Amul Sul <sulamul@gmail.com> wrote: > > > > On Wed, Nov 17, 2021 at 4:07 PM Amul Sul <sulamul@gmail.com> wrote: > > > > > > On Wed, Nov 17, 2021 at 11:13 AM Amul Sul <sulamul@gmail.com> wrote: > > > > > > > > On Sat, Nov 13, 2021 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > > > > > On Mon, Nov 8, 2021 at 8:20 AM Amul Sul <sulamul@gmail.com> wrote: > > > > > > Attached is the rebased version of refactoring as well as the > > > > > > pg_prohibit_wal feature patches for the latest master head (commit # > > > > > > 39a3105678a). > > > > > > > > > > I spent a lot of time today studying 0002, and specifically the > > > > > question of whether EndOfLog must be the same as > > > > > XLogCtl->replayEndRecPtr and whether EndOfLogTLI must be the same as > > > > > XLogCtl->replayEndTLI. > > > > > > > > > > The answer to the former question is "no" because, if we don't enter > > > > > redo, XLogCtl->replayEndRecPtr won't be initialized at all. If we do > > > > > enter redo, then I think it has to be the same unless something very > > > > > weird happens. EndOfLog gets set like this: > > > > > > > > > > XLogBeginRead(xlogreader, LastRec); > > > > > record = ReadRecord(xlogreader, PANIC, false, replayTLI); > > > > > EndOfLog = EndRecPtr; > > > > > > > > > > In every case that exists in our regression tests, EndRecPtr is the > > > > > same before these three lines of code as it is afterward. However, if > > > > > you test with recovery_target=immediate, you can get it to be > > > > > different, because in that case we drop out of the redo loop after > > > > > calling recoveryStopsBefore() rather than after calling > > > > > recoveryStopsAfter(). Similarly I'm fairly sure that if you use > > > > > recovery_target_inclusive=off you can likewise get it to be different > > > > > (though I discovered the hard way that recovery_target_inclusive=off > > > > > is ignored when you use recovery_target_name). It seems like a really > > > > > bad thing that neither recovery_target=immediate nor > > > > > recovery_target_inclusive=off have any tests, and I think we ought to > > > > > add some. > > > > > > > > > > > > > recovery/t/003_recovery_targets.pl has test for > > > > recovery_target=immediate but not for recovery_target_inclusive=off, we > > > > can add that for recovery_target_lsn, recovery_target_time, and > > > > recovery_target_xid case only where it affects. > > > > > > > > > Anyway, in effect, these three lines of code have the effect of > > > > > backing up the xlogreader by one record when we stop before rather > > > > > than after a record that we're replaying. What that means is that > > > > > EndOfLog is going to be the end+1 of the last record that we actually > > > > > replayed. There might be one more record that we read but did not > > > > > replay, and that record won't impact the value we end up with in > > > > > EndOfLog. Now, XLogCtl->replayEndRecPtr is also that end+1 of the last > > > > > record that we actually replayed. To put that another way, there's no > > > > > way to exit the main redo loop after we set XLogCtl->replayEndRecPtr > > > > > and before we change LastRec. So in the cases where > > > > > XLogCtl->replayEndRecPtr gets initialized at all, it can only be > > > > > different from EndOfLog if something different happens when we re-read > > > > > the last-replayed WAL record than what happened when we read it the > > > > > first time. That seems unlikely, and would be unfortunate it if it did > > > > > happen. I am inclined to think that it might be better not to reread > > > > > the record at all, though. > > > > > > > > There are two reasons that the record is reread; first, one that you > > > > have just explained where the redo loop drops out due to > > > > recoveryStopsBefore() and another one is that InRecovery is false. > > > > > > > > In the formal case at the end, redo while-loop does read a new record > > > > which in effect updates EndRecPtr and when we breaks the loop, we do > > > > reach the place where we do reread record -- where we do read the > > > > record (i.e. LastRec) before the record that redo loop has read and > > > > which correctly sets EndRecPtr. In the latter case, definitely, we > > > > don't need any adjustment to EndRecPtr. > > > > > > > > So technically one case needs reread but that is also not needed, we > > > > have that value in XLogCtl->lastReplayedEndRecPtr. I do agree that we > > > > do not need to reread the record, but EndOfLog and EndOfLogTLI should > > > > be set conditionally something like: > > > > > > > > if (InRecovery) > > > > { > > > > EndOfLog = XLogCtl->lastReplayedEndRecPtr; > > > > EndOfLogTLI = XLogCtl->lastReplayedTLI; > > > > } > > > > else > > > > { > > > > EndOfLog = EndRecPtr; > > > > EndOfLogTLI = replayTLI; > > > > } > > > > > > > > > As far as this patch goes, I think we need > > > > > a solution that doesn't involve fetching EndOfLog from a variable > > > > > that's only sometimes initialized and then not doing anything with it > > > > > except in the cases where it was initialized. > > > > > > > > > > > > > Another reason could be EndOfLog changes further in the following case: > > > > > > > > /* > > > > * Actually, if WAL ended in an incomplete record, skip the parts that > > > > * made it through and start writing after the portion that persisted. > > > > * (It's critical to first write an OVERWRITE_CONTRECORD message, which > > > > * we'll do as soon as we're open for writing new WAL.) > > > > */ > > > > if (!XLogRecPtrIsInvalid(missingContrecPtr)) > > > > { > > > > Assert(!XLogRecPtrIsInvalid(abortedRecPtr)); > > > > EndOfLog = missingContrecPtr; > > > > } > > > > > > > > Now only solution that I can think is to copy EndOfLog (so > > > > EndOfLogTLI) into shared memory. > > > > > > > > > As for EndOfLogTLI, I'm afraid I don't think that's the same thing as > > > > > XLogCtl->replayEndTLI. Now, it's hard to be sure, because I don't > > > > > think the regression tests contain any scenarios where we run recovery > > > > > and the values end up different. However, I think that the code sets > > > > > EndOfLogTLI to the TLI of the last WAL file that we read, and I think > > > > > XLogCtl->replayEndTLI gets set to the timeline from which that WAL > > > > > record originated. So imagine that we are looking for WAL that ought > > > > > to be in 000000010000000000000003 but we don't find it; instead we > > > > > find 000000020000000000000003 because our recovery target timeline is > > > > > 2, or something that has 2 in its history. We will read the WAL for > > > > > timeline 1 from this file which has timeline 2 in the file name. I > > > > > think if recovery ends in this file before the timeline switch, these > > > > > values will be different. I did not try to construct a test case for > > > > > this today due to not having enough time, so it's possible that I'm > > > > > wrong about this, but that's how it looks to me from the code. > > > > > > > > > > > > > I am not sure, I have understood this scenario due to lack of > > > > expertise in this area -- Why would the record we looking that ought > > > > to be in 000000010000000000000003 we don't find it? Possibly WAL > > > > corruption or that file is missing? > > > > > > > > > > On further study, XLogPageRead(), WaitForWALToBecomeAvailable(), and > > > XLogFileReadAnyTLI(), I think I could make a sense that there could be > > > a case where the record belong to TLI 1 we are looking for; we might > > > open the file with TLI 2. But, I am wondering what's wrong if we say > > > that TLI 1 for that record even if we read it from the file has TLI 2 or 3 or 4 > > > in its file name -- that statement is still true, and that record > > > should be still accessible from the filename with TLI 1. Also, if we > > > going to consider this reading record exists before the timeline > > > switch point as the EndOfLog then why should be worried about the > > > latter timeline switch which eventually everything after the EndOfLog > > > going to be useless for us. We might continue switching TLI and/or > > > writing the WAL right after EndOfLog, correct me if I am missing > > > something here. > > > > > > Further, I still think replayEndTLI has set to the correct value what > > > we looking for EndOfLogTLI when we go through the redo loop. When it > > > read the record and finds a change in the current replayTLI then it > > > updates that as: > > > > > > if (newReplayTLI != replayTLI) > > > { > > > /* Check that it's OK to switch to this TLI */ > > > checkTimeLineSwitch(EndRecPtr, newReplayTLI, > > > prevReplayTLI, replayTLI); > > > > > > /* Following WAL records should be run with new TLI */ > > > replayTLI = newReplayTLI; > > > switchedTLI = true; > > > } > > > > > > Then replayEndTLI gets updated. If we going to skip the reread of > > > "LastRec" that we were discussing, then I think the following code > > > that fetches the EndOfLogTLI is also not needed, XLogCtl->replayEndTLI > > > (or XLogCtl->lastReplayedTLI) or replayTLI (when InRecovery is false) > > > should be enough, AFAICU. > > > > > > /* > > > * EndOfLogTLI is the TLI in the filename of the XLOG segment containing > > > * the end-of-log. It could be different from the timeline that EndOfLog > > > * nominally belongs to, if there was a timeline switch in that segment, > > > * and we were reading the old WAL from a segment belonging to a higher > > > * timeline. > > > */ > > > EndOfLogTLI = xlogreader->seg.ws_tli; > > > > > > > I think I found the right case for this, above TLI fetch is needed in > > the case where we do restore from the archived WAL files. In my trial, > > the archive directory has files as below (Kindly ignore the extra > > history file, I perform a few more trials to be sure): > > > > -rw-------. 1 amul amul 16777216 Nov 17 06:36 00000004000000000000001E > > -rw-------. 1 amul amul 16777216 Nov 17 06:39 00000004000000000000001F.partial > > -rw-------. 1 amul amul 128 Nov 17 06:36 00000004.history > > -rw-------. 1 amul amul 16777216 Nov 17 06:40 00000005000000000000001F > > -rw-------. 1 amul amul 171 Nov 17 06:39 00000005.history > > -rw-------. 1 amul amul 209 Nov 17 06:45 00000006.history > > -rw-------. 1 amul amul 247 Nov 17 06:52 00000007.history > > > > The timeline is switched in 1F file but the archiver has backup older > > timeline file and renamed it. While performing PITR using these > > archived files, the .partitial file seems to be skipped from the > > restore. The file with the next timeline id is selected to read the > > records that belong to the previous timeline id as well (i.e. 4 here, > > all the records before timeline switch point). Here is the files > > inside pg_wal directory after restore, note that in the current > > experiment, I chose recovery_target_xid = <just before the timeline#5 > > switch point > and then recovery_target_action = 'promote': > > > > -rw-------. 1 amul amul 85 Nov 17 07:33 00000003.history > > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000004000000000000001E > > -rw-------. 1 amul amul 128 Nov 17 07:33 00000004.history > > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000005000000000000001F > > -rw-------. 1 amul amul 171 Nov 17 07:33 00000005.history > > -rw-------. 1 amul amul 209 Nov 17 07:33 00000006.history > > -rw-------. 1 amul amul 247 Nov 17 07:33 00000007.history > > -rw-------. 1 amul amul 16777216 Nov 17 07:33 00000008000000000000001F > > > > The last one is the new WAL file created in that cluster. > > > > With this experiment, I think it is clear that the EndOfLogTLI can be > different from the replayEndTLI or lastReplayedTLI, and we don't have > any other option to get that into other processes other than exporting > into shared memory. Similarly, we have bunch of option (e.g. > replayEndRecPtr, lastReplayedEndRecPtr, lastSegSwitchLSN etc) to get > EndOfLog value but those are not perfect and reliable options. > > Therefore, in the attached patch, I have exported EndOfLog and > EndOfLogTLI into shared memory and attached only the refactoring > patches since there a bunch of other work needs to be done on the main > ASRO patches what I discussed with Robert off-list, thanks. > Attaching the rest of the patches. To execute XLogAcceptWrites() -> PerformRecoveryXLogAction() in Checkpointer process; ideally, we should perform full checkpoint but we can't do that using current PerformRecoveryXLogAction() which would call RequestCheckpoint() with WAIT flags which make the Checkpointer process wait infinite on itself to finish the requested checkpoint, bad!! The option we have is to change RequestCheckpoint() for the Checkpointer process directly call CreateCheckPoint() as we do for !IsPostmasterEnvironment case, but problem is that XLogWrite() running inside Checkpointer process can reach to CreateCheckPoint() and cause an unexpected behaviour that I have noted previously[1]. The RequestCheckpoint() from XLogWrite() when inside Checkpointer process is needed or not is need a separate discussion. For now, I have changed PerformRecoveryXLogAction() to call CreateCheckPoint() for the Checkpointer process; in the v41-0003 version I tried to do the changes to RequestCheckpoint() to avoid that but that change looks too ugly. Another problem is the recursive call to XLogAccepWrite() in the Checkpointer process due to the aforesaid CreateCheckPoint() call from PerformRecoveryXLogAction(). The reason is to avoid the delay in processing WAL prohibit state change requests we do have added ProcessWALProhibitStateChangeRequest() call multiple places that Checkpointer can check and process while performing a long-running checkpoint. When Checkpointer call CreateCheckPoint() from PerformRecoveryXLogAction() then that can also hit ProcessWALProhibitStateChangeRequest() and since XLogAccepWrite() operation not completed yet that tried to do that again. To avoid that I have added a flag that avoids ProcessWALProhibitStateChangeRequest() execution is that flag is set, see ProcessWALProhibitStateChangeRequest() in attached 0003 patch. Note that both the issues, I noted above are boil down to CreateCheckPoint() and its need. If we don't need to perform a full checkpoint in our case then we might not have that recursion issue. Instead, do the CreateEndOfRecoveryRecord() and then do the full checkpoint that currently PerformRecoveryXLogAction() does for the promotion case but not having full checkpoint looks might look scary. I tried that and works fine for me, but I am not much confident about that. Regards, Amul 1] https://postgr.es/m/CAAJ_b97fPWU_yyOg97Y5AtSvx5mrg2cGyz260swz5x5iPKEM+g@mail.gmail.com
Attachment
- v43-0006-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v43-0002-Remove-dependencies-on-startup-process-specifica.patch
- v43-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v43-0005-Documentation.patch
- v43-0003-Implement-wal-prohibit-state-using-global-barrie.patch
- v43-0001-Create-XLogAcceptWrites-function-with-code-from-.patch
Attaching the later version, has a few additional changes that decide for the Checkpointer process where it should be halt or not in the wal prohibited state; those changes are yet to be confirmed and tested thoroughly, thanks. Regards, Amul
Attachment
- v44-0002-Remove-dependencies-on-startup-process-specifica.patch
- v44-0005-Documentation.patch
- v44-0006-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v44-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v44-0003-Implement-wal-prohibit-state-using-global-barrie.patch
- v44-0001-Create-XLogAcceptWrites-function-with-code-from-.patch
Attached is rebase version for the latest maste head(#891624f0ec). 0001 and 0002 patch is changed a bit due to xlog.c refactoring commit(#70e81861), needing a bit more thought to copy global variables into right shared memory structure. Also, I made some changes to the 0003 patch to avoid XLogAcceptWrites() entrancing suggested in offline discussion. Regards, Amul
Attachment
- v45-0005-Documentation.patch
- v45-0006-Test-Few-tap-tests-for-wal-prohibited-system.patch
- v45-0002-Remove-dependencies-on-startup-process-specifica.patch
- v45-0003-Implement-wal-prohibit-state-using-global-barrie.patch
- v45-0004-Error-or-Assert-before-START_CRIT_SECTION-for-WA.patch
- v45-0001-Create-XLogAcceptWrites-function-with-code-from-.patch
On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote: > > > > It is a very minor change, so I rebased the patch. Please take a look, if that works for you. > > > > Thanks, I am getting one more failure for the vacuumlazy.c. on the > latest master head(d75288fb27b), I fixed that in attached version. Thanks Amul! I haven't looked at the whole thread, I may be repeating things here, please bear with me. 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I think the patches still have ALTER SYSTEM READ ONLY references. 2) IIUC, the idea of this patch is not to generate any new WAL when set as default_transaction_read_only and transaction_read_only can't guarantee that? 3) IMO, the function name pg_prohibit_wal doesn't look good where it also allows one to set WAL writes, how about the following functions - pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or pg_allow_wal or pg_allow_wal_{generation, inserts} without any arguments and if needed a common function pg_set_wal_generation_state(read-only/read-write) something like that? 4) It looks like only the checkpointer is setting the WAL prohibit state? Is there a strong reason for that? Why can't the backend take a lock on prohibit state in shared memory and set it and let the checkpointer read it and block itself from writing WAL? 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to checkpointer? Why? 6) What happens for long-running or in-progress transactions if someone prohibits WAL in the midst of them? Do these txns fail? Or do we say that we will allow them to run to completion? Or do we fail those txns at commit time? One might use this feature to say not let server go out of disk space, but if we allow in-progress txns to generate/write WAL, then how can one achieve that with this feature? Say, I monitor my server in such a way that at 90% of disk space, prohibit WAL to avoid server crash. But if this feature allows in-progress txns to generate WAL, then the server may still crash? 7) What are the other use-cases (I can think of - to avoid out of disk crashes, block/freeze writes to database when the server is compromised) with this feature? Any usages during/before failover, promotion or after it? 8) Is there a strong reason that we've picked up conditional variable wal_prohibit_cv over mutex/lock for updating WALProhibit shared memory? 9) Any tests that you are planning to add? Regards, Bharath Rupireddy.
On Sat, Apr 23, 2022 at 1:34 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Mar 15, 2021 at 12:56 PM Amul Sul <sulamul@gmail.com> wrote: > > > > > > It is a very minor change, so I rebased the patch. Please take a look, if that works for you. > > > > > > > Thanks, I am getting one more failure for the vacuumlazy.c. on the > > latest master head(d75288fb27b), I fixed that in attached version. > > Thanks Amul! I haven't looked at the whole thread, I may be repeating > things here, please bear with me. > Np, thanks for looking into it. > 1) Is the pg_prohibit_wal() only user sets the wal prohibit mode? Or > do we still allow via 'ALTER SYSTEM READ ONLY/READ WRITE'? If not, I > think the patches still have ALTER SYSTEM READ ONLY references. Could you please point me to what those references are? I didn't find any in the v45 version. > 2) IIUC, the idea of this patch is not to generate any new WAL when > set as default_transaction_read_only and transaction_read_only can't > guarantee that? No. Complete WAL write should be disabled, in other words XLogInsert() should be restricted. > 3) IMO, the function name pg_prohibit_wal doesn't look good where it > also allows one to set WAL writes, how about the following functions - > pg_prohibit_wal or pg_disallow_wal_{generation, inserts} or > pg_allow_wal or pg_allow_wal_{generation, inserts} without any > arguments and if needed a common function > pg_set_wal_generation_state(read-only/read-write) something like that? There are already similar suggestions before too, but none of that finalized yet, there are other more challenges that need to be handled, so we can keep this work at last. > 4) It looks like only the checkpointer is setting the WAL prohibit > state? Is there a strong reason for that? Why can't the backend take a > lock on prohibit state in shared memory and set it and let the > checkpointer read it and block itself from writing WAL? Once WAL prohibited state transition is initiated and should be completed, there is no fallback. What if the backed exit before the complete transition? Similarly, even if the checkpointer exits, that will be restarted again and will complete the state transition. > 5) Is SIGUSR1 (which is multiplexed) being sent without a "reason" to > checkpointer? Why? Simply want to wake up the checkpointer process without asking for specific work in the handle function. Another suitable choice will be SIGINT, we can choose that too if needed. > 6) What happens for long-running or in-progress transactions if > someone prohibits WAL in the midst of them? Do these txns fail? Or do > we say that we will allow them to run to completion? Or do we fail > those txns at commit time? One might use this feature to say not let > server go out of disk space, but if we allow in-progress txns to > generate/write WAL, then how can one achieve that with this feature? > Say, I monitor my server in such a way that at 90% of disk space, > prohibit WAL to avoid server crash. But if this feature allows > in-progress txns to generate WAL, then the server may still crash? Read-only transactions will be allowed to continue, and if that transaction tries to write or any other transaction that has performed any writes already then the session running that transaction will be terminated -- the design is described in the first mail of this thread. > 7) What are the other use-cases (I can think of - to avoid out of disk > crashes, block/freeze writes to database when the server is > compromised) with this feature? Any usages during/before failover, > promotion or after it? The important use case is for failover to avoid split-brain situations. > 8) Is there a strong reason that we've picked up conditional variable > wal_prohibit_cv over mutex/lock for updating WALProhibit shared > memory? I am not sure how that can be done using mutex or lock. > 9) Any tests that you are planning to add? Yes, we can. I have added very sophisticated tests that cover most of my code changes, but that is not enough for such critical code changes, have a lot of chances of improvement and adding more tests for this module as well as other parts e.g. some missing coverage of gin, gists, brin, core features where this patch is adding checks, etc. Any help will be greatly appreciated. Regards, Amul
On Fri, Apr 8, 2022 at 7:27 AM Amul Sul <sulamul@gmail.com> wrote: > Attached is rebase version for the latest maste head(#891624f0ec). Hi Amul, I'm going through past CF triage emails today; I noticed that this patch dropped out of the commitfest when you withdrew it in January, but it hasn't been added back with the most recent patchset you posted. Was that intended, or did you want to re-register it for review? --Jacob
Hi, On Thu, Jul 28, 2022 at 4:05 AM Jacob Champion <jchampion@timescale.com> wrote: > > On Fri, Apr 8, 2022 at 7:27 AM Amul Sul <sulamul@gmail.com> wrote: > > Attached is rebase version for the latest maste head(#891624f0ec). > > Hi Amul, > > I'm going through past CF triage emails today; I noticed that this > patch dropped out of the commitfest when you withdrew it in January, > but it hasn't been added back with the most recent patchset you > posted. Was that intended, or did you want to re-register it for > review? > Yes, there is a plan to re-register it again but not anytime soon, once we start to rework the design. Regards, Amul