Thread: [patch] demote
Hi, As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur objectives than mine, I decided to share the humble patch I am playing with to step down an instance from primary to standby status. I'm still wondering about the coding style, but as the discussion about this kind of feature is rising, I share it in an early stage so it has a chance to be discussed. I'm opening a new discussion to avoid disturbing Amul's one. The design of my patch is similar to the crash recovery code, without resetting the shared memory. It supports smart and fast demote. The only existing user interface currently is "pg_ctl [-m smart|fast] demote". An SQL admin function, eg. pg_demote(), would be easy to add. Main difference with Amul's patch is that all backends must be disconnected to process with the demote. Either we wait for them to disconnect (smart) or we kill them (fast). This makes life much easier from the code point of view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin would kill the session, with no options to wait for the action to finish, as we do with pg_promote(). Keeping read only session around could probably be achieved using global barrier as Amul did, but without all the complexity related to WAL writes prohibition. There's still some questions in the current patch. As I wrote, it's an humble patch, a proof of concept, a bit naive. Does it worth discussing it and improving it further or do I miss something obvious in this design that leads to a dead end? Thanks. Regards, [1] https://www.postgresql.org/message-id/flat/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw%40mail.gmail.com
Attachment
On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur > objectives than mine, I decided to share the humble patch I am playing with to > step down an instance from primary to standby status. Cool! This was vaguely on my hit list, but neither I nor any of my colleagues had gotten the time and energy to have a go at it. > Main difference with Amul's patch is that all backends must be disconnected to > process with the demote. Either we wait for them to disconnect (smart) or we > kill them (fast). This makes life much easier from the code point of view, but > much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin > would kill the session, with no options to wait for the action to finish, as we > do with pg_promote(). Keeping read only session around could probably be > achieved using global barrier as Amul did, but without all the complexity > related to WAL writes prohibition. > > There's still some questions in the current patch. As I wrote, it's an humble > patch, a proof of concept, a bit naive. > > Does it worth discussing it and improving it further or do I miss something > obvious in this design that leads to a dead end? I haven't looked at your code, but I think we should view the two efforts as complementing each other, not competing. With both patches in play, a clean switchover would look like this: - first use ALTER SYSTEM READ ONLY (or whatever we decide to call it) to make the primary read only, killing off write transactions - next use pg_ctl promote to promote the standby - finally use pg_ctl demote (or whatever we decide to call it) to turn the read-only primary into a standby of the new primary I think this would be waaaaay better than what you have to do today, which as I mentioned in my reply to Tom on the other thread, is very complicated and error-prone. I think with the combination of that patch and this one (or some successor version of each) we could get to a point where the tooling to do a clean switchover is relatively easy to write and doesn't involve having to shut down the server completely at any point. If we can do it while also preserving connections, at least for read-only queries, that's a better user experience, but as Tom pointed out over there, there are real concerns about the complexity of these patches, so it may be that the approach you've taken of just killing everything is safer and thus a superior choice overall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote: > As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur > objectives than mine, I decided to share the humble patch I am playing with to > step down an instance from primary to standby status. To make sure we are on the same page: What your patch intends to do is to leave the server running, but switch from being a primary to replicating from another system. Correct? > Main difference with Amul's patch is that all backends must be disconnected to > process with the demote. Either we wait for them to disconnect (smart) or we > kill them (fast). This makes life much easier from the code point of view, but > much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin > would kill the session, with no options to wait for the action to finish, as we > do with pg_promote(). Keeping read only session around could probably be > achieved using global barrier as Amul did, but without all the complexity > related to WAL writes prohibition. FWIW just doing that for normal backends isn't sufficient, you also have to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due to hint bits, the checkpoint record, and some more). > There's still some questions in the current patch. As I wrote, it's an humble > patch, a proof of concept, a bit naive. > > Does it worth discussing it and improving it further or do I miss something > obvious in this design that leads to a dead end? I don't think there's a fundamental issue, but I think it needs to deal with a lot more things than it does right now. StartupXLOG doesn't currently deal correctly with subsystems that are already initialized. And your patch doesn't make it so as far as I can tell. Greetings, Andres Freund
On Wed, 17 Jun 2020 12:29:31 -0400 Robert Haas <robertmhaas@gmail.com> wrote: [...] > > Main difference with Amul's patch is that all backends must be disconnected > > to process with the demote. Either we wait for them to disconnect (smart) > > or we kill them (fast). This makes life much easier from the code point of > > view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')" > > as an admin would kill the session, with no options to wait for the action > > to finish, as we do with pg_promote(). Keeping read only session around > > could probably be achieved using global barrier as Amul did, but without > > all the complexity related to WAL writes prohibition. > > > > There's still some questions in the current patch. As I wrote, it's an > > humble patch, a proof of concept, a bit naive. > > > > Does it worth discussing it and improving it further or do I miss something > > obvious in this design that leads to a dead end? > > I haven't looked at your code, but I think we should view the two > efforts as complementing each other, not competing. That was part of my feeling. I like the idea of keeping readonly backends around. But I'm not convinced by the "ALTER SYSTEM READ ONLY" feature on its own. At some expense, Admin can already set the system as readonly from the application point of view, using: alter system set default_transaction_read_only TO on; select pg_reload_conf(); Current RW xact will finish, but no other will be allowed. > With both patches in play, a clean switchover would look like this: > > - first use ALTER SYSTEM READ ONLY (or whatever we decide to call it) > to make the primary read only, killing off write transactions > - next use pg_ctl promote to promote the standby > - finally use pg_ctl demote (or whatever we decide to call it) to turn > the read-only primary into a standby of the new primary I'm not sure how useful ALTER SYSTEM READ ONLY is, outside of the switchover scope. This seems like it should be included in the demote process itself. If we focus on user experience, my first original goal was: * demote the primary * promote a standby Later down the path of various additional patches (keep readonly backends, add pg_demote(), etc), we could extend the replication protocol so a switchover can be negotiated and controlled from the nodes themselves. > I think this would be waaaaay better than what you have to do today, > which as I mentioned in my reply to Tom on the other thread, is very > complicated and error-prone. Well, I agree, the current procedure to achieve a clean switchover is difficult from the user point of view. For the record, in PAF (a Pacemaker user agent) we are parsing the pg_waldump output to check if the designated standby to promote received the shutdown checkpoint from the primary. If it does, we accept promoting. Manually, I usually shutdown the primary, checkpoint on standby, compare "REDO location" from both side, then promote. > I think with the combination of that > patch and this one (or some successor version of each) we could get to > a point where the tooling to do a clean switchover is relatively easy > to write and doesn't involve having to shut down the server completely > at any point. That would be great yes. > If we can do it while also preserving connections, at > least for read-only queries, that's a better user experience, but as > Tom pointed out over there, there are real concerns about the > complexity of these patches, so it may be that the approach you've > taken of just killing everything is safer and thus a superior choice > overall. As far as this approach doesn't close futur doors to keep readonly backends around, that might be a good first step. Thank you!
On Wed, 17 Jun 2020 11:14:47 -0700 Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote: > > As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur > > objectives than mine, I decided to share the humble patch I am playing with > > to step down an instance from primary to standby status. > > To make sure we are on the same page: What your patch intends to do is > to leave the server running, but switch from being a primary to > replicating from another system. Correct? Yes. The instance status is retrograded from "in production" to "in archive recovery". Of course, it will start replicating depending on archive_mode/command and primary_conninfo setup. > > Main difference with Amul's patch is that all backends must be disconnected > > to process with the demote. Either we wait for them to disconnect (smart) > > or we kill them (fast). This makes life much easier from the code point of > > view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')" > > as an admin would kill the session, with no options to wait for the action > > to finish, as we do with pg_promote(). Keeping read only session around > > could probably be achieved using global barrier as Amul did, but without > > all the complexity related to WAL writes prohibition. > > FWIW just doing that for normal backends isn't sufficient, you also have > to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due > to hint bits, the checkpoint record, and some more). In fact, the patch relies on existing code path in the state machine. The startup process is started when the code enters in PM_NO_CHILDREN state. This state is set when «These other guys should be dead already» as stated in the code: /* These other guys should be dead already */ Assert(StartupPID == 0); Assert(WalReceiverPID == 0); Assert(BgWriterPID == 0); Assert(CheckpointerPID == 0); Assert(WalWriterPID == 0); Assert(AutoVacPID == 0); /* syslogger is not considered here */ pmState = PM_NO_CHILDREN; > > There's still some questions in the current patch. As I wrote, it's an > > humble patch, a proof of concept, a bit naive. > > > > Does it worth discussing it and improving it further or do I miss something > > obvious in this design that leads to a dead end? > > I don't think there's a fundamental issue, but I think it needs to deal > with a lot more things than it does right now. StartupXLOG doesn't > currently deal correctly with subsystems that are already > initialized. And your patch doesn't make it so as far as I can tell. If you are talking about bgwriter, checkpointer, etc, as far as I understand the current state machine, my patch actually deal with them. Thank you for your feedback! I'll study how hard it would be to keep read only backends around during the demote step. Regards,
On 2020/06/18 1:29, Robert Haas wrote: > On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: >> As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur >> objectives than mine, I decided to share the humble patch I am playing with to >> step down an instance from primary to standby status. > > Cool! This was vaguely on my hit list, but neither I nor any of my > colleagues had gotten the time and energy to have a go at it. > >> Main difference with Amul's patch is that all backends must be disconnected to >> process with the demote. Either we wait for them to disconnect (smart) or we >> kill them (fast). This makes life much easier from the code point of view, but >> much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin >> would kill the session, with no options to wait for the action to finish, as we >> do with pg_promote(). Keeping read only session around could probably be >> achieved using global barrier as Amul did, but without all the complexity >> related to WAL writes prohibition. >> >> There's still some questions in the current patch. As I wrote, it's an humble >> patch, a proof of concept, a bit naive. >> >> Does it worth discussing it and improving it further or do I miss something >> obvious in this design that leads to a dead end? > > I haven't looked at your code, but I think we should view the two > efforts as complementing each other, not competing. With both patches > in play, a clean switchover would look like this: > > - first use ALTER SYSTEM READ ONLY (or whatever we decide to call it) > to make the primary read only, killing off write transactions > - next use pg_ctl promote to promote the standby > - finally use pg_ctl demote (or whatever we decide to call it) to turn > the read-only primary into a standby of the new primary ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY". What about the following procedure? 1. Demote the primary to a standby. Then this demoted standby is read-only. 2. The orignal standby automatically establishes the cascading replication connection with the demoted standby. 3. Wait for all the WAL records available in the demoted standby to be streamed to the orignal standby. 4. Promote the original standby to new primary. 5. Change primary_conninfo in the demoted standby so that it establishes the replication connection with new primary. So it seems enough to implement "demote" feature for a clean switchover. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > At some expense, Admin can already set the system as readonly from the > application point of view, using: > > alter system set default_transaction_read_only TO on; > select pg_reload_conf(); > > Current RW xact will finish, but no other will be allowed. That doesn't block all WAL generation, though: rhaas=# alter system set default_transaction_read_only TO on; ALTER SYSTEM rhaas=# select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts; rhaas=# There's a bunch of other things it also doesn't block, too. If you're trying to switch to a new primary, you really want to stop WAL generation completely on the old one. Otherwise, you can't guarantee that the machine you're going to promote is completely caught up, which means you might lose some changes, and you might have to pg_rewind the old master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY". > What about the following procedure? > > 1. Demote the primary to a standby. Then this demoted standby is read-only. > 2. The orignal standby automatically establishes the cascading replication > connection with the demoted standby. > 3. Wait for all the WAL records available in the demoted standby to be streamed > to the orignal standby. > 4. Promote the original standby to new primary. > 5. Change primary_conninfo in the demoted standby so that it establishes > the replication connection with new primary. > > So it seems enough to implement "demote" feature for a clean switchover. There's something to that idea. I think it somewhat depends on how invasive the various operations are. For example, I'm not really sure how feasible it is to demote without a full server restart that kicks out all sessions. If that is required, it's a significant disadvantage compared to ASRO. On the other hand, if a machine can be demoted just by kicking out R/W sessions, as ASRO currently does, then maybe there's not that much difference. Or maybe both designs are subject to improvement and we can do something even less invasive... One thing I think people are going to want to do is have the master go read-only if it loses communication to the rest of the network, to avoid or at least mitigate split-brain. However, such network interruptions are often transient, so it might not be uncommon to briefly go read-only due to a network blip, but then recover quickly and return to a read-write state. It doesn't seem to matter much whether that read-only state is a new kind of normal operation (like what ASRO would do) or whether we've actually returned to a recovery state (as demote would do) but the collateral effects of the state change do matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 18 Jun 2020 11:15:02 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > At some expense, Admin can already set the system as readonly from the > > application point of view, using: > > > > alter system set default_transaction_read_only TO on; > > select pg_reload_conf(); > > > > Current RW xact will finish, but no other will be allowed. > > That doesn't block all WAL generation, though: > > rhaas=# alter system set default_transaction_read_only TO on; > ALTER SYSTEM > rhaas=# select pg_reload_conf(); > pg_reload_conf > ---------------- > t > (1 row) > rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts; > rhaas=# Yes, this, and the fact that any user can switch transaction_read_only back to on easily. This was a terrible example. My point was that ALTER SYSTEM READ ONLY as described here doesn't feel like a required user feature, outside of the demote scope. It might be useful for the demote process, but only from the core point of view, without user interaction. It seems there's no other purpose from the admin standpoint. > There's a bunch of other things it also doesn't block, too. If you're > trying to switch to a new primary, you really want to stop WAL > generation completely on the old one. Otherwise, you can't guarantee > that the machine you're going to promote is completely caught up, > which means you might lose some changes, and you might have to > pg_rewind the old master. Yes, of course. I wasn't explaining transaction_read_only was useful in a switchover procedure, sorry for the confusion and misleading comment. Regards,
On Thu, 18 Jun 2020 11:22:47 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> > wrote: > > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY". > > What about the following procedure? > > > > 1. Demote the primary to a standby. Then this demoted standby is read-only. > > 2. The orignal standby automatically establishes the cascading replication > > connection with the demoted standby. > > 3. Wait for all the WAL records available in the demoted standby to be > > streamed to the orignal standby. > > 4. Promote the original standby to new primary. > > 5. Change primary_conninfo in the demoted standby so that it establishes > > the replication connection with new primary. > > > > So it seems enough to implement "demote" feature for a clean switchover. > > There's something to that idea. I think it somewhat depends on how > invasive the various operations are. For example, I'm not really sure > how feasible it is to demote without a full server restart that kicks > out all sessions. If that is required, it's a significant disadvantage > compared to ASRO. On the other hand, if a machine can be demoted just > by kicking out R/W sessions, as ASRO currently does, then maybe > there's not that much difference. Or maybe both designs are subject to > improvement and we can do something even less invasive... Considering the current demote patch improvement. I was considering to digg in the following direction: * add a new state in the state machine where all backends are idle * this new state forbid any new writes, the same fashion we do on standby nodes * this state could either wait for end of xact, or cancel/kill RW backends, in the same fashion current smart/fast stop do * from this state, we might then rollback pending prepared xact, stop other sub-process etc (as the current patch does), and demote safely to PM_RECOVERY or PM_HOT_STANDBY (depending on the setup). Is it something worth considering? Maybe the code will be so close from ASRO, it would just be kind of a fusion of both patch? I don't know, I didn't look at the ASRO patch yet. > One thing I think people are going to want to do is have the master go > read-only if it loses communication to the rest of the network, to > avoid or at least mitigate split-brain. However, such network > interruptions are often transient, so it might not be uncommon to > briefly go read-only due to a network blip, but then recover quickly > and return to a read-write state. It doesn't seem to matter much > whether that read-only state is a new kind of normal operation (like > what ASRO would do) or whether we've actually returned to a recovery > state (as demote would do) but the collateral effects of the state > change do matter. Well, triggering such actions (demote or read only) often occurs external decision, hopefully relying on at least some quorum and being able to escalade to watchdog or fencing is required. Most tools around will need to demote or fence. It seems dangerous to flip between read only/read write on a bunch of cluster nodes. It might be quickly messy, especially since a former primary with non replicated data could automatically replicate from a new primary without screaming... Regards,
On Thu, Jun 18, 2020 at 11:56 AM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Considering the current demote patch improvement. I was considering to digg in > the following direction: > > * add a new state in the state machine where all backends are idle > * this new state forbid any new writes, the same fashion we do on standby nodes > * this state could either wait for end of xact, or cancel/kill > RW backends, in the same fashion current smart/fast stop do > * from this state, we might then rollback pending prepared xact, stop other > sub-process etc (as the current patch does), and demote safely to > PM_RECOVERY or PM_HOT_STANDBY (depending on the setup). > > Is it something worth considering? > Maybe the code will be so close from ASRO, it would just be kind of a fusion of > both patch? I don't know, I didn't look at the ASRO patch yet. I don't think that the postmaster state machine is the interesting part of this problem. The tricky parts have to do with updating shared memory state, and with updating per-backend private state. For example, snapshots are taken in a different way during recovery than they are in normal operation, hence SnapshotData's takenDuringRecovery member. And I think that we allocate extra shared memory space for storing the data that those snapshots use if, and only if, the server starts up in recovery. So if the server goes backward from normal running into recovery, we might not have the space that we need in shared memory to store the extra data, and even if we had the space it might not be populated correctly, and the code that takes snapshots might not be written properly to handle multiple transitions between recovery and normal running, or even a single backward transition. In general, there's code scattered all throughout the system that assumes the recovery -> normal running transition is one-way. If we go back into recovery by killing off all backends and reinitializing shared memory, then we don't have to worry about that stuff. If we do anything less than that, we have to find all the code that relies on never reentering recovery and fix it all. Now it's also true that we have to do some other things, like restarting the startup process, and stopping things like autovacuum, and the postmaster may need to be involved in some of that. There's clearly some engineering work there, but I think it's substantially less than the amount of engineering work involved in fixing problems with shared memory contents and backend-local state. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020/06/19 0:22, Robert Haas wrote: > On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY". >> What about the following procedure? >> >> 1. Demote the primary to a standby. Then this demoted standby is read-only. >> 2. The orignal standby automatically establishes the cascading replication >> connection with the demoted standby. >> 3. Wait for all the WAL records available in the demoted standby to be streamed >> to the orignal standby. >> 4. Promote the original standby to new primary. >> 5. Change primary_conninfo in the demoted standby so that it establishes >> the replication connection with new primary. >> >> So it seems enough to implement "demote" feature for a clean switchover. > > There's something to that idea. I think it somewhat depends on how > invasive the various operations are. For example, I'm not really sure > how feasible it is to demote without a full server restart that kicks > out all sessions. If that is required, it's a significant disadvantage > compared to ASRO. Even with ASRO, the server restart is necessary and RO sessions are kicked out when demoting RO primary to a standby, i.e., during a clean switchover? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hi, On 2020-06-18 12:16:27 +0200, Jehan-Guillaume de Rorthais wrote: > On Wed, 17 Jun 2020 11:14:47 -0700 > > I don't think there's a fundamental issue, but I think it needs to deal > > with a lot more things than it does right now. StartupXLOG doesn't > > currently deal correctly with subsystems that are already > > initialized. And your patch doesn't make it so as far as I can tell. > > If you are talking about bgwriter, checkpointer, etc, as far as I understand > the current state machine, my patch actually deal with them. I'm talking about subsystems like subtrans multixact etc not being ok with suddenly being initialized a second time. You cannot call StartupXLOG twice without making modifications to it. Greetings, Andres Freund
Hi, On 2020-06-18 21:41:45 +0900, Fujii Masao wrote: > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY". > What about the following procedure? > > 1. Demote the primary to a standby. Then this demoted standby is read-only. As far as I can tell this step includes ALTER SYSTEM READ ONLY. Sure you can choose not to expose ASRO to the user separately from demote, but that's a miniscule part of the complexity. Greetings, Andres Freund
On Thu, Jun 18, 2020 at 12:55 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > Even with ASRO, the server restart is necessary and RO sessions are > kicked out when demoting RO primary to a standby, i.e., during a clean > switchover? The ASRO patch doesn't provide a way to put a running server to be put back into recovery, so yes, that is required, unless some other patch fixes it so that it isn't. It wouldn't be better to find a way where we never need to kill of R/O sessions at all, and I think that would require all the machinery from the ASRO patch plus some more. If you want to allow sessions to survive a state transition like this - whether it's to a WAL-read-only state or all the way back to recovery - you need a way to prevent further WAL writes in those sessions. Most of the stuff that the ASRO patch does is concerned with that. So it doesn't seem likely to me that we can just throw all that code away, unless by chance somebody else has got a better version of the same thing already. To go back to recovery rather than just to a read-only state, I think you'd need to grapple with some additional issues that patch doesn't touch, like some of the snapshot-taking stuff, but I think you still need to solve all of the problems that it does deal with, unless you're OK with killing every session. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > ... To go back to recovery rather than just to a read-only > state, I think you'd need to grapple with some additional issues that > patch doesn't touch, like some of the snapshot-taking stuff, but I > think you still need to solve all of the problems that it does deal > with, unless you're OK with killing every session. It seems like this is the core decision that needs to be taken. If we're willing to have these state transitions include a server restart, then many things get simpler. If we're not, it's gonna cost us in code complexity and hence bugs. Maybe the usability gain is worth it, or maybe not. I think it would probably be worth the trouble to pursue both designs in parallel for awhile, so we can get a better handle on exactly how much complexity we're buying into with the more ambitious definition. regards, tom lane
Hi, On 2020-06-18 13:24:38 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > ... To go back to recovery rather than just to a read-only > > state, I think you'd need to grapple with some additional issues that > > patch doesn't touch, like some of the snapshot-taking stuff, but I > > think you still need to solve all of the problems that it does deal > > with, unless you're OK with killing every session. > > It seems like this is the core decision that needs to be taken. If > we're willing to have these state transitions include a server restart, > then many things get simpler. If we're not, it's gonna cost us in > code complexity and hence bugs. Maybe the usability gain is worth it, > or maybe not. > > I think it would probably be worth the trouble to pursue both designs in > parallel for awhile, so we can get a better handle on exactly how much > complexity we're buying into with the more ambitious definition. What I like about ALTER SYSTEM READ ONLY is that it basically would likely be a part of both a restart and a non-restart based implementation. I don't really get why the demote in this thread is mentioned as an alternative - it pretty obviously has to include a large portion of ALTER SYSTEM READ ONLY. The only part that could really be skipped by going straight to demote is a way to make ASRO invocable directly. You can simplify a bit more by killing all user sessions, but at that point there's not that much upshot for having no-restart version of demote in the first place. The demote patch in this thread doesn't even start to attack much of the real complexity around turning a primary into a standby. To me the complexity of a restartless demotion are likely worth it. But it doesn't seem feasible to get there in one large step. So adding individually usable sub-steps like ASRO makes sense imo. Greetings, Andres Freund
On Thu, Jun 18, 2020 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > It seems like this is the core decision that needs to be taken. If > we're willing to have these state transitions include a server restart, > then many things get simpler. If we're not, it's gonna cost us in > code complexity and hence bugs. Maybe the usability gain is worth it, > or maybe not. > > I think it would probably be worth the trouble to pursue both designs in > parallel for awhile, so we can get a better handle on exactly how much > complexity we're buying into with the more ambitious definition. I wouldn't vote to reject a patch that performed a demotion by doing a full server restart, because it's a useful incremental step, but I wouldn't be excited about spending a lot of time on it, either, because it's basically crippleware by design. Either you have to checkpoint before restarting, or you have to run recovery after restarting, and either of those can be extremely slow. You also break all the connections, which can produce application errors unless the applications are pretty robustly designed, and you lose the entire contents of shared_buffers, which makes things run very slowly even after the restart is completed, which can cause a lengthy slow period even after the system is nominally back up. All of those things are really bad, and AFAICT the first one is the worst by a considerable margin. It can take 20 minutes to checkpoint and even longer to run recovery, so doing this once per year already puts you outside of five-nines territory, and we release critical updates that cannot be applied without a server restart about four times per year. That means that if you perform updates by using a switchover -- a common practice -- and if you apply all of your updates in a timely fashion -- unfortunately, not such a common practice, but one we'd surely like to encourage -- you can't even achieve four nines if a switchover requires either a checkpoint or running recovery. And that's without accounting for any switchovers that you may need to perform for reasons unrelated to upgrades, and without accounting for any unplanned downtime. Not many people in 2020 are interested in running a system with three nines of availability, so I think it is clear that we need to do better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here is a summary of my work during the last few days on this demote approach. Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the commit message and as FIXME in code. The patch is not finished or bug-free yet, I'm still not very happy with the coding style, it probably lack some more code documentation, but a lot has changed since v1. It's still a PoC to push the discussion a bit further after being myself silent for some days. The patch is currently relying on a demote checkpoint. I understand a forced checkpoint overhead can be massive and cause major wait/downtime. But I keep this for a later step. Maybe we should be able to cancel a running checkpoint? Or leave it to its synching work but discard the result without wirting it to XLog? I hadn't time to investigate Robert's concern about shared memory for snapshot during recovery. The patch doesn't deal with prepared xact yet. Testing "start->demote->promote" raise an assert if some prepared xact exist. I suppose I will rollback them during demote in next patch version. I'm not sure how to divide this patch in multiple small independent steps. I suppose I can split it like: 1. add demote checkpoint 2. support demote: mostly postmaster, startup/xlog and checkpointer related code 3. cli using pg_ctl demote ...But I'm not sure it worth it. Regards,
Attachment
Hello. At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in > Here is a summary of my work during the last few days on this demote approach. > > Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the > commit message and as FIXME in code. > > The patch is not finished or bug-free yet, I'm still not very happy with the > coding style, it probably lack some more code documentation, but a lot has > changed since v1. It's still a PoC to push the discussion a bit further after > being myself silent for some days. > > The patch is currently relying on a demote checkpoint. I understand a forced > checkpoint overhead can be massive and cause major wait/downtime. But I keep > this for a later step. Maybe we should be able to cancel a running checkpoint? > Or leave it to its synching work but discard the result without wirting it to > XLog? If we are going to dive so close to server shutdown, we can just utilize the restart-after-crash path, which we can assume to work reliably. The attached is a quite rough sketch, hijacking smart shutdown path for a convenience, of that but seems working. "pg_ctl -m s -W stop" lets server demote. > I hadn't time to investigate Robert's concern about shared memory for snapshot > during recovery. The patch does all required clenaup of resources including shared memory, I believe. It's enough if we don't need to keep any resources alive? > The patch doesn't deal with prepared xact yet. Testing "start->demote->promote" > raise an assert if some prepared xact exist. I suppose I will rollback them > during demote in next patch version. > > I'm not sure how to divide this patch in multiple small independent steps. I > suppose I can split it like: > > 1. add demote checkpoint > 2. support demote: mostly postmaster, startup/xlog and checkpointer related > code > 3. cli using pg_ctl demote > > ...But I'm not sure it worth it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b4d475bb0b..a4adf3e587 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2752,6 +2752,7 @@ SIGHUP_handler(SIGNAL_ARGS) /* * pmdie -- signal handler for processing various postmaster signals. */ +static bool demoting = false; static void pmdie(SIGNAL_ARGS) { @@ -2774,59 +2775,17 @@ pmdie(SIGNAL_ARGS) case SIGTERM: /* - * Smart Shutdown: + * XXX: Hijacked as DEMOTE * - * Wait for children to end their work, then shut down. + * Runs fast shutdown, then restart as standby */ if (Shutdown >= SmartShutdown) break; Shutdown = SmartShutdown; ereport(LOG, - (errmsg("received smart shutdown request"))); - - /* Report status */ - AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING); -#ifdef USE_SYSTEMD - sd_notify(0, "STOPPING=1"); -#endif - - if (pmState == PM_RUN || pmState == PM_RECOVERY || - pmState == PM_HOT_STANDBY || pmState == PM_STARTUP) - { - /* autovac workers are told to shut down immediately */ - /* and bgworkers too; does this need tweaking? */ - SignalSomeChildren(SIGTERM, - BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); - /* and the autovac launcher too */ - if (AutoVacPID != 0) - signal_child(AutoVacPID, SIGTERM); - /* and the bgwriter too */ - if (BgWriterPID != 0) - signal_child(BgWriterPID, SIGTERM); - /* and the walwriter too */ - if (WalWriterPID != 0) - signal_child(WalWriterPID, SIGTERM); - - /* - * If we're in recovery, we can't kill the startup process - * right away, because at present doing so does not release - * its locks. We might want to change this in a future - * release. For the time being, the PM_WAIT_READONLY state - * indicates that we're waiting for the regular (read only) - * backends to die off; once they do, we'll kill the startup - * and walreceiver processes. - */ - pmState = (pmState == PM_RUN) ? - PM_WAIT_BACKUP : PM_WAIT_READONLY; - } - - /* - * Now wait for online backup mode to end and backends to exit. If - * that is already the case, PostmasterStateMachine will take the - * next step. - */ - PostmasterStateMachine(); - break; + (errmsg("received demote request"))); + demoting = true; + /* FALL THROUGH */ case SIGINT: @@ -2839,8 +2798,10 @@ pmdie(SIGNAL_ARGS) if (Shutdown >= FastShutdown) break; Shutdown = FastShutdown; - ereport(LOG, - (errmsg("received fast shutdown request"))); + + if (!demoting) + ereport(LOG, + (errmsg("received fast shutdown request"))); /* Report status */ AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING); @@ -2887,6 +2848,13 @@ pmdie(SIGNAL_ARGS) pmState = PM_WAIT_BACKENDS; } + /* create standby signal file */ + { + FILE *standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w"); + + Assert (standby_file && !FreeFile(standby_file)); + } + /* * Now wait for backends to exit. If there are none, * PostmasterStateMachine will take the next step. @@ -3958,7 +3926,7 @@ PostmasterStateMachine(void) * EOF on its input pipe, which happens when there are no more upstream * processes. */ - if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN) + if (!demoting && Shutdown > NoShutdown && pmState == PM_NO_CHILDREN) { if (FatalError) { @@ -3996,13 +3964,23 @@ PostmasterStateMachine(void) ExitPostmaster(1); /* - * If we need to recover from a crash, wait for all non-syslogger children - * to exit, then reset shmem and StartupDataBase. + * If we need to recover from a crash or demoting, wait for all + * non-syslogger children to exit, then reset shmem and StartupDataBase. */ - if (FatalError && pmState == PM_NO_CHILDREN) + if ((demoting || FatalError) && pmState == PM_NO_CHILDREN) { - ereport(LOG, - (errmsg("all server processes terminated; reinitializing"))); + if (demoting) + ereport(LOG, + (errmsg("all server processes terminated; starting as standby"))); + else + ereport(LOG, + (errmsg("all server processes terminated; reinitializing"))); + + if (demoting) + { + Shutdown = NoShutdown; + demoting = false; + } /* allow background workers to immediately restart */ ResetBackgroundWorkerCrashTimes();
Mmm. Fat finger.. At Fri, 26 Jun 2020 16:14:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Hello. > > If we are going to dive so close to server shutdown, we can just > utilize the restart-after-crash path, which we can assume to work > reliably. The attached is a quite rough sketch, hijacking smart > shutdown path for a convenience, of that but seems working. "pg_ctl > -m s -W stop" lets server demote. > > > I hadn't time to investigate Robert's concern about shared memory for snapshot > > during recovery. > > The patch does all required clenaup of resources including shared The path does all required clenaup of.. > memory, I believe. It's enough if we don't need to keep any resources > alive? > > > The patch doesn't deal with prepared xact yet. Testing "start->demote->promote" > > raise an assert if some prepared xact exist. I suppose I will rollback them > > during demote in next patch version. > > > > I'm not sure how to divide this patch in multiple small independent steps. I > > suppose I can split it like: > > > > 1. add demote checkpoint > > 2. support demote: mostly postmaster, startup/xlog and checkpointer related > > code > > 3. cli using pg_ctl demote > > > > ...But I'm not sure it worth it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, 26 Jun 2020 16:14:38 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Hello. > > At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote in > > Here is a summary of my work during the last few days on this demote > > approach. > > > > Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the > > commit message and as FIXME in code. > > > > The patch is not finished or bug-free yet, I'm still not very happy with the > > coding style, it probably lack some more code documentation, but a lot has > > changed since v1. It's still a PoC to push the discussion a bit further > > after being myself silent for some days. > > > > The patch is currently relying on a demote checkpoint. I understand a forced > > checkpoint overhead can be massive and cause major wait/downtime. But I keep > > this for a later step. Maybe we should be able to cancel a running > > checkpoint? Or leave it to its synching work but discard the result without > > wirting it to XLog? > > If we are going to dive so close to server shutdown, we can just > utilize the restart-after-crash path, which we can assume to work > reliably. The attached is a quite rough sketch, hijacking smart > shutdown path for a convenience, of that but seems working. "pg_ctl > -m s -W stop" lets server demote. This was actually my very first toy PoC. However, resetting everything is far from a graceful demote I was seeking for. Moreover, such a patch will not be able to evolve to eg. keep read only backends around. > > I hadn't time to investigate Robert's concern about shared memory for > > snapshot during recovery. > > The patch does all required clenaup of resources including shared > memory, I believe. It's enough if we don't need to keep any resources > alive? Resetting everything might not be enough. If I understand Robert's concern correctly, it might actually need more shmem for hot standby xact snapshot. Or maybe some shmem init'ed differently. Regards,
Hi, Here is a small activity summary since last report. On Thu, 25 Jun 2020 19:27:54 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: [...] > I hadn't time to investigate Robert's concern about shared memory for snapshot > during recovery. I hadn't time to dig very far, but I suppose this might be related to the comment in ProcArrayShmemSize(). If I'm right, then it seems the space is already allocated as long as hot_standby is enabled. I realize it doesn't means we are on the safe side of the fence though. I still have to have a better understanding on this. > The patch doesn't deal with prepared xact yet. Testing > "start->demote->promote" raise an assert if some prepared xact exist. I > suppose I will rollback them during demote in next patch version. Rollback all prepared transaction on demote seems easy. However, I realized there's no point to cancel them. After the demote action, they might still be committed later on a promoted instance. I am currently trying to clean shared memory for existing prepared transaction so they are handled by the startup process during recovery. I've been able to clean TwoPhaseState and the ProcArray. I'm now in the process to clean remaining prepared xact locks. Regards,
Hi, Another summary + patch + tests. This patch supports 2PC. The goal is to keep them safe during demote/promote actions so they can be committed/rollbacked later on a primary. See tests. The checkpointer is now shutdowned after the demote shutdown checkpoint. It removes some useless code complexity, eg. avoiding to signal postmaster from checkpointer to keep going with the demotion. Cascaded replication is now supported. Wal senders stay actives during demotion but set their local "am_cascading_walsender = true". It has been a rough debug session (thank you rr and tests!) on my side, but it might deserve it. I believe they should stay connected during the demote actions for futur features, eg. triggering a switchover over the replication protocol using an admin function. The first tests has been added in "recovery/t/021_promote-demote.pl". I'll add some more tests in futur versions. I believe the patch is ready for some preliminary tests and advice or directions. On my todo: * study how to only disconnect or cancel active RW backends * ...then add pg_demote() admin function * cancel running checkpoint for fast demote ? * user documentation * Robert's concern about snapshot during hot standby * some more coding style cleanup/refactoring * anything else reported to me :) Thanks, On Fri, 3 Jul 2020 00:12:10 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Hi, > > Here is a small activity summary since last report. > > On Thu, 25 Jun 2020 19:27:54 +0200 > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > [...] > > I hadn't time to investigate Robert's concern about shared memory for > > snapshot during recovery. > > I hadn't time to dig very far, but I suppose this might be related to the > comment in ProcArrayShmemSize(). If I'm right, then it seems the space is > already allocated as long as hot_standby is enabled. I realize it doesn't > means we are on the safe side of the fence though. I still have to have a > better understanding on this. > > > The patch doesn't deal with prepared xact yet. Testing > > "start->demote->promote" raise an assert if some prepared xact exist. I > > suppose I will rollback them during demote in next patch version. > > Rollback all prepared transaction on demote seems easy. However, I realized > there's no point to cancel them. After the demote action, they might still be > committed later on a promoted instance. > > I am currently trying to clean shared memory for existing prepared transaction > so they are handled by the startup process during recovery. > I've been able to clean TwoPhaseState and the ProcArray. I'm now in the > process to clean remaining prepared xact locks. > > Regards, > > -- Jehan-Guillaume de Rorthais Dalibo
Attachment
On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > Hi, > > Another summary + patch + tests. > > This patch supports 2PC. The goal is to keep them safe during demote/promote > actions so they can be committed/rollbacked later on a primary. See tests. > Wondering is it necessary to clear prepared transactions from shared memory? Can't simply skip clearing and restoring prepared transactions while demoting? Regards, Amul
Hi, On 2020-07-14 17:26:37 +0530, Amul Sul wrote: > On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais > <jgdr@dalibo.com> wrote: > > > > Hi, > > > > Another summary + patch + tests. > > > > This patch supports 2PC. The goal is to keep them safe during demote/promote > > actions so they can be committed/rollbacked later on a primary. See tests. > > > > Wondering is it necessary to clear prepared transactions from shared memory? > Can't simply skip clearing and restoring prepared transactions while demoting? Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store transaction state. So I don't think it'd be correct to leave them around in their previous state. We'd likely end up with incorrect snapshots if a demoted node later gets promoted... Greetings, Andres Freund
On Tue, 14 Jul 2020 12:49:51 -0700 Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2020-07-14 17:26:37 +0530, Amul Sul wrote: > > On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais > > <jgdr@dalibo.com> wrote: > > > > > > Hi, > > > > > > Another summary + patch + tests. > > > > > > This patch supports 2PC. The goal is to keep them safe during > > > demote/promote actions so they can be committed/rollbacked later on a > > > primary. See tests. > > > > Wondering is it necessary to clear prepared transactions from shared memory? > > Can't simply skip clearing and restoring prepared transactions while > > demoting? > > Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store > transaction state. So I don't think it'd be correct to leave them around > in their previous state. We'd likely end up with incorrect snapshots > if a demoted node later gets promoted... Indeed. I experienced it while debugging. PGXACT/PGPROC/locks need to be cleared.
Hi, Yet another summary + patch + tests. Demote now keeps backends with no active xid alive. Smart mode keeps all backends: it waits for them to finish their xact and enter read-only. Fast mode terminate backends wit an active xid and keeps all other ones. Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 (check recovery state) once demoted. During demote, no new session is allowed. As backends with no active xid survive, a new SQL admin function "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie(). The resulting refactoring makes the code much simpler, cleaner, with better isolation of actions from the code point of view. Thanks to the refactoring, the patch now only adds one state to the state machine: PM_DEMOTING. A second one could be use to replace: /* Demoting: start the Startup Process */ if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0) with eg.: if (pmState == PM_DEMOTED) I believe it might be a bit simpler to understand, but the existing comment might be good enough as well. The full state machine path for demote is: PM_DEMOTING /* wait for active xid backend to finish */ PM_SHUTDOWN /* wait for checkpoint shutdown and its various shutdown tasks */ PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */ PM_STARTUP Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests, adding tests on backend behaviors during demote and new function pg_demote(). On my todo: * cancel running checkpoint for fast demote ? * forbid demote when PITR backup is in progress * user documentation * Robert's concern about snapshot during hot standby * anything else reported to me Plus, I might be able to split the backend part and their signals of the patch 0002 in its own patch if it helps the review. It would apply after 0001 and before actual 0002. As there was no consensus and the discussions seemed to conclude this patch set should keep growing to see were it goes, I wonder if/when I should add it to the commitfest. Advice? Opinion? Regards,
Attachment
Hi, Please find in attachment v5 of the patch set rebased on master after various conflicts. Regards, On Wed, 5 Aug 2020 00:04:53 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Demote now keeps backends with no active xid alive. Smart mode keeps all > backends: it waits for them to finish their xact and enter read-only. Fast > mode terminate backends wit an active xid and keeps all other ones. > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 > (check recovery state) once demoted. > During demote, no new session is allowed. > > As backends with no active xid survive, a new SQL admin function > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. > > Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie(). > The resulting refactoring makes the code much simpler, cleaner, with better > isolation of actions from the code point of view. > > Thanks to the refactoring, the patch now only adds one state to the state > machine: PM_DEMOTING. A second one could be use to replace: > > /* Demoting: start the Startup Process */ > if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0) > > with eg.: > > if (pmState == PM_DEMOTED) > > I believe it might be a bit simpler to understand, but the existing comment > might be good enough as well. The full state machine path for demote is: > > PM_DEMOTING /* wait for active xid backend to finish */ > PM_SHUTDOWN /* wait for checkpoint shutdown and its > various shutdown tasks */ > PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */ > PM_STARTUP > > Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests, > adding tests on backend behaviors during demote and new function pg_demote(). > > On my todo: > > * cancel running checkpoint for fast demote ? > * forbid demote when PITR backup is in progress > * user documentation > * Robert's concern about snapshot during hot standby > * anything else reported to me > > Plus, I might be able to split the backend part and their signals of the patch > 0002 in its own patch if it helps the review. It would apply after 0001 and > before actual 0002. > > As there was no consensus and the discussions seemed to conclude this patch > set should keep growing to see were it goes, I wonder if/when I should add it > to the commitfest. Advice? Opinion?
Attachment
On Tue, 18 Aug 2020 17:41:31 +0200 Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > Hi, > > Please find in attachment v5 of the patch set rebased on master after various > conflicts. > > Regards, > > On Wed, 5 Aug 2020 00:04:53 +0200 > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote: > > > Demote now keeps backends with no active xid alive. Smart mode keeps all > > backends: it waits for them to finish their xact and enter read-only. Fast > > mode terminate backends wit an active xid and keeps all other ones. > > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 > > (check recovery state) once demoted. > > During demote, no new session is allowed. > > > > As backends with no active xid survive, a new SQL admin function > > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. Just to keep the list inform, I found a race condition leading to backends trying to write to XLog after they processed the demote signal. Eg.: [posmaster] LOG: all backends in read only [checkpointer] LOG: demoting [backend] PANIC: cannot make new WAL entries during recovery STATEMENT: UPDATE pgbench_accounts [...] Because of this Postmaster enters in crash recovery while demote environnement is in progress. I have a couple of other subjects right now, but I plan to get back to it soon. Regards,