Thread: [patch] demote

[patch] demote

From

Jehan-Guillaume de Rorthais

Date:

17 June 2020, 15:44:51

Hi,

As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
objectives than mine, I decided to share the humble patch I am playing with to
step down an instance from primary to standby status.

I'm still wondering about the coding style, but as the discussion about this
kind of feature is rising, I share it in an early stage so it has a chance to
be discussed.

I'm opening a new discussion to avoid disturbing Amul's one. 

The design of my patch is similar to the crash recovery code, without resetting
the shared memory. It supports smart and fast demote. The only existing user
interface currently is "pg_ctl [-m smart|fast] demote". An SQL admin function,
eg. pg_demote(), would be easy to add.

Main difference with Amul's patch is that all backends must be disconnected to
process with the demote. Either we wait for them to disconnect (smart) or we
kill them (fast). This makes life much easier from the code point of view, but
much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
would kill the session, with no options to wait for the action to finish, as we
do with pg_promote(). Keeping read only session around could probably be
achieved using global barrier as Amul did, but without all the complexity
related to WAL writes prohibition.

There's still some questions in the current patch. As I wrote, it's an humble
patch, a proof of concept, a bit naive.

Does it worth discussing it and improving it further or do I miss something
obvious in this design that leads to a dead end?

Thanks.

Regards,

[1] https://www.postgresql.org/message-id/flat/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw%40mail.gmail.com

Attachment

v1-0001-Demote-PoC.patch

Re: [patch] demote

From

Robert Haas

Date:

17 June 2020, 16:29:31

On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
> objectives than mine, I decided to share the humble patch I am playing with to
> step down an instance from primary to standby status.

Cool! This was vaguely on my hit list, but neither I nor any of my
colleagues had gotten the time and energy to have a go at it.

> Main difference with Amul's patch is that all backends must be disconnected to
> process with the demote. Either we wait for them to disconnect (smart) or we
> kill them (fast). This makes life much easier from the code point of view, but
> much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
> would kill the session, with no options to wait for the action to finish, as we
> do with pg_promote(). Keeping read only session around could probably be
> achieved using global barrier as Amul did, but without all the complexity
> related to WAL writes prohibition.
>
> There's still some questions in the current patch. As I wrote, it's an humble
> patch, a proof of concept, a bit naive.
>
> Does it worth discussing it and improving it further or do I miss something
> obvious in this design that leads to a dead end?

I haven't looked at your code, but I think we should view the two
efforts as complementing each other, not competing. With both patches
in play, a clean switchover would look like this:

- first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
to make the primary read only, killing off write transactions
- next use pg_ctl promote to promote the standby
- finally use pg_ctl demote (or whatever we decide to call it) to turn
the read-only primary into a standby of the new primary

I think this would be waaaaay better than what you have to do today,
which as I mentioned in my reply to Tom on the other thread, is very
complicated and error-prone. I think with the combination of that
patch and this one (or some successor version of each) we could get to
a point where the tooling to do a clean switchover is relatively easy
to write and doesn't involve having to shut down the server completely
at any point. If we can do it while also preserving connections, at
least for read-only queries, that's a better user experience, but as
Tom pointed out over there, there are real concerns about the
complexity of these patches, so it may be that the approach you've
taken of just killing everything is safer and thus a superior choice
overall.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

From

Andres Freund

Date:

17 June 2020, 18:14:47

Hi,

On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote:
> As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
> objectives than mine, I decided to share the humble patch I am playing with to
> step down an instance from primary to standby status.

To make sure we are on the same page: What your patch intends to do is
to leave the server running, but switch from being a primary to
replicating from another system. Correct?

> Main difference with Amul's patch is that all backends must be disconnected to
> process with the demote. Either we wait for them to disconnect (smart) or we
> kill them (fast). This makes life much easier from the code point of view, but
> much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
> would kill the session, with no options to wait for the action to finish, as we
> do with pg_promote(). Keeping read only session around could probably be
> achieved using global barrier as Amul did, but without all the complexity
> related to WAL writes prohibition.

FWIW just doing that for normal backends isn't sufficient, you also have
to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due
to hint bits, the checkpoint record, and some more).

> There's still some questions in the current patch. As I wrote, it's an humble
> patch, a proof of concept, a bit naive.
> 
> Does it worth discussing it and improving it further or do I miss something
> obvious in this design that leads to a dead end?

I don't think there's a fundamental issue, but I think it needs to deal
with a lot more things than it does right now. StartupXLOG doesn't
currently deal correctly with subsystems that are already
initialized. And your patch doesn't make it so as far as I can tell.

Greetings,

Andres Freund

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

18 June 2020, 10:02:02

On Wed, 17 Jun 2020 12:29:31 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

[...]
> > Main difference with Amul's patch is that all backends must be disconnected
> > to process with the demote. Either we wait for them to disconnect (smart)
> > or we kill them (fast). This makes life much easier from the code point of
> > view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')"
> > as an admin would kill the session, with no options to wait for the action
> > to finish, as we do with pg_promote(). Keeping read only session around
> > could probably be achieved using global barrier as Amul did, but without
> > all the complexity related to WAL writes prohibition.
> >
> > There's still some questions in the current patch. As I wrote, it's an
> > humble patch, a proof of concept, a bit naive.
> >
> > Does it worth discussing it and improving it further or do I miss something
> > obvious in this design that leads to a dead end?  
> 
> I haven't looked at your code, but I think we should view the two
> efforts as complementing each other, not competing.

That was part of my feeling. I like the idea of keeping readonly backends
around. But I'm not convinced by the "ALTER SYSTEM READ ONLY" feature on its
own.

At some expense, Admin can already set the system as readonly from the
application point of view, using:

  alter system set default_transaction_read_only TO on;
  select pg_reload_conf();

Current RW xact will finish, but no other will be allowed.

> With both patches in play, a clean switchover would look like this:
> 
> - first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
> to make the primary read only, killing off write transactions
> - next use pg_ctl promote to promote the standby
> - finally use pg_ctl demote (or whatever we decide to call it) to turn
> the read-only primary into a standby of the new primary

I'm not sure how useful ALTER SYSTEM READ ONLY is, outside of the switchover
scope. This seems like it should be included in the demote process itself. If we
focus on user experience, my first original goal was:

* demote the primary
* promote a standby

Later down the path of various additional patches (keep readonly backends, add
pg_demote(), etc), we could extend the replication protocol so a switchover can
be negotiated and controlled from the nodes themselves.

> I think this would be waaaaay better than what you have to do today,
> which as I mentioned in my reply to Tom on the other thread, is very
> complicated and error-prone.

Well, I agree, the current procedure to achieve a clean switchover is
difficult from the user point of view. 

For the record, in PAF (a Pacemaker user agent) we are parsing the pg_waldump
output to check if the designated standby to promote received the shutdown
checkpoint from the primary. If it does, we accept promoting.

Manually, I usually shutdown the primary, checkpoint on standby, compare "REDO
location" from both side, then promote.

> I think with the combination of that
> patch and this one (or some successor version of each) we could get to
> a point where the tooling to do a clean switchover is relatively easy
> to write and doesn't involve having to shut down the server completely
> at any point.

That would be great yes.

> If we can do it while also preserving connections, at
> least for read-only queries, that's a better user experience, but as
> Tom pointed out over there, there are real concerns about the
> complexity of these patches, so it may be that the approach you've
> taken of just killing everything is safer and thus a superior choice
> overall.

As far as this approach doesn't close futur doors to keep readonly backends
around, that might be a good first step.

Thank you!

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

18 June 2020, 10:16:27

On Wed, 17 Jun 2020 11:14:47 -0700
Andres Freund <andres@anarazel.de> wrote:

> Hi,
>
> On 2020-06-17 17:44:51 +0200, Jehan-Guillaume de Rorthais wrote:
> > As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
> > objectives than mine, I decided to share the humble patch I am playing with
> > to step down an instance from primary to standby status.
>
> To make sure we are on the same page: What your patch intends to do is
> to leave the server running, but switch from being a primary to
> replicating from another system. Correct?

Yes. The instance status is retrograded from "in production" to "in archive
recovery".

Of course, it will start replicating depending on archive_mode/command and
primary_conninfo setup.

> > Main difference with Amul's patch is that all backends must be disconnected
> > to process with the demote. Either we wait for them to disconnect (smart)
> > or we kill them (fast). This makes life much easier from the code point of
> > view, but much more naive as well. Eg. calling "SELECT pg_demote('fast')"
> > as an admin would kill the session, with no options to wait for the action
> > to finish, as we do with pg_promote(). Keeping read only session around
> > could probably be achieved using global barrier as Amul did, but without
> > all the complexity related to WAL writes prohibition.
>
> FWIW just doing that for normal backends isn't sufficient, you also have
> to deal with bgwriter, checkpointer, ... triggering WAL writes (FPWs due
> to hint bits, the checkpoint record, and some more).

In fact, the patch relies on existing code path in the state machine. The
startup process is started when the code enters in PM_NO_CHILDREN state. This
state is set when «These other guys should be dead already» as stated in the
code:

            /* These other guys should be dead already */
            Assert(StartupPID == 0);
            Assert(WalReceiverPID == 0);
            Assert(BgWriterPID == 0);
            Assert(CheckpointerPID == 0);
            Assert(WalWriterPID == 0);
            Assert(AutoVacPID == 0);
            /* syslogger is not considered here */
            pmState = PM_NO_CHILDREN;

> > There's still some questions in the current patch. As I wrote, it's an
> > humble patch, a proof of concept, a bit naive.
> >
> > Does it worth discussing it and improving it further or do I miss something
> > obvious in this design that leads to a dead end?
>
> I don't think there's a fundamental issue, but I think it needs to deal
> with a lot more things than it does right now. StartupXLOG doesn't
> currently deal correctly with subsystems that are already
> initialized. And your patch doesn't make it so as far as I can tell.

If you are talking about bgwriter, checkpointer, etc, as far as I understand
the current state machine, my patch actually deal with them.

Thank you for your feedback!

I'll study how hard it would be to keep read only backends around during the
demote step.

Regards,

Re: [patch] demote

From

Fujii Masao

Date:

18 June 2020, 12:41:45


On 2020/06/18 1:29, Robert Haas wrote:
> On Wed, Jun 17, 2020 at 11:45 AM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
>> As Amul sent a patch about "ALTER SYSTEM READ ONLY"[1], with similar futur
>> objectives than mine, I decided to share the humble patch I am playing with to
>> step down an instance from primary to standby status.
> 
> Cool! This was vaguely on my hit list, but neither I nor any of my
> colleagues had gotten the time and energy to have a go at it.
> 
>> Main difference with Amul's patch is that all backends must be disconnected to
>> process with the demote. Either we wait for them to disconnect (smart) or we
>> kill them (fast). This makes life much easier from the code point of view, but
>> much more naive as well. Eg. calling "SELECT pg_demote('fast')" as an admin
>> would kill the session, with no options to wait for the action to finish, as we
>> do with pg_promote(). Keeping read only session around could probably be
>> achieved using global barrier as Amul did, but without all the complexity
>> related to WAL writes prohibition.
>>
>> There's still some questions in the current patch. As I wrote, it's an humble
>> patch, a proof of concept, a bit naive.
>>
>> Does it worth discussing it and improving it further or do I miss something
>> obvious in this design that leads to a dead end?
> 
> I haven't looked at your code, but I think we should view the two
> efforts as complementing each other, not competing. With both patches
> in play, a clean switchover would look like this:
> 
> - first use ALTER SYSTEM READ ONLY (or whatever we decide to call it)
> to make the primary read only, killing off write transactions
> - next use pg_ctl promote to promote the standby
> - finally use pg_ctl demote (or whatever we decide to call it) to turn
> the read-only primary into a standby of the new primary

ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
What about the following procedure?

1. Demote the primary to a standby. Then this demoted standby is read-only.
2. The orignal standby automatically establishes the cascading replication
    connection with the demoted standby.
3. Wait for all the WAL records available in the demoted standby to be streamed
    to the orignal standby.
4. Promote the original standby to new primary.
5. Change primary_conninfo in the demoted standby so that it establishes
    the replication connection with new primary.

So it seems enough to implement "demote" feature for a clean switchover.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [patch] demote

From

Robert Haas

Date:

18 June 2020, 15:15:02

On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> At some expense, Admin can already set the system as readonly from the
> application point of view, using:
>
>   alter system set default_transaction_read_only TO on;
>   select pg_reload_conf();
>
> Current RW xact will finish, but no other will be allowed.

That doesn't block all WAL generation, though:

rhaas=# alter system set default_transaction_read_only TO on;
ALTER SYSTEM
rhaas=# select pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)
rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
rhaas=#

There's a bunch of other things it also doesn't block, too. If you're
trying to switch to a new primary, you really want to stop WAL
generation completely on the old one. Otherwise, you can't guarantee
that the machine you're going to promote is completely caught up,
which means you might lose some changes, and you might have to
pg_rewind the old master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

From

Robert Haas

Date:

18 June 2020, 15:22:47

On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> What about the following procedure?
>
> 1. Demote the primary to a standby. Then this demoted standby is read-only.
> 2. The orignal standby automatically establishes the cascading replication
>     connection with the demoted standby.
> 3. Wait for all the WAL records available in the demoted standby to be streamed
>     to the orignal standby.
> 4. Promote the original standby to new primary.
> 5. Change primary_conninfo in the demoted standby so that it establishes
>     the replication connection with new primary.
>
> So it seems enough to implement "demote" feature for a clean switchover.

There's something to that idea. I think it somewhat depends on how
invasive the various operations are. For example, I'm not really sure
how feasible it is to demote without a full server restart that kicks
out all sessions. If that is required, it's a significant disadvantage
compared to ASRO. On the other hand, if a machine can be demoted just
by kicking out R/W sessions, as ASRO currently does, then maybe
there's not that much difference. Or maybe both designs are subject to
improvement and we can do something even less invasive...

One thing I think people are going to want to do is have the master go
read-only if it loses communication to the rest of the network, to
avoid or at least mitigate split-brain. However, such network
interruptions are often transient, so it might not be uncommon to
briefly go read-only due to a network blip, but then recover quickly
and return to a read-write state. It doesn't seem to matter much
whether that read-only state is a new kind of normal operation (like
what ASRO would do) or whether we've actually returned to a recovery
state (as demote would do) but the collateral effects of the state
change do matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

18 June 2020, 15:27:32

On Thu, 18 Jun 2020 11:15:02 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Thu, Jun 18, 2020 at 6:02 AM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> > At some expense, Admin can already set the system as readonly from the
> > application point of view, using:
> >
> >   alter system set default_transaction_read_only TO on;
> >   select pg_reload_conf();
> >
> > Current RW xact will finish, but no other will be allowed.  
> 
> That doesn't block all WAL generation, though:
> 
> rhaas=# alter system set default_transaction_read_only TO on;
> ALTER SYSTEM
> rhaas=# select pg_reload_conf();
>  pg_reload_conf
> ----------------
>  t
> (1 row)
> rhaas=# cluster pgbench_accounts_pkey on pgbench_accounts;
> rhaas=#

Yes, this, and the fact that any user can switch transaction_read_only back to
on easily. This was a terrible example.

My point was that ALTER SYSTEM READ ONLY as described here doesn't feel like a
required user feature, outside of the demote scope. It might be useful for the
demote process, but only from the core point of view, without user interaction.
It seems there's no other purpose from the admin standpoint.

> There's a bunch of other things it also doesn't block, too. If you're
> trying to switch to a new primary, you really want to stop WAL
> generation completely on the old one. Otherwise, you can't guarantee
> that the machine you're going to promote is completely caught up,
> which means you might lose some changes, and you might have to
> pg_rewind the old master.

Yes, of course. I wasn't explaining transaction_read_only was useful in a 
switchover procedure, sorry for the confusion and misleading comment.

Regards,

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

18 June 2020, 15:56:07

On Thu, 18 Jun 2020 11:22:47 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com>
> wrote:
> > ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> > What about the following procedure?
> >
> > 1. Demote the primary to a standby. Then this demoted standby is read-only.
> > 2. The orignal standby automatically establishes the cascading replication
> >     connection with the demoted standby.
> > 3. Wait for all the WAL records available in the demoted standby to be
> > streamed to the orignal standby.
> > 4. Promote the original standby to new primary.
> > 5. Change primary_conninfo in the demoted standby so that it establishes
> >     the replication connection with new primary.
> >
> > So it seems enough to implement "demote" feature for a clean switchover.  
> 
> There's something to that idea. I think it somewhat depends on how
> invasive the various operations are. For example, I'm not really sure
> how feasible it is to demote without a full server restart that kicks
> out all sessions. If that is required, it's a significant disadvantage
> compared to ASRO. On the other hand, if a machine can be demoted just
> by kicking out R/W sessions, as ASRO currently does, then maybe
> there's not that much difference. Or maybe both designs are subject to
> improvement and we can do something even less invasive...

Considering the current demote patch improvement. I was considering to digg in
the following direction:

* add a new state in the state machine where all backends are idle
* this new state forbid any new writes, the same fashion we do on standby nodes
* this state could either wait for end of xact, or cancel/kill
  RW backends, in the same fashion current smart/fast stop do
* from this state, we might then rollback pending prepared xact, stop other
  sub-process etc (as the current patch does), and demote safely to
  PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).

Is it something worth considering?
Maybe the code will be so close from ASRO, it would just be kind of a fusion of
both patch? I don't know, I didn't look at the ASRO patch yet.

> One thing I think people are going to want to do is have the master go
> read-only if it loses communication to the rest of the network, to
> avoid or at least mitigate split-brain. However, such network
> interruptions are often transient, so it might not be uncommon to
> briefly go read-only due to a network blip, but then recover quickly
> and return to a read-write state. It doesn't seem to matter much
> whether that read-only state is a new kind of normal operation (like
> what ASRO would do) or whether we've actually returned to a recovery
> state (as demote would do) but the collateral effects of the state
> change do matter.

Well, triggering such actions (demote or read only) often occurs external
decision, hopefully relying on at least some quorum and being able to escalade
to watchdog or fencing is required.

Most tools around will need to demote or fence. It seems dangerous to flip
between read only/read write on a bunch of cluster nodes. It might be quickly
messy, especially since a former primary with non replicated data could
automatically replicate from a new primary without screaming...

Regards,

Re: [patch] demote

From

Robert Haas

Date:

18 June 2020, 16:33:28

On Thu, Jun 18, 2020 at 11:56 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
> Considering the current demote patch improvement. I was considering to digg in
> the following direction:
>
> * add a new state in the state machine where all backends are idle
> * this new state forbid any new writes, the same fashion we do on standby nodes
> * this state could either wait for end of xact, or cancel/kill
>   RW backends, in the same fashion current smart/fast stop do
> * from this state, we might then rollback pending prepared xact, stop other
>   sub-process etc (as the current patch does), and demote safely to
>   PM_RECOVERY or PM_HOT_STANDBY (depending on the setup).
>
> Is it something worth considering?
> Maybe the code will be so close from ASRO, it would just be kind of a fusion of
> both patch? I don't know, I didn't look at the ASRO patch yet.

I don't think that the postmaster state machine is the interesting
part of this problem. The tricky parts have to do with updating shared
memory state, and with updating per-backend private state. For
example, snapshots are taken in a different way during recovery than
they are in normal operation, hence SnapshotData's takenDuringRecovery
member. And I think that we allocate extra shared memory space for
storing the data that those snapshots use if, and only if, the server
starts up in recovery. So if the server goes backward from normal
running into recovery, we might not have the space that we need in
shared memory to store the extra data, and even if we had the space it
might not be populated correctly, and the code that takes snapshots
might not be written properly to handle multiple transitions between
recovery and normal running, or even a single backward transition.

In general, there's code scattered all throughout the system that
assumes the recovery -> normal running transition is one-way. If we go
back into recovery by killing off all backends and reinitializing
shared memory, then we don't have to worry about that stuff. If we do
anything less than that, we have to find all the code that relies on
never reentering recovery and fix it all. Now it's also true that we
have to do some other things, like restarting the startup process, and
stopping things like autovacuum, and the postmaster may need to be
involved in some of that. There's clearly some engineering work there,
but I think it's substantially less than the amount of engineering
work involved in fixing problems with shared memory contents and
backend-local state.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

From

Fujii Masao

Date:

18 June 2020, 16:55:35


On 2020/06/19 0:22, Robert Haas wrote:
> On Thu, Jun 18, 2020 at 8:41 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
>> What about the following procedure?
>>
>> 1. Demote the primary to a standby. Then this demoted standby is read-only.
>> 2. The orignal standby automatically establishes the cascading replication
>>      connection with the demoted standby.
>> 3. Wait for all the WAL records available in the demoted standby to be streamed
>>      to the orignal standby.
>> 4. Promote the original standby to new primary.
>> 5. Change primary_conninfo in the demoted standby so that it establishes
>>      the replication connection with new primary.
>>
>> So it seems enough to implement "demote" feature for a clean switchover.
> 
> There's something to that idea. I think it somewhat depends on how
> invasive the various operations are. For example, I'm not really sure
> how feasible it is to demote without a full server restart that kicks
> out all sessions. If that is required, it's a significant disadvantage
> compared to ASRO.

Even with ASRO, the server restart is necessary and RO sessions are
kicked out when demoting RO primary to a standby, i.e., during a clean
switchover?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [patch] demote

From

Andres Freund

Date:

18 June 2020, 17:10:15

Hi,

On 2020-06-18 12:16:27 +0200, Jehan-Guillaume de Rorthais wrote:
> On Wed, 17 Jun 2020 11:14:47 -0700
> > I don't think there's a fundamental issue, but I think it needs to deal
> > with a lot more things than it does right now. StartupXLOG doesn't
> > currently deal correctly with subsystems that are already
> > initialized. And your patch doesn't make it so as far as I can tell.
> 
> If you are talking about bgwriter, checkpointer, etc, as far as I understand
> the current state machine, my patch actually deal with them.

I'm talking about subsystems like subtrans multixact etc not being ok
with suddenly being initialized a second time. You cannot call
StartupXLOG twice without making modifications to it.

Greetings,

Andres Freund

Re: [patch] demote

From

Andres Freund

Date:

18 June 2020, 17:15:33

Hi,

On 2020-06-18 21:41:45 +0900, Fujii Masao wrote:
> ISTM that a clean switchover is possible without "ALTER SYSTEM READ ONLY".
> What about the following procedure?
> 
> 1. Demote the primary to a standby. Then this demoted standby is read-only.

As far as I can tell this step includes ALTER SYSTEM READ ONLY. Sure you
can choose not to expose ASRO to the user separately from demote, but
that's a miniscule part of the complexity.

Greetings,

Andres Freund

Re: [patch] demote

From

Robert Haas

Date:

18 June 2020, 17:16:23

On Thu, Jun 18, 2020 at 12:55 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
> Even with ASRO, the server restart is necessary and RO sessions are
> kicked out when demoting RO primary to a standby, i.e., during a clean
> switchover?

The ASRO patch doesn't provide a way to put a running server to be put
back into recovery, so yes, that is required, unless some other patch
fixes it so that it isn't. It wouldn't be better to find a way where
we never need to kill of R/O sessions at all, and I think that would
require all the machinery from the ASRO patch plus some more. If you
want to allow sessions to survive a state transition like this -
whether it's to a WAL-read-only state or all the way back to recovery
- you need a way to prevent further WAL writes in those sessions. Most
of the stuff that the ASRO patch does is concerned with that. So it
doesn't seem likely to me that we can just throw all that code away,
unless by chance somebody else has got a better version of the same
thing already. To go back to recovery rather than just to a read-only
state, I think you'd need to grapple with some additional issues that
patch doesn't touch, like some of the snapshot-taking stuff, but I
think you still need to solve all of the problems that it does deal
with, unless you're OK with killing every session.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

From

Tom Lane

Date:

18 June 2020, 17:24:38

Robert Haas <robertmhaas@gmail.com> writes:
> ... To go back to recovery rather than just to a read-only
> state, I think you'd need to grapple with some additional issues that
> patch doesn't touch, like some of the snapshot-taking stuff, but I
> think you still need to solve all of the problems that it does deal
> with, unless you're OK with killing every session.

It seems like this is the core decision that needs to be taken.  If
we're willing to have these state transitions include a server restart,
then many things get simpler.  If we're not, it's gonna cost us in
code complexity and hence bugs.  Maybe the usability gain is worth it,
or maybe not.

I think it would probably be worth the trouble to pursue both designs in
parallel for awhile, so we can get a better handle on exactly how much
complexity we're buying into with the more ambitious definition.

            regards, tom lane

Re: [patch] demote

From

Andres Freund

Date:

18 June 2020, 17:49:55

Hi,

On 2020-06-18 13:24:38 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > ... To go back to recovery rather than just to a read-only
> > state, I think you'd need to grapple with some additional issues that
> > patch doesn't touch, like some of the snapshot-taking stuff, but I
> > think you still need to solve all of the problems that it does deal
> > with, unless you're OK with killing every session.
> 
> It seems like this is the core decision that needs to be taken.  If
> we're willing to have these state transitions include a server restart,
> then many things get simpler.  If we're not, it's gonna cost us in
> code complexity and hence bugs.  Maybe the usability gain is worth it,
> or maybe not.
> 
> I think it would probably be worth the trouble to pursue both designs in
> parallel for awhile, so we can get a better handle on exactly how much
> complexity we're buying into with the more ambitious definition.

What I like about ALTER SYSTEM READ ONLY is that it basically would
likely be a part of both a restart and a non-restart based
implementation.

I don't really get why the demote in this thread is mentioned as an
alternative - it pretty obviously has to include a large portion of
ALTER SYSTEM READ ONLY.

The only part that could really be skipped by going straight to demote
is a way to make ASRO invocable directly. You can simplify a bit more by
killing all user sessions, but at that point there's not that much
upshot for having no-restart version of demote in the first place.

The demote patch in this thread doesn't even start to attack much of the
real complexity around turning a primary into a standby.

To me the complexity of a restartless demotion are likely worth it. But
it doesn't seem feasible to get there in one large step. So adding
individually usable sub-steps like ASRO makes sense imo.

Greetings,

Andres Freund

Re: [patch] demote

From

Robert Haas

Date:

18 June 2020, 18:30:15

On Thu, Jun 18, 2020 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It seems like this is the core decision that needs to be taken.  If
> we're willing to have these state transitions include a server restart,
> then many things get simpler.  If we're not, it's gonna cost us in
> code complexity and hence bugs.  Maybe the usability gain is worth it,
> or maybe not.
>
> I think it would probably be worth the trouble to pursue both designs in
> parallel for awhile, so we can get a better handle on exactly how much
> complexity we're buying into with the more ambitious definition.

I wouldn't vote to reject a patch that performed a demotion by doing a
full server restart, because it's a useful incremental step, but I
wouldn't be excited about spending a lot of time on it, either,
because it's basically crippleware by design. Either you have to
checkpoint before restarting, or you have to run recovery after
restarting, and either of those can be extremely slow. You also break
all the connections, which can produce application errors unless the
applications are pretty robustly designed, and you lose the entire
contents of shared_buffers, which makes things run very slowly even
after the restart is completed, which can cause a lengthy slow period
even after the system is nominally back up. All of those things are
really bad, and AFAICT the first one is the worst by a considerable
margin. It can take 20 minutes to checkpoint and even longer to run
recovery, so doing this once per year already puts you outside of
five-nines territory, and we release critical updates that cannot be
applied without a server restart about four times per year. That means
that if you perform updates by using a switchover -- a common practice
-- and if you apply all of your updates in a timely fashion --
unfortunately, not such a common practice, but one we'd surely like to
encourage -- you can't even achieve four nines if a switchover
requires either a checkpoint or running recovery. And that's without
accounting for any switchovers that you may need to perform for
reasons unrelated to upgrades, and without accounting for any
unplanned downtime. Not many people in 2020 are interested in running
a system with three nines of availability, so I think it is clear that
we need to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

25 June 2020, 17:27:54

Hi,

Here is a summary of my work during the last few days on this demote approach.

Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
commit message and as FIXME in code.

The patch is not finished or bug-free yet, I'm still not very happy with the
coding style, it probably lack some more code documentation, but a lot has
changed since v1. It's still a PoC to push the discussion a bit further after
being myself silent for some days.

The patch is currently relying on a demote checkpoint. I understand a forced
checkpoint overhead can be massive and cause major wait/downtime. But I keep
this for a later step. Maybe we should be able to cancel a running checkpoint?
Or leave it to its synching work but discard the result without wirting it to
XLog?

I hadn't time to investigate Robert's concern about shared memory for snapshot
during recovery.

The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
raise an assert if some prepared xact exist. I suppose I will rollback them
during demote in next patch version.

I'm not sure how to divide this patch in multiple small independent steps. I
suppose I can split it like:

1. add demote checkpoint
2. support demote: mostly postmaster, startup/xlog and checkpointer related
   code
3. cli using pg_ctl demote

...But I'm not sure it worth it.

Regards,

Attachment

v2-0001-Demote-PoC.patch

Re: [patch] demote

From

Kyotaro Horiguchi

Date:

26 June 2020, 07:14:38

Hello.

At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> Here is a summary of my work during the last few days on this demote approach.
> 
> Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
> commit message and as FIXME in code.
> 
> The patch is not finished or bug-free yet, I'm still not very happy with the
> coding style, it probably lack some more code documentation, but a lot has
> changed since v1. It's still a PoC to push the discussion a bit further after
> being myself silent for some days.
> 
> The patch is currently relying on a demote checkpoint. I understand a forced
> checkpoint overhead can be massive and cause major wait/downtime. But I keep
> this for a later step. Maybe we should be able to cancel a running checkpoint?
> Or leave it to its synching work but discard the result without wirting it to
> XLog?

If we are going to dive so close to server shutdown, we can just
utilize the restart-after-crash path, which we can assume to work
reliably. The attached is a quite rough sketch, hijacking smart
shutdown path for a convenience, of that but seems working.  "pg_ctl
-m s -W stop" lets server demote.

> I hadn't time to investigate Robert's concern about shared memory for snapshot
> during recovery.

The patch does all required clenaup of resources including shared
memory, I believe.  It's enough if we don't need to keep any resources
alive?

> The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
> raise an assert if some prepared xact exist. I suppose I will rollback them
> during demote in next patch version.
> 
> I'm not sure how to divide this patch in multiple small independent steps. I
> suppose I can split it like:
> 
> 1. add demote checkpoint
> 2. support demote: mostly postmaster, startup/xlog and checkpointer related
>    code
> 3. cli using pg_ctl demote
> 
> ...But I'm not sure it worth it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b4d475bb0b..a4adf3e587 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2752,6 +2752,7 @@ SIGHUP_handler(SIGNAL_ARGS)
 /*
  * pmdie -- signal handler for processing various postmaster signals.
  */
+static bool        demoting = false;
 static void
 pmdie(SIGNAL_ARGS)
 {
@@ -2774,59 +2775,17 @@ pmdie(SIGNAL_ARGS)
         case SIGTERM:
 
             /*
-             * Smart Shutdown:
+             * XXX: Hijacked as DEMOTE
              *
-             * Wait for children to end their work, then shut down.
+             * Runs fast shutdown, then restart as standby
              */
             if (Shutdown >= SmartShutdown)
                 break;
             Shutdown = SmartShutdown;
             ereport(LOG,
-                    (errmsg("received smart shutdown request")));
-
-            /* Report status */
-            AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
-#ifdef USE_SYSTEMD
-            sd_notify(0, "STOPPING=1");
-#endif
-
-            if (pmState == PM_RUN || pmState == PM_RECOVERY ||
-                pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
-            {
-                /* autovac workers are told to shut down immediately */
-                /* and bgworkers too; does this need tweaking? */
-                SignalSomeChildren(SIGTERM,
-                                   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
-                /* and the autovac launcher too */
-                if (AutoVacPID != 0)
-                    signal_child(AutoVacPID, SIGTERM);
-                /* and the bgwriter too */
-                if (BgWriterPID != 0)
-                    signal_child(BgWriterPID, SIGTERM);
-                /* and the walwriter too */
-                if (WalWriterPID != 0)
-                    signal_child(WalWriterPID, SIGTERM);
-
-                /*
-                 * If we're in recovery, we can't kill the startup process
-                 * right away, because at present doing so does not release
-                 * its locks.  We might want to change this in a future
-                 * release.  For the time being, the PM_WAIT_READONLY state
-                 * indicates that we're waiting for the regular (read only)
-                 * backends to die off; once they do, we'll kill the startup
-                 * and walreceiver processes.
-                 */
-                pmState = (pmState == PM_RUN) ?
-                    PM_WAIT_BACKUP : PM_WAIT_READONLY;
-            }
-
-            /*
-             * Now wait for online backup mode to end and backends to exit. If
-             * that is already the case, PostmasterStateMachine will take the
-             * next step.
-             */
-            PostmasterStateMachine();
-            break;
+                    (errmsg("received demote request")));
+            demoting = true;
+            /* FALL THROUGH */
 
         case SIGINT:
 
@@ -2839,8 +2798,10 @@ pmdie(SIGNAL_ARGS)
             if (Shutdown >= FastShutdown)
                 break;
             Shutdown = FastShutdown;
-            ereport(LOG,
-                    (errmsg("received fast shutdown request")));
+
+            if (!demoting)
+                ereport(LOG,
+                        (errmsg("received fast shutdown request")));
 
             /* Report status */
             AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STOPPING);
@@ -2887,6 +2848,13 @@ pmdie(SIGNAL_ARGS)
                 pmState = PM_WAIT_BACKENDS;
             }
 
+            /* create standby signal file */
+            {
+                FILE *standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+
+                Assert (standby_file && !FreeFile(standby_file));
+            }
+
             /*
              * Now wait for backends to exit.  If there are none,
              * PostmasterStateMachine will take the next step.
@@ -3958,7 +3926,7 @@ PostmasterStateMachine(void)
      * EOF on its input pipe, which happens when there are no more upstream
      * processes.
      */
-    if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+    if (!demoting && Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
     {
         if (FatalError)
         {
@@ -3996,13 +3964,23 @@ PostmasterStateMachine(void)
         ExitPostmaster(1);
 
     /*
-     * If we need to recover from a crash, wait for all non-syslogger children
-     * to exit, then reset shmem and StartupDataBase.
+     * If we need to recover from a crash or demoting, wait for all
+     * non-syslogger children to exit, then reset shmem and StartupDataBase.
      */
-    if (FatalError && pmState == PM_NO_CHILDREN)
+    if ((demoting || FatalError) && pmState == PM_NO_CHILDREN)
     {
-        ereport(LOG,
-                (errmsg("all server processes terminated; reinitializing")));
+        if (demoting)
+            ereport(LOG,
+                    (errmsg("all server processes terminated; starting as standby")));
+        else
+            ereport(LOG,
+                    (errmsg("all server processes terminated; reinitializing")));
+
+        if (demoting)
+        {
+            Shutdown = NoShutdown;
+            demoting = false;
+        }
 
         /* allow background workers to immediately restart */
         ResetBackgroundWorkerCrashTimes();

Re: [patch] demote

From

Kyotaro Horiguchi

Date:

26 June 2020, 07:26:16

Mmm. Fat finger..

At Fri, 26 Jun 2020 16:14:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Hello.
> 
> If we are going to dive so close to server shutdown, we can just
> utilize the restart-after-crash path, which we can assume to work
> reliably. The attached is a quite rough sketch, hijacking smart
> shutdown path for a convenience, of that but seems working.  "pg_ctl
> -m s -W stop" lets server demote.
> 
> > I hadn't time to investigate Robert's concern about shared memory for snapshot
> > during recovery.
> 
> The patch does all required clenaup of resources including shared

The path does all required clenaup of..

> memory, I believe.  It's enough if we don't need to keep any resources
> alive?
> 
> > The patch doesn't deal with prepared xact yet. Testing "start->demote->promote"
> > raise an assert if some prepared xact exist. I suppose I will rollback them
> > during demote in next patch version.
> > 
> > I'm not sure how to divide this patch in multiple small independent steps. I
> > suppose I can split it like:
> > 
> > 1. add demote checkpoint
> > 2. support demote: mostly postmaster, startup/xlog and checkpointer related
> >    code
> > 3. cli using pg_ctl demote
> > 
> > ...But I'm not sure it worth it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

01 July 2020, 10:15:55

On Fri, 26 Jun 2020 16:14:38 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Hello.
> 
> At Thu, 25 Jun 2020 19:27:54 +0200, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in 
> > Here is a summary of my work during the last few days on this demote
> > approach.
> > 
> > Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the
> > commit message and as FIXME in code.
> > 
> > The patch is not finished or bug-free yet, I'm still not very happy with the
> > coding style, it probably lack some more code documentation, but a lot has
> > changed since v1. It's still a PoC to push the discussion a bit further
> > after being myself silent for some days.
> > 
> > The patch is currently relying on a demote checkpoint. I understand a forced
> > checkpoint overhead can be massive and cause major wait/downtime. But I keep
> > this for a later step. Maybe we should be able to cancel a running
> > checkpoint? Or leave it to its synching work but discard the result without
> > wirting it to XLog?
> 
> If we are going to dive so close to server shutdown, we can just
> utilize the restart-after-crash path, which we can assume to work
> reliably. The attached is a quite rough sketch, hijacking smart
> shutdown path for a convenience, of that but seems working.  "pg_ctl
> -m s -W stop" lets server demote.

This was actually my very first toy PoC.

However, resetting everything is far from a graceful demote I was seeking for.
Moreover, such a patch will not be able to evolve to eg. keep read only
backends around.

> > I hadn't time to investigate Robert's concern about shared memory for
> > snapshot during recovery.
> 
> The patch does all required clenaup of resources including shared
> memory, I believe.  It's enough if we don't need to keep any resources
> alive?

Resetting everything might not be enough. If I understand Robert's concern
correctly, it might actually need more shmem for hot standby xact snapshot. Or
maybe some shmem init'ed differently.

Regards,

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

02 July 2020, 22:12:10

Hi,

Here is a small activity summary since last report.

On Thu, 25 Jun 2020 19:27:54 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]
> I hadn't time to investigate Robert's concern about shared memory for snapshot
> during recovery.

I hadn't time to dig very far, but I suppose this might be related to the
comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
already allocated as long as hot_standby is enabled. I realize it doesn't means
we are on the safe side of the fence though. I still have to have a better
understanding on this.

> The patch doesn't deal with prepared xact yet. Testing
> "start->demote->promote" raise an assert if some prepared xact exist. I
> suppose I will rollback them during demote in next patch version.

Rollback all prepared transaction on demote seems easy. However, I realized
there's no point to cancel them. After the demote action, they might still be
committed later on a promoted instance.

I am currently trying to clean shared memory for existing prepared transaction
so they are handled by the startup process during recovery.
I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
process to clean remaining prepared xact locks.

Regards,

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

13 July 2020, 15:04:49

Hi,

Another summary + patch + tests.

This patch supports 2PC. The goal is to keep them safe during demote/promote
actions so they can be committed/rollbacked later on a primary. See tests.

The checkpointer is now shutdowned after the demote shutdown checkpoint. It
removes some useless code complexity, eg. avoiding to signal postmaster from
checkpointer to keep going with the demotion. 

Cascaded replication is now supported. Wal senders stay actives during
demotion but set their local "am_cascading_walsender = true". It has been a
rough debug session (thank you rr and tests!) on my side, but it might deserve
it. I believe they should stay connected during the demote actions for futur
features, eg. triggering a switchover over the replication protocol using an
admin function.

The first tests has been added in "recovery/t/021_promote-demote.pl". I'll add
some more tests in futur versions.

I believe the patch is ready for some preliminary tests and advice or
directions.

On my todo:

* study how to only disconnect or cancel active RW backends
  * ...then add pg_demote() admin function
* cancel running checkpoint for fast demote ?
* user documentation
* Robert's concern about snapshot during hot standby
* some more coding style cleanup/refactoring
* anything else reported to me :)

Thanks,

On Fri, 3 Jul 2020 00:12:10 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> Hi,
> 
> Here is a small activity summary since last report.
> 
> On Thu, 25 Jun 2020 19:27:54 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> [...]
> > I hadn't time to investigate Robert's concern about shared memory for
> > snapshot during recovery.
> 
> I hadn't time to dig very far, but I suppose this might be related to the
> comment in ProcArrayShmemSize(). If I'm right, then it seems the space is
> already allocated as long as hot_standby is enabled. I realize it doesn't
> means we are on the safe side of the fence though. I still have to have a
> better understanding on this.
> 
> > The patch doesn't deal with prepared xact yet. Testing
> > "start->demote->promote" raise an assert if some prepared xact exist. I
> > suppose I will rollback them during demote in next patch version.
> 
> Rollback all prepared transaction on demote seems easy. However, I realized
> there's no point to cancel them. After the demote action, they might still be
> committed later on a promoted instance.
> 
> I am currently trying to clean shared memory for existing prepared transaction
> so they are handled by the startup process during recovery.
> I've been able to clean TwoPhaseState and the ProcArray. I'm now in the
> process to clean remaining prepared xact locks.
> 
> Regards,
> 
> 

-- 
Jehan-Guillaume de Rorthais
Dalibo

Attachment

Re: [patch] demote

From

Amul Sul

Date:

14 July 2020, 11:56:37

On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:
>
> Hi,
>
> Another summary + patch + tests.
>
> This patch supports 2PC. The goal is to keep them safe during demote/promote
> actions so they can be committed/rollbacked later on a primary. See tests.
>

Wondering is it necessary to clear prepared transactions from shared memory?
Can't simply skip clearing and restoring prepared transactions while demoting?

Regards,
Amul

Re: [patch] demote

From

Andres Freund

Date:

14 July 2020, 19:49:51

Hi,

On 2020-07-14 17:26:37 +0530, Amul Sul wrote:
> On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote:
> >
> > Hi,
> >
> > Another summary + patch + tests.
> >
> > This patch supports 2PC. The goal is to keep them safe during demote/promote
> > actions so they can be committed/rollbacked later on a primary. See tests.
> >
> 
> Wondering is it necessary to clear prepared transactions from shared memory?
> Can't simply skip clearing and restoring prepared transactions while demoting?

Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
transaction state. So I don't think it'd be correct to leave them around
in their previous state. We'd likely end up with incorrect snapshots
if a demoted node later gets promoted...

Greetings,

Andres Freund

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

14 July 2020, 21:16:34

On Tue, 14 Jul 2020 12:49:51 -0700
Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2020-07-14 17:26:37 +0530, Amul Sul wrote:
> > On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote:  
> > >
> > > Hi,
> > >
> > > Another summary + patch + tests.
> > >
> > > This patch supports 2PC. The goal is to keep them safe during
> > > demote/promote actions so they can be committed/rollbacked later on a
> > > primary. See tests. 
> > 
> > Wondering is it necessary to clear prepared transactions from shared memory?
> > Can't simply skip clearing and restoring prepared transactions while
> > demoting?  
> 
> Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
> transaction state. So I don't think it'd be correct to leave them around
> in their previous state. We'd likely end up with incorrect snapshots
> if a demoted node later gets promoted...

Indeed. I experienced it while debugging. PGXACT/PGPROC/locks need to
be cleared.

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

04 August 2020, 22:04:53

Hi,

Yet another summary + patch + tests.

Demote now keeps backends with no active xid alive. Smart mode keeps all
backends: it waits for them to finish their xact and enter read-only. Fast
mode terminate backends wit an active xid and keeps all other ones.
Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
(check recovery state) once demoted.
During demote, no new session is allowed.

As backends with no active xid survive, a new SQL admin function
"pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
The resulting refactoring makes the code much simpler, cleaner, with better
isolation of actions from the code point of view.

Thanks to the refactoring, the patch now only adds one state to the state
machine: PM_DEMOTING. A second one could be use to replace:

    /* Demoting: start the Startup Process */
    if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)

with eg.:

    if (pmState == PM_DEMOTED)

I believe it might be a bit simpler to understand, but the existing comment
might be good enough as well. The full state machine path for demote is:

 PM_DEMOTING  /* wait for active xid backend to finish */
 PM_SHUTDOWN  /* wait for checkpoint shutdown and its 
                 various shutdown tasks */
 PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
 PM_STARTUP

Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
adding tests on backend behaviors during demote and new function pg_demote().

On my todo:

* cancel running checkpoint for fast demote ?
* forbid demote when PITR backup is in progress
* user documentation
* Robert's concern about snapshot during hot standby
* anything else reported to me

Plus, I might be able to split the backend part and their signals of the patch
0002 in its own patch if it helps the review. It would apply after 0001 and
before actual 0002.

As there was no consensus and the discussions seemed to conclude this patch set
should keep growing to see were it goes, I wonder if/when I should add it to
the commitfest. Advice? Opinion?

Regards,

Attachment

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

18 August 2020, 15:41:31

Hi,

Please find in attachment v5 of the patch set rebased on master after various
conflicts.

Regards,

On Wed, 5 Aug 2020 00:04:53 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> Demote now keeps backends with no active xid alive. Smart mode keeps all
> backends: it waits for them to finish their xact and enter read-only. Fast
> mode terminate backends wit an active xid and keeps all other ones.
> Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> (check recovery state) once demoted.
> During demote, no new session is allowed.
> 
> As backends with no active xid survive, a new SQL admin function
> "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.
> 
> Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie().
> The resulting refactoring makes the code much simpler, cleaner, with better
> isolation of actions from the code point of view.
> 
> Thanks to the refactoring, the patch now only adds one state to the state
> machine: PM_DEMOTING. A second one could be use to replace:
> 
>     /* Demoting: start the Startup Process */
>     if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0)
> 
> with eg.:
> 
>     if (pmState == PM_DEMOTED)
> 
> I believe it might be a bit simpler to understand, but the existing comment
> might be good enough as well. The full state machine path for demote is:
> 
>  PM_DEMOTING  /* wait for active xid backend to finish */
>  PM_SHUTDOWN  /* wait for checkpoint shutdown and its 
>                  various shutdown tasks */
>  PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */
>  PM_STARTUP
> 
> Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests,
> adding tests on backend behaviors during demote and new function pg_demote().
> 
> On my todo:
> 
> * cancel running checkpoint for fast demote ?
> * forbid demote when PITR backup is in progress
> * user documentation
> * Robert's concern about snapshot during hot standby
> * anything else reported to me
> 
> Plus, I might be able to split the backend part and their signals of the patch
> 0002 in its own patch if it helps the review. It would apply after 0001 and
> before actual 0002.
> 
> As there was no consensus and the discussions seemed to conclude this patch
> set should keep growing to see were it goes, I wonder if/when I should add it
> to the commitfest. Advice? Opinion?

Attachment

Re: [patch] demote

From

Jehan-Guillaume de Rorthais

Date:

01 September 2020, 09:23:05

On Tue, 18 Aug 2020 17:41:31 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> Hi,
> 
> Please find in attachment v5 of the patch set rebased on master after various
> conflicts.
> 
> Regards,
> 
> On Wed, 5 Aug 2020 00:04:53 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> 
> > Demote now keeps backends with no active xid alive. Smart mode keeps all
> > backends: it waits for them to finish their xact and enter read-only. Fast
> > mode terminate backends wit an active xid and keeps all other ones.
> > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> > (check recovery state) once demoted.
> > During demote, no new session is allowed.
> > 
> > As backends with no active xid survive, a new SQL admin function
> > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Just to keep the list inform, I found a race condition leading to backends
trying to write to XLog after they processed the demote signal. Eg.:

  [posmaster]      LOG:  all backends in read only
  [checkpointer]   LOG:  demoting
  [backend]      PANIC:  cannot make new WAL entries during recovery
             STATEMENT:  UPDATE pgbench_accounts [...]

Because of this Postmaster enters in crash recovery while demote
environnement is in progress.

I have a couple of other subjects right now, but I plan to get back to it soon.

Regards,