Thread: Restoring normal master-slave roles after replication failure

Restoring normal master-slave roles after replication failure

From

Alex Balashov

Date:

01 April 2015, 17:22:45

Hello all,

I'm terribly sorry if this question has been asked before in a number of
different forms; I wasn't able to find quite the nuance I was looking
for in the archives.

We're running PostgreSQL 9.4 with streaming replication + hot_standby +
  the newfangled physical replication slots. On the master, the
wal_archive level is 'hot_standby', and on the slave, it's 'archive'[1].

As I have understood the conventional wisdom from scouring the
documentation and list archives, if the master fails and we
trigger-promote the slave to master and then, after fixing the master
server, we need to restore it to the master role and restore the slave
server back down to the hot_standby role, we need to reinitialise
replication. That is, we need to do a brand new base backup of the slave
back to the master.

My question is:

Isn't there a better way that lends itself to some degree of automation?
We have a ~500 GB database and while we can do a base backup, it's
neither pretty nor quick. Can we not configure $OLD_SLAVE as a master
(with WAL senders, replication role and all), put $OLD_MASTER in
recovery mode and have it play out the accumulated WALs in reverse from
$NEW_MASTER, then shut down $NEW_MASTER, reconfigure both replicas to
their appropriate roles, and bring them back up like before?

If this is possible, does it depend on setting wal_archive to
'hot_standby' level on the slave during the course of normal operation,
too? Shouldn't 'archive' be sufficient to restore from? What exactly
does the 'hot_standby' WAL level contain that 'archive' doesn't?

If I'm wrong, is there no other way to do this except to re-do a whole
base backup back to the master? How do people with very large databases
deal with this? Just grin and bear it?

Many thanks in advance, and I apologise again if this has previously
been answered in some canonical way.

-- Alex

[1] This is as it should be, right? We want to have backup WALs in case
we experience a total physical meltdown of the master and need to do
PITR, and/or to archive them to WAL-E from the slave if the slave were
promoted to master.

--
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States

Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/

Re: Restoring normal master-slave roles after replication failure

From

"Gilberto Castillo"

Date:

01 April 2015, 17:30:25



Hello Alex,

Have you seen pgpool or rpmgr-3?


>
> Hello all,
>
> I'm terribly sorry if this question has been asked before in a number of
> different forms; I wasn't able to find quite the nuance I was looking
> for in the archives.
>
> We're running PostgreSQL 9.4 with streaming replication + hot_standby +
>   the newfangled physical replication slots. On the master, the
> wal_archive level is 'hot_standby', and on the slave, it's 'archive'[1].
>
> As I have understood the conventional wisdom from scouring the
> documentation and list archives, if the master fails and we
> trigger-promote the slave to master and then, after fixing the master
> server, we need to restore it to the master role and restore the slave
> server back down to the hot_standby role, we need to reinitialise
> replication. That is, we need to do a brand new base backup of the slave
> back to the master.
>
> My question is:
>
> Isn't there a better way that lends itself to some degree of automation?
> We have a ~500 GB database and while we can do a base backup, it's
> neither pretty nor quick. Can we not configure $OLD_SLAVE as a master
> (with WAL senders, replication role and all), put $OLD_MASTER in
> recovery mode and have it play out the accumulated WALs in reverse from
> $NEW_MASTER, then shut down $NEW_MASTER, reconfigure both replicas to
> their appropriate roles, and bring them back up like before?
>
> If this is possible, does it depend on setting wal_archive to
> 'hot_standby' level on the slave during the course of normal operation,
> too? Shouldn't 'archive' be sufficient to restore from? What exactly
> does the 'hot_standby' WAL level contain that 'archive' doesn't?
>
> If I'm wrong, is there no other way to do this except to re-do a whole
> base backup back to the master? How do people with very large databases
> deal with this? Just grin and bear it?
>
> Many thanks in advance, and I apologise again if this has previously
> been answered in some canonical way.
>
> -- Alex
>
> [1] This is as it should be, right? We want to have backup WALs in case
> we experience a total physical meltdown of the master and need to do
> PITR, and/or to archive them to WAL-E from the slave if the slave were
> promoted to master.
>
> --
> Alex Balashov | Principal | Evariste Systems LLC
> 303 Perimeter Center North, Suite 300
> Atlanta, GA 30346
> United States
>
> Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
> Web: http://www.evaristesys.com/, http://www.csrpswitch.com/
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin---
> This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE
> running at host imx2.etecsa.cu
> Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>
>


Saludos,
Gilberto Castillo
ETECSA, La Habana, Cuba
---
This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE running at host imx3.etecsa.cu
Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>

Re: Restoring normal master-slave roles after replication failure

From

Alex Balashov

Date:

01 April 2015, 17:31:59

Hi Gilberto,

On 04/01/2015 02:29 PM, Gilberto Castillo wrote:

> Have you seen pgpool or rpmgr-3?

I have, but prefer not to use these tools if possible as it would
complicate the setup.

I'm not so much looking for front-end simplicity as I am looking for a
way to avoid having to do a base backup back to the master. I am fine
with having to reconfigure master and slave roles manually.

--
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States

Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/

Re: Restoring normal master-slave roles after replication failure

From

"Gilberto Castillo"

Date:

01 April 2015, 17:41:23


>
> Hi Gilberto,
>
> On 04/01/2015 02:29 PM, Gilberto Castillo wrote:
>
>> Have you seen pgpool or rpmgr-3?
>
> I have, but prefer not to use these tools if possible as it would
> complicate the setup.
>
> I'm not so much looking for front-end simplicity as I am looking for a
> way to avoid having to do a base backup back to the master. I am fine
> with having to reconfigure master and slave roles manually.

I commented, and I used pgpool ensures failover  a transparent manner to
end users.
Now use rpmgr-3, which allows me the same; but without the complicated
motaje of pgpool.

But if you want to do manuel, you define archive_command = "hot_stanby" on
both nodes


Saludos,
Gilberto Castillo
ETECSA, La Habana, Cuba
---
This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE running at host imx3.etecsa.cu
Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>

Re: Restoring normal master-slave roles after replication failure

From

Alex Balashov

Date:

01 April 2015, 17:43:09

On 04/01/2015 02:40 PM, Gilberto Castillo wrote:

> But if you want to do manuel, you define archive_command = "hot_stanby" on
> both nodes

Are you implying that my hypothesis is correct, i.e. that we can play
the WALs on the slave out in reverse onto the slave?

--
Alex Balashov | Principal | Evariste Systems LLC
303 Perimeter Center North, Suite 300
Atlanta, GA 30346
United States

Tel: +1-800-250-5920 (toll-free) / +1-678-954-0671 (direct)
Web: http://www.evaristesys.com/, http://www.csrpswitch.com/

Re: Restoring normal master-slave roles after replication failure

From

"Gilberto Castillo"

Date:

01 April 2015, 17:49:55


>
> On 04/01/2015 02:40 PM, Gilberto Castillo wrote:
>
>> But if you want to do manuel, you define archive_command = "hot_stanby"
>> on
>> both nodes
>
> Are you implying that my hypothesis is correct, i.e. that we can play
> the WALs on the slave out in reverse onto the slave?


I say who pg_recovery defines whether the node is a slave or not


Saludos,
Gilberto Castillo
ETECSA, La Habana, Cuba
---
This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE running at host imx3.etecsa.cu
Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>