Thread: can we avoid pg_basebackup on planned switches?

can we avoid pg_basebackup on planned switches?

From
Ben Chobot
Date:
We make heavy use of streaming replication on PG 9.1 and it's been great for us. We do have one issue with it, though, and that's when we switch master nodes - currently, the documentation says that you must run pg_basebackup on your old master to turn it into a slave. That makes sense when the old master had crashed, but it seems that in the case of a planned switch, we could do better. Here's what we tried that seemed to work... are we shooting ourselves in the foot?

1. Cleanly shut down the current master.
2. Pick a slave, turn it into the new master.
3. Copy the new pg_xlog history file over to the old master.
4. On any other slaves (many of our clusters are 3 nodes), we already have "recovery_target_timeline=latest" and wal archiving, so they should already be working as slaves of the new master.
5. Set up recovery.conf on the old master to be like the other slaves.
6. Start up the old master.

Have we just avoided running pg_basebackup, or have we just given ourselves data corruption? Because we're using wal archiving, can we simplify and leave out step 3?

Re: can we avoid pg_basebackup on planned switches?

From
Ben Chobot
Date:
Anybody?

On Jul 27, 2012, at 10:00 AM, Ben Chobot wrote:

We make heavy use of streaming replication on PG 9.1 and it's been great for us. We do have one issue with it, though, and that's when we switch master nodes - currently, the documentation says that you must run pg_basebackup on your old master to turn it into a slave. That makes sense when the old master had crashed, but it seems that in the case of a planned switch, we could do better. Here's what we tried that seemed to work... are we shooting ourselves in the foot?

1. Cleanly shut down the current master.
2. Pick a slave, turn it into the new master.
3. Copy the new pg_xlog history file over to the old master.
4. On any other slaves (many of our clusters are 3 nodes), we already have "recovery_target_timeline=latest" and wal archiving, so they should already be working as slaves of the new master.
5. Set up recovery.conf on the old master to be like the other slaves.
6. Start up the old master.

Have we just avoided running pg_basebackup, or have we just given ourselves data corruption? Because we're using wal archiving, can we simplify and leave out step 3?

Re: can we avoid pg_basebackup on planned switches?

From
Fujii Masao
Date:
On Sat, Jul 28, 2012 at 2:00 AM, Ben Chobot <bench@silentmedia.com> wrote:
> We make heavy use of streaming replication on PG 9.1 and it's been great for
> us. We do have one issue with it, though, and that's when we switch master
> nodes - currently, the documentation says that you must run pg_basebackup on
> your old master to turn it into a slave. That makes sense when the old
> master had crashed, but it seems that in the case of a planned switch, we
> could do better. Here's what we tried that seemed to work... are we shooting
> ourselves in the foot?
>
> 1. Cleanly shut down the current master.
> 2. Pick a slave, turn it into the new master.

Before promoting the standby, you have to confirm that all WAL files
the old master
generated have been shipped to the standby which you'll promote. Because the
standby might terminate the replication before receiving all WAL
files. Note that
there is no clean way to confirm that. For example, to confirm that, you need to
execute CHECKPOINT in the standby, run pg_controldata in both old master and
standby, and check whether their latest checkpoint locations are the same. You
may think to compare the latest checkpoint location in the old master and
pg_last_xlog_replay_location in the standby. But the former indicates
the *starting*
location of the last WAL record (i.e., shutdown checkpoint WAL record). OTOH,
the latter indicates the *ending* location of it. So you should not compare them
without taking into consideration the above mismatch.

If the standby failed to receive some WAL files, you need to manually copy them
in pg_xlog from the old master to the standby.

> 3. Copy the new pg_xlog history file over to the old master.
> 4. On any other slaves (many of our clusters are 3 nodes), we already have
> "recovery_target_timeline=latest" and wal archiving, so they should already
> be working as slaves of the new master.
> 5. Set up recovery.conf on the old master to be like the other slaves.
> 6. Start up the old master.
>
> Have we just avoided running pg_basebackup, or have we just given ourselves
> data corruption?


If you change your operations in the above-mentioned way, I think you can
avoid pg_basebackup on the planned switch. I've not tested your operations.
So please test them carefully before applying them to your system.

> Because we're using wal archiving, can we simplify and
> leave out step 3?

Yes.

Regards,

--
Fujii Masao

Re: can we avoid pg_basebackup on planned switches?

From
Ben Chobot
Date:
On Aug 5, 2012, at 11:12 AM, Fujii Masao wrote:

> On Sat, Jul 28, 2012 at 2:00 AM, Ben Chobot <bench@silentmedia.com> wrote:
>> We make heavy use of streaming replication on PG 9.1 and it's been great for
>> us. We do have one issue with it, though, and that's when we switch master
>> nodes - currently, the documentation says that you must run pg_basebackup on
>> your old master to turn it into a slave. That makes sense when the old
>> master had crashed, but it seems that in the case of a planned switch, we
>> could do better. Here's what we tried that seemed to work... are we shooting
>> ourselves in the foot?
>>
>> 1. Cleanly shut down the current master.
>> 2. Pick a slave, turn it into the new master.
>
> Before promoting the standby, you have to confirm that all WAL files
> the old master generated have been shipped to the standby which you'll promote. Because the
> standby might terminate the replication before receiving all WAL
> files. Note that there is no clean way to confirm that. For example, to confirm that, you need to
> execute CHECKPOINT in the standby, run pg_controldata in both old master and
> standby, and check whether their latest checkpoint locations are the same. You
> may think to compare the latest checkpoint location in the old master and
> pg_last_xlog_replay_location in the standby. But the former indicates
> the *starting* location of the last WAL record (i.e., shutdown checkpoint WAL record). OTOH,
> the latter indicates the *ending* location of it. So you should not compare them
> without taking into consideration the above mismatch.
>
> If the standby failed to receive some WAL files, you need to manually copy them
> in pg_xlog from the old master to the standby.

Oh, I would have though that doing a clean shutdown of the old master (step 1) would have made sure that all the
unstreamedwal records would be flushed to any connected slaves as part of the master shutting down. In retrospect, I
don'tremember reading that anywhere, so I must have made that up because I wanted it to be that way. Is it wishful
thinking?

Re: can we avoid pg_basebackup on planned switches?

From
Fujii Masao
Date:
On Mon, Aug 6, 2012 at 3:29 AM, Ben Chobot <bench@silentmedia.com> wrote:
>
> On Aug 5, 2012, at 11:12 AM, Fujii Masao wrote:
>
>> On Sat, Jul 28, 2012 at 2:00 AM, Ben Chobot <bench@silentmedia.com> wrote:
>>> We make heavy use of streaming replication on PG 9.1 and it's been great for
>>> us. We do have one issue with it, though, and that's when we switch master
>>> nodes - currently, the documentation says that you must run pg_basebackup on
>>> your old master to turn it into a slave. That makes sense when the old
>>> master had crashed, but it seems that in the case of a planned switch, we
>>> could do better. Here's what we tried that seemed to work... are we shooting
>>> ourselves in the foot?
>>>
>>> 1. Cleanly shut down the current master.
>>> 2. Pick a slave, turn it into the new master.
>>
>> Before promoting the standby, you have to confirm that all WAL files
>> the old master generated have been shipped to the standby which you'll promote. Because the
>> standby might terminate the replication before receiving all WAL
>> files. Note that there is no clean way to confirm that. For example, to confirm that, you need to
>> execute CHECKPOINT in the standby, run pg_controldata in both old master and
>> standby, and check whether their latest checkpoint locations are the same. You
>> may think to compare the latest checkpoint location in the old master and
>> pg_last_xlog_replay_location in the standby. But the former indicates
>> the *starting* location of the last WAL record (i.e., shutdown checkpoint WAL record). OTOH,
>> the latter indicates the *ending* location of it. So you should not compare them
>> without taking into consideration the above mismatch.
>>
>> If the standby failed to receive some WAL files, you need to manually copy them
>> in pg_xlog from the old master to the standby.
>
> Oh, I would have though that doing a clean shutdown of the old master (step 1) would have made sure that all the
unstreamedwal records would be flushed to any connected slaves as part of the master shutting down. In retrospect, I
don'tremember reading that anywhere, so I must have made that up because I wanted it to be that way. Is it wishful
thinking?

When clean shutdown is requested, the master sends all WAL records to
the standby,
but it doesn't wait for the standby to receive them. So there is no
guarantee that all WAL
records have been flushed to the standby. Walreceiver process in the
standby might
detect the termination of replication connection and exit before
receiving all WAL records.
Unfortunately I've encountered that case some times.

Regards,

--
Fujii Masao

Re: can we avoid pg_basebackup on planned switches?

From
Ben Chobot
Date:
On Aug 7, 2012, at 9:32 AM, Fujii Masao wrote:

> On Mon, Aug 6, 2012 at 3:29 AM, Ben Chobot <bench@silentmedia.com> wrote:
>>
>> Oh, I would have though that doing a clean shutdown of the old master (step 1) would have made sure that all the
unstreamedwal records would be flushed to any connected slaves as part of the master shutting down. In retrospect, I
don'tremember reading that anywhere, so I must have made that up because I wanted it to be that way. Is it wishful
thinking?
>
> When clean shutdown is requested, the master sends all WAL records to
> the standby,
> but it doesn't wait for the standby to receive them. So there is no
> guarantee that all WAL
> records have been flushed to the standby. Walreceiver process in the
> standby might
> detect the termination of replication connection and exit before
> receiving all WAL records.
> Unfortunately I've encountered that case some times.


Oh, I see. Well, that's unfortunate. Thanks for the help though! It shouldn't be too hard to script up what you
suggest.

Re: can we avoid pg_basebackup on planned switches?

From
Sergey Konoplev
Date:
On Sun, Aug 5, 2012 at 10:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Have we just avoided running pg_basebackup, or have we just given ourselves
>> data corruption?
>
> If you change your operations in the above-mentioned way, I think you can
> avoid pg_basebackup on the planned switch. I've not tested your operations.
> So please test them carefully before applying them to your system.

It is really hopeful.

So are there any thoughts of how to make sure that after performing
the changed process there are no data corruption on the new replica
(ex-master)?

ps. BTW do not we need to CHECKPOINT the old replica after copying all
the WAL files from the stopped master and before promoting it (the old
replica) to a new master?

>
>> Because we're using wal archiving, can we simplify and
>> leave out step 3?
>
> Yes.
>
> Regards,
>
> --
> Fujii Masao
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

--
Sergey Konoplev

a database architect, software developer at PostgreSQL-Consulting.com
http://www.postgresql-consulting.com

Jabber: gray.ru@gmail.com Skype: gray-hemp Phone: +79160686204