Re: Failback to old master - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Failback to old master
Date
Msg-id CA+TgmoYDRgOBKY5L4rnpJTNfdE8YJf3dLf76o9t7h5qv=U=TJw@mail.gmail.com
Whole thread Raw
In response to Failback to old master  ("Maeldron T." <maeldron@gmail.com>)
Responses Re: Failback to old master
List pgsql-hackers
On Wed, Oct 29, 2014 at 6:21 AM, Maeldron T. <maeldron@gmail.com> wrote:
> I swear I have read a couple of old threads. Yet I am not sure if it safe to
> failback to the old master in case of async replication without base backup.
>
> Considering:
> I have the latest 9.3 server
> A: master
> B: slave
> B is actively connected to A
>
> I shut down A manually with -m fast (it's the default FreeBSD init script
> setting)
> I remove the recovery.conf from B
> I restart B
> I create a recovery.conf on A
> I start A
> I see nothing wrong in the logs
> I go for a lunch
> I shut down B
> I remove the recovery.conf on AI restart A
> I restore the recovery.conf on B
> I start B
> I see nothing wrong in the logs and I see that replication is working
>
> Can I say that my data is safe in this case?
>
> If the answer is yes, is it safe to do this if there was a power outage on A
> instead of manual shutdown? Considering that the log says nothing wrong. (Of
> course if it complains I'd do base backup from B).

The threshold question here is whether the original master might have
written (and thus, perhaps, applied) write-ahead log records that were
not replayed on the slave.  If A crashed, that is definitely possible,
so this is definitely not safe.  If A was shut down cleanly, then
streaming replication *should* take everything up through the shutdown
checkpoint and replicate those to the standby, which *should* replay
them.  If all goes according to plan, I think this will work.

I'm not sure we really have enough safeties to make this robust,
though: for example, at the point when the shutdown checkpoint is
written, I believe that the master is no longer accepting new
connections - so if the connection to the slave is broken before the
shutdown checkpoint record is replicated, then it's not safe any more,
but how will we detect that?  And, if you remove recovery.conf on the
slave, it will abort replay and enter normal running as soon as it
reaches what it thinks is end-of-WAL, with no cross-check to make sure
that's really the same was point that the master was actually at.  So
it strikes me that it might be quite difficult to really have
confidence that nothing will go wrong.

I'm definitely not the expert in this area on this mailing list, so
I'm curious what others think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Validating CHECK constraints with SPI
Next
From: Robert Haas
Date:
Subject: Re: WIP: Access method extendability