Re: Failback with log shipping - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Failback with log shipping
Date
Msg-id 4C002C49.4070501@enterprisedb.com
Whole thread Raw
In response to Re: Failback with log shipping  (Dimitri Fontaine <dfontaine@hi-media.com>)
List pgsql-hackers
On 28/05/10 22:20, Dimitri Fontaine wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Not shipped before the first failover you mean? No, if any WAL records were
>> created in the old master that were not shipped to the standby before
>> failover, the corresponding changes to the data files might've been flushed
>> to disk already, and you can't undo those by not replaying the WAL record on
>> restart.
>
> Ah yes you need to fail between when (WAL is written and not sent) and
> CHECKPOINT for this to be possible.

Checkpoint only guarantees that everything before that is flushed to 
disk. It doesn't guarantee that nothing is flushed to disk until that. 
If there's a checkpoint that hasn't been shipped to the standby, you're 
certainly hosed, but if there is no checkpoint you don't know if the 
data files have changed or not.

> But automatic testing of the
> situation (is the data already safe in PGDATA) might still be possible?

Hmm, so the situation is this:
            D - E - crash!          /
A - B - C          \            d - f - g - h

The letters represent WAL records. C is the last WAL record that was 
shipped to the standby, D & E are WAL records that were generated in the 
old master before the crash but never sent to the standby, and d-h are 
WAL records created in the standby after failover.

I guess you could read the WAL in the old master and compare it with the 
WAL from the standby to figure out where the failover happened (C), and 
then scan all the data pages involved in records D - E, checking that 
the LSNs on the data pages touched by those records are earlier than C. 
That's a bit laborious, and requires knowledge of all different kinds of 
WAL records to figure out which data pages they touch, but seems 
possible in theory.

>>> How easy is it to script that? It seems all we need is the current XID
>>> of the slave at the end of recovery. It should be in the log, maybe it's
>>> easy enough to expose it at the SQL level…
>>
>> XID doesn't help at all, LSN more likely, but I feel that I don't fully
>> understand what you're saying.
>
> Sorry I was unclear, I was thinking in terms of recovery.conf file and
> either recovery_target_xid or recovery_target_time. The idea being that
> if the old-master didn't CHECKPOINT the changes that the slave missed,
> then we can do crash recovery and choose to stop before that point, then
> apply WALs from the new master.

Ah, I see. No, you don't want to use a recovery target, that would end 
the recovery and start the server. You just need to make sure to use 
WALs from the new master instead of the old one when both exist.

> So you're saying controlled failover could possibly skip base backup to
> reuse old master as new slave, and I'm asking if by some luck (crash
> happened before CHECKPOINT) and some recovery.conf setup we could get to
> the same situation in case of hard failure. That would allow completely
> automatic switchover / failover with no need to resync.

Yeah, that would be nice. In practice, I think you would get lucky more 
often than not, because whenever you modify and dirty a page, writing a 
WAL record, the usage count on the buffer is incremented and it won't be 
evicted from the buffer cache for a while.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: List traffic
Next
From: Heikki Linnakangas
Date:
Subject: Re: How to pass around collation information