Re: Synchronous Log Shipping Replication - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Synchronous Log Shipping Replication
Date
Msg-id 1221034238.9487.9.camel@huvostro
Whole thread Raw
In response to Re: Synchronous Log Shipping Replication  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Wed, 2008-09-10 at 08:15 +0100, Simon Riggs wrote:

> Any working solution needs to work for all required phases. If you did
> it this way, you'd never catch up at all.
> 
> When you first make the copy, it will be made at time X. The point of
> consistency will be sometime later and requires WAL data to make it
> consistent. So you would need to do a PITR to get it to the point of
> consistency. While you've been doing that, the primary server has moved
> on and now there is a gap between primary and standby. You *must*
> provide a facility to allow the standby to catch up with the primary.
> Only sending *current* WAL is not a solution, and not acceptable.
> 
> So there must be mechanisms for sending past *and* current WAL data to
> the standby, and an exact and careful mechanism for switching between
> the two modes when the time is right. Replication is only synchronous
> *after* the change in mode.
> 
> So the protocol needs to be something like:
> 
> 1. Standby contacts primary and says it would like to catch up, but is
> currently at point X (which is a point at, or after the first consistent
> stopping point in WAL after standby has performed its own crash
> recovery, if any was required).
> 2. primary initiates data transfer of old data to standby, starting at
> point X
> 3. standby tells primary where it has got to periodically
> 4. at some point primary decides primary and standby are close enough
> that it can now begin streaming "current WAL" (which is always the WAL
> up to wal_buffers behind the the current WAL insertion point).
> 
> Bear in mind that unless wal_buffers > 16MB the final catchup will
> *always* be less than one WAL file, so external file based mechanisms
> alone could never be enough. So you would need wal_buffers >= 2000 to
> make an external catch up facility even work at all.
> 
> This also probably means that receipt of WAL data on the standby cannot
> be achieved by placing it in wal_buffers. So we probably need to write
> it directly to the WAL files, then rely on the filesystem cache on the
> standby to buffer the data for use by ReadRecord.

And this catchup may be needed to be done repeatedly, in case of network
failure.

I don't think that slave automatically becoming a master if it detects
network failure (as suggested elsewhere in this thread) is acceptable
solution, as it will more often than not result in two masters.

A better solution would be:

1. Slave just keeps waiting for new WAL records and confirming receipt
storing to disk and application.

2. Master is in one of at least two states
2.1 - Catchup - Async mode where it is sending old logs and wal records to slave
2.2 - Sync Replication - Sync mode, where COMMIT does not return before
confirmation from WALSender.

Initial mode is Catchup which is promoted to Sync Replication when delay
of WAL application is reasonably small.

When Master detects network outage (== delay bigger than acceptable) it
will either just Send a NOTICE to all clients and fall back to Catchup,
or raise an ERROR (and still fall back to cathup)

This is the point where external HA / Heartbeat etc. solutions would
intervene and decide, what to do.

-----------------
Hannu





pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: Synchronous Log Shipping Replication
Next
From: Hannu Krosing
Date:
Subject: Re: Keeping creation time of objects