On Fri, 2010-01-15 at 20:11 +0200, Heikki Linnakangas wrote:
> The states we have at the moment in standby are:
>
> 1. Archive recovery. Standby fetches WAL files from archive using
> restore_command. When a file is not found in archive, we switch to state 2
>
> 2. Streaming replication. Standby connects (and reconnects if the
> connection is lost for any reason) to the primary, starts streaming, and
> applies WAL as it arrives. We stay in this state until trigger file is
> found or server is shut down.
> The states with my suggested ReadRecord/FetchRecord refactoring, the
> code I have in the replication-xlogrefactor branch in my git repo, are:
>
> 1. Initial archive recovery. Standby fetches WAL files from archive
> using restore_command. When a file is not found in archive, we start
> walreceiver and switch to state 2
>
> 2. Retrying to restore from archive. When the connection to primary is
> established and replication is started, we switch to state 3
>
> 3. Streaming replication. Connection to primary is established, and WAL
> is applied as it arrives. When the connection is dropped, we go back to
> state 2
>
> Although the the state transitions between 2 and 3 are a bit fuzzy in
> that version; walreceiver runs concurrently, trying to reconnect, while
> startup process retries restoring from archive. Fujii-san's suggestion
> to have walreceiver stop while startup process retries restoring from
> archive (or have walreceiver run restore_command in approach #2) would
> make that clearer.
The one-way state transitions between 1->2 in both cases seem to make
this a little more complex, rather than more simple.
If the connection did drop then WAL will be in the archive, so the path
for data is archive->primary->standby. There already needs to be a
network path between archive and standby, so why not drop back from
state 3 -> 1 rather than from 3 -> 2? That way we could have just 2
states on each side, rather than 3.
-- Simon Riggs www.2ndQuadrant.com