Switching timeline over streaming replication - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Switching timeline over streaming replication
Date
Msg-id 504F737E.1040103@iki.fi
Whole thread Raw
Responses Re: Switching timeline over streaming replication
List pgsql-hackers
I've been working on the often-requested feature to handle timeline
changes over streaming replication. At the moment, if you kill the
master and promote a standby server, and you have another standby server
that you'd like to keep following the new master server, you need a WAL
archive in addition to streaming replication to make it cross the
timeline change. Streaming replication will just error out. Having a WAL
archive is usually a good idea in complex replication scenarios anyway,
but it would be good to not require it.

Attached is a WIP patch for that. It needs cleanup, but works.

Protocol changes
----------------

When we invented the COPY-both mode, we left out any description of how
to get out of that mode, simply stating that both ends "may then send
CopyData messages until the connection is terminated". The patch makes
it possible to revert back to regular processing, by sending a CopyDone
message, like in normal Copy-in or Copy-out mode. Either end can take
the initiative and send CopyDone, and after doing that may not send any
more CopyDone messages. When both ends have sent a CopyDone message, and
received a CopyDone message from the other end, the connection is out of
Copy-mode, and the server finishes the command with a CommandComplete
message.

Another way to think of it is that when the server sends a CopyDone
message, the connection switches from copy-both to Copy-in mode. And if
the client sends a CopyDone message first, the connection goes from
Copy-both to Copy-out mode, until the server ends the streaming from its
end.

New replication command: TIMELINE_HISTORY
-----------------------------------------

To switch recovery target timeline, a standby needs the timeline history
file (e.g 00000002.history) of the new timeline. The patch adds a new
command to the set of commands accepted by walsender, to transmit a
given timeline history file from master to slave.

Walsender changes to stream a particular timeline
-------------------------------------------------

The walsender now keeps track of exactly which timeline it is currently
streaming; it's not necessarily the latest one anymore. The
START_REPLICATION command is extended with a TIMELINE option that the
client can use to request streaming from a particular timeline. If the
client asks for a timeline that's not the current, but is part of the
history of the server, the walsender knows to read from the correct WAL
file that contains that. Also, the walsender knows where the server's
history branched off from that timeline, and will only stream WAL up to
that point. When that point is reached, it ends the streaming (with a
CopyDone message), and prepares to accept a new replication command.
Typically, the walreceiver will then ask to start streaming from the
next timeline.

Walreceiver changes
-------------------

Previously, when the timeline reported by the server didn't match the
current timeline in the standby, walreceiver simply errored out. Now, it
requests for any missing timeline history files using the new
TIMELINE_HISTORY command, and then tries to start replication from the
current standby's timeline, even if that's older than the master's.

When the end of the old timeline is reached, walreceiver sets a state
variable in shared memory to indicate that, pings the the startup
process, and waits for the startup process for new orders. The startup
process can set receiveStart and timeline in shared memory and ping the
walreceiver again, to get the walreceiver to restart streaming from the
new starting point [1]. Before the startup process does that, it will
scan pg_xlog for new timeline history files if
recovery_target_timeline='latest'. It will find any new histrory files
the walreceiver stored there, and switch over to the latest timeline
just as it does with a WAL archive.


Some parts of this patch are just refactoring that probably make sense
regardless of the new functionality. For example, I split off the
timeline history file related functions to a new file, timeline.c.
That's not very much code, but it's fairly isolated, and xlog.c is
massive, so I feel that anything that we can move off from xlog.c is a
good thing. I also moved off the two functions RestoreArchivedFile() and
ExecuteRecoveryCommand(), to a separate file. Those are also not much
code, but are fairly isolated. If no-one objects to those changes, and
the general direction this work is going to, I'm going split off those
refactorings to separate patches and commit them separately.

I also made the timeline history file a bit more detailed: instead of
recording just the WAL segment where the timeline was changed, it now
records the exact XLogRecPtr. That was required for the walsender to
know the switchpoint, without having to parse the XLOG records (it reads
and parses the history file, instead)


[1] Initially, I tried to do this by simply letting walreceiver die and
have the startup process launch a new walreceiver process that would
reconnect, but it turned out to be hard to rapidly disconnect and
connect, because the postmaster, which forks the walreceiver process,
does not always have the same idea of whether the walreceiver is active
as the startup process does. It would eventually be ok, thanks to
timeouts, but would require polling. But not having to disconnect seems
nicer, anyway

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Math and logic mistakes in tsquery_opr_selec
Next
From: Boszormenyi Zoltan
Date:
Subject: Re: [v9.3] Extra Daemons (Re: elegant and effective way for running jobs inside a database)