Re: [COMMITTERS] pgsql: Allow a streaming replication standby to follow a timeline switc - Mailing list pgsql-general
From | hubert depesz lubaczewski |
---|---|
Subject | Re: [COMMITTERS] pgsql: Allow a streaming replication standby to follow a timeline switc |
Date | |
Msg-id | 20121215150610.GA21431@depesz.com Whole thread Raw |
Responses |
Re: [COMMITTERS] pgsql: Allow a streaming replication standby to
follow a timeline switc
|
List | pgsql-general |
I might be missing something, but what exactly does that commit give us? I mean - we were able, previously, to make slave switch to new master - as Phil Sorber described in here: http://philsorber.blogspot.com/2012/03/what-to-do-when-your-timeline-isnt.html After some talk on IRC, I understood that this patch will make it possible to switch to new master in plain SR replication, with no WAL archive (because if you have wal archive, you can use the method Phil described, which basically "just works"). So I did setup three machines: master and two slaves. Master had 2 IPs - its own, and a floating one. Both slaves were connecting to the floating one, and recovery.conf looked like: --------- standby_mode = 'on' primary_conninfo = 'port=5920 user=replication host=172.28.173.253' trigger_file = '/tmp/finish.replication' recovery_target_timeline='latest' --------- After I verified that replication works to both slaves, I did failover one of the slaves, shut down master, and did ip takeover of floating ip to the slave that did takeover. new master did work nicely, but the 2nd slave, which was supposed to switch to new master - didn't switch automatically. In logs I saw: 2012-12-15 15:59:58.495 CET @ 24288 FATAL: could not send data to WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2012-12-15 16:00:13.518 CET @ 24304 LOG: fetching timeline history file for timeline 2 from primary server 2012-12-15 16:00:13.522 CET @ 24304 FATAL: could not start WAL streaming: ERROR: requested starting point 0/F000000 ontimeline 1 is not in this server's history DETAIL: This server's history forked from timeline 1 at 0/E002D40 2012-12-15 16:00:13.523 CET @ 24287 LOG: new timeline 2 forked off current database system timeline 1 before current recoverypoint 0/F000000 2012-12-15 16:00:18.535 CET @ 24309 FATAL: could not start WAL streaming: ERROR: requested starting point 0/F000000 ontimeline 1 is not in this server's history DETAIL: This server's history forked from timeline 1 at 0/E002D40 2012-12-15 16:00:18.536 CET @ 24287 LOG: new timeline 2 forked off current database system timeline 1 before current recoverypoint 0/F000000 2012-12-15 16:00:23.549 CET @ 24323 FATAL: could not start WAL streaming: ERROR: requested starting point 0/F000000 ontimeline 1 is not in this server's history DETAIL: This server's history forked from timeline 1 at 0/E002D40 2012-12-15 16:00:23.550 CET @ 24287 LOG: new timeline 2 forked off current database system timeline 1 before current recoverypoint 0/F000000 2012-12-15 16:00:28.564 CET @ 24327 FATAL: could not start WAL streaming: ERROR: requested starting point 0/F000000 ontimeline 1 is not in this server's history DETAIL: This server's history forked from timeline 1 at 0/E002D40 2012-12-15 16:00:28.565 CET @ 24287 LOG: new timeline 2 forked off current database system timeline 1 before current recoverypoint 0/F000000 2012-12-15 16:00:32.927 CET @ 24282 LOG: received smart shutdown request 2012-12-15 16:00:33.132 CET @ 24289 LOG: restartpoint complete: wrote 4427 buffers (27.0%); 1 transaction log file(s) added,0 removed, 0 recycled; write=63.750 s, sync=0.095 s, total=63.943 s; sync files=2, longest=0.093 s, average=0.047 s 2012-12-15 16:00:33.132 CET @ 24289 LOG: recovery restart point at 0/DD376B8 2012-12-15 16:00:33.132 CET @ 24289 DETAIL: last completed transaction was at log time 2012-12-15 15:59:43.993318+01 2012-12-15 16:00:33.133 CET @ 24289 LOG: shutting down 2012-12-15 16:00:33.140 CET @ 24289 LOG: database system is shut down I did shutdown the slave, and restarted it, but it fails to start, and instead it just shows in log: 2012-12-15 16:00:38.130 CET @ 24348 LOG: database system was shut down in recovery at 2012-12-15 16:00:33 CET 2012-12-15 16:00:38.130 CET @ 24348 FATAL: requested timeline 2 does not contain minimum recovery point 0/F000000 on timeline1 2012-12-15 16:00:38.130 CET @ 24343 LOG: startup process (PID 24348) exited with exit code 1 2012-12-15 16:00:38.130 CET @ 24343 LOG: aborting startup due to startup process failure So, what is the problem, and how can I get it working? Or, if wal archive is necessary - what is the point of this commit? What am I missing in here? Best regards, depesz On Thu, Dec 13, 2012 at 05:19:36PM +0000, Heikki Linnakangas wrote: > Allow a streaming replication standby to follow a timeline switch. > > Before this patch, streaming replication would refuse to start replicating > if the timeline in the primary doesn't exactly match the standby. The > situation where it doesn't match is when you have a master, and two > standbys, and you promote one of the standbys to become new master. > Promoting bumps up the timeline ID, and after that bump, the other standby > would refuse to continue. > > There's significantly more timeline related logic in streaming replication > now. First of all, when a standby connects to primary, it will ask the > primary for any timeline history files that are missing from the standby. > The missing files are sent using a new replication command TIMELINE_HISTORY, > and stored in standby's pg_xlog directory. Using the timeline history files, > the standby can follow the latest timeline present in the primary > (recovery_target_timeline='latest'), just as it can follow new timelines > appearing in an archive directory. > > START_REPLICATION now takes a TIMELINE parameter, to specify exactly which > timeline to stream WAL from. This allows the standby to request the primary > to send over WAL that precedes the promotion. The replication protocol is > changed slightly (in a backwards-compatible way although there's little hope > of streaming replication working across major versions anyway), to allow > replication to stop when the end of timeline reached, putting the walsender > back into accepting a replication command. > > Many thanks to Amit Kapila for testing and reviewing various versions of > this patch. > > Branch > ------ > master > > Details > ------- > http://git.postgresql.org/pg/commitdiff/abfd192b1b5ba5216ac4b1f31dcd553106304b19 > > Modified Files > -------------- > doc/src/sgml/high-availability.sgml | 7 +- > doc/src/sgml/protocol.sgml | 77 +++- > src/backend/access/transam/timeline.c | 83 +++ > src/backend/access/transam/xlog.c | 55 +- > src/backend/access/transam/xlogfuncs.c | 4 +- > src/backend/postmaster/postmaster.c | 21 - > src/backend/postmaster/startup.c | 2 + > src/backend/replication/basebackup.c | 21 +- > .../libpqwalreceiver/libpqwalreceiver.c | 202 +++++-- > src/backend/replication/repl_gram.y | 47 ++- > src/backend/replication/repl_scanner.l | 10 + > src/backend/replication/walreceiver.c | 472 +++++++++++++--- > src/backend/replication/walreceiverfuncs.c | 90 +++- > src/backend/replication/walsender.c | 587 +++++++++++++++---- > src/include/access/timeline.h | 1 + > src/include/access/xlog.h | 4 +- > src/include/nodes/nodes.h | 1 + > src/include/nodes/replnodes.h | 12 + > src/include/replication/walreceiver.h | 52 ++- > src/include/replication/walsender.h | 1 - > src/include/replication/walsender_private.h | 2 +- > src/interfaces/libpq/fe-exec.c | 8 +- > src/interfaces/libpq/fe-protocol3.c | 7 +- > 23 files changed, 1396 insertions(+), 370 deletions(-) > > > -- > Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-committers -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
pgsql-general by date: