Home > mailing lists

Re: BUG #5011: Standby recovery unable to follow timeline change - Mailing list pgsql-bugs

From	Heikki Linnakangas
Subject	Re: BUG #5011: Standby recovery unable to follow timeline change
Date	August 26, 2009 05:42:49
Msg-id	4A94F518.3050001@enterprisedb.com Whole thread Raw
In response to	BUG #5011: Standby recovery unable to follow timeline change ("James Bardin" <jbardin@bu.edu>)
Responses	Re: BUG #5011: Standby recovery unable to follow timeline change Re: BUG #5011: Standby recovery unable to follow timeline change
List	pgsql-bugs

Tree view

James Bardin wrote:
> I'm working on a system where the master and standby servers are expected to
> be able to swap roles repeatedly. The first failover works fine, but the
> ex-master, now standby, can't recover using the shipped logs.
>
> Using recovery_target_timeline='latest' finds the new history file, and
> pg_standby looks good until recovery is attempted. Then we log errors like:
>
> LOG:  unexpected timeline ID 0 in log file 0, segment 1, offset 0
> LOG:  invalid primary checkpoint record
>
> and any changes made after the first failover are lost.
>
> Is this currently possible, or do I have to send a full file-level backup to
> sync the ex-master server with the new master?

That should work. (Note that you do need to restore the ex-master from
the old base backup; you can't just copy recovery.conf to the old
master's data directory.)

I can reproduce that, it's clearly a bug. Thanks for the report!

Our last-minute changes in 8.4 to allow checkpoint record to be created,
while forbidding other WAL insertions, missed that CreateCheckPoint()
calls AdvanceXLInsertBuffer() which requires a valid ThisTimeLineID to
be set. We need to initialize ThisTimeLineID before we call
AdvanceXLInsertBuffer().

I wonder if we should add an XLogInsertAllowed() cross-check to
AdvanceXLInsertBuffer() to catch this kind of bugs in the future.
Writing a new empty WAL page is more or less the same thing as writing a
new WAL record. OTOH, all other AdvanceXLInsertBuffer() calls are from
XLogInsert(), which already checks that, and those calls are in quite
performance-critical paths.

Attached is a straightforward fix which initializes ThisTimeLineID
before the AdvanceXLInsertBuffer() call. Barring objections, I'll commit
 that.

BTW, I'm not sure if the AdvanceXLInsertBuffer() call is really
necessary there. It's just to round up the redo-pointer in the
checkpoint record to where the next WAL record will be, but ISTM the end
location of the last record would work just as well.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc6be16..88dc987 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6455,6 +6455,8 @@ CreateCheckPoint(int flags)
     freespace = INSERT_FREESPACE(Insert);
     if (freespace < SizeOfXLogRecord)
     {
+        /* AdvanceXLInsertBuffer() needs a valid ThisTimeLineID */
+        InitXLOGAccess();
         (void) AdvanceXLInsertBuffer(false);
         /* OK to ignore update return flag, since we will do flush anyway */
         freespace = INSERT_FREESPACE(Insert);

pgsql-bugs by date:

From: Magnus Hagander
Date: 26 August 2009, 04:18:12
Subject: Re: BUG #5008: Server Startup Problem - When server is configured for SSL

From: Tom Lane
Date: 26 August 2009, 10:30:42
Subject: Re: BUG #5008: Server Startup Problem - When server is configured for SSL

Re: BUG #5011: Standby recovery unable to follow timeline change - Mailing list pgsql-bugs

Previous

Next