Thread: BUG #4854: Problems with replaying WAL files on Warm Standby

BUG #4854: Problems with replaying WAL files on Warm Standby

From
"Keith Pierno"
Date:
The following bug has been logged online:

Bug reference:      4854
Logged by:          Keith Pierno
Email address:      kpierno@lulu.com
PostgreSQL version: 8.3.6
Operating system:   Red Hat Enterprise Linux AS release 4 (Nahant Update 6)
PPC64
Description:        Problems with replaying WAL files on Warm Standby
Details:

We experienced a failure of our primary database server, after fixing the
hardware issue I attempted to turn the old primary into a warm standby
slave. I used a hotbackup from the new primary server and restored the files
to the slave. I then removed all references to old WAL files and used rsync
to copy the appropriate WAL files from our archive location. During the log
playback I received the following errors.

2009-06-15 13:25:11 EDT <::> LOG:  restored log file
"00000004000007460000007F" from archive
2009-06-15 13:25:11 EDT <::> LOG:  restored log file
"000000040000074600000080" from archive
2009-06-15 13:25:11 EDT <::> LOG:  restored log file
"000000040000074600000081" from archive
2009-06-15 13:25:11 EDT <::> LOG:  restored log file
"000000040000074600000082" from archive
2009-06-15 13:25:12 EDT <::> LOG:  restored log file
"000000040000074600000083" from archive
2009-06-15 13:25:12 EDT <::> LOG:  restored log file
"000000040000074600000084" from archive
2009-06-15 13:25:12 EDT <::> PANIC:  unexpected timeline ID 6 (after 4) in
checkpoint record
2009-06-15 13:25:12 EDT <::> CONTEXT:  xlog redo checkpoint: redo
746/84FFF010; tli 6; xid 0/50577196; oid 31395572; multi 6542; offset 13124;
shutdown
2009-06-15 13:25:13 EDT <::> LOG:  startup process (PID 27836) was
terminated by signal 6: Aborted
2009-06-15 13:25:13 EDT <::> LOG:  aborting startup due to startup process
failure

This backup was created at 0205 EDT and the WAL in question was from 0743
EDT.

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Tom Lane
Date:
"Keith Pierno" <kpierno@lulu.com> writes:
> 2009-06-15 13:25:12 EDT <::> PANIC:  unexpected timeline ID 6 (after 4) in
> checkpoint record

Hmm.  It's complaining because it didn't find timeline 6 mentioned in
the timeline history file it read (if any).  Maybe you forgot to archive
or restore the NNNNNNNN.history file?

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Tom Lane
Date:
Keith Pierno <kpierno@lulu.com> writes:
> Thanks for the prompt response. If such a file is needed for recovery
> it was never created by postgres. The current archiving process creates
> uses rsync to archive the WAL files to a shared archive area. In the
> past and on my other cluster we do not see .history files on the
> primary server and have been able to recover without them.

The .history file would have been created on the slave at the time it
became new master (and started its own timeline).

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Keith Pierno
Date:
Thanks for the prompt response. If such a file is needed for recovery
it was never created by postgres. The current archiving process creates
uses rsync to archive the WAL files to a shared archive area. In the
past and on my other cluster we do not see .history files on the
primary server and have been able to recover without them. If it helps
I can send a copy of the WAL files and my postgres.conf files.

Keith

Tom Lane wrote:

  "Keith Pierno" <kpierno@lulu.com> writes:


    2009-06-15 13:25:12 EDT <::> PANIC:  unexpected timeline ID 6 (after 4) in
checkpoint record



Hmm.  It's complaining because it didn't find timeline 6 mentioned in
the timeline history file it read (if any).  Maybe you forgot to archive
or restore the NNNNNNNN.history file?

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Keith Pierno
Date:
I was able to find the file, which was from the date of the failure
(Tuesday June 9th). My question is why the backup from this morning
would all of a sudden require a file from the 9th even though 5 hours
of logs were able to be applied? I ended up doing another hotbackup and
restoring the database from this afternoon and this seems to be working
fine. I am just worried that another WAL file will throw the same error
and I will be without a warm standby.

Thanks again,

Keith

Tom Lane wrote:

  Keith Pierno <kpierno@lulu.com> writes:


    Thanks for the prompt response. If such a file is needed for recovery
it was never created by postgres. The current archiving process creates
uses rsync to archive the WAL files to a shared archive area. In the
past and on my other cluster we do not see .history files on the
primary server and have been able to recover without them.



The .history file would have been created on the slave at the time it
became new master (and started its own timeline).

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Tom Lane
Date:
Keith Pierno <kpierno@lulu.com> writes:
> I was able to find the file, which was from the date of the failure
> (Tuesday June 9th). My question is why the backup from this morning
> would all of a sudden require a file from the 9th even though 5 hours
> of logs were able to be applied?

It was apparently busy applying the logs that were from the old primary.
When it got up to the point where the failover occurred, kaboom :-(.
It would have had trouble in any case there because the subsequent log
files are presumably numbered 00000006... instead of 00000004...
Were you getting those shipped over properly from the new primary?

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Keith Pierno
Date:
The backup used was from well after the failover time which is why I
was concerned. Interestingly enough the logs are still all prefixed
with 00000004... That just makes this problem extremely bizarre. 

Current log listing:
[postgres@db01pri pgwalarchives]# ls -ltr
total 82028
-rw-------  1 postgres postgres      250 Jun 15 02:05
0000000400000745000000DB.00F290D8.backup
-rw-------  1 postgres postgres      250 Jun 15 13:28
000000040000074700000065.00D48908.backup
-rw-------  1 postgres postgres 16777216 Jun 15 15:58
0000000400000747000000E4
-rw-------  1 postgres postgres 16777216 Jun 15 15:59
0000000400000747000000E5
-rw-------  1 postgres postgres 16777216 Jun 15 16:02
0000000400000747000000E6
-rw-------  1 postgres postgres 16777216 Jun 15 16:04
0000000400000747000000E7
-rw-------  1 postgres postgres 16777216 Jun 15 16:07
0000000400000747000000E8
[postgres@db01pri pgwalarchives]#

Keith

Tom Lane wrote:

  Keith Pierno <kpierno@lulu.com> writes:


    I was able to find the file, which was from the date of the failure
(Tuesday June 9th). My question is why the backup from this morning
would all of a sudden require a file from the 9th even though 5 hours
of logs were able to be applied?



It was apparently busy applying the logs that were from the old primary.
When it got up to the point where the failover occurred, kaboom :-(.
It would have had trouble in any case there because the subsequent log
files are presumably numbered 00000006... instead of 00000004...
Were you getting those shipped over properly from the new primary?

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Tom Lane
Date:
Keith Pierno <kpierno@lulu.com> writes:
> The backup used was from well after the failover time which is why I
> was concerned. Interestingly enough the logs are still all prefixed
> with 00000004... That just makes this problem extremely bizarre.

Hmm, that *is* weird.  It seems like the new primary must have reverted
its decision to go from timeline 4 to timeline 6.  (Which in itself is
a bit odd; why not timeline 5?)

Can you give us an exact sequence of events on the slave server/new
primary around the time of the failover?  Also, what was in the .history
file when you found it, and are there any other history files?

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Keith Pierno
Date:
The timeline for the events all dates MM/DD/YYYY

    06/09/2009 1310 EDT - Hardware fault on primary database server
db01pri
    06/09/2009 1325 EDT - Failover to warm standby db01sec
    06/12/2009 1615 EDT - db01pri server fixed and OS booted
    06/15/2009 1115 EDT - started recovery of hotbackup from 06/15/2009
0205 EDT from db01sec onto db01pri
    06/15/2009 1320 EDT - Attempted to start postgres on db01pri in
warm standby mode
    06/15/2009 1325 EDT - Failure to apply WAL log errors with
"unexpected timeline ID"
    06/15/2009 1340 EDT - Started a new hotbackup on db01sec
    06/15/2009 1545 EDT - Started recovery hotbackup from 06/15/2009
1340 to db01pri
    06/15/2000 1430 EDT - db01pri recovered and running in warm standby

Here is the contents of the pg_xlog directory and the 00000004.history
file:

[postgres@db01pri ~]$  cat 00000004.history
1    0000000100000736000000A1    before transaction 0 at 1999-12-31
19:00:00-05
[postgres@db01pri ~]$  ls -l
total 98468
-rw-------  1 postgres postgres       74 Jul 10  2008 00000002.history
-rw-------  1 postgres postgres       74 Jun  9 13:29 00000003.history
-rw-------  1 postgres postgres 16777216 Jun 16 08:45
0000000400000749000000C9
-rw-------  1 postgres postgres 16777216 Jun 16 08:46
0000000400000749000000CA
-rw-------  1 postgres postgres 16777216 Jun 16 08:47
0000000400000749000000CB
-rw-------  1 postgres postgres       74 Jun  9 13:33 00000004.history
drwxr-xr-x  2 postgres postgres    32768 Jun 16 08:46 archive_status
-rw-------  1 postgres postgres 16777216 Jun  9 13:45 xlogtemp.17243
-rw-------  1 postgres postgres 16777216 Jun  9 13:45 xlogtemp.17244
-rw-------  1 postgres postgres 16777216 Jun  9 13:52 xlogtemp.17397
[postgres@db01pri ~]$

Thanks again,

Keith

Tom Lane wrote:

  Keith Pierno <kpierno@lulu.com> writes:


    The backup used was from well after the failover time which is why I
was concerned. Interestingly enough the logs are still all prefixed
with 00000004... That just makes this problem extremely bizarre.



Hmm, that *is* weird.  It seems like the new primary must have reverted
its decision to go from timeline 4 to timeline 6.  (Which in itself is
a bit odd; why not timeline 5?)

Can you give us an exact sequence of events on the slave server/new
primary around the time of the failover?  Also, what was in the .history
file when you found it, and are there any other history files?

            regards, tom lane

Re: BUG #4854: Problems with replaying WAL files on Warm Standby

From
Tom Lane
Date:
Keith Pierno <kpierno@lulu.com> writes:
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>

BTW, please do not post all-HTML mail to this list.

> The timeline for the events all dates MM/DD/YYYY
>
>     06/09/2009 1310 EDT - Hardware fault on primary database server db01pri
>     06/09/2009 1325 EDT - Failover to warm standby db01sec

Hmm, how exactly was the failover managed?

> [postgres@db01pri ~]$  ls -l
> total 98468
> -rw-------  1 postgres postgres       74 Jul 10  2008 00000002.history
> -rw-------  1 postgres postgres       74 Jun  9 13:29 00000003.history
> -rw-------  1 postgres postgres 16777216 Jun 16 08:45 0000000400000749000000C9
> -rw-------  1 postgres postgres 16777216 Jun 16 08:46 0000000400000749000000CA
> -rw-------  1 postgres postgres 16777216 Jun 16 08:47 0000000400000749000000CB
> -rw-------  1 postgres postgres       74 Jun  9 13:33 00000004.history
> drwxr-xr-x  2 postgres postgres    32768 Jun 16 08:46 archive_status
> -rw-------  1 postgres postgres 16777216 Jun  9 13:45 xlogtemp.17243
> -rw-------  1 postgres postgres 16777216 Jun  9 13:45 xlogtemp.17244
> -rw-------  1 postgres postgres 16777216 Jun  9 13:52 xlogtemp.17397

The xlogtemp files are a bit suspicious; they shouldn't be there.  The
timestamps suggest that something odd was going on right after the
failover.  Do you have the slave's postmaster log from that time
interval, and if so, does it show anything out of the ordinary?

            regards, tom lane