Thread: WAL ends before end time of backup dump

WAL ends before end time of backup dump

From
Jeff Davis
Date:
Version: 8.1.4

I am having a problem restoring one of my base backups. I took a
successful backup of the production DB already since this one, and this
is just a routine test, so it's fortunately not an emergency.

I think that I either have a corrupted base backup or corrupted WAL
segments, or maybe I hit some strange bug.

When I try to restore, I point recovery.conf to the full set of archived
WAL segments, and get the following result:

[snip]
LOG:  restored log file "00000001000000170000002B.004A3CAC.backup" from
archive
LOG:  restored log file "00000001000000170000002B" from archive
LOG:  checkpoint record is at 17/2B4CDC58
LOG:  redo record is at 17/2B4A3CAC; undo record is at 0/0; shutdown
FALSE
LOG:  next transaction ID: 41438715; next OID: 42280
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  automatic recovery in progress
LOG:  redo starts at 17/2B4A3CAC
LOG:  record with zero length at 17/2B6EACC8
LOG:  redo done at 17/2B6EAC84
LOG:  restored log file "00000001000000170000002B" from archive
PANIC:  WAL ends before end time of backup dump
LOG:  startup process (PID 88979) was terminated by signal 6
LOG:  aborting startup due to startup process failure

If I restore from the earlier base backup, which grinds slowly through a
week's worth of WAL segments, it stops at segment
00000001000000170000002B, like so:

[ snip ]
LOG:  restored log file "00000001000000170000002B" from archive
LOG:  record with zero length at 17/2B6EACC8
LOG:  redo done at 17/2B6EAC84
LOG:  restored log file "00000001000000170000002B" from archive
LOG:  archive recovery complete
LOG:  database system is ready
LOG:  transaction ID wrap limit is 1094453440, limited by database
"postgres"

If I restore from a later backup, everything works fine.

The thing that stands out to me about the base backup that doesn't work
is that it took several WAL segments to complete. Here's the .backup
file for the base backup that fails:

$ cat wal/00000001000000170000002B.004A3CAC.backup
START WAL LOCATION: 17/2B4A3CAC (file 00000001000000170000002B)
STOP WAL LOCATION: 17/397B7D64 (file 000000010000001700000039)
CHECKPOINT LOCATION: 17/2B4CDC58
START TIME: 2006-11-05 01:00:01 PST
LABEL: 20061105010001.27375.tar.gz
STOP TIME: 2006-11-05 01:14:03 PST

I noticed the 8.2beta3 included a fix for WAL replay, is that related?
Can someone link to the thread about that bug? I can't test newer
versions of postgres because all my other backups seem to work.

Basically, I'd just like to know what happened to prevent it in the
future. I am archiving to an NFS mount, I don't know whether that
carries a risk of corruption or not.

Regards,
    Jeff Davis


Re: WAL ends before end time of backup dump

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> LOG:  restored log file "00000001000000170000002B" from archive
> LOG:  record with zero length at 17/2B6EACC8
> LOG:  redo done at 17/2B6EAC84

It looks to me like you archived this log file before it was fully
written.  You should take a close look at your archiving procedures.

            regards, tom lane

Re: WAL ends before end time of backup dump

From
Jeff Davis
Date:
On Tue, 2006-11-07 at 11:15 -0800, Jeff Davis wrote:
> Version: 8.1.4
>

[ snip ]

> OG:  restored log file "00000001000000170000002B" from archive
> LOG:  record with zero length at 17/2B6EACC8
> LOG:  redo done at 17/2B6EAC84
> LOG:  restored log file "00000001000000170000002B" from archive
> LOG:  archive recovery complete
> LOG:  database system is ready
> LOG:  transaction ID wrap limit is 1094453440, limited by database
> "postgres"
>

From the 8.1.4 source I see that:

  /*
   * Currently, xl_len == 0 must be bad data, but that might not be true

Does that mean that I must have a corrupt WAL segment file
00000001000000170000002B?

Is it possible this was caused by a bug? Would an already known bug
possibly cause this problem?

I am sending the transaction logs to an NFS mount. I consider it to be
reliable storage, so I can't see how that would have been corrupted
unless it was sent that way by PostgreSQL.

Regards,
    Jeff Davis



Re: WAL ends before end time of backup dump

From
Jeff Davis
Date:
On Tue, 2006-11-07 at 17:20 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > LOG:  restored log file "00000001000000170000002B" from archive
> > LOG:  record with zero length at 17/2B6EACC8
> > LOG:  redo done at 17/2B6EAC84
>
> It looks to me like you archived this log file before it was fully
> written.  You should take a close look at your archiving procedures.
>

Hmm... the only way the file gets there is archive command. I didn't do
any manual moves of the segments, certainly not last Sunday at 1am
(which is when this segment was put there).

From postgresql.conf:
archive_command = '/usr/local/pgsql/bin/archive_command.bash "%p" "%f"'

and:
-----------
$ cat /usr/local/pgsql/bin/archive_command.bash
#!/usr/local/bin/bash

# $1 is the full path to the file to archive
# $2 is the filename

WAL=$1
WAL_NAME=$2
WAL_ARCHIVE=/path/to/archive

echo test ! -f $WAL_ARCHIVE/$WAL_NAME && gzip -c $WAL >
$WAL_ARCHIVE/$WAL_NAME.gz
------------

Any obvious holes in there? Is gzip doing something weird? It worked
with all my other WAL segments, and I've tested the backups multiple
times.

Also, the WAL segment is exactly the right size, although I suppose
there's no way for it not to be.

Regards,
    Jeff Davis


Re: WAL ends before end time of backup dump

From
"Woody Woodring"
Date:
Which Sunday?  Could you have been bitten by some DST time shift?  There
were 2 1ams a couple of weeks ago.

Woody
IGLASS Networks

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Davis
Sent: Tuesday, November 07, 2006 5:38 PM
To: Tom Lane
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] WAL ends before end time of backup dump

On Tue, 2006-11-07 at 17:20 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > LOG:  restored log file "00000001000000170000002B" from archive
> > LOG:  record with zero length at 17/2B6EACC8
> > LOG:  redo done at 17/2B6EAC84
>
> It looks to me like you archived this log file before it was fully
> written.  You should take a close look at your archiving procedures.
>

Hmm... the only way the file gets there is archive command. I didn't do any
manual moves of the segments, certainly not last Sunday at 1am (which is
when this segment was put there).

From postgresql.conf:
archive_command = '/usr/local/pgsql/bin/archive_command.bash "%p" "%f"'

and:
-----------
$ cat /usr/local/pgsql/bin/archive_command.bash
#!/usr/local/bin/bash

# $1 is the full path to the file to archive # $2 is the filename

WAL=$1
WAL_NAME=$2
WAL_ARCHIVE=/path/to/archive

echo test ! -f $WAL_ARCHIVE/$WAL_NAME && gzip -c $WAL >
$WAL_ARCHIVE/$WAL_NAME.gz
------------

Any obvious holes in there? Is gzip doing something weird? It worked with
all my other WAL segments, and I've tested the backups multiple times.

Also, the WAL segment is exactly the right size, although I suppose there's
no way for it not to be.

Regards,
    Jeff Davis


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster


Re: WAL ends before end time of backup dump

From
Jeff Davis
Date:
On Wed, 2006-11-08 at 09:39 -0500, Woody Woodring wrote:
> Which Sunday?  Could you have been bitten by some DST time shift?  There
> were 2 1ams a couple of weeks ago.
>

Hah, interesting idea, but it wasn't that the basebackup didn't exist,
and it wasn't that I had two of them close together. I had one base
backup, and apparently one of the logs archived at that time is corrupt
(or perhaps archived too soon).

The log file has bytes throughout, but I suppose that doesn't tell me
anything because of the recycling.

It would be handy if there was a way to check whether archived log files
are good or not without trying to go through recovery.

I still really don't know what caused this. I think it may have
something to do with a lot of activity happening during my base backup.
I am sending syslog data to the database, and I imagine a lot of syslog
data is generated by other cron scripts at that time.

Regards,
    Jeff Davis


Re: [solved] WAL ends before end time of backup dump

From
Jeff Davis
Date:
On Tue, 2006-11-07 at 17:20 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > LOG:  restored log file "00000001000000170000002B" from archive
> > LOG:  record with zero length at 17/2B6EACC8
> > LOG:  redo done at 17/2B6EAC84
>
> It looks to me like you archived this log file before it was fully
> written.  You should take a close look at your archiving procedures.

Ok, I found the problem. Completely my mistake, because I was running
through the test without being careful.

On the first recovery attempt, I left the postmaster.conf, including the
archive_command setting, in the data dir along with some archives in
pg_xlog. When I started recovery, it overwrote some of my archived
segments.

The archive script isn't supposed to overwrite existing files, but I had
a typo in the check and didn't realize it because in the normal case
everything was archiving correctly.

Would it make sense to throw an error if there's a recovery.conf and
files already exist in pg_xlog?

Thanks for narrowing down the problem for me.

Regards,
    Jeff Davis


Re: [solved] WAL ends before end time of backup dump

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Would it make sense to throw an error if there's a recovery.conf and
> files already exist in pg_xlog?

No; that's actually a feature in some scenarios (eg, your last few
segments didn't get archived yet).

There's no substitute for debugging your archiver script ;-)

            regards, tom lane