Thread: WAL archive stopping too early

WAL archive stopping too early

From
Andy Shellam
Date:
I'm trying to set up WAL archiving on my server, and replicate my live database to a hot-spare server using online backups.
However, the postmaster on the hot-spare is stopping too early during the recovery.

In my archived xlog folder on the live server, I have the following xlogs (note the backup history file):

-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:14 000000010000000900000019
-rw-------  1 postgresql datahandler      273 2006-02-16 15:14 000000010000000900000019.0035C2F4.backup
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:14 00000001000000090000001A
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001B
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001C
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001D
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001E
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:25 00000001000000090000001F
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:35 000000010000000900000020
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:38 000000010000000900000021
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:44 000000010000000900000022
-rw-------  1 postgresql datahandler 16777216 2006-02-16 15:46 000000010000000900000023

I have replicated this folder to the xlog_archive folder on the hot-spare.  I have taken a tar of the entire $PGDATA on the live server (during a start_backup and stop_backup), and un-tarred it on the hot-spare (changing the owner/permissions as appropriate.)  Note the pg_xlog was not included in the tar, so I've created the correct structure on the spare (as a symlink to a separate physical disk.)

In my recovery.conf, I've added the single recovery_command and changed the owner to my superuser (postgresql.)

Then when I start the postmaster on the spare, here's what follows in the log:

2006-02-17 08:59:40 GMT LOG:  starting archive recovery
2006-02-17 08:59:40 GMT LOG:  restore_command = "cp /mndata/logs/xlog_archive/%f "%p""
cp: cannot stat `/mndata/logs/xlog_archive/00000001.history': No such file or directory
2006-02-17 08:59:40 GMT LOG:  restored log file "000000010000000900000019.0035C2F4.backup" from archive
2006-02-17 08:59:40 GMT LOG:  restored log file "000000010000000900000019" from archive
2006-02-17 08:59:40 GMT LOG:  checkpoint record is at 9/1935C2F4
2006-02-17 08:59:40 GMT LOG:  redo record is at 9/1935C2F4; undo record is at 0/0; shutdown FALSE
2006-02-17 08:59:40 GMT LOG:  next transaction ID: 795500; next OID: 37184
2006-02-17 08:59:40 GMT LOG:  next MultiXactId: 1; next MultiXactOffset: 0
2006-02-17 08:59:40 GMT LOG:  automatic recovery in progress
2006-02-17 08:59:40 GMT LOG:  redo starts at 9/1935C338
2006-02-17 08:59:42 GMT LOG:  restored log file "00000001000000090000001A" from archive
2006-02-17 08:59:44 GMT LOG:  restored log file "00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG:  record with zero length at 9/1BEB2430
2006-02-17 09:00:09 GMT LOG:  redo done at 9/1BEB23EC
2006-02-17 09:00:09 GMT LOG:  restored log file "00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG:  archive recovery complete
2006-02-17 09:00:20 GMT LOG:  database system is ready
2006-02-17 09:00:20 GMT LOG:  transaction ID wrap limit is 1074536488, limited by database "dspam"

Note there are plenty more files ready to run in the archive, and the contents of the backup label only mention 1A as the final WAL file of the backup, so it is starting to read the next WAL file, but stopping for some reason.  Is it because of the "record with zero length" that's causing it to fail?  Also why is it restoring 1B from the archive twice?  There is a committed transaction in the live database that is not appearing in the hot-spare after the restore, which is how I know it's not completing correctly.

Here's the backup label file contents:

START WAL LOCATION: 9/1935C2F4 (file 000000010000000900000019)
STOP WAL LOCATION: 9/1AC3DCF8 (file 00000001000000090000001A)
CHECKPOINT LOCATION: 9/1935C2F4
START TIME: 2006-02-16 11:39:57 GMT
LABEL: /perth/backup/db/base_20060216.tar.gz
STOP TIME: 2006-02-16 12:13:49 GMT

Can anyone shed any light?

Many thanks

Andy Shellam
the mail network - server support

Re: WAL archive stopping too early

From
Tom Lane
Date:
Andy Shellam <andy.shellam@mailnetwork.co.uk> writes:
> Note there are plenty more files ready to run in the archive, and the
> contents of the backup label only mention 1A as the final WAL file of
> the backup, so it is starting to read the next WAL file, but stopping
> for some reason.  Is it because of the "record with zero length" that's
> causing it to fail?

Yes --- that indicates the end of archive.  It sounds to me like file 1B
was archived while still only partly filled.  Check the process you're
using for deciding when to copy WAL files.

> Also why is it restoring 1B from the archive twice?

I think that's expected behavior --- it has something to do with wanting
to make the copy of 1B into the new writable front WAL segment.  You can
dig through xlog.c if you want the details.

            regards, tom lane

Re: WAL archive stopping too early

From
Andy Shellam
Date:
Hi Tom,

Many thanks for the reply.  I dug through the log files and found that the archiving had been failing, so it stands a good chance I'd pulled 1B out of the xlog directory ahead of time.
Over the course of the weekend (once I've upgraded to v8.1.3) I'll try the exercise again and see what happens.

As for digging through xlog.c, I'm not a C programmer so I'll stay right out of the way - leave that to you guys!!

Regards

Andy

Tom Lane wrote:
Andy Shellam <andy.shellam@mailnetwork.co.uk> writes: 
Note there are plenty more files ready to run in the archive, and the 
contents of the backup label only mention 1A as the final WAL file of 
the backup, so it is starting to read the next WAL file, but stopping 
for some reason.  Is it because of the "record with zero length" that's 
causing it to fail?   
Yes --- that indicates the end of archive.  It sounds to me like file 1B
was archived while still only partly filled.  Check the process you're
using for deciding when to copy WAL files.
 
Also why is it restoring 1B from the archive twice?   
I think that's expected behavior --- it has something to do with wanting
to make the copy of 1B into the new writable front WAL segment.  You can
dig through xlog.c if you want the details.
		regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

!DSPAM:43f5e383306306701047853!

 

--
the mail network - an alternative in a standardised worldAndy Shellam
the mail network (Server Support)

0 845 838 0879  Phone
www.mailnetwork.co.uk  Web
andy.shellam@mailnetwork.co.uk  E-mail
Attachment

Re: WAL archive stopping too early

From
Alvaro Herrera
Date:
Andy Shellam wrote:

Hi,

> As for digging through xlog.c, I'm not a C programmer so I'll stay right
> out of the way - leave that to you guys!!

You don't need to be a C programmer -- just read the code commentary.
It's easy to follow most of the time.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.