Thread: WAL archive stopping too early
I'm trying to set up WAL archiving on my server, and replicate my live database to a hot-spare server using online backups.
However, the postmaster on the hot-spare is stopping too early during the recovery.
In my archived xlog folder on the live server, I have the following xlogs (note the backup history file):
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:14 000000010000000900000019
-rw------- 1 postgresql datahandler 273 2006-02-16 15:14 000000010000000900000019.0035C2F4.backup
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:14 00000001000000090000001A
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001B
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001C
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001D
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001E
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:25 00000001000000090000001F
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:35 000000010000000900000020
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:38 000000010000000900000021
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:44 000000010000000900000022
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:46 000000010000000900000023
I have replicated this folder to the xlog_archive folder on the hot-spare. I have taken a tar of the entire $PGDATA on the live server (during a start_backup and stop_backup), and un-tarred it on the hot-spare (changing the owner/permissions as appropriate.) Note the pg_xlog was not included in the tar, so I've created the correct structure on the spare (as a symlink to a separate physical disk.)
In my recovery.conf, I've added the single recovery_command and changed the owner to my superuser (postgresql.)
Then when I start the postmaster on the spare, here's what follows in the log:
2006-02-17 08:59:40 GMT LOG: starting archive recovery
2006-02-17 08:59:40 GMT LOG: restore_command = "cp /mndata/logs/xlog_archive/%f "%p""
cp: cannot stat `/mndata/logs/xlog_archive/00000001.history': No such file or directory
2006-02-17 08:59:40 GMT LOG: restored log file "000000010000000900000019.0035C2F4.backup" from archive
2006-02-17 08:59:40 GMT LOG: restored log file "000000010000000900000019" from archive
2006-02-17 08:59:40 GMT LOG: checkpoint record is at 9/1935C2F4
2006-02-17 08:59:40 GMT LOG: redo record is at 9/1935C2F4; undo record is at 0/0; shutdown FALSE
2006-02-17 08:59:40 GMT LOG: next transaction ID: 795500; next OID: 37184
2006-02-17 08:59:40 GMT LOG: next MultiXactId: 1; next MultiXactOffset: 0
2006-02-17 08:59:40 GMT LOG: automatic recovery in progress
2006-02-17 08:59:40 GMT LOG: redo starts at 9/1935C338
2006-02-17 08:59:42 GMT LOG: restored log file "00000001000000090000001A" from archive
2006-02-17 08:59:44 GMT LOG: restored log file "00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG: record with zero length at 9/1BEB2430
2006-02-17 09:00:09 GMT LOG: redo done at 9/1BEB23EC
2006-02-17 09:00:09 GMT LOG: restored log file "00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG: archive recovery complete
2006-02-17 09:00:20 GMT LOG: database system is ready
2006-02-17 09:00:20 GMT LOG: transaction ID wrap limit is 1074536488, limited by database "dspam"
Note there are plenty more files ready to run in the archive, and the contents of the backup label only mention 1A as the final WAL file of the backup, so it is starting to read the next WAL file, but stopping for some reason. Is it because of the "record with zero length" that's causing it to fail? Also why is it restoring 1B from the archive twice? There is a committed transaction in the live database that is not appearing in the hot-spare after the restore, which is how I know it's not completing correctly.
Here's the backup label file contents:
START WAL LOCATION: 9/1935C2F4 (file 000000010000000900000019)
STOP WAL LOCATION: 9/1AC3DCF8 (file 00000001000000090000001A)
CHECKPOINT LOCATION: 9/1935C2F4
START TIME: 2006-02-16 11:39:57 GMT
LABEL: /perth/backup/db/base_20060216.tar.gz
STOP TIME: 2006-02-16 12:13:49 GMT
Can anyone shed any light?
Many thanks
Andy Shellam
the mail network - server support
However, the postmaster on the hot-spare is stopping too early during the recovery.
In my archived xlog folder on the live server, I have the following xlogs (note the backup history file):
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:14 000000010000000900000019
-rw------- 1 postgresql datahandler 273 2006-02-16 15:14 000000010000000900000019.0035C2F4.backup
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:14 00000001000000090000001A
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001B
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001C
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001D
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:50 00000001000000090000001E
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:25 00000001000000090000001F
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:35 000000010000000900000020
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:38 000000010000000900000021
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:44 000000010000000900000022
-rw------- 1 postgresql datahandler 16777216 2006-02-16 15:46 000000010000000900000023
I have replicated this folder to the xlog_archive folder on the hot-spare. I have taken a tar of the entire $PGDATA on the live server (during a start_backup and stop_backup), and un-tarred it on the hot-spare (changing the owner/permissions as appropriate.) Note the pg_xlog was not included in the tar, so I've created the correct structure on the spare (as a symlink to a separate physical disk.)
In my recovery.conf, I've added the single recovery_command and changed the owner to my superuser (postgresql.)
Then when I start the postmaster on the spare, here's what follows in the log:
2006-02-17 08:59:40 GMT LOG: starting archive recovery
2006-02-17 08:59:40 GMT LOG: restore_command = "cp /mndata/logs/xlog_archive/%f "%p""
cp: cannot stat `/mndata/logs/xlog_archive/00000001.history': No such file or directory
2006-02-17 08:59:40 GMT LOG: restored log file "000000010000000900000019.0035C2F4.backup" from archive
2006-02-17 08:59:40 GMT LOG: restored log file "000000010000000900000019" from archive
2006-02-17 08:59:40 GMT LOG: checkpoint record is at 9/1935C2F4
2006-02-17 08:59:40 GMT LOG: redo record is at 9/1935C2F4; undo record is at 0/0; shutdown FALSE
2006-02-17 08:59:40 GMT LOG: next transaction ID: 795500; next OID: 37184
2006-02-17 08:59:40 GMT LOG: next MultiXactId: 1; next MultiXactOffset: 0
2006-02-17 08:59:40 GMT LOG: automatic recovery in progress
2006-02-17 08:59:40 GMT LOG: redo starts at 9/1935C338
2006-02-17 08:59:42 GMT LOG: restored log file "00000001000000090000001A" from archive
2006-02-17 08:59:44 GMT LOG: restored log file "00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG: record with zero length at 9/1BEB2430
2006-02-17 09:00:09 GMT LOG: redo done at 9/1BEB23EC
2006-02-17 09:00:09 GMT LOG: restored log file "00000001000000090000001B" from archive
2006-02-17 09:00:09 GMT LOG: archive recovery complete
2006-02-17 09:00:20 GMT LOG: database system is ready
2006-02-17 09:00:20 GMT LOG: transaction ID wrap limit is 1074536488, limited by database "dspam"
Note there are plenty more files ready to run in the archive, and the contents of the backup label only mention 1A as the final WAL file of the backup, so it is starting to read the next WAL file, but stopping for some reason. Is it because of the "record with zero length" that's causing it to fail? Also why is it restoring 1B from the archive twice? There is a committed transaction in the live database that is not appearing in the hot-spare after the restore, which is how I know it's not completing correctly.
Here's the backup label file contents:
START WAL LOCATION: 9/1935C2F4 (file 000000010000000900000019)
STOP WAL LOCATION: 9/1AC3DCF8 (file 00000001000000090000001A)
CHECKPOINT LOCATION: 9/1935C2F4
START TIME: 2006-02-16 11:39:57 GMT
LABEL: /perth/backup/db/base_20060216.tar.gz
STOP TIME: 2006-02-16 12:13:49 GMT
Can anyone shed any light?
Many thanks
Andy Shellam
the mail network - server support
Andy Shellam <andy.shellam@mailnetwork.co.uk> writes: > Note there are plenty more files ready to run in the archive, and the > contents of the backup label only mention 1A as the final WAL file of > the backup, so it is starting to read the next WAL file, but stopping > for some reason. Is it because of the "record with zero length" that's > causing it to fail? Yes --- that indicates the end of archive. It sounds to me like file 1B was archived while still only partly filled. Check the process you're using for deciding when to copy WAL files. > Also why is it restoring 1B from the archive twice? I think that's expected behavior --- it has something to do with wanting to make the copy of 1B into the new writable front WAL segment. You can dig through xlog.c if you want the details. regards, tom lane
Hi Tom,
Many thanks for the reply. I dug through the log files and found that the archiving had been failing, so it stands a good chance I'd pulled 1B out of the xlog directory ahead of time.
Over the course of the weekend (once I've upgraded to v8.1.3) I'll try the exercise again and see what happens.
As for digging through xlog.c, I'm not a C programmer so I'll stay right out of the way - leave that to you guys!!
Regards
Andy
Tom Lane wrote:
Many thanks for the reply. I dug through the log files and found that the archiving had been failing, so it stands a good chance I'd pulled 1B out of the xlog directory ahead of time.
Over the course of the weekend (once I've upgraded to v8.1.3) I'll try the exercise again and see what happens.
As for digging through xlog.c, I'm not a C programmer so I'll stay right out of the way - leave that to you guys!!
Regards
Andy
Tom Lane wrote:
Andy Shellam <andy.shellam@mailnetwork.co.uk> writes:Note there are plenty more files ready to run in the archive, and the contents of the backup label only mention 1A as the final WAL file of the backup, so it is starting to read the next WAL file, but stopping for some reason. Is it because of the "record with zero length" that's causing it to fail?Yes --- that indicates the end of archive. It sounds to me like file 1B was archived while still only partly filled. Check the process you're using for deciding when to copy WAL files.Also why is it restoring 1B from the archive twice?I think that's expected behavior --- it has something to do with wanting to make the copy of 1B into the new writable front WAL segment. You can dig through xlog.c if you want the details. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster !DSPAM:43f5e383306306701047853!
--
Andy Shellam the mail network (Server Support) | ||
0 845 838 0879 | ||
www.mailnetwork.co.uk | ||
andy.shellam@mailnetwork.co.uk |
Attachment
Andy Shellam wrote: Hi, > As for digging through xlog.c, I'm not a C programmer so I'll stay right > out of the way - leave that to you guys!! You don't need to be a C programmer -- just read the code commentary. It's easy to follow most of the time. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.