Thread: WAL recovery
Hi,
I'm trying to get a WAL recovery system set up so I have a hot-spare database server standing by should my first one fail.
The idea is that every night, over night, the WAL logs for that day will be shipped from the main server to the standby, and the standby will replay them so it is up to date.
Every week a full backup will be taken of the live system, and stored off-site.
So far I've got it working so that:
- My full, base backup from yesterday has been loaded onto the spare
- The WAL logs up to 2PM today have been shipped and replayed onto the spare - all OK to here
However, whenever I try to ship more logs and play them, I get the following error in the final file:
2006-02-22 15:50:00 GMT LOG: starting archive recovery
2006-02-22 15:50:00 GMT LOG: restore_command = "cp /mndata/archive/xlog_archive/%f %p"
cp: cannot stat `/mndata/archive/xlog_archive/00000001.history': No such file or directory
2006-02-22 15:50:00 GMT LOG: restored log file "0000000100000000000000D9" from archive
2006-02-22 15:50:00 GMT LOG: invalid record length at 0/D9FFDB84
2006-02-22 15:50:00 GMT LOG: invalid primary checkpoint record
2006-02-22 15:50:00 GMT LOG: restored log file "0000000100000000000000D9" from archive
2006-02-22 15:50:00 GMT LOG: restored log file "0000000100000000000000DA" from archive
2006-02-22 15:50:00 GMT LOG: invalid resource manager ID in secondary checkpoint record
2006-02-22 15:50:00 GMT PANIC: could not locate a valid checkpoint record
2006-02-22 15:50:00 GMT LOG: startup process (PID 20792) was terminated by signal 6
2006-02-22 15:50:00 GMT LOG: aborting startup due to startup process failure
2006-02-22 15:50:00 GMT LOG: logger shutting down
However, if I delete my PG data directory, restore the same base backup from yesterday, and begin recovery, it recovers right up until the last log file, which the previous roll-forward attempt fails.
The log files were fully archived off the live server to begin with so I can't see it's that they've changed or anything.
Is this scenario possible - that you can keep rolling forward over log files as long as necessary, or do you always have to start from a base backup? Nothing is changing on the spare, it's literally a sitting duck.
Thanks
Andy
I'm trying to get a WAL recovery system set up so I have a hot-spare database server standing by should my first one fail.
The idea is that every night, over night, the WAL logs for that day will be shipped from the main server to the standby, and the standby will replay them so it is up to date.
Every week a full backup will be taken of the live system, and stored off-site.
So far I've got it working so that:
- My full, base backup from yesterday has been loaded onto the spare
- The WAL logs up to 2PM today have been shipped and replayed onto the spare - all OK to here
However, whenever I try to ship more logs and play them, I get the following error in the final file:
2006-02-22 15:50:00 GMT LOG: starting archive recovery
2006-02-22 15:50:00 GMT LOG: restore_command = "cp /mndata/archive/xlog_archive/%f %p"
cp: cannot stat `/mndata/archive/xlog_archive/00000001.history': No such file or directory
2006-02-22 15:50:00 GMT LOG: restored log file "0000000100000000000000D9" from archive
2006-02-22 15:50:00 GMT LOG: invalid record length at 0/D9FFDB84
2006-02-22 15:50:00 GMT LOG: invalid primary checkpoint record
2006-02-22 15:50:00 GMT LOG: restored log file "0000000100000000000000D9" from archive
2006-02-22 15:50:00 GMT LOG: restored log file "0000000100000000000000DA" from archive
2006-02-22 15:50:00 GMT LOG: invalid resource manager ID in secondary checkpoint record
2006-02-22 15:50:00 GMT PANIC: could not locate a valid checkpoint record
2006-02-22 15:50:00 GMT LOG: startup process (PID 20792) was terminated by signal 6
2006-02-22 15:50:00 GMT LOG: aborting startup due to startup process failure
2006-02-22 15:50:00 GMT LOG: logger shutting down
However, if I delete my PG data directory, restore the same base backup from yesterday, and begin recovery, it recovers right up until the last log file, which the previous roll-forward attempt fails.
The log files were fully archived off the live server to begin with so I can't see it's that they've changed or anything.
Is this scenario possible - that you can keep rolling forward over log files as long as necessary, or do you always have to start from a base backup? Nothing is changing on the spare, it's literally a sitting duck.
Thanks
Andy
On Wed, 22 Feb 2006, Andy Shellam wrote: > However, if I delete my PG data directory, restore the same base backup from > yesterday, and begin recovery, it recovers right up until the last log file, > which the previous roll-forward attempt fails. > The log files were fully archived off the live server to begin with so I > can't see it's that they've changed or anything. > > Is this scenario possible - that you can keep rolling forward over log files > as long as necessary, or do you always have to start from a base backup? > Nothing is changing on the spare, it's literally a sitting duck. You must always start with a base backup when you begin the restore otherwise the system does not know where to begin the recovery. What I have done in the past on a similar system is keep the secondary DB shutdown and make base backups as often as possible so the recovery time is suitably short when the secondary is needed. Also a good idea to do test recovers on occassion to make sure you don't have any corrupt WAL files. If you use rsync, then you can make the base backup pretty quickly depending on how much of your DB changes. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954
Andy, FWIW, I /think/ the replication scheme you are trying to carry out is implemented in Command Prompt's "Mammoth PostgreSQL Replicator" product. From what I understand, their system will independently copy over and replay "TransactionLog" files (translate: WAL files?) on your slave databases in a continuous stream or in a batch. They describe a few other features in their product that might make the commercial option worthwhile to try. CG --- Jeff Frost <jeff@frostconsultingllc.com> wrote: > On Wed, 22 Feb 2006, Andy Shellam wrote: > > > However, if I delete my PG data directory, restore the same base backup > from > > yesterday, and begin recovery, it recovers right up until the last log > file, > > which the previous roll-forward attempt fails. > > The log files were fully archived off the live server to begin with so I > > can't see it's that they've changed or anything. > > > > Is this scenario possible - that you can keep rolling forward over log > files > > as long as necessary, or do you always have to start from a base backup? > > Nothing is changing on the spare, it's literally a sitting duck. > > You must always start with a base backup when you begin the restore otherwise > > the system does not know where to begin the recovery. What I have done in > the > past on a similar system is keep the secondary DB shutdown and make base > backups as often as possible so the recovery time is suitably short when the > secondary is needed. Also a good idea to do test recovers on occassion to > make sure you don't have any corrupt WAL files. If you use rsync, then you > can make the base backup pretty quickly depending on how much of your DB > changes. > > -- > Jeff Frost, Owner <jeff@frostconsultingllc.com> > Frost Consulting, LLC http://www.frostconsultingllc.com/ > Phone: 650-780-7908 FAX: 650-649-1954 > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
CG wrote: > Andy, > > FWIW, I /think/ the replication scheme you are trying to carry out is > implemented in Command Prompt's "Mammoth PostgreSQL Replicator" product. From > what I understand, their system will independently copy over and replay > "TransactionLog" files (translate: WAL files?) on your slave databases in a > continuous stream or in a batch. This is almost true :). We don't use the WAL but other the other items are correct. Sincerely, Joshua D. Drake > They describe a few other features in their > product that might make the commercial option worthwhile to try. > > CG > -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: PLphp, PLperl - http://www.commandprompt.com/
On Wed, 2006-02-22 at 16:26 +0000, Andy Shellam wrote: > Is this scenario possible - that you can keep rolling forward over log > files as long as necessary, or do you always have to start from a base > backup? Nothing is changing on the spare, it's literally a sitting > duck. You'll need a restore_command that is a script that sits in a wait loop when the file it is asked for is not available yet, or other conditions have occurred such as notification of switchover (manually or otherwise). When those conditions occur the script should return a non-zero error condition. There should be no recovery_target* settings. Best Regards, Simon Riggs
Hi Simon,
Many thanks for this suggestion - this sounds ideal actually. My thoughts on this are, I would write a shell script that gets called for each file requested in the recovery.conf ...
a) If the log file requested exists, copy it and exit with 0 status
b) If the file doesn't exist, check if there is a "bring_online" flag file set in the log transfer area ...
... if so, exit with non-zero so the server can be bought online.
... if not, sleep for 5 or 10 minutes - within this time the next log may or may not have arrived
c) After 5 or 10 minutes, repeat at step b
So basically if I need the server bringing up, I can either manually touch bring_online in the log transfer folder and wait for the next 5/10 minute update, or have my monitoring system automatically touch that file when the connection to the primary database is broken, and switch the database server DNS to the standby server.
Sounds good!?!
Cheers for the idea Simon, now to get coding...!
Andy
Simon Riggs wrote:
Many thanks for this suggestion - this sounds ideal actually. My thoughts on this are, I would write a shell script that gets called for each file requested in the recovery.conf ...
a) If the log file requested exists, copy it and exit with 0 status
b) If the file doesn't exist, check if there is a "bring_online" flag file set in the log transfer area ...
... if so, exit with non-zero so the server can be bought online.
... if not, sleep for 5 or 10 minutes - within this time the next log may or may not have arrived
c) After 5 or 10 minutes, repeat at step b
So basically if I need the server bringing up, I can either manually touch bring_online in the log transfer folder and wait for the next 5/10 minute update, or have my monitoring system automatically touch that file when the connection to the primary database is broken, and switch the database server DNS to the standby server.
Sounds good!?!
Cheers for the idea Simon, now to get coding...!
Andy
Simon Riggs wrote:
On Wed, 2006-02-22 at 16:26 +0000, Andy Shellam wrote:Is this scenario possible - that you can keep rolling forward over log files as long as necessary, or do you always have to start from a base backup? Nothing is changing on the spare, it's literally a sitting duck.You'll need a restore_command that is a script that sits in a wait loop when the file it is asked for is not available yet, or other conditions have occurred such as notification of switchover (manually or otherwise). When those conditions occur the script should return a non-zero error condition. There should be no recovery_target* settings. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org !DSPAM:43fdc5ac12881700010740!
--
Andy Shellam the mail network (Server Support) | ||
0 845 838 0879 | ||
www.mailnetwork.co.uk | ||
andy.shellam@mailnetwork.co.uk |
Attachment
On Thu, 2006-02-23 at 16:18 +0000, Andy Shellam wrote: > Cheers for the idea Simon, now to get coding...! S'OK ... it was designed that way from the start. Best Regards, Simon Riggs