Thread: WAL recovery

WAL recovery

From
Andy Shellam
Date:
Hi,

I'm trying to get a WAL recovery system set up so I have a hot-spare database server standing by should my first one fail.
The idea is that every night, over night, the WAL logs for that day will be shipped from the main server to the standby, and the standby will replay them so it is up to date.

Every week a full backup will be taken of the live system, and stored off-site.

So far I've got it working so that:

- My full, base backup from yesterday has been loaded onto the spare
- The WAL logs up to 2PM today have been shipped and replayed onto the spare - all OK to here

However, whenever I try to ship more logs and play them, I get the following error in the final file:

2006-02-22 15:50:00 GMT LOG:  starting archive recovery
2006-02-22 15:50:00 GMT LOG:  restore_command = "cp /mndata/archive/xlog_archive/%f %p"
cp: cannot stat `/mndata/archive/xlog_archive/00000001.history': No such file or directory
2006-02-22 15:50:00 GMT LOG:  restored log file "0000000100000000000000D9" from archive
2006-02-22 15:50:00 GMT LOG:  invalid record length at 0/D9FFDB84
2006-02-22 15:50:00 GMT LOG:  invalid primary checkpoint record
2006-02-22 15:50:00 GMT LOG:  restored log file "0000000100000000000000D9" from archive
2006-02-22 15:50:00 GMT LOG:  restored log file "0000000100000000000000DA" from archive
2006-02-22 15:50:00 GMT LOG:  invalid resource manager ID in secondary checkpoint record
2006-02-22 15:50:00 GMT PANIC:  could not locate a valid checkpoint record
2006-02-22 15:50:00 GMT LOG:  startup process (PID 20792) was terminated by signal 6
2006-02-22 15:50:00 GMT LOG:  aborting startup due to startup process failure
2006-02-22 15:50:00 GMT LOG:  logger shutting down

However, if I delete my PG data directory, restore the same base backup from yesterday, and begin recovery, it recovers right up until the last log file, which the previous roll-forward attempt fails.
The log files were fully archived off the live server to begin with so I can't see it's that they've changed or anything.

Is this scenario possible - that you can keep rolling forward over log files as long as necessary, or do you always have to start from a base backup?  Nothing is changing on the spare, it's literally a sitting duck.

Thanks

Andy

Re: WAL recovery

From
Jeff Frost
Date:
On Wed, 22 Feb 2006, Andy Shellam wrote:

> However, if I delete my PG data directory, restore the same base backup from
> yesterday, and begin recovery, it recovers right up until the last log file,
> which the previous roll-forward attempt fails.
> The log files were fully archived off the live server to begin with so I
> can't see it's that they've changed or anything.
>
> Is this scenario possible - that you can keep rolling forward over log files
> as long as necessary, or do you always have to start from a base backup?
> Nothing is changing on the spare, it's literally a sitting duck.

You must always start with a base backup when you begin the restore otherwise
the system does not know where to begin the recovery.  What I have done in the
past on a similar system is keep the secondary DB shutdown and make base
backups as often as possible so the recovery time is suitably short when the
secondary is needed.  Also a good idea to do test recovers on occassion to
make sure you don't have any corrupt WAL files.  If you use rsync, then you
can make the base backup pretty quickly depending on how much of your DB
changes.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: WAL recovery

From
CG
Date:
Andy,

FWIW, I /think/ the replication scheme you are trying to carry out is
implemented in Command Prompt's "Mammoth PostgreSQL Replicator" product. From
what I understand, their system will independently copy over and replay
"TransactionLog" files (translate: WAL files?) on your slave databases in a
continuous stream or in a batch. They describe a few other features in their
product that might make the commercial option worthwhile to try.

CG

--- Jeff Frost <jeff@frostconsultingllc.com> wrote:

> On Wed, 22 Feb 2006, Andy Shellam wrote:
>
> > However, if I delete my PG data directory, restore the same base backup
> from
> > yesterday, and begin recovery, it recovers right up until the last log
> file,
> > which the previous roll-forward attempt fails.
> > The log files were fully archived off the live server to begin with so I
> > can't see it's that they've changed or anything.
> >
> > Is this scenario possible - that you can keep rolling forward over log
> files
> > as long as necessary, or do you always have to start from a base backup?
> > Nothing is changing on the spare, it's literally a sitting duck.
>
> You must always start with a base backup when you begin the restore otherwise
>
> the system does not know where to begin the recovery.  What I have done in
> the
> past on a similar system is keep the secondary DB shutdown and make base
> backups as often as possible so the recovery time is suitably short when the
> secondary is needed.  Also a good idea to do test recovers on occassion to
> make sure you don't have any corrupt WAL files.  If you use rsync, then you
> can make the base backup pretty quickly depending on how much of your DB
> changes.
>
> --
> Jeff Frost, Owner     <jeff@frostconsultingllc.com>
> Frost Consulting, LLC     http://www.frostconsultingllc.com/
> Phone: 650-780-7908    FAX: 650-649-1954
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: WAL recovery

From
"Joshua D. Drake"
Date:
CG wrote:
> Andy,
>
> FWIW, I /think/ the replication scheme you are trying to carry out is
> implemented in Command Prompt's "Mammoth PostgreSQL Replicator" product. From
> what I understand, their system will independently copy over and replay
> "TransactionLog" files (translate: WAL files?) on your slave databases in a
> continuous stream or in a batch.
This is almost true :). We don't use the WAL but other the other items
are correct.

Sincerely,

Joshua D. Drake
>  They describe a few other features in their
> product that might make the commercial option worthwhile to try.
>
> CG
>


--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


Re: WAL recovery

From
Simon Riggs
Date:
On Wed, 2006-02-22 at 16:26 +0000, Andy Shellam wrote:

> Is this scenario possible - that you can keep rolling forward over log
> files as long as necessary, or do you always have to start from a base
> backup?  Nothing is changing on the spare, it's literally a sitting
> duck.

You'll need a restore_command that is a script that sits in a wait loop
when the file it is asked for is not available yet, or other conditions
have occurred such as notification of switchover (manually or
otherwise). When those conditions occur the script should return a
non-zero error condition.

There should be no recovery_target* settings.

Best Regards, Simon Riggs



Re: WAL recovery

From
Andy Shellam
Date:
Hi Simon,

Many thanks for this suggestion - this sounds ideal actually.  My thoughts on this are, I would write a shell script that gets called for each file requested in the recovery.conf ...

a) If the log file requested exists, copy it and exit with 0 status
b) If the file doesn't exist, check if there is a "bring_online" flag file set in the log transfer area ...
... if so, exit with non-zero so the server can be bought online.
... if not, sleep for 5 or 10 minutes - within this time the next log may or may not have arrived
c) After 5 or 10 minutes, repeat at step b

So basically if I need the server bringing up, I can either manually touch bring_online in the log transfer folder and wait for the next 5/10 minute update, or have my monitoring system automatically touch that file when the connection to the primary database is broken, and switch the database server DNS to the standby server.

Sounds good!?!

Cheers for the idea Simon, now to get coding...!

Andy

Simon Riggs wrote:
On Wed, 2006-02-22 at 16:26 +0000, Andy Shellam wrote:
 
Is this scenario possible - that you can keep rolling forward over log
files as long as necessary, or do you always have to start from a base
backup?  Nothing is changing on the spare, it's literally a sitting
duck.   
You'll need a restore_command that is a script that sits in a wait loop
when the file it is asked for is not available yet, or other conditions
have occurred such as notification of switchover (manually or
otherwise). When those conditions occur the script should return a
non-zero error condition.

There should be no recovery_target* settings.

Best Regards, Simon Riggs



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
              http://archives.postgresql.org

!DSPAM:43fdc5ac12881700010740!

 

--
the mail network - an alternative in a standardised worldAndy Shellam
the mail network (Server Support)

0 845 838 0879  Phone
www.mailnetwork.co.uk  Web
andy.shellam@mailnetwork.co.uk  E-mail
Attachment

Re: WAL recovery

From
Simon Riggs
Date:
On Thu, 2006-02-23 at 16:18 +0000, Andy Shellam wrote:

> Cheers for the idea Simon, now to get coding...!

S'OK ... it was designed that way from the start.

Best Regards, Simon Riggs