Thread: problem with WAL files

problem with WAL files

From
Romain Thouvenin
Date:
I'm currently testing PITR features of 8.0 release. I followed the steps
given by the documentation and thus deleted the WAL files in pg_xlog
after having restored the database.
But when I put the recovery.conf file and try to start postgres, it
fails. The log file says :

2005-06-30 17:57:50 LOG:  database system was shut down at 2005-06-30
17:57:26 Paris, Madrid
2005-06-30 17:57:50 LOG:  starting archive recovery
2005-06-30 17:57:50 LOG:  restore_command = "copy E:\\temp\\archive\\%f
"%p""
2005-06-30 17:57:50 LOG:  recovery_target_time = 2005-07-01 00:45:00+02
2005-06-30 17:57:50 LOG:  could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/00000001000000000000000B" (log file 0,
segment 11): No such file or directory
2005-06-30 17:57:50 LOG:  invalid primary checkpoint record
2005-06-30 17:57:50 LOG:  could not open file "C:/Program
Files/PostgreSQL/8.0/data/pg_xlog/00000001000000000000000B" (log file 0,
segment 11): No such file or directory
2005-06-30 17:57:50 LOG:  invalid secondary checkpoint record
2005-06-30 17:57:50 PANIC:  could not locate a valid checkpoint record

This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.
2005-06-30 17:57:50 LOG:  startup process (PID 356) was terminated by
signal 3
2005-06-30 17:57:50 LOG:  aborting startup due to startup process failure
2005-06-30 17:57:51 LOG:  logger shutting down

What's the problem ?
I have an idea, but can't solve it :
When I want to restore the database, the "psql -U postgres essai <
save.txt" command fails because it seems that psql read the password in
save.txt (it don't let me type it) and fails to connect. So I first
connected and then restored with the \i command. Maybe this way of
proceeding alter WAL files in a bad way ?

I tryed many ways of restoring and deleting/putting WAL files, but all
failed. Can somebody help me ?

Thanks.

Romain Thouvenin


Re: problem with WAL files

From
Jeff Frost
Date:
Romain,

This is a problem I ran into during testing as well.  It seems that your base
backup must also include the in use wal file.  IN your case:
00000001000000000000000B.  So, I think the restore documentation is slightly
misleading. In my experience, what you really need to do is remove all wal
files from your pg_xlog directory, then put in the in use wal file from when
you did your base backup (00000001000000000000000B in your case), put in the
recovery.conf and then start postgresql.  This procedure works well for me,
but I didn't quite grasp that initially when reading the documentation on how
to backup/restore with PITR.

On Thu, 30 Jun 2005, Romain Thouvenin wrote:

> But when I put the recovery.conf file and try to start postgres, it fails.
> The log file says :
>
> 2005-06-30 17:57:50 LOG:  could not open file "C:/Program
> Files/PostgreSQL/8.0/data/pg_xlog/00000001000000000000000B" (log file 0,
> segment 11): No such file or directory

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: problem with WAL files

From
Tom Lane
Date:
Romain Thouvenin <rthouvenin@sfwan.com> writes:
> I'm currently testing PITR features of 8.0 release. I followed the steps
> given by the documentation and thus deleted the WAL files in pg_xlog
> after having restored the database.

But you have the WAL files in your archive area, right?

> 2005-06-30 17:57:50 LOG:  restore_command = "copy E:\\temp\\archive\\%f
> "%p""
> 2005-06-30 17:57:50 LOG:  recovery_target_time = 2005-07-01 00:45:00+02
> 2005-06-30 17:57:50 LOG:  could not open file "C:/Program
> Files/PostgreSQL/8.0/data/pg_xlog/00000001000000000000000B" (log file 0,
> segment 11): No such file or directory

Apparently it wasn't able to get the 00000001000000000000000B segment
from the archive either.  The displayed restore_command looks a bit
bogus to me: too many \'s.  Maybe you need to correct that?

            regards, tom lane

Re: problem with WAL files

From
Jeff Frost
Date:
On Thu, 30 Jun 2005, Tom Lane wrote:

> Apparently it wasn't able to get the 00000001000000000000000B segment
> from the archive either.  The displayed restore_command looks a bit
> bogus to me: too many \'s.  Maybe you need to correct that?

Tom,

When I tested with our system, it would not attempt to retrieve that first WAL
via the restore command.  All the ones after that it was happy to do it for,
but if that one did not exist, it told me to get lost.  That was on 8.0.3.  I
did not test any other versions.

Putting it in the pg_xlog dir made everything go smoothly though.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: problem with WAL files

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> When I tested with our system, it would not attempt to retrieve that first WAL
> via the restore command.  All the ones after that it was happy to do it for,
> but if that one did not exist, it told me to get lost.

Would you be more specific please?  Looking at the code, I can't see a
reason that the first one would be treated differently from others.
What *exactly* did you get?

            regards, tom lane

Re: problem with WAL files

From
Jeff Frost
Date:
On Thu, 30 Jun 2005, Tom Lane wrote:

> Would you be more specific please?  Looking at the code, I can't see a
> reason that the first one would be treated differently from others.
> What *exactly* did you get?

Unfortunately, I do not have my output saved, but here's basically what
happened:

I made a base backup as such:

* psql to start backup
* rsync pgdata dir with --exclude pg_xlog
* psql to stop backup

Copy this to new server's pgdata dir, put in recovery.conf and start postgres.
Postgres reports that it could not find wal archive (which would be the in-use
wal during the base backup) and exits fatal.  Copying the wal archive to the
archive dir received the same warning; however, copying it to the pg_xlog dir
allowed the restore to begin and the archived wal files after that happily
started replaying.

I'll have to retest that particular scenario and verify that something else
was not causing the problem when I have a chance now that I have a fully
functional and scripted out base backup and recover solution; however, it did
not give me any of the errors I see when my walrestore.sh script cannot access
an archived wal, so I would believe it did not even try to get it out of the
archive.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: problem with WAL files

From
Tom Lane
Date:
Jeff Frost <jeff@frostconsultingllc.com> writes:
> I'll have to retest that particular scenario and verify that something
> else was not causing the problem when I have a chance now that I have
> a fully functional and scripted out base backup and recover solution;
> however, it did not give me any of the errors I see when my
> walrestore.sh script cannot access an archived wal, so I would believe
> it did not even try to get it out of the archive.

As far as I can see from the code, it *should* try to get it out of
the archive ... and I'm quite sure we tested that case during 8.0
development ... so I suspect there is some other contributing factor in
your configuration.  If you could put together a self-contained test
case I'd be happy to dig into it, because it sounds like a bug to me.

            regards, tom lane

Re: problem with WAL files

From
Jeff Frost
Date:
On Thu, 30 Jun 2005, Tom Lane wrote:

> Jeff Frost <jeff@frostconsultingllc.com> writes:
>> I'll have to retest that particular scenario and verify that something
>> else was not causing the problem when I have a chance now that I have
>> a fully functional and scripted out base backup and recover solution;
>> however, it did not give me any of the errors I see when my
>> walrestore.sh script cannot access an archived wal, so I would believe
>> it did not even try to get it out of the archive.
>

Tom,

Finally got a chance to give this another full test and discovered that the
WAL files work fine in the archive dir, but I uncovered a different problem.
What had happened was that when I rsynced the walarchive and walrestore
scripts over to the server, rsync made them mode 644 as they were in my cvs
sandbox and thus not executable.  When postgres went to call the script, I
guess it stat'd the walrestore, saw that it was not executable and quietly
reported that it could not find the WAL file in the pg_xlog dir.  Would
probably be handy to throw a warning that the restore command was not
executable, no?  Sorry for the false alarm before.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: problem with WAL files

From
Bruce Momjian
Date:
Jeff Frost wrote:
> On Thu, 30 Jun 2005, Tom Lane wrote:
>
> > Jeff Frost <jeff@frostconsultingllc.com> writes:
> >> I'll have to retest that particular scenario and verify that something
> >> else was not causing the problem when I have a chance now that I have
> >> a fully functional and scripted out base backup and recover solution;
> >> however, it did not give me any of the errors I see when my
> >> walrestore.sh script cannot access an archived wal, so I would believe
> >> it did not even try to get it out of the archive.
> >
>
> Tom,
>
> Finally got a chance to give this another full test and discovered that the
> WAL files work fine in the archive dir, but I uncovered a different problem.
> What had happened was that when I rsynced the walarchive and walrestore
> scripts over to the server, rsync made them mode 644 as they were in my cvs
> sandbox and thus not executable.  When postgres went to call the script, I
> guess it stat'd the walrestore, saw that it was not executable and quietly
> reported that it could not find the WAL file in the pg_xlog dir.  Would
> probably be handy to throw a warning that the restore command was not
> executable, no?  Sorry for the false alarm before.

We basically run the command you give us via system(), so it is hard for
us to know exactly why it failed.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: problem with WAL files

From
Jeff Frost
Date:
On Wed, 6 Jul 2005, Bruce Momjian wrote:

> We basically run the command you give us via system(), so it is hard for
> us to know exactly why it failed.

Understandable..but it would at least be nice to know that it failed.  It does
not appear to log anything at this point.


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: problem with WAL files

From
Bruce Momjian
Date:
Jeff Frost wrote:
> On Wed, 6 Jul 2005, Bruce Momjian wrote:
>
> > We basically run the command you give us via system(), so it is hard for
> > us to know exactly why it failed.
>
> Understandable..but it would at least be nice to know that it failed.  It does
> not appear to log anything at this point.

That is strange.  You said it reported the file did not exist.  When the
command fails, we assume it doesn't exist.  Not sure what else we can
do.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: problem with WAL files

From
Jeff Frost
Date:
On Wed, 6 Jul 2005, Bruce Momjian wrote:

> That is strange.  You said it reported the file did not exist.  When the
> command fails, we assume it doesn't exist.  Not sure what else we can
> do.

I'm sorry, I should have copied and pasted the logs.  It did not report a
failure to copy from the archive dir, it reported that the wal file did not
exist in pg_xlog.  Here's a snip from the log:


Jul  5 22:14:06 db1 postgres[26299]: [1-1] LOG:  database system was interrupted
  at 2005-07-05 23:13:54 PDT
Jul  5 22:14:06 db1 postgres[26299]: [2-1] LOG:  starting archive recovery
Jul  5 22:14:06 db1 postgres[26299]: [3-1] LOG:  restore_command = "/usr/local/b
in/walrestore.sh %f %p"
Jul  5 22:14:06 db1 postgres[26299]: [4-1] LOG:  could not open file "/usr/local
/pgsql/data/pg_xlog/00000001000000060000005F" (log file 6, segment 95): No such
file or
Jul  5 22:14:06 db1 postgres[26299]: [4-2]  directory
Jul  5 22:14:06 db1 postgres[26299]: [5-1] LOG:  invalid checkpoint record
Jul  5 22:14:06 db1 postgres[26299]: [6-1] PANIC:  could not locate required
checkpoint record


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954