Thread: PITR and Compressed WALS

PITR and Compressed WALS

From
Brian Wipf
Date:
We have two PostgreSQL 8.2.4 servers. On one database, WALs are
archived with a simple script that gzips and transfers them to an NFS
file server. The other database is in perpetual recovery mode,
ungizipping and processing the WALs as they appear and become
complete on the file server. This has been running fine for the past
few days. As soon as the gzipped WAL appears in the archived WAL
directory, I see an entry in the logs that the file has been restored.

Last night, I brought the database out of its perpetual recovery
mode. Here are the lines from the log when this was done:
[2007-10-01 23:43:03 MDT] LOG:  restored log file
"000000010000046600000060" from archive
[2007-10-01 23:45:50 MDT] LOG:  could not open file "pg_xlog/
000000010000046600000061" (log file 1126, segment 97): No such file
or directory
[2007-10-01 23:45:50 MDT] LOG:  redo done at 466/60000070

Which is all fine, since 000000010000046600000060.gz was the last
archived WAL file. The next entry in the log follows:

[2007-10-01 23:45:50 MDT] PANIC:  could not open file "pg_xlog/
000000010000046600000060" (log file 1126, segment 96): No such file
or directory
[2007-10-01 23:45:51 MDT] LOG:  startup process (PID 27624) was
terminated by signal 6
[2007-10-01 23:45:51 MDT] LOG:  aborting startup due to startup
process failure
[2007-10-01 23:45:51 MDT] LOG:  logger shutting down

And the database would not start up. The issue appears to be that the
restore_command script itself ungzips the WAL to its destination %p,
and the WAL is left in the archive directory as
000000010000046600000060.gz. By simply ungzipping the last few WALs
manually in the archive directory, the database replayed them and
started up successfully.

I'm not sure if this should be listed as another caveat on the PITR
recovery page but in the very least I wanted to post to the list so
that others attempting to archive and recover compressed WALs may be
aware of a potential issue.

Brian Wipf
<brian@clickspace.com>


Re: PITR and Compressed WALS

From
Tom Lane
Date:
Brian Wipf <brian@clickspace.com> writes:
> Last night, I brought the database out of its perpetual recovery
> mode. Here are the lines from the log when this was done:
> [2007-10-01 23:43:03 MDT] LOG:  restored log file
> "000000010000046600000060" from archive
> [2007-10-01 23:45:50 MDT] LOG:  could not open file "pg_xlog/
> 000000010000046600000061" (log file 1126, segment 97): No such file
> or directory
> [2007-10-01 23:45:50 MDT] LOG:  redo done at 466/60000070

> Which is all fine, since 000000010000046600000060.gz was the last
> archived WAL file. The next entry in the log follows:

> [2007-10-01 23:45:50 MDT] PANIC:  could not open file "pg_xlog/
> 000000010000046600000060" (log file 1126, segment 96): No such file
> or directory
> [2007-10-01 23:45:51 MDT] LOG:  startup process (PID 27624) was
> terminated by signal 6
> [2007-10-01 23:45:51 MDT] LOG:  aborting startup due to startup
> process failure
> [2007-10-01 23:45:51 MDT] LOG:  logger shutting down

> And the database would not start up. The issue appears to be that the
> restore_command script itself ungzips the WAL to its destination %p,
> and the WAL is left in the archive directory as
> 000000010000046600000060.gz. By simply ungzipping the last few WALs
> manually in the archive directory, the database replayed them and
> started up successfully.

What this sounds like to me is a problem in your recovery procedures.
What exactly did you do to "bring the database out of recovery mode"?
It is expected that it would ask the restore_command script to fetch the
last WAL segment a second time, and I don't understand why that didn't
Just Work.

            regards, tom lane

Re: PITR and Compressed WALS

From
Brian Wipf
Date:
On 3-Oct-07, at 12:38 PM, Tom Lane wrote:
> What this sounds like to me is a problem in your recovery procedures.
> What exactly did you do to "bring the database out of recovery mode"?

The script looked for a trigger file and once found, aborts.
Unfortunately, it would abort without doing the requested copy. I
fixed the bug now. Thanks for your input Tom.

Brian Wipf
ClickSpace Interactive Inc.
<brian@clickspace.com>