Thread: Minor bug/inconveniance with restore from backup, using PITR base backup and archived wal files

I've recently been testing our backup/restore procedures, and discovered
a minor inconvenience.

I emptied out the data directory(on a test-box), and restored it from a
backup. I made sure that pg_xlog and pg_xlog/archive_status was empty.
I then set up the recovery.conf file in the root of the data directory,
and pointed restore_command in the direction of the archived wal files.

After starting postgresql, the cluster started restoring from the
archive logs, and everything looks fine, and doing what it should. The
problem came at the end of the recovery, after the log line:

[2008-09-23 15:33:14.764 CEST] [pgtest01] [:] [] [18393] [] LOG: archive
recovery complete

followed immediately after by this line:

  [2008-09-24 13:04:52.168 CEST] pgtest01] [:] [] [13324] [] PANIC:
could not open file "pg_xlog/00000002000000030000009F" (log file 3,
segment 159): Permission denied

and, the cluster shutting down.
After a bit of looking around, and with some help from the fine people
in #postgresql on freenode, I think I figured out what was going on.
The last wal archive file was 00000001000000030000009F, and after
finishing recovery, postgresql created the file 00000002000000030000009F
(ie. 00000002 instead of 00000001) in pg_xlog. The wal-files were
archived read-only, and this file permission seemed to be carried over
to the new file created by postgresql in pg_xlog, causing the cluster to
fall over and die. Changing permissions on the last wal-file in the
archive directory to read/write made this problem go away. Moving the
last wal file from the archive folder to pg_xlog/ or, if I had partial
archive log files in pg_xlog, also made this problem go away,
irrespective of whether the file was chmod read only, or read/write.
Simply chmoding the wal-file in pg_xlog/, and trying to restart the
cluster did not work, and I found no other way to recover from this,
other than to start over again from the beginning.


--
Tommy Gildseth

Tommy Gildseth wrote:
> I've recently been testing our backup/restore procedures, and discovered
> a minor inconvenience.

Running 8.2.9 btw


--
Tommy Gildseth

Tommy Gildseth <tommy.gildseth@usit.uio.no> writes:
> ... problem came at the end of the recovery, after the log line:

> [2008-09-23 15:33:14.764 CEST] [pgtest01] [:] [] [18393] [] LOG: archive
> recovery complete

> followed immediately after by this line:

>   [2008-09-24 13:04:52.168 CEST] pgtest01] [:] [] [13324] [] PANIC:
> could not open file "pg_xlog/00000002000000030000009F" (log file 3,
> segment 159): Permission denied

> and, the cluster shutting down.
> After a bit of looking around, and with some help from the fine people
> in #postgresql on freenode, I think I figured out what was going on.
> The last wal archive file was 00000001000000030000009F, and after
> finishing recovery, postgresql created the file 00000002000000030000009F
> (ie. 00000002 instead of 00000001) in pg_xlog.

It's customary for PG to "create" new XLOG segments by recycling old
ones.

> The wal-files were
> archived read-only, and this file permission seemed to be carried over
> to the new file created by postgresql in pg_xlog, causing the cluster to
> fall over and die.

I would say that the bug is in your restore script: it should have made
sure that the files it copies into the xlog directory are given the
right ownership/permissions.

            regards, tom lane

Tom Lane wrote:
> Tommy Gildseth <tommy.gildseth@usit.uio.no> writes:
>> After a bit of looking around, and with some help from the fine people
>> in #postgresql on freenode, I think I figured out what was going on.
>> The last wal archive file was 00000001000000030000009F, and after
>> finishing recovery, postgresql created the file 00000002000000030000009F
>> (ie. 00000002 instead of 00000001) in pg_xlog.
>
> It's customary for PG to "create" new XLOG segments by recycling old
> ones.
>
>> The wal-files were
>> archived read-only, and this file permission seemed to be carried over
>> to the new file created by postgresql in pg_xlog, causing the cluster to
>> fall over and die.
>
> I would say that the bug is in your restore script: it should have made
> sure that the files it copies into the xlog directory are given the
> right ownership/permissions.


Well, the restore command(script) is simply copied from the suggestion
in the manual (restore_command = 'cp /path/to/my/archived/wal/files/%f
"%p"'). In my opinion, it's not very obvious that the last wal file
needs read/write permissions set, and it's certainly not documented
anywhere on
http://www.postgresql.org/docs/current/static/continuous-archiving.html
that I can see.
There's also the matter of the inconsistency that postgresql knows to
recycle *and* chmod the file if it's originally located in pg_xlog/
folder, but not if it's originally located in the wal files archive
folder. I guess it's more of a gotcha than a bug per se.

--
Tommy Gildseth