Thread: BUG #2285: Can not access database after successful PITR - file naming problems

BUG #2285: Can not access database after successful PITR - file naming problems

From
"Nicholas Vinen"
Date:
The following bug has been logged online:

Bug reference:      2285
Logged by:          Nicholas Vinen
Email address:      hb@pg.x256.org
PostgreSQL version: 8.1.3
Operating system:   Gentoo Linux x86
Description:        Can not access database after successful PITR - file
naming problems
Details:

For debugging/testing purposes, I have taken to performing a PITR on a test
machine from the production database back-ups. I restore a snap-shot of the
production server which was taken before the time I am interested in
debugging, then use the PITR method to get the database to the point I want
to test at. I can then "rewind" the database and test again by restoring
again.

This used to work (either before 8.0 or before 8.1, I'm not sure). I also
have this back-up in case the production database server dies, but can't
really test restoring it on the production machine except when data is lost,
otherwise I will interrupt service. The machines are almost identical, the
only real difference is that one is a Pentium 3 and one is a Pentium 4, so I
don't see why that should matter. They are both running virtually identical
Linux installations.

Now, when I try to use PITR to restore the database on my test server, the
PITR succeeds as normal, but I can not access the database with an error
like:

psql: FATAL:  could not open relation 1663/16385/605464: No such file or
directory

This happens when I attempt to connect to my database. I can connect to some
of the others (such as the template databases) fine.

Interestingly, the 605464 file was in the back-up snapshot, but the process
of performing the PITR seems to rename the file, but it's still looking
under the old name. This is fully reproducible, so if I am not providing
sufficient information here, just let me know what you need in order to fix
this.

Here is a log of what I have done up to the error:

rt2 ~ # cd /var/lib/postgresql/data
rt2 data # rm -rf *
rt2 data # gzip -cd /backup/Helpdesk/Database/Snapshots/Weekly/2006-06\ \
Sun\ 12\ Feb.io.gz | cpio -i
3229320 blocks
rt2 data # ls -al base/16385/605464
-rw-------  1 postgres postgres 40960 Feb 27 02:57 base/16385/605464
rt2 data # rm pg_xlog/*
rm: cannot remove `pg_xlog/archive_status': Is a directory
rt2 data # cp /data/postgresql/recovery.conf
/data/postgresql/postgresql.conf .
rt2 data # chown postgres *
rt2 data # /etc/init.d/postgresql start
 * Starting PostgreSQL ...
rt2 data # tail -f /var/log/postgres/current
Feb 27 03:06:12 [postgres] [1-1] LOG:  could not create IPv6 socket: Address
family not supported by protocol
Feb 27 03:06:12 [postgres] [2-1] LOG:  database system was interrupted at
2006-02-12 01:30:02 PST
Feb 27 03:06:12 [postgres] [3-1] LOG:  starting archive recovery
Feb 27 03:06:12 [postgres] [4-1] LOG:  restore_command = "gzip -cd
/backup/Helpdesk/Database/TransactionLog/"%f">"%p""
Feb 27 03:06:12 [postgres] [5-1] LOG:  recovery_target_time = 2006-02-17
08:20:00-08
Feb 27 03:06:12 [postgres] [6-1] LOG:  restored log file
"000000010000000B00000002.00A9FA60.backup" from archive
Feb 27 03:06:14 [postgres] [7-1] LOG:  restored log file
"000000010000000B00000002" from archive
Feb 27 03:06:14 [postgres] [8-1] LOG:  checkpoint record is at B/2A9FA60
Feb 27 03:06:14 [postgres] [9-1] LOG:  redo record is at B/2A9FA60; undo
record is at 0/0; shutdown FALSE
Feb 27 03:06:14 [postgres] [10-1] LOG:  next transaction ID: 34066581; next
OID: 611376
Feb 27 03:06:14 [postgres] [11-1] LOG:  next MultiXactId: 633; next
MultiXactOffset: 1265
Feb 27 03:06:14 [postgres] [12-1] LOG:  automatic recovery in progress
Feb 27 03:06:14 [postgres] [13-1] LOG:  redo starts at B/2A9FAA4
Feb 27 03:06:27 [postgres] [14-1] LOG:  restored log file
"000000010000000B00000003" from archive
Feb 27 03:06:38 [postgres] [15-1] LOG:  restored log file
"000000010000000B00000004" from archive
Feb 27 03:06:41 [postgres] [16-1] LOG:  restored log file
"000000010000000B00000005" from archive
....
Feb 27 03:47:13 [postgres] [682-1] LOG:  restored log file
"000000010000000D000000A1" from archive
Feb 27 03:47:17 [postgres] [683-1] LOG:  restored log file
"000000010000000D000000A2" from archive
Feb 27 03:47:21 [postgres] [684-1] LOG:  restored log file
"000000010000000D000000A3" from archive
Feb 27 03:47:21 [postgres] [685-1] LOG:  recovery stopping before commit of
transaction 42586328, time 2006-02-17 08:20:01 PST
Feb 27 03:47:21 [postgres] [686-1] LOG:  redo done at D/A35794EC
Feb 27 03:47:21 [postgres] [687-1] LOG:  selected new timeline ID: 2
Feb 27 03:47:22 [postgres] [688-1] LOG:  archive recovery complete
Feb 27 03:47:32 [postgres] [689-1] LOG:  database system is ready
Feb 27 03:47:32 [postgres] [690-1] LOG:  transaction ID wrap limit is
1103292637, limited by database "postgres"
Feb 27 05:39:54 [postgres] [2-1] LOG:  invalid server process ID -1 (new
error in 8.1.3)
rt2 data # psql -U postgres rt3
psql: FATAL:  could not open relation 1663/16385/605464: No such file or
directory
rt2 data # ls -al base/16385/605464
rt2 data #

recovery.conf:
restore_command = 'gzip -cd
/backup/Helpdesk/Database/TransactionLog/"%f">"%p"'
recovery_target_time = '2006-02-17 16:20:00+0';

I believe this should work. I checked the PITR limitations and I don't
believe I've broken any of them. I also belive this should work between
seperate machines of the same architecture. It certainly has in the past.
Please let me know how to proceed.

By the way, I can access *some* of my databases (the smaller ones), just not
the main one. I suspect perhaps it's due to an ALTER TABLE or similar
command I committed on the main database inbetween the snapshot and the
restore point, which is not being handled properly during the transation log
replay.


Thanks,

Nicholas
"Nicholas Vinen" <hb@x256.com> writes:
> Now, when I try to use PITR to restore the database on my test server, the
> PITR succeeds as normal, but I can not access the database with an error
> like:

> psql: FATAL:  could not open relation 1663/16385/605464: No such file or
> directory

This seems pretty ugly :-(.  Unless you've got some odd stuff in
~/.psqlrc, I would think that psql startup could only cause accesses to
system catalogs, but 605464 is far beyond the normal relfilenode numbers
assigned to any system catalogs or indexes.  AFAICS the only way that a
new relfilenode would be assigned is during TRUNCATE, CLUSTER or
REINDEX ... have you done any of those on system catalogs?

            regards, tom lane