Thread: BUG #2285: Can not access database after successful PITR - file naming problems
BUG #2285: Can not access database after successful PITR - file naming problems
From
"Nicholas Vinen"
Date:
The following bug has been logged online: Bug reference: 2285 Logged by: Nicholas Vinen Email address: hb@pg.x256.org PostgreSQL version: 8.1.3 Operating system: Gentoo Linux x86 Description: Can not access database after successful PITR - file naming problems Details: For debugging/testing purposes, I have taken to performing a PITR on a test machine from the production database back-ups. I restore a snap-shot of the production server which was taken before the time I am interested in debugging, then use the PITR method to get the database to the point I want to test at. I can then "rewind" the database and test again by restoring again. This used to work (either before 8.0 or before 8.1, I'm not sure). I also have this back-up in case the production database server dies, but can't really test restoring it on the production machine except when data is lost, otherwise I will interrupt service. The machines are almost identical, the only real difference is that one is a Pentium 3 and one is a Pentium 4, so I don't see why that should matter. They are both running virtually identical Linux installations. Now, when I try to use PITR to restore the database on my test server, the PITR succeeds as normal, but I can not access the database with an error like: psql: FATAL: could not open relation 1663/16385/605464: No such file or directory This happens when I attempt to connect to my database. I can connect to some of the others (such as the template databases) fine. Interestingly, the 605464 file was in the back-up snapshot, but the process of performing the PITR seems to rename the file, but it's still looking under the old name. This is fully reproducible, so if I am not providing sufficient information here, just let me know what you need in order to fix this. Here is a log of what I have done up to the error: rt2 ~ # cd /var/lib/postgresql/data rt2 data # rm -rf * rt2 data # gzip -cd /backup/Helpdesk/Database/Snapshots/Weekly/2006-06\ \ Sun\ 12\ Feb.io.gz | cpio -i 3229320 blocks rt2 data # ls -al base/16385/605464 -rw------- 1 postgres postgres 40960 Feb 27 02:57 base/16385/605464 rt2 data # rm pg_xlog/* rm: cannot remove `pg_xlog/archive_status': Is a directory rt2 data # cp /data/postgresql/recovery.conf /data/postgresql/postgresql.conf . rt2 data # chown postgres * rt2 data # /etc/init.d/postgresql start * Starting PostgreSQL ... rt2 data # tail -f /var/log/postgres/current Feb 27 03:06:12 [postgres] [1-1] LOG: could not create IPv6 socket: Address family not supported by protocol Feb 27 03:06:12 [postgres] [2-1] LOG: database system was interrupted at 2006-02-12 01:30:02 PST Feb 27 03:06:12 [postgres] [3-1] LOG: starting archive recovery Feb 27 03:06:12 [postgres] [4-1] LOG: restore_command = "gzip -cd /backup/Helpdesk/Database/TransactionLog/"%f">"%p"" Feb 27 03:06:12 [postgres] [5-1] LOG: recovery_target_time = 2006-02-17 08:20:00-08 Feb 27 03:06:12 [postgres] [6-1] LOG: restored log file "000000010000000B00000002.00A9FA60.backup" from archive Feb 27 03:06:14 [postgres] [7-1] LOG: restored log file "000000010000000B00000002" from archive Feb 27 03:06:14 [postgres] [8-1] LOG: checkpoint record is at B/2A9FA60 Feb 27 03:06:14 [postgres] [9-1] LOG: redo record is at B/2A9FA60; undo record is at 0/0; shutdown FALSE Feb 27 03:06:14 [postgres] [10-1] LOG: next transaction ID: 34066581; next OID: 611376 Feb 27 03:06:14 [postgres] [11-1] LOG: next MultiXactId: 633; next MultiXactOffset: 1265 Feb 27 03:06:14 [postgres] [12-1] LOG: automatic recovery in progress Feb 27 03:06:14 [postgres] [13-1] LOG: redo starts at B/2A9FAA4 Feb 27 03:06:27 [postgres] [14-1] LOG: restored log file "000000010000000B00000003" from archive Feb 27 03:06:38 [postgres] [15-1] LOG: restored log file "000000010000000B00000004" from archive Feb 27 03:06:41 [postgres] [16-1] LOG: restored log file "000000010000000B00000005" from archive .... Feb 27 03:47:13 [postgres] [682-1] LOG: restored log file "000000010000000D000000A1" from archive Feb 27 03:47:17 [postgres] [683-1] LOG: restored log file "000000010000000D000000A2" from archive Feb 27 03:47:21 [postgres] [684-1] LOG: restored log file "000000010000000D000000A3" from archive Feb 27 03:47:21 [postgres] [685-1] LOG: recovery stopping before commit of transaction 42586328, time 2006-02-17 08:20:01 PST Feb 27 03:47:21 [postgres] [686-1] LOG: redo done at D/A35794EC Feb 27 03:47:21 [postgres] [687-1] LOG: selected new timeline ID: 2 Feb 27 03:47:22 [postgres] [688-1] LOG: archive recovery complete Feb 27 03:47:32 [postgres] [689-1] LOG: database system is ready Feb 27 03:47:32 [postgres] [690-1] LOG: transaction ID wrap limit is 1103292637, limited by database "postgres" Feb 27 05:39:54 [postgres] [2-1] LOG: invalid server process ID -1 (new error in 8.1.3) rt2 data # psql -U postgres rt3 psql: FATAL: could not open relation 1663/16385/605464: No such file or directory rt2 data # ls -al base/16385/605464 rt2 data # recovery.conf: restore_command = 'gzip -cd /backup/Helpdesk/Database/TransactionLog/"%f">"%p"' recovery_target_time = '2006-02-17 16:20:00+0'; I believe this should work. I checked the PITR limitations and I don't believe I've broken any of them. I also belive this should work between seperate machines of the same architecture. It certainly has in the past. Please let me know how to proceed. By the way, I can access *some* of my databases (the smaller ones), just not the main one. I suspect perhaps it's due to an ALTER TABLE or similar command I committed on the main database inbetween the snapshot and the restore point, which is not being handled properly during the transation log replay. Thanks, Nicholas
Re: BUG #2285: Can not access database after successful PITR - file naming problems
From
Tom Lane
Date:
"Nicholas Vinen" <hb@x256.com> writes: > Now, when I try to use PITR to restore the database on my test server, the > PITR succeeds as normal, but I can not access the database with an error > like: > psql: FATAL: could not open relation 1663/16385/605464: No such file or > directory This seems pretty ugly :-(. Unless you've got some odd stuff in ~/.psqlrc, I would think that psql startup could only cause accesses to system catalogs, but 605464 is far beyond the normal relfilenode numbers assigned to any system catalogs or indexes. AFAICS the only way that a new relfilenode would be assigned is during TRUNCATE, CLUSTER or REINDEX ... have you done any of those on system catalogs? regards, tom lane