BUG #2285: Can not access database after successful PITR - file naming problems - Mailing list pgsql-bugs

From Nicholas Vinen
Subject BUG #2285: Can not access database after successful PITR - file naming problems
Date
Msg-id 20060227134626.E4695F0B03@svr2.postgresql.org
Whole thread Raw
Responses Re: BUG #2285: Can not access database after successful PITR - file naming problems
List pgsql-bugs
The following bug has been logged online:

Bug reference:      2285
Logged by:          Nicholas Vinen
Email address:      hb@pg.x256.org
PostgreSQL version: 8.1.3
Operating system:   Gentoo Linux x86
Description:        Can not access database after successful PITR - file
naming problems
Details:

For debugging/testing purposes, I have taken to performing a PITR on a test
machine from the production database back-ups. I restore a snap-shot of the
production server which was taken before the time I am interested in
debugging, then use the PITR method to get the database to the point I want
to test at. I can then "rewind" the database and test again by restoring
again.

This used to work (either before 8.0 or before 8.1, I'm not sure). I also
have this back-up in case the production database server dies, but can't
really test restoring it on the production machine except when data is lost,
otherwise I will interrupt service. The machines are almost identical, the
only real difference is that one is a Pentium 3 and one is a Pentium 4, so I
don't see why that should matter. They are both running virtually identical
Linux installations.

Now, when I try to use PITR to restore the database on my test server, the
PITR succeeds as normal, but I can not access the database with an error
like:

psql: FATAL:  could not open relation 1663/16385/605464: No such file or
directory

This happens when I attempt to connect to my database. I can connect to some
of the others (such as the template databases) fine.

Interestingly, the 605464 file was in the back-up snapshot, but the process
of performing the PITR seems to rename the file, but it's still looking
under the old name. This is fully reproducible, so if I am not providing
sufficient information here, just let me know what you need in order to fix
this.

Here is a log of what I have done up to the error:

rt2 ~ # cd /var/lib/postgresql/data
rt2 data # rm -rf *
rt2 data # gzip -cd /backup/Helpdesk/Database/Snapshots/Weekly/2006-06\ \
Sun\ 12\ Feb.io.gz | cpio -i
3229320 blocks
rt2 data # ls -al base/16385/605464
-rw-------  1 postgres postgres 40960 Feb 27 02:57 base/16385/605464
rt2 data # rm pg_xlog/*
rm: cannot remove `pg_xlog/archive_status': Is a directory
rt2 data # cp /data/postgresql/recovery.conf
/data/postgresql/postgresql.conf .
rt2 data # chown postgres *
rt2 data # /etc/init.d/postgresql start
 * Starting PostgreSQL ...
rt2 data # tail -f /var/log/postgres/current
Feb 27 03:06:12 [postgres] [1-1] LOG:  could not create IPv6 socket: Address
family not supported by protocol
Feb 27 03:06:12 [postgres] [2-1] LOG:  database system was interrupted at
2006-02-12 01:30:02 PST
Feb 27 03:06:12 [postgres] [3-1] LOG:  starting archive recovery
Feb 27 03:06:12 [postgres] [4-1] LOG:  restore_command = "gzip -cd
/backup/Helpdesk/Database/TransactionLog/"%f">"%p""
Feb 27 03:06:12 [postgres] [5-1] LOG:  recovery_target_time = 2006-02-17
08:20:00-08
Feb 27 03:06:12 [postgres] [6-1] LOG:  restored log file
"000000010000000B00000002.00A9FA60.backup" from archive
Feb 27 03:06:14 [postgres] [7-1] LOG:  restored log file
"000000010000000B00000002" from archive
Feb 27 03:06:14 [postgres] [8-1] LOG:  checkpoint record is at B/2A9FA60
Feb 27 03:06:14 [postgres] [9-1] LOG:  redo record is at B/2A9FA60; undo
record is at 0/0; shutdown FALSE
Feb 27 03:06:14 [postgres] [10-1] LOG:  next transaction ID: 34066581; next
OID: 611376
Feb 27 03:06:14 [postgres] [11-1] LOG:  next MultiXactId: 633; next
MultiXactOffset: 1265
Feb 27 03:06:14 [postgres] [12-1] LOG:  automatic recovery in progress
Feb 27 03:06:14 [postgres] [13-1] LOG:  redo starts at B/2A9FAA4
Feb 27 03:06:27 [postgres] [14-1] LOG:  restored log file
"000000010000000B00000003" from archive
Feb 27 03:06:38 [postgres] [15-1] LOG:  restored log file
"000000010000000B00000004" from archive
Feb 27 03:06:41 [postgres] [16-1] LOG:  restored log file
"000000010000000B00000005" from archive
....
Feb 27 03:47:13 [postgres] [682-1] LOG:  restored log file
"000000010000000D000000A1" from archive
Feb 27 03:47:17 [postgres] [683-1] LOG:  restored log file
"000000010000000D000000A2" from archive
Feb 27 03:47:21 [postgres] [684-1] LOG:  restored log file
"000000010000000D000000A3" from archive
Feb 27 03:47:21 [postgres] [685-1] LOG:  recovery stopping before commit of
transaction 42586328, time 2006-02-17 08:20:01 PST
Feb 27 03:47:21 [postgres] [686-1] LOG:  redo done at D/A35794EC
Feb 27 03:47:21 [postgres] [687-1] LOG:  selected new timeline ID: 2
Feb 27 03:47:22 [postgres] [688-1] LOG:  archive recovery complete
Feb 27 03:47:32 [postgres] [689-1] LOG:  database system is ready
Feb 27 03:47:32 [postgres] [690-1] LOG:  transaction ID wrap limit is
1103292637, limited by database "postgres"
Feb 27 05:39:54 [postgres] [2-1] LOG:  invalid server process ID -1 (new
error in 8.1.3)
rt2 data # psql -U postgres rt3
psql: FATAL:  could not open relation 1663/16385/605464: No such file or
directory
rt2 data # ls -al base/16385/605464
rt2 data #

recovery.conf:
restore_command = 'gzip -cd
/backup/Helpdesk/Database/TransactionLog/"%f">"%p"'
recovery_target_time = '2006-02-17 16:20:00+0';

I believe this should work. I checked the PITR limitations and I don't
believe I've broken any of them. I also belive this should work between
seperate machines of the same architecture. It certainly has in the past.
Please let me know how to proceed.

By the way, I can access *some* of my databases (the smaller ones), just not
the main one. I suspect perhaps it's due to an ALTER TABLE or similar
command I committed on the main database inbetween the snapshot and the
restore point, which is not being handled properly during the transation log
replay.


Thanks,

Nicholas

pgsql-bugs by date:

Previous
From: ""
Date:
Subject: BUG #2283: VS .NET 2003 connection problem
Next
From: Alvaro Herrera
Date:
Subject: Re: BUG #2284: missing sequence number