Anyone,
I would not describe my situation as desprate (yet), but it is painful,
and I am looking for some information and a degree of confidence. I have
yet to communicate w/ our sysadmin about backup/restore.
We are using version 8.0 beta 3, on Red Hat Linux, to develop a product
that we plan to released in several months.
The problem began when I tried to make some table modifications. A
CREATE statement hung. I canceled. I then noticed many backend
processes hung "idle in transaction". I had everyone exit their
database activities, but several hung transactions still remained. I
then did a "pg_ctl stop" followed by a "pg_ctl start". Here is the log
file activity:
(I got lots of these. Probably people exiting their clients)
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection
(Then this; probably after "pg_ctl stop")
LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: shutting down
LOG: database system is shut down
(Now this after "pg_ctl start")
LOG: could not create IPv6 socket: Address family not supported by
protocol
LOG: database system was shut down at 2004-12-21 16:30:25 EST
LOG: checkpoint record is at 0/278B3430
LOG: redo record is at 0/278B3430; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 884736; next OID: 306834
PANIC: could not access status of transaction 884736
DETAIL: could not read from file
"/usr/local/pgsql8b3/data/pg_clog/0000" at offset 221184: Success
LOG: startup process (PID 17774) was terminated by signal 6
LOG: aborting startup due to startup process failure
It turns out that "/usr/local/pgsql8b3/data/pg_clog/0000" exists, is
read/writable by postgres, and is exactly 221184 bytes long. So I
started pursuing recovery documents. Came across pg_resetxlog. First
I tried without the force option. It complained that it would not work.
Then, using the -f option; that changed nothing. Then I began using
the -f, -x and -l options: as follows:
First some useful information for you:
$ ls -l /usr/local/pgsql8b3/data/pg_clog/
-rw------- 1 postgres postgres 221184 Dec 21 16:12 0000 and
$ls -l /usr/local/pgsql8b3/data/pg_xlog
-rw------- 1 postgres postgres 16777216 Dec 22 06:16
000000010000000100000068
drwx------ 2 postgres postgres 4096 Sep 28 04:57 archive_status
Like the doc describes, I tried:
$ postgres pg_resetxlog -f -x 0x100000 -l 0x1,0x1,0x65
/usr/local/pgsql8b3/data
(Now I get this in the log)
LOG: could not create IPv6 socket: Address family not supported by
protocol
LOG: database system was shut down at 2004-12-21 20:11:12 EST
LOG: checkpoint record is at 1/6400001C
LOG: redo record is at 1/6400001C; undo record is at 1/6400001C;
shutdown TRUE
LOG: next transaction ID: 1048576; next OID: 306834
PANIC: could not access status of transaction 1048576
DETAIL: could not open file "/usr/local/pgsql8b3/data/pg_clog/0001": No
such file or directory
LOG: startup process (PID 304) was terminated by signal 6
LOG: aborting startup due to startup process failure
Which is correct, there is no file:
"/usr/local/pgsql8b3/data/pg_clog/0001".
Also, this is the output of pg_controldata after running pg_resetxlog:
pg_control version number: 74
Catalog version number: 200408031
Database system identifier: 4708838347263258567
Database cluster state: in production
pg_control last modified: Wed 22 Dec 2004 06:17:55 AM EST
Current log file ID: 1
Next log file segment: 105
Latest checkpoint location: 1/6800001C
Prior checkpoint location: 0/0
Latest checkpoint's REDO location: 1/6800001C
Latest checkpoint's UNDO location: 1/6800001C
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 1048576
Latest checkpoint's NextOID: 306834
Time of latest checkpoint: Wed 22 Dec 2004 06:16:27 AM EST
Database block size: 8192
Blocks per segment of large relation: 131072
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8
Can someone help me make sense of all this?
Thanks,
David