Thread: Can't Restart ver 8.0b3

Can't Restart ver 8.0b3

From

"Dave Hartwig"

Date:

22 December 2004, 11:33:47

Anyone,

I would not describe my situation as desprate (yet), but it is painful,
and I am looking for some information and a degree of confidence. I have
yet to communicate w/ our sysadmin about backup/restore.

We are using version 8.0 beta 3, on Red Hat Linux, to develop a product
that we plan to released in several months.

The problem began when I tried to make some table modifications.  A
CREATE statement hung. I canceled.  I then noticed many backend
processes hung "idle in transaction".  I had everyone exit their
database activities, but several hung transactions still remained.  I
then did a "pg_ctl stop" followed by a "pg_ctl start".  Here is the log
file activity:
   (I got lots of these.  Probably people exiting their clients)
LOG:  could not receive data from client: Connection reset by peer
LOG:  unexpected EOF on client connection
LOG:  could not receive data from client: Connection reset by peer
LOG:  unexpected EOF on client connection
   (Then this; probably after "pg_ctl stop")
LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down
   (Now this after "pg_ctl start")
LOG:  could not create IPv6 socket: Address family not supported by
protocol
LOG:  database system was shut down at 2004-12-21 16:30:25 EST
LOG:  checkpoint record is at 0/278B3430
LOG:  redo record is at 0/278B3430; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 884736; next OID: 306834
PANIC:  could not access status of transaction 884736
DETAIL:  could not read from file
"/usr/local/pgsql8b3/data/pg_clog/0000" at offset 221184: Success
LOG:  startup process (PID 17774) was terminated by signal 6
LOG:  aborting startup due to startup process failure

It turns out that "/usr/local/pgsql8b3/data/pg_clog/0000" exists, is
read/writable by postgres, and is exactly 221184 bytes long.  So I
started pursuing recovery documents.  Came across pg_resetxlog.   First
I tried without the force option.  It complained that it would not work.
Then, using the -f option; that changed nothing.   Then I began using
the -f, -x and -l options: as follows:

First some useful information for you:

$ ls -l /usr/local/pgsql8b3/data/pg_clog/
-rw-------    1 postgres postgres   221184 Dec 21 16:12 0000    and
$ls -l /usr/local/pgsql8b3/data/pg_xlog
-rw-------    1 postgres postgres 16777216 Dec 22 06:16
000000010000000100000068
drwx------    2 postgres postgres     4096 Sep 28 04:57 archive_status

Like the doc describes, I tried:
$ postgres pg_resetxlog -f -x 0x100000 -l 0x1,0x1,0x65
/usr/local/pgsql8b3/data
    (Now I get this in the log)
LOG:  could not create IPv6 socket: Address family not supported by
protocol
LOG:  database system was shut down at 2004-12-21 20:11:12 EST
LOG:  checkpoint record is at 1/6400001C
LOG:  redo record is at 1/6400001C; undo record is at 1/6400001C;
shutdown TRUE
LOG:  next transaction ID: 1048576; next OID: 306834
PANIC:  could not access status of transaction 1048576
DETAIL:  could not open file "/usr/local/pgsql8b3/data/pg_clog/0001": No
such file or directory
LOG:  startup process (PID 304) was terminated by signal 6
LOG:  aborting startup due to startup process failure

Which is correct, there is no file:
"/usr/local/pgsql8b3/data/pg_clog/0001".

Also, this is the output of pg_controldata after running pg_resetxlog:

pg_control version number:            74
Catalog version number:               200408031
Database system identifier:           4708838347263258567
Database cluster state:               in production
pg_control last modified:             Wed 22 Dec 2004 06:17:55 AM EST
Current log file ID:                  1
Next log file segment:                105
Latest checkpoint location:           1/6800001C
Prior checkpoint location:            0/0
Latest checkpoint's REDO location:    1/6800001C
Latest checkpoint's UNDO location:    1/6800001C
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          1048576
Latest checkpoint's NextOID:          306834
Time of latest checkpoint:            Wed 22 Dec 2004 06:16:27 AM EST
Database block size:                  8192
Blocks per segment of large relation: 131072
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           en_US.UTF-8
LC_CTYPE:                             en_US.UTF-8


Can someone help me make sense of all this?

Thanks,
David

Re: Can't Restart ver 8.0b3

From

Tom Lane

Date:

22 December 2004, 16:32:49

"Dave Hartwig" <daveh@Routescape.com> writes:
> LOG:  next transaction ID: 884736; next OID: 306834
> PANIC:  could not access status of transaction 884736
> DETAIL:  could not read from file
> "/usr/local/pgsql8b3/data/pg_clog/0000" at offset 221184: Success
> LOG:  startup process (PID 17774) was terminated by signal 6
> LOG:  aborting startup due to startup process failure

Hmm ... do we have a problem when the next XID is exactly at a page
boundary?  I'll look into that.

> LOG:  next transaction ID: 1048576; next OID: 306834
> PANIC:  could not access status of transaction 1048576
> DETAIL:  could not open file "/usr/local/pgsql8b3/data/pg_clog/0001": No
> such file or directory

IIRC, pg_resetxlog doesn't have any provision to create new pg_clog
segments.  Which is probably an oversight, but it's easy enough to
do it by hand.  Do something likedd bs=8k count=1 </dev/zero >/usr/local/pgsql8b3/data/pg_clog/0001
and everything should be fine.
        regards, tom lane

Re: Can't Restart ver 8.0b3

From

"Dave Hartwig"

Date:

22 December 2004, 18:26:11

Bingo!  Thanks Tom.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, December 22, 2004 11:33 AM
To: Dave Hartwig
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Can't Restart ver 8.0b3

"Dave Hartwig" <daveh@Routescape.com> writes:
> LOG:  next transaction ID: 884736; next OID: 306834
> PANIC:  could not access status of transaction 884736
> DETAIL:  could not read from file
> "/usr/local/pgsql8b3/data/pg_clog/0000" at offset 221184: Success
> LOG:  startup process (PID 17774) was terminated by signal 6
> LOG:  aborting startup due to startup process failure

Hmm ... do we have a problem when the next XID is exactly at a page
boundary?  I'll look into that.

> LOG:  next transaction ID: 1048576; next OID: 306834
> PANIC:  could not access status of transaction 1048576
> DETAIL:  could not open file "/usr/local/pgsql8b3/data/pg_clog/0001":
No
> such file or directory

IIRC, pg_resetxlog doesn't have any provision to create new pg_clog
segments.  Which is probably an oversight, but it's easy enough to
do it by hand.  Do something likedd bs=8k count=1 </dev/zero
>/usr/local/pgsql8b3/data/pg_clog/0001
and everything should be fine.
        regards, tom lane

Re: Can't Restart ver 8.0b3

From

Tom Lane

Date:

22 December 2004, 18:50:54

I wrote:
> "Dave Hartwig" <daveh@Routescape.com> writes:
>> LOG:  next transaction ID: 884736; next OID: 306834
>> PANIC:  could not access status of transaction 884736
>> DETAIL:  could not read from file
>> "/usr/local/pgsql8b3/data/pg_clog/0000" at offset 221184: Success
>> LOG:  startup process (PID 17774) was terminated by signal 6
>> LOG:  aborting startup due to startup process failure

> Hmm ... do we have a problem when the next XID is exactly at a page
> boundary?  I'll look into that.

Indeed, this is a bug I introduced into 8.0 awhile back :-(.  Many
thanks for the trouble report.

> IIRC, pg_resetxlog doesn't have any provision to create new pg_clog
> segments.  Which is probably an oversight, but it's easy enough to
> do it by hand.  Do something like
>     dd bs=8k count=1 </dev/zero >/usr/local/pgsql8b3/data/pg_clog/0001
> and everything should be fine.

There isn't any need to change pg_resetxlog, since the postmaster
can handle creating new segments at need.  But I'd modified StartupCLOG
in a way that assumed the current clog page already exists, which is
wrong in this boundary case.  Fixed for RC3.  In the meantime, creating
an all-zeroes page as per the above suggestion should get you out of
trouble.
        regards, tom lane