Re: Hosed PostGreSQL Installation - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Hosed PostGreSQL Installation
Date
Msg-id 8533.1032621224@sss.pgh.pa.us
Whole thread Raw
In response to Hosed PostGreSQL Installation  ("Pete St. Onge" <pete@seul.org>)
Responses Re: Hosed PostGreSQL Installation  ("Pete St. Onge" <pete@seul.org>)
List pgsql-hackers
"Pete St. Onge" <pete@seul.org> writes:
> As a result of some disk errors on another drive, an admin in our group
> brought down the server hosting our pgsql databases with a kill -KILL
> after having gone to runlevel 1 and finding the postmaster process still
> running. No surprise, our installation was hosed in the process. 

That should not have been a catastrophic mistake in any version >= 7.1.
I suspect you had disk problems or other problems.

> Klamath suggested that I run pg_controldata:

> ...
> Latest checkpoint's StartUpID:        21
> Latest checkpoint's NextXID:          615
> Latest checkpoint's NextOID:          18720

These numbers are suspiciously small for an installation that's been
in production awhile.  I suspect you have not told us the whole story;
in particular I suspect you already tried "pg_resetxlog -f", which was
probably not a good idea.

> If I look into the pg_xlog directory, I see this:

> -rw-------    1 postgres postgres 16777216 Sep 20 23:13 0000000000000002
> -rw-------    1 postgres postgres 16777216 Sep 19 22:09 000000020000007E

Yeah, your xlog positions should be a great deal higher than they are,
if segment 2/7E was previously in use.

It is likely that you can recover (with some uncertainty about integrity
of recent transactions) if you proceed as follows:

1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release (you
can't use 7.1's pg_resetxlog because it doesn't offer the switches
you'll need).  Compile it *against your 7.1 headers*.  It should compile
except you'll have to remove this change:

***************
*** 853,858 ****
--- 394,403 ----     page->xlp_magic = XLOG_PAGE_MAGIC;     page->xlp_info = 0;     page->xlp_sui =
ControlFile.checkPointCopy.ThisStartUpID;
+     page->xlp_pageaddr.xlogid =
+         ControlFile.checkPointCopy.redo.xlogid;
+     page->xlp_pageaddr.xrecoff =
+         ControlFile.checkPointCopy.redo.xrecoff - SizeOfXLogPHD;     record = (XLogRecord *) ((char *) page +
SizeOfXLogPHD);    record->xl_prev.xlogid = 0;     record->xl_prev.xrecoff = 0;
 

Test it using its -n switch to make sure it reports sane values.

2. Run the hacked-up pg_resetxlog like this:
pg_resetxlog -l 2 127 -x 1000000000 $PGDATA

(the -l position is next beyond what we see in pg_xlog, the 1-billion
XID is just a guess at something past where you were.  Actually, can
you give us the size of pg_log, ie, $PGDATA/global/1269?  That would
allow computing a correct next-XID to use.  Figure 4 XIDs per byte,
thus if pg_log is 1 million bytes you need -x at least 4 million.)

3. The postmaster should start now.

4. *Immediately* attempt to do a pg_dumpall.  Do not pass GO, do not
collect $200, do not let in any interactive clients until you've done
it. (I'd suggest tweaking pg_hba.conf to disable all logins but your
own.)

5. If pg_dumpall succeeds and produces sane-looking output, then you've
survived.  initdb, reload the dump file, re-open for business, go have
a beer.  (Recommended: install 7.2.2 and reload into that, not 7.1.*.)
You will probably still need to check for partially-applied recent
transactions, but for the most part you should be OK.

6. If pg_dumpall fails then let us know what the symptoms are, and we'll
see if we can figure out a workaround for whatever the corruption is.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Shridhar Daithankar"
Date:
Subject: Re: Improving speed of copy
Next
From: Tom Lane
Date:
Subject: NUMERIC's transcendental functions