Re: Hosed PostGreSQL Installation - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Hosed PostGreSQL Installation |
Date | |
Msg-id | 8533.1032621224@sss.pgh.pa.us Whole thread Raw |
In response to | Hosed PostGreSQL Installation ("Pete St. Onge" <pete@seul.org>) |
Responses |
Re: Hosed PostGreSQL Installation
|
List | pgsql-hackers |
"Pete St. Onge" <pete@seul.org> writes: > As a result of some disk errors on another drive, an admin in our group > brought down the server hosting our pgsql databases with a kill -KILL > after having gone to runlevel 1 and finding the postmaster process still > running. No surprise, our installation was hosed in the process. That should not have been a catastrophic mistake in any version >= 7.1. I suspect you had disk problems or other problems. > Klamath suggested that I run pg_controldata: > ... > Latest checkpoint's StartUpID: 21 > Latest checkpoint's NextXID: 615 > Latest checkpoint's NextOID: 18720 These numbers are suspiciously small for an installation that's been in production awhile. I suspect you have not told us the whole story; in particular I suspect you already tried "pg_resetxlog -f", which was probably not a good idea. > If I look into the pg_xlog directory, I see this: > -rw------- 1 postgres postgres 16777216 Sep 20 23:13 0000000000000002 > -rw------- 1 postgres postgres 16777216 Sep 19 22:09 000000020000007E Yeah, your xlog positions should be a great deal higher than they are, if segment 2/7E was previously in use. It is likely that you can recover (with some uncertainty about integrity of recent transactions) if you proceed as follows: 1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release (you can't use 7.1's pg_resetxlog because it doesn't offer the switches you'll need). Compile it *against your 7.1 headers*. It should compile except you'll have to remove this change: *************** *** 853,858 **** --- 394,403 ---- page->xlp_magic = XLOG_PAGE_MAGIC; page->xlp_info = 0; page->xlp_sui = ControlFile.checkPointCopy.ThisStartUpID; + page->xlp_pageaddr.xlogid = + ControlFile.checkPointCopy.redo.xlogid; + page->xlp_pageaddr.xrecoff = + ControlFile.checkPointCopy.redo.xrecoff - SizeOfXLogPHD; record = (XLogRecord *) ((char *) page + SizeOfXLogPHD); record->xl_prev.xlogid = 0; record->xl_prev.xrecoff = 0; Test it using its -n switch to make sure it reports sane values. 2. Run the hacked-up pg_resetxlog like this: pg_resetxlog -l 2 127 -x 1000000000 $PGDATA (the -l position is next beyond what we see in pg_xlog, the 1-billion XID is just a guess at something past where you were. Actually, can you give us the size of pg_log, ie, $PGDATA/global/1269? That would allow computing a correct next-XID to use. Figure 4 XIDs per byte, thus if pg_log is 1 million bytes you need -x at least 4 million.) 3. The postmaster should start now. 4. *Immediately* attempt to do a pg_dumpall. Do not pass GO, do not collect $200, do not let in any interactive clients until you've done it. (I'd suggest tweaking pg_hba.conf to disable all logins but your own.) 5. If pg_dumpall succeeds and produces sane-looking output, then you've survived. initdb, reload the dump file, re-open for business, go have a beer. (Recommended: install 7.2.2 and reload into that, not 7.1.*.) You will probably still need to check for partially-applied recent transactions, but for the most part you should be OK. 6. If pg_dumpall fails then let us know what the symptoms are, and we'll see if we can figure out a workaround for whatever the corruption is. regards, tom lane
pgsql-hackers by date: