Thread: Problems starting up postgres
Hello, I have quite strange problem. Postgres refuses to start. This is 7.1.2. Actually this is Aug 15 snapshot of REL7_1_STABLE branch. This is what should be 7.1.2. Any ideas how to repair data? This is quite urgent. The system is live, and now stopped. Sep 5 08:42:30 mx postgres[5341]: [1] DEBUG: database system shutdown was interrupted at 2001-09-05 08:26:25 EDT Sep 5 08:42:30 mx postgres[5341]: [2] DEBUG: CheckPoint record at (23, 2431142676) Sep 5 08:42:30 mx postgres[5341]: [3] DEBUG: Redo record at (23, 2431142676); Undo record at (0, 0); Shutdown TRUE Sep 5 08:42:30 mx postgres[5341]: [4] DEBUG: NextTransactionId: 13932307; NextOid: 9127687 Sep 5 08:42:30 mx postgres[5341]: [5] DEBUG: database system was not properly shut down; automatic recovery in progress... Sep 5 08:42:30 mx postgres[5341]: [6] DEBUG: redo starts at (23, 2431142740) Sep 5 08:42:30 mx postgres[5341]: [7] DEBUG: ReadRecord: record with zero len at (23, 2432317444) Sep 5 08:42:30 mx postgres[5341]: [8] DEBUG: redo done at (23, 2432317408) Sep 5 08:42:30 mx postgres[5341]: [9] FATAL 2: XLogFlush: request is not satisfied Sep 5 08:44:42 mx postgres[5441]: [1] DEBUG: database system shutdown was interrupted at 2001-09-05 08:42:30 EDT Sep 5 08:44:42 mx postgres[5441]: [2] DEBUG: CheckPoint record at (23, 2431142676) -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
Denis Perchine <dyp@perchine.com> writes: > Sep 5 08:42:30 mx postgres[5341]: [9] FATAL 2: XLogFlush: request is not satisfied Hmm. I think you must be running into some kind of logic bug (boundary condition maybe?) in XLogFlush. Could you add some debugging printouts, along the line of *** src/backend/access/transam/xlog.c~ Wed Sep 5 12:18:07 2001 --- src/backend/access/transam/xlog.c Wed Sep 5 12:20:17 2001 *************** *** 1266,1272 **** XLogWrite(WriteRqst); S_UNLOCK(&(XLogCtl->logwrt_lck)); if (XLByteLT(LogwrtResult.Flush,record)) ! elog(STOP, "XLogFlush: request is not satisfied"); break; } S_LOCK_SLEEP(&(XLogCtl->logwrt_lck),spins++, XLOG_LOCK_TIMEOUT); --- 1266,1274 ---- XLogWrite(WriteRqst); S_UNLOCK(&(XLogCtl->logwrt_lck)); if (XLByteLT(LogwrtResult.Flush,record)) ! elog(STOP, "XLogFlush: request (%u, %u) is not satisfied --- flushed to (%u, %u)", ! record.xlogid, record.xrecoff, ! LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); break; } S_LOCK_SLEEP(&(XLogCtl->logwrt_lck),spins++, XLOG_LOCK_TIMEOUT); (this patch is for current sources, line numbers are probably different in 7.1.*) BTW, how did you get into this state --- did you have a system crash? regards, tom lane
On Wednesday 05 September 2001 23:23, you wrote: > Denis Perchine <dyp@perchine.com> writes: > Hmm. I think you must be running into some kind of logic bug (boundary > condition maybe?) in XLogFlush. Could you add some debugging printouts, > along the line of Sep 6 02:09:28 mx postgres[13468]: [2] DEBUG: CheckPoint record at (23, 2431142676) Sep 6 02:09:28 mx postgres[13468]: [3] DEBUG: Redo record at (23, 2431142676); Undo record at (0, 0); Shutdown TRUE Sep 6 02:09:28 mx postgres[13468]: [4] DEBUG: NextTransactionId: 13932307; NextOid: 9127687 Sep 6 02:09:28 mx postgres[13468]: [5] DEBUG: database system was not properly shut down; automatic recovery in progress... Sep 6 02:09:29 mx postgres[13468]: [6] DEBUG: redo starts at (23, 2431142740) Sep 6 02:09:30 mx postgres[13468]: [7] DEBUG: ReadRecord: record with zero len at (23, 2432317444) Sep 6 02:09:30 mx postgres[13468]: [8] DEBUG: redo done at (23, 2432317408) Sep 6 02:09:30 mx postgres[13468]: [9] FATAL 2: XLogFlush: request(1494286336, 786458) is not satisfied -- flushed to (23, 2432317444) > BTW, how did you get into this state --- did you have a system crash? Yes. I was forced to fsck. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
Denis Perchine <dyp@perchine.com> writes: > Sep 6 02:09:30 mx postgres[13468]: [9] FATAL 2: XLogFlush: request(1494286336, 786458) is not satisfied -- > flushed to (23, 2432317444) Yeek. Looks like you have a page somewhere in the database with a bogus LSN value (xlog pointer) ... and, most likely, other corruption as well. >> BTW, how did you get into this state --- did you have a system crash? > Yes. I was forced to fsck. Okay. As a temporary recovery measure, I'd suggest reducing that particular elog from STOP to DEBUG level. That will let you start up and run the database. You'll need to look through your tables and try to figure out which one(s) have lost data. It might be interesting to try to figure out just which page has the bad LSN value --- that might give us a clue why the WAL did not provide protection against this failure. Unfortunately XLogFlush doesn't have any idea who its caller is, so the only way I can think of to check that directly is to set a breakpoint at this error report and look at the call stack. Vadim, what do you think of reducing this elog from STOP to a notice on a permanent basis? ISTM we saw cases during 7.1 beta where this STOP prevented people from recovering, so I'm thinking it does more harm than good to overall system reliability. regards, tom lane
On Thursday 06 September 2001 20:49, Tom Lane wrote: > Denis Perchine <dyp@perchine.com> writes: > Okay. As a temporary recovery measure, I'd suggest reducing that > particular elog from STOP to DEBUG level. That will let you start up > and run the database. You'll need to look through your tables and try > to figure out which one(s) have lost data. It might be interesting to > try to figure out just which page has the bad LSN value --- that might > give us a clue why the WAL did not provide protection against this > failure. Unfortunately XLogFlush doesn't have any idea who its caller > is, so the only way I can think of to check that directly is to set a > breakpoint at this error report and look at the call stack. OK. I will do this tomorrow. I have no space, and I forced to tgz, untgz database. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
> > Sep 6 02:09:30 mx postgres[13468]: [9] FATAL 2: > > XLogFlush: request(1494286336, 786458) is not satisfied -- > > flushed to (23, 2432317444) First note that Denis could just restart with wal_debug = 1 to see bad request, without code change. (We should ask ppl to set wal_debug ON in the case of any WAL problem...) Denis, could you provide us with debug output? > Yeek. Looks like you have a page somewhere in the database > with a bogus LSN value (xlog pointer) ... and, most likely, > other corruption as well. We got error during checkpoint, when backend flushes pages changed by REDO (and *only those pages*). So, that page X (with bad LSN) was "recovered" from log. We didn't see CRC errors, so log is Ok, physically. We should know what is the X page (by setting breakpoint as suggested by Tom) and than look into debug output to see where we got bad LSN. Maybe it comes from restored pages or from checkpoint LSN, due to errors in XLogCtl initialization, but for sure it looks like bug in WAL code. > Vadim, what do you think of reducing this elog from STOP to a notice > on a permanent basis? ISTM we saw cases during 7.1 beta where this And increase probability that ppl will just miss/ignore NOTICE and bug in WAL will continue to harm others? > STOP prevented people from recovering, so I'm thinking it does more And we fixed bug in WAL that time... > harm than good to overall system reliability. No reliability having bugs in WAL code, so I object. But I'd move check into XLogWrite code to STOP if flush request is beyond write point. Denis, please help us to fix this bug. Some GDB-ing probably will be required. If you have not enough time/disk resources but able to give us copy of data-dir, it would be great (I have RedHat 7.? and Solaris 2.6 hosts, Tom ?). In any case debug output is the first thing I'd like to see. If it's big please send it to Tom and me only. And of course you can contact with me in Russian -:) Vadim