Thread: Problems starting up postgres

Problems starting up postgres

From

Denis Perchine

Date:

06 September 2001, 02:12:45

Hello, 
I have quite strange problem. Postgres refuses to start. 
This is 7.1.2. Actually this is Aug 15 snapshot of REL7_1_STABLE branch. 
This is what should be 7.1.2. Any ideas how to repair data? 
This is quite urgent. The system is live, and now stopped. 
Sep  5 08:42:30 mx postgres[5341]: [1] DEBUG:  database system shutdown was interrupted at 2001-09-05 08:26:25 EDT 
Sep  5 08:42:30 mx postgres[5341]: [2] DEBUG:  CheckPoint record at (23, 2431142676) 
Sep  5 08:42:30 mx postgres[5341]: [3] DEBUG:  Redo record at (23, 2431142676); Undo record at (0, 0); Shutdown TRUE 
Sep  5 08:42:30 mx postgres[5341]: [4] DEBUG:  NextTransactionId: 13932307; NextOid: 9127687 
Sep  5 08:42:30 mx postgres[5341]: [5] DEBUG:  database system was not properly shut down; automatic recovery in
progress...
 
Sep  5 08:42:30 mx postgres[5341]: [6] DEBUG:  redo starts at (23, 2431142740) 
Sep  5 08:42:30 mx postgres[5341]: [7] DEBUG:  ReadRecord: record with zero len at (23, 2432317444) 
Sep  5 08:42:30 mx postgres[5341]: [8] DEBUG:  redo done at (23, 2432317408) 
Sep  5 08:42:30 mx postgres[5341]: [9] FATAL 2:  XLogFlush: request is not satisfied 
Sep  5 08:44:42 mx postgres[5441]: [1] DEBUG:  database system shutdown was interrupted at 2001-09-05 08:42:30 EDT 
Sep  5 08:44:42 mx postgres[5441]: [2] DEBUG:  CheckPoint record at (23, 2431142676) 
--  
Sincerely Yours, 
Denis Perchine 
---------------------------------- 
E-Mail: dyp@perchine.com 
HomePage: http://www.perchine.com/dyp/ 
FidoNet: 2:5000/120.5 
----------------------------------

Re: Problems starting up postgres

From

Tom Lane

Date:

06 September 2001, 02:27:44

Denis Perchine <dyp@perchine.com> writes:
> Sep  5 08:42:30 mx postgres[5341]: [9] FATAL 2:  XLogFlush: request is not satisfied 

Hmm.  I think you must be running into some kind of logic bug (boundary
condition maybe?) in XLogFlush.  Could you add some debugging printouts,
along the line of

*** src/backend/access/transam/xlog.c~    Wed Sep  5 12:18:07 2001
--- src/backend/access/transam/xlog.c    Wed Sep  5 12:20:17 2001
***************
*** 1266,1272 ****             XLogWrite(WriteRqst);             S_UNLOCK(&(XLogCtl->logwrt_lck));             if
(XLByteLT(LogwrtResult.Flush,record))
 
!                 elog(STOP, "XLogFlush: request is not satisfied");             break;         }
S_LOCK_SLEEP(&(XLogCtl->logwrt_lck),spins++, XLOG_LOCK_TIMEOUT);
 
--- 1266,1274 ----             XLogWrite(WriteRqst);             S_UNLOCK(&(XLogCtl->logwrt_lck));             if
(XLByteLT(LogwrtResult.Flush,record))
 
!                 elog(STOP, "XLogFlush: request (%u, %u) is not satisfied --- flushed to (%u, %u)",
!                      record.xlogid, record.xrecoff,
!                      LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);             break;         }
S_LOCK_SLEEP(&(XLogCtl->logwrt_lck),spins++, XLOG_LOCK_TIMEOUT);
 


(this patch is for current sources, line numbers are probably different
in 7.1.*)

BTW, how did you get into this state --- did you have a system crash?
        regards, tom lane

Re: Problems starting up postgres

From

Denis Perchine

Date:

06 September 2001, 08:08:35

On Wednesday 05 September 2001 23:23, you wrote: 
> Denis Perchine <dyp@perchine.com> writes: 
> Hmm.  I think you must be running into some kind of logic bug (boundary 
> condition maybe?) in XLogFlush.  Could you add some debugging printouts, 
> along the line of 
Sep  6 02:09:28 mx postgres[13468]: [2] DEBUG:  CheckPoint record at (23, 2431142676) 
Sep  6 02:09:28 mx postgres[13468]: [3] DEBUG:  Redo record at (23, 2431142676); Undo record at (0, 0); Shutdown TRUE 
Sep  6 02:09:28 mx postgres[13468]: [4] DEBUG:  NextTransactionId: 13932307; NextOid: 9127687 
Sep  6 02:09:28 mx postgres[13468]: [5] DEBUG:  database system was not properly shut down; automatic recovery in
progress...
 
Sep  6 02:09:29 mx postgres[13468]: [6] DEBUG:  redo starts at (23, 2431142740) 
Sep  6 02:09:30 mx postgres[13468]: [7] DEBUG:  ReadRecord: record with zero len at (23, 2432317444) 
Sep  6 02:09:30 mx postgres[13468]: [8] DEBUG:  redo done at (23, 2432317408) 
Sep  6 02:09:30 mx postgres[13468]: [9] FATAL 2:  XLogFlush: request(1494286336, 786458) is not satisfied -- 
flushed to (23, 2432317444) 
> BTW, how did you get into this state --- did you have a system crash? 
Yes. I was forced to fsck. 
--  
Sincerely Yours, 
Denis Perchine 
---------------------------------- 
E-Mail: dyp@perchine.com 
HomePage: http://www.perchine.com/dyp/ 
FidoNet: 2:5000/120.5 
----------------------------------

Re: Problems starting up postgres

From

Tom Lane

Date:

06 September 2001, 12:50:23

Denis Perchine <dyp@perchine.com> writes:
> Sep  6 02:09:30 mx postgres[13468]: [9] FATAL 2:  XLogFlush: request(1494286336, 786458) is not satisfied -- 
> flushed to (23, 2432317444) 

Yeek.  Looks like you have a page somewhere in the database with a bogus
LSN value (xlog pointer) ... and, most likely, other corruption as well.
>> BTW, how did you get into this state --- did you have a system crash? 
> Yes. I was forced to fsck. 

Okay.  As a temporary recovery measure, I'd suggest reducing that
particular elog from STOP to DEBUG level.  That will let you start up
and run the database.  You'll need to look through your tables and try
to figure out which one(s) have lost data.  It might be interesting to
try to figure out just which page has the bad LSN value --- that might
give us a clue why the WAL did not provide protection against this
failure.  Unfortunately XLogFlush doesn't have any idea who its caller
is, so the only way I can think of to check that directly is to set a
breakpoint at this error report and look at the call stack.

Vadim, what do you think of reducing this elog from STOP to a notice
on a permanent basis?  ISTM we saw cases during 7.1 beta where this
STOP prevented people from recovering, so I'm thinking it does more
harm than good to overall system reliability.
        regards, tom lane

Re: Problems starting up postgres

From

Denis Perchine

Date:

06 September 2001, 13:06:33

On Thursday 06 September 2001 20:49, Tom Lane wrote: 
> Denis Perchine <dyp@perchine.com> writes: 
> Okay.  As a temporary recovery measure, I'd suggest reducing that 
> particular elog from STOP to DEBUG level.  That will let you start up 
> and run the database.  You'll need to look through your tables and try 
> to figure out which one(s) have lost data.  It might be interesting to 
> try to figure out just which page has the bad LSN value --- that might 
> give us a clue why the WAL did not provide protection against this 
> failure.  Unfortunately XLogFlush doesn't have any idea who its caller 
> is, so the only way I can think of to check that directly is to set a 
> breakpoint at this error report and look at the call stack. 
OK. I will do this tomorrow. I have no space, and I forced to tgz, untgz 
database. 
--  
Sincerely Yours, 
Denis Perchine 
---------------------------------- 
E-Mail: dyp@perchine.com 
HomePage: http://www.perchine.com/dyp/ 
FidoNet: 2:5000/120.5 
----------------------------------

Re: Problems starting up postgres

From

"Mikheev, Vadim"

Date:

06 September 2001, 16:35:33

> > Sep  6 02:09:30 mx postgres[13468]: [9] FATAL 2:
> > XLogFlush: request(1494286336, 786458) is not satisfied -- 
> > flushed to (23, 2432317444) 

First note that Denis could just restart with wal_debug = 1
to see bad request, without code change. (We should ask ppl
to set wal_debug ON in the case of any WAL problem...)
Denis, could you provide us with debug output?

> Yeek.  Looks like you have a page somewhere in the database
> with a bogus LSN value (xlog pointer) ... and, most likely,
> other corruption as well.

We got error during checkpoint, when backend flushes pages
changed by REDO (and *only those pages*). So, that page X (with
bad LSN) was "recovered" from log. We didn't see CRC errors,
so log is Ok, physically. We should know what is the X page
(by setting breakpoint as suggested by Tom) and than look
into debug output to see where we got bad LSN.
Maybe it comes from restored pages or from checkpoint LSN,
due to errors in XLogCtl initialization, but for sure it looks
like bug in WAL code.

> Vadim, what do you think of reducing this elog from STOP to a notice
> on a permanent basis?  ISTM we saw cases during 7.1 beta where this

And increase probability that ppl will just miss/ignore NOTICE
and bug in WAL will continue to harm others?

> STOP prevented people from recovering, so I'm thinking it does more

And we fixed bug in WAL that time...

> harm than good to overall system reliability.

No reliability having bugs in WAL code, so I object. But I'd move
check into XLogWrite code to STOP if flush request is beyond write
point.

Denis, please help us to fix this bug. Some GDB-ing probably will be
required. If you have not enough time/disk resources but able to
give us copy of data-dir, it would be great (I have RedHat 7.? and
Solaris 2.6 hosts, Tom ?). In any case debug output is the first
thing I'd like to see. If it's big please send it to Tom and me only.
And of course you can contact with me in Russian -:)

Vadim