Thread: WAL does not recover gracefully from out-of-disk-space
With current sources: DEBUG: copy: line 629980, XLogWrite: new log file created - try to increase WAL_FILES DEBUG: copy: line 694890, XLogWrite: new log file created - try to increase WAL_FILES FATAL 2: copy: line 759383, ZeroFill(logfile 0 seg 13) failed: No space left on device Server process (pid 3178) exited with status 512 at Fri Feb 23 21:53:19 2001 Terminating any active server processes... Server processes were terminated at Fri Feb 23 21:53:19 2001 Reinitializing shared memory and semaphores DEBUG: starting up DEBUG: database system was interrupted at 2001-02-23 21:53:11 DEBUG: CheckPoint record at (0, 21075456) DEBUG: Redo record at (0, 21075456); Undo record at (0, 0); Shutdown TRUE DEBUG: NextTransactionId: 4296; NextOid: 145211 DEBUG: database system was not properly shut down; automatic recovery in progress... DEBUG: redo starts at (0, 21075520) The Data Base System is starting up DEBUG: open(logfile 0 seg 0) failed: No such file or directory TRAP: Failed Assertion("!(readOff > 0):", File: "xlog.c", Line: 1441) !(readOff > 0) (0) [Bad file descriptor] postmaster: Startup proc 3179 exited with status 134 - abort Regardless of whether this particular behavior is fixable, this brings up something that I think we *must* do before 7.1 release: create a utility that blows away a corrupted logfile to allow the system to restart with whatever is in the datafiles. Otherwise, there is no recovery technique for WAL restart failures, short of initdb and restore from last backup. I'd rather be able to get at data of questionable up-to-dateness than not have any chance of recovery at all. regards, tom lane
Tom Lane wrote: > > With current sources: > > DEBUG: copy: line 629980, XLogWrite: new log file created - try to increase WAL_FILES > DEBUG: copy: line 694890, XLogWrite: new log file created - try to increase WAL_FILES > FATAL 2: copy: line 759383, ZeroFill(logfile 0 seg 13) failed: No space left on device > Server process (pid 3178) exited with status 512 at Fri Feb 23 21:53:19 2001 > Terminating any active server processes... > Server processes were terminated at Fri Feb 23 21:53:19 2001 > Reinitializing shared memory and semaphores > DEBUG: starting up > DEBUG: database system was interrupted at 2001-02-23 21:53:11 > DEBUG: CheckPoint record at (0, 21075456) > DEBUG: Redo record at (0, 21075456); Undo record at (0, 0); Shutdown TRUE > DEBUG: NextTransactionId: 4296; NextOid: 145211 > DEBUG: database system was not properly shut down; automatic recovery in progress... > DEBUG: redo starts at (0, 21075520) > The Data Base System is starting up > DEBUG: open(logfile 0 seg 0) failed: No such file or directory > TRAP: Failed Assertion("!(readOff > 0):", File: "xlog.c", Line: 1441) > !(readOff > 0) (0) [Bad file descriptor] > postmaster: Startup proc 3179 exited with status 134 - abort > > Regardless of whether this particular behavior is fixable, this brings > up something that I think we *must* do before 7.1 release: create a > utility that blows away a corrupted logfile to allow the system to > restart with whatever is in the datafiles. Otherwise, there is no > recovery technique for WAL restart failures, short of initdb and > restore from last backup. I'd rather be able to get at data of > questionable up-to-dateness than not have any chance of recovery > at all. > I've asked 2 or 3 times how to recover from recovery failure but got no answer. We should some recipi for the failure before 7.1 release. Regards, Hiroshi Inoue
>> Regardless of whether this particular behavior is fixable, >> this brings up something that I think we *must* do before >> 7.1 release: create a utility that blows away a corrupted >> logfile to allow the system to restart with whatever is in >> the datafiles. Otherwise, there is no recovery technique >> for WAL restart failures, short of initdb and restore from >> last backup. I'd rather be able to get at data of >> questionable up-to-dateness than not have any chance of >> recovery at all. > > I've asked 2 or 3 times how to recover from recovery failure > but got no answer. We should some recipi for the failure > before 7.1 release. And I answered 2 or 3 times with fixes for each reported recovery failure -:) (And asked to help with testing...) Seems to me that "fixing" is the only "answer" you would get asking the same question to Oracle, Informix or any other system with transaction log. Does anybody know how "big boys" deal with this issue? Vadim ----------------------------------------------- FREE! The World's Best Email Address @email.com Reserve your name now at http://www.email.com
Was the following bug already fixed ? Regards, Hiroshi Inoue Tom Lane wrote: > > With current sources: > > DEBUG: copy: line 629980, XLogWrite: new log file created - try to increase WAL_FILES > DEBUG: copy: line 694890, XLogWrite: new log file created - try to increase WAL_FILES > FATAL 2: copy: line 759383, ZeroFill(logfile 0 seg 13) failed: No space left on device > Server process (pid 3178) exited with status 512 at Fri Feb 23 21:53:19 2001 > Terminating any active server processes... > Server processes were terminated at Fri Feb 23 21:53:19 2001 > Reinitializing shared memory and semaphores > DEBUG: starting up > DEBUG: database system was interrupted at 2001-02-23 21:53:11 > DEBUG: CheckPoint record at (0, 21075456) > DEBUG: Redo record at (0, 21075456); Undo record at (0, 0); Shutdown TRUE > DEBUG: NextTransactionId: 4296; NextOid: 145211 > DEBUG: database system was not properly shut down; automatic recovery in progress... > DEBUG: redo starts at (0, 21075520) > The Data Base System is starting up > DEBUG: open(logfile 0 seg 0) failed: No such file or directory > TRAP: Failed Assertion("!(readOff > 0):", File: "xlog.c", Line: 1441) > !(readOff > 0) (0) [Bad file descriptor] > postmaster: Startup proc 3179 exited with status 134 - abort
> Was the following bug already fixed ? I was going to ask same Q. I see that seek+write was changed to write-s in XLogFileInit (that was induced by subj, right?), but what about problem itself? > > DEBUG: redo starts at (0, 21075520) > > The Data Base System is starting up > > DEBUG: open(logfile 0 seg 0) failed: No such file or directory ^^^^^^^^^^^^^^^ redo started in seg 1 and shouldn't try to read seg 0... BTW, were performance tests run after seek+write --> write-s change? Write-s were not obviously faster to me, that's why I've used seek+write, but never tested that area -:( Vadim
Hiroshi Inoue <Inoue@tpf.co.jp> writes: > Was the following bug already fixed ? Dunno. I've changed the WAL ReadRecord code so that it fails soft (no Asserts or elog(STOP)s) for all failure cases, so the particular crash mode exhibited here should be gone. But I'm not sure why the code appears to be trying to open the wrong log segment, as Vadim comments. That bug might still be there. Need to try to reproduce the problem with new code. regards, tom lane
"Vadim Mikheev" <vmikheev@sectorbase.com> writes: > I see that seek+write was changed to write-s in XLogFileInit > (that was induced by subj, right?), but what about problem > itself? > BTW, were performance tests run after seek+write --> write-s > change? That change was for safety, not for performance. It might be a performance win on systems that support fdatasync properly (because it lets us use fdatasync), otherwise it's probably not a performance win. But we need it regardless --- if you didn't want a fully-allocated WAL file, why'd you bother with the original seek-and-write-1-byte code? regards, tom lane