Thread: WAL does not recover gracefully from out-of-disk-space

WAL does not recover gracefully from out-of-disk-space

From

Tom Lane

Date:

23 February 2001, 22:11:25

With current sources:

DEBUG:  copy: line 629980, XLogWrite: new log file created - try to increase WAL_FILES
DEBUG:  copy: line 694890, XLogWrite: new log file created - try to increase WAL_FILES
FATAL 2:  copy: line 759383, ZeroFill(logfile 0 seg 13) failed: No space left on device
Server process (pid 3178) exited with status 512 at Fri Feb 23 21:53:19 2001
Terminating any active server processes...
Server processes were terminated at Fri Feb 23 21:53:19 2001
Reinitializing shared memory and semaphores
DEBUG:  starting up
DEBUG:  database system was interrupted at 2001-02-23 21:53:11
DEBUG:  CheckPoint record at (0, 21075456)
DEBUG:  Redo record at (0, 21075456); Undo record at (0, 0); Shutdown TRUE
DEBUG:  NextTransactionId: 4296; NextOid: 145211
DEBUG:  database system was not properly shut down; automatic recovery in progress...
DEBUG:  redo starts at (0, 21075520)
The Data Base System is starting up
DEBUG:  open(logfile 0 seg 0) failed: No such file or directory
TRAP: Failed Assertion("!(readOff > 0):", File: "xlog.c", Line: 1441)
!(readOff > 0) (0) [Bad file descriptor]
postmaster: Startup proc 3179 exited with status 134 - abort


Regardless of whether this particular behavior is fixable, this brings
up something that I think we *must* do before 7.1 release: create a
utility that blows away a corrupted logfile to allow the system to
restart with whatever is in the datafiles.  Otherwise, there is no
recovery technique for WAL restart failures, short of initdb and
restore from last backup.  I'd rather be able to get at data of
questionable up-to-dateness than not have any chance of recovery
at all.
        regards, tom lane

Re: WAL does not recover gracefully from out-of-disk-space

From

Hiroshi Inoue

Date:

24 February 2001, 03:43:16

Tom Lane wrote:
> 
> With current sources:
> 
> DEBUG:  copy: line 629980, XLogWrite: new log file created - try to increase WAL_FILES
> DEBUG:  copy: line 694890, XLogWrite: new log file created - try to increase WAL_FILES
> FATAL 2:  copy: line 759383, ZeroFill(logfile 0 seg 13) failed: No space left on device
> Server process (pid 3178) exited with status 512 at Fri Feb 23 21:53:19 2001
> Terminating any active server processes...
> Server processes were terminated at Fri Feb 23 21:53:19 2001
> Reinitializing shared memory and semaphores
> DEBUG:  starting up
> DEBUG:  database system was interrupted at 2001-02-23 21:53:11
> DEBUG:  CheckPoint record at (0, 21075456)
> DEBUG:  Redo record at (0, 21075456); Undo record at (0, 0); Shutdown TRUE
> DEBUG:  NextTransactionId: 4296; NextOid: 145211
> DEBUG:  database system was not properly shut down; automatic recovery in progress...
> DEBUG:  redo starts at (0, 21075520)
> The Data Base System is starting up
> DEBUG:  open(logfile 0 seg 0) failed: No such file or directory
> TRAP: Failed Assertion("!(readOff > 0):", File: "xlog.c", Line: 1441)
> !(readOff > 0) (0) [Bad file descriptor]
> postmaster: Startup proc 3179 exited with status 134 - abort
> 
> Regardless of whether this particular behavior is fixable, this brings
> up something that I think we *must* do before 7.1 release: create a
> utility that blows away a corrupted logfile to allow the system to
> restart with whatever is in the datafiles.  Otherwise, there is no
> recovery technique for WAL restart failures, short of initdb and
> restore from last backup.  I'd rather be able to get at data of
> questionable up-to-dateness than not have any chance of recovery
> at all.
> 

I've asked 2 or 3 times how to recover from recovery failure but
got no answer. We should some recipi for the failure before 7.1
release.

Regards,
Hiroshi Inoue

Re: WAL does not recover gracefully from out-of-disk-space

From

Vadim Mikheev

Date:

27 February 2001, 08:00:47

>> Regardless of whether this particular behavior is fixable,
>> this brings up something that I think we *must* do before
>> 7.1 release: create a utility that blows away a corrupted
>> logfile to allow the system to restart with whatever is in
>> the datafiles.  Otherwise, there is no recovery technique
>> for WAL restart failures, short of initdb and restore from
>> last backup.  I'd rather be able to get at data of
>> questionable up-to-dateness than not have any chance of
>> recovery at all.
>
> I've asked 2 or 3 times how to recover from recovery failure
> but got no answer. We should some recipi for the failure
> before 7.1 release.

And I answered 2 or 3 times with fixes for each reported
recovery failure -:) (And asked to help with testing...)

Seems to me that "fixing" is the only "answer" you would
get asking the same question to Oracle, Informix or any
other system with transaction log. Does anybody know how
"big boys" deal with this issue?

Vadim

-----------------------------------------------
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com

Re: WAL does not recover gracefully from out-of-disk-space

From

Hiroshi Inoue

Date:

07 March 2001, 23:14:45

Was the following bug already fixed ?

Regards,
Hiroshi Inoue

Tom Lane wrote:
> 
> With current sources:
> 
> DEBUG:  copy: line 629980, XLogWrite: new log file created - try to increase WAL_FILES
> DEBUG:  copy: line 694890, XLogWrite: new log file created - try to increase WAL_FILES
> FATAL 2:  copy: line 759383, ZeroFill(logfile 0 seg 13) failed: No space left on device
> Server process (pid 3178) exited with status 512 at Fri Feb 23 21:53:19 2001
> Terminating any active server processes...
> Server processes were terminated at Fri Feb 23 21:53:19 2001
> Reinitializing shared memory and semaphores
> DEBUG:  starting up
> DEBUG:  database system was interrupted at 2001-02-23 21:53:11
> DEBUG:  CheckPoint record at (0, 21075456)
> DEBUG:  Redo record at (0, 21075456); Undo record at (0, 0); Shutdown TRUE
> DEBUG:  NextTransactionId: 4296; NextOid: 145211
> DEBUG:  database system was not properly shut down; automatic recovery in progress...
> DEBUG:  redo starts at (0, 21075520)
> The Data Base System is starting up
> DEBUG:  open(logfile 0 seg 0) failed: No such file or directory
> TRAP: Failed Assertion("!(readOff > 0):", File: "xlog.c", Line: 1441)
> !(readOff > 0) (0) [Bad file descriptor]
> postmaster: Startup proc 3179 exited with status 134 - abort

Re: WAL does not recover gracefully from out-of-disk-space

From

"Vadim Mikheev"

Date:

08 March 2001, 07:38:58

> Was the following bug already fixed ?

I was going to ask same Q.

I see that seek+write was changed to write-s in XLogFileInit
(that was induced by subj, right?), but what about problem
itself?

> > DEBUG:  redo starts at (0, 21075520)
> > The Data Base System is starting up
> > DEBUG:  open(logfile 0 seg 0) failed: No such file or directory                ^^^^^^^^^^^^^^^
redo started in seg 1 and shouldn't try to read seg 0...

BTW, were performance tests run after seek+write --> write-s
change? Write-s were not obviously faster to me, that's why I've
used seek+write, but never tested that area -:(

Vadim

Re: WAL does not recover gracefully from out-of-disk-space

From

Tom Lane

Date:

08 March 2001, 11:36:07

Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> Was the following bug already fixed ?

Dunno.  I've changed the WAL ReadRecord code so that it fails soft (no
Asserts or elog(STOP)s) for all failure cases, so the particular crash
mode exhibited here should be gone.  But I'm not sure why the code
appears to be trying to open the wrong log segment, as Vadim comments.
That bug might still be there.  Need to try to reproduce the problem
with new code.
        regards, tom lane

Re: WAL does not recover gracefully from out-of-disk-space

From

Tom Lane

Date:

08 March 2001, 11:59:30

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> I see that seek+write was changed to write-s in XLogFileInit
> (that was induced by subj, right?), but what about problem
> itself?

> BTW, were performance tests run after seek+write --> write-s
> change?

That change was for safety, not for performance.  It might be a
performance win on systems that support fdatasync properly (because it
lets us use fdatasync), otherwise it's probably not a performance win.
But we need it regardless --- if you didn't want a fully-allocated WAL
file, why'd you bother with the original seek-and-write-1-byte code?
        regards, tom lane