Thread: BUG #5176: database recovery produces infinite loop

BUG #5176: database recovery produces infinite loop

From
"Helge Milde"
Date:
The following bug has been logged online:

Bug reference:      5176
Logged by:          Helge Milde
Email address:      helge@monsternett.no
PostgreSQL version: 8.1.18
Operating system:   Debian 2.6.18-6-686
Description:        database recovery produces infinite loop
Details:

Our database recently crashed, and since then, we haven't been able to start
it again.

Here's the postmaster log with debug level 5:

--- log start ---
LOG:  could not load root certificate file "root.crt": No SSL error
reported
DETAIL:  Will not verify client certificates.
DEBUG:  invoking IpcMemoryCreate(size=10469376)
DEBUG:  max_safe_fds = 985, usable_fds = 1000, already_open = 5
LOG:  database system was interrupted while in recovery at 2009-11-10
13:07:47 CET
HINT:  This probably means that some data is corrupted and you will have to
use the last backup for recovery.
LOG:  checkpoint record is at 8/931B8558
LOG:  redo record is at 8/931B3544; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 566148231; next OID: 25210
LOG:  next MultiXactId: 18; next MultiXactOffset: 37
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 8/931B3544
--- log end ---

Nothing else of interest follows.


It seems the recovery process is failing; doing a strace on the 'postgres:
startup process' pid, I see this:

-- strace log start ---
Process 29888 attached - interrupt to quit
close(14)                               = 0
open("pg_clog/021B", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 14
_llseek(14, 245760, [245760], SEEK_SET) = 0
write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192)
= 8192
fsync(14)                               = 0
close(14)                               = 0
open("pg_clog/021B", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 14
_llseek(14, 245760, [245760], SEEK_SET) = 0
write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192)
= 8192
fsync(14)                               = 0
close(14)                               = 0
-- strace log end ---

This repeats infinetly.

We do have some database dumps, but they are not very recent, so any help on
fixing this would be much appreciated.

Thanks,
Helge Milde

Re: BUG #5176: database recovery produces infinite loop

From
Tom Lane
Date:
"Helge Milde" <helge@monsternett.no> writes:
> It seems the recovery process is failing; doing a strace on the 'postgres:
> startup process' pid, I see this:

> -- strace log start ---
> Process 29888 attached - interrupt to quit
> close(14)                               = 0
> open("pg_clog/021B", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 14
> _llseek(14, 245760, [245760], SEEK_SET) = 0
> write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192)
> = 8192
> fsync(14)                               = 0
> close(14)                               = 0
> open("pg_clog/021B", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 14
> _llseek(14, 245760, [245760], SEEK_SET) = 0
> write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192)
> = 8192
> fsync(14)                               = 0
> close(14)                               = 0
> -- strace log end ---

> This repeats infinetly.

Hmm, can you attach to that process with gdb and collect a few stack
traces to show where it's looping?  This isn't a symptom we've seen
before, AFAIR.

            regards, tom lane

Re: BUG #5176: database recovery produces infinite loop

From
Helge Milde
Date:
I'm sorry, but the issue has been fixed now; we've upgraded to postgres-8.3, so I can't reproduce this anymore..
I'm not completely sure how it got fixed (it wasn't me), but I can ask the person who fixed it and get back to you
tomorrow;perhaps that might help narrow the field a little bit. 

On Tue, Nov 10, 2009 at 09:31:06AM -0500, Tom Lane wrote:
>"Helge Milde" <helge@monsternett.no> writes:
>> It seems the recovery process is failing; doing a strace on the 'postgres:
>> startup process' pid, I see this:
>
>> -- strace log start ---
>> Process 29888 attached - interrupt to quit
>> close(14)                               = 0
>> open("pg_clog/021B", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 14
>> _llseek(14, 245760, [245760], SEEK_SET) = 0
>> write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192)
>> = 8192
>> fsync(14)                               = 0
>> close(14)                               = 0
>> open("pg_clog/021B", O_RDWR|O_CREAT|O_LARGEFILE, 0600) = 14
>> _llseek(14, 245760, [245760], SEEK_SET) = 0
>> write(14, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192)
>> = 8192
>> fsync(14)                               = 0
>> close(14)                               = 0
>> -- strace log end ---
>
>> This repeats infinetly.
>
>Hmm, can you attach to that process with gdb and collect a few stack
>traces to show where it's looping?  This isn't a symptom we've seen
>before, AFAIR.
>
>            regards, tom lane

--
Helge Milde, 69701808
www.monsternett.no