Thread: redo error?

redo error?

From
"Christopher Kings-Lynne"
Date:
Hi guys,

My postgres totally messed up again for some reason (there were like 3
postmasters running, other weirdness).

I noticed this as it was starting up again:

2003-01-07 18:01:34 DEBUG:  ReadRecord: unexpected pageaddr 16/F2794000 in
log file 22, segment 249, offset 7946240
2003-01-07 18:01:34 DEBUG:  redo done at 16/F9791664

It also logged that it was killed with signal 9, although I didn't kill it!
Is there something weird going on here?

Postgres 7.2.3

Chris



Re: redo error?

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> My postgres totally messed up again for some reason (there were like 3
> postmasters running, other weirdness).
> I noticed this as it was starting up again:
> 2003-01-07 18:01:34 DEBUG:  ReadRecord: unexpected pageaddr 16/F2794000 in
> log file 22, segment 249, offset 7946240
> 2003-01-07 18:01:34 DEBUG:  redo done at 16/F9791664

This is probably OK --- I believe it just suggests that an XLOG page
header is not what was expected, which is an unsurprising case after a
crash.  The system should recover anyway.  (If you were running with
fsync off, then more paranoia might be appropriate.)

> It also logged that it was killed with signal 9, although I didn't kill it!
> Is there something weird going on here?

Is this Linux?  The Linux kernel seems to think that killing
randomly-chosen processes with SIGKILL is an appropriate response to
running out of memory.  I cannot offhand think of a more brain-dead
behavior in any OS living or dead, but that's what it does.
        regards, tom lane


Re: redo error?

From
"Christopher Kings-Lynne"
Date:
> > It also logged that it was killed with signal 9, although I
> didn't kill it!
> > Is there something weird going on here?
>
> Is this Linux?  The Linux kernel seems to think that killing
> randomly-chosen processes with SIGKILL is an appropriate response to
> running out of memory.  I cannot offhand think of a more brain-dead
> behavior in any OS living or dead, but that's what it does.

No, FreeBSD.  It does the same thing as Linux.

What happened is that the postmaster got confused by lots of kill requests
from the kernel I think so I ended up with 3 of them running.

But then I killed them all manually, ipcclean'd and restarted postmaster
cleanly.  Then, a few minutes later I saw that.  However, I might be getting
mixed up as to the order of events, so it is probably me or the kernel doing
it.

Chris



Re: redo error?

From
Greg Copeland
Date:
On Tue, 2003-01-07 at 22:58, Tom Lane wrote:

> > It also logged that it was killed with signal 9, although I didn't kill it!
> > Is there something weird going on here?
> 
> Is this Linux?  The Linux kernel seems to think that killing
> randomly-chosen processes with SIGKILL is an appropriate response to
> running out of memory.  I cannot offhand think of a more brain-dead
> behavior in any OS living or dead, but that's what it does.

Just FYI, I believe the 2.6.x series of kernels will rectify this
situation.


-- 
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting