Thread: "recovery mode"

"recovery mode"

From
"Steve Wolfe"
Date:
   What exactly is "recovery mode"?  Today, the backend went into recovery
mode, and simply wouldn't do anything.  Not using any CPU, and would not go
away even with a kill -9.  I ended up having to reboot the machine to get
the database working again....

steve



Re: "recovery mode"

From
Tom Lane
Date:
"Steve Wolfe" <steve@iboats.com> writes:
>    What exactly is "recovery mode"?  Today, the backend went into recovery
> mode, and simply wouldn't do anything.  Not using any CPU, and would not go
> away even with a kill -9.  I ended up having to reboot the machine to get
> the database working again....

I don't think recovery mode actually does much in 7.0.* --- I think it's
just a stub (Vadim might know better though).  In 7.1 it means the thing
is replaying the WAL log after a crash.  In any case it shouldn't
create a lockup condition like that.

The only cases I've ever heard of where a user process couldn't be
killed with kill -9 are where it's stuck in a kernel call (and the
kill response is being held off till the end of the kernel call).
Any such situation is arguably a kernel bug, of course, but that's
not a lot of comfort.

Exactly which process were you sending kill -9 to, anyway?  There should
have been a postmaster and one backend running the recovery-mode code.
If the postmaster was responding to connection requests with an error
message, then I would not say that it was locked up.

            regards, tom lane

Re: "recovery mode"

From
"Steve Wolfe"
Date:
> I don't think recovery mode actually does much in 7.0.* --- I think it's
> just a stub (Vadim might know better though).  In 7.1 it means the thing
> is replaying the WAL log after a crash.  In any case it shouldn't
> create a lockup condition like that.
>
> The only cases I've ever heard of where a user process couldn't be
> killed with kill -9 are where it's stuck in a kernel call (and the
> kill response is being held off till the end of the kernel call).
> Any such situation is arguably a kernel bug, of course, but that's
> not a lot of comfort.
>
> Exactly which process were you sending kill -9 to, anyway?  There should
> have been a postmaster and one backend running the recovery-mode code.
> If the postmaster was responding to connection requests with an error
> message, then I would not say that it was locked up.

  I believe that it was a backend that I tried -9'ing.  I knew it wasn't
something that good to do, but I had to get it running again.  It's amazing
how bold you get when you hear an entire department mumbling about "Why
isn't the site working?". : )

   Anyway, I think the problem wasn't in postgres.  I rebooted the machine,
and it worked - for about ten minutes.  Then, it froze, with the kernel
crapping out.   I rebooted it, it lasted about three minutes until the same
thing happened.  Reboot, it didn't even get through the fsck before it did
it again.

    I looked at the CPU temps, one of the four was warmer than it should be,
but still within acceptable limits (40 C).  So, I shut it down, reseated the
RAM chassis, the DIMM's, the CPU's, and the expansion cards.  When it came
up, I compiled and put on a newer kernel (I guess there was some good in the
crashes), and then it worked fine.  Because of the symptoms, I imagine that
it was a flakey connection.   Odd, considering that everything except the
DIMM's (including the CPU's) are literally screwed to the motherboard!

steve