Thanks very much Tom for you input -
The guys at AMCC are suggesting that the firmware on the controller
card crashed, causing the card to basicaly stop IO operations. This
would explain why postgres could not recover and re-read WAL, because
/dev/sdc and sdd were inaccessible at that time.
I think this puzzle is mostly solved - all we need to do now is
figured out what the heck happened on the controller card!
Thanks,
Alex Turner
On Thu, 10 Mar 2005 23:09:07 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alex Turner <armtuk@gmail.com> writes:
> > Well - I am sort of trying to piece together exactly what happened.
> > Here's what I know.
>
> > Around 02:52 I get messages in my syslog stating that there were
> > problems writing to a controler channel:
> > [ various hardware errors snipped ]
>
> > At around 07:30 all connections were failing giving the error:
> > InternalError: FATAL: the database system is in recovery mode
>
> I think what happened here is that Postgres got a write error on WAL,
> which would probably cause a PANIC, and then the ensuing database reboot
> got hung up trying to re-read WAL. Client connection requests would be
> refused with messages like the above until the recovery process
> completed. The fact that this was still going on 4+ hours later shows
> that Postgres is *not* timing out on stuck disk operations ... very much
> the reverse in fact.
>
> You'd be best off to take the matter up with some kernel hackers.
> If there's anything to be done to improve the behavior, it's at
> the kernel device driver level.
>
> regards, tom lane
>
>