Re: Corruption of files in PostgreSQL - Mailing list pgsql-general

From Greg Smith
Subject Re: Corruption of files in PostgreSQL
Date
Msg-id Pine.GSO.4.64.0706051006120.28129@westnet.com
Whole thread Raw
In response to Re: Corruption of files in PostgreSQL  ("Paolo Bizzarri" <pibizza@gmail.com>)
Responses Re: Corruption of files in PostgreSQL  (Scott Marlowe <smarlowe@g2switchworks.com>)
List pgsql-general
On Tue, 5 Jun 2007, Paolo Bizzarri wrote:

> On 6/4/07, Scott Marlowe <smarlowe@g2switchworks.com> wrote:
>> http://lwn.net/Articles/215868/
>> documents a bug in the 2.6 linux kernel that can result in corrupted
>> files if there are a lot of processes accessing it at once.
>
> in fact, we were using a 2.6.12 kernel. Can this be a problem?

That particular problem appears to be specific to newer kernels so I
wouldn't think it's related to your issue.

Tracking down random crashes of the sort you're reporting is hard.  As
Scott rightly suggested, the source of problem could be easily be any
number of hardware components or low-level software like the kernel.  The
tests required to really certify that a server is suitable for production
use can take several days worth of testing.  The normal approach here
would be to move this application+data to another system and see if the
problem is still there; that lets you rule out all the hardware at once.
That would do something else you should be thinking about--making
absolutely sure you can backup and restore your data, and that the
corruption you're seeing isn't causing information to be lost in your
database.

The general flow of figuring out the cause for random problems goes
something like this:

1) Check for memory errors.  http://www.memtest86.com/ is a good tool for
PCs.  That will need to run for many hours.

2) Run the manufacturer's disk utilities to see if any of your disks are
going bad.  You might be able to do this using Linux's SMART tools instead
without even taking the server down; if you're not using those already you
should look into that.  http://www.linuxjournal.com/article/6983 is a good
intro here.

3) Boot another version of Linux and run some low-level disk tests there.
A live CD/DVD like Knoppix and Ubuntu is the easiest way to do that.

4) If everything above passes, upgrade to the kernel version used on the
live CD/DVD and see if the problem goes away.

You can try skipping right to #4 here and playing with the kernel first,
but understand that if your underlying hardware has issues, that may cause
more corruption (with possible data loss) rather than less.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-general by date:

Previous
From: Michael Glaesemann
Date:
Subject: Re: CREATE RULE with WHERE clause
Next
From: Marc Compte
Date:
Subject: Re: Foreign keys and indexes