Thread: Detecting database corruption

Detecting database corruption

From
Jack Orenstein
Date:
My company is developing a PostgreSQL 7.4 application.  We don't want
our customers to have to manage the database. If the database ever
becomes corrupt, we have procedures for restoring the database from
backup data. The question we've been wondering about is how to detect
a corrupt database. False positives are acceptable (database isn't
really corrupt but we think it is); false negatives are not, (database
is corrupt but we can't tell).

- Are any cases of corruption detected in the normal operation of
PostgreSQL?

- If there are, then how are applications notified when corruption
is detected?

- What symptoms should prompt us to suspect and check for corruption?

- Are there any tools we can run to determine whether a database is
corrupt?

Jack Orenstein


Re: Detecting database corruption

From
Andrew Sullivan
Date:
On Tue, Jan 13, 2004 at 06:06:03PM -0500, Jack Orenstein wrote:
> - Are any cases of corruption detected in the normal operation of
> PostgreSQL?

If this means, "Does the database usually check for corruption?" the
answer is, "Not as a matter of course."  If this means, "Will I know
if the database is corrupt?" the answer is, "Probably."  When
database corruption occurs, it's serious enough that things stop
working pretty fast.  Note that this is an infrequent enough
occurrence that it's not part of the standard toolkit of DBAs; each
case tends to be unusual.  It seems usually to be caused by faulty
hardware.

> - If there are, then how are applications notified when corruption
> is detected?

They get errors.

> - What symptoms should prompt us to suspect and check for corruption?

Refusal to start.  Random crashes of the server.  Strange errors
during tuple access, VACUUM operations, or pg_dump.

> - Are there any tools we can run to determine whether a database is
> corrupt?

This depends on the case.  There is a utility called pgfsck floating
around; it's dangerous, and for use where things are really totally
hosed up so badly that you can't recover.

The real question is, what have you been using that makes database
corruption such a grave concern?  If I had to worry that much about
Postgres database corruption, I'd use something else.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Those who know history get to watch other people repeat it.
        --Bruce Sterling

Re: Detecting database corruption

From
Jack Orenstein
Date:
Andrew Sullivan wrote:
 > On Tue, Jan 13, 2004 at 06:06:03PM -0500, Jack Orenstein wrote:
 >
 >>- Are any cases of corruption detected in the normal operation of
 >>PostgreSQL?
 >
 >
 > If this means, "Does the database usually check for corruption?" the
 > answer is, "Not as a matter of course."

Do you mean that this happens in a few select situations? Or that
there are configuration flags that can be used to enable such checks?

 > ...
 >
 >>- Are there any tools we can run to determine whether a database is
 >>corrupt?
 >
 >
 > This depends on the case.  There is a utility called pgfsck floating
 > around; it's dangerous, and for use where things are really totally
 > hosed up so badly that you can't recover.
 >
 > The real question is, what have you been using that makes database
 > corruption such a grave concern?  If I had to worry that much about
 > Postgres database corruption, I'd use something else.

The database is part of our product and we'd like it to be as
unobtrusive as possible. Our product has an admin console, but our
expectation is that it will be ignored most of the time. In general,
we'd rather kick off a rebuild of the database more often that might
really be necessary, if this meant that we could minimize the
involvement of an admin.

Database corruption is a concern for two reasons. First, if it ever
does occur, we have to be able to deal with the situation gracefully,
even if that means nothing beyond a clean shutdown of the
application. Second, we are struggling with the IDE vs. fsync issue,
that has come up on this mailing list. We definitely have to support
IDE drives, and we're trying to determine how to balance performance
against other concerns. If we do end up leaving IDE caching enabled,
then my understanding is that corruption is a real possibility, (or
have I drawn the wrong conclusion on this point?)

Jack Orenstein


Re: Detecting database corruption

From
"Joshua D. Drake"
Date:
> Do you mean that this happens in a few select situations? Or that
> there are configuration flags that can be used to enable such checks?
>
It is a few select situations and frankly I haven't had database corruption
because of PostgreSQL since the 7.1 days. I have had however database
corruption due to bad memory, and bad Linux kernels/filesystems.


> >>- Are there any tools we can run to determine whether a database is
> >>corrupt?

Typically PostgreSQL will tell you via an error message pointing to a
relation
id. Also if you perform regular vacuums if vacuum fails it will
typically tell you
where.


>
> > The real question is, what have you been using that makes database
> > corruption such a grave concern?  If I had to worry that much about
> > Postgres database corruption, I'd use something else.
>
> even if that means nothing beyond a clean shutdown of the
> application. Second, we are struggling with the IDE vs. fsync issue,
> that has come up on this mailing list. We definitely have to support
> IDE drives, and we're trying to determine how to balance performance
> against other concerns. If we do end up leaving IDE caching enabled,
> then my understanding is that corruption is a real possibility, (or
> have I drawn the wrong conclusion on this point?)
>
I think (at least personally) that you are placing a little too much
emphasis on
this problem. We have successfully done power plug tests over and over
and over with IDE drives and not had the issue come about.

Of course this entirely depends on many things, but that is what a UPS is
for.

Sincerely,

Joshua D. Drake


> Jack Orenstein
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster



--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL


Re: Detecting database corruption

From
Andrew Sullivan
Date:
On Mon, Jan 19, 2004 at 02:45:27PM -0500, Jack Orenstein wrote:
> > If this means, "Does the database usually check for corruption?" the
> > answer is, "Not as a matter of course."
>
> Do you mean that this happens in a few select situations? Or that
> there are configuration flags that can be used to enable such checks?

There have been occasional reports of such corruption, but it seems
always to come down to bad hardware.  There are no flags to check for
this as a part of regular operations, although you'd certainly get an
error if you tried to retrieve bad data.

> Database corruption is a concern for two reasons. First, if it ever
> does occur, we have to be able to deal with the situation gracefully,
> even if that means nothing beyond a clean shutdown of the
> application.

In the cases where people experience it, what usually shows up is
some sort of inability to access data that is supposed to be in a
place on the disk, but turns out not to be.  You get error messages
about missing tuples, mangled data, or a core dump.  I think in such
cases you probably would indeed want to shut down your application.

> Second, we are struggling with the IDE vs. fsync issue,
> that has come up on this mailing list. We definitely have to support
> IDE drives, and we're trying to determine how to balance performance
> against other concerns. If we do end up leaving IDE caching enabled,
> then my understanding is that corruption is a real possibility, (or
> have I drawn the wrong conclusion on this point?)

This is a different problem.  My best advice is, "get a UPS with a
brain."  A UPS which will keep your system up for 10 minutes and
which will shut it down as soon as the battery kicks in is pretty
cheap.  That and some regular testing and maintenance of it is likely
to prevent most problematic cases you might run into here.

Most fsync worries actually have to do with losing data rather than
data corruption: fsync is called when a transaction commits, and if
the hardware is lying about whether the bits are actually on the
disk, you might lose some things you think are committed.  You can
apparently tolerate some data loss anyway, so in this case it's not
too big a deal.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Music is no business of mine.
        --Marge Simpson