Thread: database corruption

database corruption

From

Ian Westmacott

Date:

15 April 2005, 17:55:37

For several weeks now we have been experiencing fairly
severe database corruption upon clean reboot.  It is very
repeatable, and the corruption is of the following forms:

ERROR:  could not access status of transaction foo
DETAIL:  could not open file "bar": No such file or directory

ERROR:  invalid page header in block foo of relation "bar"

ERROR:  uninitialized page in block foo of relation "bar"


At first, we believed this was related to XFS, and have
been pursuing investigations along those lines.  However,
we have now experienced the exact same problem with JFS.

Here are some details:

- Postgres 7.4.2
- 2.6.6 kernel.org kernel
- dedicated database partition
- repeatable with XFS and JFS (have not seen on ext3)
- repeatable with and without Linux software RAID 0
- repeatable with IDE and SATA
- repeatable with and without fsync, and with fdatasync
- repeatable on multiple systems


I have two questions:

- any known reason why this might be occurring?  (we must
  have something wrong, for this high rate of severe
  error).

- if I don't care about losing data, and am not interested
  in trying to recover anything, how can I arrange for
  Postgres to proceed normally?  I know about
  zero_damaged_pages, but this doesn't help with missing
  transaction files and such.  Is there any way to get
  Postgres to chuck anything bad and proceed?

Thanks,

    --Ian

Re: database corruption

From

Chris Travers

Date:

15 April 2005, 22:29:41

Hi Ian;

I think it is important to figure out why this is happening.  I would
not want to run any production databases on systems that were failing
like this.

I am trying to figure out what are the likely causes of the errors...

1)  Any other computers suffer random application crashes, power downs,
etc. in your building?
2)  I take it there are no Raid controllers involved?
3)  RAM is non-ECC?
4)  Are the systems on UPS's?

If I could make a wild (and probably wrong) guess, I would wonder if
something external to the system (like electrical supply) was
introducing glitches into memory, causing bad data to be written.  I am
only mentioning it because I have implicated electrical supply in other
cases where rare computer failurres weer affecting many systems...

Ian Westmacott wrote:

>For several weeks now we have been experiencing fairly
>severe database corruption upon clean reboot.  It is very
>repeatable, and the corruption is of the following forms:
>
>ERROR:  could not access status of transaction foo
>DETAIL:  could not open file "bar": No such file or directory
>
>ERROR:  invalid page header in block foo of relation "bar"
>
>ERROR:  uninitialized page in block foo of relation "bar"
>
>
>At first, we believed this was related to XFS, and have
>been pursuing investigations along those lines.  However,
>we have now experienced the exact same problem with JFS.
>
>Here are some details:
>
>- Postgres 7.4.2
>- 2.6.6 kernel.org kernel
>- dedicated database partition
>- repeatable with XFS and JFS (have not seen on ext3)
>- repeatable with and without Linux software RAID 0
>- repeatable with IDE and SATA
>- repeatable with and without fsync, and with fdatasync
>- repeatable on multiple systems
>
>
>I have two questions:
>
>- any known reason why this might be occurring?  (we must
>  have something wrong, for this high rate of severe
>  error).
>
>- if I don't care about losing data, and am not interested
>  in trying to recover anything, how can I arrange for
>  Postgres to proceed normally?  I know about
>  zero_damaged_pages, but this doesn't help with missing
>  transaction files and such.  Is there any way to get
>  Postgres to chuck anything bad and proceed?
>
>Thanks,
>
>    --Ian
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>
>

Re: database corruption

From

"Ian Westmacott"

Date:

16 April 2005, 01:00:41

Hi Chris,

> I think it is important to figure out why this is happening.  I would
> not want to run any production databases on systems that were failing
> like this.

You and me both :)  (in our application though, it is not a total
disaster to lose the last 5 minutes of transactions, it is a disaster
if the database is unusable when it comes up)

> 1)  Any other computers suffer random application crashes, power downs,
> etc. in your building?

No, but more importantly, we have seen this failure happen in different
buildings (in different cities), on same spec but different hardware
(at least three motherboards and power supplies, six disks).  That's why
it really feels like a bug or configuration error.

> 2)  I take it there are no Raid controllers involved?

No.  But we get this error with and without software RAID, FWIW.

> 3)  RAM is non-ECC?

I'll have to double-check, but I think it is.

> 4)  Are the systems on UPS's?

Yes.

> If I could make a wild (and probably wrong) guess, I would wonder if
> something external to the system (like electrical supply) was
> introducing glitches into memory, causing bad data to be written.  I am
> only mentioning it because I have implicated electrical supply in other
> cases where rare computer failurres weer affecting many systems...

I would tend to agree, but this occurs on multiple systems in
multiple locations (but, oddly enough, we are having trouble reproducing
it in our lab).  And we have run memtest.

However, it is true that all the systems on which
this has been seen have the same spec power supply/UPS.  I would think
though, that this could cause error at any time -- all
of these failures occur after reboot (that is, no corruption, reboot,
immediate corruption).  I have stopped/started Postgres
while the application is running, without corruption.  (smells like
a dirty buffer not being written to disk, which is why we focused on
the filesystem).

Here are some further details:

- 865PE/G Neo2-P (MS-6728) ATX motherboard (and similar for IDE)
- 2x 512MB/400MHz DIMM RAM
- Intel Pentium 4/3.2GHz/1MB/800MHz CPU (hyperthreading enabled)
- 2x WD 250GB/7200RPM/8MB/SATA-150 on ICH5 SATA ports (also tested
  similar IDE drives), writethrough
- XFS and JFS (not seen on ext3, but not fully tested)
- either software RAID 0 on both drives, or one drive alone without RAID
- SuSE 9.1
- 2.6.6 kernel
- Postgres 7.4.2
- 300 TPS against DB containing 5-50GB data, no more than a dozen
  concurrent connections.
- fsync (or not) and fdatasync
- Postgres may be taken down (via init script) with connections open to
  it (in fact the application may aggressively try to re-establish the
  connection as it goes down).
- we have put syncs, sleeps and large dd to the disk in the shutdown
  scripts, none of which work.


At this point, I'm really looking for fresh ideas.  Thanks,

    --Ian

Re: database corruption

From

Amadeus Zilfinski

Date:

16 April 2005, 13:38:04

,snipped.
> - SuSE 9.1
> - 2.6.6 kernel
> - Postgres 7.4.2
> - 300 TPS against DB containing 5-50GB data, no more than a dozen
>   concurrent connections.
> - fsync (or not) and fdatasync

I remember a problem that was fixed in the 2.6.9 kernel concerning XFS
corruption (shutdowns I think were the worst). Also introduced some JFS
stuff, but I don't run JFS so I didn't really pay much attention to
that.

Can you try a later (the latest?) vanilla kernel?

AmadeusZ.-

Re: database corruption

From

"Ian Westmacott"

Date:

17 April 2005, 22:35:06

Thanks for the tip.  Later kernel versions have unrelated problems for
us, but we'll take a look at the filesystem mods and see if we can
backpatch them.

    --Ian


> I remember a problem that was fixed in the 2.6.9 kernel concerning XFS
> corruption (shutdowns I think were the worst). Also introduced some JFS
> stuff, but I don't run JFS so I didn't really pay much attention to
> that.
>
> Can you try a later (the latest?) vanilla kernel?

Re: database corruption

From

"Ian Westmacott"

Date:

17 April 2005, 22:43:01

> In your previous emails, you stated that these errors were seen on
> multiple systems.  Multiple systems, configured identically, with
> diverse motherboards/hardware or always identical hardware except for
> the sata/IDE drives?
>
> I ask, because I noted that you are using the Neo 2 ATX motherboard that
> has the "dynamic overclocking" feature.  This can cause grave
> instability in systems when it set to an aggressive setting.  This
> motherboard is not compatible with low quality ram either.   This
> motherboard also is prone to heat problems, and requires extra cooling.
> If you have multiple systems failing, using this motherboard, then that
> particular model could be the culprit.

We have seen this using at least one other motherboard besides the Neo 2
ATX.  And (fortunately or unfortunately) we don't use the dynamic
overclocking feature.  Do you know how the RAM incompatibility is
manifest?

Thanks,

    --Ian