Re: database corruption - Mailing list pgsql-admin

From Chris Travers
Subject Re: database corruption
Date
Msg-id 42606A69.9010102@travelamericas.com
Whole thread Raw
In response to database corruption  (Ian Westmacott <ianw@intellivid.com>)
Responses Re: database corruption
List pgsql-admin
Hi Ian;

I think it is important to figure out why this is happening.  I would
not want to run any production databases on systems that were failing
like this.

I am trying to figure out what are the likely causes of the errors...

1)  Any other computers suffer random application crashes, power downs,
etc. in your building?
2)  I take it there are no Raid controllers involved?
3)  RAM is non-ECC?
4)  Are the systems on UPS's?

If I could make a wild (and probably wrong) guess, I would wonder if
something external to the system (like electrical supply) was
introducing glitches into memory, causing bad data to be written.  I am
only mentioning it because I have implicated electrical supply in other
cases where rare computer failurres weer affecting many systems...

Ian Westmacott wrote:

>For several weeks now we have been experiencing fairly
>severe database corruption upon clean reboot.  It is very
>repeatable, and the corruption is of the following forms:
>
>ERROR:  could not access status of transaction foo
>DETAIL:  could not open file "bar": No such file or directory
>
>ERROR:  invalid page header in block foo of relation "bar"
>
>ERROR:  uninitialized page in block foo of relation "bar"
>
>
>At first, we believed this was related to XFS, and have
>been pursuing investigations along those lines.  However,
>we have now experienced the exact same problem with JFS.
>
>Here are some details:
>
>- Postgres 7.4.2
>- 2.6.6 kernel.org kernel
>- dedicated database partition
>- repeatable with XFS and JFS (have not seen on ext3)
>- repeatable with and without Linux software RAID 0
>- repeatable with IDE and SATA
>- repeatable with and without fsync, and with fdatasync
>- repeatable on multiple systems
>
>
>I have two questions:
>
>- any known reason why this might be occurring?  (we must
>  have something wrong, for this high rate of severe
>  error).
>
>- if I don't care about losing data, and am not interested
>  in trying to recover anything, how can I arrange for
>  Postgres to proceed normally?  I know about
>  zero_damaged_pages, but this doesn't help with missing
>  transaction files and such.  Is there any way to get
>  Postgres to chuck anything bad and proceed?
>
>Thanks,
>
>    --Ian
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>
>


pgsql-admin by date:

Previous
From: "Chris Hoover"
Date:
Subject: Re: Help installing 8.0.2 rpms on RH 3.0
Next
From: "Ian Westmacott"
Date:
Subject: Re: database corruption