Thread: database corruption
For several weeks now we have been experiencing fairly severe database corruption upon clean reboot. It is very repeatable, and the corruption is of the following forms: ERROR: could not access status of transaction foo DETAIL: could not open file "bar": No such file or directory ERROR: invalid page header in block foo of relation "bar" ERROR: uninitialized page in block foo of relation "bar" At first, we believed this was related to XFS, and have been pursuing investigations along those lines. However, we have now experienced the exact same problem with JFS. Here are some details: - Postgres 7.4.2 - 2.6.6 kernel.org kernel - dedicated database partition - repeatable with XFS and JFS (have not seen on ext3) - repeatable with and without Linux software RAID 0 - repeatable with IDE and SATA - repeatable with and without fsync, and with fdatasync - repeatable on multiple systems I have two questions: - any known reason why this might be occurring? (we must have something wrong, for this high rate of severe error). - if I don't care about losing data, and am not interested in trying to recover anything, how can I arrange for Postgres to proceed normally? I know about zero_damaged_pages, but this doesn't help with missing transaction files and such. Is there any way to get Postgres to chuck anything bad and proceed? Thanks, --Ian
Hi Ian; I think it is important to figure out why this is happening. I would not want to run any production databases on systems that were failing like this. I am trying to figure out what are the likely causes of the errors... 1) Any other computers suffer random application crashes, power downs, etc. in your building? 2) I take it there are no Raid controllers involved? 3) RAM is non-ECC? 4) Are the systems on UPS's? If I could make a wild (and probably wrong) guess, I would wonder if something external to the system (like electrical supply) was introducing glitches into memory, causing bad data to be written. I am only mentioning it because I have implicated electrical supply in other cases where rare computer failurres weer affecting many systems... Ian Westmacott wrote: >For several weeks now we have been experiencing fairly >severe database corruption upon clean reboot. It is very >repeatable, and the corruption is of the following forms: > >ERROR: could not access status of transaction foo >DETAIL: could not open file "bar": No such file or directory > >ERROR: invalid page header in block foo of relation "bar" > >ERROR: uninitialized page in block foo of relation "bar" > > >At first, we believed this was related to XFS, and have >been pursuing investigations along those lines. However, >we have now experienced the exact same problem with JFS. > >Here are some details: > >- Postgres 7.4.2 >- 2.6.6 kernel.org kernel >- dedicated database partition >- repeatable with XFS and JFS (have not seen on ext3) >- repeatable with and without Linux software RAID 0 >- repeatable with IDE and SATA >- repeatable with and without fsync, and with fdatasync >- repeatable on multiple systems > > >I have two questions: > >- any known reason why this might be occurring? (we must > have something wrong, for this high rate of severe > error). > >- if I don't care about losing data, and am not interested > in trying to recover anything, how can I arrange for > Postgres to proceed normally? I know about > zero_damaged_pages, but this doesn't help with missing > transaction files and such. Is there any way to get > Postgres to chuck anything bad and proceed? > >Thanks, > > --Ian > > > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > >
Hi Chris, > I think it is important to figure out why this is happening. I would > not want to run any production databases on systems that were failing > like this. You and me both :) (in our application though, it is not a total disaster to lose the last 5 minutes of transactions, it is a disaster if the database is unusable when it comes up) > 1) Any other computers suffer random application crashes, power downs, > etc. in your building? No, but more importantly, we have seen this failure happen in different buildings (in different cities), on same spec but different hardware (at least three motherboards and power supplies, six disks). That's why it really feels like a bug or configuration error. > 2) I take it there are no Raid controllers involved? No. But we get this error with and without software RAID, FWIW. > 3) RAM is non-ECC? I'll have to double-check, but I think it is. > 4) Are the systems on UPS's? Yes. > If I could make a wild (and probably wrong) guess, I would wonder if > something external to the system (like electrical supply) was > introducing glitches into memory, causing bad data to be written. I am > only mentioning it because I have implicated electrical supply in other > cases where rare computer failurres weer affecting many systems... I would tend to agree, but this occurs on multiple systems in multiple locations (but, oddly enough, we are having trouble reproducing it in our lab). And we have run memtest. However, it is true that all the systems on which this has been seen have the same spec power supply/UPS. I would think though, that this could cause error at any time -- all of these failures occur after reboot (that is, no corruption, reboot, immediate corruption). I have stopped/started Postgres while the application is running, without corruption. (smells like a dirty buffer not being written to disk, which is why we focused on the filesystem). Here are some further details: - 865PE/G Neo2-P (MS-6728) ATX motherboard (and similar for IDE) - 2x 512MB/400MHz DIMM RAM - Intel Pentium 4/3.2GHz/1MB/800MHz CPU (hyperthreading enabled) - 2x WD 250GB/7200RPM/8MB/SATA-150 on ICH5 SATA ports (also tested similar IDE drives), writethrough - XFS and JFS (not seen on ext3, but not fully tested) - either software RAID 0 on both drives, or one drive alone without RAID - SuSE 9.1 - 2.6.6 kernel - Postgres 7.4.2 - 300 TPS against DB containing 5-50GB data, no more than a dozen concurrent connections. - fsync (or not) and fdatasync - Postgres may be taken down (via init script) with connections open to it (in fact the application may aggressively try to re-establish the connection as it goes down). - we have put syncs, sleeps and large dd to the disk in the shutdown scripts, none of which work. At this point, I'm really looking for fresh ideas. Thanks, --Ian
,snipped. > - SuSE 9.1 > - 2.6.6 kernel > - Postgres 7.4.2 > - 300 TPS against DB containing 5-50GB data, no more than a dozen > concurrent connections. > - fsync (or not) and fdatasync I remember a problem that was fixed in the 2.6.9 kernel concerning XFS corruption (shutdowns I think were the worst). Also introduced some JFS stuff, but I don't run JFS so I didn't really pay much attention to that. Can you try a later (the latest?) vanilla kernel? AmadeusZ.-
Thanks for the tip. Later kernel versions have unrelated problems for us, but we'll take a look at the filesystem mods and see if we can backpatch them. --Ian > I remember a problem that was fixed in the 2.6.9 kernel concerning XFS > corruption (shutdowns I think were the worst). Also introduced some JFS > stuff, but I don't run JFS so I didn't really pay much attention to > that. > > Can you try a later (the latest?) vanilla kernel?
> In your previous emails, you stated that these errors were seen on > multiple systems. Multiple systems, configured identically, with > diverse motherboards/hardware or always identical hardware except for > the sata/IDE drives? > > I ask, because I noted that you are using the Neo 2 ATX motherboard that > has the "dynamic overclocking" feature. This can cause grave > instability in systems when it set to an aggressive setting. This > motherboard is not compatible with low quality ram either. This > motherboard also is prone to heat problems, and requires extra cooling. > If you have multiple systems failing, using this motherboard, then that > particular model could be the culprit. We have seen this using at least one other motherboard besides the Neo 2 ATX. And (fortunately or unfortunately) we don't use the dynamic overclocking feature. Do you know how the RAM incompatibility is manifest? Thanks, --Ian