On Wed, 2005-02-16 at 07:14, Alban Hertroys wrote:
> Marco Colombo wrote:
> > On Wed, 16 Feb 2005, Andrew Hall wrote:
> >
> >> fsync is on for all these boxes. Our customers run their own hardware
> >> with many different specification of hardware in use. Many of our
> >> customers don't have UPS, although their power is probably pretty
> >> reliable (normal city based utilities), but of course I can't
> >> guarantee they don't get an outage once in a while with a thunderstorm
> >> etc.
> >
> >
> > I see. Well I can't help much, then, I don't run PG on XFS. I suggest
> > testing
> > on a different FS, to exclude XFS problems. But with fsync on, the FS has
> > very little to do with reliability, unless it _lies_ about fsync(). Any
> > FS should return from fsync only after data is on disc, journal or not
> > (there might be issues with meta-data, but it's hardly a problem with PG).
> >
> > It's more likely the hardware (IDE disks) lies about data being on plate.
> > But again that's only in case of sudden poweroffs.
>
> Do you happen to have the same type disks in all these systems? That
> could point to a disk cache "problem" (f.e. the disks lying about having
> written data from the cache to disk).
>
> Or do you use the same disk parameters on all these machines? Have you
> tried using the disks w/o write caching and/or in synchronous mode
> (contrary to "async").
I was wondering if this problem had ever shown up on a machine that
HADN'T lost power abrubtly or not. IFF the only machines that
experience corruption have lost power beforehand sometime, then I would
look towards either the drives, controller or file system or somewhere
in there.
I know there are write modes in ext3 that will allow corruption on power
loss (I think it's writeback). I know little of XFS in a production
environment, as I run ext3, warts and all.