Thread: (Again) Datacorruption using 7.4.2 on XFS/raid1

(Again) Datacorruption using 7.4.2 on XFS/raid1

From
"Florian G. Pflug"
Date:
Hi

We have again experienced data-corruption using 7.4.2 on an XFS Filesystem
on top of a software-raid (md) raid-1.

After a server crash last night (It was a rather strange crash - The machine
was still pingable, but no login was possible, and postgres and apache
didn't respond to requests any more) we hard-reset the machine. It came up
again nicely, but a few hours later the following errors occured when trying
to access certain tabled. (Those tables are updated heavily - each day about
2 million tuples are inserted, and the old versions of those tuples
deleted).

ERROR:  could not access status of transaction 34048
DETAIL:  could not open file "/var/lib/postgres/data/pg_clog/0000": No such
file or directory

While reading linux-kernel today, I stumbled upon a description of a rather
strange XFS behaviour. It seems to zero a block if the block was updated,
and the corresponding metadata-update was flushed to disk, but not the data
itself.
It does not happen if the file is fsynced() after the update - but I was
wondering what would happen if the machine crashed between the write() and
the fsync().

The lkml thread about this can be found here:
http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0359.html

Could this XFS behaviour cause the postgres problems we are seeing?

greetings, Florian Pflug

Re: (Again) Datacorruption using 7.4.2 on XFS/raid1

From
Brian Hirt
Date:
FYI, I have seen the SW linux raid not detect failed drives and cause
filesystem corruption on many occasions.  I would reccomend staying
away from it.  Maybe what you describe is a problem with PG but, i
doubt it.


On Jul 12, 2004, at 12:31 PM, Florian G. Pflug wrote:

> Hi
>
> We have again experienced data-corruption using 7.4.2 on an XFS
> Filesystem
> on top of a software-raid (md) raid-1.
>
> After a server crash last night (It was a rather strange crash - The
> machine
> was still pingable, but no login was possible, and postgres and apache
> didn't respond to requests any more) we hard-reset the machine. It
> came up
> again nicely, but a few hours later the following errors occured when
> trying
> to access certain tabled. (Those tables are updated heavily - each day
> about
> 2 million tuples are inserted, and the old versions of those tuples
> deleted).
>
> ERROR:  could not access status of transaction 34048
> DETAIL:  could not open file "/var/lib/postgres/data/pg_clog/0000": No
> such
> file or directory
>
> While reading linux-kernel today, I stumbled upon a description of a
> rather
> strange XFS behaviour. It seems to zero a block if the block was
> updated,
> and the corresponding metadata-update was flushed to disk, but not the
> data
> itself.
> It does not happen if the file is fsynced() after the update - but I
> was
> wondering what would happen if the machine crashed between the write()
> and
> the fsync().
>
> The lkml thread about this can be found here:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0359.html
>
> Could this XFS behaviour cause the postgres problems we are seeing?
>
> greetings, Florian Pflug
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend


Re: (Again) Datacorruption using 7.4.2 on XFS/raid1

From
Ian Barwick
Date:
On Mon, 12 Jul 2004 20:31:15 +0200, Florian G. Pflug <fgp@phlo.org> wrote:
> Hi
>
> We have again experienced data-corruption using 7.4.2 on an XFS Filesystem
> on top of a software-raid (md) raid-1.
>
> After a server crash last night (It was a rather strange crash - The machine
> was still pingable, but no login was possible, and postgres and apache
> didn't respond to requests any more) we hard-reset the machine. It came up
> again nicely, but a few hours later the following errors occured when trying
> to access certain tabled. (Those tables are updated heavily - each day about
> 2 million tuples are inserted, and the old versions of those tuples
> deleted).
>
> ERROR:  could not access status of transaction 34048
> DETAIL:  could not open file "/var/lib/postgres/data/pg_clog/0000": No such
> file or directory

You don't say what kind of disks you are using. Sounds very much like
hardware problems though.

I had a PostgreSQL installation on a pair of IDE disks with software
RAID1 / Ext3 die very nastily with similar error messages. Turned out
that one of the disks was very defective and the RAID wasn't handling
it.

On the other hand - after copying the files from the good disk,
PostgreSQL started with barely a complaint and I couldn't detect any
corruption.

Ian Barwick

Re: (Again) Datacorruption using 7.4.2 on XFS/raid1

From
"Florian G. Pflug"
Date:
On Mon, Jul 12, 2004 at 01:22:02PM -0600, Brian Hirt wrote:
> FYI, I have seen the SW linux raid not detect failed drives and cause
> filesystem corruption on many occasions.  I would reccomend staying
> away from it.  Maybe what you describe is a problem with PG but, i
> doubt it.

Hi

I was under the impression that this only applies to the ataraid-drivers
(Those drivers for promise and hpt raid-controllers that don't really
provide hardware raid, but do provide a BIOS that is capable of booting from
raid1 and raid0 arrays) - well, I guess I'll have to figure out some way to
test the software raid.

greetings, Florian Pflug