On Fri, 2006-07-28 at 22:30, Merlin Moncure wrote:
> On 7/28/06, Arnaud Lesauvage <thewild@freesurf.fr> wrote:
> > Csaba Nagy wrote:
> > > I found that PITR using WAL shipping is not protecting against all
> > > failure scenarios... it sure will help if the primary machine's hardware
> > > fails, but in one case it was useless for us: the primary had a linux
> > > kernel with buggy XFS code (that's what I think it was, cause we never
> > > found out for sure) and we did use XFS for the data partition, and at
> > > one point it started to get corruptions at the data page level. The
> > > corruption was promptly transferred to the standby, and therefore it was
> > > also unusable... we had to recover from a backup, with the related
> > > downtime. Not good for business...
> > >
> > OK, but corruption at the data page level is a very unlikely
> > event, isn't it ?
It's not... it just happened to me again, strangely this time on a Slony
replica. It might be that the hardware/OS/FS combination we use is the
problem, might be that postgres has some problem with those (I would
exclude slony being able to produce such things). But it did happened,
and I can't exclude it will happen again. This time I'll be able to
investigate closer I hope.
> yes, and that is not a pitr problem, that is a data corruption
> problem. i am very suspicious that slony style replication would
> provide any sort of defense against replicating from a machine which
> is changing bytes from a to b, etc. i think the best defense against
> *that* sort of problem would be synchronous replication via pgpool.
When it happened for us, it was a few blocks in some tables, and I
suspect it was a OS/FS bug. In that case slony would not propagate the
error, it might propagate bad data, but not the page error itself. So it
might not protect against bad data, but I will be able to switch over
and have a working system immediately compared to recover from a backup
from yesterday after a downtime of 8 hours. So instead of loosing data
worth of 1 day and have a downtime of 8 hours I'll have a downtime of 1
minute and have a few bad entries in the DB... for the kind of
application we have here it is definitely a better scenario.
Cheers,
Csaba.