streaming replication breaks horribly if master crashes - Mailing list pgsql-hackers

On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> But that change would cause the problem that Robert pointed out.
>> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php
>
> Presumably this means that if synchronous_commit = off on primary that
> SR in 9.0 will no longer work correctly if the primary crashes?

I spent some time investigating this today and have come to the
conclusion that streaming replication is really, really broken in the
face of potential crashes on the master.  Using a copy of VMware
parallels provided by $EMPLOYER, I set up two Fedora 12 virtual
machines on my MacBook in a master/slave configuration.  Then I
crashed the master repeatedly using 'echo b > /proc/sysrq-trigger',
which causes an immediate reboot (without syncing the disks, closing
network connections, etc.) while running pgbench or other stuff
against it.

The first problem I noticed is that the slave never seems to realize
that the master has gone away.  Every time I crashed the master, I had
to kill the wal receiver process on the slave to get it to reconnect;
otherwise it just sat there waiting, either forever or at least for
longer than I was willing to wait.

More seriously, I was able to demonstrate that the problem linked in
the thread above is real: if the master crashes after streaming WAL
that it hasn't yet fsync'd, then on recovery the slave's xlog position
is ahead of the master.  So far I've only been able to reproduce this
with fsync=off, but I believe it's possible anyway, and this just
makes it more likely.  After the most recent crash, the master thought
pg_current_xlog_location() was 1/86CD4000; the slave thought
pg_last_xlog_receive_location() was 1/8733C000.  After reconnecting to
the master, the slave then thought that
pg_last_xlog_receive_location() was 1/87000000.  The slave didn't
think this was a problem yet, though.  When I then restarted a pgbench
run against the master, the slave pretty quickly started spewing an
endless stream of messages complaining of "LOG: invalid record length
at 1/8733A828".

So, obviously at this point my slave database is corrupted beyond
repair due to nothing more than an unexpected crash on the master.
That's bad.  What is worse is that the system only detected the
corruption because the slave had crossed an xlog segment boundary
which the master had not crossed.  Had it been otherwise, when the
slave rewound to the beginning of the current segment, it would have
had no trouble getting back in sync with the master - but it would
have done this after having replayed WAL that, from the master's point
of view, doesn't exist.  In other words, the database on the slave
would be silently corrupted.

I don't know what to do about this, but I'm pretty sure we can't ship it as-is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: 9.0 beta2 pg_upgrade: malloc 0 bytes patch
Next
From: "Joshua D. Drake"
Date:
Subject: Re: streaming replication breaks horribly if master crashes