On Mon, Jun 14, 2010 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jun 14, 2010 at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> That's a different question altogether ;-). I assume you're not
>>> satisfied by the change Heikki committed a couple hours ago?
>>> It will at least try to do something to recover.
>
>> Yeah, I'm not satisfied by that. It's an improvement in the technical
>> sense - it replaces an infinite retry that spins at top speed with a
>> slower retry that won't flog your CPU quite so badly, but the chances
>> that it will actually succeed in correcting the underlying problem
>> seem infinitesimal.
>
> I'm not sure about that. walreceiver will refetch from the start of the
> current WAL page, so there's at least some chance of getting a good copy
> when we didn't have one before.
The testing that I have been doing while we've been discussing this
reveals that you are correct. I set up an HS/SR master and slave
(running on the same machine), ran pgbench on the master, and then
started randomly sending SIGSEGV to one of the master's backends. It
seems that complaints about the WAL are possible on both master and
slave. Here are a couple from the slave:
LOG: unexpected pageaddr 0/89B7A000 in log file 0, segment 152, offset 12034048
WARNING: there is no contrecord flag in log file 0, segment 136, offset 2523136
LOG: invalid magic number 0000 in log file 0, segment 136, offset 2531328
The slave reconnects and then things get better. So I think your idea
of retrying once and then panicking is probably best.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company