Re: warning message in standby - Mailing list pgsql-hackers

From Robert Haas
Subject Re: warning message in standby
Date
Msg-id AANLkTimzwrEKk7HfREaoGw6TjrzOOiG7cDn80W-aNwPp@mail.gmail.com
Whole thread Raw
In response to Re: warning message in standby  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: warning message in standby
List pgsql-hackers
On Tue, Jun 29, 2010 at 6:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 29, 2010 at 3:55 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Tue, Jun 15, 2010 at 11:35 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On the other hand, I like immediate-panicking. And I don't want the standby
>>> to retry reconnecting the master infinitely.
>>
>> On second thought, the peremptory PANIC is not good for HA system. If the
>> master unfortunately has written an invalid record because of its crash,
>> the standby would exit with PANIC before performing a failover.
>
> I don't think that should ever happen.  The master only streams WAL
> that it has fsync'd.  Presumably there's no reason for the master to
> ever fsync a partial WAL record (which is usually how a corrupt record
> gets into the stream).
>
>> So when an invalid record is found in streamed WAL file, we should keep
>> the standby running and leave the decision whether the standby retries to
>> connect to the master forever or shuts down right now, up to the user
>> (actually, it may be a clusterware)?
>
> Well, if we want to leave it up to the user/clusterware, the current
> code is possibly adequate, although there are many different log
> messages that could signal this situation, so coding it up might not
> be too trivial.

So here's a patch that seems to implement the behavior I'm thinking of
- if we repeatedly retrieve the same WAL record from the master, and
we never succeed in replaying it, then give up.

It seems we don't have 100% consensus on this, but I thought posting
the patch might inspire some further thoughts.  I'm really
uncomfortable with the idea that if the slave gets out of sync with
the master we'll just do this forever:

FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638
FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638
FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638
FATAL:  terminating walreceiver process due to administrator command
LOG:  streaming replication successfully connected to primary
LOG:  invalid record length at 0/313FB638

...with this patch, following the above, you get:

FATAL:  invalid record in WAL stream
HINT:  Take a new base backup, or remove recovery.conf and restart in
read-write mode.
LOG:  startup process (PID 6126) exited with exit code 1
LOG:  terminating any other active server processes

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Mike Fowler
Date:
Subject: Re: [PATCH] Re: Adding XMLEXISTS to the grammar
Next
From: Robert Haas
Date:
Subject: Re: Keepalives win32