On 10/06/10 17:38, Tom Lane wrote:
> Robert Haas<robertmhaas@gmail.com> writes:
>> On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
>>> When an error is found in the WAL streamed from the master, a warning
>>> message is repeated without interval forever in the standby. This
>>> consumes CPU load very much, and would interfere with read-only queries.
>>> To fix this problem, we should add a sleep into emode_for_corrupt_record()
>>> or somewhere? Or we should stop walreceiver and retry to read WAL from
>>> pg_xlog or the archive?
>
>> I ran into this problem at one point, too, but was in the middle of
>> trying to investigate a different bug and didn't have time to track
>> down what was causing it.
>
>> I think the basic question here is - if there's an error in the WAL,
>> how do we expect to EVER recover? Even if we can read from the
>> archive or pg_xlog, presumably it's the same WAL - why should we be
>> any more successful the second time?
>
> What "warning message" are we talking about? All the error cases I can
> think of in WAL-application are ERROR, or likely even PANIC.
We're talking about a corrupt record (incorrect CRC, incorrect backlink
etc.), not errors within redo functions. During crash recovery, a
corrupt record means you've reached end of WAL. In standby mode, when
streaming WAL from master, that shouldn't happen, and it's not clear
what to do if it does. PANIC is not a good idea, at least if the server
uses hot standby, because that only makes the situation worse from
availability point of view. So we log the error as a WARNING, and keep
retrying. It's unlikely that the problem will just go away, but we keep
retrying anyway in the hope that it does. However, it seems that we're
too aggressive with the retries.
-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com