Re: BUG #15567: Wal receiver process restart failed when a damagedwal record arrived at standby. - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #15567: Wal receiver process restart failed when a damagedwal record arrived at standby.
Date
Msg-id 20190109071411.GA23045@paquier.xyz
Whole thread Raw
In response to BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.  ("lichuancheng@highgo.com" <lichuancheng@highgo.com>)
List pgsql-bugs
On Thu, Jan 03, 2019 at 05:50:48PM +0800, lichuancheng@highgo.com wrote:
>> I read the code about startup process and walreciver process.
>> When a damaged wal record arrived:
>> 1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>> 2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal
PMSIGNAL_START_WALRECEIVER.
>> 3.The startup process try to read wal record.
>>  And now reciver process does not startup completely,so the startup process read the damaged wal record another
times.
>>  And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>> 4.go to 2.
>> That's the infinite cycle.
>
> And now i do a patch in mail attachment to solve the bug.
> I test it on pg10.5 and work well.

So the take of the patch is to make sure that the WAL receiver has
been correctly started if the startup process requested one before
trying to replay more WAL records.  Now it seems to me that you do not
close all the gaps here.  The WAL receiver could finish starting, but
still not have the time to stream up to the point where the startup
process could consume things safely.  The patch makes somewhat sure
that the gap is closed by applying a timeout of 1s, still in
environments with more latency it would make the failure still
happen.  Wouldn't it be better instead to make sure that the startup
process does not try to use once again the same record where the
failure happened before the WAL receiver confirms that it has flushed
something newer than the LSN position which was known as bad?

One problem may be that upon restart we reset receivedUpto, and that
visibly we may want to be more careful with its handling to avoid
successive failures so as the startup process does not bump
repeatitively into the same bogus record.  Looking at the code, the
computation looks quite safe though (I need to double-check to be 100%
sure), so isn't the origin of the problem just that the WAL segment
from the primary is corrupted, and that what gets streamed is broken
at its origin point, preventing the standby to move on?  You may want
to double-check that as well..

As a side note, the patch is incorrect for a couple of things though:
1) Formatting of comments and indentation.
2) Please be careful of whitespaces.
3) Usually no more than 72-80 characters per line.
4) Why a wait phase of 1s?
5) And much worse, the WAL receiver status cannot be looked at without
taking its mutex lock first, so this makes the logic look at something
which may be concurrently updated.  Oops.
6) Still the patch is incorrect per the reasons above ;)
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Is temporary functions feature official/supported? Found someissues with it.
Next
From: Michael Paquier
Date:
Subject: Re: Is temporary functions feature official/supported? Found someissues with it.