Thread: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From

"lichuancheng@highgo.com"

Date:

03 January 2019, 12:50:48

hello

I have analysis of the causes of bug when report it.

>[Code review]
>I read the code about startup process and walreciver process.
>When a damaged wal record arrived:
>1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
>3.The startup process try to read wal record.
> And now reciver process does not startup completely,so the startup process read the damaged wal record another times.
> And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>4.go to 2.
>That's the infinite cycle.

And now i do a patch in mail attachment to solve the bug.

I test it on pg10.5 and work well.

-----

movead

Attachment

bug15567_lichuancheng_20190103.patch

Re: BUG #15567: Wal receiver process restart failed when a damagedwal record arrived at standby.

From

Michael Paquier

Date:

09 January 2019, 10:14:11

On Thu, Jan 03, 2019 at 05:50:48PM +0800, lichuancheng@highgo.com wrote:
>> I read the code about startup process and walreciver process.
>> When a damaged wal record arrived:
>> 1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>> 2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal
PMSIGNAL_START_WALRECEIVER.
>> 3.The startup process try to read wal record.
>>  And now reciver process does not startup completely,so the startup process read the damaged wal record another
times.
>>  And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>> 4.go to 2.
>> That's the infinite cycle.
>
> And now i do a patch in mail attachment to solve the bug.
> I test it on pg10.5 and work well.

So the take of the patch is to make sure that the WAL receiver has
been correctly started if the startup process requested one before
trying to replay more WAL records.  Now it seems to me that you do not
close all the gaps here.  The WAL receiver could finish starting, but
still not have the time to stream up to the point where the startup
process could consume things safely.  The patch makes somewhat sure
that the gap is closed by applying a timeout of 1s, still in
environments with more latency it would make the failure still
happen.  Wouldn't it be better instead to make sure that the startup
process does not try to use once again the same record where the
failure happened before the WAL receiver confirms that it has flushed
something newer than the LSN position which was known as bad?

One problem may be that upon restart we reset receivedUpto, and that
visibly we may want to be more careful with its handling to avoid
successive failures so as the startup process does not bump
repeatitively into the same bogus record.  Looking at the code, the
computation looks quite safe though (I need to double-check to be 100%
sure), so isn't the origin of the problem just that the WAL segment
from the primary is corrupted, and that what gets streamed is broken
at its origin point, preventing the standby to move on?  You may want
to double-check that as well..

As a side note, the patch is incorrect for a couple of things though:
1) Formatting of comments and indentation.
2) Please be careful of whitespaces.
3) Usually no more than 72-80 characters per line.
4) Why a wait phase of 1s?
5) And much worse, the WAL receiver status cannot be looked at without
taking its mutex lock first, so this makes the logic look at something
which may be concurrently updated.  Oops.
6) Still the patch is incorrect per the reasons above ;)
--
Michael

Attachment

signature.asc

Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From

"lichuancheng@highgo.com"

Date:

09 January 2019, 11:28:51

>So the take of the patch is to make sure that the WAL receiver has
>been correctly started if the startup process requested one before
>trying to replay more WAL records. Now it seems to me that you do not
>close all the gaps here. The WAL receiver could finish starting, but
>still not have the time to stream up to the point where the startup
>process could consume things safely. The patch makes somewhat sure
>that the gap is closed by applying a timeout of 1s, still in
>environments with more latency it would make the failure still
>happen. Wouldn't it be better instead to make sure that the startup
>process does not try to use once again the same record where the
>failure happened before the WAL receiver confirms that it has flushed
>something newer than the LSN position which was known as bad?

>One problem may be that upon restart we reset receivedUpto, and that
>visibly we may want to be more careful with its handling to avoid
>successive failures so as the startup process does not bump
>repeatitively into the same bogus record. Looking at the code, the
>computation looks quite safe though (I need to double-check to be 100%
>sure), so isn't the origin of the problem just that the WAL segment
>from the primary is corrupted, and that what gets streamed is broken
>at its origin point, preventing the standby to move on? You may want
>to double-check that as well..

I known what you describe.
Nomally,when the reciver process restart,the code reset redo LSN to the start of the
wal segment. But now the question is that the reciver process can not restart
sucessful,so correct wal record nerver arrived. so the startup process is in a
infinite cycle to start and end the reciver process.
So the way to solve the problem is give some time to reciver process to
restart. As long as it restart sucessful,it will apply wal record at the start of the
wal segment from master and the correct wal record will arrived. so i do not
think what you say 'still in environments with more latency it would make the
failure still happen' is right.So i think the patch can solve the problem after
modify as side note blow.

I have make sure that the the broken wal record is broked during stream,
it's well in master and other slave.

>As a side note, the patch is incorrect for a couple of things though:
>1) Formatting of comments and indentation.
>2) Please be careful of whitespaces.
>3) Usually no more than 72-80 characters per line.
>4) Why a wait phase of 1s?
I think 1s is enought for reciver process to restart. Maybe it need a discuss.

>5) And much worse, the WAL receiver status cannot be looked at without
>taking its mutex lock first, so this makes the logic look at something
>which may be concurrently updated. Oops.
I am sorry.

>6) Still the patch is incorrect per the reasons above ;)
I am new here, thanks for your explanation in side note.

-----
movead