Thread: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From
"lichuancheng@highgo.com"
Date:
hello

I have analysis of the causes of bug when report it.

>[Code review]
>I read the code about startup process and walreciver process.
>When a damaged wal record arrived:
>1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
>3.The startup process try to read wal record. 
> And now reciver process does not startup completely,so the startup process read the damaged wal record another times.
> And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>4.go to 2.
>That's the infinite cycle.


And now i do a patch in mail attachment to solve the bug.  
I test it on pg10.5 and work well.

-----
movead
Attachment
On Thu, Jan 03, 2019 at 05:50:48PM +0800, lichuancheng@highgo.com wrote:
>> I read the code about startup process and walreciver process.
>> When a damaged wal record arrived:
>> 1.The startup process shutdown the walreciver process use function ShutdownWalRcv().
>> 2.The startup process start the walreciver process use funtion RequestXLogStreaming() by signal
PMSIGNAL_START_WALRECEIVER.
>> 3.The startup process try to read wal record.
>>  And now reciver process does not startup completely,so the startup process read the damaged wal record another
times.
>>  And startup process set walrcv->walRcvState = WALRCV_STOPPED use function ShutdownWalRcv().
>> 4.go to 2.
>> That's the infinite cycle.
>
> And now i do a patch in mail attachment to solve the bug.
> I test it on pg10.5 and work well.

So the take of the patch is to make sure that the WAL receiver has
been correctly started if the startup process requested one before
trying to replay more WAL records.  Now it seems to me that you do not
close all the gaps here.  The WAL receiver could finish starting, but
still not have the time to stream up to the point where the startup
process could consume things safely.  The patch makes somewhat sure
that the gap is closed by applying a timeout of 1s, still in
environments with more latency it would make the failure still
happen.  Wouldn't it be better instead to make sure that the startup
process does not try to use once again the same record where the
failure happened before the WAL receiver confirms that it has flushed
something newer than the LSN position which was known as bad?

One problem may be that upon restart we reset receivedUpto, and that
visibly we may want to be more careful with its handling to avoid
successive failures so as the startup process does not bump
repeatitively into the same bogus record.  Looking at the code, the
computation looks quite safe though (I need to double-check to be 100%
sure), so isn't the origin of the problem just that the WAL segment
from the primary is corrupted, and that what gets streamed is broken
at its origin point, preventing the standby to move on?  You may want
to double-check that as well..

As a side note, the patch is incorrect for a couple of things though:
1) Formatting of comments and indentation.
2) Please be careful of whitespaces.
3) Usually no more than 72-80 characters per line.
4) Why a wait phase of 1s?
5) And much worse, the WAL receiver status cannot be looked at without
taking its mutex lock first, so this makes the logic look at something
which may be concurrently updated.  Oops.
6) Still the patch is incorrect per the reasons above ;)
--
Michael

Attachment

Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From
"lichuancheng@highgo.com"
Date:

>So the take of the patch is to make sure that the WAL receiver has
>been correctly started if the startup process requested one before
>trying to replay more WAL records.  Now it seems to me that you do not
>close all the gaps here.  The WAL receiver could finish starting, but
>still not have the time to stream up to the point where the startup
>process could consume things safely.  The patch makes somewhat sure
>that the gap is closed by applying a timeout of 1s, still in
>environments with more latency it would make the failure still
>happen.  Wouldn't it be better instead to make sure that the startup
>process does not try to use once again the same record where the
>failure happened before the WAL receiver confirms that it has flushed
>something newer than the LSN position which was known as bad?
 
>One problem may be that upon restart we reset receivedUpto, and that
>visibly we may want to be more careful with its handling to avoid
>successive failures so as the startup process does not bump
>repeatitively into the same bogus record.  Looking at the code, the
>computation looks quite safe though (I need to double-check to be 100%
>sure), so isn't the origin of the problem just that the WAL segment
>from the primary is corrupted, and that what gets streamed is broken
>at its origin point, preventing the standby to move on?  You may want
>to double-check that as well..

I known what you describe.
Nomally,when the reciver process restart,the code reset redo LSN to the start of the 
wal segment. But now the question is that the reciver process can not restart
sucessful,so correct wal record nerver arrived. so the startup process is in a
infinite cycle to start and end the reciver process. 
So the way to solve the problem  is give some time to reciver process to 
restart. As long as it restart sucessful,it will apply wal record at the start of the 
wal segment from master and the correct wal record will arrived. so i do not 
think what you say 'still in environments with more latency it would make the 
failure still happen' is right.So i think the patch can solve the problem after
 modify as  side note blow.

I have make sure that the the broken wal record is broked during stream,
it's well in master and other slave.

 
>As a side note, the patch is incorrect for a couple of things though:
>1) Formatting of comments and indentation.
>2) Please be careful of whitespaces.
>3) Usually no more than 72-80 characters per line.
>4) Why a wait phase of 1s?
I think 1s is enought for reciver process to restart. Maybe it need a discuss.

>5) And much worse, the WAL receiver status cannot be looked at without
>taking its mutex lock first, so this makes the logic look at something
>which may be concurrently updated.  Oops.
I am sorry.

>6) Still the patch is incorrect per the reasons above ;)
I am new here, thanks for your explanation in side note.

-----
movead


Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From
"lichuancheng@highgo.com"
Date:

Thanks Michael Paquier for your explanation.
I have modified the patch  in mail attachment,wish your reply.


-----
movead

Attachment
hello

 A change to prevent dead-cycle by function 'HandleStartupProcInterrupts()'

-----
movead

Attachment