Re: BUG #15567: Wal receiver process restart failed when a damagedwal record arrived at standby. - Mailing list pgsql-bugs

From Maxim Boguk
Subject Re: BUG #15567: Wal receiver process restart failed when a damagedwal record arrived at standby.
Date
Msg-id CAK-MWwTYxLo9M5tzGToZeNj6YLKCcBhNETnGTbLexfj53RsW-w@mail.gmail.com
Whole thread Raw
In response to BUG #15567: Wal receiver process restart failed when a damaged walrecord arrived at standby.  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs


On Fri, Dec 28, 2018 at 8:10 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      15567
Logged by:          movead
Email address:      lichuancheng@highgo.com
PostgreSQL version: 10.5
Operating system:   centos7
Description:       

[Problem description]
We use pg9.5.7 with replica and i find that if some damaged wal record
arrived at standby then the startup process on standby try to restart
walreceiver process but never sucess and log below Infinite cycle on
standby.If i try to restart database now,it sucess.
(I try it on pg10.5,the same result.)
-----------------------------------------------------------------------------------------------------------------
2018-12-15 10:38:00.302 CST,,,30928,,5c1422d5.78d0,2,,2018-12-15 05:38:29
CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator
command",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,8,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,9,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,10,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:05.308 CST,,,5546,,5c1276c1.15aa,11,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:10.314 CST,,,5546,,5c1276c1.15aa,12,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:15.319 CST,,,5546,,5c1276c1.15aa,13,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:20.325 CST,,,5546,,5c1276c1.15aa,14,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:25.331 CST,,,5546,,5c1276c1.15aa,15,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:30.336 CST,,,5546,,5c1276c1.15aa,16,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:35.342 CST,,,5546,,5c1276c1.15aa,17,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:40.348 CST,,,5546,,5c1276c1.15aa,18,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:45.353 CST,,,5546,,5c1276c1.15aa,19,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:50.359 CST,,,5546,,5c1276c1.15aa,20,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:38:55.364 CST,,,5546,,5c1276c1.15aa,21,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
2018-12-15 10:39:00.370 CST,,,5546,,5c1276c1.15aa,22,,2018-12-13 23:12:01
CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
5A9/11E0A90",,,,,,,,,""
-----------------------------------------------------------------------------------------------------------------


[Code review]
I read the code about startup process and walreciver process.

When a damaged wal record arrived:
1.The startup process shutdown the walreciver process use function
ShutdownWalRcv().
2.The startup process start the walreciver process use funtion
RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
3.The startup process try to read wal record.
        And now reciver process does not startup completely,so the startup process
read the damaged wal record another times.
        And startup process set walrcv->walRcvState = WALRCV_STOPPED use function
ShutdownWalRcv().
4.go to 2.
That's the infinite cycle.

[Question]
I wonder that if it is a bug or you just don't want restart walreciver
process.


It look like the same issue as I reported in BUG #15151.
(unfortunately I didn't have time to explore it further then).

You have good description of what's going on under hood.


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #15568: pg_dump error
Next
From: "lichuancheng@highgo.com"
Date:
Subject: Re: Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.