Thread: Should walsernder check correctness of WAL records?

Should walsernder check correctness of WAL records?

From
Konstantin Knizhnik
Date:
Hi hackers,

Investigating one of customer's support cases I found out that walsender 
is not calculating WAL records CRC and send them to replicas without any 
checks.
As a result damaged WAL record causes errors on all replicas:

         LOG: incorrect resource manager data checksum in record at 
5FB9/D199F7D8
         FATAL: terminating walreceiver process due to administrator command

I wonder if it will be better to detect this problem earlier at master?
We can try to recover damaged WAL record (it is not always possible, but...)
Or at least do not advance replication slots and make it possible for 
DBA to restore corrupted WAL segment from archive and resume replication.

And right now the only choice is to restore replicas using basebackup 
which may take significant amount of time (for larger database).
And during this time master will not be protected from failures.

Or extra overhead of computing CRC in WAL sender is assumed to be to high?

Sorry, if this question was already discussed - I failed to find it in 
the archive.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




RE: Should walsernder check correctness of WAL records?

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
> Investigating one of customer's support cases I found out that walsender
> is not calculating WAL records CRC and send them to replicas without any
> checks.
> As a result damaged WAL record causes errors on all replicas:

IIUC, walsender tries hard to send WAL as fast as possible to reduce replication lag and transaction response time, so
itdoesn't try to peek each WAL record.  I think it's good.
 

In any case, the WAL can get corrupt during transmission, and writing and reading on the standby.  So, the standby
needsto check the WAL record CRC.
 


Regards
Takayuki Tsunakawa




Re: Should walsernder check correctness of WAL records?

From
Michael Paquier
Date:
On Fri, Oct 02, 2020 at 12:16:25AM +0000, tsunakawa.takay@fujitsu.com wrote:
> IIUC, walsender tries hard to send WAL as fast as possible to reduce
> replication lag and transaction response time, so it doesn't try to
> peek each WAL record.  I think it's good.

CRC calculation would unlikely be the bottleneck here, no?  I would
assume that the extra lseek() calls needed to look after the record
data to be more harmful.

> In any case, the WAL can get corrupt during transmission, and
> writing and reading on the standby.  So, the standby needs to check
> the WAL record CRC.

Yep.  However, I would worry much more about the case of cold
archives.  In my experience, there are higher risks to get a WAL
segment corrupted because it was on disk and that this disk got
corrupted.  Transmission is a one-time short operation. Cold archives
could stay on disk for weeks before getting reused in WAL replay.
--
Michael

Attachment

RE: Should walsernder check correctness of WAL records?

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Michael Paquier <michael@paquier.xyz>
> CRC calculation would unlikely be the bottleneck here, no?  I would assume
> that the extra lseek() calls needed to look after the record data to be more
> harmful.

Maybe, although I'm not sure lseek() is necessary.  I simply thought walsender was designed to just read and send WAL
withoutcaring about other things for maximal speed. 


> Yep.  However, I would worry much more about the case of cold archives.  In
> my experience, there are higher risks to get a WAL segment corrupted because
> it was on disk and that this disk got corrupted.  Transmission is a one-time
> short operation. Cold archives could stay on disk for weeks before getting
> reused in WAL replay.

Yes, I think cold archives should be checked regularly.  pg_verifybackup and pg_waldump can be used for it, can't they?


Regards
Takayuki Tsunakawa






Re: Should walsernder check correctness of WAL records?

From
Konstantin Knizhnik
Date:

On 02.10.2020 3:28, Michael Paquier wrote:
> On Fri, Oct 02, 2020 at 12:16:25AM +0000, tsunakawa.takay@fujitsu.com wrote:
>> IIUC, walsender tries hard to send WAL as fast as possible to reduce
>> replication lag and transaction response time, so it doesn't try to
>> peek each WAL record.  I think it's good.
> CRC calculation would unlikely be the bottleneck here, no?  I would
> assume that the extra lseek() calls needed to look after the record
> data to be more harmful.
When do we need to perform some lseeks?
wal-sender and wal-receiver are dealing just with raw sequences of bytes.
Them do not try to split input stream into WAL records.
If we have to process input data using wal-reader, then I afraid it will 
itself add quite noticeable overhead.
Using standard wal reader seems to be very inefficient in this case, 
because it performs unpacking of WAL records.
We do not need it: the only requires thing is to extract WAL record 
length from the header and calculate CRC.
The main difficulty is that WAl record can occupy several pages, so we 
need to accumulate checksum somewhere
and  seek backward to the beginning of the record once we found  CRC 
mismatch.


>> In any case, the WAL can get corrupt during transmission, and
>> writing and reading on the standby.  So, the standby needs to check
>> the WAL record CRC.
> Yep.  However, I would worry much more about the case of cold
> archives.  In my experience, there are higher risks to get a WAL
> segment corrupted because it was on disk and that this disk got
> corrupted.  Transmission is a one-time short operation. Cold archives
> could stay on disk for weeks before getting reused in WAL replay.
> --
> Michael

So right now neither wal-sender, neither wal-receiver are checking CRC.
We check records only when applying them.
But it seems to be too late for correct recovery.

As far as wal-sender adjust replication slot position according to the 
flush position at replica,
at the moment when we detect corrupted record restart lsn position can 
be already set after this  record.
Even of we perform WAL archiving and fortunately this archive contains 
correct (not corrupted) WAL segment,
we will have to copy this WAL segment not only to master but also to all 
replicas.
is it acceptable?


So I am not sure whether earlier CRC mismatch detection can help us to 
recover this error.
And isn't price for it too high?

I wonder what other actions we can perform at master or at replica to 
handle this situation?
For example, if we detect record corruption at WAL-sender and corrupted 
records contains FPW,
we can try to replace image of the buffer in the record with current 
page image.
But it is only possible if page was not changed since this WAL record 
was created.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company