Thread: Should walsernder check correctness of WAL records?
Hi hackers, Investigating one of customer's support cases I found out that walsender is not calculating WAL records CRC and send them to replicas without any checks. As a result damaged WAL record causes errors on all replicas: LOG: incorrect resource manager data checksum in record at 5FB9/D199F7D8 FATAL: terminating walreceiver process due to administrator command I wonder if it will be better to detect this problem earlier at master? We can try to recover damaged WAL record (it is not always possible, but...) Or at least do not advance replication slots and make it possible for DBA to restore corrupted WAL segment from archive and resume replication. And right now the only choice is to restore replicas using basebackup which may take significant amount of time (for larger database). And during this time master will not be protected from failures. Or extra overhead of computing CRC in WAL sender is assumed to be to high? Sorry, if this question was already discussed - I failed to find it in the archive. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> > Investigating one of customer's support cases I found out that walsender > is not calculating WAL records CRC and send them to replicas without any > checks. > As a result damaged WAL record causes errors on all replicas: IIUC, walsender tries hard to send WAL as fast as possible to reduce replication lag and transaction response time, so itdoesn't try to peek each WAL record. I think it's good. In any case, the WAL can get corrupt during transmission, and writing and reading on the standby. So, the standby needsto check the WAL record CRC. Regards Takayuki Tsunakawa
On Fri, Oct 02, 2020 at 12:16:25AM +0000, tsunakawa.takay@fujitsu.com wrote: > IIUC, walsender tries hard to send WAL as fast as possible to reduce > replication lag and transaction response time, so it doesn't try to > peek each WAL record. I think it's good. CRC calculation would unlikely be the bottleneck here, no? I would assume that the extra lseek() calls needed to look after the record data to be more harmful. > In any case, the WAL can get corrupt during transmission, and > writing and reading on the standby. So, the standby needs to check > the WAL record CRC. Yep. However, I would worry much more about the case of cold archives. In my experience, there are higher risks to get a WAL segment corrupted because it was on disk and that this disk got corrupted. Transmission is a one-time short operation. Cold archives could stay on disk for weeks before getting reused in WAL replay. -- Michael
Attachment
From: Michael Paquier <michael@paquier.xyz> > CRC calculation would unlikely be the bottleneck here, no? I would assume > that the extra lseek() calls needed to look after the record data to be more > harmful. Maybe, although I'm not sure lseek() is necessary. I simply thought walsender was designed to just read and send WAL withoutcaring about other things for maximal speed. > Yep. However, I would worry much more about the case of cold archives. In > my experience, there are higher risks to get a WAL segment corrupted because > it was on disk and that this disk got corrupted. Transmission is a one-time > short operation. Cold archives could stay on disk for weeks before getting > reused in WAL replay. Yes, I think cold archives should be checked regularly. pg_verifybackup and pg_waldump can be used for it, can't they? Regards Takayuki Tsunakawa
On 02.10.2020 3:28, Michael Paquier wrote: > On Fri, Oct 02, 2020 at 12:16:25AM +0000, tsunakawa.takay@fujitsu.com wrote: >> IIUC, walsender tries hard to send WAL as fast as possible to reduce >> replication lag and transaction response time, so it doesn't try to >> peek each WAL record. I think it's good. > CRC calculation would unlikely be the bottleneck here, no? I would > assume that the extra lseek() calls needed to look after the record > data to be more harmful. When do we need to perform some lseeks? wal-sender and wal-receiver are dealing just with raw sequences of bytes. Them do not try to split input stream into WAL records. If we have to process input data using wal-reader, then I afraid it will itself add quite noticeable overhead. Using standard wal reader seems to be very inefficient in this case, because it performs unpacking of WAL records. We do not need it: the only requires thing is to extract WAL record length from the header and calculate CRC. The main difficulty is that WAl record can occupy several pages, so we need to accumulate checksum somewhere and seek backward to the beginning of the record once we found CRC mismatch. >> In any case, the WAL can get corrupt during transmission, and >> writing and reading on the standby. So, the standby needs to check >> the WAL record CRC. > Yep. However, I would worry much more about the case of cold > archives. In my experience, there are higher risks to get a WAL > segment corrupted because it was on disk and that this disk got > corrupted. Transmission is a one-time short operation. Cold archives > could stay on disk for weeks before getting reused in WAL replay. > -- > Michael So right now neither wal-sender, neither wal-receiver are checking CRC. We check records only when applying them. But it seems to be too late for correct recovery. As far as wal-sender adjust replication slot position according to the flush position at replica, at the moment when we detect corrupted record restart lsn position can be already set after this record. Even of we perform WAL archiving and fortunately this archive contains correct (not corrupted) WAL segment, we will have to copy this WAL segment not only to master but also to all replicas. is it acceptable? So I am not sure whether earlier CRC mismatch detection can help us to recover this error. And isn't price for it too high? I wonder what other actions we can perform at master or at replica to handle this situation? For example, if we detect record corruption at WAL-sender and corrupted records contains FPW, we can try to replace image of the buffer in the record with current page image. But it is only possible if page was not changed since this WAL record was created. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company