Online checksums verification in the backend - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Online checksums verification in the backend |
Date | |
Msg-id | CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com Whole thread Raw |
Responses |
Re: Online checksums verification in the backend
Re: Online checksums verification in the backend |
List | pgsql-hackers |
Hi, This topic was discussed several times, with the most recent discussions found at [1] and [2]. Based on those discussions, my understanding is that the current approach in BASE_BACKUP has too many drawbacks and we should instead do this check in the backend. I've been working using such approach at VMware, and I'm submitting it here to discuss the approach and rationales, and hopefully have such a feature integrated. First, this was originally developed as an extension. It means that the check is performed using an SRF. That's maybe not the best approach, as a transaction has be kept for the total processing time. It can be leveraged by checking each relation independently, but that's still not ideal. Maybe using some utility commands (as part of VACUUM or a new CHECK command for instance) would be a better approach. This brings the second consideration: how to report the list corrupted blocks to end users. As I said this is for now returned via the SRF, but this is clearly not ideal and should rather be made available more globally. One usage of this information could be block level recovery. I'm Cc-ing Sawada-san, as I know he's working on this and mentioned me that he had ideas on passing the list of corrupted blocks using the stat collector. Finally, the read and locking considerations. I tried to cover that extensively in the comments, but here are some details on how I tried to make the check safe while trying to keep the overhead as low as possible. First thing is that this is only doing buffered reads, without any attempt to discard OS cache. Therefore, any discrepancy between the OS cache and the disk cannot be detected unless you do other actions, such as sync / drop_caches on GNU/Linux. An access share lock on the currently checked relation is held, meaning that it can't get deleted/truncated. The total number of blocks for the given fork is retrieved first, so any new block will be ignored. Such new blocks are considered out of scope as being written after the start of the check. Each time a buffer is being checked, the target buffer mapping partition lock is acquired in shared mode, to prevent concurrent eviction. If the buffer is found in shared buffers, it's pinned and released immediately, just to get the state. If the buffer is found dirty, no check is performed as it'll be written to disk by the checkpointer, or during recovery in case of unclean shutdown. Otherwise, an IO lock is held while the the buffer is being read in a private buffer. IO Lock and buffer mapping lock are released and then the check is performed. If the buffer is not found in shared buffers, the buffer mapping partition lock is released immediately and the block is read from disk. It's therefore possible to get a false positive here, as the block could be concurrently read, modified and partially written to disk. So, if an error is detected in this case, the check is restarted from scratch and if the buffer is still not found in shared buffers, the read will be done while still holding the buffer mapping partition lock to make sure that it can't get concurrently loaded and modified. This is an optimistic approach to avoid performance overhead, assuming that there shouldn't be a lot of positive, and false positive possibility is very narrow. The check consists of simply comparing the stored and computed checksum, with an additional check that the page is really new (using PageIsVerified) if it's found as PageIsNew(). Since this is done after releasing all locks, we could definitely add more checks without causing much overhead, like pd_lower/pd_upper sanity. I prefer to keep the check simple for now and rather focus on the general approach. Finally, I also reused vacuum costing GUC (for simplicity) and approach to add some throttling. I'm attaching a patch that adds a new pg_check_relation() sql function to perform a check of one or all relations, and some simple regression tests. [1] https://www.postgresql.org/message-id/flat/1532606373.3422.5.camel%40credativ.de [2] https://www.postgresql.org/message-id/flat/20190326170820.6sylklg7eh6uhabd%40alap3.anarazel.de
Attachment
pgsql-hackers by date: