Re: Online checksums verification in the backend - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Re: Online checksums verification in the backend |
Date | |
Msg-id | CAOBaU_ag_cgJMEw-d=UhaJT=8TW0UAgZREzAFz-tesmJsQCv+A@mail.gmail.com Whole thread Raw |
In response to | Re: Online checksums verification in the backend (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Online checksums verification in the backend
|
List | pgsql-hackers |
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > This brings the second consideration: how to report the list corrupted > > blocks to end users. As I said this is for now returned via the SRF, > > but this is clearly not ideal and should rather be made available more > > globally. One usage of this information could be block level > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > mentioned me that he had ideas on passing the list of corrupted blocks > > using the stat collector. > > Yes it's necessary the list of corrupted pages for single page > recovery. Apart from single page recovery I think it's helpful for DBA > if they can find the corrupted blocks in the server logs and on a > system view. > > I've also tried to report corrupted pages to the stats collector > during I researching single page recovery in PostgreSQL but one > problem is that the statistics in the stats collector is cleared when > crash recovery. I want the information of block corruption to survive > even when the server down. Yes, having the list of corrupted blocks surviving a crash-and-restart cycle, and also available after a clean shutdown is definitely important. > And we might want to add checksums to the > permanent file having information of database corruption. The > correctness of these information would be important because we can fix > a database by restoring some tables from a logical backup or by doing > reindex etc as long as we have a non-broken information of database > corruption. Agreed > > Finally, the read and locking considerations. I tried to cover that > > extensively in the comments, but here are some details on how I tried > > to make the check safe while trying to keep the overhead as low as > > possible. First thing is that this is only doing buffered reads, > > without any attempt to discard OS cache. Therefore, any discrepancy > > between the OS cache and the disk cannot be detected unless you do > > other actions, such as sync / drop_caches on GNU/Linux. > > > > An access share lock on the currently checked relation is held, > > meaning that it can't get deleted/truncated. The total number of > > blocks for the given fork is retrieved first, so any new block will be > > ignored. Such new blocks are considered out of scope as being written > > after the start of the check. > > > > Each time a buffer is being checked, the target buffer mapping > > partition lock is acquired in shared mode, to prevent concurrent > > eviction. If the buffer is found in shared buffers, it's pinned and > > released immediately, just to get the state. > > I wonder if there is possibility that blocks on disk can be corrupted > even if these are loaded to the shared buffer. ISTM the above method > cannot detect such corruption. Reading and checking blocks fast is > attractive but I thought it's also important to check blocks precisely > without overlooking. It can definitely happen, and it's the usual doomsday scenario: database is working fine for months, then postgres is restarted say for a minor version upgrade and then boom the most populars blocks that are constantly used in read only were corrupted on disk but never evicted from shared buffers, and you have a major outage. I have witnessed that unfortunately too many times. This is especially bad as in this kind of scenario, you typically discover the corruption once all backup only contains the corrupted blocks. Note that in the approach I'm suggesting, I do verify blocks that are loaded in shared buffers, I only ignore the dirty blocks, as they'll be written by the checkpointer or recovery process in case of unclean shutdown. A bufferpin isn't necessary to avoid torn page read, an IO lock also guarantees that and causes less overhead. The included TAP test should also detect the corruption of a present-in-shared-buffers-non-dirty block. It could however be improved eg. by calling pg_prewarm to make sure that it's indeed in shared_buffers, and also do the same test after a clean restart to make sure that it's hitting the not-in-shared-buffers case.
pgsql-hackers by date: