Re: Online checksums verification in the backend - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Online checksums verification in the backend |
Date | |
Msg-id | CA+fd4k5BBPfsPVOZ0JrH5G8h1aWL=UmUWimRL+jKovp5BTfspg@mail.gmail.com Whole thread Raw |
In response to | Re: Online checksums verification in the backend (Julien Rouhaud <rjuju123@gmail.com>) |
List | pgsql-hackers |
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > > > This brings the second consideration: how to report the list corrupted > > > blocks to end users. As I said this is for now returned via the SRF, > > > but this is clearly not ideal and should rather be made available more > > > globally. One usage of this information could be block level > > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > > mentioned me that he had ideas on passing the list of corrupted blocks > > > using the stat collector. > > > > Yes it's necessary the list of corrupted pages for single page > > recovery. Apart from single page recovery I think it's helpful for DBA > > if they can find the corrupted blocks in the server logs and on a > > system view. > > > > I've also tried to report corrupted pages to the stats collector > > during I researching single page recovery in PostgreSQL but one > > problem is that the statistics in the stats collector is cleared when > > crash recovery. I want the information of block corruption to survive > > even when the server down. > > Yes, having the list of corrupted blocks surviving a crash-and-restart > cycle, and also available after a clean shutdown is definitely > important. > > > And we might want to add checksums to the > > permanent file having information of database corruption. The > > correctness of these information would be important because we can fix > > a database by restoring some tables from a logical backup or by doing > > reindex etc as long as we have a non-broken information of database > > corruption. > > Agreed > > > > Finally, the read and locking considerations. I tried to cover that > > > extensively in the comments, but here are some details on how I tried > > > to make the check safe while trying to keep the overhead as low as > > > possible. First thing is that this is only doing buffered reads, > > > without any attempt to discard OS cache. Therefore, any discrepancy > > > between the OS cache and the disk cannot be detected unless you do > > > other actions, such as sync / drop_caches on GNU/Linux. > > > > > > An access share lock on the currently checked relation is held, > > > meaning that it can't get deleted/truncated. The total number of > > > blocks for the given fork is retrieved first, so any new block will be > > > ignored. Such new blocks are considered out of scope as being written > > > after the start of the check. > > > > > > Each time a buffer is being checked, the target buffer mapping > > > partition lock is acquired in shared mode, to prevent concurrent > > > eviction. If the buffer is found in shared buffers, it's pinned and > > > released immediately, just to get the state. > > > > I wonder if there is possibility that blocks on disk can be corrupted > > even if these are loaded to the shared buffer. ISTM the above method > > cannot detect such corruption. Reading and checking blocks fast is > > attractive but I thought it's also important to check blocks precisely > > without overlooking. > > It can definitely happen, and it's the usual doomsday scenario: > database is working fine for months, then postgres is restarted say > for a minor version upgrade and then boom the most populars blocks > that are constantly used in read only were corrupted on disk but never > evicted from shared buffers, and you have a major outage. I have > witnessed that unfortunately too many times. This is especially bad > as in this kind of scenario, you typically discover the corruption > once all backup only contains the corrupted blocks. > > Note that in the approach I'm suggesting, I do verify blocks that are > loaded in shared buffers, I only ignore the dirty blocks, as they'll > be written by the checkpointer or recovery process in case of unclean > shutdown. A bufferpin isn't necessary to avoid torn page read, an IO > lock also guarantees that and causes less overhead. The included TAP > test should also detect the corruption of a > present-in-shared-buffers-non-dirty block. It could however be > improved eg. by calling pg_prewarm to make sure that it's indeed in > shared_buffers, and also do the same test after a clean restart to > make sure that it's hitting the not-in-shared-buffers case. It reads blocks from disk even if they are loaded in shared buffer. Now I understand. Thanks! Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date:
Previous
From: Michael PaquierDate:
Subject: Re: Assert failure due to "drop schema pg_temp_3 cascade" fortemporary tables and \d+ is not showing any info after drooping temp tableschema
Next
From: Michael PaquierDate:
Subject: Re: pgsql: Superuser can permit passwordless connections onpostgres_fdw