Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: Online checksums verification in the backend
Date
Msg-id CAOBaU_ag_cgJMEw-d=UhaJT=8TW0UAgZREzAFz-tesmJsQCv+A@mail.gmail.com
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Online checksums verification in the backend
List pgsql-hackers
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > This brings the second consideration: how to report the list corrupted
> > blocks to end users.  As I said this is for now returned via the SRF,
> > but this is clearly not ideal and should rather be made available more
> > globally.  One usage of this information could be block level
> > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > mentioned me that he had ideas on passing the list of corrupted blocks
> > using the stat collector.
>
> Yes it's necessary the list of corrupted pages for single page
> recovery. Apart from single page recovery I think it's helpful for DBA
> if they can find the corrupted blocks in the server logs and on a
> system view.
>
> I've also tried to report corrupted pages to the stats collector
> during I researching single page recovery in PostgreSQL but one
> problem is that the statistics in the stats collector is cleared when
> crash recovery. I want the information of block corruption to survive
> even when the server down.

Yes, having the list of corrupted blocks surviving a crash-and-restart
cycle, and also available after a clean shutdown is definitely
important.

> And we might want to add checksums to the
> permanent file having information of database corruption. The
> correctness of these information would be important because we can fix
> a database by restoring some tables from a logical backup or by doing
> reindex etc as long as we have a non-broken information of database
> corruption.

Agreed

> > Finally, the read and locking considerations.  I tried to cover that
> > extensively in the comments, but here are some details on how I tried
> > to make the check safe while trying to keep the overhead as low as
> > possible.  First thing is that this is only doing buffered reads,
> > without any attempt to discard OS cache.  Therefore, any discrepancy
> > between the OS cache and the disk cannot be detected unless you do
> > other actions, such as sync / drop_caches on GNU/Linux.
> >
> > An access share lock on the currently checked relation is held,
> > meaning that it can't get deleted/truncated.  The total number of
> > blocks for the given fork is retrieved first, so any new block will be
> > ignored.  Such new blocks are considered out of scope as being written
> > after the start of the check.
> >
> > Each time a buffer is being checked, the target buffer mapping
> > partition lock is acquired in shared mode, to prevent concurrent
> > eviction.  If the buffer is found in shared buffers, it's pinned and
> > released immediately, just to get the state.
>
> I wonder if there is possibility that blocks on disk can be corrupted
> even if these are loaded to the shared buffer. ISTM the above method
> cannot detect such corruption. Reading and checking blocks fast is
> attractive but I thought it's also important to check blocks precisely
> without overlooking.

It can definitely happen, and it's the usual doomsday scenario:
database is working fine for months, then postgres is restarted say
for a minor version upgrade and then boom the most populars blocks
that are constantly used in read only were corrupted on disk but never
evicted from shared buffers, and you have a major outage.  I have
witnessed that unfortunately too many times.  This is especially bad
as in this kind of scenario, you typically discover the corruption
once all backup only contains the corrupted blocks.

Note that in the approach I'm suggesting, I do verify blocks that are
loaded in shared buffers, I only ignore the dirty blocks, as they'll
be written by the checkpointer or recovery process in case of unclean
shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
lock also guarantees that and causes less overhead.  The included TAP
test should also detect the corruption of a
present-in-shared-buffers-non-dirty block.  It could however be
improved eg. by calling pg_prewarm to make sure that it's indeed in
shared_buffers, and also do the same test after a clean restart to
make sure that it's hitting the not-in-shared-buffers case.



pgsql-hackers by date:

Previous
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: Implementing Incremental View Maintenance
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?