Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Online checksums verification in the backend
Date
Msg-id CA+fd4k5BBPfsPVOZ0JrH5G8h1aWL=UmUWimRL+jKovp5BTfspg@mail.gmail.com
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-hackers
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > >
> > > This brings the second consideration: how to report the list corrupted
> > > blocks to end users.  As I said this is for now returned via the SRF,
> > > but this is clearly not ideal and should rather be made available more
> > > globally.  One usage of this information could be block level
> > > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > > mentioned me that he had ideas on passing the list of corrupted blocks
> > > using the stat collector.
> >
> > Yes it's necessary the list of corrupted pages for single page
> > recovery. Apart from single page recovery I think it's helpful for DBA
> > if they can find the corrupted blocks in the server logs and on a
> > system view.
> >
> > I've also tried to report corrupted pages to the stats collector
> > during I researching single page recovery in PostgreSQL but one
> > problem is that the statistics in the stats collector is cleared when
> > crash recovery. I want the information of block corruption to survive
> > even when the server down.
>
> Yes, having the list of corrupted blocks surviving a crash-and-restart
> cycle, and also available after a clean shutdown is definitely
> important.
>
> > And we might want to add checksums to the
> > permanent file having information of database corruption. The
> > correctness of these information would be important because we can fix
> > a database by restoring some tables from a logical backup or by doing
> > reindex etc as long as we have a non-broken information of database
> > corruption.
>
> Agreed
>
> > > Finally, the read and locking considerations.  I tried to cover that
> > > extensively in the comments, but here are some details on how I tried
> > > to make the check safe while trying to keep the overhead as low as
> > > possible.  First thing is that this is only doing buffered reads,
> > > without any attempt to discard OS cache.  Therefore, any discrepancy
> > > between the OS cache and the disk cannot be detected unless you do
> > > other actions, such as sync / drop_caches on GNU/Linux.
> > >
> > > An access share lock on the currently checked relation is held,
> > > meaning that it can't get deleted/truncated.  The total number of
> > > blocks for the given fork is retrieved first, so any new block will be
> > > ignored.  Such new blocks are considered out of scope as being written
> > > after the start of the check.
> > >
> > > Each time a buffer is being checked, the target buffer mapping
> > > partition lock is acquired in shared mode, to prevent concurrent
> > > eviction.  If the buffer is found in shared buffers, it's pinned and
> > > released immediately, just to get the state.
> >
> > I wonder if there is possibility that blocks on disk can be corrupted
> > even if these are loaded to the shared buffer. ISTM the above method
> > cannot detect such corruption. Reading and checking blocks fast is
> > attractive but I thought it's also important to check blocks precisely
> > without overlooking.
>
> It can definitely happen, and it's the usual doomsday scenario:
> database is working fine for months, then postgres is restarted say
> for a minor version upgrade and then boom the most populars blocks
> that are constantly used in read only were corrupted on disk but never
> evicted from shared buffers, and you have a major outage.  I have
> witnessed that unfortunately too many times.  This is especially bad
> as in this kind of scenario, you typically discover the corruption
> once all backup only contains the corrupted blocks.
>
> Note that in the approach I'm suggesting, I do verify blocks that are
> loaded in shared buffers, I only ignore the dirty blocks, as they'll
> be written by the checkpointer or recovery process in case of unclean
> shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
> lock also guarantees that and causes less overhead.  The included TAP
> test should also detect the corruption of a
> present-in-shared-buffers-non-dirty block.  It could however be
> improved eg. by calling pg_prewarm to make sure that it's indeed in
> shared_buffers, and also do the same test after a clean restart to
> make sure that it's hitting the not-in-shared-buffers case.

It reads blocks from disk even if they are loaded in shared buffer.
Now I understand. Thanks!

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Assert failure due to "drop schema pg_temp_3 cascade" fortemporary tables and \d+ is not showing any info after drooping temp tableschema
Next
From: Michael Paquier
Date:
Subject: Re: pgsql: Superuser can permit passwordless connections onpostgres_fdw