Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Re: Online checksums verification in the backend
Date
Msg-id 20200316082122.3wr4swqvpe7j5cg5@nol
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Online checksums verification in the backend  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Thanks for the review Michael!

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:
> > The cfbot reported a build failure, so here's a rebased v2 which also contains
> > the pg_stat_report_failure() call and extra log info.
>
> + * - if a block is not found in shared_buffers, the LWLock is relased and the
> + *   block is read from disk without taking any lock.  If an error is detected,
> + *   the read block will be discarded and retrieved again while holding the
> + *   LWLock.  This is because an error due to concurrent write is possible but
> + *   very unlikely, so it's better to have an optimistic approach to limit
> + *   locking overhead
> This can be risky with false positives, no?


Do you mean high probability of false positive in the 1st iteration, so running
frequently the recheck that can't have false positive, not that the 2nd check
can lead to false positive?


> With a large amount of
> shared buffer eviction you actually increase the risk of torn page
> reads.  Instead of a logic relying on partition mapping locks, which
> could be unwise on performance grounds, did you consider different
> approaches?  For example a kind of pre-emptive lock on the page in
> storage to prevent any shared buffer operation to happen while the
> block is read from storage, that would act like a barrier.


Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page.  Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.


>
> + * Vacuum's GUCs are used to avoid consuming too much resources while running
> + * this tool.
> Shouldn't this involve separate GUCs instead of the VACUUM ones?


We could but the access pattern looked so similar that it looked like a good
idea to avoid adding 2 new GUC for that to keep configuration simple.  Unless
there are objections I'll add them in the next version.

> I guess that this leads to the fact that this function may be better as
> a contrib module, with the addition of some better-suited APIs in core
> (see paragraph above).


Below?


>
> +Run
> +    make check
> +or
> +    make installcheck
> Why is installcheck mentioned here?


Oups, copy/pasto error from the original contrib module this stuff was
initially implemented as, will fix.

>
> I don't think that it is appropriate to place the SQL-callable part in
> the existing checksum.c.  I would suggest instead a new file, say
> checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
> for checksums.


Agreed.

>
> -SUBDIRS = perl regress isolation modules authentication recovery
>  subscription
> +SUBDIRS = perl regress isolation modules authentication check_relation \
> +     recovery subscription
> It seems to me that this test would be a good fit for
> src/test/modules/test_misc/.


WFM.

>
> +static void
> +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
> +                         ForkNumber forknum)
> Per the argument of bloat, I think that I would remove
> check_all_relation() as this function could take a very long time to
> run, and just make the SQL function strict.


No objection.

>
> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> + *   disk either before the end of the next checkpoint or during recovery in
> + *   case of unsafe shutdown
> Wouldn't it actually be a good thing to check that the page on storage
> is fine in this case?  This depends on the system settings and the
> checkpoint frequency, but checkpoint_timeout can be extended up to 1
> day.  And plenty of things could happen to the storage in one day,
> including a base backup that includes a corrupted page on storage,
> that this function would not be able to detect.


How could that lead to data corruption?  If postgres crashes before the
checkpoint completion, the block will be overwritten during recovery, and if a
base backup is taken the block will also be overwritten while replaying all the
required WALs.  Detecting a corrupted blocks in those cases would have the
merit of possibly warning about possibly broken hardware sooner, but it would
also make the check more expensive as the odds to prevent postgres from
evicting a dirty block is way higher.  Maybe an additional GUC for that?

For the record when I first tested that feature I did try to check dirty
blocks, and it seemed that dirty blocks of shared relation were sometimes
wrongly reported as corrupted.  I didn't try to investigate more though.


> + *   we detect if a block is in shared_buffers or not.  See get_buffer()
> + *   comments for more details about the locking strategy.
> get_buffer() does not exist in your patch, check_get_buffer() does.


Oops, will fix.


>

> + * - if a block is not found in shared_buffers, the LWLock is relased and the
> [...]
> + * To avoid torn page and possible false postives when reading data, and
> Typos.
>
> +   if (!DataChecksumsEnabled())
> +       elog(ERROR, "Data checksums are not enabled");
> Note that elog() is for the class of errors which are never expected,
> and here a caller of pg_check_relation() with checksums disabled can
> trigger that.  So you need to call ereport() with
> ERRCODE_FEATURE_NOT_SUPPORTED.


Indeed, will fix.


>
> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> + *   disk either before the end of the next checkpoint or during recovery in
> + *   case of unsafe shutdown
> Not sure that the indentation is going to react well on that part of
> the patch, perhaps it would be better to add some "/*-------" at the
> beginning and end of the comment block to tell pgindent to ignore this
> part?


Ok.  Although I think only the beginning comment is needed?

>
> Based on the feedback gathered on this thread, I guess that you should
> have a SRF returning the list of broken blocks, as well as NOTICE
> messages.


The current patch has an SRF and a WARNING message, do you want an additional
NOTICE message or downgrade the existing one?

> Another thing to consider is the addition of a range
> argument to only check a certain portion of the blocks, say one
> segment file at a time, etc.  Fine by me to not include in the first
> flavor of the patch.


Ok!


> The patch needs documentation.


I'll try to add some.



pgsql-hackers by date:

Previous
From: "曾文旌(义从)"
Date:
Subject: Re: [Proposal] Global temporary tables
Next
From: Julien Rouhaud
Date:
Subject: Re: Online checksums verification in the backend