Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Online checksums verification in the backend
Date
Msg-id CA+fd4k4B8QVxhQa8=5pf7SuDcB0vBhqzOn0c59FxVa99CF0ptQ@mail.gmail.com
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Julien Rouhaud <rjuju123@gmail.com>)
Responses Re: Online checksums verification in the backend  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-hackers
On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:
> > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> > > >> With a large amount of
> > > >> shared buffer eviction you actually increase the risk of torn page
> > > >> reads.  Instead of a logic relying on partition mapping locks, which
> > > >> could be unwise on performance grounds, did you consider different
> > > >> approaches?  For example a kind of pre-emptive lock on the page in
> > > >> storage to prevent any shared buffer operation to happen while the
> > > >> block is read from storage, that would act like a barrier.
> > > >
> > > > Even with a workload having a large shared_buffers eviction pattern, I don't
> > > > think that there's a high probability of hitting a torn page.  Unless I'm
> > > > mistaken it can only happen if all those steps happen concurrently to doing the
> > > > block read just after releasing the LWLock:
> > > >
> > > > - postgres read the same block in shared_buffers (including all the locking)
> > > > - dirties it
> > > > - writes part of the page
> > > >
> > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > > > approach seems like a very good tradeoff.
> > >
> > > Having false reports in this area could be very confusing for the
> > > user.  That's for example possible now with checksum verification and
> > > base backups.
> >
> >
> > I agree, however this shouldn't be the case here, as the block will be
> > rechecked while holding proper lock the 2nd time in case of possible false
> > positive before being reported as corrupted.  So the only downside is to check
> > twice a corrupted block that's not found in shared buffers (or concurrently
> > loaded/modified/half flushed).  As the number of corrupted or concurrently
> > loaded/modified/half flushed blocks should usually be close to zero, it seems
> > worthwhile to have a lockless check first for performance reason.
>
>
> I just noticed some dumb mistakes while adding the new GUCs.  v5 attached to
> fix that, no other changes.

Thank you for updating the patch. I have some comments:

1.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>, <parameter>fork</parameter>
<type>text</type>)</function></literal>
+       </entry>

Looking at the declaration of pg_check_relation, 'relation' and 'fork'
are optional arguments. So I think the above is not correct. But as I
commented below, 'relation' should not be optional, so maybe the above
line could be:

+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>

2.
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>

* I think we need a description about possible values for 'fork'
(i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
is omitted.

* Do we need to explain about checksum cost-based delay here?

3.
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;

Now that pg_check_relation doesn't accept NULL as 'relation', I think
we need to make 'relation' a mandatory argument.

4.
+   /* Check if the relation (still) exists */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       /*
+        * We use a standard relation_open() to acquire the initial lock.  It
+        * means that this will block until the lock is acquired, or will
+        * raise an ERROR if lock_timeout has been set.  If caller wants to
+        * check multiple tables while relying on a maximum wait time, it
+        * should process tables one by one instead of relying on a global
+        * processing with the main SRF.
+        */
+       relation = relation_open(relid, AccessShareLock);
+   }

IIUC the above was necessary because we used to have
check_all_relations() which iterates all relations on the database to
do checksum checks. But now that we don't have that function and
pg_check_relation processes one relation. Can we just do
relation_open() here?

5.
I think we need to check if the relation is a temp relation. I'm not
sure it's worth to check checksums of temp relations but at least we
need not to check other's temp relations.

6.
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */

I think the above comment also needs some "/*-------" at the beginning and end.

7.
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+                BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+                bool *found_in_sb)
+{

Maybe we can make check_get_buffer() return a bool indicating we found
a buffer to check, instead of having '*checkit'. That way, we can
simplify the following code:

+       check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                        &checkit, &found_in_sb);
+
+       if (!checkit)
+           continue;

to something like:

+ if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                          &found_in_sb))
+     continue;

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },

* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.

* The patch doesn't use checksum_cost_page_hit at all.

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int            max_worker_processes = 8;
 int            max_parallel_workers = 8;
 int            MaxBackends = 0;

+int            ChecksumCostPageHit = 1;    /* GUC parameters for
checksum check */
+int            ChecksumCostPageMiss = 10;
+int            ChecksumCostLimit = 200;
+double     ChecksumCostDelay = 0;
+
+int            ChecksumCostBalance = 0;    /* working state for
checksums check */
+bool       ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

10.
+       /* Report the failure to the stat collector and the logs. */
+       pgstat_report_checksum_failure();
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM
Next
From: Guancheng Luo
Date:
Subject: Re: [PATCH] Check operator when creating unique index on partitiontable