Thread: Online checksums verification in the backend

Online checksums verification in the backend

From
Julien Rouhaud
Date:
Hi,

This topic was discussed several times, with the most recent
discussions found at [1] and [2].  Based on those discussions, my
understanding is that the current approach in BASE_BACKUP has too many
drawbacks and we should instead do this check in the backend.  I've
been working using such approach at VMware, and I'm submitting it here
to discuss the approach and rationales, and hopefully have such a
feature integrated.

First, this was originally developed as an extension.  It means that
the check is performed using an SRF.  That's maybe not the best
approach, as a transaction has be kept for the total processing time.
It can be leveraged by checking each relation independently, but
that's still not ideal.  Maybe using some utility commands (as part of
VACUUM or a new CHECK command for instance) would be a better
approach.

This brings the second consideration: how to report the list corrupted
blocks to end users.  As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally.  One usage of this information could be block level
recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Finally, the read and locking considerations.  I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible.  First thing is that this is only doing buffered reads,
without any attempt to discard OS cache.  Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated.  The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored.  Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction.  If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state.  If the buffer is found
dirty, no check is performed as it'll be written to disk by the
checkpointer, or during recovery in case of unclean shutdown.
Otherwise, an IO lock is held while the the buffer is being read in a
private buffer. IO Lock and buffer mapping lock are released and then
the check is performed.

If the buffer is not found in shared buffers, the buffer mapping
partition lock is released immediately and the block is read from
disk.  It's therefore possible to get a false positive here, as the
block could be concurrently read, modified and partially written to
disk.  So, if an error is detected in this case, the check is
restarted from scratch and if the buffer is still not found in shared
buffers, the read will be done while still holding the buffer mapping
partition lock to make sure that it can't get concurrently loaded and
modified.  This is an optimistic approach to avoid performance
overhead, assuming that there shouldn't be a lot of positive, and
false positive possibility is very narrow.

The check consists of simply comparing the stored and computed
checksum, with an additional check that the page is really new (using
PageIsVerified) if it's found as PageIsNew().  Since this is done
after releasing all locks, we could definitely add more checks without
causing much overhead, like pd_lower/pd_upper sanity.  I prefer to
keep the check simple for now and rather focus on the general
approach.

Finally, I also reused vacuum costing GUC (for simplicity) and
approach to add some throttling.

I'm attaching a patch that adds a new pg_check_relation() sql function
to perform a check of one or all relations, and some simple regression
tests.

[1] https://www.postgresql.org/message-id/flat/1532606373.3422.5.camel%40credativ.de
[2] https://www.postgresql.org/message-id/flat/20190326170820.6sylklg7eh6uhabd%40alap3.anarazel.de

Attachment

Re: Online checksums verification in the backend

From
Robert Haas
Date:
On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> This topic was discussed several times, with the most recent
> discussions found at [1] and [2].  Based on those discussions, my
> understanding is that the current approach in BASE_BACKUP has too many
> drawbacks and we should instead do this check in the backend.

Good idea.

> This brings the second consideration: how to report the list corrupted
> blocks to end users.  As I said this is for now returned via the SRF,
> but this is clearly not ideal and should rather be made available more
> globally.

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally." I guess one
idea would be to provide a way to kick this off in the background via
a background worker or similar and then have it put the results in a
table. But that might fail if there are checksum errors in the
catalogs themselves.

I don't really know what's best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> > This brings the second consideration: how to report the list corrupted
> > blocks to end users.  As I said this is for now returned via the SRF,
> > but this is clearly not ideal and should rather be made available more
> > globally.
>
> Some people might prefer notices, because you can get those while the
> thing is still running, rather than a result set, which you will only
> see when the query finishes. Other people might prefer an SRF, because
> they want to have the data in structured form so that they can
> postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller.  I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

> I guess one
> idea would be to provide a way to kick this off in the background via
> a background worker or similar and then have it put the results in a
> table. But that might fail if there are checksum errors in the
> catalogs themselves.

Yes that's a concern.  We could maintain a list in (dynamic) shared
memory with a simple SQL wrapper to read the data, but that would be
lost with a crash/restart.  Or use
pgstat_report_checksum_failures_in_db(), modifying it to get an
relfilenode, bocknum and forknum and append that to some flat files,
hoping that it won't get corrupted either.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> Some people might prefer notices, because you can get those while the
>> thing is still running, rather than a result set, which you will only
>> see when the query finishes. Other people might prefer an SRF, because
>> they want to have the data in structured form so that they can
>> postprocess it. Not sure what you mean by "more globally."
>
> I meant having the results available system-wide, not only to the
> caller.  I think that emitting a log/notice level should always be
> done on top on whatever other communication facility we're using.

The problem of notice and logs is that they tend to be ignored.  Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep.  I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning.  The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.

>> I guess one
>> idea would be to provide a way to kick this off in the background via
>> a background worker or similar and then have it put the results in a
>> table. But that might fail if there are checksum errors in the
>> catalogs themselves.
>
> Yes that's a concern.  We could maintain a list in (dynamic) shared
> memory with a simple SQL wrapper to read the data, but that would be
> lost with a crash/restart.  Or use
> pgstat_report_checksum_failures_in_db(), modifying it to get an
> relfilenode, bocknum and forknum and append that to some flat files,
> hoping that it won't get corrupted either.

If a lot of blocks are corrupted, that could bloat things.  Hence some
retention policies would be necessary, and that's tricky to define and
configure properly.  I'd tend to be in the school of just logging the
information and be done with it, because that's simple and because you
won't need to worry about any more configuration.  Doing the work in
the background is still separate than a SQL-callable function though,
no?  In this case you need a connection to a database to allow the
checksum verification to happen on a relfilenode based on the relation
to check, also because you want the thing to be safe concurrently
(a background work here is a combo with a bgworker triggering dynamic
children working on one database, not necessarily something that needs
to be in core).
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >> Some people might prefer notices, because you can get those while the
> >> thing is still running, rather than a result set, which you will only
> >> see when the query finishes. Other people might prefer an SRF, because
> >> they want to have the data in structured form so that they can
> >> postprocess it. Not sure what you mean by "more globally."
> >
> > I meant having the results available system-wide, not only to the
> > caller.  I think that emitting a log/notice level should always be
> > done on top on whatever other communication facility we're using.
>
> The problem of notice and logs is that they tend to be ignored.  Now I
> don't see no problems either in adding something into the logs which
> can be found later on for parsing on top of a SRF returned by the
> caller which includes all the corruption details, say with pgbadger
> or your friendly neighborhood grep.  I think that any backend function
> should also make sure to call pgstat_report_checksum_failure() to
> report a report visible at database-level in the catalogs, so as it is
> possible to use that as a cheap high-level warning.  The details of
> the failures could always be dug from the logs or the result of the
> function itself after finding out that something is wrong in
> pg_stat_database.

I agree that adding extra information in the logs and calling
pgstat_report_checksum_failure is a must do, and I changed that
locally.  However, I doubt that the logs is the right place to find
the details of corrupted blocks.  There's no guarantee that the file
will be accessible to the DBA, nor that the content won't get
truncated by the time it's needed.  I really think that corruption is
important enough to justify more specific location.

> >> I guess one
> >> idea would be to provide a way to kick this off in the background via
> >> a background worker or similar and then have it put the results in a
> >> table. But that might fail if there are checksum errors in the
> >> catalogs themselves.
> >
> > Yes that's a concern.  We could maintain a list in (dynamic) shared
> > memory with a simple SQL wrapper to read the data, but that would be
> > lost with a crash/restart.  Or use
> > pgstat_report_checksum_failures_in_db(), modifying it to get an
> > relfilenode, bocknum and forknum and append that to some flat files,
> > hoping that it won't get corrupted either.
>
> If a lot of blocks are corrupted, that could bloat things.  Hence some
> retention policies would be necessary, and that's tricky to define and
> configure properly.  I'd tend to be in the school of just logging the
> information and be done with it, because that's simple and because you
> won't need to worry about any more configuration.

If the number of corrupted blocks becomes high enough to excessively
bloat things, it's likely that the instance is doomed anyway, so I'm
not especially concerned about it.



Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> This topic was discussed several times, with the most recent
> discussions found at [1] and [2].  Based on those discussions, my
> understanding is that the current approach in BASE_BACKUP has too many
> drawbacks and we should instead do this check in the backend.  I've
> been working using such approach at VMware, and I'm submitting it here
> to discuss the approach and rationales, and hopefully have such a
> feature integrated.

Thank you for working on this!

>
> First, this was originally developed as an extension.  It means that
> the check is performed using an SRF.  That's maybe not the best
> approach, as a transaction has be kept for the total processing time.
> It can be leveraged by checking each relation independently, but
> that's still not ideal.  Maybe using some utility commands (as part of
> VACUUM or a new CHECK command for instance) would be a better
> approach.
>
> This brings the second consideration: how to report the list corrupted
> blocks to end users.  As I said this is for now returned via the SRF,
> but this is clearly not ideal and should rather be made available more
> globally.  One usage of this information could be block level
> recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> mentioned me that he had ideas on passing the list of corrupted blocks
> using the stat collector.

Yes it's necessary the list of corrupted pages for single page
recovery. Apart from single page recovery I think it's helpful for DBA
if they can find the corrupted blocks in the server logs and on a
system view.

I've also tried to report corrupted pages to the stats collector
during I researching single page recovery in PostgreSQL but one
problem is that the statistics in the stats collector is cleared when
crash recovery. I want the information of block corruption to survive
even when the server down. And we might want to add checksums to the
permanent file having information of database corruption. The
correctness of these information would be important because we can fix
a database by restoring some tables from a logical backup or by doing
reindex etc as long as we have a non-broken information of database
corruption.

>
> Finally, the read and locking considerations.  I tried to cover that
> extensively in the comments, but here are some details on how I tried
> to make the check safe while trying to keep the overhead as low as
> possible.  First thing is that this is only doing buffered reads,
> without any attempt to discard OS cache.  Therefore, any discrepancy
> between the OS cache and the disk cannot be detected unless you do
> other actions, such as sync / drop_caches on GNU/Linux.
>
> An access share lock on the currently checked relation is held,
> meaning that it can't get deleted/truncated.  The total number of
> blocks for the given fork is retrieved first, so any new block will be
> ignored.  Such new blocks are considered out of scope as being written
> after the start of the check.
>
> Each time a buffer is being checked, the target buffer mapping
> partition lock is acquired in shared mode, to prevent concurrent
> eviction.  If the buffer is found in shared buffers, it's pinned and
> released immediately, just to get the state.

I wonder if there is possibility that blocks on disk can be corrupted
even if these are loaded to the shared buffer. ISTM the above method
cannot detect such corruption. Reading and checking blocks fast is
attractive but I thought it's also important to check blocks precisely
without overlooking.

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > This brings the second consideration: how to report the list corrupted
> > blocks to end users.  As I said this is for now returned via the SRF,
> > but this is clearly not ideal and should rather be made available more
> > globally.  One usage of this information could be block level
> > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > mentioned me that he had ideas on passing the list of corrupted blocks
> > using the stat collector.
>
> Yes it's necessary the list of corrupted pages for single page
> recovery. Apart from single page recovery I think it's helpful for DBA
> if they can find the corrupted blocks in the server logs and on a
> system view.
>
> I've also tried to report corrupted pages to the stats collector
> during I researching single page recovery in PostgreSQL but one
> problem is that the statistics in the stats collector is cleared when
> crash recovery. I want the information of block corruption to survive
> even when the server down.

Yes, having the list of corrupted blocks surviving a crash-and-restart
cycle, and also available after a clean shutdown is definitely
important.

> And we might want to add checksums to the
> permanent file having information of database corruption. The
> correctness of these information would be important because we can fix
> a database by restoring some tables from a logical backup or by doing
> reindex etc as long as we have a non-broken information of database
> corruption.

Agreed

> > Finally, the read and locking considerations.  I tried to cover that
> > extensively in the comments, but here are some details on how I tried
> > to make the check safe while trying to keep the overhead as low as
> > possible.  First thing is that this is only doing buffered reads,
> > without any attempt to discard OS cache.  Therefore, any discrepancy
> > between the OS cache and the disk cannot be detected unless you do
> > other actions, such as sync / drop_caches on GNU/Linux.
> >
> > An access share lock on the currently checked relation is held,
> > meaning that it can't get deleted/truncated.  The total number of
> > blocks for the given fork is retrieved first, so any new block will be
> > ignored.  Such new blocks are considered out of scope as being written
> > after the start of the check.
> >
> > Each time a buffer is being checked, the target buffer mapping
> > partition lock is acquired in shared mode, to prevent concurrent
> > eviction.  If the buffer is found in shared buffers, it's pinned and
> > released immediately, just to get the state.
>
> I wonder if there is possibility that blocks on disk can be corrupted
> even if these are loaded to the shared buffer. ISTM the above method
> cannot detect such corruption. Reading and checking blocks fast is
> attractive but I thought it's also important to check blocks precisely
> without overlooking.

It can definitely happen, and it's the usual doomsday scenario:
database is working fine for months, then postgres is restarted say
for a minor version upgrade and then boom the most populars blocks
that are constantly used in read only were corrupted on disk but never
evicted from shared buffers, and you have a major outage.  I have
witnessed that unfortunately too many times.  This is especially bad
as in this kind of scenario, you typically discover the corruption
once all backup only contains the corrupted blocks.

Note that in the approach I'm suggesting, I do verify blocks that are
loaded in shared buffers, I only ignore the dirty blocks, as they'll
be written by the checkpointer or recovery process in case of unclean
shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
lock also guarantees that and causes less overhead.  The included TAP
test should also detect the corruption of a
present-in-shared-buffers-non-dirty block.  It could however be
improved eg. by calling pg_prewarm to make sure that it's indeed in
shared_buffers, and also do the same test after a clean restart to
make sure that it's hitting the not-in-shared-buffers case.



Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > >
> > > This brings the second consideration: how to report the list corrupted
> > > blocks to end users.  As I said this is for now returned via the SRF,
> > > but this is clearly not ideal and should rather be made available more
> > > globally.  One usage of this information could be block level
> > > recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
> > > mentioned me that he had ideas on passing the list of corrupted blocks
> > > using the stat collector.
> >
> > Yes it's necessary the list of corrupted pages for single page
> > recovery. Apart from single page recovery I think it's helpful for DBA
> > if they can find the corrupted blocks in the server logs and on a
> > system view.
> >
> > I've also tried to report corrupted pages to the stats collector
> > during I researching single page recovery in PostgreSQL but one
> > problem is that the statistics in the stats collector is cleared when
> > crash recovery. I want the information of block corruption to survive
> > even when the server down.
>
> Yes, having the list of corrupted blocks surviving a crash-and-restart
> cycle, and also available after a clean shutdown is definitely
> important.
>
> > And we might want to add checksums to the
> > permanent file having information of database corruption. The
> > correctness of these information would be important because we can fix
> > a database by restoring some tables from a logical backup or by doing
> > reindex etc as long as we have a non-broken information of database
> > corruption.
>
> Agreed
>
> > > Finally, the read and locking considerations.  I tried to cover that
> > > extensively in the comments, but here are some details on how I tried
> > > to make the check safe while trying to keep the overhead as low as
> > > possible.  First thing is that this is only doing buffered reads,
> > > without any attempt to discard OS cache.  Therefore, any discrepancy
> > > between the OS cache and the disk cannot be detected unless you do
> > > other actions, such as sync / drop_caches on GNU/Linux.
> > >
> > > An access share lock on the currently checked relation is held,
> > > meaning that it can't get deleted/truncated.  The total number of
> > > blocks for the given fork is retrieved first, so any new block will be
> > > ignored.  Such new blocks are considered out of scope as being written
> > > after the start of the check.
> > >
> > > Each time a buffer is being checked, the target buffer mapping
> > > partition lock is acquired in shared mode, to prevent concurrent
> > > eviction.  If the buffer is found in shared buffers, it's pinned and
> > > released immediately, just to get the state.
> >
> > I wonder if there is possibility that blocks on disk can be corrupted
> > even if these are loaded to the shared buffer. ISTM the above method
> > cannot detect such corruption. Reading and checking blocks fast is
> > attractive but I thought it's also important to check blocks precisely
> > without overlooking.
>
> It can definitely happen, and it's the usual doomsday scenario:
> database is working fine for months, then postgres is restarted say
> for a minor version upgrade and then boom the most populars blocks
> that are constantly used in read only were corrupted on disk but never
> evicted from shared buffers, and you have a major outage.  I have
> witnessed that unfortunately too many times.  This is especially bad
> as in this kind of scenario, you typically discover the corruption
> once all backup only contains the corrupted blocks.
>
> Note that in the approach I'm suggesting, I do verify blocks that are
> loaded in shared buffers, I only ignore the dirty blocks, as they'll
> be written by the checkpointer or recovery process in case of unclean
> shutdown.  A bufferpin isn't necessary to avoid torn page read, an IO
> lock also guarantees that and causes less overhead.  The included TAP
> test should also detect the corruption of a
> present-in-shared-buffers-non-dirty block.  It could however be
> improved eg. by calling pg_prewarm to make sure that it's indeed in
> shared_buffers, and also do the same test after a clean restart to
> make sure that it's hitting the not-in-shared-buffers case.

It reads blocks from disk even if they are loaded in shared buffer.
Now I understand. Thanks!

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote:
> On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > >> Some people might prefer notices, because you can get those while the
> > >> thing is still running, rather than a result set, which you will only
> > >> see when the query finishes. Other people might prefer an SRF, because
> > >> they want to have the data in structured form so that they can
> > >> postprocess it. Not sure what you mean by "more globally."
> > >
> > > I meant having the results available system-wide, not only to the
> > > caller.  I think that emitting a log/notice level should always be
> > > done on top on whatever other communication facility we're using.
> >
> > The problem of notice and logs is that they tend to be ignored.  Now I
> > don't see no problems either in adding something into the logs which
> > can be found later on for parsing on top of a SRF returned by the
> > caller which includes all the corruption details, say with pgbadger
> > or your friendly neighborhood grep.  I think that any backend function
> > should also make sure to call pgstat_report_checksum_failure() to
> > report a report visible at database-level in the catalogs, so as it is
> > possible to use that as a cheap high-level warning.  The details of
> > the failures could always be dug from the logs or the result of the
> > function itself after finding out that something is wrong in
> > pg_stat_database.
>
> I agree that adding extra information in the logs and calling
> pgstat_report_checksum_failure is a must do, and I changed that
> locally.  However, I doubt that the logs is the right place to find
> the details of corrupted blocks.  There's no guarantee that the file
> will be accessible to the DBA, nor that the content won't get
> truncated by the time it's needed.  I really think that corruption is
> important enough to justify more specific location.


The cfbot reported a build failure, so here's a rebased v2 which also contains
the pg_stat_report_failure() call and extra log info.

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:
> The cfbot reported a build failure, so here's a rebased v2 which also contains
> the pg_stat_report_failure() call and extra log info.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
This can be risky with false positives, no?  With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads.  Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches?  For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

+ * Vacuum's GUCs are used to avoid consuming too much resources while running
+ * this tool.
Shouldn't this involve separate GUCs instead of the VACUUM ones?  I
guess that this leads to the fact that this function may be better as
a contrib module, with the addition of some better-suited APIs in core
(see paragraph above).

+Run
+    make check
+or
+    make installcheck
Why is installcheck mentioned here?

I don't think that it is appropriate to place the SQL-callable part in
the existing checksum.c.  I would suggest instead a new file, say
checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
for checksums.

-SUBDIRS = perl regress isolation modules authentication recovery
 subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+     recovery subscription
It seems to me that this test would be a good fit for
src/test/modules/test_misc/.

+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)
Per the argument of bloat, I think that I would remove
check_all_relation() as this function could take a very long time to
run, and just make the SQL function strict.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Wouldn't it actually be a good thing to check that the page on storage
is fine in this case?  This depends on the system settings and the
checkpoint frequency, but checkpoint_timeout can be extended up to 1
day.  And plenty of things could happen to the storage in one day,
including a base backup that includes a corrupted page on storage,
that this function would not be able to detect.

+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
Yeah, I think that you are right here to check the page on storage
anyway.

+ *   we detect if a block is in shared_buffers or not.  See get_buffer()
+ *   comments for more details about the locking strategy.
get_buffer() does not exist in your patch, check_get_buffer() does.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
[...]
+ * To avoid torn page and possible false postives when reading data, and
Typos.

+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");
Note that elog() is for the class of errors which are never expected,
and here a caller of pg_check_relation() with checksums disabled can
trigger that.  So you need to call ereport() with
ERRCODE_FEATURE_NOT_SUPPORTED.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Not sure that the indentation is going to react well on that part of
the patch, perhaps it would be better to add some "/*-------" at the
beginning and end of the comment block to tell pgindent to ignore this
part?

Based on the feedback gathered on this thread, I guess that you should
have a SRF returning the list of broken blocks, as well as NOTICE
messages.  Another thing to consider is the addition of a range
argument to only check a certain portion of the blocks, say one
segment file at a time, etc.  Fine by me to not include in the first
flavor of the patch.

The patch needs documentation.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Wed, 11 Mar 2020 at 16:18, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote:
> > On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> > > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > >> Some people might prefer notices, because you can get those while the
> > > >> thing is still running, rather than a result set, which you will only
> > > >> see when the query finishes. Other people might prefer an SRF, because
> > > >> they want to have the data in structured form so that they can
> > > >> postprocess it. Not sure what you mean by "more globally."
> > > >
> > > > I meant having the results available system-wide, not only to the
> > > > caller.  I think that emitting a log/notice level should always be
> > > > done on top on whatever other communication facility we're using.
> > >
> > > The problem of notice and logs is that they tend to be ignored.  Now I
> > > don't see no problems either in adding something into the logs which
> > > can be found later on for parsing on top of a SRF returned by the
> > > caller which includes all the corruption details, say with pgbadger
> > > or your friendly neighborhood grep.  I think that any backend function
> > > should also make sure to call pgstat_report_checksum_failure() to
> > > report a report visible at database-level in the catalogs, so as it is
> > > possible to use that as a cheap high-level warning.  The details of
> > > the failures could always be dug from the logs or the result of the
> > > function itself after finding out that something is wrong in
> > > pg_stat_database.
> >
> > I agree that adding extra information in the logs and calling
> > pgstat_report_checksum_failure is a must do, and I changed that
> > locally.  However, I doubt that the logs is the right place to find
> > the details of corrupted blocks.  There's no guarantee that the file
> > will be accessible to the DBA, nor that the content won't get
> > truncated by the time it's needed.  I really think that corruption is
> > important enough to justify more specific location.
>
>
> The cfbot reported a build failure, so here's a rebased v2 which also contains
> the pg_stat_report_failure() call and extra log info.

In addition to comments from Michael-san, here are my comments:

1.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));
+
+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");

I think it's better to reverse the order of the above checks.

2.
+#define CRF_COLS           5   /* Number of output arguments in the SRF */

Should it be SRF_COLS?

3.
+static void
+check_delay_point(void)
+{
+   /* Always check for interrupts */
+   CHECK_FOR_INTERRUPTS();
+
+   /* Nap if appropriate */
+   if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
+   {
+       int         msec;
+
+       msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
+       if (msec > VacuumCostDelay * 4)
+           msec = VacuumCostDelay * 4;
+
+       pg_usleep(msec * 1000L);
+
+       VacuumCostBalance = 0;
+
+       /* Might have gotten an interrupt while sleeping */
+       CHECK_FOR_INTERRUPTS();
+   }
+}

Even if we use vacuum delay for this function, I think we need to set
VacuumDelayActive and return if it's false, or it's better to just
return if VacuumCostDelay == 0.

4.
+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)

I also agree with Michael-san to remove this function. Instead we can
check all relations by:

select pg_check_relation(oid) from pg_class;

6.
Other typos

s/dirted/dirtied/
s/explictly/explicitly/

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
Thanks for the review Michael!

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:
> > The cfbot reported a build failure, so here's a rebased v2 which also contains
> > the pg_stat_report_failure() call and extra log info.
>
> + * - if a block is not found in shared_buffers, the LWLock is relased and the
> + *   block is read from disk without taking any lock.  If an error is detected,
> + *   the read block will be discarded and retrieved again while holding the
> + *   LWLock.  This is because an error due to concurrent write is possible but
> + *   very unlikely, so it's better to have an optimistic approach to limit
> + *   locking overhead
> This can be risky with false positives, no?


Do you mean high probability of false positive in the 1st iteration, so running
frequently the recheck that can't have false positive, not that the 2nd check
can lead to false positive?


> With a large amount of
> shared buffer eviction you actually increase the risk of torn page
> reads.  Instead of a logic relying on partition mapping locks, which
> could be unwise on performance grounds, did you consider different
> approaches?  For example a kind of pre-emptive lock on the page in
> storage to prevent any shared buffer operation to happen while the
> block is read from storage, that would act like a barrier.


Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page.  Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.


>
> + * Vacuum's GUCs are used to avoid consuming too much resources while running
> + * this tool.
> Shouldn't this involve separate GUCs instead of the VACUUM ones?


We could but the access pattern looked so similar that it looked like a good
idea to avoid adding 2 new GUC for that to keep configuration simple.  Unless
there are objections I'll add them in the next version.

> I guess that this leads to the fact that this function may be better as
> a contrib module, with the addition of some better-suited APIs in core
> (see paragraph above).


Below?


>
> +Run
> +    make check
> +or
> +    make installcheck
> Why is installcheck mentioned here?


Oups, copy/pasto error from the original contrib module this stuff was
initially implemented as, will fix.

>
> I don't think that it is appropriate to place the SQL-callable part in
> the existing checksum.c.  I would suggest instead a new file, say
> checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
> for checksums.


Agreed.

>
> -SUBDIRS = perl regress isolation modules authentication recovery
>  subscription
> +SUBDIRS = perl regress isolation modules authentication check_relation \
> +     recovery subscription
> It seems to me that this test would be a good fit for
> src/test/modules/test_misc/.


WFM.

>
> +static void
> +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
> +                         ForkNumber forknum)
> Per the argument of bloat, I think that I would remove
> check_all_relation() as this function could take a very long time to
> run, and just make the SQL function strict.


No objection.

>
> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> + *   disk either before the end of the next checkpoint or during recovery in
> + *   case of unsafe shutdown
> Wouldn't it actually be a good thing to check that the page on storage
> is fine in this case?  This depends on the system settings and the
> checkpoint frequency, but checkpoint_timeout can be extended up to 1
> day.  And plenty of things could happen to the storage in one day,
> including a base backup that includes a corrupted page on storage,
> that this function would not be able to detect.


How could that lead to data corruption?  If postgres crashes before the
checkpoint completion, the block will be overwritten during recovery, and if a
base backup is taken the block will also be overwritten while replaying all the
required WALs.  Detecting a corrupted blocks in those cases would have the
merit of possibly warning about possibly broken hardware sooner, but it would
also make the check more expensive as the odds to prevent postgres from
evicting a dirty block is way higher.  Maybe an additional GUC for that?

For the record when I first tested that feature I did try to check dirty
blocks, and it seemed that dirty blocks of shared relation were sometimes
wrongly reported as corrupted.  I didn't try to investigate more though.


> + *   we detect if a block is in shared_buffers or not.  See get_buffer()
> + *   comments for more details about the locking strategy.
> get_buffer() does not exist in your patch, check_get_buffer() does.


Oops, will fix.


>

> + * - if a block is not found in shared_buffers, the LWLock is relased and the
> [...]
> + * To avoid torn page and possible false postives when reading data, and
> Typos.
>
> +   if (!DataChecksumsEnabled())
> +       elog(ERROR, "Data checksums are not enabled");
> Note that elog() is for the class of errors which are never expected,
> and here a caller of pg_check_relation() with checksums disabled can
> trigger that.  So you need to call ereport() with
> ERRCODE_FEATURE_NOT_SUPPORTED.


Indeed, will fix.


>
> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> + *   disk either before the end of the next checkpoint or during recovery in
> + *   case of unsafe shutdown
> Not sure that the indentation is going to react well on that part of
> the patch, perhaps it would be better to add some "/*-------" at the
> beginning and end of the comment block to tell pgindent to ignore this
> part?


Ok.  Although I think only the beginning comment is needed?

>
> Based on the feedback gathered on this thread, I guess that you should
> have a SRF returning the list of broken blocks, as well as NOTICE
> messages.


The current patch has an SRF and a WARNING message, do you want an additional
NOTICE message or downgrade the existing one?

> Another thing to consider is the addition of a range
> argument to only check a certain portion of the blocks, say one
> segment file at a time, etc.  Fine by me to not include in the first
> flavor of the patch.


Ok!


> The patch needs documentation.


I'll try to add some.



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:
>
> In addition to comments from Michael-san, here are my comments:
>
> 1.
> +   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +                errmsg("only superuser or a member of the
> pg_read_server_files role may use this function")));


Good point!  I'll fix it.


> +
> +   if (!DataChecksumsEnabled())
> +       elog(ERROR, "Data checksums are not enabled");
>
> I think it's better to reverse the order of the above checks.


Indeed.


>
> 2.
> +#define CRF_COLS           5   /* Number of output arguments in the SRF */
>
> Should it be SRF_COLS?


Oops, will fix.


>
> 3.
> +static void
> +check_delay_point(void)
> +{
> +   /* Always check for interrupts */
> +   CHECK_FOR_INTERRUPTS();
> +
> +   /* Nap if appropriate */
> +   if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
> +   {
> +       int         msec;
> +
> +       msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
> +       if (msec > VacuumCostDelay * 4)
> +           msec = VacuumCostDelay * 4;
> +
> +       pg_usleep(msec * 1000L);
> +
> +       VacuumCostBalance = 0;
> +
> +       /* Might have gotten an interrupt while sleeping */
> +       CHECK_FOR_INTERRUPTS();
> +   }
> +}
>
> Even if we use vacuum delay for this function, I think we need to set
> VacuumDelayActive and return if it's false, or it's better to just
> return if VacuumCostDelay == 0.


Good point, I'll fix that.


>
> 4.
> +static void
> +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
> +                         ForkNumber forknum)
>
> I also agree with Michael-san to remove this function. Instead we can
> check all relations by:
>
> select pg_check_relation(oid) from pg_class;


Sure, but ideally we should do that in a client program (eg.  pg_checksums)
that wouldn't maintain a transaction active for the whole execution.


> 6.
> Other typos
>
> s/dirted/dirtied/
> s/explictly/explicitly/


Will fix, thanks!



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote:
> On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:
> >
> > In addition to comments from Michael-san, here are my comments:

Thanks both for the reviews.  I'm attaching a v3 with all comments addressed,
except:

> It seems to me that this test would be a good fit for
> src/test/modules/test_misc/.


AFAICT this is explicitly documented as tests for various extensions, and for
now it's a core function, so I didn't move it.


> +Run
> +    make check
> +or
> +    make installcheck
> Why is installcheck mentioned here?


This is actually already used in multiple other test readme.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Mar 16, 2020 at 02:15:27PM +0100, Julien Rouhaud wrote:
> On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:
> > >
> > > In addition to comments from Michael-san, here are my comments:
>
> Thanks both for the reviews.  I'm attaching a v3 with all comments addressed,
> except:
>
> > It seems to me that this test would be a good fit for
> > src/test/modules/test_misc/.
>
>
> AFAICT this is explicitly documented as tests for various extensions, and for
> now it's a core function, so I didn't move it.
>
>
> > +Run
> > +    make check
> > +or
> > +    make installcheck
> > Why is installcheck mentioned here?
>
>
> This is actually already used in multiple other test readme.


Sorry I forgot to update the regression tests.  v4 attached.

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
>> With a large amount of
>> shared buffer eviction you actually increase the risk of torn page
>> reads.  Instead of a logic relying on partition mapping locks, which
>> could be unwise on performance grounds, did you consider different
>> approaches?  For example a kind of pre-emptive lock on the page in
>> storage to prevent any shared buffer operation to happen while the
>> block is read from storage, that would act like a barrier.
>
> Even with a workload having a large shared_buffers eviction pattern, I don't
> think that there's a high probability of hitting a torn page.  Unless I'm
> mistaken it can only happen if all those steps happen concurrently to doing the
> block read just after releasing the LWLock:
>
> - postgres read the same block in shared_buffers (including all the locking)
> - dirties it
> - writes part of the page
>
> It's certainly possible, but it seems so unlikely that the optimistic lock-less
> approach seems like a very good tradeoff.

Having false reports in this area could be very confusing for the
user.  That's for example possible now with checksum verification and
base backups.

>> I guess that this leads to the fact that this function may be better as
>> a contrib module, with the addition of some better-suited APIs in core
>> (see paragraph above).
>
> Below?

Above.  This thought more precisely:
>> For example a kind of pre-emptive lock on the page in
>> storage to prevent any shared buffer operation to happen while the
>> block is read from storage, that would act like a barrier.

> For the record when I first tested that feature I did try to check dirty
> blocks, and it seemed that dirty blocks of shared relation were sometimes
> wrongly reported as corrupted.  I didn't try to investigate more though.

Hmm.  It would be good to look at that, correct verification of shared
relations matter.

>> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
>> + *   disk either before the end of the next checkpoint or during recovery in
>> + *   case of unsafe shutdown
>> Not sure that the indentation is going to react well on that part of
>> the patch, perhaps it would be better to add some "/*-------" at the
>> beginning and end of the comment block to tell pgindent to ignore this
>> part?
>
> Ok.  Although I think only the beginning comment is needed?

From src/tools/pgindent/README:
"pgindent will reflow any comment block that's not at the left margin.
If this messes up manual formatting that ought to be preserved,
protect the comment block with some dashes:"

        /*----------
     * Text here will not be touched by pgindent.
         *----------
         */

>> Based on the feedback gathered on this thread, I guess that you should
>> have a SRF returning the list of broken blocks, as well as NOTICE
>> messages.
>
> The current patch has an SRF and a WARNING message, do you want an additional
> NOTICE message or downgrade the existing one?

Right, not sure which one is better, for zero_damaged_pages a WARNING
is used.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> >> With a large amount of
> >> shared buffer eviction you actually increase the risk of torn page
> >> reads.  Instead of a logic relying on partition mapping locks, which
> >> could be unwise on performance grounds, did you consider different
> >> approaches?  For example a kind of pre-emptive lock on the page in
> >> storage to prevent any shared buffer operation to happen while the
> >> block is read from storage, that would act like a barrier.
> >
> > Even with a workload having a large shared_buffers eviction pattern, I don't
> > think that there's a high probability of hitting a torn page.  Unless I'm
> > mistaken it can only happen if all those steps happen concurrently to doing the
> > block read just after releasing the LWLock:
> >
> > - postgres read the same block in shared_buffers (including all the locking)
> > - dirties it
> > - writes part of the page
> >
> > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > approach seems like a very good tradeoff.
>
> Having false reports in this area could be very confusing for the
> user.  That's for example possible now with checksum verification and
> base backups.


I agree, however this shouldn't be the case here, as the block will be
rechecked while holding proper lock the 2nd time in case of possible false
positive before being reported as corrupted.  So the only downside is to check
twice a corrupted block that's not found in shared buffers (or concurrently
loaded/modified/half flushed).  As the number of corrupted or concurrently
loaded/modified/half flushed blocks should usually be close to zero, it seems
worthwhile to have a lockless check first for performance reason.


> > For the record when I first tested that feature I did try to check dirty
> > blocks, and it seemed that dirty blocks of shared relation were sometimes
> > wrongly reported as corrupted.  I didn't try to investigate more though.
>
> Hmm.  It would be good to look at that, correct verification of shared
> relations matter.


I'll try to investigate, but non-dirty shared relation blocks can be checked
and work as intended.


>
> >> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
> >> + *   disk either before the end of the next checkpoint or during recovery in
> >> + *   case of unsafe shutdown
> >> Not sure that the indentation is going to react well on that part of
> >> the patch, perhaps it would be better to add some "/*-------" at the
> >> beginning and end of the comment block to tell pgindent to ignore this
> >> part?
> >
> > Ok.  Although I think only the beginning comment is needed?
>
> From src/tools/pgindent/README:
> "pgindent will reflow any comment block that's not at the left margin.
> If this messes up manual formatting that ought to be preserved,
> protect the comment block with some dashes:"
>
>         /*----------
>      * Text here will not be touched by pgindent.
>          *----------
>          */


For instance the block comment in gen_partprune_steps_internal() disagrees.
Anyway I added both as all the nearby codes does that for overall function
comments.



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> >> Based on the feedback gathered on this thread, I guess that you should
> >> have a SRF returning the list of broken blocks, as well as NOTICE
> >> messages.
> >
> > The current patch has an SRF and a WARNING message, do you want an additional
> > NOTICE message or downgrade the existing one?
>
> Right, not sure which one is better, for zero_damaged_pages a WARNING
> is used.


Sorry forgot to answer that.  IMHO a WARNING is better here, as we're talking
about data corruption.  Also, a WARNING will be reported to both the client and
server logs, which sounds like a good thing.



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:
> On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> > >> With a large amount of
> > >> shared buffer eviction you actually increase the risk of torn page
> > >> reads.  Instead of a logic relying on partition mapping locks, which
> > >> could be unwise on performance grounds, did you consider different
> > >> approaches?  For example a kind of pre-emptive lock on the page in
> > >> storage to prevent any shared buffer operation to happen while the
> > >> block is read from storage, that would act like a barrier.
> > >
> > > Even with a workload having a large shared_buffers eviction pattern, I don't
> > > think that there's a high probability of hitting a torn page.  Unless I'm
> > > mistaken it can only happen if all those steps happen concurrently to doing the
> > > block read just after releasing the LWLock:
> > >
> > > - postgres read the same block in shared_buffers (including all the locking)
> > > - dirties it
> > > - writes part of the page
> > >
> > > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > > approach seems like a very good tradeoff.
> >
> > Having false reports in this area could be very confusing for the
> > user.  That's for example possible now with checksum verification and
> > base backups.
>
>
> I agree, however this shouldn't be the case here, as the block will be
> rechecked while holding proper lock the 2nd time in case of possible false
> positive before being reported as corrupted.  So the only downside is to check
> twice a corrupted block that's not found in shared buffers (or concurrently
> loaded/modified/half flushed).  As the number of corrupted or concurrently
> loaded/modified/half flushed blocks should usually be close to zero, it seems
> worthwhile to have a lockless check first for performance reason.


I just noticed some dumb mistakes while adding the new GUCs.  v5 attached to
fix that, no other changes.

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:
> > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> > > >> With a large amount of
> > > >> shared buffer eviction you actually increase the risk of torn page
> > > >> reads.  Instead of a logic relying on partition mapping locks, which
> > > >> could be unwise on performance grounds, did you consider different
> > > >> approaches?  For example a kind of pre-emptive lock on the page in
> > > >> storage to prevent any shared buffer operation to happen while the
> > > >> block is read from storage, that would act like a barrier.
> > > >
> > > > Even with a workload having a large shared_buffers eviction pattern, I don't
> > > > think that there's a high probability of hitting a torn page.  Unless I'm
> > > > mistaken it can only happen if all those steps happen concurrently to doing the
> > > > block read just after releasing the LWLock:
> > > >
> > > > - postgres read the same block in shared_buffers (including all the locking)
> > > > - dirties it
> > > > - writes part of the page
> > > >
> > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > > > approach seems like a very good tradeoff.
> > >
> > > Having false reports in this area could be very confusing for the
> > > user.  That's for example possible now with checksum verification and
> > > base backups.
> >
> >
> > I agree, however this shouldn't be the case here, as the block will be
> > rechecked while holding proper lock the 2nd time in case of possible false
> > positive before being reported as corrupted.  So the only downside is to check
> > twice a corrupted block that's not found in shared buffers (or concurrently
> > loaded/modified/half flushed).  As the number of corrupted or concurrently
> > loaded/modified/half flushed blocks should usually be close to zero, it seems
> > worthwhile to have a lockless check first for performance reason.
>
>
> I just noticed some dumb mistakes while adding the new GUCs.  v5 attached to
> fix that, no other changes.

Thank you for updating the patch. I have some comments:

1.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>, <parameter>fork</parameter>
<type>text</type>)</function></literal>
+       </entry>

Looking at the declaration of pg_check_relation, 'relation' and 'fork'
are optional arguments. So I think the above is not correct. But as I
commented below, 'relation' should not be optional, so maybe the above
line could be:

+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>

2.
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>

* I think we need a description about possible values for 'fork'
(i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
is omitted.

* Do we need to explain about checksum cost-based delay here?

3.
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;

Now that pg_check_relation doesn't accept NULL as 'relation', I think
we need to make 'relation' a mandatory argument.

4.
+   /* Check if the relation (still) exists */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       /*
+        * We use a standard relation_open() to acquire the initial lock.  It
+        * means that this will block until the lock is acquired, or will
+        * raise an ERROR if lock_timeout has been set.  If caller wants to
+        * check multiple tables while relying on a maximum wait time, it
+        * should process tables one by one instead of relying on a global
+        * processing with the main SRF.
+        */
+       relation = relation_open(relid, AccessShareLock);
+   }

IIUC the above was necessary because we used to have
check_all_relations() which iterates all relations on the database to
do checksum checks. But now that we don't have that function and
pg_check_relation processes one relation. Can we just do
relation_open() here?

5.
I think we need to check if the relation is a temp relation. I'm not
sure it's worth to check checksums of temp relations but at least we
need not to check other's temp relations.

6.
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */

I think the above comment also needs some "/*-------" at the beginning and end.

7.
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+                BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+                bool *found_in_sb)
+{

Maybe we can make check_get_buffer() return a bool indicating we found
a buffer to check, instead of having '*checkit'. That way, we can
simplify the following code:

+       check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                        &checkit, &found_in_sb);
+
+       if (!checkit)
+           continue;

to something like:

+ if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                          &found_in_sb))
+     continue;

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },

* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.

* The patch doesn't use checksum_cost_page_hit at all.

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int            max_worker_processes = 8;
 int            max_parallel_workers = 8;
 int            MaxBackends = 0;

+int            ChecksumCostPageHit = 1;    /* GUC parameters for
checksum check */
+int            ChecksumCostPageMiss = 10;
+int            ChecksumCostLimit = 200;
+double     ChecksumCostDelay = 0;
+
+int            ChecksumCostBalance = 0;    /* working state for
checksums check */
+bool       ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

10.
+       /* Report the failure to the stat collector and the logs. */
+       pgstat_report_checksum_failure();
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote:
> On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > v5 attached
> 
> Thank you for updating the patch. I have some comments:

Thanks a lot for the review!

> 1.
> +       <entry>
> +        <literal><function>pg_check_relation(<parameter>relation</parameter>
> <type>oid</type>, <parameter>fork</parameter>
> <type>text</type>)</function></literal>
> +       </entry>
> 
> Looking at the declaration of pg_check_relation, 'relation' and 'fork'
> are optional arguments. So I think the above is not correct. But as I
> commented below, 'relation' should not be optional, so maybe the above
> line could be:
> 
> +        <literal><function>pg_check_relation(<parameter>relation</parameter>
> <type>oid</type>[, <parameter>fork</parameter>
> <type>text</type>])</function></literal>

Yes I missed that when making relation mandatory.  Fixed.

> 2.
> +   <indexterm>
> +    <primary>pg_check_relation</primary>
> +   </indexterm>
> +   <para>
> +    <function>pg_check_relation</function> iterates over all the blocks of all
> +    or the specified fork of a given relation and verify their checksum.  It
> +    returns the list of blocks for which the found checksum doesn't match the
> +    expected one.  You must be a member of the
> +    <literal>pg_read_all_stats</literal> role to use this function.  It can
> +    only be used if data checksums are enabled.  See <xref
> +    linkend="app-initdb-data-checksums"/> for more information.
> +   </para>
> 
> * I think we need a description about possible values for 'fork'
> (i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
> is omitted.

Done.

> * Do we need to explain about checksum cost-based delay here?

It's probably better in config.sgml, nearby vacuum cost-based delay, so done
this way with a link to reference that part.

> 3.
> +CREATE OR REPLACE FUNCTION pg_check_relation(
> +  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
> +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> +  OUT expected_checksum integer, OUT found_checksum integer)
> +  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
> +  PARALLEL RESTRICTED;
> 
> Now that pg_check_relation doesn't accept NULL as 'relation', I think
> we need to make 'relation' a mandatory argument.

Correct, fixed.

> 4.
> +   /* Check if the relation (still) exists */
> +   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
> +   {
> +       /*
> +        * We use a standard relation_open() to acquire the initial lock.  It
> +        * means that this will block until the lock is acquired, or will
> +        * raise an ERROR if lock_timeout has been set.  If caller wants to
> +        * check multiple tables while relying on a maximum wait time, it
> +        * should process tables one by one instead of relying on a global
> +        * processing with the main SRF.
> +        */
> +       relation = relation_open(relid, AccessShareLock);
> +   }
> 
> IIUC the above was necessary because we used to have
> check_all_relations() which iterates all relations on the database to
> do checksum checks. But now that we don't have that function and
> pg_check_relation processes one relation. Can we just do
> relation_open() here?

Ah yes I missed that comment.  I think only the comment needed to be updated to
remove any part related to NULL-relation call.  I ended up removign the whole
comment since locking and lock_timeout behavior is inherent to relation_open
and there's no need to document that any further now that we always only check
one relation at a time.

> 5.
> I think we need to check if the relation is a temp relation. I'm not
> sure it's worth to check checksums of temp relations but at least we
> need not to check other's temp relations.

Good point.  I think it's still worthwhile to check the backend's temp
relation, although typical usage should be a bgworker/cron job doing that check
so there shouldn't be any.

> 6.
> +/*
> + * Safely read the wanted buffer from disk, dealing with possible concurrency
> + * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
> + * be performed and the caller will be informed that no check should be done.
> + * We can safely ignore such buffers as they'll be written before next
> + * checkpoint's completion..
> [...] 
> + */
> 
> I think the above comment also needs some "/*-------" at the beginning and end.

Fixed.

> 7.
> +static void
> +check_get_buffer(Relation relation, ForkNumber forknum,
> +                BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
> +                bool *found_in_sb)
> +{
> 
> Maybe we can make check_get_buffer() return a bool indicating we found
> a buffer to check, instead of having '*checkit'. That way, we can
> simplify the following code:
> 
> +       check_get_buffer(relation, forknum, blkno, buffer, force_lock,
> +                        &checkit, &found_in_sb);
> +
> +       if (!checkit)
> +           continue;
> 
> to something like:
> 
> + if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
> +                          &found_in_sb))
> +     continue;

Changed.

> 8.
> +       if (PageIsVerified(buffer, blkno))
> +       {
> +           /*
> +            * If the page is really new, there won't by any checksum to be
> +            * computed or expected.
> +            */
> +           *chk_expected = *chk_found = NoComputedChecksum;
> +           return true;
> +       }
> +       else
> +       {
> +           /*
> +            * There's a corruption, but since this affect PageIsNew, we
> +            * can't compute a checksum, so set NoComputedChecksum for the
> +            * expected checksum.
> +            */
> +           *chk_expected = NoComputedChecksum;
> +           *chk_found = hdr->pd_checksum;
> +       }
> +       return false;
> 
> * I think the 'else' is not necessary here.

AFAICT it's, see below.

> * Setting *chk_expected and *chk_found seems useless when we return
> true. The caller doesn't use them.

Indeed, fixed.

> * Should we forcibly overwrite ignore_checksum_failure to off in
> pg_check_relation()? Otherwise, this logic seems not work fine.
> 
> * I don't understand why we cannot compute a checksum in case where a
> page looks like a new page but is actually corrupted. Could you please
> elaborate on that?

PageIsVerified has a different behavior depending on whether the page looks new
or not.  If the page looks like new, it only checks that it's indeed a new
page, and otherwise try to verify the checksum.

Also, pg_check_page() has an assert to make sure that the page isn't (or don't
look like) new.

So it seems to me that the 'else' is required to properly detect a real or fake
PageIsNew, and try to compute checksums only when required.

> 8.
> +   {
> +       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
> +           gettext_noop("Checksum cost for a page found in the buffer cache."),
> +           NULL
> +       },
> +       &ChecksumCostPageHit,
> +       1, 0, 10000,
> +       NULL, NULL, NULL
> +   },
> 
> * There is no description about the newly added four GUC parameters in the doc.
> 
> * We need to put new GUC parameters into postgresql.conf.sample as well.

Fixed both.

> * The patch doesn't use checksum_cost_page_hit at all.

Indeed, I also realized that while working on previous issues.  I removed it
and renamed checksum_cost_page_miss to checksum_cost_page.
>
> 9.
> diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> index eb19644419..37f63e747c 100644
> --- a/src/backend/utils/init/globals.c
> +++ b/src/backend/utils/init/globals.c
> @@ -134,6 +134,14 @@ int            max_worker_processes = 8;
>  int            max_parallel_workers = 8;
>  int            MaxBackends = 0;
> 
> +int            ChecksumCostPageHit = 1;    /* GUC parameters for
> checksum check */
> +int            ChecksumCostPageMiss = 10;
> +int            ChecksumCostLimit = 200;
> +double     ChecksumCostDelay = 0;
> +
> +int            ChecksumCostBalance = 0;    /* working state for
> checksums check */
> +bool       ChecksumCostActive = false;
> 
> Can we declare them in checksum.c since these parameters are used only
> in checksum.c and it does I/O my itself.

The GUC parameters would still need to be global, so for consistency I kept all
the variables in globals.c.
>
> 10.
> +       /* Report the failure to the stat collector and the logs. */
> +       pgstat_report_checksum_failure();
> +       ereport(WARNING,
> +               (errcode(ERRCODE_DATA_CORRUPTED),
> +                errmsg("invalid page in block %u of relation %s",
> +                       blkno,
> +                       relpath(relation->rd_smgr->smgr_rnode, forknum))));
> 
> I think we could do pgstat_report_checksum_failure() and emit WARNING
> twice for the same page since PageIsVerified() also does the same.

As mentioned before, in this patch I only calls PageIsVerified() if the buffer
looks like new, and in this case PageIsVerified() only verify that it's a true
all-zero-page, and won't try to verify the checksum, so there's no possibility
of duplicated report.  I modified the comments to document all the interactions
and expectations.

v6 attached.

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote:
> > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:
> > >
> > > v5 attached
> >
> > Thank you for updating the patch. I have some comments:
>
> Thanks a lot for the review!

Thank you for updating the patch!

> > 4.
> > +   /* Check if the relation (still) exists */
> > +   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
> > +   {
> > +       /*
> > +        * We use a standard relation_open() to acquire the initial lock.  It
> > +        * means that this will block until the lock is acquired, or will
> > +        * raise an ERROR if lock_timeout has been set.  If caller wants to
> > +        * check multiple tables while relying on a maximum wait time, it
> > +        * should process tables one by one instead of relying on a global
> > +        * processing with the main SRF.
> > +        */
> > +       relation = relation_open(relid, AccessShareLock);
> > +   }
> >
> > IIUC the above was necessary because we used to have
> > check_all_relations() which iterates all relations on the database to
> > do checksum checks. But now that we don't have that function and
> > pg_check_relation processes one relation. Can we just do
> > relation_open() here?
>
> Ah yes I missed that comment.  I think only the comment needed to be updated to
> remove any part related to NULL-relation call.  I ended up removign the whole
> comment since locking and lock_timeout behavior is inherent to relation_open
> and there's no need to document that any further now that we always only check
> one relation at a time.

The current patch still checks SearchSysCacheExists1() before
relation_open. Why do we need to call SearchSysCacheExists1() here? I
think if the given relation doesn't exist, relation_open() will raise
an error "could not open relation with OID %u".

+   /* Open the relation if it exists.  */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       relation = relation_open(relid, AccessShareLock);
+   }


> > 8.
> > +       if (PageIsVerified(buffer, blkno))
> > +       {
> > +           /*
> > +            * If the page is really new, there won't by any checksum to be
> > +            * computed or expected.
> > +            */
> > +           *chk_expected = *chk_found = NoComputedChecksum;
> > +           return true;
> > +       }
> > +       else
> > +       {
> > +           /*
> > +            * There's a corruption, but since this affect PageIsNew, we
> > +            * can't compute a checksum, so set NoComputedChecksum for the
> > +            * expected checksum.
> > +            */
> > +           *chk_expected = NoComputedChecksum;
> > +           *chk_found = hdr->pd_checksum;
> > +       }
> > +       return false;
> >
> > * I think the 'else' is not necessary here.
>
> AFAICT it's, see below.
>
> > * Setting *chk_expected and *chk_found seems useless when we return
> > true. The caller doesn't use them.
>
> Indeed, fixed.

The patch still sets values to both?

+       if (PageIsVerified(buffer, blkno))
+       {
+           /* No corruption. */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }

>
> > * Should we forcibly overwrite ignore_checksum_failure to off in
> > pg_check_relation()? Otherwise, this logic seems not work fine.
> >
> > * I don't understand why we cannot compute a checksum in case where a
> > page looks like a new page but is actually corrupted. Could you please
> > elaborate on that?
>
> PageIsVerified has a different behavior depending on whether the page looks new
> or not.  If the page looks like new, it only checks that it's indeed a new
> page, and otherwise try to verify the checksum.
>
> Also, pg_check_page() has an assert to make sure that the page isn't (or don't
> look like) new.
>
> So it seems to me that the 'else' is required to properly detect a real or fake
> PageIsNew, and try to compute checksums only when required.

Thank you for your explanation! I understand.

I thought we can arrange the code to something like:

if (PageIsNew(hdr))
{
    if (PageIsVerified(hdr))
    {
        *chk_expected = *chk_found = NoComputedChecksum;
        return true;
    }

    *chk_expected = NoComputedChecksum;
    *chk_found = hdr->pd_checksum;
    return false;
}

But since it's not a critical problem you can ignore it.

>
> > 8.
> > +   {
> > +       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
> > +           gettext_noop("Checksum cost for a page found in the buffer cache."),
> > +           NULL
> > +       },
> > +       &ChecksumCostPageHit,
> > +       1, 0, 10000,
> > +       NULL, NULL, NULL
> > +   },
> >
> > * There is no description about the newly added four GUC parameters in the doc.
> >
> > * We need to put new GUC parameters into postgresql.conf.sample as well.
>
> Fixed both.
>
> > * The patch doesn't use checksum_cost_page_hit at all.
>
> Indeed, I also realized that while working on previous issues.  I removed it
> and renamed checksum_cost_page_miss to checksum_cost_page.

Perhaps we can use checksum_cost_page_hit when we found the page in
the shared buffer but it's marked as dirty?

> >
> > 9.
> > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> > index eb19644419..37f63e747c 100644
> > --- a/src/backend/utils/init/globals.c
> > +++ b/src/backend/utils/init/globals.c
> > @@ -134,6 +134,14 @@ int            max_worker_processes = 8;
> >  int            max_parallel_workers = 8;
> >  int            MaxBackends = 0;
> >
> > +int            ChecksumCostPageHit = 1;    /* GUC parameters for
> > checksum check */
> > +int            ChecksumCostPageMiss = 10;
> > +int            ChecksumCostLimit = 200;
> > +double     ChecksumCostDelay = 0;
> > +
> > +int            ChecksumCostBalance = 0;    /* working state for
> > checksums check */
> > +bool       ChecksumCostActive = false;
> >
> > Can we declare them in checksum.c since these parameters are used only
> > in checksum.c and it does I/O my itself.
>
> The GUC parameters would still need to be global, so for consistency I kept all
> the variables in globals.c.

Okay.

> >
> > 10.
> > +       /* Report the failure to the stat collector and the logs. */
> > +       pgstat_report_checksum_failure();
> > +       ereport(WARNING,
> > +               (errcode(ERRCODE_DATA_CORRUPTED),
> > +                errmsg("invalid page in block %u of relation %s",
> > +                       blkno,
> > +                       relpath(relation->rd_smgr->smgr_rnode, forknum))));
> >
> > I think we could do pgstat_report_checksum_failure() and emit WARNING
> > twice for the same page since PageIsVerified() also does the same.
>
> As mentioned before, in this patch I only calls PageIsVerified() if the buffer
> looks like new, and in this case PageIsVerified() only verify that it's a true
> all-zero-page, and won't try to verify the checksum, so there's no possibility
> of duplicated report.  I modified the comments to document all the interactions
> and expectations.

You're right. Thank you for the explanation!

I've read the latest patch and here is random comments:

1.
+       /*
+        * Add a page miss cost, as we're always reading outside the shared
+        * buffers.
+        */
+       /* Add a page cost. */
+       ChecksumCostBalance += ChecksumCostPage;

There are duplicate comments.

2.
+       /* Dirty pages are ignored as they'll be flushed soon. */
+       if (buf_state & BM_DIRTY)
+           checkit = false;

Should we check the buffer if it has BM_TAG_VALID as well here? I
thought there might be a possibility that BufTableLookup() returns a
buf_Id but its buffer tag is not valid for example when the previous
read failed after inserting the buffer tag to the buffer table.

3.
+   /* Add a page cost. */
+   ChecksumCostBalance += ChecksumCostPage;
+
+   return checkit;
+}

The check_get_buffer() seems to be slightly complex to me but when we
reached the end of this function we always return true. Similarly, in
the case where we read the block while holding a partition lock we
always return true as well. Is my understanding right? If so, it might
be better to put some assertions.

4.
@@ -10825,6 +10825,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },

Why is the pg_check_relation() is not a strict function? I think
prostrict can be 'true' for this function and we can drop checking if
the first argument is NULL.

5.
+       memset(values, 0, sizeof(values));
+       memset(nulls, 0, sizeof(nulls));

I think we can do memset right before setting values to them, that is,
after checking (!found_in_sb && !force_lock).

6.
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+                  uint16 *chk_found)
+{
+   PageHeader  hdr = (PageHeader) buffer;
+
+   Assert(chk_expected && chk_found);
+
+   if (PageIsNew(hdr))
+   {
+       /*
+        * Check if the page is really new or if there's a corruption that
+        * affected PageIsNew detection.  Note that PageIsVerified won't try to
+        * detect checksum corruption in this case, so there's no risk of
+        * duplicated corruption report.
+        */
+       if (PageIsVerified(buffer, blkno))

How about using Page instead of PageHeader? Looking at other codes,
ISTM we usually pass Page to both PageIsNew() and PageIsVerified().

7.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>.
+       </entry>

+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },

The function argument data types don't match in the doc and function
declaretion. relation is 'oid' in the doc but is 'regclass' in the
function declaretion.

8.
+#define SRF_COLS           5   /* Number of output arguments in the SRF */

Looking at similar built-in functions that return set of records they
use a more specific name for the number of returned columns such as
PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about
PG_CHECK_RELATION_COLS?

check_relation_fork() seems to quite depends on pg_check_relation()
because the returned tuplestore is specified by pg_check_relation().
It's just an idea but to improve reusability, how about moving
check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
while iterating all blocks we call a new function in checksum.c, say
check_one_block() function, which has the following part and is
responsible for getting, checking the specified block and returning a
boolean indicating whether the block has corruption or not, along with
chk_found and chk_expected:

        /*
         * To avoid too much overhead, the buffer will be first read without
         * the locks that would guarantee the lack of false positive, as such
         * events should be quite rare.
         */
Retry:
        if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
                              &found_in_sb))
            continue;

        if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
            continue;

        /*
         * If we get a failure and the buffer wasn't found in shared buffers,
         * reread the buffer with suitable lock to avoid false positive.  See
         * check_get_buffer for more details.
         */
        if (!found_in_sb && !force_lock)
        {
            force_lock = true;
            goto Retry;
        }

A new function in checksumfuncs.c or pg_check_relation will be
responsible for storing the result to the tuplestore. That way,
check_one_block() will be useful for other use when we want to check
if the particular block has corruption with low overhead.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote:
> On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> The current patch still checks SearchSysCacheExists1() before
> relation_open. Why do we need to call SearchSysCacheExists1() here? I
> think if the given relation doesn't exist, relation_open() will raise
> an error "could not open relation with OID %u".
> 
> +   /* Open the relation if it exists.  */
> +   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
> +   {
> +       relation = relation_open(relid, AccessShareLock);
> +   }

Oops yes sorry about that.  Fixed.


> > > 8.
> > > +       if (PageIsVerified(buffer, blkno))
> > > +       {
> > > +           /*
> > > +            * If the page is really new, there won't by any checksum to be
> > > +            * computed or expected.
> > > +            */
> > > +           *chk_expected = *chk_found = NoComputedChecksum;
> > > +           return true;
> > > +       }
> > > +       else
> > > +       {
> > > +           /*
> > > +            * There's a corruption, but since this affect PageIsNew, we
> > > +            * can't compute a checksum, so set NoComputedChecksum for the
> > > +            * expected checksum.
> > > +            */
> > > +           *chk_expected = NoComputedChecksum;
> > > +           *chk_found = hdr->pd_checksum;
> > > +       }
> > > +       return false;
> > >
> > > * I think the 'else' is not necessary here.
> >
> > AFAICT it's, see below.
> >
> > > * Setting *chk_expected and *chk_found seems useless when we return
> > > true. The caller doesn't use them.
> >
> > Indeed, fixed.
> 
> The patch still sets values to both?
> 
> +       if (PageIsVerified(buffer, blkno))
> +       {
> +           /* No corruption. */
> +           *chk_expected = *chk_found = NoComputedChecksum;
> +           return true;
> +       }


Sorry again, fixed.


> > > * Should we forcibly overwrite ignore_checksum_failure to off in
> > > pg_check_relation()? Otherwise, this logic seems not work fine.
> > >
> > > * I don't understand why we cannot compute a checksum in case where a
> > > page looks like a new page but is actually corrupted. Could you please
> > > elaborate on that?
> >
> > PageIsVerified has a different behavior depending on whether the page looks new
> > or not.  If the page looks like new, it only checks that it's indeed a new
> > page, and otherwise try to verify the checksum.
> >
> > Also, pg_check_page() has an assert to make sure that the page isn't (or don't
> > look like) new.
> >
> > So it seems to me that the 'else' is required to properly detect a real or fake
> > PageIsNew, and try to compute checksums only when required.
> 
> Thank you for your explanation! I understand.
> 
> I thought we can arrange the code to something like:
> 
> if (PageIsNew(hdr))
> {
>     if (PageIsVerified(hdr))
>     {
>         *chk_expected = *chk_found = NoComputedChecksum;
>         return true;
>     }
> 
>     *chk_expected = NoComputedChecksum;
>     *chk_found = hdr->pd_checksum;
>     return false;
> }
> 
> But since it's not a critical problem you can ignore it.


I like it, so done!


> > > 8.
> > > +   {
> > > +       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
> > > +           gettext_noop("Checksum cost for a page found in the buffer cache."),
> > > +           NULL
> > > +       },
> > > +       &ChecksumCostPageHit,
> > > +       1, 0, 10000,
> > > +       NULL, NULL, NULL
> > > +   },
> > >
> > > * There is no description about the newly added four GUC parameters in the doc.
> > >
> > > * We need to put new GUC parameters into postgresql.conf.sample as well.
> >
> > Fixed both.
> >
> > > * The patch doesn't use checksum_cost_page_hit at all.
> >
> > Indeed, I also realized that while working on previous issues.  I removed it
> > and renamed checksum_cost_page_miss to checksum_cost_page.
> 
> Perhaps we can use checksum_cost_page_hit when we found the page in
> the shared buffer but it's marked as dirty?


The thing is that when the buffer is dirty, we won't do any additional check,
thus not adding any overhead.  What may be needed here is to account for the
locking overhead (in all cases), so that if all (or almost all) the buffers are
dirty and in shared buffers the execution can be throttled.  I don't know how
much an issue it can be, but if that's something to be fixes then page_hit
doesn't look like the right answer for that.


> I've read the latest patch and here is random comments:
> 
> 1.
> +       /*
> +        * Add a page miss cost, as we're always reading outside the shared
> +        * buffers.
> +        */
> +       /* Add a page cost. */
> +       ChecksumCostBalance += ChecksumCostPage;
> 
> There are duplicate comments.

Fixed.


> 2.
> +       /* Dirty pages are ignored as they'll be flushed soon. */
> +       if (buf_state & BM_DIRTY)
> +           checkit = false;
> 
> Should we check the buffer if it has BM_TAG_VALID as well here? I
> thought there might be a possibility that BufTableLookup() returns a
> buf_Id but its buffer tag is not valid for example when the previous
> read failed after inserting the buffer tag to the buffer table.


Good point, fixed.


> 3.
> +   /* Add a page cost. */
> +   ChecksumCostBalance += ChecksumCostPage;
> +
> +   return checkit;
> +}
> 
> The check_get_buffer() seems to be slightly complex to me but when we
> reached the end of this function we always return true. Similarly, in
> the case where we read the block while holding a partition lock we
> always return true as well. Is my understanding right? If so, it might
> be better to put some assertions.


Yes it's a little bit complex.  I used this approach to avoid the need to
release the locks all over the place, but maybe this doesn't really improve
things.  I added asserts and comments anyway as suggested, thanks.


> 4.
> @@ -10825,6 +10825,14 @@
>    proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
>    proargnames => '{tablespace,name,size,modification}',
>    prosrc => 'pg_ls_tmpdir_1arg' },
> +{ oid => '9147', descr => 'check data integrity for one or all relations',
> +  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
> +  prorows => '20', proretset => 't', proparallel => 'r',
> +  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
> +  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
> +  proargmodes => '{i,i,o,o,o,o,o}',
> +  proargnames =>
> '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
> +  prosrc => 'pg_check_relation' },
> 
> Why is the pg_check_relation() is not a strict function? I think
> prostrict can be 'true' for this function and we can drop checking if
> the first argument is NULL.


That's because the fork is still optional.  While this could be made mandatory
without much problems, I think we'll eventually want to add a way to check only
a subset of a fork, so it seemed to me that is wasn't worth changing that now.


> 5.
> +       memset(values, 0, sizeof(values));
> +       memset(nulls, 0, sizeof(nulls));
> 
> I think we can do memset right before setting values to them, that is,
> after checking (!found_in_sb && !force_lock).


Indeed, done!


> 6.
> +static bool
> +check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
> +                  uint16 *chk_found)
> +{
> +   PageHeader  hdr = (PageHeader) buffer;
> +
> +   Assert(chk_expected && chk_found);
> +
> +   if (PageIsNew(hdr))
> +   {
> +       /*
> +        * Check if the page is really new or if there's a corruption that
> +        * affected PageIsNew detection.  Note that PageIsVerified won't try to
> +        * detect checksum corruption in this case, so there's no risk of
> +        * duplicated corruption report.
> +        */
> +       if (PageIsVerified(buffer, blkno))
> 
> How about using Page instead of PageHeader? Looking at other codes,
> ISTM we usually pass Page to both PageIsNew() and PageIsVerified().


Agreed, done.


> 7.
> +       <entry>
> +        <literal><function>pg_check_relation(<parameter>relation</parameter>
> <type>oid</type>[, <parameter>fork</parameter>
> <type>text</type>])</function></literal>.
> +       </entry>
> 
> +{ oid => '9147', descr => 'check data integrity for one or all relations',
> +  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
> +  prorows => '20', proretset => 't', proparallel => 'r',
> +  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
> +  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
> +  proargmodes => '{i,i,o,o,o,o,o}',
> +  proargnames =>
> '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
> +  prosrc => 'pg_check_relation' },
> 
> The function argument data types don't match in the doc and function
> declaretion. relation is 'oid' in the doc but is 'regclass' in the
> function declaretion.


Fixed.


> 8.
> +#define SRF_COLS           5   /* Number of output arguments in the SRF */
> 
> Looking at similar built-in functions that return set of records they
> use a more specific name for the number of returned columns such as
> PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about
> PG_CHECK_RELATION_COLS?
> 
> check_relation_fork() seems to quite depends on pg_check_relation()
> because the returned tuplestore is specified by pg_check_relation().
> It's just an idea but to improve reusability, how about moving
> check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
> while iterating all blocks we call a new function in checksum.c, say
> check_one_block() function, which has the following part and is
> responsible for getting, checking the specified block and returning a
> boolean indicating whether the block has corruption or not, along with
> chk_found and chk_expected:
> 
>         /*
>          * To avoid too much overhead, the buffer will be first read without
>          * the locks that would guarantee the lack of false positive, as such
>          * events should be quite rare.
>          */
> Retry:
>         if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
>                               &found_in_sb))
>             continue;
> 
>         if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
>             continue;
> 
>         /*
>          * If we get a failure and the buffer wasn't found in shared buffers,
>          * reread the buffer with suitable lock to avoid false positive.  See
>          * check_get_buffer for more details.
>          */
>         if (!found_in_sb && !force_lock)
>         {
>             force_lock = true;
>             goto Retry;
>         }
> 
> A new function in checksumfuncs.c or pg_check_relation will be
> responsible for storing the result to the tuplestore. That way,
> check_one_block() will be useful for other use when we want to check
> if the particular block has corruption with low overhead.


Yes, I agree that passing the tuplestore isn't an ideal approach and some
refactoring should probably happen.  One thing is that this wouldn't be
"check_one_block()" but "check_one_block_on_disk()" (which could also be from
the OS cache).  I'm not sure how useful it's in itself.  It also raises some
concerns about the throttling.  I didn't change that for now, but I hope
there'll be some other feedback about it.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote:
> On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote:
> > 
> > check_relation_fork() seems to quite depends on pg_check_relation()
> > because the returned tuplestore is specified by pg_check_relation().
> > It's just an idea but to improve reusability, how about moving
> > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
> > while iterating all blocks we call a new function in checksum.c, say
> > check_one_block() function, which has the following part and is
> > responsible for getting, checking the specified block and returning a
> > boolean indicating whether the block has corruption or not, along with
> > chk_found and chk_expected:
> > 
> >         /*
> >          * To avoid too much overhead, the buffer will be first read without
> >          * the locks that would guarantee the lack of false positive, as such
> >          * events should be quite rare.
> >          */
> > Retry:
> >         if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
> >                               &found_in_sb))
> >             continue;
> > 
> >         if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
> >             continue;
> > 
> >         /*
> >          * If we get a failure and the buffer wasn't found in shared buffers,
> >          * reread the buffer with suitable lock to avoid false positive.  See
> >          * check_get_buffer for more details.
> >          */
> >         if (!found_in_sb && !force_lock)
> >         {
> >             force_lock = true;
> >             goto Retry;
> >         }
> > 
> > A new function in checksumfuncs.c or pg_check_relation will be
> > responsible for storing the result to the tuplestore. That way,
> > check_one_block() will be useful for other use when we want to check
> > if the particular block has corruption with low overhead.
> 
> 
> Yes, I agree that passing the tuplestore isn't an ideal approach and some
> refactoring should probably happen.  One thing is that this wouldn't be
> "check_one_block()" but "check_one_block_on_disk()" (which could also be from
> the OS cache).  I'm not sure how useful it's in itself.  It also raises some
> concerns about the throttling.  I didn't change that for now, but I hope
> there'll be some other feedback about it.
> 


I had some time this morning, so I did the suggested refactoring as it seems
like a way cleaner interface.  I also kept the suggested check_one_block().

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Sat, 4 Apr 2020 at 18:04, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote:
> > On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote:
> > >
> > > check_relation_fork() seems to quite depends on pg_check_relation()
> > > because the returned tuplestore is specified by pg_check_relation().
> > > It's just an idea but to improve reusability, how about moving
> > > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
> > > while iterating all blocks we call a new function in checksum.c, say
> > > check_one_block() function, which has the following part and is
> > > responsible for getting, checking the specified block and returning a
> > > boolean indicating whether the block has corruption or not, along with
> > > chk_found and chk_expected:
> > >
> > >         /*
> > >          * To avoid too much overhead, the buffer will be first read without
> > >          * the locks that would guarantee the lack of false positive, as such
> > >          * events should be quite rare.
> > >          */
> > > Retry:
> > >         if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
> > >                               &found_in_sb))
> > >             continue;
> > >
> > >         if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
> > >             continue;
> > >
> > >         /*
> > >          * If we get a failure and the buffer wasn't found in shared buffers,
> > >          * reread the buffer with suitable lock to avoid false positive.  See
> > >          * check_get_buffer for more details.
> > >          */
> > >         if (!found_in_sb && !force_lock)
> > >         {
> > >             force_lock = true;
> > >             goto Retry;
> > >         }
> > >
> > > A new function in checksumfuncs.c or pg_check_relation will be
> > > responsible for storing the result to the tuplestore. That way,
> > > check_one_block() will be useful for other use when we want to check
> > > if the particular block has corruption with low overhead.
> >
> >
> > Yes, I agree that passing the tuplestore isn't an ideal approach and some
> > refactoring should probably happen.  One thing is that this wouldn't be
> > "check_one_block()" but "check_one_block_on_disk()" (which could also be from
> > the OS cache).  I'm not sure how useful it's in itself.  It also raises some
> > concerns about the throttling.  I didn't change that for now, but I hope
> > there'll be some other feedback about it.
> >
>
>
> I had some time this morning, so I did the suggested refactoring as it seems
> like a way cleaner interface.  I also kept the suggested check_one_block().

Thank you for updating the patch! The patch looks good to me. Here are
some random comments mostly about cosmetic changes.

1.
I think we can have two separate SQL functions:
pg_check_relation(regclass, text) and pg_check_relation(regclass),
instead of setting NULL by default to the second argument.

2.
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrred even if a version exists in

s/smgrred/smgrread/

3.
+       /* The buffer will have to check checked. */
+       Assert(checkit);

Should it be "The buffer will have to be checked"?

4.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));

Looking at the definition of pg_stat_read_server_files role, this role
seems to be for operations that could read non-database files such as
csv files. Therefore, currently this role is used by file_fdw and COPY
command. I personally think pg_stat_scan_tables would be more
appropriate for this function but I'm not sure.

5.
+   /* Set cost-based vacuum delay */
+   ChecksumCostActive = (ChecksumCostDelay > 0);
+   ChecksumCostBalance = 0;

s/vacuum/checksum verification/

6.
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think it's better to show the relation name instead of the relation path here.

7.
+       ereport(ERROR,
+               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                errmsg("relation \"%s\" does not have storage to be checked",
+                quote_qualified_identifier(
+                    get_namespace_name(get_rel_namespace(relid)),
+                    get_rel_name(relid)))));

Looking at other similar error messages we don't show qualified
relation name but the relation name gotten by
RelationGetRelationName(relation). Can we do that here as well for
consistency?

8.
+   if (!(rsinfo->allowedModes & SFRM_Materialize))
+       ereport(ERROR,
+               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                errmsg("materialize mode required, but it is not " \
+                       "allowed in this context")));

I think it's better to have this error message in one line for easy grepping.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote:
> 
> Thank you for updating the patch! The patch looks good to me. Here are
> some random comments mostly about cosmetic changes.
> 

Thanks a lot for the review!

> 
> 1.
> I think we can have two separate SQL functions:
> pg_check_relation(regclass, text) and pg_check_relation(regclass),
> instead of setting NULL by default to the second argument.
> 

I'm fine with it, so implemented this way with the required documentation
changes.

> 
> 2.
> + * Check data sanity for a specific block in the given fork of the given
> + * relation, always retrieved locally with smgrred even if a version exists in
> 
> s/smgrred/smgrread/

Fixed.

> 
> 3.
> +       /* The buffer will have to check checked. */
> +       Assert(checkit);
> 
> Should it be "The buffer will have to be checked"?
> 

Oops indeed, fixed.

> 
> 4.
> +   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +                errmsg("only superuser or a member of the
> pg_read_server_files role may use this function")));
> 
> Looking at the definition of pg_stat_read_server_files role, this role
> seems to be for operations that could read non-database files such as
> csv files. Therefore, currently this role is used by file_fdw and COPY
> command. I personally think pg_stat_scan_tables would be more
> appropriate for this function but I'm not sure.
> 

That's a very good point, especially since the documentation of this default
role is quite relevant for those functions:

"Execute monitoring functions that may take ACCESS SHARE locks on tables,
potentially for a long time."

So changed!

> 
> 5.
> +   /* Set cost-based vacuum delay */
> +   ChecksumCostActive = (ChecksumCostDelay > 0);
> +   ChecksumCostBalance = 0;
> 
> s/vacuum/checksum verification/
> 

Fixed.

> 
> 6.
> +       ereport(WARNING,
> +               (errcode(ERRCODE_DATA_CORRUPTED),
> +                errmsg("invalid page in block %u of relation %s",
> +                       blkno,
> +                       relpath(relation->rd_smgr->smgr_rnode, forknum))));
> 
> I think it's better to show the relation name instead of the relation path here.
> 

I'm here using the same pattern as what ReadBuffer_common() would display if a
corrupted block is read.  I think it's better to keep the format for both, so
any existing log analyzer will keep working with those new functions.

> 
> 7.
> +       ereport(ERROR,
> +               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> +                errmsg("relation \"%s\" does not have storage to be checked",
> +                quote_qualified_identifier(
> +                    get_namespace_name(get_rel_namespace(relid)),
> +                    get_rel_name(relid)))));
> 
> Looking at other similar error messages we don't show qualified
> relation name but the relation name gotten by
> RelationGetRelationName(relation). Can we do that here as well for
> consistency?
> 

Indeed, fixed.

> 
> 8.
> +   if (!(rsinfo->allowedModes & SFRM_Materialize))
> +       ereport(ERROR,
> +               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                errmsg("materialize mode required, but it is not " \
> +                       "allowed in this context")));
> 
> I think it's better to have this error message in one line for easy grepping.

Fixed.

I also fixed missing leading tab in the perl TAP tests

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Sun, 5 Apr 2020 at 17:44, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote:
> >
> > Thank you for updating the patch! The patch looks good to me. Here are
> > some random comments mostly about cosmetic changes.
> >
>
> Thanks a lot for the review!

Thank you for updating the patch.

>
> >
> > 1.
> > I think we can have two separate SQL functions:
> > pg_check_relation(regclass, text) and pg_check_relation(regclass),
> > instead of setting NULL by default to the second argument.
> >
>
> I'm fine with it, so implemented this way with the required documentation
> changes.

Why do we need two rows in the doc? For instance, replication slot
functions have some optional arguments but there is only one row in
the doc. So I think we don't need to change the doc from the previous
version patch.

And I think these are not necessary as we already defined in
include/catalog/pg_proc.dat:

+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;

>
> >
> > 2.
> > + * Check data sanity for a specific block in the given fork of the given
> > + * relation, always retrieved locally with smgrred even if a version exists in
> >
> > s/smgrred/smgrread/
>
> Fixed.
>
> >
> > 3.
> > +       /* The buffer will have to check checked. */
> > +       Assert(checkit);
> >
> > Should it be "The buffer will have to be checked"?
> >
>
> Oops indeed, fixed.
>
> >
> > 4.
> > +   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
> > +       ereport(ERROR,
> > +               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > +                errmsg("only superuser or a member of the
> > pg_read_server_files role may use this function")));
> >
> > Looking at the definition of pg_stat_read_server_files role, this role
> > seems to be for operations that could read non-database files such as
> > csv files. Therefore, currently this role is used by file_fdw and COPY
> > command. I personally think pg_stat_scan_tables would be more
> > appropriate for this function but I'm not sure.
> >
>
> That's a very good point, especially since the documentation of this default
> role is quite relevant for those functions:
>
> "Execute monitoring functions that may take ACCESS SHARE locks on tables,
> potentially for a long time."
>
> So changed!
>
> >
> > 5.
> > +   /* Set cost-based vacuum delay */
> > +   ChecksumCostActive = (ChecksumCostDelay > 0);
> > +   ChecksumCostBalance = 0;
> >
> > s/vacuum/checksum verification/
> >
>
> Fixed.
>
> >
> > 6.
> > +       ereport(WARNING,
> > +               (errcode(ERRCODE_DATA_CORRUPTED),
> > +                errmsg("invalid page in block %u of relation %s",
> > +                       blkno,
> > +                       relpath(relation->rd_smgr->smgr_rnode, forknum))));
> >
> > I think it's better to show the relation name instead of the relation path here.
> >
>
> I'm here using the same pattern as what ReadBuffer_common() would display if a
> corrupted block is read.  I think it's better to keep the format for both, so
> any existing log analyzer will keep working with those new functions.

Ok, I agree with you.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:
> 
> Why do we need two rows in the doc? For instance, replication slot
> functions have some optional arguments but there is only one row in
> the doc. So I think we don't need to change the doc from the previous
> version patch.
> 

I thought that if we document the function as pg_check_relation(regclass [,
fork]) users could think that the 2nd argument is optional, so that
pg_check_relation('something', NULL) could be a valid alias for the 1-argument
form, which it isn't.  After checking, I see that e.g. current_setting has the
same semantics and is documented the way you suggest, so fixed back to previous
version.

> And I think these are not necessary as we already defined in
> include/catalog/pg_proc.dat:
> 
> +CREATE OR REPLACE FUNCTION pg_check_relation(
> +  IN relation regclass,
> +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> +  OUT expected_checksum integer, OUT found_checksum integer)
> +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
> +  PARALLEL RESTRICTED;
> +
> +CREATE OR REPLACE FUNCTION pg_check_relation(
> +  IN relation regclass, IN fork text,
> +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> +  OUT expected_checksum integer, OUT found_checksum integer)
> +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
> +  AS 'pg_check_relation_fork'
> +  PARALLEL RESTRICTED;
> 

Oh right this isn't required since there's no default value anymore, fixed.

v9 attached.

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:
> >
> > Why do we need two rows in the doc? For instance, replication slot
> > functions have some optional arguments but there is only one row in
> > the doc. So I think we don't need to change the doc from the previous
> > version patch.
> >
>
> I thought that if we document the function as pg_check_relation(regclass [,
> fork]) users could think that the 2nd argument is optional, so that
> pg_check_relation('something', NULL) could be a valid alias for the 1-argument
> form, which it isn't.  After checking, I see that e.g. current_setting has the
> same semantics and is documented the way you suggest, so fixed back to previous
> version.
>
> > And I think these are not necessary as we already defined in
> > include/catalog/pg_proc.dat:
> >
> > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > +  IN relation regclass,
> > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > +  OUT expected_checksum integer, OUT found_checksum integer)
> > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
> > +  PARALLEL RESTRICTED;
> > +
> > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > +  IN relation regclass, IN fork text,
> > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > +  OUT expected_checksum integer, OUT found_checksum integer)
> > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
> > +  AS 'pg_check_relation_fork'
> > +  PARALLEL RESTRICTED;
> >
>
> Oh right this isn't required since there's no default value anymore, fixed.
>
> v9 attached.

Thank you for updating the patch! The patch looks good to me.

I've marked this patch as Ready for Committer. I hope this patch will
get committed to PG13.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote:
> On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:
> > >
> > > Why do we need two rows in the doc? For instance, replication slot
> > > functions have some optional arguments but there is only one row in
> > > the doc. So I think we don't need to change the doc from the previous
> > > version patch.
> > >
> >
> > I thought that if we document the function as pg_check_relation(regclass [,
> > fork]) users could think that the 2nd argument is optional, so that
> > pg_check_relation('something', NULL) could be a valid alias for the 1-argument
> > form, which it isn't.  After checking, I see that e.g. current_setting has the
> > same semantics and is documented the way you suggest, so fixed back to previous
> > version.
> >
> > > And I think these are not necessary as we already defined in
> > > include/catalog/pg_proc.dat:
> > >
> > > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > > +  IN relation regclass,
> > > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > > +  OUT expected_checksum integer, OUT found_checksum integer)
> > > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
> > > +  PARALLEL RESTRICTED;
> > > +
> > > +CREATE OR REPLACE FUNCTION pg_check_relation(
> > > +  IN relation regclass, IN fork text,
> > > +  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
> > > +  OUT expected_checksum integer, OUT found_checksum integer)
> > > +  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
> > > +  AS 'pg_check_relation_fork'
> > > +  PARALLEL RESTRICTED;
> > >
> >
> > Oh right this isn't required since there's no default value anymore, fixed.
> >
> > v9 attached.
> 
> Thank you for updating the patch! The patch looks good to me.
> 
> I've marked this patch as Ready for Committer. I hope this patch will
> get committed to PG13.
> 
Thanks a lot!



Re: Online checksums verification in the backend

From
Daniel Gustafsson
Date:
> On 5 Apr 2020, at 13:17, Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote:

>> Thank you for updating the patch! The patch looks good to me.
>>
>> I've marked this patch as Ready for Committer. I hope this patch will
>> get committed to PG13.

> Thanks a lot!

This patch has been through quite thorough review, and skimming the thread all
concerns raised have been addressed.  It still applies and tests gree in the CF
Patchtester.  The feature in itself certainly gets my +1 for inclusion, it
seems a good addition.

Is any committer who has taken part in the thread (or anyone else for that
matter) interested in seeing this to some form of closure in this CF?

cheers ./daniel


Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Sep 7, 2020 at 8:59 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> +#include "postgres.h"
> +
> +#include "access/tupdesc.h"
> +#include "common/relpath.h"
>  #include "storage/block.h"
> +#include "utils/relcache.h"
> +#include "utils/tuplestore.h"
> [...]
> +extern bool check_one_block(Relation relation, ForkNumber forknum,
> +                           BlockNumber blkno, uint16 *chk_expected,
> +                           uint16 *chk_found);
> I don't think that it is a good idea to add this much to checksum.h
> as these are includes coming mainly from the backend.  Note that
> pg_checksum_page() is a function designed to be also available for
> frontend tools, with checksum.h something that can be included in
> frontends.  This would mean that we could move all the buffer lookup
> APIs directly to checksumfuncs.c, or move that into a separate file
> closer to the location.

Did you mean creating a new checksumfuncs.c file? I don't find any
such file in the current tree.

> + * A zero checksum can never be computed, see pg_checksum_page() */
> +#define NoComputedChecksum 0
> Wouldn't it be better to rename that something like
> InvalidPageChecksum, and make use of it in pg_checksum_page()?  It
> would be more consistent with the naming of transaction IDs, OIDs or
> even XLogRecPtr.  And that could be a separate patch.

It seems quite ambiguous, as checksum validity usually has a different
meaning.  And in the code added here, the meaning isn't that the
ckecksum is invalid but that there's no checsum as it cannot be
computed due to PageIsNew().

> +++ b/src/test/check_relation/t/01_checksums_check.pl
> @@ -0,0 +1,276 @@
> +use strict;
> +use warnings;
> It could be better to move that to src/test/modules/, so as it could
> be picked more easily by MSVC scripts in the future.  Note that if we
> apply the normal naming convention here this should be named
> 001_checksum_check.pl.
>
> +subdir = src/test/check_relation
> +top_builddir = ../../..
> +include $(top_builddir)/src/Makefile.global
> Let's use a Makefile shaped in a way similar to modules/test_misc that
> makes use of TAP_TESTS = 1.  There is the infra, let's rely on it for
> the regression tests.

Will fix.

> +       pg_usleep(msec * 1000L);
> Could it be possible to add a wait event here?  It would be nice to be
> able to monitor that in pg_stat_activity.

Sure, I missed that as this was first implemented as an extension.

> +if (exists $ENV{MY_PG_REGRESS})
> +{
> +   $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
> +}
> What is MY_PG_REGRESS for?  A remnant from an external makefile
> perhaps?

Indeed.

> +   /*
> +    * If we get a failure and the buffer wasn't found in shared buffers,
> +    * reread the buffer with suitable lock to avoid false positive. See
> +    * check_get_buffer for more details.
> +    */
> +   if (!found_in_sb && !force_lock)
> +   {
> +       force_lock = true;
> +       goto Retry;
> +   }
> As designed, we have a risk of false positives with a torn page in the
> first loop when trying to look for a given buffer as we would try to
> use smgrread() without a partition lock.  This stresses me a bit, and
> false positives could scare users easily.  Could we consider first a
> safer approach where we don't do that, and just read the page while
> holding the partition lock?  OK, that would be more expensive, but at
> least that's safe in any case.  My memory of this patch is a bit
> fuzzy, but this is itching me and this is the heart of the problem
> dealt with here :)

I'm not sure I understand.  Unless I missed something this approach
*cannot* raise a false positive.  What it does is force a 2nd check
with stronger lock *to make sure it's actually a corruption*, so we
don't raise false positive.  The only report that can happen in this
1st loop is if smgread raises an error, which AFAICT can only happen
(at least with mdread) if the whole block couldn't be read, which is a
sign of a very bad problem.  This should clearly be reported, as this
cannot be caused by the locking heuristics used here.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote:
> Did you mean creating a new checksumfuncs.c file? I don't find any
> such file in the current tree.

Your patch adds checksumfuncs.c, so the subroutines grabbing a given
block could just be moved there.

> I'm not sure I understand.  Unless I missed something this approach
> *cannot* raise a false positive.  What it does is force a 2nd check
> with stronger lock *to make sure it's actually a corruption*, so we
> don't raise false positive.  The only report that can happen in this
> 1st loop is if smgread raises an error, which AFAICT can only happen
> (at least with mdread) if the whole block couldn't be read, which is a
> sign of a very bad problem.  This should clearly be reported, as this
> cannot be caused by the locking heuristics used here.

We don't know how much this optimization matters though?  Could it be
possible to get an idea of that?  For example, take the case of one
relation with a fixed size in a read-only workload and a read-write
workload (as long as autovacuum and updates make the number of
relation blocks rather constant for the read-write case), doing a
checksum verification in parallel of multiple clients working on the
relation concurrently.  Assuming that the relation is fully in the OS
cache, we could get an idea of the impact with multiple
(shared_buffers / relation size) rates to make the eviction more
aggressive?  The buffer partition locks, knowing that
NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it
seems to me that it would be good to see if we have a difference.
What do you think?
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Masahiko Sawada
Date:
On Mon, 7 Sep 2020 at 15:59, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote:
> > On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:
> >> Small language fixes in comments and user-facing docs.
> >
> > Thanks a lot!  I just fixed a small issue (see below), PFA updated v10.
>
> Sawada-san, you are registered as a reviewer of this patch.  Are you
> planning to look at it?  If you are busy lately, that's fine as well
> (congrats!).

Thanks!

>  In this case it could be better to unregister from the
> CF app for this entry.

Well, I sent review comments on this patch and Julien fixed all
comments. So I’d marked this as Ready for Committer since I didn't
have further comments at that time, and I was waiting for the
committer review. I'll look at this patch again but should I remove my
name from the reviewer after that if no comments?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Tue, Sep 08, 2020 at 11:36:45AM +0900, Masahiko Sawada wrote:
> On Mon, 7 Sep 2020 at 15:59, Michael Paquier <michael@paquier.xyz> wrote:
>>  In this case it could be better to unregister from the
>> CF app for this entry.
>
> Well, I sent review comments on this patch and Julien fixed all
> comments. So I’d marked this as Ready for Committer since I didn't
> have further comments at that time, and I was waiting for the
> committer review. I'll look at this patch again but should I remove my
> name from the reviewer after that if no comments?

Ah, sorry, I somewhat missed the previous status of the patch.
Perhaps that's an overdose of CF.  Keeping your name as reviewer is
fine I guess.  I have begun looking at the patch and spotted some
issues, so let's see where we do from here.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Sep 07, 2020 at 05:50:38PM +0900, Michael Paquier wrote:
> On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote:
> > Did you mean creating a new checksumfuncs.c file? I don't find any
> > such file in the current tree.
> 
> Your patch adds checksumfuncs.c, so the subroutines grabbing a given
> block could just be moved there.
> 

Sorry, I was in the middle of a rebase for another patch and missed the new
files added in this one.  I added a new checksumfuncs.h for the required
include that should not be seen by client code.  I kept checksumfuncs.c and
checksums.c so that the SQL visible declaration are separated from the rest of
the implementation as this is what we already do elsewhere I think.  If that's
a problem I'll change and put everything in checksumfuncs.[ch].

I also moved the tap tests in src/test/modules and renamed the file with a 3
digits.  For the record I initially copied src/test/modules/brin, and this is
apparently the only subdir that has a 2 digits pattern.

I also added a new WAIT_EVENT_CHECK_DELAY wait event.

> > I'm not sure I understand.  Unless I missed something this approach
> > *cannot* raise a false positive.  What it does is force a 2nd check
> > with stronger lock *to make sure it's actually a corruption*, so we
> > don't raise false positive.  The only report that can happen in this
> > 1st loop is if smgread raises an error, which AFAICT can only happen
> > (at least with mdread) if the whole block couldn't be read, which is a
> > sign of a very bad problem.  This should clearly be reported, as this
> > cannot be caused by the locking heuristics used here.
> 
> We don't know how much this optimization matters though?  Could it be
> possible to get an idea of that?  For example, take the case of one
> relation with a fixed size in a read-only workload and a read-write
> workload (as long as autovacuum and updates make the number of
> relation blocks rather constant for the read-write case), doing a
> checksum verification in parallel of multiple clients working on the
> relation concurrently.  Assuming that the relation is fully in the OS
> cache, we could get an idea of the impact with multiple
> (shared_buffers / relation size) rates to make the eviction more
> aggressive?  The buffer partition locks, knowing that
> NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it
> seems to me that it would be good to see if we have a difference.
> What do you think?

I assumed that the odds of having to check the buffer twice were so low, and
avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
enough optimisation, but maybe that's only wishful thinking.

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time.  In the meantime I'm attaching v11 of the patch that should
address all other comments.

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Wed, Sep 09, 2020 at 11:25:24AM +0200, Julien Rouhaud wrote:
> I assumed that the odds of having to check the buffer twice were so low, and
> avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
> enough optimisation, but maybe that's only wishful thinking.

Perhaps it is worth it, so it would be good to make sure of it and see
if that's actually worth the extra complication.  This also depends if
the page is in the OS cache if the page is not in shared buffers,
meaning that smgrread() is needed to fetch the page to check.  I would
be more curious to see if there is an actual difference if the page is
the OS cache.

> I'll do some becnhmarking and see if I can get some figures, but it'll probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

Thanks.

Another thing that was itching me is the introduction of 3 GUCs with
one new category for the sake of two SQL functions.  For VACUUM we
have many things relying on the GUC delays, with autovacuum and manual
vacuum.  Perhaps it would make sense to have these if we have some day
a background worker doing checksum verifications, still that could
perfectly be in contrib/, and that would be kind of hard to tune as
well.  The patch enabling checksums on-the-fly could also be a reason
good enough.  Another thing we could consider is to pass down those
parameters as function arguments, at the cost of not being able to
reload them.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> Another thing that was itching me is the introduction of 3 GUCs with
> one new category for the sake of two SQL functions.  For VACUUM we
> have many things relying on the GUC delays, with autovacuum and manual
> vacuum.  Perhaps it would make sense to have these if we have some day
> a background worker doing checksum verifications, still that could
> perfectly be in contrib/, and that would be kind of hard to tune as
> well.  The patch enabling checksums on-the-fly could also be a reason
> good enough.  Another thing we could consider is to pass down those
> parameters as function arguments, at the cost of not being able to
> reload them.

I'm not terribly happy with adding that for now, but it's quite clear
that there'll eventually be a lot of new stuff added that will benefit
from either the category or the GUC.  FTR once we reach an agreement
on how to do this check (I'm wondering if it'll stay an SQL function
or become a plain backend command, in which case GUCs would be
mandatory), I'll also be happy to work on a background worker to help
people running the check regularly.  So in my opinion it's better to
add them now so we won't have to change the sql function definition
later when other facilities will rely on them.



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> 
> I'll do some becnhmarking and see if I can get some figures, but it'll probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder.  v12 attached should fix that.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:
> On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> > 
> > I'll do some becnhmarking and see if I can get some figures, but it'll probably
> > take some time.  In the meantime I'm attaching v11 of the patch that should
> > address all other comments.
> 
> I just realized that I forgot to update one of the Makefile when moving the TAP
> test folder.  v12 attached should fix that.


And the cfbot just reported a new error for Windows build.  Attached v13 should
fix that.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Thu, Sep 10, 2020 at 09:47:23AM +0200, Julien Rouhaud wrote:
> On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:
> > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> > > 
> > > I'll do some becnhmarking and see if I can get some figures, but it'll probably
> > > take some time.  In the meantime I'm attaching v11 of the patch that should
> > > address all other comments.
> > 
> > I just realized that I forgot to update one of the Makefile when moving the TAP
> > test folder.  v12 attached should fix that.
> 
> 
> And the cfbot just reported a new error for Windows build.  Attached v13 should
> fix that.


I did some benchmarking using the following environnment:

- 16MB shared buffers
- 490MB table (10M rows)
- synchronized_seqscan to off
- table in OS cache

I don't have a big machine so I went with a very small shared_buffers and a
small table, to make sure that all data is in OS cache but the table more than
an order bigger than the shared_buffers, to simulate some plausible environment.

I used a simple read only query that performs a sequential scan of the table (a
simple SELECT * FROM table), run using 10 concurrent connections, 5 runs of 700
seconds.  I did that without any other activity, with a \watch of the original
pg_check_relation function using \watch .1, and a modified version of that
function without the optimisation, still with a \watch .1

The TPS is obviously overall extremely bad, but I can see that the submitted
version added an overhead of ~3.9% (average of 5 runs), while the version
without the optimisation added an overhead of ~6.57%.

This is supposed to be a relatively fair benchmark as all the data are cached
on the OS side, so IO done while holding the bufmapping lock aren't too long,
but we can see that we already get a non negligible benefit from this
optimisation.  Should I do additional benchmarking, like dropping the OS cache
and/or adding some write activity?  This would probably only make the
unoptimized version perform even worse.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote:
> The TPS is obviously overall extremely bad, but I can see that the submitted
> version added an overhead of ~3.9% (average of 5 runs), while the version
> without the optimisation added an overhead of ~6.57%.

Okay, so that stands as a difference.  I am planning to run some
benchmarks on my end as well, and see if I can see a clear
difference.

> This is supposed to be a relatively fair benchmark as all the data are cached
> on the OS side, so IO done while holding the bufmapping lock aren't too long,
> but we can see that we already get a non negligible benefit from this
> optimisation.  Should I do additional benchmarking, like dropping the OS cache
> and/or adding some write activity?  This would probably only make the
> unoptimized version perform even worse.

It would be also interesting to see the case where the pages are not
in the OS cache and see how bad it can get.  For the read-write case,
I am not sure as we may have some different overhead that hide the
noise.  Also, did you run your tests with the functions scanning at
full speed, with (ChecksumCostDelay < 0) so as there is no throttling?
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Fri, Sep 11, 2020 at 9:34 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote:
> > The TPS is obviously overall extremely bad, but I can see that the submitted
> > version added an overhead of ~3.9% (average of 5 runs), while the version
> > without the optimisation added an overhead of ~6.57%.
>
> Okay, so that stands as a difference.  I am planning to run some
> benchmarks on my end as well, and see if I can see a clear
> difference.

Thanks!

> > This is supposed to be a relatively fair benchmark as all the data are cached
> > on the OS side, so IO done while holding the bufmapping lock aren't too long,
> > but we can see that we already get a non negligible benefit from this
> > optimisation.  Should I do additional benchmarking, like dropping the OS cache
> > and/or adding some write activity?  This would probably only make the
> > unoptimized version perform even worse.
>
> It would be also interesting to see the case where the pages are not
> in the OS cache and see how bad it can get.  For the read-write case,
> I am not sure as we may have some different overhead that hide the
> noise.  Also, did you run your tests with the functions scanning at
> full speed, with (ChecksumCostDelay < 0) so as there is no throttling?

I used all default settings, but by default checksum_cost_delay is 0
so there shouldn't be any throttling.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:
> Thanks!

I got some numbers out of my pocket, using the following base
configuration:
wal_level = minimal
fsync = off
shared_buffers = '300MB' # also tested with 30MB and 60MB
checksum_cost_delay = 0 # default in patch

And for the test I have used pgbench initialized at a scale of 250, to
have close to 3.5GB of data, so as it gives us a sort of 90% buffer
eviction, with all the data cached in the OS (we may want to look as
well at the case where the OS cache does not hold all the relation
pages).  I have also run some tests with 30MB and 60MB of shared
buffers, for similar results.

I also applied some prewarming on the cluster:
create extension pg_prewarm
select pg_prewarm(oid) from pg_class where oid > 16000;

Then, I have done five runs using that:
pgbench -S -M prepared -c 64/128/256 -n -T 60
In parallel of that, I got this stuff running in parallel all the
time:
select pg_check_relation('pgbench_accounts');
\watch 0.1

Here are some TPS numbers with the execution time of pg_check_relation.
In the five runs, I removed the highest and lowest ones, then took an
average of the remaining three.  I have also tested two modes: with
and without the optimization, that requires a one-liner in checksum.c
as per your latest patch:
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
                 uint16 *chk_expected, uint16 *chk_found)
 {
     char        buffer[BLCKSZ];
-    bool        force_lock = false;
+    bool        force_lock = true;
     bool        found_in_sb;

Within parenthesis is the amount of time taken by pg_relation_check()
for a single check.  This is not completely exact and I saw some
variations, just to give an impression:
Conns                      64                 128               256
force_lock=true     60590 (7~8s)  55652 (10.2~10.5s)     46812 (9~10s)
force_lock=false   58637 (5s)        54131 (6~7s)  37091 (1.1~1.2s)

For connections higher than 128, I was kind of surprised to see
pg_relation_check being more aggressive without the optimization, with
much less contention on the buffer mapping LWLock actually, but that's
an impression from looking at pg_stat_activity.

Looking at the wait events for each run, I saw much more hiccups with
the buffer mapping LWLock when forcing the lock rather than not, still
I was able to see some contention when also not forcing the lock.  Not
surprising as this rejects a bunch of pages from shared buffers.

> I used all default settings, but by default checksum_cost_delay is 0
> so there shouldn't be any throttling.

Thanks, so did I.  From what I can see, there could be as well
benefits in not using the optimization by default so as the relation
check applies some natural throttling by making the checks actually
slower (there is a link between the individual runtime of
pg_relation_time and the TPS).  So it also seems to me that the
throttling parameters would be beneficial, but it looks to me that
there is as well a point to not include any throttling in a first
version if the optimization to go full speed is not around.  Using
three new GUCs for those function calls is still too much IMO, so
there is also the argument to move all this stuff into a new contrib/
module, and have a bgworker implementation as part of it as it would
need shared_preload_libraries anyway.

Also, I have been putting some thoughts into the APIs able to fetch a
buffer without going through the shared buffers.  And neither
checksum.c, because it should be a place that those APIs depends on
and include only the most-internal logic for checksum algorithm and
computation, nor checksumfuncs.c, because we need to think about the
case of a background worker as well (that could spawn a set of dynamic
workers connecting to different databases able to do checksum
verifications?).  It would be good to keep the handling of the buffer
mapping lock as well as the calls to smgrread() into a single place.
ReadBuffer_common() is a natural place for that, though it means for
our use case the addition of three new options:
- Being able to pass down directly a buffer pointer to save the page
contents.
- Being able to not verify directly a page, leaving the verification
to the caller upthread.
- Addition of a new mode, that I am calling here RBM_PRIVATE, where we
actually read the page from disk if not yet in shared buffers, except
that we fill in the contents of the page using the pointer given by
the caller.  That's different than the use of local buffers, as we
don't need this much amount of complications like temporary tables of
course for per-page checks.

Another idea would be to actually just let ReadBuffer_common just do
the check by itself, with a different mode like a kind of
RBM_VALIDATE, where we just return a verification state of the page
that can be consumed by callers.

This also comes with some more advantages:
- Tracking of reads from disk with track_io_timing.
- Addition of some private stats dedicated to this private mode, with
new fields in pgBufferUsage, all in a single place
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Wed, Sep 16, 2020 at 11:46 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:
> > Thanks!
>
> I got some numbers out of my pocket, using the following base
> configuration:
> [...]
>
> Within parenthesis is the amount of time taken by pg_relation_check()
> for a single check.  This is not completely exact and I saw some
> variations, just to give an impression:
> Conns                      64                 128               256
> force_lock=true  60590 (7~8s)  55652 (10.2~10.5s)     46812 (9~10s)
> force_lock=false   58637 (5s)        54131 (6~7s)  37091 (1.1~1.2s)
>
> For connections higher than 128, I was kind of surprised to see
> pg_relation_check being more aggressive without the optimization, with
> much less contention on the buffer mapping LWLock actually, but that's
> an impression from looking at pg_stat_activity.
>
> Looking at the wait events for each run, I saw much more hiccups with
> the buffer mapping LWLock when forcing the lock rather than not, still
> I was able to see some contention when also not forcing the lock.  Not
> surprising as this rejects a bunch of pages from shared buffers.
>
> > I used all default settings, but by default checksum_cost_delay is 0
> > so there shouldn't be any throttling.
>
> Thanks, so did I.  From what I can see, there could be as well
> benefits in not using the optimization by default so as the relation
> check applies some natural throttling by making the checks actually
> slower (there is a link between the individual runtime of
> pg_relation_time and the TPS).

Thanks a lot for the tests!  I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster.  Do you have
any idea why that's the case?

>  So it also seems to me that the
> throttling parameters would be beneficial, but it looks to me that
> there is as well a point to not include any throttling in a first
> version if the optimization to go full speed is not around.  Using
> three new GUCs for those function calls is still too much IMO

I'm assuming that you prefer to remove both the optimization and the
throttling part?  I'll do that with the next version unless there's
objections.

>, so
> there is also the argument to move all this stuff into a new contrib/
> module, and have a bgworker implementation as part of it as it would
> need shared_preload_libraries anyway.
>
> Also, I have been putting some thoughts into the APIs able to fetch a
> buffer without going through the shared buffers.  And neither
> checksum.c, because it should be a place that those APIs depends on
> and include only the most-internal logic for checksum algorithm and
> computation, nor checksumfuncs.c, because we need to think about the
> case of a background worker as well (that could spawn a set of dynamic
> workers connecting to different databases able to do checksum
> verifications?).  It would be good to keep the handling of the buffer
> mapping lock as well as the calls to smgrread() into a single place.
> ReadBuffer_common() is a natural place for that, though it means for
> our use case the addition of three new options:
> - Being able to pass down directly a buffer pointer to save the page
> contents.
> - Being able to not verify directly a page, leaving the verification
> to the caller upthread.
> - Addition of a new mode, that I am calling here RBM_PRIVATE, where we
> actually read the page from disk if not yet in shared buffers, except
> that we fill in the contents of the page using the pointer given by
> the caller.  That's different than the use of local buffers, as we
> don't need this much amount of complications like temporary tables of
> course for per-page checks.
>
> Another idea would be to actually just let ReadBuffer_common just do
> the check by itself, with a different mode like a kind of
> RBM_VALIDATE, where we just return a verification state of the page
> that can be consumed by callers.
>
> This also comes with some more advantages:
> - Tracking of reads from disk with track_io_timing.
> - Addition of some private stats dedicated to this private mode, with
> new fields in pgBufferUsage, all in a single place

I agree that putting the code nearby ReadBuffer_common() would be a
good idea.  However, that means that I can't move all the code to
contrib/  I'm wondering what you'd like to see going there.  I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

 Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which  is maybe not a good
idea for something as critical as this function.

What do you think?



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:
> Thanks a lot for the tests!  I'm not surprised that forcing the lock
> will slow down the pg_check_relation() execution, but I'm a bit
> surprised that holding the buffer mapping lock longer in a workload
> that has a lot of evictions actually makes things faster.  Do you have
> any idea why that's the case?

That's still a bit unclear to me, but I have not spent much time
thinking about this particular point either.

> I'm assuming that you prefer to remove both the optimization and the
> throttling part?  I'll do that with the next version unless there's
> objections.

Yeah, any tests I have done tends to show that.  It would be good to
also check some perf profiles here, at least for the process running
the relation check in a loop.

> I agree that putting the code nearby ReadBuffer_common() would be a
> good idea.  However, that means that I can't move all the code to
> contrib/  I'm wondering what you'd like to see going there.  I can see
> some values in also having the SQL functions available in core rather
> than contrib, e.g. if you need to quickly check a relation on a
> standby, so without requiring to create the extension on the primary
> node first.

Good point.  This could make the user experience worse.

>  Then, I'm a bit worried about adding this code in ReadBuffer_common.
> What this code does is quite different, and I'm afraid that it'll make
> ReadBuffer_common more complex than needed, which  is maybe not a good
> idea for something as critical as this function.
>
> What do you think?

Yeah, I have been looking at ReadBuffer_common() and it is true that
it is complicated enough so we may not really need an extra mode or
more options, for a final logic that is actually different than what a
buffer read does: we just want to know if a page has a valid checksum
or not.  An idea that I got here would be to add a new, separate
function to do the page check directly in bufmgr.c, but that's what
you mean.  Now only the prefetch routine and ReadBuffer_common use
partition locks, but getting that done in the same file looks like a
good compromise to me.  It would be also possible to keep the BLCKSZ
buffer used to check the page directly in this routine, so as any
caller willing to do a check don't need to worry about any
allocation.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:
> > Thanks a lot for the tests!  I'm not surprised that forcing the lock
> > will slow down the pg_check_relation() execution, but I'm a bit
> > surprised that holding the buffer mapping lock longer in a workload
> > that has a lot of evictions actually makes things faster.  Do you have
> > any idea why that's the case?
>
> That's still a bit unclear to me, but I have not spent much time
> thinking about this particular point either.
>
> > I'm assuming that you prefer to remove both the optimization and the
> > throttling part?  I'll do that with the next version unless there's
> > objections.
>
> Yeah, any tests I have done tends to show that.  It would be good to
> also check some perf profiles here, at least for the process running
> the relation check in a loop.
>
> > I agree that putting the code nearby ReadBuffer_common() would be a
> > good idea.  However, that means that I can't move all the code to
> > contrib/  I'm wondering what you'd like to see going there.  I can see
> > some values in also having the SQL functions available in core rather
> > than contrib, e.g. if you need to quickly check a relation on a
> > standby, so without requiring to create the extension on the primary
> > node first.
>
> Good point.  This could make the user experience worse.
>
> >  Then, I'm a bit worried about adding this code in ReadBuffer_common.
> > What this code does is quite different, and I'm afraid that it'll make
> > ReadBuffer_common more complex than needed, which  is maybe not a good
> > idea for something as critical as this function.
> >
> > What do you think?
>
> Yeah, I have been looking at ReadBuffer_common() and it is true that
> it is complicated enough so we may not really need an extra mode or
> more options, for a final logic that is actually different than what a
> buffer read does: we just want to know if a page has a valid checksum
> or not.  An idea that I got here would be to add a new, separate
> function to do the page check directly in bufmgr.c, but that's what
> you mean.  Now only the prefetch routine and ReadBuffer_common use
> partition locks, but getting that done in the same file looks like a
> good compromise to me.  It would be also possible to keep the BLCKSZ
> buffer used to check the page directly in this routine, so as any
> caller willing to do a check don't need to worry about any
> allocation.

I made all the suggested modifications in attached v14:

- moved the C code in bufmgr.c nearby ReadBuffer
- removed the GUC and throttling options
- removed the dubious optimization

All documentation and comments are updated to reflect those changes.
I also split the commit in two, one for the backend infrastructure and
one for the SQL wrappers.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:
> > > Thanks a lot for the tests!  I'm not surprised that forcing the lock
> > > will slow down the pg_check_relation() execution, but I'm a bit
> > > surprised that holding the buffer mapping lock longer in a workload
> > > that has a lot of evictions actually makes things faster.  Do you have
> > > any idea why that's the case?
> >
> > That's still a bit unclear to me, but I have not spent much time
> > thinking about this particular point either.
> >
> > > I'm assuming that you prefer to remove both the optimization and the
> > > throttling part?  I'll do that with the next version unless there's
> > > objections.
> >
> > Yeah, any tests I have done tends to show that.  It would be good to
> > also check some perf profiles here, at least for the process running
> > the relation check in a loop.
> >
> > > I agree that putting the code nearby ReadBuffer_common() would be a
> > > good idea.  However, that means that I can't move all the code to
> > > contrib/  I'm wondering what you'd like to see going there.  I can see
> > > some values in also having the SQL functions available in core rather
> > > than contrib, e.g. if you need to quickly check a relation on a
> > > standby, so without requiring to create the extension on the primary
> > > node first.
> >
> > Good point.  This could make the user experience worse.
> >
> > >  Then, I'm a bit worried about adding this code in ReadBuffer_common.
> > > What this code does is quite different, and I'm afraid that it'll make
> > > ReadBuffer_common more complex than needed, which  is maybe not a good
> > > idea for something as critical as this function.
> > >
> > > What do you think?
> >
> > Yeah, I have been looking at ReadBuffer_common() and it is true that
> > it is complicated enough so we may not really need an extra mode or
> > more options, for a final logic that is actually different than what a
> > buffer read does: we just want to know if a page has a valid checksum
> > or not.  An idea that I got here would be to add a new, separate
> > function to do the page check directly in bufmgr.c, but that's what
> > you mean.  Now only the prefetch routine and ReadBuffer_common use
> > partition locks, but getting that done in the same file looks like a
> > good compromise to me.  It would be also possible to keep the BLCKSZ
> > buffer used to check the page directly in this routine, so as any
> > caller willing to do a check don't need to worry about any
> > allocation.
>
> I made all the suggested modifications in attached v14:
>
> - moved the C code in bufmgr.c nearby ReadBuffer
> - removed the GUC and throttling options
> - removed the dubious optimization
>
> All documentation and comments are updated to reflect those changes.
> I also split the commit in two, one for the backend infrastructure and
> one for the SQL wrappers.

And I did miss a reference in the sgml documentation, fixed in v15.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:
> > > > Thanks a lot for the tests!  I'm not surprised that forcing the lock
> > > > will slow down the pg_check_relation() execution, but I'm a bit
> > > > surprised that holding the buffer mapping lock longer in a workload
> > > > that has a lot of evictions actually makes things faster.  Do you have
> > > > any idea why that's the case?
> > >
> > > That's still a bit unclear to me, but I have not spent much time
> > > thinking about this particular point either.
> > >
> > > > I'm assuming that you prefer to remove both the optimization and the
> > > > throttling part?  I'll do that with the next version unless there's
> > > > objections.
> > >
> > > Yeah, any tests I have done tends to show that.  It would be good to
> > > also check some perf profiles here, at least for the process running
> > > the relation check in a loop.
> > >
> > > > I agree that putting the code nearby ReadBuffer_common() would be a
> > > > good idea.  However, that means that I can't move all the code to
> > > > contrib/  I'm wondering what you'd like to see going there.  I can see
> > > > some values in also having the SQL functions available in core rather
> > > > than contrib, e.g. if you need to quickly check a relation on a
> > > > standby, so without requiring to create the extension on the primary
> > > > node first.
> > >
> > > Good point.  This could make the user experience worse.
> > >
> > > >  Then, I'm a bit worried about adding this code in ReadBuffer_common.
> > > > What this code does is quite different, and I'm afraid that it'll make
> > > > ReadBuffer_common more complex than needed, which  is maybe not a good
> > > > idea for something as critical as this function.
> > > >
> > > > What do you think?
> > >
> > > Yeah, I have been looking at ReadBuffer_common() and it is true that
> > > it is complicated enough so we may not really need an extra mode or
> > > more options, for a final logic that is actually different than what a
> > > buffer read does: we just want to know if a page has a valid checksum
> > > or not.  An idea that I got here would be to add a new, separate
> > > function to do the page check directly in bufmgr.c, but that's what
> > > you mean.  Now only the prefetch routine and ReadBuffer_common use
> > > partition locks, but getting that done in the same file looks like a
> > > good compromise to me.  It would be also possible to keep the BLCKSZ
> > > buffer used to check the page directly in this routine, so as any
> > > caller willing to do a check don't need to worry about any
> > > allocation.
> >
> > I made all the suggested modifications in attached v14:
> >
> > - moved the C code in bufmgr.c nearby ReadBuffer
> > - removed the GUC and throttling options
> > - removed the dubious optimization
> >
> > All documentation and comments are updated to reflect those changes.
> > I also split the commit in two, one for the backend infrastructure and
> > one for the SQL wrappers.
>
> And I did miss a reference in the sgml documentation, fixed in v15.

I forgot to add the modified file in the previous attachment, sorry
for the noise.

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Fri, Oct 16, 2020 at 8:59 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >
> > On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > >
> > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:
> > > >
> > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:
> > > > > Thanks a lot for the tests!  I'm not surprised that forcing the lock
> > > > > will slow down the pg_check_relation() execution, but I'm a bit
> > > > > surprised that holding the buffer mapping lock longer in a workload
> > > > > that has a lot of evictions actually makes things faster.  Do you have
> > > > > any idea why that's the case?
> > > >
> > > > That's still a bit unclear to me, but I have not spent much time
> > > > thinking about this particular point either.
> > > >
> > > > > I'm assuming that you prefer to remove both the optimization and the
> > > > > throttling part?  I'll do that with the next version unless there's
> > > > > objections.
> > > >
> > > > Yeah, any tests I have done tends to show that.  It would be good to
> > > > also check some perf profiles here, at least for the process running
> > > > the relation check in a loop.
> > > >
> > > > > I agree that putting the code nearby ReadBuffer_common() would be a
> > > > > good idea.  However, that means that I can't move all the code to
> > > > > contrib/  I'm wondering what you'd like to see going there.  I can see
> > > > > some values in also having the SQL functions available in core rather
> > > > > than contrib, e.g. if you need to quickly check a relation on a
> > > > > standby, so without requiring to create the extension on the primary
> > > > > node first.
> > > >
> > > > Good point.  This could make the user experience worse.
> > > >
> > > > >  Then, I'm a bit worried about adding this code in ReadBuffer_common.
> > > > > What this code does is quite different, and I'm afraid that it'll make
> > > > > ReadBuffer_common more complex than needed, which  is maybe not a good
> > > > > idea for something as critical as this function.
> > > > >
> > > > > What do you think?
> > > >
> > > > Yeah, I have been looking at ReadBuffer_common() and it is true that
> > > > it is complicated enough so we may not really need an extra mode or
> > > > more options, for a final logic that is actually different than what a
> > > > buffer read does: we just want to know if a page has a valid checksum
> > > > or not.  An idea that I got here would be to add a new, separate
> > > > function to do the page check directly in bufmgr.c, but that's what
> > > > you mean.  Now only the prefetch routine and ReadBuffer_common use
> > > > partition locks, but getting that done in the same file looks like a
> > > > good compromise to me.  It would be also possible to keep the BLCKSZ
> > > > buffer used to check the page directly in this routine, so as any
> > > > caller willing to do a check don't need to worry about any
> > > > allocation.
> > >
> > > I made all the suggested modifications in attached v14:
> > >
> > > - moved the C code in bufmgr.c nearby ReadBuffer
> > > - removed the GUC and throttling options
> > > - removed the dubious optimization
> > >
> > > All documentation and comments are updated to reflect those changes.
> > > I also split the commit in two, one for the backend infrastructure and
> > > one for the SQL wrappers.
> >
> > And I did miss a reference in the sgml documentation, fixed in v15.
>
> I forgot to add the modified file in the previous attachment, sorry
> for the noise.

And Michael just told me that I also missed adding one of the C files
while splitting the patch into two.

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote:
> And Michael just told me that I also missed adding one of the C files
> while splitting the patch into two.

+   if (PageIsNew(page))
+   {
+       /*
+        * Check if the page is really new or if there's corruption that
+        * affected PageIsNew detection.  Note that PageIsVerified won't try to
+        * detect checksum corruption in this case, so there's no risk of
+        * duplicated corruption report.
+        */
+       if (PageIsVerified(page, blkno))
+       {
+           /* No corruption. */
+           return true;
+       }
Please note that this part of your patch overlaps with a proposal for
a bug fix related to zero-only pages with the checksum verification of
base backups:
https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru

Your patch is trying to adapt itself to the existing logic we have in
PageIsVerified() so as you don't get a duplicated report, as does the
base backup part.  Note that I cannot find in the wild any open code
making use of PageIsVerified(), but I'd like to believe that it is
rather handy to use for table AMs at least (?), so if we can avoid any
useless ABI breakage, it may be better to have a new
PageIsVerifiedExtended() able to take additional options, one to
report to pgstat and one to generate this elog(WARNING).  And then
this patch could just make use of it?

+       /*
+        * There's corruption, but since this affects PageIsNew, we
+        * can't compute a checksum, so set NoComputedChecksum for the
+        * expected checksum.
+        */
+       *chk_expected = NoComputedChecksum;
+       *chk_found = hdr->pd_checksum;
+       return false;
[...]
+       /*
+        * This can happen if corruption makes the block appears as
+        * PageIsNew() but isn't a new page.
+        */
+       if (chk_expected == NoComputedChecksum)
+           nulls[i++] = true;
+       else
+           values[i++] = UInt16GetDatum(chk_expected);
Somewhat related to the first point, NoComputedChecksum exists
because, as the current patch is shaped, we need to report an existing
checksum to the user even for the zero-only case.  PageIsVerified() is
not that flexible so we could change it to report a status depending
on the error faced (checksum, header or zero-only) on top of getting a
checksum.  Now, I am not completely sure either that it is worth the
complication to return in the SRF of the check function the expected
checksum.  So, wouldn't it be better to just rely on PageIsVerified()
(well it's rather-soon-to-be extended flavor) for the checksum check,
the header sanity check and the zero-only check?  My point is to keep
a single entry point for all the page sanity checks, so as base
backups, your patch, and the buffer manager apply the same things.
Base backups got that partially wrong because the base backup code
wants to keep control of the number of failures and the error
reports.  Your patch actually wishes to report a failure, but you want
to add more context with the fork name and such.  Another option we
could use here is to add an error context so as PageIsVerified()
reports the WARNING, but the SQL function provides more context with
the block number and the relation involved in the check.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote:
> > And Michael just told me that I also missed adding one of the C files
> > while splitting the patch into two.
>
> +   if (PageIsNew(page))
> +   {
> +       /*
> +        * Check if the page is really new or if there's corruption that
> +        * affected PageIsNew detection.  Note that PageIsVerified won't try to
> +        * detect checksum corruption in this case, so there's no risk of
> +        * duplicated corruption report.
> +        */
> +       if (PageIsVerified(page, blkno))
> +       {
> +           /* No corruption. */
> +           return true;
> +       }
> Please note that this part of your patch overlaps with a proposal for
> a bug fix related to zero-only pages with the checksum verification of
> base backups:
> https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru
>
> Your patch is trying to adapt itself to the existing logic we have in
> PageIsVerified() so as you don't get a duplicated report, as does the
> base backup part.  Note that I cannot find in the wild any open code
> making use of PageIsVerified(), but I'd like to believe that it is
> rather handy to use for table AMs at least (?), so if we can avoid any
> useless ABI breakage, it may be better to have a new
> PageIsVerifiedExtended() able to take additional options, one to
> report to pgstat and one to generate this elog(WARNING).  And then
> this patch could just make use of it?

Indeed, that would be great.

> +       /*
> +        * There's corruption, but since this affects PageIsNew, we
> +        * can't compute a checksum, so set NoComputedChecksum for the
> +        * expected checksum.
> +        */
> +       *chk_expected = NoComputedChecksum;
> +       *chk_found = hdr->pd_checksum;
> +       return false;
> [...]
> +       /*
> +        * This can happen if corruption makes the block appears as
> +        * PageIsNew() but isn't a new page.
> +        */
> +       if (chk_expected == NoComputedChecksum)
> +           nulls[i++] = true;
> +       else
> +           values[i++] = UInt16GetDatum(chk_expected);
> Somewhat related to the first point, NoComputedChecksum exists
> because, as the current patch is shaped, we need to report an existing
> checksum to the user even for the zero-only case.


I'm not sure that I understand your point.  The current patch only
returns something to users when there's a corruption.  If by
"zero-only case" you mean "page corrupted in a way that PageIsNew()
returns true while not being all zero", then it's a corrupted page and
then obviously yes it needs to be returned to users.

>  PageIsVerified() is
> not that flexible so we could change it to report a status depending
> on the error faced (checksum, header or zero-only) on top of getting a
> checksum.  Now, I am not completely sure either that it is worth the
> complication to return in the SRF of the check function the expected
> checksum.

It seemed to me that it could be something useful to get with this
kind of tool.  You may be able to recover a corrupted page from
backup/WAL if the checksum itself wasn't corrupted so that you know
what to look for.  There would be a lot of caveats and low level work,
but if you're desperate enough that may save you a bit of time.

>  So, wouldn't it be better to just rely on PageIsVerified()
> (well it's rather-soon-to-be extended flavor) for the checksum check,
> the header sanity check and the zero-only check?  My point is to keep
> a single entry point for all the page sanity checks, so as base
> backups, your patch, and the buffer manager apply the same things.
> Base backups got that partially wrong because the base backup code
> wants to keep control of the number of failures and the error
> reports.

I'm fine with it.

>  Your patch actually wishes to report a failure, but you want
> to add more context with the fork name and such.  Another option we
> could use here is to add an error context so as PageIsVerified()
> reports the WARNING, but the SQL function provides more context with
> the block number and the relation involved in the check.

Also, returning actual data rather than a bunch of warnings is way
easier to process for client code.  And as mentioned previously having
an API that returns a list of corrupted blocks could be useful for a
single-page recovery feature.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Mon, Oct 19, 2020 at 11:16:38AM +0800, Julien Rouhaud wrote:
> On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote:
>> Somewhat related to the first point, NoComputedChecksum exists
>> because, as the current patch is shaped, we need to report an existing
>> checksum to the user even for the zero-only case.
>
> I'm not sure that I understand your point.  The current patch only
> returns something to users when there's a corruption.  If by
> "zero-only case" you mean "page corrupted in a way that PageIsNew()
> returns true while not being all zero", then it's a corrupted page and
> then obviously yes it needs to be returned to users.

Sorry for the confusion, this previous paragraph was confusing.  I
meant that the reason why NoComputedChecksum exists is that we give up
on attempting to calculate the checksum if we detect that the page is
new, but failed the zero-only test, and that we want the users to know
about this special case by setting this expected checksum to NULL for
the SRF.

>>  So, wouldn't it be better to just rely on PageIsVerified()
>> (well it's rather-soon-to-be extended flavor) for the checksum check,
>> the header sanity check and the zero-only check?  My point is to keep
>> a single entry point for all the page sanity checks, so as base
>> backups, your patch, and the buffer manager apply the same things.
>> Base backups got that partially wrong because the base backup code
>> wants to keep control of the number of failures and the error
>> reports.
>
> I'm fine with it.

Thanks.

>>  Your patch actually wishes to report a failure, but you want
>> to add more context with the fork name and such.  Another option we
>> could use here is to add an error context so as PageIsVerified()
>> reports the WARNING, but the SQL function provides more context with
>> the block number and the relation involved in the check.
>
> Also, returning actual data rather than a bunch of warnings is way
> easier to process for client code.  And as mentioned previously having
> an API that returns a list of corrupted blocks could be useful for a
> single-page recovery feature.

No issues with reporting the block number and the fork type in the SRF
at least of course as this is helpful to detect the position of the
broken blocks.  For the checksum found in the header and the one
calculated (named expected in the patch), I am less sure how to put a
clear definition on top of that but we could always consider that
later and extend the SRF as needed.  Once the user knows that both do
not match, he/she knows that there is a problem.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote:
> No issues with reporting the block number and the fork type in the SRF
> at least of course as this is helpful to detect the position of the
> broken blocks.  For the checksum found in the header and the one
> calculated (named expected in the patch), I am less sure how to put a
> clear definition on top of that but we could always consider that
> later and extend the SRF as needed.  Once the user knows that both do
> not match, he/she knows that there is a problem.

So, I have reviewed your patch set, and heavily reworked the logic to
be more consistent on many thinks, resulting in a largely simplified
patch without sacrificing its usefulness:
- The logic of the core routine of bufmgr.c is unchanged.  I have
simplified it a bit though by merging the subroutines that were part
of the patch.  SMgrRelation is used as argument instead of a
Relation.  That's more consistent with the surroundings.  The initial
read of a page without locks is still on the table as an extra
optimization, though I am not completely sure if this should be part
of CheckBuffer() or not.  I also thought about the previous benchmarks
and I think that not using the most-optimized improved performance,
because it reduced the successive runes of the SQL functions, reducing
the back-pressure on the partition locks (we held on of them at the
same time for a longer period, meaning that the other 127 ran free for
a longer time).  Please note that this part still referred to a
"module", which was incorrect.
- Removal of the expected and found checksums from the SRF of the
function.  Based on my recent business with the page checks for base
backups, I have arrived at the conclusion that the SRF should return
data that we can absolutely trust, and the minimum I think we have to
trust here is if a given page is thought as safe or not, considering
all the sanity checks done by PageIsVerified() as the main entry
point for everything.  This has led to a bit of confusion with the
addition of NoComputedChecksum for a page that was empty as of the
initial of the patch, so it happens that we don't need it anymore.
- Removal of the dependency with checksums for this feature.  While
simplifying the code, I have noticed that this feature can also be
beneficial for clusters that do not have have data checksums, as
PageIsVerified() is perfectly able to run some page header checks and
the zero-page case.  That's of course less useful than having the
checksums, but there is no need to add a dependency here.  The file
for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c.
- The function is changed to return no tuples if the relkind is not
supported, and the same applies for temporary relations.  That's more
consistent with other system functions like the ones in charge of
partition information, and this makes full scans of pg_class much
easier to work with.  Temporary tables were not handled correctly
anyway as these are in local buffers, but the use case of this
function in this case is not really obvious to me.
- Having the forknum in the SRF is kind of confusing, as the user
would need to map a number with the physical on-disk name.  Instead, I
have made the choice to return the *path* of the corrupted file with a
block number.  This way, an operator can know immediately where a
problem comes from.  The patch does not append the segment number, and
I am not sure if we should actually do that, but adding it is
straight-forward as we have the block number.  There is a dependency
with table AMs here as well, as this goes down to fd.c, explaining why
I have not added it and just.
- I really don't know what kind of default ACL should apply for such
functions, but I am sure that SCAN_TABLES is not what we are looking
for here, and there is nothing preventing us from having a safe
default from the start of times, so I moved the function to be
superuser-only by default, and GRANT can be used to allow its
execution to other roles.  We could relax that in the future, of
course, this can be discussed separately.
- The WARNING report for each block found as corrupted gains an error
context, as a result of a switch to PageIsVerified(), giving a user
all the information needed in the logs on top of the result in the
SRF.  That's useful as well if combined with CHECK_FOR_INTERRUPTS(),
and I got to wonder if we should have some progress report for this
stuff, though that's a separate discussion.
- The function is renamed to something less generic,
pg_relation_check_pages(), and I have reduced the number of functions
from two to one, where the user can specify the fork name with a new
option.  The default of NULL means that all the forks of a relation
are checked.
- The TAP tests are rather bulky.  I have moved all the non-corruption
test cases into a new SQL test file.  That's useful for people willing
to do some basic sanity checks with a non-default table AM.  At least
it provides a minimum coverage.  I have not completely finished its
review, but I have done some work.  Doing some debugging of
corrupt_and_test_block() was proving to be rather difficult as the
same test names are assigned multiple times.  I am tempted to move
this test suite to src/test/recovery/ instead.
- Reworked the docs and some comments.

That's quite a lot of changes, and I think that most of the C code,
the main tests in src/test/regress/ and the docs are getting in a
rather committable state.  The TAP tests emulating corruptions still
need a closer lookup (say, check_pg_stat_database_nb_error() should
have an error prefix at least).  The portions in bufmgr.c and the rest
should of course be split into two separate commits, that can easily
be done.  And the code needs an indentation run and a catalog bump.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Fri, Oct 23, 2020 at 3:28 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote:
> > No issues with reporting the block number and the fork type in the SRF
> > at least of course as this is helpful to detect the position of the
> > broken blocks.  For the checksum found in the header and the one
> > calculated (named expected in the patch), I am less sure how to put a
> > clear definition on top of that but we could always consider that
> > later and extend the SRF as needed.  Once the user knows that both do
> > not match, he/she knows that there is a problem.
>
> So, I have reviewed your patch set, and heavily reworked the logic to
> be more consistent on many thinks, resulting in a largely simplified
> patch without sacrificing its usefulness:

Thanks!

> - Removal of the dependency with checksums for this feature.  While
> simplifying the code, I have noticed that this feature can also be
> beneficial for clusters that do not have have data checksums, as
> PageIsVerified() is perfectly able to run some page header checks and
> the zero-page case.  That's of course less useful than having the
> checksums, but there is no need to add a dependency here.  The file
> for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c.

I agree.  However I'm assuming that this refactor is relying on the
not yet committed patch (in the nearby thread dealing with base backup
checksums check) to also refactor PageIsVerified?  As all the logic
you removed was done to avoid spamming a lot of warnings  when calling
the function.

> - The function is changed to return no tuples if the relkind is not
> supported, and the same applies for temporary relations.  That's more
> consistent with other system functions like the ones in charge of
> partition information, and this makes full scans of pg_class much
> easier to work with.  Temporary tables were not handled correctly
> anyway as these are in local buffers, but the use case of this
> function in this case is not really obvious to me.

Agreed

> - Having the forknum in the SRF is kind of confusing, as the user
> would need to map a number with the physical on-disk name.  Instead, I
> have made the choice to return the *path* of the corrupted file with a
> block number.  This way, an operator can know immediately where a
> problem comes from.  The patch does not append the segment number, and
> I am not sure if we should actually do that, but adding it is
> straight-forward as we have the block number.  There is a dependency
> with table AMs here as well, as this goes down to fd.c, explaining why
> I have not added it and just.

That's a clear improvement, thanks!

> - I really don't know what kind of default ACL should apply for such
> functions, but I am sure that SCAN_TABLES is not what we are looking
> for here, and there is nothing preventing us from having a safe
> default from the start of times, so I moved the function to be
> superuser-only by default, and GRANT can be used to allow its
> execution to other roles.  We could relax that in the future, of
> course, this can be discussed separately.

I don't have a strong opinion here, SCAN_TABLES was maybe not ideal.
No objections.

> - The WARNING report for each block found as corrupted gains an error
> context, as a result of a switch to PageIsVerified(), giving a user
> all the information needed in the logs on top of the result in the
> SRF.  That's useful as well if combined with CHECK_FOR_INTERRUPTS(),
> and I got to wonder if we should have some progress report for this
> stuff, though that's a separate discussion.

Mmm, is it really an improvement to report warnings during this
function execution?  Note also that PageIsVerified as-is won't report
a warning if a page is found as PageIsNew() but isn't actually all
zero, while still being reported as corrupted by the SRF.

Have you also considered that it's possible to execute
pg_relation_check_pages with ignore_checksum_failure = on?  That's
evidently a bad idea, but doing so would report some of the data
corruption as warnings while still not reporting anything in the SRF.

Having some progress report would be nice to have, but +1 to have a
separate discussion for that.

> - The function is renamed to something less generic,
> pg_relation_check_pages(), and I have reduced the number of functions
> from two to one, where the user can specify the fork name with a new
> option.  The default of NULL means that all the forks of a relation
> are checked.

Ok.

> - The TAP tests are rather bulky.  I have moved all the non-corruption
> test cases into a new SQL test file.  That's useful for people willing
> to do some basic sanity checks with a non-default table AM.  At least
> it provides a minimum coverage.

Agreed



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:
> I agree.  However I'm assuming that this refactor is relying on the
> not yet committed patch (in the nearby thread dealing with base backup
> checksums check) to also refactor PageIsVerified?  As all the logic
> you removed was done to avoid spamming a lot of warnings  when calling
> the function.

Yeah, it should use a refactored version, but I was as well in the
mood of looking at version based on what we have now on HEAD.  Even if
I am not completely clear where the patch for page verification and
base backups will go, I was thinking as well to do the refactoring
introducing PageIsVerifiedExtended() first, before considering the
next steps for this thread.  It seems to me that the path where we
generate no WARNINGs at all makes the whole experience more consistent
for the user with this function.

> Mmm, is it really an improvement to report warnings during this
> function execution?  Note also that PageIsVerified as-is won't report
> a warning if a page is found as PageIsNew() but isn't actually all
> zero, while still being reported as corrupted by the SRF.

Yep, joining the point of above to just have no WARNINGs at all.

> Have you also considered that it's possible to execute
> pg_relation_check_pages with ignore_checksum_failure = on?  That's
> evidently a bad idea, but doing so would report some of the data
> corruption as warnings while still not reporting anything in the SRF.

Yeah, I thought about that as well, but I did not see a strong
argument against preventing this behavior either, even if it sounds
a bit strange.  We could always tune that later depending on the
feedback.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote:
> On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:
>> Mmm, is it really an improvement to report warnings during this
>> function execution?  Note also that PageIsVerified as-is won't report
>> a warning if a page is found as PageIsNew() but isn't actually all
>> zero, while still being reported as corrupted by the SRF.
>
> Yep, joining the point of above to just have no WARNINGs at all.

Now that we have d401c57, I got to consider more this one, and opted
for not generating a WARNING for now.  Hence, PageisVerifiedExtended()
is disabled regarding that, but we still report a checksum failure in
it.

I have spent some time reviewing the tests, and as I felt this was
bulky.  In the reworked version attached, I have reduced the number of
tests by half, without reducing the coverage, mainly:
- Removed all the stderr and the return code tests, as we always
expected the commands to succeed, and safe_psql() can do the job
already.
- Merged of the queries using pg_relation_check_pages into a single
routine, with the expected output (set of broken pages returned in the
SRF) in the arguments.
- Added some prefixes to the tests, to generate unique test names.
That makes debug easier.
- The query on pg_stat_database is run once at the beginning, once at
the end with the number of checksum failures correctly updated.
- Added comments to document all the routines, and renamed some of
them mostly for consistency.
- Skipped system relations from the scan of pg_class, making the test
more costly for nothing.
- I ran some tests on Windows, just-in-case.

I have also added a SearchSysCacheExists1() to double-check if the
relation is missing before opening it, added a
CHECK_FOR_INTERRUPTS() within the main check loop (where the error
context is really helpful), indented the code, bumped the catalogs
(mostly a self-reminder), etc.

So, what do you think?
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Tue, Oct 27, 2020 at 3:07 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote:
> > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:
> >> Mmm, is it really an improvement to report warnings during this
> >> function execution?  Note also that PageIsVerified as-is won't report
> >> a warning if a page is found as PageIsNew() but isn't actually all
> >> zero, while still being reported as corrupted by the SRF.
> >
> > Yep, joining the point of above to just have no WARNINGs at all.
>
> Now that we have d401c57, I got to consider more this one, and opted
> for not generating a WARNING for now.  Hence, PageisVerifiedExtended()
> is disabled regarding that, but we still report a checksum failure in
> it.

Great, that's also what I had in mind.

> I have spent some time reviewing the tests, and as I felt this was
> bulky.  In the reworked version attached, I have reduced the number of
> tests by half, without reducing the coverage, mainly:
> - Removed all the stderr and the return code tests, as we always
> expected the commands to succeed, and safe_psql() can do the job
> already.
> - Merged of the queries using pg_relation_check_pages into a single
> routine, with the expected output (set of broken pages returned in the
> SRF) in the arguments.
> - Added some prefixes to the tests, to generate unique test names.
> That makes debug easier.
> - The query on pg_stat_database is run once at the beginning, once at
> the end with the number of checksum failures correctly updated.
> - Added comments to document all the routines, and renamed some of
> them mostly for consistency.
> - Skipped system relations from the scan of pg_class, making the test
> more costly for nothing.
> - I ran some tests on Windows, just-in-case.
>
> I have also added a SearchSysCacheExists1() to double-check if the
> relation is missing before opening it, added a
> CHECK_FOR_INTERRUPTS() within the main check loop (where the error
> context is really helpful), indented the code, bumped the catalogs
> (mostly a self-reminder), etc.
>
> So, what do you think?

I think it's also worth noting that the IOLock is now acquired just
before getting the buffer state, and released after the read (or after
finding that the buffer is dirty).  This is consistent with how it's
done elsewhere, so I'm fine.

Other than that I'm quite happy with the changes you made, thanks a lot!



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote:
> I think it's also worth noting that the IOLock is now acquired just
> before getting the buffer state, and released after the read (or after
> finding that the buffer is dirty).  This is consistent with how it's
> done elsewhere, so I'm fine.

Consistency is the point.  This API should be safe to use by design.
I have done some extra performance tests similar to what I did
upthread, and this version showed similar numbers.

> Other than that I'm quite happy with the changes you made, thanks a lot!

Thanks for confirming.  I have gone through the whole set today,
splitted the thing into two commits and applied them.  We had
buildfarm member florican complain about a mistake in one of the
GetDatum() calls that I took care of already, and there is nothing
else on my radar.
--
Michael

Attachment

RE: Online checksums verification in the backend

From
"Shinoda, Noriyoshi (PN Japan A&PS Delivery)"
Date:
Hi,

I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few
blocksin a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend
terminatedabnormally. The attached file is the operation log. 

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Michael Paquier [mailto:michael@paquier.xyz]
Sent: Wednesday, October 28, 2020 2:09 PM
To: Julien Rouhaud <rjuju123@gmail.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; Masahiko Sawada <masahiko.sawada@2ndquadrant.com>; Robert Haas
<robertmhaas@gmail.com>;PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Masahiko Sawada
<sawada.mshk@gmail.com>
Subject: Re: Online checksums verification in the backend

On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote:
> I think it's also worth noting that the IOLock is now acquired just
> before getting the buffer state, and released after the read (or after
> finding that the buffer is dirty).  This is consistent with how it's
> done elsewhere, so I'm fine.

Consistency is the point.  This API should be safe to use by design.
I have done some extra performance tests similar to what I did upthread, and this version showed similar numbers.

> Other than that I'm quite happy with the changes you made, thanks a lot!

Thanks for confirming.  I have gone through the whole set today, splitted the thing into two commits and applied them.
Wehad buildfarm member florican complain about a mistake in one of the 
GetDatum() calls that I took care of already, and there is nothing else on my radar.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
Hello,

On Thu, Oct 29, 2020 at 7:52 AM Shinoda, Noriyoshi (PN Japan A&PS
Delivery) <noriyoshi.shinoda@hpe.com> wrote:
>
> Hi,
>
> I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few
blocksin a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend
terminatedabnormally. The attached file is the operation log. 

Thanks for the report!

As far as I can see the issue is that the pfree(path) in
check_relation_fork() should be outside the for loop.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Thu, Oct 29, 2020 at 08:12:42AM +0800, Julien Rouhaud wrote:
> As far as I can see the issue is that the pfree(path) in
> check_relation_fork() should be outside the for loop.

Yes, this would be triggered if more than one page is found as broken
in a single SRF.  Fixed, thanks Shinoda-san.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi,

On 2020-10-28 14:08:52 +0900, Michael Paquier wrote:
> Thanks for confirming.  I have gone through the whole set today,
> splitted the thing into two commits and applied them.  We had
> buildfarm member florican complain about a mistake in one of the
> GetDatum() calls that I took care of already, and there is nothing
> else on my radar.

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk?  Seriously?


    /* see if the block is in the buffer pool or not */
    LWLockAcquire(partLock, LW_SHARED);
    buf_id = BufTableLookup(&buf_tag, buf_hash);
    if (buf_id >= 0)
    {
        uint32        buf_state;

        /*
         * Found it.  Now, retrieve its state to know what to do with it, and
         * release the pin immediately.  We do so to limit overhead as much as
         * possible.  We keep the shared LWLock on the target buffer mapping
         * partition for now, so this buffer cannot be evicted, and we acquire
         * an I/O Lock on the buffer as we may need to read its contents from
         * disk.
         */
        bufdesc = GetBufferDescriptor(buf_id);

        LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
        buf_state = LockBufHdr(bufdesc);
        UnlockBufHdr(bufdesc, buf_state);

        /* If the page is dirty or invalid, skip it */
        if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)
        {
            LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
            LWLockRelease(partLock);
            return true;
        }

        /* Read the buffer from disk, with the I/O lock still held */
        smgrread(smgr, forknum, blkno, buffer);
        LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
    }
    else
    {
        /*
         * Simply read the buffer.  There's no risk of modification on it as
         * we are holding the buffer pool partition mapping lock.
         */
        smgrread(smgr, forknum, blkno, buffer);
    }


The justification in the in-shared-buffers case seems to completely
mis-judge costs too:
         * Found it.  Now, retrieve its state to know what to do with it, and
         * release the pin immediately.  We do so to limit overhead as much as
         * possible.  We keep the shared LWLock on the target buffer mapping
         * partition for now, so this buffer cannot be evicted, and we acquire
         * an I/O Lock on the buffer as we may need to read its contents from
         * disk.
a pin is cheap. Holding the partition lock is not.


Also, using char[BLCKSZ] as a buffer isn't ok. This should use
PGAlignedBlock:
/*
 * Use this, not "char buf[BLCKSZ]", to declare a field or local variable
 * holding a page buffer, if that page might be accessed as a page and not
 * just a string of bytes.  Otherwise the variable might be under-aligned,
 * causing problems on alignment-picky hardware.  (In some places, we use
 * this to declare buffers even though we only pass them to read() and
 * write(), because copying to/from aligned buffers is usually faster than
 * using unaligned buffers.)  We include both "double" and "int64" in the
 * union to ensure that the compiler knows the value must be MAXALIGN'ed
 * (cf. configure's computation of MAXIMUM_ALIGNOF).
 */
typedef union PGAlignedBlock


I think this needs to be quickly reworked or reverted.


Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi,

On 2020-10-29 11:17:29 -0700, Andres Freund wrote:
>         LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
>         buf_state = LockBufHdr(bufdesc);
>         UnlockBufHdr(bufdesc, buf_state);
> 
>         /* If the page is dirty or invalid, skip it */
>         if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)

This is weird as well. What is this supposed to do? Just locking and
unlocking a buffer header doesn't do squat? There's no guarantee that
the flags haven't changed by this point, so you could just as well not
acquire the buffer header lock.

Also, why are pages without a valid tag ignored? I can follow the
argument for skipping it in the DIRTY case, but that doesn't apply for
BM_TAG_VALID?

Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi,

On 2020-10-29 11:17:29 -0700, Andres Freund wrote:
> The code does IO while holding the buffer mapping lock. That seems
> *entirely* unacceptable to me. That basically locks 1/128 of shared
> buffers against concurrent mapping changes, while reading data that is
> likely not to be on disk?  Seriously?

Also, uh, I don't think the locking of the buffer table provides you
with the full guarantees CheckBuffer() seems to assume:

 * Check the state of a buffer without loading it into the shared buffers. To
 * avoid torn pages and possible false positives when reading data, a shared
 * LWLock is taken on the target buffer pool partition mapping, and we check
 * if the page is in shared buffers or not.  An I/O lock is taken on the block
 * to prevent any concurrent activity from happening.

this doesn't actually prevent all concurrent write IO, unless you hold
an appropriate lock on the relation. There's a few places that use
smgrwrite()/smgrextend() to write out data bypassing shared buffers.

Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
for, but that'd need a pretty detailed explanation as to when it's safe
to use CheckBuffer() for which blocks.

Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
Hi,

On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
> The code does IO while holding the buffer mapping lock. That seems
> *entirely* unacceptable to me. That basically locks 1/128 of shared
> buffers against concurrent mapping changes, while reading data that is
> likely not to be on disk?  Seriously?

The initial implementation had a different approach, reading the buffer once
without holding the buffer mapping lock (which could lead to some false
positive in some unlikely scenario), and only if a corruption is detected the
read is done once again *while holding the buffer mapping lock* to ensure it's
not a false positive.  Some benchmarking showed that the performance was worse,
so we dropped that optimisation.  Should we go back to something like that or
do you have a better way to ensure a consistent read of a buffer which isn't in
shared buffers?

> a pin is cheap. Holding the partition lock is not.

> The justification in the in-shared-buffers case seems to completely
> mis-judge costs too:
>                  * Found it.  Now, retrieve its state to know what to do with it, and
>                  * release the pin immediately.  We do so to limit overhead as much as
>                  * possible.  We keep the shared LWLock on the target buffer mapping
>                  * partition for now, so this buffer cannot be evicted, and we acquire
>                  * an I/O Lock on the buffer as we may need to read its contents from
>                  * disk.
> a pin is cheap. Holding the partition lock is not.

I clearly did a poor job in that case.  Will fix.

> Also, using char[BLCKSZ] as a buffer isn't ok. This should use
> PGAlignedBlock:

I wasn't aware of it, I will fix.

> >               LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
> >               buf_state = LockBufHdr(bufdesc);
> >               UnlockBufHdr(bufdesc, buf_state);
> >
> >               /* If the page is dirty or invalid, skip it */
> >               if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)
>
> This is weird as well. What is this supposed to do? Just locking and
> unlocking a buffer header doesn't do squat? There's no guarantee that
> the flags haven't changed by this point, so you could just as well not
> acquire the buffer header lock.

This is using the same approach as e.g. WaitIO() to get the state.  I agree
that the state can change after the buffer header lock has been released, but
I think that's something out of scope.  The only guarantee that we can give is
that the database (or subset of relations checked) was healthy at the time the
check was started, provided that your cluster survive the checkpoint happening
after the check ended.  I don't see how we can do better than that.

> Also, why are pages without a valid tag ignored? I can follow the
> argument for skipping it in the DIRTY case, but that doesn't apply for
> BM_TAG_VALID?

AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
Those shouldn't
be entirely initialized yet, and they'll be eventually written and flushed.

> Also, uh, I don't think the locking of the buffer table provides you
> with the full guarantees CheckBuffer() seems to assume:
>
>  * Check the state of a buffer without loading it into the shared buffers. To
>  * avoid torn pages and possible false positives when reading data, a shared
>  * LWLock is taken on the target buffer pool partition mapping, and we check
>  * if the page is in shared buffers or not.  An I/O lock is taken on the block
>  * to prevent any concurrent activity from happening.
>
> this doesn't actually prevent all concurrent write IO, unless you hold
> an appropriate lock on the relation. There's a few places that use
> smgrwrite()/smgrextend() to write out data bypassing shared buffers.
>
> Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
> for, but that'd need a pretty detailed explanation as to when it's safe
> to use CheckBuffer() for which blocks.

AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
relation, during creation.  Those relations shouldn't be visible to the caller
snapshot, so it should be safe.  I can add a comment for that if I'm not
mistaken.

For concurrent smgrextend(), we read the relation size at the beginning of the
function, so we shouldn't read newly allocated blocks.  But you're right that
it's still possible to get the size that includes a newly allocated block
that can be concurrently written.  We can avoid that be holding a
LOCKTAG_RELATION_EXTEND lock when reading the relation size.  Would that be ok?



Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi,

On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:
> On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
> > The code does IO while holding the buffer mapping lock. That seems
> > *entirely* unacceptable to me. That basically locks 1/128 of shared
> > buffers against concurrent mapping changes, while reading data that is
> > likely not to be on disk?  Seriously?
>
> The initial implementation had a different approach, reading the buffer once
> without holding the buffer mapping lock (which could lead to some false
> positive in some unlikely scenario), and only if a corruption is detected the
> read is done once again *while holding the buffer mapping lock* to ensure it's
> not a false positive.  Some benchmarking showed that the performance was worse,
> so we dropped that optimisation.  Should we go back to something like that or
> do you have a better way to ensure a consistent read of a buffer which isn't in
> shared buffers?

I suspect that you're gonna need something quite different than what the
function is doing right now. Not because such a method will be faster in
isolation, but because there's a chance to have it correct and not have
a significant performance impact onto the rest of the system.

I've not thought about it in detail yet. Is suspect you'll need to
ensure there is a valid entry in the buffer mapping table for the buffer
you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
you're going to prevent concurrent IO from starting until your part is
done.


> > Also, why are pages without a valid tag ignored? I can follow the
> > argument for skipping it in the DIRTY case, but that doesn't apply for
> > BM_TAG_VALID?
>
> AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
> Those shouldn't
> be entirely initialized yet, and they'll be eventually written and flushed.

When a page is being read there's a period when the buffer is without
BM_TAG_VALID. It's quite possible that the locking prevents this case
from being reachable - but in that case you shouldn't just accept it as
something to be skipped...


> > Also, uh, I don't think the locking of the buffer table provides you
> > with the full guarantees CheckBuffer() seems to assume:
> >
> >  * Check the state of a buffer without loading it into the shared buffers. To
> >  * avoid torn pages and possible false positives when reading data, a shared
> >  * LWLock is taken on the target buffer pool partition mapping, and we check
> >  * if the page is in shared buffers or not.  An I/O lock is taken on the block
> >  * to prevent any concurrent activity from happening.
> >
> > this doesn't actually prevent all concurrent write IO, unless you hold
> > an appropriate lock on the relation. There's a few places that use
> > smgrwrite()/smgrextend() to write out data bypassing shared buffers.
> >
> > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
> > for, but that'd need a pretty detailed explanation as to when it's safe
> > to use CheckBuffer() for which blocks.
>
> AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
> relation, during creation.

That may be the case right in core right now, but for one, there
definitely are extensions going through smgrwrite() without using the
buffer pool. Essentially, what you are saying is that the introduction
of CheckBuffer() altered what smgrwrite() is allowed to be used for,
without having discussed or documented that.

Before this an AM/extension could just use smgrwrite() to write data not
in shared buffers, as long as a locking scheme is used that prevents
multiple backends from doing that at the same time (trivially:
AccessExclusiveLock).


> Those relations shouldn't be visible to the caller
> snapshot, so it should be safe.  I can add a comment for that if I'm not
> mistaken.

There's no comment warning that you shouldn't use CheckBuffer() to check
every buffer in shared buffers, or every relfilenode on disk. The latter
would be quite a reasonable thing, given it'd avoid needing to connect
to every database etc.


> For concurrent smgrextend(), we read the relation size at the beginning of the
> function, so we shouldn't read newly allocated blocks.  But you're right that
> it's still possible to get the size that includes a newly allocated block
> that can be concurrently written.  We can avoid that be holding a
> LOCKTAG_RELATION_EXTEND lock when reading the relation size.  Would that be ok?

That could possibly work - but currently CheckBuffer() doesn't get a
relation, nor are the comments explaining that it has to be a relation
in the current database or anything.

I hadn't yet looked at the caller - I just started looking at
CheckBuffer() this because it caused compilation failures after rebasing
my aio branch onto master (there's no IO locks anymore).



Looking at the caller:
- This is not a concurrency safe pattern:

    /* Check if relation exists. leaving if there is no such relation */
    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
        return;

    relation = relation_open(relid, AccessShareLock);

  there's a pretty obvious time-to-check-time-to-use danger here.

- pg_relation_check_pages()'s docs say "valid enough to safely be loaded
  into the server's shared buffers". I think that's overpromising by a
  lot. It sounds like it verifies that the page cannot cause a crash or
  such when accessed - but it obviously does no such thing.

- Why does check_one_relation() *silently* ignore when it's being
  passed a temporary table, or a relkind without storage?

- I don't think it's good that check_one_relation() releases relation
  locks after access, but I know that others think that's fine (I think
  it's only fine for catalog relations).

- I realize permission to pg_relation_check_pages() is not granted to
  non-superusers by default, but shouldn't it still perform relation
  access checks?

- why does check_relation_fork() pstrdup the path?

Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Julien Rouhaud
Date:
On Fri, Oct 30, 2020 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:
> > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:
> > > The code does IO while holding the buffer mapping lock. That seems
> > > *entirely* unacceptable to me. That basically locks 1/128 of shared
> > > buffers against concurrent mapping changes, while reading data that is
> > > likely not to be on disk?  Seriously?
> >
> > The initial implementation had a different approach, reading the buffer once
> > without holding the buffer mapping lock (which could lead to some false
> > positive in some unlikely scenario), and only if a corruption is detected the
> > read is done once again *while holding the buffer mapping lock* to ensure it's
> > not a false positive.  Some benchmarking showed that the performance was worse,
> > so we dropped that optimisation.  Should we go back to something like that or
> > do you have a better way to ensure a consistent read of a buffer which isn't in
> > shared buffers?
>
> I suspect that you're gonna need something quite different than what the
> function is doing right now. Not because such a method will be faster in
> isolation, but because there's a chance to have it correct and not have
> a significant performance impact onto the rest of the system.
>
> I've not thought about it in detail yet. Is suspect you'll need to
> ensure there is a valid entry in the buffer mapping table for the buffer
> you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
> you're going to prevent concurrent IO from starting until your part is
> done.

So I'm assuming that the previous optimization to avoid almost every
time doing an IO while holding a buffer mapping lock isn't an option?
In that case, I don't see any other option than reverting the patch
and discussing a new approach.



Re: Online checksums verification in the backend

From
Andres Freund
Date:
On 2020-10-30 11:58:13 +0800, Julien Rouhaud wrote:
> So I'm assuming that the previous optimization to avoid almost every
> time doing an IO while holding a buffer mapping lock isn't an option?
> In that case, I don't see any other option than reverting the patch
> and discussing a new approach.

I think its pretty much *never* OK to do IO while holding a buffer
mapping lock. You're locking a significant fraction of shared buffers
over IO. That's just not OK.  Don't think there's any place doing so
currently either.



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote:
> I think its pretty much *never* OK to do IO while holding a buffer
> mapping lock. You're locking a significant fraction of shared buffers
> over IO. That's just not OK.  Don't think there's any place doing so
> currently either.

There is no place doing that on HEAD.

This specific point was mentioned in the first message of this thread,
7th paragraph.  That was a long thread, so it is easy to miss:
https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

I am wondering what you have in mind regarding the use of
BM_IO_IN_PROGRESS or a similar flag.  Wouldn't that imply some
consequences for other existing buffers in the table, like a possible
eviction?  I'd like to think that we should not do any manipulation of
the buffer tables in this case.  Hence, in order to prevent a
concurrent activity to load in shared buffers the page currently
checked on disk, I got to think that we would need something new here,
like a filtering hash table that would be checked each time a backend
tries to insert an entry into the buffer tables.  That's something I
was wondering here:
https://www.postgresql.org/message-id/20200316030638.GA2331@paquier.xyz
I called that a preemptive lock, but you could also call that a
discard filter or a virtual pin, just something to mean that a page
locked this way cannot be loaded into the shared buffers.  I'd like to
think that this should not touch the existing buffer table, but it
would impact the performance when attempting to insert an entry in the
tables, as anything would need to be pre-checked.

Assuming that we could make this thing work without holding the
partition lock, and assuming that we only hold a share lock on the
relation, we have two cases:
1) If the buffer is in shared buffers, we have the APIs to solve that
by using a pin, unlock the partition, and then do the I/O.  (Still
that's unsafe with the smgrwrite() argument?)
2) If the buffer is not in shared buffers, we don't have what it takes
to solve the problem yet.  But even if we solve this problem, we will
never really be sure that this is entirely safe, as per the argument
with concurrent smgrwrite() calls.  Current in-core code assumes that
this can happen only for init forks of unlogged relations which would
not be visible yet in the backend doing a page check, still it can be
really easy to break this assumption with any new code added by a new
feature.

These arguments bring down to reduce the scope of CheckBuffer() as
follows:
- Use an AEL on the relation, pass down a Relation instead of
SMgrRelation, and add on the way an assertion to make sure that the
caller holds an AEL on the relation.  I wanted to study the possiblity
to use that stuff for base backups, but if you bring the concurrent
smgrwrite() calls into the set of possibilities this shuts down the
whole study from the start.
- It is still useful to check that a page is in shared buffers IMO, so
as if it is dirty we just discard it from the checks and rely on the
next checkpoint to do a flush.  It is also useful to check the state
of the on-disk data is good or not if the page is not dirty, as the
page could have gone rogue on-disk while a system was up for weeks.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi,

I'm a bit limited writing - one handed for a while following an injury
on Friday...

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:
> On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote:
> > I think its pretty much *never* OK to do IO while holding a buffer
> > mapping lock. You're locking a significant fraction of shared buffers
> > over IO. That's just not OK.  Don't think there's any place doing so
> > currently either.
> 
> There is no place doing that on HEAD.

Err?

    /* see if the block is in the buffer pool or not */
    LWLockAcquire(partLock, LW_SHARED);
    buf_id = BufTableLookup(&buf_tag, buf_hash);
    if (buf_id >= 0)
    {
...
        /* Read the buffer from disk, with the I/O lock still held */
        smgrread(smgr, forknum, blkno, buffer);
        LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
    }
    else
    {
        /*
         * Simply read the buffer.  There's no risk of modification on it as
         * we are holding the buffer pool partition mapping lock.
         */
        smgrread(smgr, forknum, blkno, buffer);
    }

    /* buffer lookup done, so now do its check */
    LWLockRelease(partLock);

How is this not doing IO while holding a buffer mapping lock?


> This specific point was mentioned in the first message of this thread,
> 7th paragraph.  That was a long thread, so it is easy to miss:
> https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

The code clearly doesnt implement it that way.


> I am wondering what you have in mind regarding the use of
> BM_IO_IN_PROGRESS or a similar flag.  Wouldn't that imply some
> consequences for other existing buffers in the table, like a possible
> eviction?

You'd need exactly one empty buffer for that - it can be reused for the
next to-be-checked buffer.


> I'd like to think that we should not do any manipulation of
> the buffer tables in this case.

Why? Its the way we lock buffers - why is this so special that we need
to do differently?


> Hence, in order to prevent a
> concurrent activity to load in shared buffers the page currently
> checked on disk, I got to think that we would need something new here,
> like a filtering hash table that would be checked each time a backend
> tries to insert an entry into the buffer tables.

Thats going to slow down everything a bit - the mapping already is a
bottleneck.


> 1) If the buffer is in shared buffers, we have the APIs to solve that
> by using a pin, unlock the partition, and then do the I/O.  (Still
> that's unsafe with the smgrwrite() argument?)

Thats why you need an appropriate relation lock... Something CheckBuffer
didnt bother to document. Its a restriction, but one we probably can
live with.


> 2) If the buffer is not in shared buffers, we don't have what it takes
> to solve the problem yet.

We do. Set up enough state for the case to be otherwise the same as the
in s_b case.

> But even if we solve this problem, we will
> never really be sure that this is entirely safe, as per the argument
> with concurrent smgrwrite() calls.  Current in-core code assumes that
> this can happen only for init forks of unlogged relations which would
> not be visible yet in the backend doing a page check, still it can be
> really easy to break this assumption with any new code added by a new
> feature.

It also happens in a few other cases than just init forks. But
visibility & relation locking can take care of that. But you need to
document that. If the locking allows concurent readers - and especially
concurrent writers, then you cant really use smgrwite for anything but
relation extension.

Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote:
> I'm a bit limited writing - one handed for a while following an injury
> on Friday...

Oops.  Take care.

> On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:
> > There is no place doing that on HEAD.
>
> Err?
>
>     /* see if the block is in the buffer pool or not */
>     LWLockAcquire(partLock, LW_SHARED);
>     buf_id = BufTableLookup(&buf_tag, buf_hash);
>
> [...]
>
> How is this not doing IO while holding a buffer mapping lock?

Well, other than the one we are discussing of course :)

>
>
> > This specific point was mentioned in the first message of this thread,
> > 7th paragraph.  That was a long thread, so it is easy to miss:
> > https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com
>
> The code clearly doesnt implement it that way.
>
>
> > I am wondering what you have in mind regarding the use of
> > BM_IO_IN_PROGRESS or a similar flag.  Wouldn't that imply some
> > consequences for other existing buffers in the table, like a possible
> > eviction?
>
> You'd need exactly one empty buffer for that - it can be reused for the
> next to-be-checked buffer.
>
>
> > I'd like to think that we should not do any manipulation of
> > the buffer tables in this case.
>
> Why? Its the way we lock buffers - why is this so special that we need
> to do differently?
>
>
> > Hence, in order to prevent a
> > concurrent activity to load in shared buffers the page currently
> > checked on disk, I got to think that we would need something new here,
> > like a filtering hash table that would be checked each time a backend
> > tries to insert an entry into the buffer tables.
>
> Thats going to slow down everything a bit - the mapping already is a
> bottleneck.
>
>
> > 1) If the buffer is in shared buffers, we have the APIs to solve that
> > by using a pin, unlock the partition, and then do the I/O.  (Still
> > that's unsafe with the smgrwrite() argument?)
>
> Thats why you need an appropriate relation lock... Something CheckBuffer
> didnt bother to document. Its a restriction, but one we probably can
> live with.
>
>
> > 2) If the buffer is not in shared buffers, we don't have what it takes
> > to solve the problem yet.
>
> We do. Set up enough state for the case to be otherwise the same as the
> in s_b case.
>
> > But even if we solve this problem, we will
> > never really be sure that this is entirely safe, as per the argument
> > with concurrent smgrwrite() calls.  Current in-core code assumes that
> > this can happen only for init forks of unlogged relations which would
> > not be visible yet in the backend doing a page check, still it can be
> > really easy to break this assumption with any new code added by a new
> > feature.
>
> It also happens in a few other cases than just init forks. But
> visibility & relation locking can take care of that. But you need to
> document that. If the locking allows concurent readers - and especially
> concurrent writers, then you cant really use smgrwite for anything but
> relation extension.

--
Michael

Attachment

Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi

On 2020-11-02 10:45:00 +0900, Michael Paquier wrote:
> On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote:
> > I'm a bit limited writing - one handed for a while following an injury
> > on Friday...
> 
> Oops.  Take care.

Thanks!


> > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:
> > > There is no place doing that on HEAD.
> > 
> > Err?
> > How is this not doing IO while holding a buffer mapping lock?
> 
> Well, other than the one we are discussing of course :)

I am not following? Were you just confirming that its not a thing we do?


Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Sun, Nov 01, 2020 at 05:50:06PM -0800, Andres Freund wrote:
> On 2020-11-02 10:45:00 +0900, Michael Paquier wrote:
> > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:
> > > > There is no place doing that on HEAD.
> > >
> > > Err?
> > > How is this not doing IO while holding a buffer mapping lock?
> >
> > Well, other than the one we are discussing of course :)
>
> I am not following? Were you just confirming that its not a thing we do?

I meant that this is not done in any place other than the one
introduced by c780a7a.  So we have one place where it happens, and
no places before c780a7a.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Robert Haas
Date:
On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
> I think this needs to be quickly reworked or reverted.

I don't like this patch, either. In addition to what Andres mentioned,
CheckBuffer() is a completely-special case mechanism which can't be
reused by anything else. In particular, the amcheck for heap stuff
which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
would really like a way to examine a buffer without risking an error
if PageIsVerified() should happen to fail, but this patch is of
absolutely no use in getting there, because CheckBuffer() doesn't give
the caller any way to access the contents of the buffer. It can only
do the checks that it knows how to do, and that's it. That doesn't
seem like a good design.

I don't like the fact that CheckBuffer() silently skips dirty buffers,
either. The comment should really say that it checks the state of a
buffer without loading it into shared buffers, except sometimes it
doesn't actually check it. That doesn't seem like the behavior users
really want, and it's not clear that there is any really good reason
for it. If the buffer is in shared buffers, we could take a share-lock
on the buffer and copy the contents of the page as it exists on disk,
and then still check it.

It feels really confusing to me that the user-exposed function here is
called pg_relation_check_pages(). How is the user supposed to
understand the difference between what this function does and what the
new verify_heapam() in amcheck does? The answer is that the latter
does far more extensive checks, but this isn't obvious from the SGML
documentation, which says only that the blocks are "verified," as if
an end-user can reasonably be expected to know what that means. It
seems likely to lead users to the belief that if this function passes,
they are in good shape, which is extremely far from being true. Just
look at what PageIsVerified() checks compared to what verify_heapam()
checks.

In fact, I would argue that this functionality ought to live in
amcheck rather than core, though there could usefully be enabling
functions in core.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums verification in the backend

From
Andres Freund
Date:
Hi,

On 2020-11-02 12:35:30 -0500, Robert Haas wrote:
> On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
> > I think this needs to be quickly reworked or reverted.

I think it's fairly clear by now that revert is appropriate for now.


> I don't like this patch, either. In addition to what Andres mentioned,
> CheckBuffer() is a completely-special case mechanism which can't be
> reused by anything else. In particular, the amcheck for heap stuff
> which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
> would really like a way to examine a buffer without risking an error
> if PageIsVerified() should happen to fail, but this patch is of
> absolutely no use in getting there, because CheckBuffer() doesn't give
> the caller any way to access the contents of the buffer. It can only
> do the checks that it knows how to do, and that's it. That doesn't
> seem like a good design.

Wouldn't this be better served by having a ReadBufferExtended() flag,
preventing erroring out and zeroing the buffer? I'm not sure that
handling both the case where the buffer contents need to be valid and
the one where it doesn't will make for a good API.


> I don't like the fact that CheckBuffer() silently skips dirty buffers,
> either. The comment should really say that it checks the state of a
> buffer without loading it into shared buffers, except sometimes it
> doesn't actually check it.

Yea, I don't see a good reason for that either. There's reasons for
dirty buffers that aren't WAL logged - so if the on-disk page is broken,
a standby taken outside pg_basebackup would possibly still end up with a
corrupt on-disk page. Similar with a crash restart.


> If the buffer is in shared buffers, we could take a share-lock
> on the buffer and copy the contents of the page as it exists on disk,
> and then still check it.

Don't think we need a share lock. That still allows the buffer to be
written out (and thus a torn read). What we need is to set
BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
that. And then unset that again, without unsetting
BM_DIRTY/BM_JUST_DIRTIED.


> It feels really confusing to me that the user-exposed function here is
> called pg_relation_check_pages(). How is the user supposed to
> understand the difference between what this function does and what the
> new verify_heapam() in amcheck does? The answer is that the latter
> does far more extensive checks, but this isn't obvious from the SGML
> documentation, which says only that the blocks are "verified," as if
> an end-user can reasonably be expected to know what that means. It
> seems likely to lead users to the belief that if this function passes,
> they are in good shape, which is extremely far from being true. Just
> look at what PageIsVerified() checks compared to what verify_heapam()
> checks.

Yea I had similar thoughts, it should just be called
pg_checksum_verify_relation() or something.


> In fact, I would argue that this functionality ought to live in
> amcheck rather than core, though there could usefully be enabling
> functions in core.

I'm not really convinced by this though. It's not really AM
specific - works for all types of relations with storage; don't really
object either...

Greetings,

Andres Freund



Re: Online checksums verification in the backend

From
Magnus Hagander
Date:
On Mon, Nov 2, 2020 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-11-02 12:35:30 -0500, Robert Haas wrote:
> > It feels really confusing to me that the user-exposed function here is
> > called pg_relation_check_pages(). How is the user supposed to
> > understand the difference between what this function does and what the
> > new verify_heapam() in amcheck does? The answer is that the latter
> > does far more extensive checks, but this isn't obvious from the SGML
> > documentation, which says only that the blocks are "verified," as if
> > an end-user can reasonably be expected to know what that means. It
> > seems likely to lead users to the belief that if this function passes,
> > they are in good shape, which is extremely far from being true. Just
> > look at what PageIsVerified() checks compared to what verify_heapam()
> > checks.
>
> Yea I had similar thoughts, it should just be called
> pg_checksum_verify_relation() or something.
>

+1.


> > In fact, I would argue that this functionality ought to live in
> > amcheck rather than core, though there could usefully be enabling
> > functions in core.
>
> I'm not really convinced by this though. It's not really AM
> specific - works for all types of relations with storage; don't really
> object either...

Yeah, I'm not sure about that one either. Also what would happen
if/when we get checksums on things that aren't even relations? (though
maybe that goes for other parts of amcheck at some point as well?)


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Tue, Nov 03, 2020 at 09:31:20AM +0100, Magnus Hagander wrote:
> On Mon, Nov 2, 2020 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:
>> On 2020-11-02 12:35:30 -0500, Robert Haas wrote:
>>> It feels really confusing to me that the user-exposed function here is
>>> called pg_relation_check_pages(). How is the user supposed to
>>> understand the difference between what this function does and what the
>>> new verify_heapam() in amcheck does? The answer is that the latter
>>> does far more extensive checks, but this isn't obvious from the SGML
>>> documentation, which says only that the blocks are "verified," as if
>>> an end-user can reasonably be expected to know what that means. It
>>> seems likely to lead users to the belief that if this function passes,
>>> they are in good shape, which is extremely far from being true. Just
>>> look at what PageIsVerified() checks compared to what verify_heapam()
>>> checks.

The cases of verify_heapam() are much wider as they target only one
AM, while this stuff should remain more general.  There seems to be
some overlap in terms of the basic checks done by bufmgr.c, and
the fact that you may not want to be that much intrusive with the
existing buffer pool as well when running the AM checks.  It also
seems to me that the use cases are quite different for both, the
original goal of this thread is to detect physical corruptions for all
AMs, while verify_heapam() looks after logical corruptions in the way
heap is handled.

>> Yea I had similar thoughts, it should just be called
>> pg_checksum_verify_relation() or something.
>
> +1.

I mentioned that upthread, is there really a dependency with checksums
here?  There are two cases where we can still apply some checks on a
page, without any need of checksums:
- The state of the page header.
- Zeroed page if pd_upper is 0.  Those pages are valid, and don't have
a checksum computed.
So it seems to me that when it comes to relation pages, then the
check of a page should answer to the question: is this page loadable
in shared buffers, or not?

>>> In fact, I would argue that this functionality ought to live in
>>> amcheck rather than core, though there could usefully be enabling
>>> functions in core.
>>
>> I'm not really convinced by this though. It's not really AM
>> specific - works for all types of relations with storage; don't really
>> object either...
>
> Yeah, I'm not sure about that one either. Also what would happen
> if/when we get checksums on things that aren't even relations? (though
> maybe that goes for other parts of amcheck at some point as well?)

I also thought about amcheck when looking at this thread, but it did
not seem the right place as this applies to any AM able that could
load stuff into the shared buffers.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Mon, Nov 02, 2020 at 11:34:57AM -0800, Andres Freund wrote:
> On 2020-11-02 12:35:30 -0500, Robert Haas wrote:
>> On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
>>> I think this needs to be quickly reworked or reverted.
>
> I think it's fairly clear by now that revert is appropriate for now.

Yep, that's clear.  I'll deal with that tomorrow.  That's more than a
simple rework.

>> I don't like this patch, either. In addition to what Andres mentioned,
>> CheckBuffer() is a completely-special case mechanism which can't be
>> reused by anything else. In particular, the amcheck for heap stuff
>> which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
>> would really like a way to examine a buffer without risking an error
>> if PageIsVerified() should happen to fail, but this patch is of
>> absolutely no use in getting there, because CheckBuffer() doesn't give
>> the caller any way to access the contents of the buffer. It can only
>> do the checks that it knows how to do, and that's it. That doesn't
>> seem like a good design.
>
> Wouldn't this be better served by having a ReadBufferExtended() flag,
> preventing erroring out and zeroing the buffer? I'm not sure that
> handling both the case where the buffer contents need to be valid and
> the one where it doesn't will make for a good API.

If you grep for ReadBuffer_common() is some of the emails I sent..  I
was rather interested in something like that.

>> I don't like the fact that CheckBuffer() silently skips dirty buffers,
>> either. The comment should really say that it checks the state of a
>> buffer without loading it into shared buffers, except sometimes it
>> doesn't actually check it.
>
> Yea, I don't see a good reason for that either. There's reasons for
> dirty buffers that aren't WAL logged - so if the on-disk page is broken,
> a standby taken outside pg_basebackup would possibly still end up with a
> corrupt on-disk page. Similar with a crash restart.

Er, if you don't skip dirty buffers, wouldn't you actually report some
pages as broken if attempting to run those in a standby who may have
some torn pages from a previous base backup?  You could still run into
problems post-promotion, until the first checkpoint post-recovery
happens, no?

>> If the buffer is in shared buffers, we could take a share-lock
>> on the buffer and copy the contents of the page as it exists on disk,
>> and then still check it.
>
> Don't think we need a share lock. That still allows the buffer to be
> written out (and thus a torn read). What we need is to set
> BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
> that. And then unset that again, without unsetting
> BM_DIRTY/BM_JUST_DIRTIED.

If that can work, we could make use of some of that for base backups
for a single retry of a page that initially failed.
--
Michael

Attachment

Re: Online checksums verification in the backend

From
Robert Haas
Date:
On Mon, Nov 2, 2020 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:
> Wouldn't this be better served by having a ReadBufferExtended() flag,
> preventing erroring out and zeroing the buffer? I'm not sure that
> handling both the case where the buffer contents need to be valid and
> the one where it doesn't will make for a good API.

I'm not sure. The goal I had in mind was giving a caller a way to get
a copy of a buffer even if it's one we wouldn't normally admit into
shared_buffers. I think it's probably a bad idea to allow for a back
door where things that fail PageIsVerified() can nevertheless escape
into the buffer, but that doesn't mean a checker or recovery tool
shouldn't be allowed to see them.

> > If the buffer is in shared buffers, we could take a share-lock
> > on the buffer and copy the contents of the page as it exists on disk,
> > and then still check it.
>
> Don't think we need a share lock. That still allows the buffer to be
> written out (and thus a torn read). What we need is to set
> BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
> that. And then unset that again, without unsetting
> BM_DIRTY/BM_JUST_DIRTIED.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online checksums verification in the backend

From
Michael Paquier
Date:
On Tue, Nov 03, 2020 at 06:46:12PM +0900, Michael Paquier wrote:
> Yep, that's clear.  I'll deal with that tomorrow.  That's more than a
> simple rework.

This part is now done as of e152506a.
--
Michael

Attachment