Thread: pg_stat_database.checksum_failures vs shared relations

pg_stat_database.checksum_failures vs shared relations

From
Andres Freund
Date:
Hi,

First - I find it rather shocking that we have absolutely *zero* tests of
checksum failures in the backend. Zero. As evidenced by [1].  I really can't
quite believe it.  Nor do we have tests of ignore_checksum_failure or
zero_damaged_pages.


I was trying to write some tests for checksums vs AIO, and one thing I noticed
is that our docs seem to rather strongly hint that checksum failures on shared
objects would be reported to the dbname IS NULL row in pg_stat_database:

   The <structname>pg_stat_database</structname> view will contain one row
   for each database in the cluster, plus one for shared objects, showing
   database-wide statistics.

       Name of this database, or <literal>NULL</literal> for shared
       objects.

       Number of data page checksum failures detected in this
       database (or on a shared object), or NULL if data checksums are
       disabled.

But that is not the case, they always get reported to MyDatabaseId by
PageIsVerifiedExtended().

The docs hint more clearly at checksum failures of shared rels reported this
way starting with:

commit 252b707bc41
Author: Magnus Hagander <magnus@hagander.net>
Date:   2019-04-17 13:51:48 +0200

    Return NULL for checksum failures if checksums are not enabled

which made changes like:
      <entry>Number of data page checksum failures detected in this
-     database</entry>
+      database (or on a shared object), or NULL if data checksums are not
+      enabled.</entry>
     </row>


The stats tracking of checksum failures was introduced in

Author: Magnus Hagander <magnus@hagander.net>
Date:   2019-03-09 10:45:17 -0800

    Track block level checksum failures in pg_stat_database

which already did report stats on shared objects that way:

@@ -151,6 +152,8 @@ PageIsVerified(Page page, BlockNumber blkno)
                  errmsg("page verification failed, calculated checksum %u but expected %u",
                         checksum, p->pd_checksum)));

+        pgstat_report_checksum_failure();
+
         if (header_sane && ignore_checksum_failure)
             return true;
     }

In basebackup.c however, it didn't track stats on shared relations at that time:

@@ -1580,6 +1583,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
         ereport(WARNING,
                 (errmsg("file \"%s\" has a total of %d checksum verification "
                         "failures", readfilename, checksum_failures)));
+
+        if (dboid != InvalidOid)
+            pgstat_report_checksum_failures_in_db(dboid, checksum_failures);
     }
     total_checksum_failures += checksum_failures;


that was changed in:

commit 77bd49adba4
Author: Magnus Hagander <magnus@hagander.net>
Date:   2019-04-12 14:04:50 +0200

    Show shared object statistics in pg_stat_database

    This adds a row to the pg_stat_database view with datoid 0 and datname
    NULL for those objects that are not in a database. This was added
    particularly for checksums, but we were already tracking more satistics
    for these objects, just not returning it.
...

which made the call to pgstat_report_checksum_failures_in_db()
in basebackup.c unconditional. It did leave this comment though:

 * If dboid is anything other than InvalidOid then any checksum failures
 * detected will get reported to the cumulative stats system.


I think the above commit makes it pretty clear that the intent is for checksum
errors on shared database entries to be reported to the "shared" entry in
pg_stat_database.


So, today we have the weird situation that *some* checksum errors on shared
relations get attributed to the current database (if they happen in a backend
normally accessing a shared relation), whereas others get reported to the
"shared relations" "database" (if they happen during a base backup).  That
seems ... not optimal.


One question is whether we consider this a bug that should be backpatched.

To fix it we'd need to provide a bit more information to
PageIsVerifiedExtended(), it currently doesn't know what database the page it
is verifying is in and therefore can't report an error with
pgstat_report_checksum_failures_in_db() (rather than
pgstat_report_checksum_failure(), which attributes to MyDatabaseId).

Obviously having to change the signature of PageIsVerifiedExtended() makes it
harder to fix in the backbranches.

Greetings,

Andres Freund

[1] https://coverage.postgresql.org/src/backend/storage/page/bufpage.c.gcov.html#136



Re: pg_stat_database.checksum_failures vs shared relations

From
Robert Haas
Date:
On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
> So, today we have the weird situation that *some* checksum errors on shared
> relations get attributed to the current database (if they happen in a backend
> normally accessing a shared relation), whereas others get reported to the
> "shared relations" "database" (if they happen during a base backup).  That
> seems ... not optimal.
>
> One question is whether we consider this a bug that should be backpatched.

I think it would be defensible if pg_basebackup reported all errors
with OID 0 and backend connections reported all errors with OID
MyDatabaseId, but it seems hard to justify having pg_basebackup take
care to report things using the correct database OID and individual
backend connections not take care to do the same thing. So I think
this is a bug. If fixing it in the back-branches is too annoying, I
think it would be reasonable to fix it only in master, but
back-patching seems OK too.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pg_stat_database.checksum_failures vs shared relations

From
Michael Paquier
Date:
On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
> > So, today we have the weird situation that *some* checksum errors on shared
> > relations get attributed to the current database (if they happen in a backend
> > normally accessing a shared relation), whereas others get reported to the
> > "shared relations" "database" (if they happen during a base backup).  That
> > seems ... not optimal.
> >
> > One question is whether we consider this a bug that should be backpatched.
>
> I think it would be defensible if pg_basebackup reported all errors
> with OID 0 and backend connections reported all errors with OID
> MyDatabaseId, but it seems hard to justify having pg_basebackup take
> care to report things using the correct database OID and individual
> backend connections not take care to do the same thing. So I think
> this is a bug. If fixing it in the back-branches is too annoying, I
> think it would be reasonable to fix it only in master, but
> back-patching seems OK too.

Being able to get a better reporting for shared relations in back
branches would be nice, but that's going to require some invasive
chirurgy, isn't it?

We don't know currently the OID of the relation whose block is
corrupted with only PageIsVerifiedExtended().  There are two callers
of PIV_REPORT_STAT on HEAD:
- The checksum reports from RelationCopyStorage() know the
SMgrRelation.
- ReadBuffersOperation() has an optional Relation and a
SMgrRelationData.

We could just refactor PageIsVerifiedExtended() so as it reports a
state about why the verification failed and let the callers report the
checksum failure with a relation OID, splitting the data for shared
and non-shared relations?
--
Michael

Attachment

Re: pg_stat_database.checksum_failures vs shared relations

From
Andres Freund
Date:
Hi,

On 2025-03-28 09:44:58 +0900, Michael Paquier wrote:
> On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
> > > So, today we have the weird situation that *some* checksum errors on shared
> > > relations get attributed to the current database (if they happen in a backend
> > > normally accessing a shared relation), whereas others get reported to the
> > > "shared relations" "database" (if they happen during a base backup).  That
> > > seems ... not optimal.
> > >
> > > One question is whether we consider this a bug that should be backpatched.
> > 
> > I think it would be defensible if pg_basebackup reported all errors
> > with OID 0 and backend connections reported all errors with OID
> > MyDatabaseId, but it seems hard to justify having pg_basebackup take
> > care to report things using the correct database OID and individual
> > backend connections not take care to do the same thing. So I think
> > this is a bug. If fixing it in the back-branches is too annoying, I
> > think it would be reasonable to fix it only in master, but
> > back-patching seems OK too.
> 
> Being able to get a better reporting for shared relations in back
> branches would be nice, but that's going to require some invasive
> chirurgy, isn't it?

Yea, that's what I was worried about too.  I think we basically would need a
PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(),
with optional arguments that the "fixed" callers would use.


> We don't know currently the OID of the relation whose block is
> corrupted with only PageIsVerifiedExtended().

I don't think the relation oid is really the relevant bit, it's the database
oid (or alternatively tablespace). But PageIsVerifiedExtended() doesn't know
that either, obviously.


> There are two callers of PIV_REPORT_STAT on HEAD:

> - The checksum reports from RelationCopyStorage() know the
> SMgrRelation.
> - ReadBuffersOperation() has an optional Relation and a
> SMgrRelationData.

An SMgrRelationData suffices, via ->smgr_rlocator.locator.dbOid.


FWIW, it turns out that there are more cases than just MyDatabaseId and
InvalidOid - ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer()
read buffers in a different database than MyDatabaseId.


> We could just refactor PageIsVerifiedExtended() so as it reports a
> state about why the verification failed and let the callers report the
> checksum failure with a relation OID, splitting the data for shared
> and non-shared relations?

Yea, I think we basically need a *checksum_failed out argument, and then the
callers need to do

if (checksum_failure)
   pgstat_report_checksum_failures_in_db(src->smgr_rlocator.locator.dbOid, 1);

Or alternatively we can optionally pass in the rlocator to
PageIsVerifiedExtended2(), so it can do the above internally.

Btw, it seems somewhat odd that we accumulate stats for checksum failures but
not for invalid page headers - the latter seems even worse...

Greetings,

Andres Freund



Re: pg_stat_database.checksum_failures vs shared relations

From
Julien Rouhaud
Date:
On Thu, Mar 27, 2025 at 09:02:02PM -0400, Andres Freund wrote:
> Hi,
>
> On 2025-03-28 09:44:58 +0900, Michael Paquier wrote:
> > On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote:
> > > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote:
> > > > So, today we have the weird situation that *some* checksum errors on shared
> > > > relations get attributed to the current database (if they happen in a backend
> > > > normally accessing a shared relation), whereas others get reported to the
> > > > "shared relations" "database" (if they happen during a base backup).  That
> > > > seems ... not optimal.
> > > >
> > > > One question is whether we consider this a bug that should be backpatched.
> > >
> > > I think it would be defensible if pg_basebackup reported all errors
> > > with OID 0 and backend connections reported all errors with OID
> > > MyDatabaseId, but it seems hard to justify having pg_basebackup take
> > > care to report things using the correct database OID and individual
> > > backend connections not take care to do the same thing. So I think
> > > this is a bug. If fixing it in the back-branches is too annoying, I
> > > think it would be reasonable to fix it only in master, but
> > > back-patching seems OK too.
> >
> > Being able to get a better reporting for shared relations in back
> > branches would be nice, but that's going to require some invasive
> > chirurgy, isn't it?
>
> Yea, that's what I was worried about too.  I think we basically would need a
> PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(),
> with optional arguments that the "fixed" callers would use.

While it would be nice, I'm not sure that it would really be worth the trouble.

Maybe that's just me, but if I hit a corruption failure knowing whether it's a
global relation vs normal relation is definitely not something that will
radically change the following days / weeks of pain to fully resolve the
issue.  Instead there would be other improvements that I would welcome on top
of fixing those counters, which would impact such new API.

For instance one of the thing you need to do in case of a corruption is to
understand the reason for the corruption, and for that knowing the underlying
tablespace rather than the database seems like a way more useful information to
track.  For the rest, the relfilelocator, forknum and blocknum should already
be reported in the logs so you have the full details of what was intercepted
even if the pg_stat_database view is broken in the back branches.

But even if we had all that, there is still no guarantee (at least for now)
that we do see all the corruption as you might not read the "real" version of
the blockss if they are in shared buffers and/or in the OS cache, depending on
where the corruption actually happened.

And even if you could actually check what is physically stored on disk, that
would probably won't give you any strong guarantee that the rest data is
actually ok anyway.  The biggest source of corruption I know is an old vmware
bug usually referred as the SEsparse bug, where in some occasion some blocks
would get written at the wrong location.  In that case, the checksum can tell
me which are the blocks where the wrong write happened, but not what are the
blocks where the write should have happened, which are also entirely
inconsistent too.  That's clearly out of postgres scope, but that's in my
opinion just one out of probably a lot more examples that makes the current bug
in back branches not worth spending too many efforts to fix.



Re: pg_stat_database.checksum_failures vs shared relations

From
Andres Freund
Date:
Hi,

Attached is a fix for the issue.

I looked around and didn't find extensions using PageIsVerified[Extended]() in
codesearch.debian.org, so I got rid of the compat macro and renamed
PageIsVerifiedExtended back to PageIsVerified().

Normally I'd commit tests as part of a fix like this, but since I've already
written test infrastructure for checksum failures and their stats as part of
aio, and those tests don't work without more of aio applied, I don't think it
makes sense to write them for just this test.  It's not like anybody has ever
bothered to test checksum failures before...

Greetings,

Andres Freund

Attachment

Re: pg_stat_database.checksum_failures vs shared relations

From
Andres Freund
Date:
Hi,

On 2025-03-28 13:47:16 -0400, Andres Freund wrote:
> Attached is a fix for the issue.

I plan to push this fix soon, unless somebody protests...

Greetings,

Andres Freund



Re: pg_stat_database.checksum_failures vs shared relations

From
Andres Freund
Date:
Hi,

On 2025-03-29 13:17:44 -0400, Andres Freund wrote:
> On 2025-03-28 13:47:16 -0400, Andres Freund wrote:
> > Attached is a fix for the issue.
> 
> I plan to push this fix soon, unless somebody protests...

And done.

Greetings,

Andres Freund



Re: pg_stat_database.checksum_failures vs shared relations

From
Magnus Hagander
Date:
On Sat, Mar 29, 2025 at 7:09 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2025-03-29 13:17:44 -0400, Andres Freund wrote:
> On 2025-03-28 13:47:16 -0400, Andres Freund wrote:
> > Attached is a fix for the issue.
>
> I plan to push this fix soon, unless somebody protests...

And done.

Hi!

Sorry to get into this thread a bit late. Just to let you know that now that I'm caught up, I do agree  it looks right.

And  - thanks for handling this!

//Magnus