pg_stat_database.checksum_failures vs shared relations - Mailing list pgsql-hackers

From Andres Freund
Subject pg_stat_database.checksum_failures vs shared relations
Date
Msg-id mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz
Whole thread Raw
Responses Re: pg_stat_database.checksum_failures vs shared relations
List pgsql-hackers
Hi,

First - I find it rather shocking that we have absolutely *zero* tests of
checksum failures in the backend. Zero. As evidenced by [1].  I really can't
quite believe it.  Nor do we have tests of ignore_checksum_failure or
zero_damaged_pages.


I was trying to write some tests for checksums vs AIO, and one thing I noticed
is that our docs seem to rather strongly hint that checksum failures on shared
objects would be reported to the dbname IS NULL row in pg_stat_database:

   The <structname>pg_stat_database</structname> view will contain one row
   for each database in the cluster, plus one for shared objects, showing
   database-wide statistics.

       Name of this database, or <literal>NULL</literal> for shared
       objects.

       Number of data page checksum failures detected in this
       database (or on a shared object), or NULL if data checksums are
       disabled.

But that is not the case, they always get reported to MyDatabaseId by
PageIsVerifiedExtended().

The docs hint more clearly at checksum failures of shared rels reported this
way starting with:

commit 252b707bc41
Author: Magnus Hagander <magnus@hagander.net>
Date:   2019-04-17 13:51:48 +0200

    Return NULL for checksum failures if checksums are not enabled

which made changes like:
      <entry>Number of data page checksum failures detected in this
-     database</entry>
+      database (or on a shared object), or NULL if data checksums are not
+      enabled.</entry>
     </row>


The stats tracking of checksum failures was introduced in

Author: Magnus Hagander <magnus@hagander.net>
Date:   2019-03-09 10:45:17 -0800

    Track block level checksum failures in pg_stat_database

which already did report stats on shared objects that way:

@@ -151,6 +152,8 @@ PageIsVerified(Page page, BlockNumber blkno)
                  errmsg("page verification failed, calculated checksum %u but expected %u",
                         checksum, p->pd_checksum)));

+        pgstat_report_checksum_failure();
+
         if (header_sane && ignore_checksum_failure)
             return true;
     }

In basebackup.c however, it didn't track stats on shared relations at that time:

@@ -1580,6 +1583,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
         ereport(WARNING,
                 (errmsg("file \"%s\" has a total of %d checksum verification "
                         "failures", readfilename, checksum_failures)));
+
+        if (dboid != InvalidOid)
+            pgstat_report_checksum_failures_in_db(dboid, checksum_failures);
     }
     total_checksum_failures += checksum_failures;


that was changed in:

commit 77bd49adba4
Author: Magnus Hagander <magnus@hagander.net>
Date:   2019-04-12 14:04:50 +0200

    Show shared object statistics in pg_stat_database

    This adds a row to the pg_stat_database view with datoid 0 and datname
    NULL for those objects that are not in a database. This was added
    particularly for checksums, but we were already tracking more satistics
    for these objects, just not returning it.
...

which made the call to pgstat_report_checksum_failures_in_db()
in basebackup.c unconditional. It did leave this comment though:

 * If dboid is anything other than InvalidOid then any checksum failures
 * detected will get reported to the cumulative stats system.


I think the above commit makes it pretty clear that the intent is for checksum
errors on shared database entries to be reported to the "shared" entry in
pg_stat_database.


So, today we have the weird situation that *some* checksum errors on shared
relations get attributed to the current database (if they happen in a backend
normally accessing a shared relation), whereas others get reported to the
"shared relations" "database" (if they happen during a base backup).  That
seems ... not optimal.


One question is whether we consider this a bug that should be backpatched.

To fix it we'd need to provide a bit more information to
PageIsVerifiedExtended(), it currently doesn't know what database the page it
is verifying is in and therefore can't report an error with
pgstat_report_checksum_failures_in_db() (rather than
pgstat_report_checksum_failure(), which attributes to MyDatabaseId).

Obviously having to change the signature of PageIsVerifiedExtended() makes it
harder to fix in the backbranches.

Greetings,

Andres Freund

[1] https://coverage.postgresql.org/src/backend/storage/page/bufpage.c.gcov.html#136



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: libpq maligning postgres stability
Next
From: Renan Alves Fonseca
Date:
Subject: Remove restrictions in recursive query