pg_stat_database.checksum_failures vs shared relations - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | pg_stat_database.checksum_failures vs shared relations |
Date | |
Msg-id | mglpvvbhighzuwudjxzu4br65qqcxsnyvio3nl4fbog3qknwhg@e4gt7npsohuz Whole thread Raw |
Responses |
Re: pg_stat_database.checksum_failures vs shared relations
|
List | pgsql-hackers |
Hi, First - I find it rather shocking that we have absolutely *zero* tests of checksum failures in the backend. Zero. As evidenced by [1]. I really can't quite believe it. Nor do we have tests of ignore_checksum_failure or zero_damaged_pages. I was trying to write some tests for checksums vs AIO, and one thing I noticed is that our docs seem to rather strongly hint that checksum failures on shared objects would be reported to the dbname IS NULL row in pg_stat_database: The <structname>pg_stat_database</structname> view will contain one row for each database in the cluster, plus one for shared objects, showing database-wide statistics. Name of this database, or <literal>NULL</literal> for shared objects. Number of data page checksum failures detected in this database (or on a shared object), or NULL if data checksums are disabled. But that is not the case, they always get reported to MyDatabaseId by PageIsVerifiedExtended(). The docs hint more clearly at checksum failures of shared rels reported this way starting with: commit 252b707bc41 Author: Magnus Hagander <magnus@hagander.net> Date: 2019-04-17 13:51:48 +0200 Return NULL for checksum failures if checksums are not enabled which made changes like: <entry>Number of data page checksum failures detected in this - database</entry> + database (or on a shared object), or NULL if data checksums are not + enabled.</entry> </row> The stats tracking of checksum failures was introduced in Author: Magnus Hagander <magnus@hagander.net> Date: 2019-03-09 10:45:17 -0800 Track block level checksum failures in pg_stat_database which already did report stats on shared objects that way: @@ -151,6 +152,8 @@ PageIsVerified(Page page, BlockNumber blkno) errmsg("page verification failed, calculated checksum %u but expected %u", checksum, p->pd_checksum))); + pgstat_report_checksum_failure(); + if (header_sane && ignore_checksum_failure) return true; } In basebackup.c however, it didn't track stats on shared relations at that time: @@ -1580,6 +1583,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf ereport(WARNING, (errmsg("file \"%s\" has a total of %d checksum verification " "failures", readfilename, checksum_failures))); + + if (dboid != InvalidOid) + pgstat_report_checksum_failures_in_db(dboid, checksum_failures); } total_checksum_failures += checksum_failures; that was changed in: commit 77bd49adba4 Author: Magnus Hagander <magnus@hagander.net> Date: 2019-04-12 14:04:50 +0200 Show shared object statistics in pg_stat_database This adds a row to the pg_stat_database view with datoid 0 and datname NULL for those objects that are not in a database. This was added particularly for checksums, but we were already tracking more satistics for these objects, just not returning it. ... which made the call to pgstat_report_checksum_failures_in_db() in basebackup.c unconditional. It did leave this comment though: * If dboid is anything other than InvalidOid then any checksum failures * detected will get reported to the cumulative stats system. I think the above commit makes it pretty clear that the intent is for checksum errors on shared database entries to be reported to the "shared" entry in pg_stat_database. So, today we have the weird situation that *some* checksum errors on shared relations get attributed to the current database (if they happen in a backend normally accessing a shared relation), whereas others get reported to the "shared relations" "database" (if they happen during a base backup). That seems ... not optimal. One question is whether we consider this a bug that should be backpatched. To fix it we'd need to provide a bit more information to PageIsVerifiedExtended(), it currently doesn't know what database the page it is verifying is in and therefore can't report an error with pgstat_report_checksum_failures_in_db() (rather than pgstat_report_checksum_failure(), which attributes to MyDatabaseId). Obviously having to change the signature of PageIsVerifiedExtended() makes it harder to fix in the backbranches. Greetings, Andres Freund [1] https://coverage.postgresql.org/src/backend/storage/page/bufpage.c.gcov.html#136
pgsql-hackers by date: