Thread: pg_stat_database.checksum_failures vs shared relations
Hi, First - I find it rather shocking that we have absolutely *zero* tests of checksum failures in the backend. Zero. As evidenced by [1]. I really can't quite believe it. Nor do we have tests of ignore_checksum_failure or zero_damaged_pages. I was trying to write some tests for checksums vs AIO, and one thing I noticed is that our docs seem to rather strongly hint that checksum failures on shared objects would be reported to the dbname IS NULL row in pg_stat_database: The <structname>pg_stat_database</structname> view will contain one row for each database in the cluster, plus one for shared objects, showing database-wide statistics. Name of this database, or <literal>NULL</literal> for shared objects. Number of data page checksum failures detected in this database (or on a shared object), or NULL if data checksums are disabled. But that is not the case, they always get reported to MyDatabaseId by PageIsVerifiedExtended(). The docs hint more clearly at checksum failures of shared rels reported this way starting with: commit 252b707bc41 Author: Magnus Hagander <magnus@hagander.net> Date: 2019-04-17 13:51:48 +0200 Return NULL for checksum failures if checksums are not enabled which made changes like: <entry>Number of data page checksum failures detected in this - database</entry> + database (or on a shared object), or NULL if data checksums are not + enabled.</entry> </row> The stats tracking of checksum failures was introduced in Author: Magnus Hagander <magnus@hagander.net> Date: 2019-03-09 10:45:17 -0800 Track block level checksum failures in pg_stat_database which already did report stats on shared objects that way: @@ -151,6 +152,8 @@ PageIsVerified(Page page, BlockNumber blkno) errmsg("page verification failed, calculated checksum %u but expected %u", checksum, p->pd_checksum))); + pgstat_report_checksum_failure(); + if (header_sane && ignore_checksum_failure) return true; } In basebackup.c however, it didn't track stats on shared relations at that time: @@ -1580,6 +1583,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf ereport(WARNING, (errmsg("file \"%s\" has a total of %d checksum verification " "failures", readfilename, checksum_failures))); + + if (dboid != InvalidOid) + pgstat_report_checksum_failures_in_db(dboid, checksum_failures); } total_checksum_failures += checksum_failures; that was changed in: commit 77bd49adba4 Author: Magnus Hagander <magnus@hagander.net> Date: 2019-04-12 14:04:50 +0200 Show shared object statistics in pg_stat_database This adds a row to the pg_stat_database view with datoid 0 and datname NULL for those objects that are not in a database. This was added particularly for checksums, but we were already tracking more satistics for these objects, just not returning it. ... which made the call to pgstat_report_checksum_failures_in_db() in basebackup.c unconditional. It did leave this comment though: * If dboid is anything other than InvalidOid then any checksum failures * detected will get reported to the cumulative stats system. I think the above commit makes it pretty clear that the intent is for checksum errors on shared database entries to be reported to the "shared" entry in pg_stat_database. So, today we have the weird situation that *some* checksum errors on shared relations get attributed to the current database (if they happen in a backend normally accessing a shared relation), whereas others get reported to the "shared relations" "database" (if they happen during a base backup). That seems ... not optimal. One question is whether we consider this a bug that should be backpatched. To fix it we'd need to provide a bit more information to PageIsVerifiedExtended(), it currently doesn't know what database the page it is verifying is in and therefore can't report an error with pgstat_report_checksum_failures_in_db() (rather than pgstat_report_checksum_failure(), which attributes to MyDatabaseId). Obviously having to change the signature of PageIsVerifiedExtended() makes it harder to fix in the backbranches. Greetings, Andres Freund [1] https://coverage.postgresql.org/src/backend/storage/page/bufpage.c.gcov.html#136
On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > So, today we have the weird situation that *some* checksum errors on shared > relations get attributed to the current database (if they happen in a backend > normally accessing a shared relation), whereas others get reported to the > "shared relations" "database" (if they happen during a base backup). That > seems ... not optimal. > > One question is whether we consider this a bug that should be backpatched. I think it would be defensible if pg_basebackup reported all errors with OID 0 and backend connections reported all errors with OID MyDatabaseId, but it seems hard to justify having pg_basebackup take care to report things using the correct database OID and individual backend connections not take care to do the same thing. So I think this is a bug. If fixing it in the back-branches is too annoying, I think it would be reasonable to fix it only in master, but back-patching seems OK too. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote: > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > > So, today we have the weird situation that *some* checksum errors on shared > > relations get attributed to the current database (if they happen in a backend > > normally accessing a shared relation), whereas others get reported to the > > "shared relations" "database" (if they happen during a base backup). That > > seems ... not optimal. > > > > One question is whether we consider this a bug that should be backpatched. > > I think it would be defensible if pg_basebackup reported all errors > with OID 0 and backend connections reported all errors with OID > MyDatabaseId, but it seems hard to justify having pg_basebackup take > care to report things using the correct database OID and individual > backend connections not take care to do the same thing. So I think > this is a bug. If fixing it in the back-branches is too annoying, I > think it would be reasonable to fix it only in master, but > back-patching seems OK too. Being able to get a better reporting for shared relations in back branches would be nice, but that's going to require some invasive chirurgy, isn't it? We don't know currently the OID of the relation whose block is corrupted with only PageIsVerifiedExtended(). There are two callers of PIV_REPORT_STAT on HEAD: - The checksum reports from RelationCopyStorage() know the SMgrRelation. - ReadBuffersOperation() has an optional Relation and a SMgrRelationData. We could just refactor PageIsVerifiedExtended() so as it reports a state about why the verification failed and let the callers report the checksum failure with a relation OID, splitting the data for shared and non-shared relations? -- Michael
Attachment
Hi, On 2025-03-28 09:44:58 +0900, Michael Paquier wrote: > On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote: > > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > > > So, today we have the weird situation that *some* checksum errors on shared > > > relations get attributed to the current database (if they happen in a backend > > > normally accessing a shared relation), whereas others get reported to the > > > "shared relations" "database" (if they happen during a base backup). That > > > seems ... not optimal. > > > > > > One question is whether we consider this a bug that should be backpatched. > > > > I think it would be defensible if pg_basebackup reported all errors > > with OID 0 and backend connections reported all errors with OID > > MyDatabaseId, but it seems hard to justify having pg_basebackup take > > care to report things using the correct database OID and individual > > backend connections not take care to do the same thing. So I think > > this is a bug. If fixing it in the back-branches is too annoying, I > > think it would be reasonable to fix it only in master, but > > back-patching seems OK too. > > Being able to get a better reporting for shared relations in back > branches would be nice, but that's going to require some invasive > chirurgy, isn't it? Yea, that's what I was worried about too. I think we basically would need a PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(), with optional arguments that the "fixed" callers would use. > We don't know currently the OID of the relation whose block is > corrupted with only PageIsVerifiedExtended(). I don't think the relation oid is really the relevant bit, it's the database oid (or alternatively tablespace). But PageIsVerifiedExtended() doesn't know that either, obviously. > There are two callers of PIV_REPORT_STAT on HEAD: > - The checksum reports from RelationCopyStorage() know the > SMgrRelation. > - ReadBuffersOperation() has an optional Relation and a > SMgrRelationData. An SMgrRelationData suffices, via ->smgr_rlocator.locator.dbOid. FWIW, it turns out that there are more cases than just MyDatabaseId and InvalidOid - ScanSourceDatabasePgClass() and RelationCopyStorageUsingBuffer() read buffers in a different database than MyDatabaseId. > We could just refactor PageIsVerifiedExtended() so as it reports a > state about why the verification failed and let the callers report the > checksum failure with a relation OID, splitting the data for shared > and non-shared relations? Yea, I think we basically need a *checksum_failed out argument, and then the callers need to do if (checksum_failure) pgstat_report_checksum_failures_in_db(src->smgr_rlocator.locator.dbOid, 1); Or alternatively we can optionally pass in the rlocator to PageIsVerifiedExtended2(), so it can do the above internally. Btw, it seems somewhat odd that we accumulate stats for checksum failures but not for invalid page headers - the latter seems even worse... Greetings, Andres Freund
On Thu, Mar 27, 2025 at 09:02:02PM -0400, Andres Freund wrote: > Hi, > > On 2025-03-28 09:44:58 +0900, Michael Paquier wrote: > > On Thu, Mar 27, 2025 at 12:06:45PM -0400, Robert Haas wrote: > > > On Thu, Mar 27, 2025 at 11:58 AM Andres Freund <andres@anarazel.de> wrote: > > > > So, today we have the weird situation that *some* checksum errors on shared > > > > relations get attributed to the current database (if they happen in a backend > > > > normally accessing a shared relation), whereas others get reported to the > > > > "shared relations" "database" (if they happen during a base backup). That > > > > seems ... not optimal. > > > > > > > > One question is whether we consider this a bug that should be backpatched. > > > > > > I think it would be defensible if pg_basebackup reported all errors > > > with OID 0 and backend connections reported all errors with OID > > > MyDatabaseId, but it seems hard to justify having pg_basebackup take > > > care to report things using the correct database OID and individual > > > backend connections not take care to do the same thing. So I think > > > this is a bug. If fixing it in the back-branches is too annoying, I > > > think it would be reasonable to fix it only in master, but > > > back-patching seems OK too. > > > > Being able to get a better reporting for shared relations in back > > branches would be nice, but that's going to require some invasive > > chirurgy, isn't it? > > Yea, that's what I was worried about too. I think we basically would need a > PageIsVerifiedExtended2() that backs the current PageIsVerifiedExtended(), > with optional arguments that the "fixed" callers would use. While it would be nice, I'm not sure that it would really be worth the trouble. Maybe that's just me, but if I hit a corruption failure knowing whether it's a global relation vs normal relation is definitely not something that will radically change the following days / weeks of pain to fully resolve the issue. Instead there would be other improvements that I would welcome on top of fixing those counters, which would impact such new API. For instance one of the thing you need to do in case of a corruption is to understand the reason for the corruption, and for that knowing the underlying tablespace rather than the database seems like a way more useful information to track. For the rest, the relfilelocator, forknum and blocknum should already be reported in the logs so you have the full details of what was intercepted even if the pg_stat_database view is broken in the back branches. But even if we had all that, there is still no guarantee (at least for now) that we do see all the corruption as you might not read the "real" version of the blockss if they are in shared buffers and/or in the OS cache, depending on where the corruption actually happened. And even if you could actually check what is physically stored on disk, that would probably won't give you any strong guarantee that the rest data is actually ok anyway. The biggest source of corruption I know is an old vmware bug usually referred as the SEsparse bug, where in some occasion some blocks would get written at the wrong location. In that case, the checksum can tell me which are the blocks where the wrong write happened, but not what are the blocks where the write should have happened, which are also entirely inconsistent too. That's clearly out of postgres scope, but that's in my opinion just one out of probably a lot more examples that makes the current bug in back branches not worth spending too many efforts to fix.
Hi, Attached is a fix for the issue. I looked around and didn't find extensions using PageIsVerified[Extended]() in codesearch.debian.org, so I got rid of the compat macro and renamed PageIsVerifiedExtended back to PageIsVerified(). Normally I'd commit tests as part of a fix like this, but since I've already written test infrastructure for checksum failures and their stats as part of aio, and those tests don't work without more of aio applied, I don't think it makes sense to write them for just this test. It's not like anybody has ever bothered to test checksum failures before... Greetings, Andres Freund
Attachment
Hi, On 2025-03-28 13:47:16 -0400, Andres Freund wrote: > Attached is a fix for the issue. I plan to push this fix soon, unless somebody protests... Greetings, Andres Freund
Hi, On 2025-03-29 13:17:44 -0400, Andres Freund wrote: > On 2025-03-28 13:47:16 -0400, Andres Freund wrote: > > Attached is a fix for the issue. > > I plan to push this fix soon, unless somebody protests... And done. Greetings, Andres Freund
On Sat, Mar 29, 2025 at 7:09 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2025-03-29 13:17:44 -0400, Andres Freund wrote:
> On 2025-03-28 13:47:16 -0400, Andres Freund wrote:
> > Attached is a fix for the issue.
>
> I plan to push this fix soon, unless somebody protests...
And done.
Hi!
Sorry to get into this thread a bit late. Just to let you know that now that I'm caught up, I do agree it looks right.
And - thanks for handling this!
//Magnus