Re: Online checksums verification in the backend - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Re: Online checksums verification in the backend |
Date | |
Msg-id | CAOBaU_YZTowoJJ+NzsBieV0K5OvpOS+dAMD1cmz25hxtVZpQqA@mail.gmail.com Whole thread Raw |
In response to | Re: Online checksums verification in the backend (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Online checksums verification in the backend
|
List | pgsql-hackers |
On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote: > > And Michael just told me that I also missed adding one of the C files > > while splitting the patch into two. > > + if (PageIsNew(page)) > + { > + /* > + * Check if the page is really new or if there's corruption that > + * affected PageIsNew detection. Note that PageIsVerified won't try to > + * detect checksum corruption in this case, so there's no risk of > + * duplicated corruption report. > + */ > + if (PageIsVerified(page, blkno)) > + { > + /* No corruption. */ > + return true; > + } > Please note that this part of your patch overlaps with a proposal for > a bug fix related to zero-only pages with the checksum verification of > base backups: > https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru > > Your patch is trying to adapt itself to the existing logic we have in > PageIsVerified() so as you don't get a duplicated report, as does the > base backup part. Note that I cannot find in the wild any open code > making use of PageIsVerified(), but I'd like to believe that it is > rather handy to use for table AMs at least (?), so if we can avoid any > useless ABI breakage, it may be better to have a new > PageIsVerifiedExtended() able to take additional options, one to > report to pgstat and one to generate this elog(WARNING). And then > this patch could just make use of it? Indeed, that would be great. > + /* > + * There's corruption, but since this affects PageIsNew, we > + * can't compute a checksum, so set NoComputedChecksum for the > + * expected checksum. > + */ > + *chk_expected = NoComputedChecksum; > + *chk_found = hdr->pd_checksum; > + return false; > [...] > + /* > + * This can happen if corruption makes the block appears as > + * PageIsNew() but isn't a new page. > + */ > + if (chk_expected == NoComputedChecksum) > + nulls[i++] = true; > + else > + values[i++] = UInt16GetDatum(chk_expected); > Somewhat related to the first point, NoComputedChecksum exists > because, as the current patch is shaped, we need to report an existing > checksum to the user even for the zero-only case. I'm not sure that I understand your point. The current patch only returns something to users when there's a corruption. If by "zero-only case" you mean "page corrupted in a way that PageIsNew() returns true while not being all zero", then it's a corrupted page and then obviously yes it needs to be returned to users. > PageIsVerified() is > not that flexible so we could change it to report a status depending > on the error faced (checksum, header or zero-only) on top of getting a > checksum. Now, I am not completely sure either that it is worth the > complication to return in the SRF of the check function the expected > checksum. It seemed to me that it could be something useful to get with this kind of tool. You may be able to recover a corrupted page from backup/WAL if the checksum itself wasn't corrupted so that you know what to look for. There would be a lot of caveats and low level work, but if you're desperate enough that may save you a bit of time. > So, wouldn't it be better to just rely on PageIsVerified() > (well it's rather-soon-to-be extended flavor) for the checksum check, > the header sanity check and the zero-only check? My point is to keep > a single entry point for all the page sanity checks, so as base > backups, your patch, and the buffer manager apply the same things. > Base backups got that partially wrong because the base backup code > wants to keep control of the number of failures and the error > reports. I'm fine with it. > Your patch actually wishes to report a failure, but you want > to add more context with the fork name and such. Another option we > could use here is to add an error context so as PageIsVerified() > reports the WARNING, but the SQL function provides more context with > the block number and the relation involved in the check. Also, returning actual data rather than a bunch of warnings is way easier to process for client code. And as mentioned previously having an API that returns a list of corrupted blocks could be useful for a single-page recovery feature.
pgsql-hackers by date: