On Mon, Dec 12, 2022 at 12:40 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Dec 11, 2022 at 09:18:42PM +0100, Magnus Hagander wrote: > It would be less of a concern yes, but I think it still would be a concern. > If you have a large amount of corruption you could quickly get to millions > of rows to keep track of which would definitely be a problem in shared > memory as well, wouldn't it?
Yes. I have discussed this item with Bertrand off-list and I share the same concern. This would lead to an lot of extra workload on a large seqscan for a corrupted relation when the stats are written (shutdown delay) while bloating shared memory with potentially millions of items even if variable lists are handled through a dshash and DSM.
> But perhaps we could keep a list of "the last 100 checksum failures" or > something like that?
Applying a threshold is one solution. Now, a second thing I have seen in the past is that some disk partitions were busted but not others, and the current database-level counters are not enough to make a difference when it comes to grab patterns in this area. A list of the last N failures may be able to show some pattern, but that would be like analyzing things with a lot of noise without a clear conclusion.
Anyway, the workload caused by the threshold number had better be measured before being decided (large set of relation files with a full range of blocks corrupted, much better if these are in the OS cache when scanned), which does not change the need of a benchmark.
What about just adding a counter tracking the number of checksum failures for relfilenodes in a new structure related to them (note that I did not write PgStat_StatTabEntry)?
If we do that, it is then possible to cross-check the failures with tablespaces, which would point to disk areas that are more sensitive to corruption.
If that's the concern, then perhaps the level we should be tracking things on is tablespace? We don't have any stats per tablespace today I believe, but that doesn't mean we couldn't create that.