On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:
> On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> Some people might prefer notices, because you can get those while the
>> thing is still running, rather than a result set, which you will only
>> see when the query finishes. Other people might prefer an SRF, because
>> they want to have the data in structured form so that they can
>> postprocess it. Not sure what you mean by "more globally."
>
> I meant having the results available system-wide, not only to the
> caller. I think that emitting a log/notice level should always be
> done on top on whatever other communication facility we're using.
The problem of notice and logs is that they tend to be ignored. Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep. I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning. The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.
>> I guess one
>> idea would be to provide a way to kick this off in the background via
>> a background worker or similar and then have it put the results in a
>> table. But that might fail if there are checksum errors in the
>> catalogs themselves.
>
> Yes that's a concern. We could maintain a list in (dynamic) shared
> memory with a simple SQL wrapper to read the data, but that would be
> lost with a crash/restart. Or use
> pgstat_report_checksum_failures_in_db(), modifying it to get an
> relfilenode, bocknum and forknum and append that to some flat files,
> hoping that it won't get corrupted either.
If a lot of blocks are corrupted, that could bloat things. Hence some
retention policies would be necessary, and that's tricky to define and
configure properly. I'd tend to be in the school of just logging the
information and be done with it, because that's simple and because you
won't need to worry about any more configuration. Doing the work in
the background is still separate than a SQL-callable function though,
no? In this case you need a connection to a database to allow the
checksum verification to happen on a relfilenode based on the relation
to check, also because you want the thing to be safe concurrently
(a background work here is a combo with a bgworker triggering dynamic
children working on one database, not necessarily something that needs
to be in core).
--
Michael