Re: Checksum errors in pg_stat_database - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Checksum errors in pg_stat_database
Date
Msg-id CABUevExGXxStJaM0hLQY_kht_S3HnszgVH1=zk0xcx5ccz7tBQ@mail.gmail.com
Whole thread Raw
In response to Re: Checksum errors in pg_stat_database  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
Responses Re: Checksum errors in pg_stat_database
List pgsql-hackers
On Thu, Dec 8, 2022 at 2:35 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:


On 4/2/19 7:06 PM, Magnus Hagander wrote:
> On Tue, Apr 2, 2019 at 8:47 AM Michael Paquier <michael@paquier.xyz <mailto:michael@paquier.xyz>> wrote:
>
>     On Tue, Apr 02, 2019 at 07:43:12AM +0200, Julien Rouhaud wrote:
>      > On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier <michael@paquier.xyz <mailto:michael@paquier.xyz>> wrote:
>      >>  One thing which is not
>      >> proposed on this patch, and I am fine with it as a first draft, is
>      >> that we don't have any information about the broken block number and
>      >> the file involved.  My gut tells me that we'd want a separate view,
>      >> like pg_stat_checksums_details with one tuple per (dboid, rel, fork,
>      >> blck) to be complete.  But that's just for future work.
>      >
>      > That could indeed be nice.
>
>     Actually, backpedaling on this one...  pg_stat_checksums_details may
>     be a bad idea as we could finish with one row per broken block.  If
>     a corruption is spreading quickly, pgstat would not be able to sustain
>     that amount of objects.  Having pg_stat_checksums would allow us to
>     plugin more data easily based on the last failure state:
>     - last relid of failure
>     - last fork type of failure
>     - last block number of failure.
>     Not saying to do that now, but having that in pg_stat_database does
>     not seem very natural to me.  And on top of that we would have an
>     extra row full of NULLs for shared objects in pg_stat_database if we
>     adopt the unique view approach...  I find that rather ugly.
>
>
> I think that tracking each and every block is of course a non-starter, as you've noticed.

I think that's less of a concern now that the stats collector process has gone and that the stats are now collected in shared memory, what do you think?

It would be less of a concern yes, but I think it still would be a concern. If you have a large amount of corruption you could quickly get to millions of rows to keep track of which would definitely be a problem in shared memory as well, wouldn't it?

But perhaps we could keep a list of "the last 100 checksum failures" or something like that?  

--

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Error-safe user functions
Next
From: Andres Freund
Date:
Subject: Re: Error-safe user functions