Re: Checksum errors in pg_stat_database - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Checksum errors in pg_stat_database
Date
Msg-id CABUevEzorGZ-wNzbAhzQj3B+byo20BCWrH-S+tBh+sQykVj2iQ@mail.gmail.com
Whole thread Raw
In response to Re: Checksum errors in pg_stat_database  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Checksum errors in pg_stat_database
Re: Checksum errors in pg_stat_database
List pgsql-hackers
On Tue, Apr 2, 2019 at 8:47 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Apr 02, 2019 at 07:43:12AM +0200, Julien Rouhaud wrote:
> On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>>  One thing which is not
>> proposed on this patch, and I am fine with it as a first draft, is
>> that we don't have any information about the broken block number and
>> the file involved.  My gut tells me that we'd want a separate view,
>> like pg_stat_checksums_details with one tuple per (dboid, rel, fork,
>> blck) to be complete.  But that's just for future work.
>
> That could indeed be nice.

Actually, backpedaling on this one...  pg_stat_checksums_details may
be a bad idea as we could finish with one row per broken block.  If
a corruption is spreading quickly, pgstat would not be able to sustain
that amount of objects.  Having pg_stat_checksums would allow us to
plugin more data easily based on the last failure state:
- last relid of failure
- last fork type of failure
- last block number of failure.
Not saying to do that now, but having that in pg_stat_database does
not seem very natural to me.  And on top of that we would have an
extra row full of NULLs for shared objects in pg_stat_database if we
adopt the unique view approach...  I find that rather ugly.

I think that tracking each and every block is of course a non-starter, as you've noticed.

I'm really not sure how much those three extra fields help, TBH. As I see it the real usecase for this is automated monitoring and quick-checks of the kind of "is my db currently broken somewhere", in combination with "did this occur recently" (for people who have never looked at their stats).

This gives people enough information to know where to go look in the logs.

I mean, what's the actual usecase for tracking relid/fork/block of the *last* failure only? To monitor and see if it changes? What do I do when I have 10 failures, and I only know about the last one? (I have to go to the logs anyway)

I think having the count and hte last time make sense, but I'm very sceptical about the rest.

I can somewhat agree that splitting it  on a per database level might even at that be overdoing it. What might actually be more interesting from a failure-location perspective would be tablespace, rather than any of the others. Or we could reduce it down to just putting it in pg_stat_bgwriter and only count global values perhaps, if in the end we don't think the split-per-database is reasonable?

--

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Inadequate executor locking of indexes
Next
From: David Rowley
Date:
Subject: Re: COPY FROM WHEN condition