Home > mailing lists

Re: Checksum errors in pg_stat_database - Mailing list pgsql-hackers

From	Magnus Hagander
Subject	Re: Checksum errors in pg_stat_database
Date	April 2, 2019 17:06:35
Msg-id	CABUevEzorGZ-wNzbAhzQj3B+byo20BCWrH-S+tBh+sQykVj2iQ@mail.gmail.com Whole thread
In response to	Re: Checksum errors in pg_stat_database (Michael Paquier <michael@paquier.xyz>)
Responses	Re: Checksum errors in pg_stat_database Re: Checksum errors in pg_stat_database
List	pgsql-hackers

Tree view

On Tue, Apr 2, 2019 at 8:47 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Apr 02, 2019 at 07:43:12AM +0200, Julien Rouhaud wrote:
> On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>> One thing which is not
>> proposed on this patch, and I am fine with it as a first draft, is
>> that we don't have any information about the broken block number and
>> the file involved. My gut tells me that we'd want a separate view,
>> like pg_stat_checksums_details with one tuple per (dboid, rel, fork,
>> blck) to be complete. But that's just for future work.
>
> That could indeed be nice.

Actually, backpedaling on this one... pg_stat_checksums_details may
be a bad idea as we could finish with one row per broken block. If
a corruption is spreading quickly, pgstat would not be able to sustain
that amount of objects. Having pg_stat_checksums would allow us to
plugin more data easily based on the last failure state:
- last relid of failure
- last fork type of failure
- last block number of failure.
Not saying to do that now, but having that in pg_stat_database does
not seem very natural to me. And on top of that we would have an
extra row full of NULLs for shared objects in pg_stat_database if we
adopt the unique view approach... I find that rather ugly.

I think that tracking each and every block is of course a non-starter, as you've noticed.

I'm really not sure how much those three extra fields help, TBH. As I see it the real usecase for this is automated monitoring and quick-checks of the kind of "is my db currently broken somewhere", in combination with "did this occur recently" (for people who have never looked at their stats).

This gives people enough information to know where to go look in the logs.

I mean, what's the actual usecase for tracking relid/fork/block of the *last* failure only? To monitor and see if it changes? What do I do when I have 10 failures, and I only know about the last one? (I have to go to the logs anyway)

I think having the count and hte last time make sense, but I'm very sceptical about the rest.

I can somewhat agree that splitting it on a per database level might even at that be overdoing it. What might actually be more interesting from a failure-location perspective would be tablespace, rather than any of the others. Or we could reduce it down to just putting it in pg_stat_bgwriter and only count global values perhaps, if in the end we don't think the split-per-database is reasonable?

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

pgsql-hackers by date:

From: Tom Lane
Date: 02 April 2019, 17:03:39
Subject: Re: Inadequate executor locking of indexes

From: David Rowley
Date: 02 April 2019, 17:41:49
Subject: Re: COPY FROM WHEN condition

Re: Checksum errors in pg_stat_database - Mailing list pgsql-hackers

Previous

Next