Re: RFC: Add 'taint' field to pg_control. - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: RFC: Add 'taint' field to pg_control.
Date
Msg-id CAMsr+YGJqHDP=HkLxAukhVz0R56MTfEj1++t8M-AWb+xFTwZqA@mail.gmail.com
Whole thread Raw
In response to RFC: Add 'taint' field to pg_control.  (Andres Freund <andres@anarazel.de>)
Responses Re: RFC: Add 'taint' field to pg_control.  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 1 March 2018 at 05:43, Andres Freund <andres@anarazel.de> wrote:
Hi,

a significant number of times during investigations of bugs I wondered
whether running the cluster with various settings, or various tools
could've caused the issue at hand.  Therefore I'd like to propose adding
a 'tainted' field to pg_control, that contains some of the "history" of
the cluster. Individual bits inside that field that I can think of right
now are:
- pg_resetxlog was used non-passively on cluster
- ran with fsync=off
- ran with full_page_writes=off
- pg_upgrade was used

What do others think?


A huge +1 from me for the idea. I can't even count the number of black box "WTF did you DO?!?" servers I've looked at, where bizarre behaviour has turned out to be down to the user doing something very silly and not saying anything about it.

It's only some flags, so putting it in pg_control is arguably somewhat wasteful but so minor as to be of no real concern. And that's probably the best way to make sure it follows the cluster around no matter what backup/restore/copy mechanisms are used and how "clever" they try to be.

What I'd _really_ love would be to blow the scope of this up a bit and turn it into a key-events cluster journal, recording key param switches, recoveries (and lsn ranges), pg_upgrade's, etc. But then we'd run into people with weird workloads who managed to make it some massive file, we'd have to make sure we had a way to stop it getting left out of copies/backups, and it'd generally be irritating. So lets not do that. Proper support for class-based logging and multiple outputs would be a good solution for this at some future point.

What you propose is simple enough to be quick to implement, adds no admin overhead, and will be plenty useful enough.

I'd like to add "postmaster.pid was absent when the cluster started" to this list, please. Sure, it's not conclusive, and there are legit reasons why that might be the case, but so often it's somebody kill -9'ing the postmaster, then removing the postmaster.pid and starting up again without killing the workers....

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Two small patches for the isolationtester lexer
Next
From: Craig Ringer
Date:
Subject: Re: RFC: Add 'taint' field to pg_control.