Re: new heapcheck contrib module - Mailing list pgsql-hackers

From Robert Haas
Subject Re: new heapcheck contrib module
Date
Msg-id CA+TgmoY+YQ1PqMGr6GcgsAp8SLX5zntbL_HVJ7Us99NSkMrJ9Q@mail.gmail.com
Whole thread Raw
In response to Re: new heapcheck contrib module  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Nov 19, 2020 at 2:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Ideally heapallindex verification would verify 1:1 correspondence. It
> doesn't do that right now, but it could.

Well, that might be a cool new mode, but it doesn't necessarily have
to supplant the thing we have now. The problem immediately before us
is just making sure that the user can understand what we will and
won't be checking.

> My thoughts on these two options:
>
> * I don't think that users will ever want rootdescend verification.

That seems too absolute. I think it's fine to say, we don't think that
users will want this, so let's not do it by default. But if it's so
useless as to not be worth a command-line option, then it was a
mistake to put it into contrib at all. Let's expose all the things we
have, and try to set the defaults according to what we expect to be
most useful.

> * heapallindexed is kind of expensive, but valuable. But the extra
> check is probably less likely to help on the second or subsequent
> index on a table.
>
> It might be worth considering an option that only uses it with only
> one index: Preferably the primary key index, failing that some unique
> index, and failing that some other index.

This seems a bit too clever for me. I would prefer a simpler schema,
where we choose the default we think most people will want and use it
for everything -- and allow the user to override.

> Even if your user is just average, they still have one major advantage
> over the architects of pg_amcheck: actual knowledge of the problem in
> front of them.

Quite so.

> I think that you need to have a kind of epistemic modesty with this
> stuff. Okay, we guarantee that the backend won't crash when certain
> amcheck functions are run, based on these caveats. But don't we always
> guarantee something like that? And are the specific caveats actually
> that different in each case, when you get right down to it? A
> guarantee does not exist in a vacuum. It always has implicit
> limitations. For example, any guarantee implicitly comes with the
> caveat "unless I, the guarantor, am wrong".

Yep.

> I'm also suspicious of guarantees like this for less philosophical
> reasons. It seems to me like it solves our problem rather than the
> user's problem. Having data that is so badly corrupt that it's
> difficult to avoid segfaults when we perform some kind of standard
> transformations on it is an appalling state of affairs for the user.
> The segfault itself is very much not the point at all.

I mostly agree with everything you say here, but I think we need to be
careful not to accept the position that seg faults are no big deal.
Consider the following users, all of whom start with a database that
they believe to be non-corrupt:

Alice runs pg_amcheck. It says that nothing is wrong, and that happens
to be true.
Bob runs pg_amcheck. It says that there are problems, and there are.
Carol runs pg_amcheck. It says that nothing is wrong, but in fact
something is wrong.
Dan runs pg_amcheck. It says that there are problems, but in fact
there are none.
Erin runs pg_amcheck. The server crashes.

Alice and Bob are clearly in the best shape here, but Carol and Dan
arguably haven't been harmed very much. Sure, Carol enjoys a false
sense of security, but since she otherwise believed things were OK,
the impact of whatever problems exist is evidently not that bad. Dan
is worrying over nothing, but the damage is only to his psyche, not
his database; we can hope he'll eventually sort out what has happened
without grave consequences. Erin, on the other hand, is very possibly
in a lot of trouble with her boss and her coworkers. She had what
seemed to be a healthy database, and from their perspective, she shot
it in the head without any real cause. It will be faint consolation to
her and her coworkers that the database was corrupt all along: until
she ran the %$! tool, they did not have a problem that affected the
ability of their business to generate revenue. Now they had an outage,
and that does.

While I obviously haven't seen this exact scenario play out for a
customer, because pg_amcheck is not committed, I have seen similar
scenarios over and over. It's REALLY bad when the database goes down.
Then the application goes down, and then it gets really ugly. As long
as the database was just returning wrong answers or eating data,
nobody's boss really cared that much, but now that it's down, they
care A LOT. This is of course not to say that nobody cares about the
accuracy of results from the database: many people care a lot, and
that's why it's good to have tools like this. But we should not
underestimate the horror caused by a crash. A working database, even
with some wrong data in it, is a problem people would probably like to
get fixed. A down database is an emergency. So I think we should
actually get a lot more serious about ensuring that corrupt data on
disk doesn't cause crashes, even for regular SELECT statements. I
don't think we can take an arbitrary performance hit to get there,
which is a challenge, but I do think that even a brief outage is
nothing to take lightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: VACUUM (DISABLE_PAGE_SKIPPING on)
Next
From: Robert Haas
Date:
Subject: Re: new heapcheck contrib module