Re: Corrupt index stopping autovacuum system wide - Mailing list pgsql-general

From Peter Geoghegan
Subject Re: Corrupt index stopping autovacuum system wide
Date
Msg-id CAH2-Wzk3U+3OTQAAWCkgvYNR59V05BknJxbjyP7AVQzEgb9eyg@mail.gmail.com
Whole thread Raw
In response to Re: Corrupt index stopping autovacuum system wide  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-general
On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> This might make things worse operationally, though.  If searches aren't
> failing but vacuum is, we'd break a production system that currently
> works.

If searches aren't failing and VACUUM works, then that's probably down
to dumb luck. The user's luck could change at any time (actually, it's
quite possible that the index is already giving wrong answers without
anybody realizing). That's not always true, of course -- you could
have an OOM condition in VACUUM, where it really does make sense to
retry. But it should be true for the category of errors where we
behave more aggressively than just giving up, such as "failed to
re-find parent key" error Aaron noticed.

> Well, vacuum knows what index is being processed.  Maybe you're thinking
> that autovac can get an out-of-memory condition or something like that;
> perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
> condition is reported (and make sure all such conditions do that.  As
> far as I remember we have a patch for this particular error to be
> reported as such.)

I don't think that it would be that hard to identify errors that
nbtree VACUUM could throw that clearly indicate corruption, without
any hope of the problem self-correcting without the DBA running a
REINDEX. There will be a small amount of gray area, perhaps, but
probably not enough to matter.

> > (c) automatically disabling constraint indexes seems less than desirable.
>
> Disabling them for writes, yeah.

I think that it's fair to say that all bets are off once you see the
"failed to re-find parent key" error, or any other such error that
indicates corruption. Admittedly it isn't 100% clear that disabling
constraint enforcement to unblock autovacuum for the whole cluster is
better than any available alternative; it's really hard to reason
about things when we already know that the database has corruption.

I think that it's okay that almost anything can break when somebody
creates an index on a non-immutable expression (including VACUUM),
provided that all the problems only affect the table with the broken
index. OTOH, taking down the entire Postgres cluster as an indirect
consequence of one person's ill-considered CREATE INDEX really sucks.
That distinction seems important to me.

-- 
Peter Geoghegan



pgsql-general by date:

Previous
From: Perumal Raj
Date:
Subject: Resolved: Looking for Postgres upgrade Metrix
Next
From: Peter Geoghegan
Date:
Subject: Re: Corrupt index stopping autovacuum system wide