Re: Corrupt index stopping autovacuum system wide - Mailing list pgsql-general

From Peter Geoghegan
Subject Re: Corrupt index stopping autovacuum system wide
Date
Msg-id CAH2-Wzm9boEusD8pyz8S2eXey01VC6PZmqAegOgUO+yRCQgTiA@mail.gmail.com
Whole thread Raw
In response to Re: Corrupt index stopping autovacuum system wide  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Corrupt index stopping autovacuum system wide
List pgsql-general
On Wed, Jul 17, 2019 at 10:27 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right, you're eventually going to get to a forced shutdown if vacuum never
> succeeds on one table; no question that that's bad.

It occurs to me that we use operator class/insertion scankey
comparisons within page deletion, to relocate a leaf page that looks
like a candidate for deletion. Despite this, README.hot claims:

"Standard vacuuming scans the indexes to ensure all such index entries
are removed, amortizing the index scan cost across as many dead tuples
as possible; this approach does not scale down well to the case of
reclaiming just a few tuples.  In principle one could recompute the
index keys and do standard index searches to find the index entries,
but this is risky in the presence of possibly-buggy user-defined
functions in functional indexes.  An allegedly immutable function that
in fact is not immutable might prevent us from re-finding an index
entry"

That probably wasn't the problem in Aaron's case, but it is worth
considering as a possibility.

> My concern here is
> that if we have blinders on to the extent of only processing that one
> table or DB, we're unnecessarily allowing bloat to occur in other tables,
> and causing that missed vacuuming work to pile up so that there's more of
> it to be done once the breakage is cleared.  If the DBA doesn't notice the
> problem until getting into a forced shutdown, that is going to extend his
> outage time --- and, in a really bad worst case, maybe make the difference
> between being able to recover at all and not.

The comment about "...any db at risk of Xid wraparound..." within
do_start_worker() hints at such a problem.

Maybe nbtree VACUUM should do something more aggressive than give up
when there is a "failed to re-find parent key" or similar condition.
Perhaps it would make more sense to make the index inactive (for some
value of "inactive") instead of just complaining. That might be the
least worst option, all things considered.

--
Peter Geoghegan



pgsql-general by date:

Previous
From: Sonam Sharma
Date:
Subject: Re: Change in db size
Next
From: Perumal Raj
Date:
Subject: Looking for Postgres upgrade Metrix