Re: Corrupt index stopping autovacuum system wide - Mailing list pgsql-general

From Alvaro Herrera
Subject Re: Corrupt index stopping autovacuum system wide
Date
Msg-id 20190717184305.GA25848@alvherre.pgsql
Whole thread Raw
In response to Re: Corrupt index stopping autovacuum system wide  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Corrupt index stopping autovacuum system wide
List pgsql-general
On 2019-Jul-17, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Jul-17, Peter Geoghegan wrote:
> >> Maybe nbtree VACUUM should do something more aggressive than give up
> >> when there is a "failed to re-find parent key" or similar condition.
> >> Perhaps it would make more sense to make the index inactive (for some
> >> value of "inactive") instead of just complaining. That might be the
> >> least worst option, all things considered.
> 
> > Maybe we can mark an index as unvacuumable in some way?  As far as I
> > understand, all queries using that index work, as do index updates; it's
> > just vacuuming that fails.  If we mark the index as unvacuumable, then
> > vacuum just skips it (and does not run phase 3 for that table), and
> > things can proceed; the table's age can still be advanced.  Obviously
> > it'll result in more bloat than in normal condition, but it shouldn't
> > cause the whole cluster to go down.
> 
> If an index is corrupt enough to break vacuum, I think it takes a rather
> large leap of faith to believe that it's not going to cause problems for
> inserts or searches.

Maybe, but it's what happened in the reported case.  (Note Aaron was
careful to do the index replacement concurrently -- he wouldn't have
done that if the table wasn't in active use.)

> I'd go with just marking the index broken and
> insisting that it be REINDEX'd before we touch it again.

This might make things worse operationally, though.  If searches aren't
failing but vacuum is, we'd break a production system that currently
works.

> (a) once the transaction's failed, you can't go making catalog updates; 

Maybe we can defer the actual update to some other transaction -- say
register an autovacuum work-item, which can be executed separately.

> (b) even when you know the transaction's failed, blaming it on a
> particular index seems a bit chancy; 

Well, vacuum knows what index is being processed.  Maybe you're thinking
that autovac can get an out-of-memory condition or something like that;
perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
condition is reported (and make sure all such conditions do that.  As
far as I remember we have a patch for this particular error to be
reported as such.)

> (c) automatically disabling constraint indexes seems less than desirable.

Disabling them for writes, yeah.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-general by date:

Previous
From: Perumal Raj
Date:
Subject: Re: Looking for Postgres upgrade Metrix
Next
From: "David G. Johnston"
Date:
Subject: Re: Looking for Postgres upgrade Metrix