On 2019-Jul-17, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Jul-17, Peter Geoghegan wrote:
> >> Maybe nbtree VACUUM should do something more aggressive than give up
> >> when there is a "failed to re-find parent key" or similar condition.
> >> Perhaps it would make more sense to make the index inactive (for some
> >> value of "inactive") instead of just complaining. That might be the
> >> least worst option, all things considered.
>
> > Maybe we can mark an index as unvacuumable in some way? As far as I
> > understand, all queries using that index work, as do index updates; it's
> > just vacuuming that fails. If we mark the index as unvacuumable, then
> > vacuum just skips it (and does not run phase 3 for that table), and
> > things can proceed; the table's age can still be advanced. Obviously
> > it'll result in more bloat than in normal condition, but it shouldn't
> > cause the whole cluster to go down.
>
> If an index is corrupt enough to break vacuum, I think it takes a rather
> large leap of faith to believe that it's not going to cause problems for
> inserts or searches.
Maybe, but it's what happened in the reported case. (Note Aaron was
careful to do the index replacement concurrently -- he wouldn't have
done that if the table wasn't in active use.)
> I'd go with just marking the index broken and
> insisting that it be REINDEX'd before we touch it again.
This might make things worse operationally, though. If searches aren't
failing but vacuum is, we'd break a production system that currently
works.
> (a) once the transaction's failed, you can't go making catalog updates;
Maybe we can defer the actual update to some other transaction -- say
register an autovacuum work-item, which can be executed separately.
> (b) even when you know the transaction's failed, blaming it on a
> particular index seems a bit chancy;
Well, vacuum knows what index is being processed. Maybe you're thinking
that autovac can get an out-of-memory condition or something like that;
perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
condition is reported (and make sure all such conditions do that. As
far as I remember we have a patch for this particular error to be
reported as such.)
> (c) automatically disabling constraint indexes seems less than desirable.
Disabling them for writes, yeah.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services