Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers

From Greg Sabino Mullane
Subject Backends stalled in 'startup' state: index corruption
Date
Msg-id 20120524173325.GA5533@tinybird.home
Whole thread Raw
Responses Re: Backends stalled in 'startup' state: index corruption
List pgsql-hackers
Yesterday I had a client that experienced a sudden high load on
one of their servers (8.3.5 - yes, I know. Those of you with
clients will understand). When I checked, almost all connections
were in a "startup" state, very similar to this thread:

http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

Running a strace showed a lot of semop activity, and the logs showed a
successful connection, then a 5 minute plus wait before a query was issued.
So obviously, blocking on something. Unlike the thread above, I *did* find
problems in the system catalogs. For example, both pg_class and pg_index
gave warnings like this for every index during a VACUUM FULL
VERBOSE tablename:

WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.

A REINDEX did not solve the problem (initially), as a REINDEX followed
by a VAC showed the same warning and hint.

The next step was dropping to standalone mode, but before that could
be done, the REINDEXes fixed the problem (no warnings, no stalled
connections). So my questions are:

* Why would a REINDEX not fix the problem as the hint suggested?
Why would it then start "working"?

* What exactly is the relationship between bad indexes and shared
memory locks?

* Is there some other emergency fix when it happens, such as killing
all backends and hoping you kill the one that is actually holding
the lock (if any was).

* Did anything in the 8.3 series fix this? I saw nothing relevant in
the release notes for everything up to 8.3.18 (which it will be on soon).

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: [RFC] Interface of Row Level Security
Next
From: Alex Hunsaker
Date:
Subject: Re: plperl_helpers.h fix for clang