Re: Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Backends stalled in 'startup' state: index corruption
Date
Msg-id 27844.1338148415@sss.pgh.pa.us
Whole thread Raw
In response to Re: Backends stalled in 'startup' state: index corruption  (Greg Sabino Mullane <greg@endpoint.com>)
Responses Re: Backends stalled in 'startup' state: index corruption
List pgsql-hackers
I've been continuing to poke at this business of relcache-related
startup stalls, and have come to some new conclusions.  One is that
it no longer seems at all likely that the pg_attribute rows for system
catalogs aren't at the front of pg_attribute, because the commands that
might be used to update them are disallowed even to superusers:

foo=# alter table pg_index alter column indnatts set statistics -1;
ERROR:  permission denied: "pg_index" is a system catalog
foo=# alter table pg_index alter column indnatts set (n_distinct = -1);
ERROR:  permission denied: "pg_index" is a system catalog

Now a sufficiently stubborn DBA could override that protection (via
direct manual UPDATE on pg_attribute, or by doing ALTER TABLE in a
standalone backend with allowSystemTableMods set).  But it doesn't seem
like something anybody would try in a production database.  So while
I'd still like to see the results of that ctid query in Jeff's "broken"
database, it seems pretty unlikely that we will see high block numbers.

But, if all those rows are at the front of pg_attribute, how do we
explain the high block numbers seen in Jeff's stack traces?  He posted
three traces that show RelationBuildTupleDesc doing heapscans:
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00186.php
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00196.php
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00197.php
and in each one of these, the block number being passed to
ReadBufferExtended is well past where those rows ought to be.
What's more, the three traces are performing the pg_attribute heapscan
for three different catalogs; so to blame this observation on some
rows having gotten relocated to high block numbers, you'd have to
believe that had happened for all three of these catalogs (one of
which is an index, so there'd be no reason to fool with its stats
attributes anyway).

AFAICS, if there are not rows at high block numbers, the only other
possibility is that syncscan mode was active.  (A third explanation is
that some rows are just plain missing, but then the system would have
been erroring out altogether, which it was not.)  That theory requires
us to invent an explanation for the fact that Jeff now observes
pg_attribute to be slightly smaller than 1/4th shared_buffers, but
perhaps autovacuum managed to shrink it at some point after the trouble
happened.

Another thing that can be deduced from those stack traces is that sinval
resets were happening.  For example, in the third message linked above,
the heapscan is being done to load up a relcache entry for relation 2601
(pg_am).  This would be unsurprising, except that stack frames 17 and 18
show this is happening during the fifth call of load_critical_index, and
we should have had both pg_am's reldesc and the syscache entry that's
being fetched at relcache.c:1010 loaded up in the first call.  So those
cache entries have to have gotten blown away by sinval reset.  This is
not terribly surprising if there were a steady flux of DDL happening in
the database, for instance temp table creations/drops.  (Jeff, can you
comment on that?)  If heapscans over the whole of pg_attribute were
occurring, they'd take long enough to expose the process to sinval
overrun even with not-very-high DDL rates.

Now the interesting thing about that is that if an sinval reset occurs
at any time while relcache.c is trying to load up initial relcache
entries, it will not try to write a new relcache init file.  This means
that a steady but not very large flow of DDL activity from existing
sessions in the database would be sufficient to prevent incoming
sessions from ever writing pg_internal.init, and thus the stall behavior
could persist for a long time, which is something I hadn't been able to
explain before.  (If this were not happening, the stalls would disappear
as soon as any incoming session successfully got through its stall.)

So it appears to me that Jeff's problem is adequately explained if we
assume (1) pg_attribute a bit larger than 1/4th shared_buffers, and
(2) steady flow of DDL activity from existing sessions; and that if you
want to dispute (1) you have to invent some other explanation for the
observed attempts to read high block numbers in pg_attribute.  Also,
if there were lots of temp table creation/dropping going on, this could
result in lots of dead rows at the end of pg_attribute, which might
explain how autovacuum could've shortened the relation later.

Now this theory does not explain Greg's problem, because his
pg_attribute was apparently many times too small to make syncscans
credible, or to make them take very long even if they did happen.
But in view of the fact that that was 8.3.5 :-(, and that the problem
was observed in conjunction with corruption of system indexes, I don't
feel a compulsion to assume that Greg's problem was the same as Jeff's.
Trawling through the commit logs I find several post-8.3.5 fixes for
race conditions and other bugs in relcache initialization, notably
commits 8407a11c5a004671142a9ac2fe29de3844360b0b and
8b720b5723fcd59c22e641e05b62dc92f53c9125.  Greg mentioned lots of temp
table creation/deletion, which would produce sinval reports and the
occasional sinval reset, which is a necessary triggering condition for
both of those bugs.  The previously recognized symptoms of those bugs
look a bit different from what Greg reported, but I can easily believe
that they'd lead to index corruption if they didn't cause immediate
crashes, and maybe they explain his stall behavior as well.

So at this point I'm feeling that my commits of yesterday to suppress
syncscan during relcache rebuild represent an adequate response to these
reports.  Yesterday I proposed also trying to suppress parallel rebuilds
of the init files, but today I'm afraid that that might not be a good
idea.  The thing that is bothering me is that in the presence of a
steady stream of sinval resets, incoming backends might never be able to
write new relcache init files, in which case the proposed new lock would
just serialize admittance to the database without actually saving any
work.  The "thundering herd" problem only arises if relcache
initialization can take a long time (i.e. a large fraction of a second)
so that many new arrivals can show up and create a resource saturation
condition.  So far as I can tell, that will not happen once we suppress
syncscans here.  When operating as designed, those heapscans only touch
a few buffers' worth of pg_attribute, so the data will all stay safely
cached; hence the initialization shouldn't be that expensive even if
many incoming sessions have to do it.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alastair Turner
Date:
Subject: Re: [RFC] Interface of Row Level Security
Next
From: Euler Taveira
Date:
Subject: Re: No, pg_size_pretty(numeric) was not such a hot idea