Re: Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Backends stalled in 'startup' state: index corruption |
Date | |
Msg-id | 27844.1338148415@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Backends stalled in 'startup' state: index corruption (Greg Sabino Mullane <greg@endpoint.com>) |
Responses |
Re: Backends stalled in 'startup' state: index corruption
|
List | pgsql-hackers |
I've been continuing to poke at this business of relcache-related startup stalls, and have come to some new conclusions. One is that it no longer seems at all likely that the pg_attribute rows for system catalogs aren't at the front of pg_attribute, because the commands that might be used to update them are disallowed even to superusers: foo=# alter table pg_index alter column indnatts set statistics -1; ERROR: permission denied: "pg_index" is a system catalog foo=# alter table pg_index alter column indnatts set (n_distinct = -1); ERROR: permission denied: "pg_index" is a system catalog Now a sufficiently stubborn DBA could override that protection (via direct manual UPDATE on pg_attribute, or by doing ALTER TABLE in a standalone backend with allowSystemTableMods set). But it doesn't seem like something anybody would try in a production database. So while I'd still like to see the results of that ctid query in Jeff's "broken" database, it seems pretty unlikely that we will see high block numbers. But, if all those rows are at the front of pg_attribute, how do we explain the high block numbers seen in Jeff's stack traces? He posted three traces that show RelationBuildTupleDesc doing heapscans: http://archives.postgresql.org/pgsql-bugs/2012-04/msg00186.php http://archives.postgresql.org/pgsql-bugs/2012-04/msg00196.php http://archives.postgresql.org/pgsql-bugs/2012-04/msg00197.php and in each one of these, the block number being passed to ReadBufferExtended is well past where those rows ought to be. What's more, the three traces are performing the pg_attribute heapscan for three different catalogs; so to blame this observation on some rows having gotten relocated to high block numbers, you'd have to believe that had happened for all three of these catalogs (one of which is an index, so there'd be no reason to fool with its stats attributes anyway). AFAICS, if there are not rows at high block numbers, the only other possibility is that syncscan mode was active. (A third explanation is that some rows are just plain missing, but then the system would have been erroring out altogether, which it was not.) That theory requires us to invent an explanation for the fact that Jeff now observes pg_attribute to be slightly smaller than 1/4th shared_buffers, but perhaps autovacuum managed to shrink it at some point after the trouble happened. Another thing that can be deduced from those stack traces is that sinval resets were happening. For example, in the third message linked above, the heapscan is being done to load up a relcache entry for relation 2601 (pg_am). This would be unsurprising, except that stack frames 17 and 18 show this is happening during the fifth call of load_critical_index, and we should have had both pg_am's reldesc and the syscache entry that's being fetched at relcache.c:1010 loaded up in the first call. So those cache entries have to have gotten blown away by sinval reset. This is not terribly surprising if there were a steady flux of DDL happening in the database, for instance temp table creations/drops. (Jeff, can you comment on that?) If heapscans over the whole of pg_attribute were occurring, they'd take long enough to expose the process to sinval overrun even with not-very-high DDL rates. Now the interesting thing about that is that if an sinval reset occurs at any time while relcache.c is trying to load up initial relcache entries, it will not try to write a new relcache init file. This means that a steady but not very large flow of DDL activity from existing sessions in the database would be sufficient to prevent incoming sessions from ever writing pg_internal.init, and thus the stall behavior could persist for a long time, which is something I hadn't been able to explain before. (If this were not happening, the stalls would disappear as soon as any incoming session successfully got through its stall.) So it appears to me that Jeff's problem is adequately explained if we assume (1) pg_attribute a bit larger than 1/4th shared_buffers, and (2) steady flow of DDL activity from existing sessions; and that if you want to dispute (1) you have to invent some other explanation for the observed attempts to read high block numbers in pg_attribute. Also, if there were lots of temp table creation/dropping going on, this could result in lots of dead rows at the end of pg_attribute, which might explain how autovacuum could've shortened the relation later. Now this theory does not explain Greg's problem, because his pg_attribute was apparently many times too small to make syncscans credible, or to make them take very long even if they did happen. But in view of the fact that that was 8.3.5 :-(, and that the problem was observed in conjunction with corruption of system indexes, I don't feel a compulsion to assume that Greg's problem was the same as Jeff's. Trawling through the commit logs I find several post-8.3.5 fixes for race conditions and other bugs in relcache initialization, notably commits 8407a11c5a004671142a9ac2fe29de3844360b0b and 8b720b5723fcd59c22e641e05b62dc92f53c9125. Greg mentioned lots of temp table creation/deletion, which would produce sinval reports and the occasional sinval reset, which is a necessary triggering condition for both of those bugs. The previously recognized symptoms of those bugs look a bit different from what Greg reported, but I can easily believe that they'd lead to index corruption if they didn't cause immediate crashes, and maybe they explain his stall behavior as well. So at this point I'm feeling that my commits of yesterday to suppress syncscan during relcache rebuild represent an adequate response to these reports. Yesterday I proposed also trying to suppress parallel rebuilds of the init files, but today I'm afraid that that might not be a good idea. The thing that is bothering me is that in the presence of a steady stream of sinval resets, incoming backends might never be able to write new relcache init files, in which case the proposed new lock would just serialize admittance to the database without actually saving any work. The "thundering herd" problem only arises if relcache initialization can take a long time (i.e. a large fraction of a second) so that many new arrivals can show up and create a resource saturation condition. So far as I can tell, that will not happen once we suppress syncscans here. When operating as designed, those heapscans only touch a few buffers' worth of pg_attribute, so the data will all stay safely cached; hence the initialization shouldn't be that expensive even if many incoming sessions have to do it. regards, tom lane
pgsql-hackers by date: