Thread: Backends stalled in 'startup' state: index corruption
Yesterday I had a client that experienced a sudden high load on one of their servers (8.3.5 - yes, I know. Those of you with clients will understand). When I checked, almost all connections were in a "startup" state, very similar to this thread: http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html Running a strace showed a lot of semop activity, and the logs showed a successful connection, then a 5 minute plus wait before a query was issued. So obviously, blocking on something. Unlike the thread above, I *did* find problems in the system catalogs. For example, both pg_class and pg_index gave warnings like this for every index during a VACUUM FULL VERBOSE tablename: WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions HINT: Rebuild the index with REINDEX. A REINDEX did not solve the problem (initially), as a REINDEX followed by a VAC showed the same warning and hint. The next step was dropping to standalone mode, but before that could be done, the REINDEXes fixed the problem (no warnings, no stalled connections). So my questions are: * Why would a REINDEX not fix the problem as the hint suggested? Why would it then start "working"? * What exactly is the relationship between bad indexes and shared memory locks? * Is there some other emergency fix when it happens, such as killing all backends and hoping you kill the one that is actually holding the lock (if any was). * Did anything in the 8.3 series fix this? I saw nothing relevant in the release notes for everything up to 8.3.18 (which it will be on soon). -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
Greg Sabino Mullane <greg@endpoint.com> writes: > Yesterday I had a client that experienced a sudden high load on > one of their servers (8.3.5 - yes, I know. Those of you with > clients will understand). When I checked, almost all connections > were in a "startup" state, very similar to this thread: > http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html > Running a strace showed a lot of semop activity, and the logs showed a > successful connection, then a 5 minute plus wait before a query was issued. > So obviously, blocking on something. Did you check I/O activity? I looked again at Jeff Frost's report and now think that what he saw was probably a lot of seqscans on bloated system catalogs, cf http://archives.postgresql.org/message-id/28484.1337887297@sss.pgh.pa.us > Unlike the thread above, I *did* find > problems in the system catalogs. For example, both pg_class and pg_index > gave warnings like this for every index during a VACUUM FULL > VERBOSE tablename: > WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, > but table contains 9471 row versions > HINT: Rebuild the index with REINDEX. That's fairly interesting, but if it was a bloat situation then it would've been the VAC FULL that fixed it rather than the REINDEX. Did you happen to save the VERBOSE output? It'd be really useful to know whether there was any major shrinkage of the core catalogs (esp. pg_class, pg_attribute). > * Did anything in the 8.3 series fix this? I think there are probably two independent issues here. The missing index entries are clearly bad but it's not clear that they had anything to do with the startup stall. There are a couple of fixes in recent 8.3.x releases that might possibly explain the index corruption, especially if you're in the habit of reindexing the system catalogs frequently. regards, tom lane
On Thu, May 24, 2012 at 03:54:54PM -0400, Tom Lane wrote: > Did you check I/O activity? I looked again at Jeff Frost's report and > now think that what he saw was probably a lot of seqscans on bloated > system catalogs, cf > http://archives.postgresql.org/message-id/28484.1337887297@sss.pgh.pa.us Thank you for the response. Yes, we did look at I/O, but nothing unusual was seen that would explain the load. If it happens again I'll see if the system catalogs are getting a lot of seqscans. > That's fairly interesting, but if it was a bloat situation then it > would've been the VAC FULL that fixed it rather than the REINDEX. > Did you happen to save the VERBOSE output? It'd be really useful to > know whether there was any major shrinkage of the core catalogs > (esp. pg_class, pg_attribute). I did have them in screen, but the home screen box just craashed a few hours ago (after weeks of uptime: Murphy's law). It certainly could have been VAC FULL as we have processes that do both VAC FULL and REINDEX periodically; I simply assumed based on the HINT it was the REINDEX that cleared it up. The only snippet I have is: # vacuum full verbose pg_class; INFO: vacuuming "pg_catalog.pg_class" INFO: "pg_class": found 43 removable, 10376 nonremovable row versions in 518 pages DETAIL: 6078 dead row versions cannot be removed yet. Nonremovable row versions range from 160 to 628 bytes long. There were 7367 unused item pointers. Total free space (including removable row versions) is 157000 bytes. 0 pages are or will become empty, including 0 at the end of the table. 0 pages containing 0 free bytes are potential move destinations. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: index "pg_class_oid_index" now contains 7712 row versions in 32 pages DETAIL: 6 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. WARNING: index "pg_class_oid_index" contains 7712 row versions, but table contains 9471 row versions HINT: Rebuild the index with REINDEX. INFO: index "pg_class_relname_nsp_index" now contains 7712 row versions in 113 pages DETAIL: 6 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions HINT: Rebuild the index with REINDEX. VACUUM Time: 65.635 ms Actually, since we have full logging, those warnings should still be there... (checks) ... yep. Hints elided, the last one seen was: WARNING: index "pg_class_oid_index" contains 14877 row versions, but table contains 66648 row versions WARNING: index "pg_class_relname_nsp_index" contains 14877 row versions, but table contains 66648 row versions WARNING: index "pg_class_oid_index" contains 60943 row versions, but table contains 112714 row versions WARNING: index "pg_class_relname_nsp_index" contains 60943 row versions, but table contains 112714 row versions Looks like there are some more going back a ways: I'll gather together and send those offlist after this. > I think there are probably two independent issues here. The missing > index entries are clearly bad but it's not clear that they had anything > to do with the startup stall. There are a couple of fixes in recent > 8.3.x releases that might possibly explain the index corruption, > especially if you're in the habit of reindexing the system catalogs > frequently. Yes, we are in that habit. I hope they are not independent: I'd rather have one problem to worry about than two. :) I'll certainly report if I see either problem pop up again. Oh, almost forgot: reading your reply to the old thread reminded me of something I saw in one of the straces right as it "woke up" and left the startup state to do some work. Here's a summary: 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0 (x a gazillion) ... 12:18:40 brk(0x1c0af000) = 0x1c0af000 ...(some more semops)... 12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062c98000 ...(handful of semops)... 12:18:40 unlink("base/1554846571/pg_internal.init.11803") = -1 ENOENT (No such file or directory) 12:18:40 open("base/1554846571/pg_internal.init.11803", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13 12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062cd9000 12:18:40 write(13, ... ...(normalish looking strace output after this)... -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
> I think there are probably two independent issues here. The missing > index entries are clearly bad but it's not clear that they had anything > to do with the startup stall. On further log digging, I think you are correct, as those index warnings go back many days before the startup problems appeared. Let me know if you'd like any of those warnings from the logs. -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
Greg Sabino Mullane <greg@endpoint.com> writes: > Oh, almost forgot: reading your reply to the old thread reminded me of > something I saw in one of the straces right as it "woke up" and left > the startup state to do some work. Here's a summary: > 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0 > 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0 > 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0 > (x a gazillion) > ... > 12:18:40 brk(0x1c0af000) = 0x1c0af000 > ...(some more semops)... > 12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062c98000 > ...(handful of semops)... > 12:18:40 unlink("base/1554846571/pg_internal.init.11803") = -1 ENOENT (No such file or directory) > 12:18:40 open("base/1554846571/pg_internal.init.11803", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13 > 12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 > 12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062cd9000 > 12:18:40 write(13, ... > ...(normalish looking strace output after this)... Yeah, this is proof that what it was doing is the same as what we saw in Jeff's backtrace, ie loading up the system catalog relcache entries the hard way via seqscans on the core catalogs. So the question to be answered is why that's suddenly a big performance bottleneck. It's not a cheap operation of course (that's why we cache the results ;-)) but it shouldn't take minutes either. And, because they are seqscans, it doesn't seem like messed-up indexes should matter. The theory I have in mind about Jeff's case is that it was basically an I/O storm, but it's not clear whether the same explanation works for your case. There may be some other contributing factor that we haven't identified yet. regards, tom lane
> Yeah, this is proof that what it was doing is the same as what we saw in > Jeff's backtrace, ie loading up the system catalog relcache entries the > hard way via seqscans on the core catalogs. So the question to be > answered is why that's suddenly a big performance bottleneck. It's not > a cheap operation of course (that's why we cache the results ;-)) but > it shouldn't take minutes either. And, because they are seqscans, it > doesn't seem like messed-up indexes should matter. FWIW, this appeared to be an all-or-nothing event: either every new backend was suffering through this, or none were. They all seemed to clear up at the same time as well. > The theory I have in mind about Jeff's case is that it was basically an > I/O storm, but it's not clear whether the same explanation works for > your case. There may be some other contributing factor that we haven't > identified yet. Let me know if you think of anything particular I can test while it is happening again. I'll try to arrange a (netapp) snapshot the next time it happens as well (this system is too busy and too large to do anything else). -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
Greg Sabino Mullane <greg@endpoint.com> writes: >> Yeah, this is proof that what it was doing is the same as what we saw in >> Jeff's backtrace, ie loading up the system catalog relcache entries the >> hard way via seqscans on the core catalogs. So the question to be >> answered is why that's suddenly a big performance bottleneck. It's not >> a cheap operation of course (that's why we cache the results ;-)) but >> it shouldn't take minutes either. And, because they are seqscans, it >> doesn't seem like messed-up indexes should matter. > FWIW, this appeared to be an all-or-nothing event: either every new backend > was suffering through this, or none were. They all seemed to clear up > at the same time as well. Mostly not surprising. They'd definitely all hit the missing init file at the same time, so the stalling would start consistently for all. And once any one process successfully created a new file, subsequent incoming sessions wouldn't stall. However, the remaining processes trying to compute new init files would still have to complete the process, so I'd expect there to be a diminishing effect --- the ones that were stalling shouldn't all release exactly together. Unless there is some additional effect that's syncing them all. (I wonder for instance if the syncscan logic is kicking in here.) One interesting question is why there's a thundering herd of new arrivals in the first place. IIRC you said you were using a connection pooler. I wonder if it has a bug^H^H^Hdesign infelicity that makes it drop and reopen all its connections simultaneously. > Let me know if you think of anything particular I can test while it is > happening again. I just noticed something that might explain the persistency of the effect as observed by Jeff. The code that seqscans pg_attribute (in RelationBuildTupleDesc) knows how many rows it's expecting to find for a given catalog, and it falls out of the seqscan loop as soon as it's gotten them all. Now, the rows belonging to core system catalogs are generally right near the front of pg_attribute, being the oldest rows in that catalog, which means that generally this results in not having to seqscan very far even if pg_attribute is large. It strikes me though that there are at least two ways that nice property could get broken, resulting in much seqscan work if pg_attribute is large: 1. Somebody decides to update one of those rows, and it gets dropped in some remote region of the table. The only really plausible reason for this is deciding to fool with the column-specific stats target (attstattarget) of a system catalog. Does that sound like something either of you might have done? You could check it by looking at the ctid columns of the pg_attribute rows for system catalogs, and seeing if any have large block numbers. 2. If the syncscan logic were to kick in and cause some backend to pick up its seqscan of pg_attribute from a point beyond where some of the target rows are, that backend would have to scan all of pg_attribute, circling back around to the start, before it would find all the rows it seeks. And most likely this effect would lather-rinse-repeat for each catalog it's seeking the pg_attribute entries for. Not only does this mean a much-worse-than-normal startup time for that backend, but any other ones that arrive while the synchronized scan is in progress would be caught in the undertow and likewise spend a long time to get their results. So point 2 is suddenly looking like a really attractive theory for explaining what happened. As we were just remarking in an adjacent thread, the threshold for this to start happening would be for pg_attribute to get larger than one-fourth of shared_buffers; the syncscan logic doesn't kick in for relations smaller than that. IIRC this is close enough to the numbers Jeff mentioned to make it plausible that it happened to him, and plausible that his new installation has pg_attribute just enough smaller to avoid the scenario. Not sure about Greg's case, but he should be able to tell us the size of pg_attribute and his shared_buffers setting ... If this is the explanation, then it's easy enough to deal with point 2 --- just disable syncscan for these searches. I don't see an easy answer for problems of ilk #1, other than "don't do that in a database that's been around for awhile". Another idea we might consider is to prevent the thundering herd effect by not having all the incoming backends try to update pg_internal.init independently. Instead let the first one take a lock while it's doing that, and the rest just queue up on that lock. They'd be stalled anyway, and they are not helping matters by duplicating the first one's effort. This is a rather more invasive change though. regards, tom lane
On May 25, 2012, at 4:02 PM, Tom Lane wrote:
Greg Sabino Mullane <greg@endpoint.com> writes:Yeah, this is proof that what it was doing is the same as what we saw inJeff's backtrace, ie loading up the system catalog relcache entries thehard way via seqscans on the core catalogs. So the question to beanswered is why that's suddenly a big performance bottleneck. It's nota cheap operation of course (that's why we cache the results ;-)) butit shouldn't take minutes either. And, because they are seqscans, itdoesn't seem like messed-up indexes should matter.FWIW, this appeared to be an all-or-nothing event: either every new backendwas suffering through this, or none were. They all seemed to clear upat the same time as well.
Mostly not surprising. They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all. And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall. However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together. Unless there is some additional
effect that's syncing them all. (I wonder for instance if the syncscan
logic is kicking in here.)
In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan?
BTW, In our case, I thought to take the system down to single user mode and reindex these. When the indexes were disabled, I immediately experienced the slow startup, so it certainly seems like an issue with seq scanning these.
I'll see if i can reproduce that behavior by starting up with system indexes disabled. This probably won't happen until tuesday when we get that data directory moved to a test server.
In our customer's case, it would happen for a while,then stop happening for some time...presumably this was after the caching, then it would start up again..presumably after something invalidated the cache.
Switching from the master to the streaming replica made the situation better, but not go away.
Then a full initdb solved the problem. I bet a vacuum full of pg_attribute would've done the trick though.
1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table. The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog. Does that sound like something
either of you might have done? You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.
Definitely wasn't done by me and I'm pretty sure the customer wouldn't have done that either.
---
Jeff Frost <jeff@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
Jeff Frost <jeff@pgexperts.com> writes: > In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan? Could you go back and double check that? If the shared_buffers setting were 7GB not 8GB, that would fall right between the pg_attribute sizes you posted before. I'm getting somewhat convinced that this is the right answer, because I've been able to reproduce an unexpectedly long stall with multiple clients connecting simultaneously to an init-file-less database whose pg_attribute is large enough to trigger syncscans. The particular case I'm testing has pg_attribute of about 1GB (in a machine with just 4GB RAM, so I'm not going to push it up much further). If I just remove the init file and connect with psql, there's about a 1-second startup delay, which is bad enough; but if I throw 50 concurrent connections at it with a hacked-up version of pgbench, it takes about 45 seconds for all of them to get through startup. (You need to hack pgbench to suppress the single initialization connection it normally likes to make, else the test degenerates to the one-incoming-connection case.) I think that a big chunk of this is coming from the syncscan logic defeating the early-exit optimization in RelationBuildTupleDesc; but it still seems like there's another inefficiency somewhere. Maybe our syncscan logic just really sucks given enough backends trying to piggyback on the same syncscan. Now that I can reproduce it, I'll take a closer look at that. In the meantime, it looks like a trivial workaround is to disable synchronize_seqscans via postgresql.conf. regards, tom lane
On May 25, 2012, at 7:12 PM, Tom Lane wrote: > Jeff Frost <jeff@pgexperts.com> writes: >> In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan? > > Could you go back and double check that? If the shared_buffers setting > were 7GB not 8GB, It definitely started happening with 8GB of shared_buffers. We actually tried reducing shared_buffers to 2GB to see if we were writing out too many dirty buffers at checkpoint timeor something, but that had no effect, so we put it back to 8GB.
On Fri, May 25, 2012 at 07:02:42PM -0400, Tom Lane wrote: > However, the remaining processes trying to > compute new init files would still have to complete the process, so I'd > expect there to be a diminishing effect --- the ones that were stalling > shouldn't all release exactly together. Unless there is some additional > effect that's syncing them all. (I wonder for instance if the syncscan > logic is kicking in here.) How fast would you expect that to happen? As far as I could tell, they all released at once, or at least within probably 15 seconds of each other; I wasn't running ps constantly. I could check the logs and get a better figure if you think it's an important data point. > One interesting question is why there's a thundering herd of new > arrivals in the first place. IIRC you said you were using a connection > pooler. I wonder if it has a bug^H^H^Hdesign infelicity that makes it > drop and reopen all its connections simultaneously. No, we are not. Or rather, there is some pooling, but there is also a fairly large influx of new connections. As far as I could tell, the few existing connections were not affected. > 1. Somebody decides to update one of those rows, and it gets dropped in > some remote region of the table. The only really plausible reason for > this is deciding to fool with the column-specific stats target > (attstattarget) of a system catalog. Does that sound like something > either of you might have done? No, zero chance of this, barring some rogue intruder on the network with a strange sense of humor. > pg_attribute just enough smaller to avoid the scenario. Not sure about > Greg's case, but he should be able to tell us the size of pg_attribute > and his shared_buffers setting ... pg_attribute around 5 MB (+6MB indexes), shared_buffers 4GB. However, there is a *lot* of churn in pg_attribute and pg_class, mostly due to lots of temporary tables. P.S. Hmmm that's weird, I just double-checked the above and pg_attribute is now 52MB/70MB (the original figures were from yesterday). At any rate, nowhere near 1/4 shared buffers. -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
Greg Sabino Mullane <greg@endpoint.com> writes: > On Fri, May 25, 2012 at 07:02:42PM -0400, Tom Lane wrote: >> pg_attribute just enough smaller to avoid the scenario. Not sure about >> Greg's case, but he should be able to tell us the size of pg_attribute >> and his shared_buffers setting ... > pg_attribute around 5 MB (+6MB indexes), shared_buffers 4GB. However, > there is a *lot* of churn in pg_attribute and pg_class, mostly due > to lots of temporary tables. > P.S. Hmmm that's weird, I just double-checked the above and pg_attribute > is now 52MB/70MB (the original figures were from yesterday). At any rate, > nowhere near 1/4 shared buffers. Hmph. And Jeff swears his isn't large enough to trigger syncscans either. This could all be due to the "thundering herd" effect of a lot of processes all doing the same only-moderately-expensive-in-itself thing; except it's hard to see how the problem gets rolling unless the single-process cache reload time is already a lot more than the inter-arrival time. Would you guys please try this in the problem databases: select a.ctid, c.relname from pg_attribute a join pg_class c on a.attrelid=c.oid where c.relnamespace=11 and c.relkind in ('r','i') order by 1 desc; If you see any block numbers above about 20 then maybe the triggering condition is a row relocation after all. regards, tom lane
On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote: > If you see any block numbers above about 20 then maybe the triggering > condition is a row relocation after all. Highest was 13. -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
Greg Sabino Mullane <greg@endpoint.com> writes: > On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote: >> If you see any block numbers above about 20 then maybe the triggering >> condition is a row relocation after all. > Highest was 13. Hm ... but wait, you said you'd done a VACUUM FULL on the catalogs. So it's not clear whether this is reflective of the state at the time the problem was happening. regards, tom lane
On Sat, May 26, 2012 at 01:25:29PM -0400, Tom Lane wrote: > Greg Sabino Mullane <greg@endpoint.com> writes: > > On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote: > >> If you see any block numbers above about 20 then maybe the triggering > >> condition is a row relocation after all. > > > Highest was 13. > > Hm ... but wait, you said you'd done a VACUUM FULL on the catalogs. > So it's not clear whether this is reflective of the state at the time > the problem was happening. True. I'll try to get a high water mark when (er...if!) it happens again. -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
I've been continuing to poke at this business of relcache-related startup stalls, and have come to some new conclusions. One is that it no longer seems at all likely that the pg_attribute rows for system catalogs aren't at the front of pg_attribute, because the commands that might be used to update them are disallowed even to superusers: foo=# alter table pg_index alter column indnatts set statistics -1; ERROR: permission denied: "pg_index" is a system catalog foo=# alter table pg_index alter column indnatts set (n_distinct = -1); ERROR: permission denied: "pg_index" is a system catalog Now a sufficiently stubborn DBA could override that protection (via direct manual UPDATE on pg_attribute, or by doing ALTER TABLE in a standalone backend with allowSystemTableMods set). But it doesn't seem like something anybody would try in a production database. So while I'd still like to see the results of that ctid query in Jeff's "broken" database, it seems pretty unlikely that we will see high block numbers. But, if all those rows are at the front of pg_attribute, how do we explain the high block numbers seen in Jeff's stack traces? He posted three traces that show RelationBuildTupleDesc doing heapscans: http://archives.postgresql.org/pgsql-bugs/2012-04/msg00186.php http://archives.postgresql.org/pgsql-bugs/2012-04/msg00196.php http://archives.postgresql.org/pgsql-bugs/2012-04/msg00197.php and in each one of these, the block number being passed to ReadBufferExtended is well past where those rows ought to be. What's more, the three traces are performing the pg_attribute heapscan for three different catalogs; so to blame this observation on some rows having gotten relocated to high block numbers, you'd have to believe that had happened for all three of these catalogs (one of which is an index, so there'd be no reason to fool with its stats attributes anyway). AFAICS, if there are not rows at high block numbers, the only other possibility is that syncscan mode was active. (A third explanation is that some rows are just plain missing, but then the system would have been erroring out altogether, which it was not.) That theory requires us to invent an explanation for the fact that Jeff now observes pg_attribute to be slightly smaller than 1/4th shared_buffers, but perhaps autovacuum managed to shrink it at some point after the trouble happened. Another thing that can be deduced from those stack traces is that sinval resets were happening. For example, in the third message linked above, the heapscan is being done to load up a relcache entry for relation 2601 (pg_am). This would be unsurprising, except that stack frames 17 and 18 show this is happening during the fifth call of load_critical_index, and we should have had both pg_am's reldesc and the syscache entry that's being fetched at relcache.c:1010 loaded up in the first call. So those cache entries have to have gotten blown away by sinval reset. This is not terribly surprising if there were a steady flux of DDL happening in the database, for instance temp table creations/drops. (Jeff, can you comment on that?) If heapscans over the whole of pg_attribute were occurring, they'd take long enough to expose the process to sinval overrun even with not-very-high DDL rates. Now the interesting thing about that is that if an sinval reset occurs at any time while relcache.c is trying to load up initial relcache entries, it will not try to write a new relcache init file. This means that a steady but not very large flow of DDL activity from existing sessions in the database would be sufficient to prevent incoming sessions from ever writing pg_internal.init, and thus the stall behavior could persist for a long time, which is something I hadn't been able to explain before. (If this were not happening, the stalls would disappear as soon as any incoming session successfully got through its stall.) So it appears to me that Jeff's problem is adequately explained if we assume (1) pg_attribute a bit larger than 1/4th shared_buffers, and (2) steady flow of DDL activity from existing sessions; and that if you want to dispute (1) you have to invent some other explanation for the observed attempts to read high block numbers in pg_attribute. Also, if there were lots of temp table creation/dropping going on, this could result in lots of dead rows at the end of pg_attribute, which might explain how autovacuum could've shortened the relation later. Now this theory does not explain Greg's problem, because his pg_attribute was apparently many times too small to make syncscans credible, or to make them take very long even if they did happen. But in view of the fact that that was 8.3.5 :-(, and that the problem was observed in conjunction with corruption of system indexes, I don't feel a compulsion to assume that Greg's problem was the same as Jeff's. Trawling through the commit logs I find several post-8.3.5 fixes for race conditions and other bugs in relcache initialization, notably commits 8407a11c5a004671142a9ac2fe29de3844360b0b and 8b720b5723fcd59c22e641e05b62dc92f53c9125. Greg mentioned lots of temp table creation/deletion, which would produce sinval reports and the occasional sinval reset, which is a necessary triggering condition for both of those bugs. The previously recognized symptoms of those bugs look a bit different from what Greg reported, but I can easily believe that they'd lead to index corruption if they didn't cause immediate crashes, and maybe they explain his stall behavior as well. So at this point I'm feeling that my commits of yesterday to suppress syncscan during relcache rebuild represent an adequate response to these reports. Yesterday I proposed also trying to suppress parallel rebuilds of the init files, but today I'm afraid that that might not be a good idea. The thing that is bothering me is that in the presence of a steady stream of sinval resets, incoming backends might never be able to write new relcache init files, in which case the proposed new lock would just serialize admittance to the database without actually saving any work. The "thundering herd" problem only arises if relcache initialization can take a long time (i.e. a large fraction of a second) so that many new arrivals can show up and create a resource saturation condition. So far as I can tell, that will not happen once we suppress syncscans here. When operating as designed, those heapscans only touch a few buffers' worth of pg_attribute, so the data will all stay safely cached; hence the initialization shouldn't be that expensive even if many incoming sessions have to do it. regards, tom lane
On May 27, 2012, at 12:53 PM, Tom Lane wrote:
Another thing that can be deduced from those stack traces is that sinval
resets were happening. For example, in the third message linked above,
the heapscan is being done to load up a relcache entry for relation 2601
(pg_am). This would be unsurprising, except that stack frames 17 and 18
show this is happening during the fifth call of load_critical_index, and
we should have had both pg_am's reldesc and the syscache entry that's
being fetched at relcache.c:1010 loaded up in the first call. So those
cache entries have to have gotten blown away by sinval reset. This is
not terribly surprising if there were a steady flux of DDL happening in
the database, for instance temp table creations/drops. (Jeff, can you
comment on that?) If heapscans over the whole of pg_attribute were
occurring, they'd take long enough to expose the process to sinval
overrun even with not-very-high DDL rates.
As it turns out, there are quite a few temporary tables created.
During the busiest periods of the day, this system averages 1.75 temp tables per second.
On Sun, May 27, 2012 at 05:44:15PM -0700, Jeff Frost wrote: > On May 27, 2012, at 12:53 PM, Tom Lane wrote: > > occurring, they'd take long enough to expose the process to sinval > > overrun even with not-very-high DDL rates. > As it turns out, there are quite a few temporary tables created. For the record, same here. We do *lots* of DDL (hence the cronjobs to vac/reindex system catalogs). -- Greg Sabino Mullane greg@endpoint.com End Point Corporation PGP Key: 0x14964AC8
Greg Sabino Mullane <greg@endpoint.com> writes: > On Sun, May 27, 2012 at 05:44:15PM -0700, Jeff Frost wrote: >> On May 27, 2012, at 12:53 PM, Tom Lane wrote: >>> occurring, they'd take long enough to expose the process to sinval >>> overrun even with not-very-high DDL rates. >> As it turns out, there are quite a few temporary tables created. > For the record, same here. We do *lots* of DDL (hence the cronjobs > to vac/reindex system catalogs). I wonder if it could've been something like transient problem with a cronjob leading to massive bloat of pg_attribute, eventually triggering the syncscan issue, then fixed by a successful VAC FULL before we thought to look closely at the table size. The syncscan issue definitely was there in 8.3.5, it's only the question of pg_attribute size that made me doubt it was happening for you. regards, tom lane
On May 26, 2012, at 9:17 AM, Tom Lane wrote:
Would you guys please try this in the problem databases:
select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;
If you see any block numbers above about 20 then maybe the triggering
condition is a row relocation after all.
Sorry for such a long delay on the reply. Took a while to get the data directory moved elsewhere:
select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;
ctid | relname
---------+-----------------------------------------
(18,31) | pg_extension_name_index
(18,30) | pg_extension_oid_index
(18,29) | pg_seclabel_object_index
(18,28) | pg_seclabel_object_index
(18,27) | pg_seclabel_object_index
(18,26) | pg_seclabel_object_index
As the next step, I'd suggest verifying that the stall is reproducible
if you remove pg_internal.init (and that it's not there otherwise), and
then strace'ing the single incoming connection to see what it's doing.
It does take a little while, but not nearly as long as the stalls we were seeing before. The pg_internal.init is 108k in case that's an interesting data point.
Any other tests you'd like me to run on that bad data dir?
Also, thus far, the newly initdb'd system continues to hum along just fine. It's also been upgraded to 9.1.4, so if it was the rebuilding of pg_internal.init, then your fix should keep it happy.