Thread: Backends stalled in 'startup' state: index corruption

Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
Yesterday I had a client that experienced a sudden high load on
one of their servers (8.3.5 - yes, I know. Those of you with
clients will understand). When I checked, almost all connections
were in a "startup" state, very similar to this thread:

http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

Running a strace showed a lot of semop activity, and the logs showed a
successful connection, then a 5 minute plus wait before a query was issued.
So obviously, blocking on something. Unlike the thread above, I *did* find
problems in the system catalogs. For example, both pg_class and pg_index
gave warnings like this for every index during a VACUUM FULL
VERBOSE tablename:

WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.

A REINDEX did not solve the problem (initially), as a REINDEX followed
by a VAC showed the same warning and hint.

The next step was dropping to standalone mode, but before that could
be done, the REINDEXes fixed the problem (no warnings, no stalled
connections). So my questions are:

* Why would a REINDEX not fix the problem as the hint suggested?
Why would it then start "working"?

* What exactly is the relationship between bad indexes and shared
memory locks?

* Is there some other emergency fix when it happens, such as killing
all backends and hoping you kill the one that is actually holding
the lock (if any was).

* Did anything in the 8.3 series fix this? I saw nothing relevant in
the release notes for everything up to 8.3.18 (which it will be on soon).

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Greg Sabino Mullane <greg@endpoint.com> writes:
> Yesterday I had a client that experienced a sudden high load on 
> one of their servers (8.3.5 - yes, I know. Those of you with 
> clients will understand). When I checked, almost all connections 
> were in a "startup" state, very similar to this thread:

> http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

> Running a strace showed a lot of semop activity, and the logs showed a
> successful connection, then a 5 minute plus wait before a query was issued.
> So obviously, blocking on something.

Did you check I/O activity?  I looked again at Jeff Frost's report and
now think that what he saw was probably a lot of seqscans on bloated
system catalogs, cf
http://archives.postgresql.org/message-id/28484.1337887297@sss.pgh.pa.us

> Unlike the thread above, I *did* find
> problems in the system catalogs. For example, both pg_class and pg_index
> gave warnings like this for every index during a VACUUM FULL
> VERBOSE tablename:

> WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions,
>   but table contains 9471 row versions
> HINT:  Rebuild the index with REINDEX.

That's fairly interesting, but if it was a bloat situation then it
would've been the VAC FULL that fixed it rather than the REINDEX.
Did you happen to save the VERBOSE output?  It'd be really useful to
know whether there was any major shrinkage of the core catalogs
(esp. pg_class, pg_attribute).

> * Did anything in the 8.3 series fix this?

I think there are probably two independent issues here.  The missing
index entries are clearly bad but it's not clear that they had anything
to do with the startup stall.  There are a couple of fixes in recent
8.3.x releases that might possibly explain the index corruption,
especially if you're in the habit of reindexing the system catalogs
frequently.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
On Thu, May 24, 2012 at 03:54:54PM -0400, Tom Lane wrote:
> Did you check I/O activity?  I looked again at Jeff Frost's report and
> now think that what he saw was probably a lot of seqscans on bloated
> system catalogs, cf
> http://archives.postgresql.org/message-id/28484.1337887297@sss.pgh.pa.us

Thank you for the response. Yes, we did look at I/O, but nothing unusual
was seen that would explain the load. If it happens again I'll see if the
system catalogs are getting a lot of seqscans.

> That's fairly interesting, but if it was a bloat situation then it
> would've been the VAC FULL that fixed it rather than the REINDEX.
> Did you happen to save the VERBOSE output?  It'd be really useful to
> know whether there was any major shrinkage of the core catalogs
> (esp. pg_class, pg_attribute).

I did have them in screen, but the home screen box just craashed a
few hours ago (after weeks of uptime: Murphy's law). It certainly could
have been VAC FULL as we have processes that do both VAC FULL and REINDEX
periodically; I simply assumed based on the HINT it was the REINDEX that
cleared it up. The only snippet I have is:

# vacuum full verbose pg_class;
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 43 removable, 10376 nonremovable row versions in 518 pages
DETAIL:  6078 dead row versions cannot be removed yet.
Nonremovable row versions range from 160 to 628 bytes long.
There were 7367 unused item pointers.
Total free space (including removable row versions) is 157000 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "pg_class_oid_index" now contains 7712 row versions in 32 pages
DETAIL:  6 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index "pg_class_oid_index" contains 7712 row versions, but table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.
INFO:  index "pg_class_relname_nsp_index" now contains 7712 row versions in 113 pages
DETAIL:  6 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING:  index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.
VACUUM
Time: 65.635 ms

Actually, since we have full logging, those warnings should still be
there... (checks) ... yep. Hints elided, the last one seen was:

WARNING: index "pg_class_oid_index" contains 14877 row versions, but table contains 66648 row versions
WARNING: index "pg_class_relname_nsp_index" contains 14877 row versions, but table contains 66648 row versions
WARNING: index "pg_class_oid_index" contains 60943 row versions, but table contains 112714 row versions
WARNING:  index "pg_class_relname_nsp_index" contains 60943 row versions, but table contains 112714 row versions

Looks like there are some more going back a ways: I'll gather together
and send those offlist after this.

> I think there are probably two independent issues here.  The missing
> index entries are clearly bad but it's not clear that they had anything
> to do with the startup stall.  There are a couple of fixes in recent
> 8.3.x releases that might possibly explain the index corruption,
> especially if you're in the habit of reindexing the system catalogs
> frequently.

Yes, we are in that habit. I hope they are not independent: I'd rather
have one problem to worry about than two. :) I'll certainly report if I
see either problem pop up again.

Oh, almost forgot: reading your reply to the old thread reminded me of
something I saw in one of the straces right as it "woke up" and left
the startup state to do some work. Here's a summary:

12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
(x a gazillion)
...
12:18:40 brk(0x1c0af000)                = 0x1c0af000
...(some more semops)...
12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062c98000
...(handful of semops)...
12:18:40 unlink("base/1554846571/pg_internal.init.11803") = -1 ENOENT (No such file or directory)
12:18:40 open("base/1554846571/pg_internal.init.11803", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13
12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062cd9000
12:18:40 write(13, ...
...(normalish looking strace output after this)...

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
> I think there are probably two independent issues here.  The missing
> index entries are clearly bad but it's not clear that they had anything
> to do with the startup stall.

On further log digging, I think you are correct, as those index
warnings go back many days before the startup problems appeared.
Let me know if you'd like any of those warnings from the logs.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Greg Sabino Mullane <greg@endpoint.com> writes:
> Oh, almost forgot: reading your reply to the old thread reminded me of 
> something I saw in one of the straces right as it "woke up" and left 
> the startup state to do some work. Here's a summary:

> 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
> 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
> 12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
> (x a gazillion)
> ...
> 12:18:40 brk(0x1c0af000)                = 0x1c0af000
> ...(some more semops)...
> 12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062c98000
> ...(handful of semops)...
> 12:18:40 unlink("base/1554846571/pg_internal.init.11803") = -1 ENOENT (No such file or directory)
> 12:18:40 open("base/1554846571/pg_internal.init.11803", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13
> 12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
> 12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062cd9000
> 12:18:40 write(13, ...
> ...(normalish looking strace output after this)...

Yeah, this is proof that what it was doing is the same as what we saw in
Jeff's backtrace, ie loading up the system catalog relcache entries the
hard way via seqscans on the core catalogs.  So the question to be
answered is why that's suddenly a big performance bottleneck.  It's not
a cheap operation of course (that's why we cache the results ;-)) but
it shouldn't take minutes either.  And, because they are seqscans, it
doesn't seem like messed-up indexes should matter.

The theory I have in mind about Jeff's case is that it was basically an
I/O storm, but it's not clear whether the same explanation works for
your case.  There may be some other contributing factor that we haven't
identified yet.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
> Yeah, this is proof that what it was doing is the same as what we saw in
> Jeff's backtrace, ie loading up the system catalog relcache entries the
> hard way via seqscans on the core catalogs.  So the question to be
> answered is why that's suddenly a big performance bottleneck.  It's not
> a cheap operation of course (that's why we cache the results ;-)) but
> it shouldn't take minutes either.  And, because they are seqscans, it
> doesn't seem like messed-up indexes should matter.

FWIW, this appeared to be an all-or-nothing event: either every new backend
was suffering through this, or none were. They all seemed to clear up
at the same time as well.

> The theory I have in mind about Jeff's case is that it was basically an
> I/O storm, but it's not clear whether the same explanation works for
> your case.  There may be some other contributing factor that we haven't
> identified yet.

Let me know if you think of anything particular I can test while it is
happening again. I'll try to arrange a (netapp) snapshot the next time
it happens as well (this system is too busy and too large to do anything
else).

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Greg Sabino Mullane <greg@endpoint.com> writes:
>> Yeah, this is proof that what it was doing is the same as what we saw in
>> Jeff's backtrace, ie loading up the system catalog relcache entries the
>> hard way via seqscans on the core catalogs.  So the question to be
>> answered is why that's suddenly a big performance bottleneck.  It's not
>> a cheap operation of course (that's why we cache the results ;-)) but
>> it shouldn't take minutes either.  And, because they are seqscans, it
>> doesn't seem like messed-up indexes should matter.

> FWIW, this appeared to be an all-or-nothing event: either every new backend 
> was suffering through this, or none were. They all seemed to clear up 
> at the same time as well.

Mostly not surprising.  They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all.  And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall.  However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together.  Unless there is some additional
effect that's syncing them all.  (I wonder for instance if the syncscan
logic is kicking in here.)

One interesting question is why there's a thundering herd of new
arrivals in the first place.  IIRC you said you were using a connection
pooler.  I wonder if it has a bug^H^H^Hdesign infelicity that makes it
drop and reopen all its connections simultaneously.

> Let me know if you think of anything particular I can test while it is 
> happening again.

I just noticed something that might explain the persistency of the
effect as observed by Jeff.  The code that seqscans pg_attribute (in
RelationBuildTupleDesc) knows how many rows it's expecting to find for
a given catalog, and it falls out of the seqscan loop as soon as it's
gotten them all.  Now, the rows belonging to core system catalogs are
generally right near the front of pg_attribute, being the oldest rows in
that catalog, which means that generally this results in not having to
seqscan very far even if pg_attribute is large.  It strikes me though
that there are at least two ways that nice property could get broken,
resulting in much seqscan work if pg_attribute is large:

1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table.  The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog.  Does that sound like something
either of you might have done?  You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.

2. If the syncscan logic were to kick in and cause some backend to pick
up its seqscan of pg_attribute from a point beyond where some of the
target rows are, that backend would have to scan all of pg_attribute,
circling back around to the start, before it would find all the rows it
seeks.  And most likely this effect would lather-rinse-repeat for each
catalog it's seeking the pg_attribute entries for.  Not only does this
mean a much-worse-than-normal startup time for that backend, but any
other ones that arrive while the synchronized scan is in progress would
be caught in the undertow and likewise spend a long time to get their
results.

So point 2 is suddenly looking like a really attractive theory for
explaining what happened.  As we were just remarking in an adjacent
thread, the threshold for this to start happening would be for
pg_attribute to get larger than one-fourth of shared_buffers; the
syncscan logic doesn't kick in for relations smaller than that.  IIRC
this is close enough to the numbers Jeff mentioned to make it plausible
that it happened to him, and plausible that his new installation has
pg_attribute just enough smaller to avoid the scenario.  Not sure about
Greg's case, but he should be able to tell us the size of pg_attribute
and his shared_buffers setting ...

If this is the explanation, then it's easy enough to deal with point 2
--- just disable syncscan for these searches.  I don't see an easy
answer for problems of ilk #1, other than "don't do that in a database
that's been around for awhile".

Another idea we might consider is to prevent the thundering herd effect
by not having all the incoming backends try to update pg_internal.init
independently.  Instead let the first one take a lock while it's doing
that, and the rest just queue up on that lock.  They'd be stalled
anyway, and they are not helping matters by duplicating the first one's
effort.  This is a rather more invasive change though.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Jeff Frost
Date:

On May 25, 2012, at 4:02 PM, Tom Lane wrote:

Greg Sabino Mullane <greg@endpoint.com> writes:
Yeah, this is proof that what it was doing is the same as what we saw in
Jeff's backtrace, ie loading up the system catalog relcache entries the
hard way via seqscans on the core catalogs.  So the question to be
answered is why that's suddenly a big performance bottleneck.  It's not
a cheap operation of course (that's why we cache the results ;-)) but
it shouldn't take minutes either.  And, because they are seqscans, it
doesn't seem like messed-up indexes should matter.

FWIW, this appeared to be an all-or-nothing event: either every new backend
was suffering through this, or none were. They all seemed to clear up
at the same time as well.

Mostly not surprising.  They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all.  And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall.  However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together.  Unless there is some additional
effect that's syncing them all.  (I wonder for instance if the syncscan
logic is kicking in here.)

In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan?

BTW, In our case, I thought to take the system down to single user mode and reindex these.  When the indexes were disabled, I immediately experienced the slow startup, so it certainly seems like an issue with seq scanning these.

I'll see if i can reproduce that behavior by starting up with system indexes disabled.  This probably won't happen until tuesday when we get that data directory moved to a test server.

In our customer's case, it would happen for a while,then stop happening for some time...presumably this was after the caching, then it would start up again..presumably after something invalidated the cache.

Switching from the master to the streaming replica made the situation better, but not go away.

Then a full initdb solved the problem. I bet a vacuum full of pg_attribute would've done the trick though.





1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table.  The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog.  Does that sound like something
either of you might have done?  You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.



Definitely wasn't done by me and I'm pretty sure the customer wouldn't have done that either.




---
Jeff Frost <jeff@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122






Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Jeff Frost <jeff@pgexperts.com> writes:
> In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the
syncscan?

Could you go back and double check that?  If the shared_buffers setting
were 7GB not 8GB, that would fall right between the pg_attribute sizes
you posted before.  I'm getting somewhat convinced that this is the
right answer, because I've been able to reproduce an unexpectedly long
stall with multiple clients connecting simultaneously to an
init-file-less database whose pg_attribute is large enough to trigger
syncscans.  The particular case I'm testing has pg_attribute of about
1GB (in a machine with just 4GB RAM, so I'm not going to push it up
much further).  If I just remove the init file and connect with psql,
there's about a 1-second startup delay, which is bad enough; but if
I throw 50 concurrent connections at it with a hacked-up version of
pgbench, it takes about 45 seconds for all of them to get through
startup.  (You need to hack pgbench to suppress the single
initialization connection it normally likes to make, else the test
degenerates to the one-incoming-connection case.)

I think that a big chunk of this is coming from the syncscan logic
defeating the early-exit optimization in RelationBuildTupleDesc;
but it still seems like there's another inefficiency somewhere.
Maybe our syncscan logic just really sucks given enough backends
trying to piggyback on the same syncscan.  Now that I can reproduce it,
I'll take a closer look at that.

In the meantime, it looks like a trivial workaround is to disable
synchronize_seqscans via postgresql.conf.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Jeff Frost
Date:
On May 25, 2012, at 7:12 PM, Tom Lane wrote:

> Jeff Frost <jeff@pgexperts.com> writes:
>> In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the
syncscan?
>
> Could you go back and double check that?  If the shared_buffers setting
> were 7GB not 8GB,

It definitely started happening with 8GB of shared_buffers.

We actually tried reducing shared_buffers to 2GB to see if we were writing out too many dirty buffers at checkpoint
timeor something, but that had no effect, so we put it back to 8GB. 

Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
On Fri, May 25, 2012 at 07:02:42PM -0400, Tom Lane wrote:
> However, the remaining processes trying to
> compute new init files would still have to complete the process, so I'd
> expect there to be a diminishing effect --- the ones that were stalling
> shouldn't all release exactly together.  Unless there is some additional
> effect that's syncing them all.  (I wonder for instance if the syncscan
> logic is kicking in here.)

How fast would you expect that to happen? As far as I could tell, they all
released at once, or at least within probably 15 seconds of each other;
I wasn't running ps constantly. I could check the logs and get a better
figure if you think it's an important data point.

> One interesting question is why there's a thundering herd of new
> arrivals in the first place.  IIRC you said you were using a connection
> pooler.  I wonder if it has a bug^H^H^Hdesign infelicity that makes it
> drop and reopen all its connections simultaneously.

No, we are not. Or rather, there is some pooling, but there is also a
fairly large influx of new connections. As far as I could tell, the
few existing connections were not affected.

> 1. Somebody decides to update one of those rows, and it gets dropped in
> some remote region of the table.  The only really plausible reason for
> this is deciding to fool with the column-specific stats target
> (attstattarget) of a system catalog.  Does that sound like something
> either of you might have done?

No, zero chance of this, barring some rogue intruder on the network
with a strange sense of humor.

> pg_attribute just enough smaller to avoid the scenario.  Not sure about
> Greg's case, but he should be able to tell us the size of pg_attribute
> and his shared_buffers setting ...

pg_attribute around 5 MB (+6MB indexes), shared_buffers 4GB. However,
there is a *lot* of churn in pg_attribute and pg_class, mostly due
to lots of temporary tables.

P.S. Hmmm that's weird, I just double-checked the above and pg_attribute
is now 52MB/70MB (the original figures were from yesterday). At any rate,
nowhere near 1/4 shared buffers.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Greg Sabino Mullane <greg@endpoint.com> writes:
> On Fri, May 25, 2012 at 07:02:42PM -0400, Tom Lane wrote:
>> pg_attribute just enough smaller to avoid the scenario.  Not sure about
>> Greg's case, but he should be able to tell us the size of pg_attribute
>> and his shared_buffers setting ...

> pg_attribute around 5 MB (+6MB indexes), shared_buffers 4GB. However, 
> there is a *lot* of churn in pg_attribute and pg_class, mostly due 
> to lots of temporary tables.

> P.S. Hmmm that's weird, I just double-checked the above and pg_attribute 
> is now 52MB/70MB (the original figures were from yesterday). At any rate, 
> nowhere near 1/4 shared buffers.

Hmph.  And Jeff swears his isn't large enough to trigger syncscans
either.  This could all be due to the "thundering herd" effect of a lot
of processes all doing the same only-moderately-expensive-in-itself
thing; except it's hard to see how the problem gets rolling unless the
single-process cache reload time is already a lot more than the
inter-arrival time.

Would you guys please try this in the problem databases:

select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;

If you see any block numbers above about 20 then maybe the triggering
condition is a row relocation after all.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote:
> If you see any block numbers above about 20 then maybe the triggering
> condition is a row relocation after all.

Highest was 13.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Greg Sabino Mullane <greg@endpoint.com> writes:
> On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote:
>> If you see any block numbers above about 20 then maybe the triggering
>> condition is a row relocation after all.

> Highest was 13.

Hm ... but wait, you said you'd done a VACUUM FULL on the catalogs.
So it's not clear whether this is reflective of the state at the time
the problem was happening.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
On Sat, May 26, 2012 at 01:25:29PM -0400, Tom Lane wrote:
> Greg Sabino Mullane <greg@endpoint.com> writes:
> > On Sat, May 26, 2012 at 12:17:04PM -0400, Tom Lane wrote:
> >> If you see any block numbers above about 20 then maybe the triggering
> >> condition is a row relocation after all.
>
> > Highest was 13.
>
> Hm ... but wait, you said you'd done a VACUUM FULL on the catalogs.
> So it's not clear whether this is reflective of the state at the time
> the problem was happening.

True. I'll try to get a high water mark when (er...if!) it happens again.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
I've been continuing to poke at this business of relcache-related
startup stalls, and have come to some new conclusions.  One is that
it no longer seems at all likely that the pg_attribute rows for system
catalogs aren't at the front of pg_attribute, because the commands that
might be used to update them are disallowed even to superusers:

foo=# alter table pg_index alter column indnatts set statistics -1;
ERROR:  permission denied: "pg_index" is a system catalog
foo=# alter table pg_index alter column indnatts set (n_distinct = -1);
ERROR:  permission denied: "pg_index" is a system catalog

Now a sufficiently stubborn DBA could override that protection (via
direct manual UPDATE on pg_attribute, or by doing ALTER TABLE in a
standalone backend with allowSystemTableMods set).  But it doesn't seem
like something anybody would try in a production database.  So while
I'd still like to see the results of that ctid query in Jeff's "broken"
database, it seems pretty unlikely that we will see high block numbers.

But, if all those rows are at the front of pg_attribute, how do we
explain the high block numbers seen in Jeff's stack traces?  He posted
three traces that show RelationBuildTupleDesc doing heapscans:
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00186.php
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00196.php
http://archives.postgresql.org/pgsql-bugs/2012-04/msg00197.php
and in each one of these, the block number being passed to
ReadBufferExtended is well past where those rows ought to be.
What's more, the three traces are performing the pg_attribute heapscan
for three different catalogs; so to blame this observation on some
rows having gotten relocated to high block numbers, you'd have to
believe that had happened for all three of these catalogs (one of
which is an index, so there'd be no reason to fool with its stats
attributes anyway).

AFAICS, if there are not rows at high block numbers, the only other
possibility is that syncscan mode was active.  (A third explanation is
that some rows are just plain missing, but then the system would have
been erroring out altogether, which it was not.)  That theory requires
us to invent an explanation for the fact that Jeff now observes
pg_attribute to be slightly smaller than 1/4th shared_buffers, but
perhaps autovacuum managed to shrink it at some point after the trouble
happened.

Another thing that can be deduced from those stack traces is that sinval
resets were happening.  For example, in the third message linked above,
the heapscan is being done to load up a relcache entry for relation 2601
(pg_am).  This would be unsurprising, except that stack frames 17 and 18
show this is happening during the fifth call of load_critical_index, and
we should have had both pg_am's reldesc and the syscache entry that's
being fetched at relcache.c:1010 loaded up in the first call.  So those
cache entries have to have gotten blown away by sinval reset.  This is
not terribly surprising if there were a steady flux of DDL happening in
the database, for instance temp table creations/drops.  (Jeff, can you
comment on that?)  If heapscans over the whole of pg_attribute were
occurring, they'd take long enough to expose the process to sinval
overrun even with not-very-high DDL rates.

Now the interesting thing about that is that if an sinval reset occurs
at any time while relcache.c is trying to load up initial relcache
entries, it will not try to write a new relcache init file.  This means
that a steady but not very large flow of DDL activity from existing
sessions in the database would be sufficient to prevent incoming
sessions from ever writing pg_internal.init, and thus the stall behavior
could persist for a long time, which is something I hadn't been able to
explain before.  (If this were not happening, the stalls would disappear
as soon as any incoming session successfully got through its stall.)

So it appears to me that Jeff's problem is adequately explained if we
assume (1) pg_attribute a bit larger than 1/4th shared_buffers, and
(2) steady flow of DDL activity from existing sessions; and that if you
want to dispute (1) you have to invent some other explanation for the
observed attempts to read high block numbers in pg_attribute.  Also,
if there were lots of temp table creation/dropping going on, this could
result in lots of dead rows at the end of pg_attribute, which might
explain how autovacuum could've shortened the relation later.

Now this theory does not explain Greg's problem, because his
pg_attribute was apparently many times too small to make syncscans
credible, or to make them take very long even if they did happen.
But in view of the fact that that was 8.3.5 :-(, and that the problem
was observed in conjunction with corruption of system indexes, I don't
feel a compulsion to assume that Greg's problem was the same as Jeff's.
Trawling through the commit logs I find several post-8.3.5 fixes for
race conditions and other bugs in relcache initialization, notably
commits 8407a11c5a004671142a9ac2fe29de3844360b0b and
8b720b5723fcd59c22e641e05b62dc92f53c9125.  Greg mentioned lots of temp
table creation/deletion, which would produce sinval reports and the
occasional sinval reset, which is a necessary triggering condition for
both of those bugs.  The previously recognized symptoms of those bugs
look a bit different from what Greg reported, but I can easily believe
that they'd lead to index corruption if they didn't cause immediate
crashes, and maybe they explain his stall behavior as well.

So at this point I'm feeling that my commits of yesterday to suppress
syncscan during relcache rebuild represent an adequate response to these
reports.  Yesterday I proposed also trying to suppress parallel rebuilds
of the init files, but today I'm afraid that that might not be a good
idea.  The thing that is bothering me is that in the presence of a
steady stream of sinval resets, incoming backends might never be able to
write new relcache init files, in which case the proposed new lock would
just serialize admittance to the database without actually saving any
work.  The "thundering herd" problem only arises if relcache
initialization can take a long time (i.e. a large fraction of a second)
so that many new arrivals can show up and create a resource saturation
condition.  So far as I can tell, that will not happen once we suppress
syncscans here.  When operating as designed, those heapscans only touch
a few buffers' worth of pg_attribute, so the data will all stay safely
cached; hence the initialization shouldn't be that expensive even if
many incoming sessions have to do it.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Jeff Frost
Date:
On May 27, 2012, at 12:53 PM, Tom Lane wrote:

Another thing that can be deduced from those stack traces is that sinval
resets were happening.  For example, in the third message linked above,
the heapscan is being done to load up a relcache entry for relation 2601
(pg_am).  This would be unsurprising, except that stack frames 17 and 18
show this is happening during the fifth call of load_critical_index, and
we should have had both pg_am's reldesc and the syscache entry that's
being fetched at relcache.c:1010 loaded up in the first call.  So those
cache entries have to have gotten blown away by sinval reset.  This is
not terribly surprising if there were a steady flux of DDL happening in
the database, for instance temp table creations/drops.  (Jeff, can you
comment on that?)  If heapscans over the whole of pg_attribute were
occurring, they'd take long enough to expose the process to sinval
overrun even with not-very-high DDL rates.



As it turns out, there are quite a few temporary tables created.

During the busiest periods of the day, this system averages 1.75 temp tables per second.

Re: Backends stalled in 'startup' state: index corruption

From
Greg Sabino Mullane
Date:
On Sun, May 27, 2012 at 05:44:15PM -0700, Jeff Frost wrote:
> On May 27, 2012, at 12:53 PM, Tom Lane wrote:
> > occurring, they'd take long enough to expose the process to sinval
> > overrun even with not-very-high DDL rates.
> As it turns out, there are quite a few temporary tables created.

For the record, same here. We do *lots* of DDL (hence the cronjobs
to vac/reindex system catalogs).

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Backends stalled in 'startup' state: index corruption

From
Tom Lane
Date:
Greg Sabino Mullane <greg@endpoint.com> writes:
> On Sun, May 27, 2012 at 05:44:15PM -0700, Jeff Frost wrote:
>> On May 27, 2012, at 12:53 PM, Tom Lane wrote:
>>> occurring, they'd take long enough to expose the process to sinval
>>> overrun even with not-very-high DDL rates.

>> As it turns out, there are quite a few temporary tables created.

> For the record, same here. We do *lots* of DDL (hence the cronjobs 
> to vac/reindex system catalogs).

I wonder if it could've been something like transient problem with
a cronjob leading to massive bloat of pg_attribute, eventually
triggering the syncscan issue, then fixed by a successful VAC FULL
before we thought to look closely at the table size.  The syncscan
issue definitely was there in 8.3.5, it's only the question of
pg_attribute size that made me doubt it was happening for you.
        regards, tom lane


Re: Backends stalled in 'startup' state: index corruption

From
Jeff Frost
Date:
On May 26, 2012, at 9:17 AM, Tom Lane wrote:

Would you guys please try this in the problem databases:

select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;

If you see any block numbers above about 20 then maybe the triggering
condition is a row relocation after all.

Sorry for such a long delay on the reply.  Took a while to get the data directory moved elsewhere:

select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;

  ctid   |                 relname                 
---------+-----------------------------------------
 (18,31) | pg_extension_name_index
 (18,30) | pg_extension_oid_index
 (18,29) | pg_seclabel_object_index
 (18,28) | pg_seclabel_object_index
 (18,27) | pg_seclabel_object_index
 (18,26) | pg_seclabel_object_index

As the next step, I'd suggest verifying that the stall is reproducible
if you remove pg_internal.init (and that it's not there otherwise), and
then strace'ing the single incoming connection to see what it's doing.

It does take a little while, but not nearly as long as the stalls we were seeing before.  The pg_internal.init is 108k in case that's an interesting data point.

Any other tests you'd like me to run on that bad data dir?

Also, thus far, the newly initdb'd system continues to hum along just fine.  It's also been upgraded to 9.1.4, so if it was the rebuilding of pg_internal.init, then your fix should keep it happy.