Re: Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Backends stalled in 'startup' state: index corruption
Date
Msg-id 5513.1337986962@sss.pgh.pa.us
Whole thread Raw
In response to Re: Backends stalled in 'startup' state: index corruption  (Greg Sabino Mullane <greg@endpoint.com>)
Responses Re: Backends stalled in 'startup' state: index corruption
Re: Backends stalled in 'startup' state: index corruption
List pgsql-hackers
Greg Sabino Mullane <greg@endpoint.com> writes:
>> Yeah, this is proof that what it was doing is the same as what we saw in
>> Jeff's backtrace, ie loading up the system catalog relcache entries the
>> hard way via seqscans on the core catalogs.  So the question to be
>> answered is why that's suddenly a big performance bottleneck.  It's not
>> a cheap operation of course (that's why we cache the results ;-)) but
>> it shouldn't take minutes either.  And, because they are seqscans, it
>> doesn't seem like messed-up indexes should matter.

> FWIW, this appeared to be an all-or-nothing event: either every new backend 
> was suffering through this, or none were. They all seemed to clear up 
> at the same time as well.

Mostly not surprising.  They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all.  And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall.  However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together.  Unless there is some additional
effect that's syncing them all.  (I wonder for instance if the syncscan
logic is kicking in here.)

One interesting question is why there's a thundering herd of new
arrivals in the first place.  IIRC you said you were using a connection
pooler.  I wonder if it has a bug^H^H^Hdesign infelicity that makes it
drop and reopen all its connections simultaneously.

> Let me know if you think of anything particular I can test while it is 
> happening again.

I just noticed something that might explain the persistency of the
effect as observed by Jeff.  The code that seqscans pg_attribute (in
RelationBuildTupleDesc) knows how many rows it's expecting to find for
a given catalog, and it falls out of the seqscan loop as soon as it's
gotten them all.  Now, the rows belonging to core system catalogs are
generally right near the front of pg_attribute, being the oldest rows in
that catalog, which means that generally this results in not having to
seqscan very far even if pg_attribute is large.  It strikes me though
that there are at least two ways that nice property could get broken,
resulting in much seqscan work if pg_attribute is large:

1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table.  The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog.  Does that sound like something
either of you might have done?  You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.

2. If the syncscan logic were to kick in and cause some backend to pick
up its seqscan of pg_attribute from a point beyond where some of the
target rows are, that backend would have to scan all of pg_attribute,
circling back around to the start, before it would find all the rows it
seeks.  And most likely this effect would lather-rinse-repeat for each
catalog it's seeking the pg_attribute entries for.  Not only does this
mean a much-worse-than-normal startup time for that backend, but any
other ones that arrive while the synchronized scan is in progress would
be caught in the undertow and likewise spend a long time to get their
results.

So point 2 is suddenly looking like a really attractive theory for
explaining what happened.  As we were just remarking in an adjacent
thread, the threshold for this to start happening would be for
pg_attribute to get larger than one-fourth of shared_buffers; the
syncscan logic doesn't kick in for relations smaller than that.  IIRC
this is close enough to the numbers Jeff mentioned to make it plausible
that it happened to him, and plausible that his new installation has
pg_attribute just enough smaller to avoid the scenario.  Not sure about
Greg's case, but he should be able to tell us the size of pg_attribute
and his shared_buffers setting ...

If this is the explanation, then it's easy enough to deal with point 2
--- just disable syncscan for these searches.  I don't see an easy
answer for problems of ilk #1, other than "don't do that in a database
that's been around for awhile".

Another idea we might consider is to prevent the thundering herd effect
by not having all the incoming backends try to update pg_internal.init
independently.  Instead let the first one take a lock while it's doing
that, and the rest just queue up on that lock.  They'd be stalled
anyway, and they are not helping matters by duplicating the first one's
effort.  This is a rather more invasive change though.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Sergey Koposov
Date:
Subject: Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
Next
From: Robert Haas
Date:
Subject: Re: heap metapages