Re: Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers
From | Jeff Frost |
---|---|
Subject | Re: Backends stalled in 'startup' state: index corruption |
Date | |
Msg-id | 9CD57B4F-B1AB-4F1B-BC7A-8C6A86B0C509@pgexperts.com Whole thread Raw |
In response to | Re: Backends stalled in 'startup' state: index corruption (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Backends stalled in 'startup' state: index corruption
|
List | pgsql-hackers |
On May 25, 2012, at 4:02 PM, Tom Lane wrote:
Greg Sabino Mullane <greg@endpoint.com> writes:Yeah, this is proof that what it was doing is the same as what we saw inJeff's backtrace, ie loading up the system catalog relcache entries thehard way via seqscans on the core catalogs. So the question to beanswered is why that's suddenly a big performance bottleneck. It's nota cheap operation of course (that's why we cache the results ;-)) butit shouldn't take minutes either. And, because they are seqscans, itdoesn't seem like messed-up indexes should matter.FWIW, this appeared to be an all-or-nothing event: either every new backendwas suffering through this, or none were. They all seemed to clear upat the same time as well.
Mostly not surprising. They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all. And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall. However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together. Unless there is some additional
effect that's syncing them all. (I wonder for instance if the syncscan
logic is kicking in here.)
In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan?
BTW, In our case, I thought to take the system down to single user mode and reindex these. When the indexes were disabled, I immediately experienced the slow startup, so it certainly seems like an issue with seq scanning these.
I'll see if i can reproduce that behavior by starting up with system indexes disabled. This probably won't happen until tuesday when we get that data directory moved to a test server.
In our customer's case, it would happen for a while,then stop happening for some time...presumably this was after the caching, then it would start up again..presumably after something invalidated the cache.
Switching from the master to the streaming replica made the situation better, but not go away.
Then a full initdb solved the problem. I bet a vacuum full of pg_attribute would've done the trick though.
1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table. The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog. Does that sound like something
either of you might have done? You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.
Definitely wasn't done by me and I'm pretty sure the customer wouldn't have done that either.
---
Jeff Frost <jeff@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
pgsql-hackers by date: