Home > mailing lists

Re: Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers

From	Jeff Frost
Subject	Re: Backends stalled in 'startup' state: index corruption
Date	May 25, 2012 21:32:27
Msg-id	9CD57B4F-B1AB-4F1B-BC7A-8C6A86B0C509@pgexperts.com Whole thread Raw
In response to	Re: Backends stalled in 'startup' state: index corruption (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Backends stalled in 'startup' state: index corruption
List	pgsql-hackers

Tree view

On May 25, 2012, at 4:02 PM, Tom Lane wrote:

Greg Sabino Mullane <greg@endpoint.com> writes:
Yeah, this is proof that what it was doing is the same as what we saw in
Jeff's backtrace, ie loading up the system catalog relcache entries the
hard way via seqscans on the core catalogs. So the question to be
answered is why that's suddenly a big performance bottleneck. It's not
a cheap operation of course (that's why we cache the results ;-)) but
it shouldn't take minutes either. And, because they are seqscans, it
doesn't seem like messed-up indexes should matter.

FWIW, this appeared to be an all-or-nothing event: either every new backend
was suffering through this, or none were. They all seemed to clear up
at the same time as well.

Mostly not surprising. They'd definitely all hit the missing init file
at the same time, so the stalling would start consistently for all. And
once any one process successfully created a new file, subsequent incoming
sessions wouldn't stall. However, the remaining processes trying to
compute new init files would still have to complete the process, so I'd
expect there to be a diminishing effect --- the ones that were stalling
shouldn't all release exactly together. Unless there is some additional
effect that's syncing them all. (I wonder for instance if the syncscan
logic is kicking in here.)

In our customer's case, the size of pg_attribute was a little less than 1/4 of shared_buffers, so might not be the syncscan?

BTW, In our case, I thought to take the system down to single user mode and reindex these. When the indexes were disabled, I immediately experienced the slow startup, so it certainly seems like an issue with seq scanning these.

I'll see if i can reproduce that behavior by starting up with system indexes disabled. This probably won't happen until tuesday when we get that data directory moved to a test server.

In our customer's case, it would happen for a while,then stop happening for some time...presumably this was after the caching, then it would start up again..presumably after something invalidated the cache.

Switching from the master to the streaming replica made the situation better, but not go away.

Then a full initdb solved the problem. I bet a vacuum full of pg_attribute would've done the trick though.

1. Somebody decides to update one of those rows, and it gets dropped in
some remote region of the table. The only really plausible reason for
this is deciding to fool with the column-specific stats target
(attstattarget) of a system catalog. Does that sound like something
either of you might have done? You could check it by looking at the
ctid columns of the pg_attribute rows for system catalogs, and seeing
if any have large block numbers.

Definitely wasn't done by me and I'm pretty sure the customer wouldn't have done that either.

---

Jeff Frost <jeff@pgexperts.com>

CTO, PostgreSQL Experts, Inc.

Phone: 1-888-PG-EXPRT x506

FAX: 415-762-5122

http://www.pgexperts.com/

pgsql-hackers by date:

From: Tom Lane
Date: 25 May 2012, 21:30:34
Subject: No, pg_size_pretty(numeric) was not such a hot idea

From: Tom Lane
Date: 25 May 2012, 23:12:52
Subject: Re: Backends stalled in 'startup' state: index corruption

Re: Backends stalled in 'startup' state: index corruption - Mailing list pgsql-hackers

Previous

Next