Re: 9.1.3 backends getting stuck in 'startup' - Mailing list pgsql-bugs

From Tom Lane
Subject Re: 9.1.3 backends getting stuck in 'startup'
Date
Msg-id 28484.1337887297@sss.pgh.pa.us
Whole thread Raw
In response to 9.1.3 backends getting stuck in 'startup'  (Jeff Frost <jeff@pgexperts.com>)
Responses Re: 9.1.3 backends getting stuck in 'startup'  (Jeff Frost <jeff@pgexperts.com>)
List pgsql-bugs
Jeff Frost <jeff@pgexperts.com> writes:
> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
> get stuck in 'startup' mode.  By that I mean the set_ps_output mode. Postgres
> is installed via Martin Pitt's packages.

I took another look at your original report here, and noticed something
that had escaped me the first time: the backtrace shows that what that
backend is doing is loading critical relcache entries the hard way, that
is, via seqscans on the system catalogs.  This happens when the relcache
cache file (pg_internal.init) is missing or stale.  A backend that has
to go through that will write a new pg_internal.init file when it's
done; but if many sessions are connecting at the same time, they'll each
have to do it until someone succeeds in creating a new cache file.

Another piece of evidence is here:

> #5  0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
> forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
> out>) at bufmgr.c:246

It's possible that gdb is lying to us, but it sure looks like this read
request is for block 201362 of pg_attribute (I know it's pg_attribute
from the location of the heap_getnext call).  That's 1.5Gb into the
relation, and who knows if it's anywhere near the end yet.

So in other words, we've got a boatload of backends all seqscanning a
pretty darn large relation, and doing so multiple times.  That's
probably enough to explain why it took a long time to finish
initializing; the locking business is likely a red herring.

How big is pg_attribute normally in this database?  I'm suspicious that
what triggered this is some major bloating of pg_attribute (and maybe
some of the other catalogs too).  Once you had that, relcache cache file
invalidations are a fairly routine occurrence --- I think an autovacuum
on one of the system catalogs would trigger that, for example.

            regards, tom lane

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #6664: Postgres server process does not come up in foreground and parent process is set to init ( PID 1 )
Next
From: Tom Lane
Date:
Subject: Re: BUG #6664: Postgres server process does not come up in foreground and parent process is set to init ( PID 1 )