Re: 9.1.3 backends getting stuck in 'startup' - Mailing list pgsql-bugs

From Jeff Frost
Subject Re: 9.1.3 backends getting stuck in 'startup'
Date
Msg-id 4FBE9CDB.3030009@pgexperts.com
Whole thread Raw
In response to Re: 9.1.3 backends getting stuck in 'startup'  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: 9.1.3 backends getting stuck in 'startup'  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On 05/24/12 12:21, Tom Lane wrote:
> Jeff Frost <jeff@pgexperts.com> writes:
>> A few times today, we've seen postgresql 9.1.3 backends on Ubuntu 11.10 x86_64
>> get stuck in 'startup' mode.  By that I mean the set_ps_output mode. Postgres
>> is installed via Martin Pitt's packages.
> I took another look at your original report here, and noticed something
> that had escaped me the first time: the backtrace shows that what that
> backend is doing is loading critical relcache entries the hard way, that
> is, via seqscans on the system catalogs.  This happens when the relcache
> cache file (pg_internal.init) is missing or stale.  A backend that has
> to go through that will write a new pg_internal.init file when it's
> done; but if many sessions are connecting at the same time, they'll each
> have to do it until someone succeeds in creating a new cache file.
>
> Another piece of evidence is here:
>
>> #5  0x00007f7dd3d91ca4 in ReadBufferExtended (reln=0x7f7dd3b17540,
>> forkNum=MAIN_FORKNUM, blockNum=201362, mode=RBM_NORMAL, strategy=<optimized
>> out>) at bufmgr.c:246
> It's possible that gdb is lying to us, but it sure looks like this read
> request is for block 201362 of pg_attribute (I know it's pg_attribute
> from the location of the heap_getnext call).  That's 1.5Gb into the
> relation, and who knows if it's anywhere near the end yet.
>
> So in other words, we've got a boatload of backends all seqscanning a
> pretty darn large relation, and doing so multiple times.  That's
> probably enough to explain why it took a long time to finish
> initializing; the locking business is likely a red herring.
>
> How big is pg_attribute normally in this database?  I'm suspicious that
> what triggered this is some major bloating of pg_attribute (and maybe
> some of the other catalogs too).  Once you had that, relcache cache file
> invalidations are a fairly routine occurrence --- I think an autovacuum
> on one of the system catalogs would trigger that, for example.
>
>             regards, tom lane

Interesting!  Fortunately, that data directory is still lying around even
though we initdb'd to get a working system.

So, the current working system's pg_attribute looks like:

select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
 pg_size_pretty
----------------
 1671 MB
(1 row)

select pg_size_pretty(pg_total_relation_size('pg_catalog.pg_attribute'));
 pg_size_pretty
----------------
 2439 MB


and on the old broken system:

select pg_size_pretty(pg_relation_size('pg_catalog.pg_attribute'));
 pg_size_pretty
----------------
 1858 MB
(1 row)

select pg_size_pretty(pg_total_relation_size('pg_catalog.pg_attribute'));
 pg_size_pretty
----------------
 1907 MB
(1 row)

So, interestingly, they're both quite large, but the old broken system is
quite a bit larger.  Any other data points be helpful?

--
Jeff Frost <jeff@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #6664: Postgres server process does not come up in foreground and parent process is set to init ( PID 1 )
Next
From: Tom Lane
Date:
Subject: Re: 9.1.3 backends getting stuck in 'startup'