On May 26, 2012, at 9:17 AM, Tom Lane wrote:
Would you guys please try this in the problem databases:
select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;
If you see any block numbers above about 20 then maybe the triggering
condition is a row relocation after all.
Sorry for such a long delay on the reply. Took a while to get the data directory moved elsewhere:
select a.ctid, c.relname
from pg_attribute a join pg_class c on a.attrelid=c.oid
where c.relnamespace=11 and c.relkind in ('r','i')
order by 1 desc;
ctid | relname
---------+-----------------------------------------
(18,31) | pg_extension_name_index
(18,30) | pg_extension_oid_index
(18,29) | pg_seclabel_object_index
(18,28) | pg_seclabel_object_index
(18,27) | pg_seclabel_object_index
(18,26) | pg_seclabel_object_index
As the next step, I'd suggest verifying that the stall is reproducible
if you remove pg_internal.init (and that it's not there otherwise), and
then strace'ing the single incoming connection to see what it's doing.
It does take a little while, but not nearly as long as the stalls we were seeing before. The pg_internal.init is 108k in case that's an interesting data point.
Any other tests you'd like me to run on that bad data dir?
Also, thus far, the newly initdb'd system continues to hum along just fine. It's also been upgraded to 9.1.4, so if it was the rebuilding of pg_internal.init, then your fix should keep it happy.