Odd blocking (or massively latent) issue - even with EXPLAIN - Mailing list pgsql-performance

(First, apologies if this post now appears twice - it appears our mail server rewrites my address!)

Hello all. I'm a pgsql performance virgin so hope I cross all the 't's
and dot the lower-case 'j's when posting this query...

On our production database server we're experiencing behaviour that
several engineers are unable to explain - hence this Email. First, our
specs;

Scientific Linux 6.2, kernel 2.6.32
PG version 9.1.3, release 1PGDG.rhel6
24GB RAM
8 cores
2x software SSD-based RAIDs:
  a) ~660GB, RAID 5, 4 SSDs (data)
  b) ~160GB, RAID 1, 2 SSDs (xlogs + tmp tables)

We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
statements hang indefinitely until *something* (we don't know what)
releases some kind of resource or no longer becomes a massive bottle
neck. These are the symptoms.

However, the system seems healthy - no table ('heavyweight') locks are
held by any session (this happens with only a few connected sessions),
all indexes are used correctly, other transactions are writing data (we
generally only have a few sessions running at a time - perhaps 10) etc.
etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s
sometimes.

We regularly run vacuum analyze at quiet periods - generally 1-2s daily.

These sessions (that only read data) that are blocked can block from
anything from between only 5 minutes to 10s of hours then miraculously
complete successfully at once.

Any suggestions for my next avenue of investigation? I'll try and
capture more data by observation next time it happens (it is relatively
intermittent).

Regards,

Jim

PS. These are the settings that differ from the default:

checkpoint_segments = 128
maintenance_work_mem = 256MB
synchronous_commit = off
random_page_cost = 3.0
wal_buffers = 16MB
shared_buffers = 8192MB
checkpoint_completion_target = 0.9
effective_cache_size = 18432MB
work_mem = 32MB
effective_io_concurrency = 12
max_stack_depth = 8MB
log_autovacuum_min_duration = 0
log_lock_waits = on
autovacuum_vacuum_scale_factor = 0.1
autovacuum_naptime = 8
autovacuum_max_workers = 4

PPS. I've just noticed that our memory configuration is over subscribed!
     shared_buffers + effective_cache_size > Total available RAM! Could
     this be the root cause somehow?

--
Jim Vanns
Systems Programmer
Framestore


pgsql-performance by date:

Previous
From: Ioannis Anagnostopoulos
Date:
Subject: Re: A very long running query....
Next
From: Daniele Varrazzo
Date:
Subject: Shards + hash = forever running queries