I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a really
strange, annoying transient problem with one particular query stalling.
The symptom here is that when this query is made with X or more records in
a temp table involved in the join (where X is constant when the problem
manifests, but is different between manifestations) the query takes about
20 minutes. When the query is made with X-1 records it takes less than a
second. It's just this one query -- for everything else the system's nice
and snappy. The machine load's never above 9 (it's a 32 CPU box) and
hasn't had less than 60G of free system memory on it.
An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even
more bizarre. Not only are the two query plans identical (save trivial
differences because of the record count differences) but the explain
EXPLAIN ANALYZE total runtimes are near-identical -- the fast case showed
259ms, the slow case 265ms.
When the slow query was actually run, though, it took 20 minutes. There
were no blocked back ends shown in pg_stat_activity, and the back end
itself was definitely moving. I trussed the back end stuck running the
slow query and it spent nearly all its time doing kread() and kwrite()
calls. The DB log didn't show anything interesting in it. I checked to
make sure the SQL statement that was EXPLAIN ANALYZEd was the one actually
executed, and I pulled the client code into the debugger and
single-stepped through just to make sure it was getting stuck on that one
SQL statement and it wasn't the client doing something unexpected.
Just to be even more annoying, the problem goes away. Right now I can't
trigger the problem. Last friday it happened reliably feeding 58 records
into this query. The week before it was 38 records. Today? Nothing, the
system's full of snappy.
At the moment I'm at a loss as to what's going wrong, and the fact that I
can't duplicate it right now when I actually have time to look into the
problem's damned annoying. What I'm looking for are some hints as to where
to look the next time it does happen, or things I can flip on to catch
more in the log. (The logging parameters are all set at their defaults)
I'm going to try attaching with dbx to get a stack trace (I assume this is
relatively safe?) but past that I'm kind of stumped.
Help?