Re: Trying to track down weird query stalls - Mailing list pgsql-performance

From dan@sidhe.org
Subject Re: Trying to track down weird query stalls
Date
Msg-id 56921.199.172.169.36.1238438533.squirrel@localhost
Whole thread Raw
In response to Re: Trying to track down weird query stalls  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Trying to track down weird query stalls  (Robert Haas <robertmhaas@gmail.com>)
Re: Trying to track down weird query stalls  (Scott Marlowe <scott.marlowe@gmail.com>)
List pgsql-performance
> On Mon, Mar 30, 2009 at 1:50 PM,  <dan@sidhe.org> wrote:
>> I'm running a 64-bit build of Postgres 8.3.5 on AIX 5.3, and have a
>> really
>> strange, annoying transient problem with one particular query stalling.
>>
>> The symptom here is that when this query is made with X or more records
>> in
>> a temp table involved in the join (where X is constant when the problem
>> manifests, but is different between manifestations) the query takes
>> about
>> 20 minutes. When the query is made with X-1 records it takes less than a
>> second. It's just this one query -- for everything else the system's
>> nice
>> and snappy. The machine load's never above 9 (it's a 32 CPU box) and
>> hasn't had less than 60G of free system memory on it.
>>
>> An EXPLAIN ANALYZE of the two queries (with X-1 and X records) is even
>> more bizarre. Not only are the two query plans identical (save trivial
>> differences because of the record count differences) but the explain
>> EXPLAIN ANALYZE total runtimes are near-identical -- the fast case
>> showed
>> 259ms, the slow case 265ms.
>
> log_min_duration_statement is a good place to start, but it sounds
> like the query plan you're getting when you test it by hand isn't the
> same one that's actually executing, so I'm not sure how far that's
> going to get you.  contrib/auto_explain sounds like it would be just
> the thing, but unfortunately that's an 8.4 feature which hasn't been
> released yet.  I'm not sure whether it could be built and run against
> 8.3.

I'm not executing any of the EXPLAINs by hand, because I didn't want to
have to worry about typos or filling in temp tables with test data. Inside
the app the SQL for the problematic query's stored in a variable -- when
the task runs with debugging enabled it first executes the query with
EXPLAIN ANALYZE prepended and dumps the output, then it executes the query
itself. It's possible something's going wrong in that, but the code's
pretty simple.

Arguably in this case the actual query should run faster than the EXPLAIN
ANALYZE version, since the cache is hot. (Though that'd only likely shave
a few dozen ms off the runtime)

-Dan

pgsql-performance by date:

Previous
From: Robert Haas
Date:
Subject: Re: Trying to track down weird query stalls
Next
From: Robert Haas
Date:
Subject: Re: Trying to track down weird query stalls