Thread: Some queries starting to hang

Some queries starting to hang

From

Chris Beecroft

Date:

05 June 2006, 16:17:29

Hello,

I've noticed some posts on hanging queries but haven't seen any
solutions yet so far.

Our problem is that about a week and a half ago we started to get some
queries that would (seemingly) never return (e.g., normally run in a
couple minutes, but after 2.5 hours, they were still running, the
process pushing the processor up to 99.9% active).

We are running Postgres 8.1.1 on Redhat 7.3 using Dell poweredge quad
processor boxes with 4 GB of memory.  We have a main database that is
replicated via Sloney to a identical system.

Things we've tried so far:

We've stopped and restarted postgres and that didn't seem to help, we've
rebuilt all the indexes and that didn't seem to help either.  We've
stopped replication between the boxes and that didn't do anything.
We've tried the queries on both the production and the replicated box,
and there is no difference in the queries (or query plans)

We do have another identical system that is a backup box (same type of
box, Postgres 8.1.1, Redhat 7.3, etc), and there, the query does
complete executing in a short time.  We loaded up a current copy of the
production database and it still responded quickly.

Generally these queries, although not complicated, are on the more
complex side of our application.  Second, they have been running up
until a few weeks ago.

Attached are an example query plan: Query.sql
The query plan from our production sever: QueryPlanBroke.txt
The working query plan from our backup server: QueryPlanWork.txt

What we found that has worked so far is to remove all the outer joins,
put the results into a temp table and then left join from the temp table
to get our results.  Certainly this isn't a solution, but rather
something we have resorted to in a place or to as we limp along.


Any help would be greatly appreciated.

Thanks,
Chris Beecroft

Attachment

Re: Some queries starting to hang

From

Andrew Sullivan

Date:

05 June 2006, 16:43:12

On Mon, Jun 05, 2006 at 12:05:08PM -0700, Chris Beecroft wrote:
> Our problem is that about a week and a half ago we started to get some
> queries that would (seemingly) never return (e.g., normally run in a
> couple minutes, but after 2.5 hours, they were still running, the
> process pushing the processor up to 99.9% active).

Are there any locks preventing the query from completing?  I can't
recall how you check in 7.3, but if nothing else, you can check with
ps for something WAITING.

A



--
Andrew Sullivan  | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
        --Scott Morris

Re: Some queries starting to hang

From

Tom Lane

Date:

05 June 2006, 17:07:31

Chris Beecroft <CBeecroft@PrO-Unlimited.com> writes:
> Our problem is that about a week and a half ago we started to get some
> queries that would (seemingly) never return (e.g., normally run in a
> couple minutes, but after 2.5 hours, they were still running, the
> process pushing the processor up to 99.9% active).

> Attached are an example query plan: Query.sql
> The query plan from our production sever: QueryPlanBroke.txt
> The working query plan from our backup server: QueryPlanWork.txt

Note the major difference in estimated row counts.  That's the key to
your problem... you need to find out why the "broke" case thinks only
one row is getting selected.

broke:
>    ->  Nested Loop  (cost=30150.77..129334.04 rows=1 width=305)

work:
>    ->  Hash Join  (cost=30904.77..125395.89 rows=1810 width=306)

I'm wondering about out-of-date or nonexistent ANALYZE stats, missing
custom adjustments of statistics target settings, etc.

            regards, tom lane

Re: Some queries starting to hang

From

Andrew Sullivan

Date:

05 June 2006, 17:39:08

On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote:
>
> broke:
> >    ->  Nested Loop  (cost=30150.77..129334.04 rows=1 width=305)
>
> work:
> >    ->  Hash Join  (cost=30904.77..125395.89 rows=1810 width=306)
>
> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing
> custom adjustments of statistics target settings, etc.

But even the nested loop shouldn't be a "never returns" case, should
it?  For 1800 rows?

(I've _had_ bad plans that picked nestloop, for sure, but they're
usually for tens of thousands of rows when they take forever).

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
        --Damien Katz

Re: Some queries starting to hang

From

Chris Beecroft

Date:

05 June 2006, 17:39:50

Thanks Tom,

I knew you would come through again!

Query is now returning with results on our replicated database.  Will
vacuum analyze production now.  So it seems to have done the trick.  Now
the question is has our auto vacuum failed or was not set up properly...
A question for my IT people.

Thanks once again,
Chris Beecroft

On Mon, 2006-06-05 at 13:07, Tom Lane wrote:
> Chris Beecroft <CBeecroft@PrO-Unlimited.com> writes:
> > Our problem is that about a week and a half ago we started to get some
> > queries that would (seemingly) never return (e.g., normally run in a
> > couple minutes, but after 2.5 hours, they were still running, the
> > process pushing the processor up to 99.9% active).
>
> > Attached are an example query plan: Query.sql
> > The query plan from our production sever: QueryPlanBroke.txt
> > The working query plan from our backup server: QueryPlanWork.txt
>
> Note the major difference in estimated row counts.  That's the key to
> your problem... you need to find out why the "broke" case thinks only
> one row is getting selected.
>
> broke:
> >    ->  Nested Loop  (cost=30150.77..129334.04 rows=1 width=305)
>
> work:
> >    ->  Hash Join  (cost=30904.77..125395.89 rows=1810 width=306)
>
> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing
> custom adjustments of statistics target settings, etc.
>
>             regards, tom lane

Re: Some queries starting to hang

From

Tom Lane

Date:

05 June 2006, 18:06:45

Andrew Sullivan <ajs@crankycanuck.ca> writes:
> On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote:
>> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing
>> custom adjustments of statistics target settings, etc.

> But even the nested loop shouldn't be a "never returns" case, should
> it?  For 1800 rows?

Well, it's a big query.  If it ought to take a second or two, and
instead is taking an hour or two (1800 times the expected runtime), that
might be close enough to "never" to exhaust Chris' patience.  Besides,
we don't know whether the 1800 might itself be an underestimate (too bad
Chris didn't provide EXPLAIN ANALYZE results).  The hash plan will scale
to larger numbers of rows much more gracefully than the nestloop ...

            regards, tom lane

Re: Some queries starting to hang

From

Chris Beecroft

Date:

05 June 2006, 19:22:17

On Mon, 2006-06-05 at 14:06, Tom Lane wrote:
> Andrew Sullivan <ajs@crankycanuck.ca> writes:
> > On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote:
> >> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing
> >> custom adjustments of statistics target settings, etc.
>
> > But even the nested loop shouldn't be a "never returns" case, should
> > it?  For 1800 rows?
>
> Well, it's a big query.  If it ought to take a second or two, and
> instead is taking an hour or two (1800 times the expected runtime), that
> might be close enough to "never" to exhaust Chris' patience.  Besides,
> we don't know whether the 1800 might itself be an underestimate (too bad
> Chris didn't provide EXPLAIN ANALYZE results).  The hash plan will scale
> to larger numbers of rows much more gracefully than the nestloop ...
>
>             regards, tom lane

Hello,

If anyone is curious, I've attached an explain analyze from the now
working replicated database.  Explain analyze did not seem return on the
'broken' database (or at least, when we originally tried to test these,
did not return after an hour and a half, which enough time to head right
past patient into crabby...)

Chris

Attachment

explain.out

Re: Some queries starting to hang

From

Simon Riggs

Date:

06 June 2006, 11:39:16

On Mon, 2006-06-05 at 17:06 -0400, Tom Lane wrote:
> Andrew Sullivan <ajs@crankycanuck.ca> writes:
> > On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote:
> >> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing
> >> custom adjustments of statistics target settings, etc.
>
> > But even the nested loop shouldn't be a "never returns" case, should
> > it?  For 1800 rows?
>
> Well, it's a big query.  If it ought to take a second or two, and
> instead is taking an hour or two (1800 times the expected runtime), that
> might be close enough to "never" to exhaust Chris' patience.  Besides,
> we don't know whether the 1800 might itself be an underestimate (too bad
> Chris didn't provide EXPLAIN ANALYZE results).

This is a good example of a case where the inefficiency of EXPLAIN
ANALYZE would be a contributory factor to it not actually being
available for diagnosing a problem.

Maybe we need something even more drastic than recent proposed changes
to EXPLAIN ANALYZE?

Perhaps we could annotate the query tree with individual limits. That
way a node that was expecting to deal with 1 row would simply stop
executing the EXPLAIN ANALYZE when it hit N times as many rows
(default=no limit). That way, we would still be able to see a bad plan
even without waiting for the whole query to execute - just stop at a
point where the plan is far enough off track. That would give us what we
need: pinpoint exactly which part of the plan is off-track and see how
far off track it is. If the limits were configurable, we'd be able to
opt for faster-but-less-accurate or slower-yet-100% accuracy behaviour.
We wouldn't need to worry about timing overhead either then.

e.g. EXPLAIN ANALYZE ERRLIMIT 10 SELECT ...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Some queries starting to hang

From

Tom Lane

Date:

06 June 2006, 11:45:06

Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2006-06-05 at 17:06 -0400, Tom Lane wrote:
>> Well, it's a big query.  If it ought to take a second or two, and
>> instead is taking an hour or two (1800 times the expected runtime), that
>> might be close enough to "never" to exhaust Chris' patience.  Besides,
>> we don't know whether the 1800 might itself be an underestimate (too bad
>> Chris didn't provide EXPLAIN ANALYZE results).

> This is a good example of a case where the inefficiency of EXPLAIN
> ANALYZE would be a contributory factor to it not actually being
> available for diagnosing a problem.

Huh?  The problem is the inefficiency of the underlying query.

            regards, tom lane

Re: Some queries starting to hang

From

Simon Riggs

Date:

06 June 2006, 11:53:14

On Tue, 2006-06-06 at 10:43 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2006-06-05 at 17:06 -0400, Tom Lane wrote:
> >> Well, it's a big query.  If it ought to take a second or two, and
> >> instead is taking an hour or two (1800 times the expected runtime), that
> >> might be close enough to "never" to exhaust Chris' patience.  Besides,
> >> we don't know whether the 1800 might itself be an underestimate (too bad
> >> Chris didn't provide EXPLAIN ANALYZE results).
>
> > This is a good example of a case where the inefficiency of EXPLAIN
> > ANALYZE would be a contributory factor to it not actually being
> > available for diagnosing a problem.
>
> Huh?  The problem is the inefficiency of the underlying query.

Of course that was the main problem from the OP.

You mentioned it would be good if the OP had delivered an EXPLAIN
ANALYZE; I agree(d). The lack of EXPLAIN ANALYZE is frequently because
you can't get them to run to completion - more so when the query you
wish to analyze doesn't appear to complete either.

The idea I just had was: why do we need EXPLAIN ANALYZE to run to
completion? In severe cases like this thread, we might be able to
discover the root cause by a *partial* execution of the plan, as long as
it was properly instrumented. That way, the OP might have been able to
discover the root cause himself...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Some queries starting to hang

From

"Jim C. Nasby"

Date:

06 June 2006, 12:03:35

On Mon, Jun 05, 2006 at 01:39:38PM -0700, Chris Beecroft wrote:
> Thanks Tom,
>
> I knew you would come through again!
>
> Query is now returning with results on our replicated database.  Will
> vacuum analyze production now.  So it seems to have done the trick.  Now
> the question is has our auto vacuum failed or was not set up properly...
> A question for my IT people.

You should almost certainly be running the autovacuum that's built in
now. If you enable vacuum_cost_delay you should be able to make it so
that vacuum's impact on production is minimal. The other thing you'll
want to do is cut all the vacuum threshold and scale settings in half
(the defaults are very conservative).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Some queries starting to hang

From

Tom Lane

Date:

06 June 2006, 12:06:33

Simon Riggs <simon@2ndquadrant.com> writes:
> You mentioned it would be good if the OP had delivered an EXPLAIN
> ANALYZE; I agree(d). The lack of EXPLAIN ANALYZE is frequently because
> you can't get them to run to completion - more so when the query you
> wish to analyze doesn't appear to complete either.

Well, he could have shown EXPLAIN ANALYZE for the server that was
managing to run the query in a reasonable amount of time.

> The idea I just had was: why do we need EXPLAIN ANALYZE to run to
> completion? In severe cases like this thread, we might be able to
> discover the root cause by a *partial* execution of the plan, as long as
> it was properly instrumented. That way, the OP might have been able to
> discover the root cause himself...

I don't think that helps, as it just replaces one uncertainty by
another: how far did the EXPLAIN really get towards completion of the
plan?  You still don't have any hard data.

            regards, tom lane

Re: Some queries starting to hang

From

"Jim C. Nasby"

Date:

06 June 2006, 12:14:33

On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote:
> > The idea I just had was: why do we need EXPLAIN ANALYZE to run to
> > completion? In severe cases like this thread, we might be able to
> > discover the root cause by a *partial* execution of the plan, as long as
> > it was properly instrumented. That way, the OP might have been able to
> > discover the root cause himself...
>
> I don't think that helps, as it just replaces one uncertainty by
> another: how far did the EXPLAIN really get towards completion of the
> plan?  You still don't have any hard data.

Does that really matter, though? The point is to find the node where the
estimate proved to be fantasy. It might even make sense to highlight
that node in the output, so that users don't have to wade through a sea
of numbers to find it.

If it is important to report how far along the query got, it seems that
could always be added to the explain output.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Some queries starting to hang

From

"A.M."

Date:

06 June 2006, 12:21:36

Explain analyze could at least put an asterisk around actual time that
deviated by some factor from the estimated time.

On Tue, June 6, 2006 10:39 am, Simon Riggs wrote:

>
> This is a good example of a case where the inefficiency of EXPLAIN
> ANALYZE would be a contributory factor to it not actually being
> available for diagnosing a problem.
>
> Maybe we need something even more drastic than recent proposed changes
> to EXPLAIN ANALYZE?
>
> Perhaps we could annotate the query tree with individual limits. That
> way a node that was expecting to deal with 1 row would simply stop
> executing the EXPLAIN ANALYZE when it hit N times as many rows (default=no
> limit). That way, we would still be able to see a bad plan even without
> waiting for the whole query to execute - just stop at a point where the
> plan is far enough off track. That would give us what we need: pinpoint
> exactly which part of the plan is off-track and see how far off track it
> is. If the limits were configurable, we'd be able to opt for
> faster-but-less-accurate or slower-yet-100% accuracy behaviour. We
> wouldn't need to worry about timing overhead either then.
>
> e.g. EXPLAIN ANALYZE ERRLIMIT 10 SELECT ...
>
> --
> Simon Riggs
> EnterpriseDB   http://www.enterprisedb.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>

Re: Some queries starting to hang

From

Andrew Sullivan

Date:

06 June 2006, 12:30:11

On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote:
> > it was properly instrumented. That way, the OP might have been able to
> > discover the root cause himself...
>
> I don't think that helps, as it just replaces one uncertainty by
> another: how far did the EXPLAIN really get towards completion of the
> plan?  You still don't have any hard data.

Well, you _might_ get something useful, if you're trying to work on a
maladjusted production system, because you get to the part that trips
the limit, and then you know, "Well, I gotta fix it that far,
anyway."

Often, when you're in real trouble, you can't or don't wait for the
full plan to come back from EXPLAIN ANALYSE, because a manager is
helpfully standing over your shoulder asking whether you're there
yet.  Being able to say, "Aha, we have the first symptom," might be
helpful to users.  Because the impatient simply won't wait for the
full report to come back, and therefore they'll end up flying blind
instead.  (Note that "the impatient" is not always the person logged
in and executing the commands.)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to.  That actually seems sort of quaint now.
        --J.D. Baldwin

Re: Some queries starting to hang

From

"A.M."

Date:

06 June 2006, 12:38:10

Hmmm...It could generate NOTICEs whenever there is a drastic difference in
rowcount or actual time...

On Tue, June 6, 2006 11:29 am, Andrew Sullivan wrote:
> On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote:
>
>>> it was properly instrumented. That way, the OP might have been able
>>> to discover the root cause himself...
>>
>> I don't think that helps, as it just replaces one uncertainty by
>> another: how far did the EXPLAIN really get towards completion of the
>> plan?  You still don't have any hard data.
>
> Well, you _might_ get something useful, if you're trying to work on a
> maladjusted production system, because you get to the part that trips the
> limit, and then you know, "Well, I gotta fix it that far, anyway."
>
> Often, when you're in real trouble, you can't or don't wait for the
> full plan to come back from EXPLAIN ANALYSE, because a manager is helpfully
> standing over your shoulder asking whether you're there yet.  Being able
> to say, "Aha, we have the first symptom," might be helpful to users.
> Because the impatient simply won't wait for the
> full report to come back, and therefore they'll end up flying blind
> instead.  (Note that "the impatient" is not always the person logged in
> and executing the commands.)
>
> A
>
>
> --
> Andrew Sullivan  | ajs@crankycanuck.ca
> I remember when computers were frustrating because they *did* exactly what
>  you told them to.  That actually seems sort of quaint now. --J.D. Baldwin
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>
> http://archives.postgresql.org
>
>

Re: Some queries starting to hang

From

Greg Stark

Date:

06 June 2006, 12:38:20

Simon Riggs <simon@2ndquadrant.com> writes:

> The idea I just had was: why do we need EXPLAIN ANALYZE to run to
> completion? In severe cases like this thread, we might be able to
> discover the root cause by a *partial* execution of the plan, as long as
> it was properly instrumented. That way, the OP might have been able to
> discover the root cause himself...

An alternate approach would be to implement a SIGINFO handler that prints out
the explain analyze output for the data built up so far. You would be able to
keep hitting C-t and keep getting updates until the query completes or you
decided to hit C-c.

I'm not sure how easy this would be to implement but it sure would be nice
from a user's point of view. Much nicer than having to specify some arbitrary
limit before running the query.

--
greg

Re: Some queries starting to hang

From

Tom Lane

Date:

06 June 2006, 12:41:28

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote:
>> I don't think that helps, as it just replaces one uncertainty by
>> another: how far did the EXPLAIN really get towards completion of the
>> plan?  You still don't have any hard data.

> Does that really matter, though? The point is to find the node where the
> estimate proved to be fantasy.

No, the point is to find out what reality is.  Just knowing that the
estimates are wrong doesn't really get you anywhere (we pretty much knew
that before we even started looking at the EXPLAIN, eh?).

            regards, tom lane

Re: Some queries starting to hang

From

Andrew Sullivan

Date:

06 June 2006, 13:20:37

On Tue, Jun 06, 2006 at 11:37:46AM -0400, Greg Stark wrote:

> An alternate approach would be to implement a SIGINFO handler that
> prints out the explain analyze output for the data built up so far.
> You would be able to keep hitting C-t and keep getting updates
> until the query completes or you decided to hit C-c.

This is even better, and pretty much along the lines I was thinking
in my other mail.  If you can see the _first_ spot you break, you can
start working.  We all know (or I hope so, anyway) that it would be
better to get the full result, and know everything that needs
attention before starting.  As nearly as I can tell, however, they
don't teach Mill's methods to MBAs of a certain stripe, so changes
start getting made without all the data being available.  It'd be
nice to be able to bump the set of available data to something
higher than "none".

(That said, I appreciate that there's precious little reason to spend
a lot of work optimising a feature that is mostly there to counteract
bad management practices.)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
When my information changes, I alter my conclusions.  What do you do sir?
        --attr. John Maynard Keynes

Re: Some queries starting to hang

From

"Craig A. James"

Date:

06 June 2006, 14:46:05

Simon Riggs wrote:
>>Well, it's a big query.  If it ought to take a second or two, and
>>instead is taking an hour or two (1800 times the expected runtime), that
>>might be close enough to "never" to exhaust Chris' patience.  Besides,
>>we don't know whether the 1800 might itself be an underestimate (too bad
>>Chris didn't provide EXPLAIN ANALYZE results).
>
> This is a good example of a case where the inefficiency of EXPLAIN
> ANALYZE would be a contributory factor to it not actually being
> available for diagnosing a problem.

This is a frustration I have, but Simon expressed it much more concisely.  The first question one gets in this forum
is,"did you run EXPLAIN ANALYZE?"  But if EXPLAIN ANALYZE never finishes, you can't get the information you need to
diagnosethe problem.  Simon's proposal, 

> e.g. EXPLAIN ANALYZE ERRLIMIT 10 SELECT ...

or something similar, would be a big help.  I.e. "If you can't finish in a reasonable time, at least tell me as much as
youcan." 

Craig

Re: Some queries starting to hang

From

"Craig A. James"

Date:

06 June 2006, 14:51:23

Tom Lane wrote:
>>The idea I just had was: why do we need EXPLAIN ANALYZE to run to
>>completion? In severe cases like this thread, we might be able to
>>discover the root cause by a *partial* execution of the plan, as long as
>>it was properly instrumented. That way, the OP might have been able to
>>discover the root cause himself...
>
>
> I don't think that helps, as it just replaces one uncertainty by
> another: how far did the EXPLAIN really get towards completion of the
> plan?  You still don't have any hard data.

But at least you have some data, which is better than no data.  Even knowing that the plan got stuck on a particular
nodeof the query plan could be vital information.  For a query that never finishes, you can't even find out where it's
gettingstuck. 

That's why Simon's proposal might help in some particularly difficult situations.

Regards,
Craig

Re: Some queries starting to hang

From

Scott Marlowe

Date:

06 June 2006, 14:54:38

On Tue, 2006-06-06 at 12:50, Craig A. James wrote:
> Tom Lane wrote:
> >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to
> >>completion? In severe cases like this thread, we might be able to
> >>discover the root cause by a *partial* execution of the plan, as long as
> >>it was properly instrumented. That way, the OP might have been able to
> >>discover the root cause himself...
> >
> >
> > I don't think that helps, as it just replaces one uncertainty by
> > another: how far did the EXPLAIN really get towards completion of the
> > plan?  You still don't have any hard data.
>
> But at least you have some data, which is better than no data.  Even knowing that the plan got stuck on a particular
nodeof the query plan could be vital information.  For a query that never finishes, you can't even find out where it's
gettingstuck. 
>
> That's why Simon's proposal might help in some particularly difficult situations.

Hmmmmm.  I wonder if it be hard to have explain analyze have a timeout
per node qualifier?  Something that said if it takes more than x
milliseconds for a node to kill the explain analyze and list the up to
the nasty node that's using all the time up?

That would be extremely useful.

Re: Some queries starting to hang

From

"Jim C. Nasby"

Date:

06 June 2006, 17:51:28

On Tue, Jun 06, 2006 at 12:54:27PM -0500, Scott Marlowe wrote:
> On Tue, 2006-06-06 at 12:50, Craig A. James wrote:
> > Tom Lane wrote:
> > >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to
> > >>completion? In severe cases like this thread, we might be able to
> > >>discover the root cause by a *partial* execution of the plan, as long as
> > >>it was properly instrumented. That way, the OP might have been able to
> > >>discover the root cause himself...
> > >
> > >
> > > I don't think that helps, as it just replaces one uncertainty by
> > > another: how far did the EXPLAIN really get towards completion of the
> > > plan?  You still don't have any hard data.
> >
> > But at least you have some data, which is better than no data.  Even knowing that the plan got stuck on a
particularnode of the query plan could be vital information.  For a query that never finishes, you can't even find out
whereit's getting stuck. 
> >
> > That's why Simon's proposal might help in some particularly difficult situations.
>
> Hmmmmm.  I wonder if it be hard to have explain analyze have a timeout
> per node qualifier?  Something that said if it takes more than x
> milliseconds for a node to kill the explain analyze and list the up to
> the nasty node that's using all the time up?
>
> That would be extremely useful.

Maybe, maybe not. It would be very easy for this to croak on the first
sort it hits. I suspect the original proposal of aborting once a
rowcount estimate proves to be way off is a better idea.

For the record, I also think being able to get a current snapshot is
great, too.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Some queries starting to hang

From

Scott Marlowe

Date:

06 June 2006, 18:02:33

On Tue, 2006-06-06 at 15:51, Jim C. Nasby wrote:
> On Tue, Jun 06, 2006 at 12:54:27PM -0500, Scott Marlowe wrote:
> > On Tue, 2006-06-06 at 12:50, Craig A. James wrote:
> > > Tom Lane wrote:
> > > >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to
> > > >>completion? In severe cases like this thread, we might be able to
> > > >>discover the root cause by a *partial* execution of the plan, as long as
> > > >>it was properly instrumented. That way, the OP might have been able to
> > > >>discover the root cause himself...
> > > >
> > > >
> > > > I don't think that helps, as it just replaces one uncertainty by
> > > > another: how far did the EXPLAIN really get towards completion of the
> > > > plan?  You still don't have any hard data.
> > >
> > > But at least you have some data, which is better than no data.  Even knowing that the plan got stuck on a
particularnode of the query plan could be vital information.  For a query that never finishes, you can't even find out
whereit's getting stuck. 
> > >
> > > That's why Simon's proposal might help in some particularly difficult situations.
> >
> > Hmmmmm.  I wonder if it be hard to have explain analyze have a timeout
> > per node qualifier?  Something that said if it takes more than x
> > milliseconds for a node to kill the explain analyze and list the up to
> > the nasty node that's using all the time up?
> >
> > That would be extremely useful.
>
> Maybe, maybe not. It would be very easy for this to croak on the first
> sort it hits. I suspect the original proposal of aborting once a
> rowcount estimate proves to be way off is a better idea.
>
> For the record, I also think being able to get a current snapshot is
> great, too.

I can see value in both.

Just because the row count is right doesn't mean it won't take a
fortnight of processing.  :)

The problem with the row count estimate being off from the real thing is
you only get it AFTER the set is retrieved for that node.

The cost of aborting on the first sort is minimal.  You just turn up the
number for the timeout and run it again.  1 minute or so wasted.

The cost of not aborting on the first sort is that you may never see
what the part of the plan is that's killing your query, since you never
get the actual plan.

Re: Some queries starting to hang

From

Tom Lane

Date:

06 June 2006, 18:11:25

Scott Marlowe <smarlowe@g2switchworks.com> writes:
> The cost of not aborting on the first sort is that you may never see
> what the part of the plan is that's killing your query, since you never
> get the actual plan.

Well, you can get the plan without waiting a long time; that's what
plain EXPLAIN is for.  But I still disagree with the premise that you
can extrapolate anything very useful from an unfinished EXPLAIN ANALYZE
run.  As an example, if the plan involves setup steps such as sorting or
loading a hashtable, cancelling after a minute might make it look like
the setup step is the big problem, distracting you from the possibility
that the *rest* of the plan would take weeks to run if you ever got to
it.

            regards, tom lane

Re: Some queries starting to hang

From

Scott Marlowe

Date:

06 June 2006, 18:15:00

On Tue, 2006-06-06 at 16:11, Tom Lane wrote:
> Scott Marlowe <smarlowe@g2switchworks.com> writes:
> > The cost of not aborting on the first sort is that you may never see
> > what the part of the plan is that's killing your query, since you never
> > get the actual plan.
>
> Well, you can get the plan without waiting a long time; that's what
> plain EXPLAIN is for.  But I still disagree with the premise that you
> can extrapolate anything very useful from an unfinished EXPLAIN ANALYZE
> run.  As an example, if the plan involves setup steps such as sorting or
> loading a hashtable, cancelling after a minute might make it look like
> the setup step is the big problem, distracting you from the possibility
> that the *rest* of the plan would take weeks to run if you ever got to
> it.

Sure, but it would be nice to see it report the partial work.

i.e. I got to using a nested loop, thought there would be 20 rows,
processed 250,000 or so, timed out at 10 minutes, and gave up.

I would find that useful.

Re: Some queries starting to hang

From

Simon Riggs

Date:

07 June 2006, 08:42:37

On Tue, 2006-06-06 at 11:41 -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote:
> >> I don't think that helps, as it just replaces one uncertainty by
> >> another: how far did the EXPLAIN really get towards completion of the
> >> plan?  You still don't have any hard data.
>
> > Does that really matter, though? The point is to find the node where the
> > estimate proved to be fantasy.
>
> No, the point is to find out what reality is.

My point is knowing reality with less than 100% certainty is still very
frequently useful.

> Just knowing that the
> estimates are wrong doesn't really get you anywhere (we pretty much knew
> that before we even started looking at the EXPLAIN, eh?).

We were lucky enough to have two EXPLAINS that could be examined for
differences. Often, you have just one EXPLAIN and no idea which estimate
is incorrect, or whether they are all exactly correct. That is when an
EXPLAIN ANALYZE becomes essential - yet a *full* execution isn't
required in order to tell you what you need to know.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Some queries starting to hang

From

Markus Schaber

Date:

21 June 2006, 04:49:14

Hi, Chris,

Chris Beecroft wrote:

> Query is now returning with results on our replicated database.  Will
> vacuum analyze production now.  So it seems to have done the trick.  Now
> the question is has our auto vacuum failed or was not set up properly...
> A question for my IT people.

Most of the cases when we had database bloat despite running autovacuum,
it was due to a low free_space_map setting.

Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org