Thread: Some queries starting to hang
Hello, I've noticed some posts on hanging queries but haven't seen any solutions yet so far. Our problem is that about a week and a half ago we started to get some queries that would (seemingly) never return (e.g., normally run in a couple minutes, but after 2.5 hours, they were still running, the process pushing the processor up to 99.9% active). We are running Postgres 8.1.1 on Redhat 7.3 using Dell poweredge quad processor boxes with 4 GB of memory. We have a main database that is replicated via Sloney to a identical system. Things we've tried so far: We've stopped and restarted postgres and that didn't seem to help, we've rebuilt all the indexes and that didn't seem to help either. We've stopped replication between the boxes and that didn't do anything. We've tried the queries on both the production and the replicated box, and there is no difference in the queries (or query plans) We do have another identical system that is a backup box (same type of box, Postgres 8.1.1, Redhat 7.3, etc), and there, the query does complete executing in a short time. We loaded up a current copy of the production database and it still responded quickly. Generally these queries, although not complicated, are on the more complex side of our application. Second, they have been running up until a few weeks ago. Attached are an example query plan: Query.sql The query plan from our production sever: QueryPlanBroke.txt The working query plan from our backup server: QueryPlanWork.txt What we found that has worked so far is to remove all the outer joins, put the results into a temp table and then left join from the temp table to get our results. Certainly this isn't a solution, but rather something we have resorted to in a place or to as we limp along. Any help would be greatly appreciated. Thanks, Chris Beecroft
Attachment
On Mon, Jun 05, 2006 at 12:05:08PM -0700, Chris Beecroft wrote: > Our problem is that about a week and a half ago we started to get some > queries that would (seemingly) never return (e.g., normally run in a > couple minutes, but after 2.5 hours, they were still running, the > process pushing the processor up to 99.9% active). Are there any locks preventing the query from completing? I can't recall how you check in 7.3, but if nothing else, you can check with ps for something WAITING. A -- Andrew Sullivan | ajs@crankycanuck.ca Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris
Chris Beecroft <CBeecroft@PrO-Unlimited.com> writes: > Our problem is that about a week and a half ago we started to get some > queries that would (seemingly) never return (e.g., normally run in a > couple minutes, but after 2.5 hours, they were still running, the > process pushing the processor up to 99.9% active). > Attached are an example query plan: Query.sql > The query plan from our production sever: QueryPlanBroke.txt > The working query plan from our backup server: QueryPlanWork.txt Note the major difference in estimated row counts. That's the key to your problem... you need to find out why the "broke" case thinks only one row is getting selected. broke: > -> Nested Loop (cost=30150.77..129334.04 rows=1 width=305) work: > -> Hash Join (cost=30904.77..125395.89 rows=1810 width=306) I'm wondering about out-of-date or nonexistent ANALYZE stats, missing custom adjustments of statistics target settings, etc. regards, tom lane
On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote: > > broke: > > -> Nested Loop (cost=30150.77..129334.04 rows=1 width=305) > > work: > > -> Hash Join (cost=30904.77..125395.89 rows=1810 width=306) > > I'm wondering about out-of-date or nonexistent ANALYZE stats, missing > custom adjustments of statistics target settings, etc. But even the nested loop shouldn't be a "never returns" case, should it? For 1800 rows? (I've _had_ bad plans that picked nestloop, for sure, but they're usually for tens of thousands of rows when they take forever). A -- Andrew Sullivan | ajs@crankycanuck.ca Users never remark, "Wow, this software may be buggy and hard to use, but at least there is a lot of code underneath." --Damien Katz
Thanks Tom, I knew you would come through again! Query is now returning with results on our replicated database. Will vacuum analyze production now. So it seems to have done the trick. Now the question is has our auto vacuum failed or was not set up properly... A question for my IT people. Thanks once again, Chris Beecroft On Mon, 2006-06-05 at 13:07, Tom Lane wrote: > Chris Beecroft <CBeecroft@PrO-Unlimited.com> writes: > > Our problem is that about a week and a half ago we started to get some > > queries that would (seemingly) never return (e.g., normally run in a > > couple minutes, but after 2.5 hours, they were still running, the > > process pushing the processor up to 99.9% active). > > > Attached are an example query plan: Query.sql > > The query plan from our production sever: QueryPlanBroke.txt > > The working query plan from our backup server: QueryPlanWork.txt > > Note the major difference in estimated row counts. That's the key to > your problem... you need to find out why the "broke" case thinks only > one row is getting selected. > > broke: > > -> Nested Loop (cost=30150.77..129334.04 rows=1 width=305) > > work: > > -> Hash Join (cost=30904.77..125395.89 rows=1810 width=306) > > I'm wondering about out-of-date or nonexistent ANALYZE stats, missing > custom adjustments of statistics target settings, etc. > > regards, tom lane
Andrew Sullivan <ajs@crankycanuck.ca> writes: > On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote: >> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing >> custom adjustments of statistics target settings, etc. > But even the nested loop shouldn't be a "never returns" case, should > it? For 1800 rows? Well, it's a big query. If it ought to take a second or two, and instead is taking an hour or two (1800 times the expected runtime), that might be close enough to "never" to exhaust Chris' patience. Besides, we don't know whether the 1800 might itself be an underestimate (too bad Chris didn't provide EXPLAIN ANALYZE results). The hash plan will scale to larger numbers of rows much more gracefully than the nestloop ... regards, tom lane
On Mon, 2006-06-05 at 14:06, Tom Lane wrote: > Andrew Sullivan <ajs@crankycanuck.ca> writes: > > On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote: > >> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing > >> custom adjustments of statistics target settings, etc. > > > But even the nested loop shouldn't be a "never returns" case, should > > it? For 1800 rows? > > Well, it's a big query. If it ought to take a second or two, and > instead is taking an hour or two (1800 times the expected runtime), that > might be close enough to "never" to exhaust Chris' patience. Besides, > we don't know whether the 1800 might itself be an underestimate (too bad > Chris didn't provide EXPLAIN ANALYZE results). The hash plan will scale > to larger numbers of rows much more gracefully than the nestloop ... > > regards, tom lane Hello, If anyone is curious, I've attached an explain analyze from the now working replicated database. Explain analyze did not seem return on the 'broken' database (or at least, when we originally tried to test these, did not return after an hour and a half, which enough time to head right past patient into crabby...) Chris
Attachment
On Mon, 2006-06-05 at 17:06 -0400, Tom Lane wrote: > Andrew Sullivan <ajs@crankycanuck.ca> writes: > > On Mon, Jun 05, 2006 at 04:07:19PM -0400, Tom Lane wrote: > >> I'm wondering about out-of-date or nonexistent ANALYZE stats, missing > >> custom adjustments of statistics target settings, etc. > > > But even the nested loop shouldn't be a "never returns" case, should > > it? For 1800 rows? > > Well, it's a big query. If it ought to take a second or two, and > instead is taking an hour or two (1800 times the expected runtime), that > might be close enough to "never" to exhaust Chris' patience. Besides, > we don't know whether the 1800 might itself be an underestimate (too bad > Chris didn't provide EXPLAIN ANALYZE results). This is a good example of a case where the inefficiency of EXPLAIN ANALYZE would be a contributory factor to it not actually being available for diagnosing a problem. Maybe we need something even more drastic than recent proposed changes to EXPLAIN ANALYZE? Perhaps we could annotate the query tree with individual limits. That way a node that was expecting to deal with 1 row would simply stop executing the EXPLAIN ANALYZE when it hit N times as many rows (default=no limit). That way, we would still be able to see a bad plan even without waiting for the whole query to execute - just stop at a point where the plan is far enough off track. That would give us what we need: pinpoint exactly which part of the plan is off-track and see how far off track it is. If the limits were configurable, we'd be able to opt for faster-but-less-accurate or slower-yet-100% accuracy behaviour. We wouldn't need to worry about timing overhead either then. e.g. EXPLAIN ANALYZE ERRLIMIT 10 SELECT ... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2006-06-05 at 17:06 -0400, Tom Lane wrote: >> Well, it's a big query. If it ought to take a second or two, and >> instead is taking an hour or two (1800 times the expected runtime), that >> might be close enough to "never" to exhaust Chris' patience. Besides, >> we don't know whether the 1800 might itself be an underestimate (too bad >> Chris didn't provide EXPLAIN ANALYZE results). > This is a good example of a case where the inefficiency of EXPLAIN > ANALYZE would be a contributory factor to it not actually being > available for diagnosing a problem. Huh? The problem is the inefficiency of the underlying query. regards, tom lane
On Tue, 2006-06-06 at 10:43 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Mon, 2006-06-05 at 17:06 -0400, Tom Lane wrote: > >> Well, it's a big query. If it ought to take a second or two, and > >> instead is taking an hour or two (1800 times the expected runtime), that > >> might be close enough to "never" to exhaust Chris' patience. Besides, > >> we don't know whether the 1800 might itself be an underestimate (too bad > >> Chris didn't provide EXPLAIN ANALYZE results). > > > This is a good example of a case where the inefficiency of EXPLAIN > > ANALYZE would be a contributory factor to it not actually being > > available for diagnosing a problem. > > Huh? The problem is the inefficiency of the underlying query. Of course that was the main problem from the OP. You mentioned it would be good if the OP had delivered an EXPLAIN ANALYZE; I agree(d). The lack of EXPLAIN ANALYZE is frequently because you can't get them to run to completion - more so when the query you wish to analyze doesn't appear to complete either. The idea I just had was: why do we need EXPLAIN ANALYZE to run to completion? In severe cases like this thread, we might be able to discover the root cause by a *partial* execution of the plan, as long as it was properly instrumented. That way, the OP might have been able to discover the root cause himself... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 05, 2006 at 01:39:38PM -0700, Chris Beecroft wrote: > Thanks Tom, > > I knew you would come through again! > > Query is now returning with results on our replicated database. Will > vacuum analyze production now. So it seems to have done the trick. Now > the question is has our auto vacuum failed or was not set up properly... > A question for my IT people. You should almost certainly be running the autovacuum that's built in now. If you enable vacuum_cost_delay you should be able to make it so that vacuum's impact on production is minimal. The other thing you'll want to do is cut all the vacuum threshold and scale settings in half (the defaults are very conservative). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Simon Riggs <simon@2ndquadrant.com> writes: > You mentioned it would be good if the OP had delivered an EXPLAIN > ANALYZE; I agree(d). The lack of EXPLAIN ANALYZE is frequently because > you can't get them to run to completion - more so when the query you > wish to analyze doesn't appear to complete either. Well, he could have shown EXPLAIN ANALYZE for the server that was managing to run the query in a reasonable amount of time. > The idea I just had was: why do we need EXPLAIN ANALYZE to run to > completion? In severe cases like this thread, we might be able to > discover the root cause by a *partial* execution of the plan, as long as > it was properly instrumented. That way, the OP might have been able to > discover the root cause himself... I don't think that helps, as it just replaces one uncertainty by another: how far did the EXPLAIN really get towards completion of the plan? You still don't have any hard data. regards, tom lane
On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote: > > The idea I just had was: why do we need EXPLAIN ANALYZE to run to > > completion? In severe cases like this thread, we might be able to > > discover the root cause by a *partial* execution of the plan, as long as > > it was properly instrumented. That way, the OP might have been able to > > discover the root cause himself... > > I don't think that helps, as it just replaces one uncertainty by > another: how far did the EXPLAIN really get towards completion of the > plan? You still don't have any hard data. Does that really matter, though? The point is to find the node where the estimate proved to be fantasy. It might even make sense to highlight that node in the output, so that users don't have to wade through a sea of numbers to find it. If it is important to report how far along the query got, it seems that could always be added to the explain output. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Explain analyze could at least put an asterisk around actual time that deviated by some factor from the estimated time. On Tue, June 6, 2006 10:39 am, Simon Riggs wrote: > > This is a good example of a case where the inefficiency of EXPLAIN > ANALYZE would be a contributory factor to it not actually being > available for diagnosing a problem. > > Maybe we need something even more drastic than recent proposed changes > to EXPLAIN ANALYZE? > > Perhaps we could annotate the query tree with individual limits. That > way a node that was expecting to deal with 1 row would simply stop > executing the EXPLAIN ANALYZE when it hit N times as many rows (default=no > limit). That way, we would still be able to see a bad plan even without > waiting for the whole query to execute - just stop at a point where the > plan is far enough off track. That would give us what we need: pinpoint > exactly which part of the plan is off-track and see how far off track it > is. If the limits were configurable, we'd be able to opt for > faster-but-less-accurate or slower-yet-100% accuracy behaviour. We > wouldn't need to worry about timing overhead either then. > > e.g. EXPLAIN ANALYZE ERRLIMIT 10 SELECT ... > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > >
On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote: > > it was properly instrumented. That way, the OP might have been able to > > discover the root cause himself... > > I don't think that helps, as it just replaces one uncertainty by > another: how far did the EXPLAIN really get towards completion of the > plan? You still don't have any hard data. Well, you _might_ get something useful, if you're trying to work on a maladjusted production system, because you get to the part that trips the limit, and then you know, "Well, I gotta fix it that far, anyway." Often, when you're in real trouble, you can't or don't wait for the full plan to come back from EXPLAIN ANALYSE, because a manager is helpfully standing over your shoulder asking whether you're there yet. Being able to say, "Aha, we have the first symptom," might be helpful to users. Because the impatient simply won't wait for the full report to come back, and therefore they'll end up flying blind instead. (Note that "the impatient" is not always the person logged in and executing the commands.) A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin
Hmmm...It could generate NOTICEs whenever there is a drastic difference in rowcount or actual time... On Tue, June 6, 2006 11:29 am, Andrew Sullivan wrote: > On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote: > >>> it was properly instrumented. That way, the OP might have been able >>> to discover the root cause himself... >> >> I don't think that helps, as it just replaces one uncertainty by >> another: how far did the EXPLAIN really get towards completion of the >> plan? You still don't have any hard data. > > Well, you _might_ get something useful, if you're trying to work on a > maladjusted production system, because you get to the part that trips the > limit, and then you know, "Well, I gotta fix it that far, anyway." > > Often, when you're in real trouble, you can't or don't wait for the > full plan to come back from EXPLAIN ANALYSE, because a manager is helpfully > standing over your shoulder asking whether you're there yet. Being able > to say, "Aha, we have the first symptom," might be helpful to users. > Because the impatient simply won't wait for the > full report to come back, and therefore they'll end up flying blind > instead. (Note that "the impatient" is not always the person logged in > and executing the commands.) > > A > > > -- > Andrew Sullivan | ajs@crankycanuck.ca > I remember when computers were frustrating because they *did* exactly what > you told them to. That actually seems sort of quaint now. --J.D. Baldwin > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > > http://archives.postgresql.org > >
Simon Riggs <simon@2ndquadrant.com> writes: > The idea I just had was: why do we need EXPLAIN ANALYZE to run to > completion? In severe cases like this thread, we might be able to > discover the root cause by a *partial* execution of the plan, as long as > it was properly instrumented. That way, the OP might have been able to > discover the root cause himself... An alternate approach would be to implement a SIGINFO handler that prints out the explain analyze output for the data built up so far. You would be able to keep hitting C-t and keep getting updates until the query completes or you decided to hit C-c. I'm not sure how easy this would be to implement but it sure would be nice from a user's point of view. Much nicer than having to specify some arbitrary limit before running the query. -- greg
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote: >> I don't think that helps, as it just replaces one uncertainty by >> another: how far did the EXPLAIN really get towards completion of the >> plan? You still don't have any hard data. > Does that really matter, though? The point is to find the node where the > estimate proved to be fantasy. No, the point is to find out what reality is. Just knowing that the estimates are wrong doesn't really get you anywhere (we pretty much knew that before we even started looking at the EXPLAIN, eh?). regards, tom lane
On Tue, Jun 06, 2006 at 11:37:46AM -0400, Greg Stark wrote: > An alternate approach would be to implement a SIGINFO handler that > prints out the explain analyze output for the data built up so far. > You would be able to keep hitting C-t and keep getting updates > until the query completes or you decided to hit C-c. This is even better, and pretty much along the lines I was thinking in my other mail. If you can see the _first_ spot you break, you can start working. We all know (or I hope so, anyway) that it would be better to get the full result, and know everything that needs attention before starting. As nearly as I can tell, however, they don't teach Mill's methods to MBAs of a certain stripe, so changes start getting made without all the data being available. It'd be nice to be able to bump the set of available data to something higher than "none". (That said, I appreciate that there's precious little reason to spend a lot of work optimising a feature that is mostly there to counteract bad management practices.) A -- Andrew Sullivan | ajs@crankycanuck.ca When my information changes, I alter my conclusions. What do you do sir? --attr. John Maynard Keynes
Simon Riggs wrote: >>Well, it's a big query. If it ought to take a second or two, and >>instead is taking an hour or two (1800 times the expected runtime), that >>might be close enough to "never" to exhaust Chris' patience. Besides, >>we don't know whether the 1800 might itself be an underestimate (too bad >>Chris didn't provide EXPLAIN ANALYZE results). > > This is a good example of a case where the inefficiency of EXPLAIN > ANALYZE would be a contributory factor to it not actually being > available for diagnosing a problem. This is a frustration I have, but Simon expressed it much more concisely. The first question one gets in this forum is,"did you run EXPLAIN ANALYZE?" But if EXPLAIN ANALYZE never finishes, you can't get the information you need to diagnosethe problem. Simon's proposal, > e.g. EXPLAIN ANALYZE ERRLIMIT 10 SELECT ... or something similar, would be a big help. I.e. "If you can't finish in a reasonable time, at least tell me as much as youcan." Craig
Tom Lane wrote: >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to >>completion? In severe cases like this thread, we might be able to >>discover the root cause by a *partial* execution of the plan, as long as >>it was properly instrumented. That way, the OP might have been able to >>discover the root cause himself... > > > I don't think that helps, as it just replaces one uncertainty by > another: how far did the EXPLAIN really get towards completion of the > plan? You still don't have any hard data. But at least you have some data, which is better than no data. Even knowing that the plan got stuck on a particular nodeof the query plan could be vital information. For a query that never finishes, you can't even find out where it's gettingstuck. That's why Simon's proposal might help in some particularly difficult situations. Regards, Craig
On Tue, 2006-06-06 at 12:50, Craig A. James wrote: > Tom Lane wrote: > >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to > >>completion? In severe cases like this thread, we might be able to > >>discover the root cause by a *partial* execution of the plan, as long as > >>it was properly instrumented. That way, the OP might have been able to > >>discover the root cause himself... > > > > > > I don't think that helps, as it just replaces one uncertainty by > > another: how far did the EXPLAIN really get towards completion of the > > plan? You still don't have any hard data. > > But at least you have some data, which is better than no data. Even knowing that the plan got stuck on a particular nodeof the query plan could be vital information. For a query that never finishes, you can't even find out where it's gettingstuck. > > That's why Simon's proposal might help in some particularly difficult situations. Hmmmmm. I wonder if it be hard to have explain analyze have a timeout per node qualifier? Something that said if it takes more than x milliseconds for a node to kill the explain analyze and list the up to the nasty node that's using all the time up? That would be extremely useful.
On Tue, Jun 06, 2006 at 12:54:27PM -0500, Scott Marlowe wrote: > On Tue, 2006-06-06 at 12:50, Craig A. James wrote: > > Tom Lane wrote: > > >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to > > >>completion? In severe cases like this thread, we might be able to > > >>discover the root cause by a *partial* execution of the plan, as long as > > >>it was properly instrumented. That way, the OP might have been able to > > >>discover the root cause himself... > > > > > > > > > I don't think that helps, as it just replaces one uncertainty by > > > another: how far did the EXPLAIN really get towards completion of the > > > plan? You still don't have any hard data. > > > > But at least you have some data, which is better than no data. Even knowing that the plan got stuck on a particularnode of the query plan could be vital information. For a query that never finishes, you can't even find out whereit's getting stuck. > > > > That's why Simon's proposal might help in some particularly difficult situations. > > Hmmmmm. I wonder if it be hard to have explain analyze have a timeout > per node qualifier? Something that said if it takes more than x > milliseconds for a node to kill the explain analyze and list the up to > the nasty node that's using all the time up? > > That would be extremely useful. Maybe, maybe not. It would be very easy for this to croak on the first sort it hits. I suspect the original proposal of aborting once a rowcount estimate proves to be way off is a better idea. For the record, I also think being able to get a current snapshot is great, too. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, 2006-06-06 at 15:51, Jim C. Nasby wrote: > On Tue, Jun 06, 2006 at 12:54:27PM -0500, Scott Marlowe wrote: > > On Tue, 2006-06-06 at 12:50, Craig A. James wrote: > > > Tom Lane wrote: > > > >>The idea I just had was: why do we need EXPLAIN ANALYZE to run to > > > >>completion? In severe cases like this thread, we might be able to > > > >>discover the root cause by a *partial* execution of the plan, as long as > > > >>it was properly instrumented. That way, the OP might have been able to > > > >>discover the root cause himself... > > > > > > > > > > > > I don't think that helps, as it just replaces one uncertainty by > > > > another: how far did the EXPLAIN really get towards completion of the > > > > plan? You still don't have any hard data. > > > > > > But at least you have some data, which is better than no data. Even knowing that the plan got stuck on a particularnode of the query plan could be vital information. For a query that never finishes, you can't even find out whereit's getting stuck. > > > > > > That's why Simon's proposal might help in some particularly difficult situations. > > > > Hmmmmm. I wonder if it be hard to have explain analyze have a timeout > > per node qualifier? Something that said if it takes more than x > > milliseconds for a node to kill the explain analyze and list the up to > > the nasty node that's using all the time up? > > > > That would be extremely useful. > > Maybe, maybe not. It would be very easy for this to croak on the first > sort it hits. I suspect the original proposal of aborting once a > rowcount estimate proves to be way off is a better idea. > > For the record, I also think being able to get a current snapshot is > great, too. I can see value in both. Just because the row count is right doesn't mean it won't take a fortnight of processing. :) The problem with the row count estimate being off from the real thing is you only get it AFTER the set is retrieved for that node. The cost of aborting on the first sort is minimal. You just turn up the number for the timeout and run it again. 1 minute or so wasted. The cost of not aborting on the first sort is that you may never see what the part of the plan is that's killing your query, since you never get the actual plan.
Scott Marlowe <smarlowe@g2switchworks.com> writes: > The cost of not aborting on the first sort is that you may never see > what the part of the plan is that's killing your query, since you never > get the actual plan. Well, you can get the plan without waiting a long time; that's what plain EXPLAIN is for. But I still disagree with the premise that you can extrapolate anything very useful from an unfinished EXPLAIN ANALYZE run. As an example, if the plan involves setup steps such as sorting or loading a hashtable, cancelling after a minute might make it look like the setup step is the big problem, distracting you from the possibility that the *rest* of the plan would take weeks to run if you ever got to it. regards, tom lane
On Tue, 2006-06-06 at 16:11, Tom Lane wrote: > Scott Marlowe <smarlowe@g2switchworks.com> writes: > > The cost of not aborting on the first sort is that you may never see > > what the part of the plan is that's killing your query, since you never > > get the actual plan. > > Well, you can get the plan without waiting a long time; that's what > plain EXPLAIN is for. But I still disagree with the premise that you > can extrapolate anything very useful from an unfinished EXPLAIN ANALYZE > run. As an example, if the plan involves setup steps such as sorting or > loading a hashtable, cancelling after a minute might make it look like > the setup step is the big problem, distracting you from the possibility > that the *rest* of the plan would take weeks to run if you ever got to > it. Sure, but it would be nice to see it report the partial work. i.e. I got to using a nested loop, thought there would be 20 rows, processed 250,000 or so, timed out at 10 minutes, and gave up. I would find that useful.
On Tue, 2006-06-06 at 11:41 -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > On Tue, Jun 06, 2006 at 11:06:09AM -0400, Tom Lane wrote: > >> I don't think that helps, as it just replaces one uncertainty by > >> another: how far did the EXPLAIN really get towards completion of the > >> plan? You still don't have any hard data. > > > Does that really matter, though? The point is to find the node where the > > estimate proved to be fantasy. > > No, the point is to find out what reality is. My point is knowing reality with less than 100% certainty is still very frequently useful. > Just knowing that the > estimates are wrong doesn't really get you anywhere (we pretty much knew > that before we even started looking at the EXPLAIN, eh?). We were lucky enough to have two EXPLAINS that could be examined for differences. Often, you have just one EXPLAIN and no idea which estimate is incorrect, or whether they are all exactly correct. That is when an EXPLAIN ANALYZE becomes essential - yet a *full* execution isn't required in order to tell you what you need to know. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi, Chris, Chris Beecroft wrote: > Query is now returning with results on our replicated database. Will > vacuum analyze production now. So it seems to have done the trick. Now > the question is has our auto vacuum failed or was not set up properly... > A question for my IT people. Most of the cases when we had database bloat despite running autovacuum, it was due to a low free_space_map setting. Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org