Thread: Query sometimes takes down server
I am having a serious problem with my application and I hope someone can help me out. This could not happen at a worse time as a consulting firm is at my clients to recommend a new financial system and the inventory system(which I developed) keeps locking up. I have a dynamically built query that will periodically(2 times a day and becoming more frequent) make my server totally unresponsive. The query that hangs the system is requesting a count(*) based on some parameters the users selects. Since I set my timeout to 2 minutes(the only way I have been able to deal with this so far) I see the offending query in the log. I took the query from the logs and pasted it into pgAdmin and ran it a few times. Sometime is takes 700-900 ms, but others it takes 60-100 seconds. Other times it never stops(I waited 10 minutes). If I run 10 times I get 8 less that 1 sec 2 5-10 sec And maybe 1 in 20 will not complete. Dev Server specs 1 CPU Xeon 5472 Quad core 3.0 GHz 1600MHz FSB 2x10k Raptor Raid 1 DB 50 mb with a lot of table joins These queries are being run with nothing else running on the server. My guess it that the planner is picking an inefficient plan sometimes. I have not looked into the detail of the explain, and I do see visually that very different plans are being chosen. How can I see the plan that was chosen when the time it very high? Is there a way to print the query plan chosen when the query times out as well?
On Thu, 2009-01-15 at 18:44 -0600, Jason Long wrote: > The query that hangs the system is requesting a count(*) > based on some parameters the users selects. Can you show an example of the full offending query? How big is the table? > And maybe 1 in 20 will not complete. If you really have nothing else running on the server, the queries should complete. > I have not looked into the detail of the explain, and I do see visually > that very different plans are being chosen. It would help to share these plans with us... Run it several times using EXPLAIN ANALYZE, and send the output of one that completes quickly, one that takes a while, and preferably wait for a long-running one to finish. For those that you can't wait for, show plain EXPLAIN output. > How can I see the plan that was chosen when the time it very high? > Is there a way to print the query plan chosen when the query times out > as well? > There's a new module coming in 8.4 that makes this much easier. Right now, you have to run it yourself with EXPLAIN ANALYZE as above. Regards, Jeff Davis
> I have a dynamically built query that will periodically(2 times a day and becoming more frequent) make my server totallyunresponsive. does this query involve more than geqo_threshold (default 12) tables? If so, this most probably is geqo (genetic query optimizer) kicking in. Try to fiddle with some geqo parameters (raising geqo_effort is the first choice). Greetings Marcin
On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote: > > > > > I have not looked into the detail of the explain, and I do see visually > > > that very different plans are being chosen. > > > > > > > It would help to share these plans with us... > > > See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84 > seconds Well, it's clear why you're getting a variety of plans. With that many tables in the FROM list, there are a huge number of possible plans, so PostgreSQL will use a genetic algorithm to try to find the best one, which is based on random numbers. My only advice would be to step back for a minute and see if there might be a better way to solve the original problem. Typically, numbers in table and field names are a sign something could be better designed. I may have some better advice if you say why you are joining so many tables, and what the numbers are for. Regards, Jeff Davis
On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote: > The numbers in the table names are due to hibernate generating the > query. Well, that's what auto-generated schemas and queries do, I guess. > Now we are getting somewhere. > Someone suggested tweaking the genetic algorithm parameters. > Has anyone else had to do this and what results did you acheive? > Can someone offer me some detailed advice on tweaking these > parameters? There are a lot of tables, so no matter what you do will require GEQO (the genetic algorithm I was talking about). The fact that some of the plans are fast is good news: it means that it's possible to execute the query quickly. The other good news is that the slower plans are, indeed, estimated to be slower in the examples you provided (not by exactly proportional amounts, but it's still a good sign). If the estimations are so far off that they are basically random, GEQO won't help much; but in your case they look surprisingly good. I would try increasing geqo_effort, and tweaking geqo_pool_size and geqo_generations (mostly try increasing these last two, but smaller values might be useful), and tweak geqo_selection_bias randomly between 1.5 and 2. See useful ranges of the parameters here: http://www.postgresql.org/docs/8.3/static/runtime-config-query.html When you start to get stable execution times (make sure you don't just get lucky once), keep the values you're using. Post to the list with your results. You may be able to fix some of your queries (like this one), but I suspect this will just make the problem more rare. When you come up with some new query later, I think the problem will come back. The solution is really to have a more reasonable schema, something that PostgreSQL (and humans) can understand well enough to optimize. Regards, Jeff Davis
good idea although tweaks to geqo_pool_size, geqo_generations, and geqo_selection_bias will affect all queries
For larger and unwieldy queries you might want to look at breaking the queries down to smaller pieces e.g.
Break each statement to 2 tables with 1 join (preferrably inner join with USING) ..and of course indexing the columns used in the join
If for any reason the results become unwieldy (plan table shows cartesian joins or FTS or dredded SortMerge when NestedLoops will do) you may want to consider Creating a Temp Table which will hold the results for the current session for each join
Alternatively creating a view which contains the results in Read only format..poddibly using a materialized view to allow indexes and refresh capability
HTH
Martin
______________________________________________
Disclaimer and confidentiality note
Everything in this e-mail and any attachments relates to the official business of Sender. This transmission is of a confidential nature and Sender does not endorse distribution to any party other than intended recipient. Sender does not necessarily endorse content contained within this transmission.
> Subject: Re: [GENERAL] Query sometimes takes down server
> From: pgsql@j-davis.com
> To: jason@octgsoftware.com
> CC: pgsql-general@postgresql.org
> Date: Fri, 16 Jan 2009 10:07:26 -0800
>
> On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote:
> > The numbers in the table names are due to hibernate generating the
> > query.
>
> Well, that's what auto-generated schemas and queries do, I guess.
>
> > Now we are getting somewhere.
> > Someone suggested tweaking the genetic algorithm parameters.
> > Has anyone else had to do this and what results did you acheive?
> > Can someone offer me some detailed advice on tweaking these
> > parameters?
>
> There are a lot of tables, so no matter what you do will require GEQO
> (the genetic algorithm I was talking about).
>
> The fact that some of the plans are fast is good news: it means that
> it's possible to execute the query quickly.
>
> The other good news is that the slower plans are, indeed, estimated to
> be slower in the examples you provided (not by exactly proportional
> amounts, but it's still a good sign). If the estimations are so far off
> that they are basically random, GEQO won't help much; but in your case
> they look surprisingly good.
>
> I would try increasing geqo_effort, and tweaking geqo_pool_size and
> geqo_generations (mostly try increasing these last two, but smaller
> values might be useful), and tweak geqo_selection_bias randomly between
> 1.5 and 2.
>
> See useful ranges of the parameters here:
> http://www.postgresql.org/docs/8.3/static/runtime-config-query.html
>
> When you start to get stable execution times (make sure you don't just
> get lucky once), keep the values you're using. Post to the list with
> your results.
>
> You may be able to fix some of your queries (like this one), but I
> suspect this will just make the problem more rare. When you come up with
> some new query later, I think the problem will come back. The solution
> is really to have a more reasonable schema, something that PostgreSQL
> (and humans) can understand well enough to optimize.
>
> Regards,
> Jeff Davis
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Windows Live™: Keep your life in sync. See how it works.
For larger and unwieldy queries you might want to look at breaking the queries down to smaller pieces e.g.
Break each statement to 2 tables with 1 join (preferrably inner join with USING) ..and of course indexing the columns used in the join
If for any reason the results become unwieldy (plan table shows cartesian joins or FTS or dredded SortMerge when NestedLoops will do) you may want to consider Creating a Temp Table which will hold the results for the current session for each join
Alternatively creating a view which contains the results in Read only format..poddibly using a materialized view to allow indexes and refresh capability
HTH
Martin
______________________________________________
Disclaimer and confidentiality note
Everything in this e-mail and any attachments relates to the official business of Sender. This transmission is of a confidential nature and Sender does not endorse distribution to any party other than intended recipient. Sender does not necessarily endorse content contained within this transmission.
> Subject: Re: [GENERAL] Query sometimes takes down server
> From: pgsql@j-davis.com
> To: jason@octgsoftware.com
> CC: pgsql-general@postgresql.org
> Date: Fri, 16 Jan 2009 10:07:26 -0800
>
> On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote:
> > The numbers in the table names are due to hibernate generating the
> > query.
>
> Well, that's what auto-generated schemas and queries do, I guess.
>
> > Now we are getting somewhere.
> > Someone suggested tweaking the genetic algorithm parameters.
> > Has anyone else had to do this and what results did you acheive?
> > Can someone offer me some detailed advice on tweaking these
> > parameters?
>
> There are a lot of tables, so no matter what you do will require GEQO
> (the genetic algorithm I was talking about).
>
> The fact that some of the plans are fast is good news: it means that
> it's possible to execute the query quickly.
>
> The other good news is that the slower plans are, indeed, estimated to
> be slower in the examples you provided (not by exactly proportional
> amounts, but it's still a good sign). If the estimations are so far off
> that they are basically random, GEQO won't help much; but in your case
> they look surprisingly good.
>
> I would try increasing geqo_effort, and tweaking geqo_pool_size and
> geqo_generations (mostly try increasing these last two, but smaller
> values might be useful), and tweak geqo_selection_bias randomly between
> 1.5 and 2.
>
> See useful ranges of the parameters here:
> http://www.postgresql.org/docs/8.3/static/runtime-config-query.html
>
> When you start to get stable execution times (make sure you don't just
> get lucky once), keep the values you're using. Post to the list with
> your results.
>
> You may be able to fix some of your queries (like this one), but I
> suspect this will just make the problem more rare. When you come up with
> some new query later, I think the problem will come back. The solution
> is really to have a more reasonable schema, something that PostgreSQL
> (and humans) can understand well enough to optimize.
>
> Regards,
> Jeff Davis
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Windows Live™: Keep your life in sync. See how it works.
Jeff Davis wrote:
It is relatively easy for humans to understand. Or at least for me since I wrote it.
Thanks for the advice. I will post my results if I achieve improvement.
Would anyone have such a script already made?
Maybe I should use a genetic algorithm to analyze all the possible combinations of GEQO parameters for may case. :)
But, realistically I am slammed with work and while this is very interesting to me I will have to keep cranking out new features to keep the clients happy.
Hopefully, I can find a a better set of parameters through trial and error.
The schema is not auto generated. It evolved as I created my inventory system.On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote:The numbers in the table names are due to hibernate generating the query.Well, that's what auto-generated schemas and queries do, I guess.
It is relatively easy for humans to understand. Or at least for me since I wrote it.
I am familiar with with Global Optimization. I was part of my research for my masters degree.Now we are getting somewhere. Someone suggested tweaking the genetic algorithm parameters. Has anyone else had to do this and what results did you acheive? Can someone offer me some detailed advice on tweaking these parameters?There are a lot of tables, so no matter what you do will require GEQO (the genetic algorithm I was talking about).
I raised the geqo_effort to 10 and this made this happen my less frequently, but still fairly often.The fact that some of the plans are fast is good news: it means that it's possible to execute the query quickly. The other good news is that the slower plans are, indeed, estimated to be slower in the examples you provided (not by exactly proportional amounts, but it's still a good sign). If the estimations are so far off that they are basically random, GEQO won't help much; but in your case they look surprisingly good. I would try increasing geqo_effort, and tweaking geqo_pool_size and geqo_generations (mostly try increasing these last two, but smaller values might be useful), and tweak geqo_selection_bias randomly between 1.5 and 2.
Thanks for the advice. I will post my results if I achieve improvement.
I think I am going to write a script an run the query enough times for me to see some statistics on how my tuning of the parameters work.See useful ranges of the parameters here: http://www.postgresql.org/docs/8.3/static/runtime-config-query.html When you start to get stable execution times (make sure you don't just get lucky once), keep the values you're using. Post to the list with your results.
Would anyone have such a script already made?
Maybe I should use a genetic algorithm to analyze all the possible combinations of GEQO parameters for may case. :)
But, realistically I am slammed with work and while this is very interesting to me I will have to keep cranking out new features to keep the clients happy.
Hopefully, I can find a a better set of parameters through trial and error.
Making them more rare is probably good enough for now. Thank you very much for the advice.You may be able to fix some of your queries (like this one), but I suspect this will just make the problem more rare. When you come up with some new query later, I think the problem will come back. The solution is really to have a more reasonable schema, something that PostgreSQL (and humans) can understand well enough to optimize.
Regards,Jeff Davis
On Fri, 2009-01-16 at 12:35 -0600, Jason Long wrote: > The schema is not auto generated. It evolved as I created my > inventory system. > It is relatively easy for humans to understand. Or at least for me > since I wrote it. On second look, there aren't that many tables. There are just a lot of self joins. Those are going to be hard for PostgreSQL to optimize effectively. Looks like it's just the query that's hard to understand. You might do better just writing it yourself from scratch. Regards, Jeff Davis
On Fri, 2009-01-16 at 13:37 -0500, Martin Gainty wrote: > good idea although tweaks to geqo_pool_size, geqo_generations, and > geqo_selection_bias will affect all queries Only queries that invoke GEQO. Regards, Jeff Davis
On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote: >> > >> > > I have not looked into the detail of the explain, and I do see visually >> > > that very different plans are being chosen. >> > > >> > >> > It would help to share these plans with us... >> > >> See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84 >> seconds Jason, note that the post with explain analyze didn't make it to the mailing list, so you're effectively leaving everyone else who could give you advice out of the conversation.
Weird. I wonder if the attachment is too big and the mailing list server is chopping it off of the email. On Fri, Jan 16, 2009 at 1:19 PM, Jason Long <mailing.list@supernovasoftware.com> wrote: > Scott Marlowe wrote: > > On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: > > > On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote: > > > I have not looked into the detail of the explain, and I do see visually > that very different plans are being chosen. > > > > It would help to share these plans with us... > > > > See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84 > seconds > > > Jason, note that the post with explain analyze didn't make it to the > mailing list, so you're effectively leaving everyone else who could > give you advice out of the conversation. > > > I just sent it again. I wonder why it did not make it. Hopefully this one > will. Please let me know if this one does not either. > -- When fascism comes to America, it will be draped in a flag and carrying a cross - Sinclair Lewis
Scott Marlowe wrote: > Weird. I wonder if the attachment is too big and the mailing list > server is chopping it off of the email. > I just tried it by sending text only instead of text and html. We will see if it goes through this time. Other than that do you see anything weird about my email? > On Fri, Jan 16, 2009 at 1:19 PM, Jason Long > <mailing.list@supernovasoftware.com> wrote: > >> Scott Marlowe wrote: >> >> On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> >> >> On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote: >> >> >> I have not looked into the detail of the explain, and I do see visually >> that very different plans are being chosen. >> >> >> >> It would help to share these plans with us... >> >> >> >> See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84 >> seconds >> >> >> Jason, note that the post with explain analyze didn't make it to the >> mailing list, so you're effectively leaving everyone else who could >> give you advice out of the conversation. >> >> >> I just sent it again. I wonder why it did not make it. Hopefully this one >> will. Please let me know if this one does not either. >> >> > > > >
Scott Marlowe wrote:
I just sent it again. I wonder why it did not make it. Hopefully this one will. Please let me know if this one does not either.On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote:I have not looked into the detail of the explain, and I do see visually that very different plans are being chosen.It would help to share these plans with us...See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84 secondsJason, note that the post with explain analyze didn't make it to the mailing list, so you're effectively leaving everyone else who could give you advice out of the conversation.
On Fri, Jan 16, 2009 at 3:27 PM, Jason Long <mailing.list@supernovasoftware.com> wrote: > I just tried it by sending text only instead of text and html. We will see > if it goes through this time. > Other than that do you see anything weird about my email? Still nothing. Do you have webspace you could place it on? If not, you could use http://explain-analyze.info and simply provide the list with links to the plans. -- - David T. Wilson david.t.wilson@gmail.com
David Wilson wrote:
# - Genetic Query Optimizer -
geqo = on
geqo_threshold = 12
geqo_effort = 10 # range 1-10
geqo_pool_size = 1000 # selects default based on effort
geqo_generations = 1000 # selects default based on effort
geqo_selection_bias = 2.0 # range 1.5-2.0
I use the following script to test the settings.
#!/bin/bash
for ((i=0;i<=1000;i+=1)); do
psql -d pipetracker-dev -f /opt/main_list_count.sql | grep runtime >> /var/lib/pgsql/test.txt
done
The query was able to run 1000 times without even getting close to the 2 minute timeout I have set.
Here are some statistics I have from the runs.
Out of 1000 runs only these were more that 3 seconds
17377.86 ms
15040.32 ms
12343.12 ms
11915.26 ms
11409.88 ms
9719.72 ms
8535.49 ms
5531.8 ms
5286.9 ms
4920.17 ms
3849.4 ms
Avg 468.74 ms
Min 173.3 ms
Max 17337.86 ms
STDEV 1102.35 ms
I then put the settings back to the default and got a timeout after 45 tries.
These are the results greater than 3 seconds from those 45
114450.17 ms
79529.79 ms
40999.69 ms
28981.87 ms
25802.51 ms
13346.41 ms
9569.23 ms
9267.75 ms
8221.4 ms
8213.87 ms
5975.05 ms
Thank you everyone for your help.
Has anyone developed some sort of testing script that is much better than my simple bash script shown above?
I would like to just let some queries run and have the script run and then alter some parameters in postgres.conf and run again.
After running for a while it could tune these parameters based on the queries I have it run.
Thank you everyone for your advice. My users will be much happier. :)
Can anyone see any problems with the settings I have and how they might affect my application as a whole?
Actually I tweaked the GEQO parameters as follows:On Fri, Jan 16, 2009 at 3:27 PM, Jason Long <mailing.list@supernovasoftware.com> wrote:I just tried it by sending text only instead of text and html. We will see if it goes through this time. Other than that do you see anything weird about my email?Still nothing. Do you have webspace you could place it on? If not, you could use http://explain-analyze.info and simply provide the list with links to the plans.
# - Genetic Query Optimizer -
geqo = on
geqo_threshold = 12
geqo_effort = 10 # range 1-10
geqo_pool_size = 1000 # selects default based on effort
geqo_generations = 1000 # selects default based on effort
geqo_selection_bias = 2.0 # range 1.5-2.0
I use the following script to test the settings.
#!/bin/bash
for ((i=0;i<=1000;i+=1)); do
psql -d pipetracker-dev -f /opt/main_list_count.sql | grep runtime >> /var/lib/pgsql/test.txt
done
The query was able to run 1000 times without even getting close to the 2 minute timeout I have set.
Here are some statistics I have from the runs.
Out of 1000 runs only these were more that 3 seconds
17377.86 ms
15040.32 ms
12343.12 ms
11915.26 ms
11409.88 ms
9719.72 ms
8535.49 ms
5531.8 ms
5286.9 ms
4920.17 ms
3849.4 ms
Avg 468.74 ms
Min 173.3 ms
Max 17337.86 ms
STDEV 1102.35 ms
I then put the settings back to the default and got a timeout after 45 tries.
These are the results greater than 3 seconds from those 45
114450.17 ms
79529.79 ms
40999.69 ms
28981.87 ms
25802.51 ms
13346.41 ms
9569.23 ms
9267.75 ms
8221.4 ms
8213.87 ms
5975.05 ms
Thank you everyone for your help.
Has anyone developed some sort of testing script that is much better than my simple bash script shown above?
I would like to just let some queries run and have the script run and then alter some parameters in postgres.conf and run again.
After running for a while it could tune these parameters based on the queries I have it run.
Thank you everyone for your advice. My users will be much happier. :)
Can anyone see any problems with the settings I have and how they might affect my application as a whole?