Thread: Query sometimes takes down server

Query sometimes takes down server

From
Jason Long
Date:
I am having a serious problem with my application and I hope someone can
help me out.
This could not happen at a worse time as a consulting firm is at my
clients to recommend a new financial system and the inventory
system(which I developed) keeps locking up.

I have a dynamically built query that will periodically(2 times a day
and becoming more frequent) make my server totally unresponsive.

The query that hangs the system is requesting a count(*)
based on some parameters the users selects.

Since I set my timeout to 2 minutes(the only way I  have been able to
deal with this so far) I see the offending query in the log.

I took the query from the logs and pasted it into pgAdmin and ran it a
few times.

Sometime is takes 700-900 ms, but others it takes 60-100 seconds.
Other times it never stops(I waited 10 minutes).

If I run 10 times I get

8 less that 1 sec
2 5-10 sec

And maybe 1 in 20 will not complete.

Dev Server specs
1 CPU Xeon 5472 Quad core 3.0 GHz 1600MHz FSB
2x10k Raptor Raid 1
DB 50 mb with a lot of table joins

These queries are being run with nothing else running on the server.

My guess it that the planner is picking an inefficient plan sometimes.

I have not looked into the detail of the explain, and I do see visually
that very different plans are being chosen.

How can I see the plan that was chosen when the time it very high?
Is there a way to print the query plan chosen when the query times out
as well?





Re: Query sometimes takes down server

From
Jeff Davis
Date:
On Thu, 2009-01-15 at 18:44 -0600, Jason Long wrote:
> The query that hangs the system is requesting a count(*)
> based on some parameters the users selects.

Can you show an example of the full offending query? How big is the
table?

> And maybe 1 in 20 will not complete.

If you really have nothing else running on the server, the queries
should complete.

> I have not looked into the detail of the explain, and I do see visually
> that very different plans are being chosen.

It would help to share these plans with us...

Run it several times using EXPLAIN ANALYZE, and send the output of one
that completes quickly, one that takes a while, and preferably wait for
a long-running one to finish. For those that you can't wait for, show
plain EXPLAIN output.

> How can I see the plan that was chosen when the time it very high?
> Is there a way to print the query plan chosen when the query times out
> as well?
>

There's a new module coming in 8.4 that makes this much easier. Right
now, you have to run it yourself with EXPLAIN ANALYZE as above.

Regards,
    Jeff Davis


Re: Query sometimes takes down server

From
"marcin mank"
Date:
> I have a dynamically built query that will periodically(2 times a day and becoming more frequent) make my server
totallyunresponsive. 

does this query involve more than geqo_threshold (default 12) tables?
If so, this most probably is geqo (genetic query optimizer) kicking
in. Try to fiddle with some geqo parameters (raising geqo_effort is
the first choice).

Greetings
Marcin

Re: Query sometimes takes down server

From
Jeff Davis
Date:
On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote:
> >
> > > I have not looked into the detail of the explain, and I do see visually
> > > that very different plans are being chosen.
> > >
> >
> > It would help to share these plans with us...
> >
> See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84
> seconds

Well, it's clear why you're getting a variety of plans. With that many
tables in the FROM list, there are a huge number of possible plans, so
PostgreSQL will use a genetic algorithm to try to find the best one,
which is based on random numbers.

My only advice would be to step back for a minute and see if there might
be a better way to solve the original problem. Typically, numbers in
table and field names are a sign something could be better designed.

I may have some better advice if you say why you are joining so many
tables, and what the numbers are for.

Regards,
    Jeff Davis


Re: Query sometimes takes down server

From
Jeff Davis
Date:
On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote:
> The numbers in the table names are due to hibernate generating the
> query.

Well, that's what auto-generated schemas and queries do, I guess.

> Now we are getting somewhere.
> Someone suggested tweaking the genetic algorithm parameters.
> Has anyone else had to do this and what results did you acheive?
> Can someone offer me some detailed advice on tweaking these
> parameters?

There are a lot of tables, so no matter what you do will require GEQO
(the genetic algorithm I was talking about).

The fact that some of the plans are fast is good news: it means that
it's possible to execute the query quickly.

The other good news is that the slower plans are, indeed, estimated to
be slower in the examples you provided (not by exactly proportional
amounts, but it's still a good sign). If the estimations are so far off
that they are basically random, GEQO won't help much; but in your case
they look surprisingly good.

I would try increasing geqo_effort, and tweaking geqo_pool_size and
geqo_generations (mostly try increasing these last two, but smaller
values might be useful), and tweak geqo_selection_bias randomly between
1.5 and 2.

See useful ranges of the parameters here:
http://www.postgresql.org/docs/8.3/static/runtime-config-query.html

When you start to get stable execution times (make sure you don't just
get lucky once), keep the values you're using. Post to the list with
your results.

You may be able to fix some of your queries (like this one), but I
suspect this will just make the problem more rare. When you come up with
some new query later, I think the problem will come back. The solution
is really to have a more reasonable schema, something that PostgreSQL
(and humans) can understand well enough to optimize.

Regards,
    Jeff Davis


Re: Query sometimes takes down server

From
Martin Gainty
Date:
good idea although tweaks to geqo_pool_size, geqo_generations, and geqo_selection_bias will affect all queries

For larger and unwieldy queries you might want to look at breaking the queries down to smaller pieces e.g.
Break each statement to 2 tables with 1 join (preferrably inner join with USING) ..and of course indexing the columns used in the join

If for any reason the results become unwieldy (plan table shows cartesian joins or FTS or dredded SortMerge when NestedLoops will do) you may want to consider Creating a Temp Table which will hold the results for the current session for each join
Alternatively creating a view which contains the results in Read only format..poddibly using a materialized view to allow indexes and refresh capability

HTH
Martin
______________________________________________
Disclaimer and confidentiality note
Everything in this e-mail and any attachments relates to the official business of Sender. This transmission is of a confidential nature and Sender does not endorse distribution to any party other than intended recipient. Sender does not necessarily endorse content contained within this transmission.




> Subject: Re: [GENERAL] Query sometimes takes down server
> From: pgsql@j-davis.com
> To: jason@octgsoftware.com
> CC: pgsql-general@postgresql.org
> Date: Fri, 16 Jan 2009 10:07:26 -0800
>
> On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote:
> > The numbers in the table names are due to hibernate generating the
> > query.
>
> Well, that's what auto-generated schemas and queries do, I guess.
>
> > Now we are getting somewhere.
> > Someone suggested tweaking the genetic algorithm parameters.
> > Has anyone else had to do this and what results did you acheive?
> > Can someone offer me some detailed advice on tweaking these
> > parameters?
>
> There are a lot of tables, so no matter what you do will require GEQO
> (the genetic algorithm I was talking about).
>
> The fact that some of the plans are fast is good news: it means that
> it's possible to execute the query quickly.
>
> The other good news is that the slower plans are, indeed, estimated to
> be slower in the examples you provided (not by exactly proportional
> amounts, but it's still a good sign). If the estimations are so far off
> that they are basically random, GEQO won't help much; but in your case
> they look surprisingly good.
>
> I would try increasing geqo_effort, and tweaking geqo_pool_size and
> geqo_generations (mostly try increasing these last two, but smaller
> values might be useful), and tweak geqo_selection_bias randomly between
> 1.5 and 2.
>
> See useful ranges of the parameters here:
> http://www.postgresql.org/docs/8.3/static/runtime-config-query.html
>
> When you start to get stable execution times (make sure you don't just
> get lucky once), keep the values you're using. Post to the list with
> your results.
>
> You may be able to fix some of your queries (like this one), but I
> suspect this will just make the problem more rare. When you come up with
> some new query later, I think the problem will come back. The solution
> is really to have a more reasonable schema, something that PostgreSQL
> (and humans) can understand well enough to optimize.
>
> Regards,
> Jeff Davis
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Windows Live™: Keep your life in sync. See how it works.

Re: Query sometimes takes down server

From
Jason Long
Date:
Jeff Davis wrote:
On Fri, 2009-01-16 at 08:43 -0600, Jason Long wrote: 
The numbers in the table names are due to hibernate generating the
query.     
Well, that's what auto-generated schemas and queries do, I guess. 
The schema is not auto generated.  It evolved as I created my inventory system.
It is relatively easy for humans to understand.  Or at least for me since I wrote it. 
 
Now we are getting somewhere.
Someone suggested tweaking the genetic algorithm parameters.
Has anyone else had to do this and what results did you acheive?
Can someone offer me some detailed advice on tweaking these
parameters?   
There are a lot of tables, so no matter what you do will require GEQO
(the genetic algorithm I was talking about). 
I am familiar with with Global Optimization.  I was part of my research for my masters degree.
The fact that some of the plans are fast is good news: it means that
it's possible to execute the query quickly.

The other good news is that the slower plans are, indeed, estimated to
be slower in the examples you provided (not by exactly proportional
amounts, but it's still a good sign). If the estimations are so far off
that they are basically random, GEQO won't help much; but in your case
they look surprisingly good.

I would try increasing geqo_effort, and tweaking geqo_pool_size and
geqo_generations (mostly try increasing these last two, but smaller
values might be useful), and tweak geqo_selection_bias randomly between
1.5 and 2.
 
I raised the geqo_effort to 10 and this made this happen my less frequently, but still fairly often.
Thanks for the advice.  I will post my results if I achieve improvement.
See useful ranges of the parameters here:
http://www.postgresql.org/docs/8.3/static/runtime-config-query.html

When you start to get stable execution times (make sure you don't just
get lucky once), keep the values you're using. Post to the list with
your results. 
I think I am going to write a script an run the query enough times for me to see some statistics on how my tuning of the parameters work.
Would anyone have such a script already made?

Maybe I should use a genetic algorithm to analyze all the possible combinations of GEQO parameters for may case. :)
But, realistically I am slammed with work and while this is very interesting to me I will have to keep cranking out new features to keep the clients happy.
Hopefully, I can find a a better set of parameters through trial and error.
You may be able to fix some of your queries (like this one), but I
suspect this will just make the problem more rare. When you come up with
some new query later, I think the problem will come back. The solution
is really to have a more reasonable schema, something that PostgreSQL
(and humans) can understand well enough to optimize. 
Making them more rare is probably good enough for now.  Thank you very much for the advice.
Regards,Jeff Davis

 

Re: Query sometimes takes down server

From
Jeff Davis
Date:
On Fri, 2009-01-16 at 12:35 -0600, Jason Long wrote:
> The schema is not auto generated.  It evolved as I created my
> inventory system.
> It is relatively easy for humans to understand.  Or at least for me
> since I wrote it.

On second look, there aren't that many tables. There are just a lot of
self joins. Those are going to be hard for PostgreSQL to optimize
effectively.

Looks like it's just the query that's hard to understand. You might do
better just writing it yourself from scratch.

Regards,
    Jeff Davis


Re: Query sometimes takes down server

From
Jeff Davis
Date:
On Fri, 2009-01-16 at 13:37 -0500, Martin Gainty wrote:
> good idea although tweaks to geqo_pool_size, geqo_generations, and
> geqo_selection_bias will affect all queries

Only queries that invoke GEQO.

Regards,
    Jeff Davis


Re: Query sometimes takes down server

From
"Scott Marlowe"
Date:
On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote:
>> >
>> > > I have not looked into the detail of the explain, and I do see visually
>> > > that very different plans are being chosen.
>> > >
>> >
>> > It would help to share these plans with us...
>> >
>> See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84
>> seconds

Jason, note that the post with explain analyze didn't make it to the
mailing list, so you're effectively leaving everyone else who could
give you advice out of the conversation.

Re: Query sometimes takes down server

From
"Scott Marlowe"
Date:
Weird.  I wonder if the attachment is too big and the mailing list
server is chopping it off of the email.

On Fri, Jan 16, 2009 at 1:19 PM, Jason Long
<mailing.list@supernovasoftware.com> wrote:
> Scott Marlowe wrote:
>
> On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>
>
> On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote:
>
>
> I have not looked into the detail of the explain, and I do see visually
> that very different plans are being chosen.
>
>
>
> It would help to share these plans with us...
>
>
>
> See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84
> seconds
>
>
> Jason, note that the post with explain analyze didn't make it to the
> mailing list, so you're effectively leaving everyone else who could
> give you advice out of the conversation.
>
>
> I just sent it again.  I wonder why it did not make it.  Hopefully this one
> will.  Please let me know if this one does not either.
>



--
When fascism comes to America, it will be draped in a flag and
carrying a cross - Sinclair Lewis

Re: Query sometimes takes down server

From
Jason Long
Date:
Scott Marlowe wrote:
> Weird.  I wonder if the attachment is too big and the mailing list
> server is chopping it off of the email.
>
I just tried it by sending text only instead of text and html.  We will
see if it goes through this time.
Other than that do you see anything weird about my email?
> On Fri, Jan 16, 2009 at 1:19 PM, Jason Long
> <mailing.list@supernovasoftware.com> wrote:
>
>> Scott Marlowe wrote:
>>
>> On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>>
>>
>> On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote:
>>
>>
>> I have not looked into the detail of the explain, and I do see visually
>> that very different plans are being chosen.
>>
>>
>>
>> It would help to share these plans with us...
>>
>>
>>
>> See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84
>> seconds
>>
>>
>> Jason, note that the post with explain analyze didn't make it to the
>> mailing list, so you're effectively leaving everyone else who could
>> give you advice out of the conversation.
>>
>>
>> I just sent it again.  I wonder why it did not make it.  Hopefully this one
>> will.  Please let me know if this one does not either.
>>
>>
>
>
>
>


Re: Query sometimes takes down server

From
Jason Long
Date:
Scott Marlowe wrote:
On Thu, Jan 15, 2009 at 11:04 PM, Jeff Davis <pgsql@j-davis.com> wrote: 
On Thu, 2009-01-15 at 19:37 -0600, Jason Long wrote:   
I have not looked into the detail of the explain, and I do see visually
that very different plans are being chosen.
         
It would help to share these plans with us...
       
See EXPLAIN ANALYZE below for three different plans @ .5, 6, and 84
seconds     
Jason, note that the post with explain analyze didn't make it to the
mailing list, so you're effectively leaving everyone else who could
give you advice out of the conversation. 
I just sent it again.  I wonder why it did not make it.  Hopefully this one will.  Please let me know if this one does not either.

Re: Query sometimes takes down server

From
"David Wilson"
Date:
On Fri, Jan 16, 2009 at 3:27 PM, Jason Long
<mailing.list@supernovasoftware.com> wrote:

> I just tried it by sending text only instead of text and html.  We will see
> if it goes through this time.
> Other than that do you see anything weird about my email?

Still nothing. Do you have webspace you could place it on? If not, you
could use http://explain-analyze.info and simply provide the list with
links to the plans.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: Query sometimes takes down server

From
Jason Long
Date:
David Wilson wrote:
On Fri, Jan 16, 2009 at 3:27 PM, Jason Long
<mailing.list@supernovasoftware.com> wrote:
 
I just tried it by sending text only instead of text and html.  We will see
if it goes through this time.
Other than that do you see anything weird about my email?   
Still nothing. Do you have webspace you could place it on? If not, you
could use http://explain-analyze.info and simply provide the list with
links to the plans.
 
Actually I tweaked the GEQO parameters as follows:

# - Genetic Query Optimizer -
geqo = on
geqo_threshold = 12
geqo_effort = 10                        # range 1-10
geqo_pool_size = 1000              # selects default based on effort
geqo_generations = 1000           # selects default based on effort
geqo_selection_bias = 2.0          # range 1.5-2.0

I use the following script to test the settings.

#!/bin/bash
for ((i=0;i<=1000;i+=1)); do
psql -d pipetracker-dev -f /opt/main_list_count.sql  | grep runtime >>  /var/lib/pgsql/test.txt
done

The query was able to run 1000 times without even getting close to the 2 minute timeout I have set.
Here are some statistics I have from the runs.

Out of 1000 runs only these were more that 3 seconds

17377.86 ms
15040.32 ms
12343.12 ms
11915.26 ms
11409.88 ms
  9719.72 ms
  8535.49 ms
  5531.8 ms
  5286.9 ms
  4920.17 ms
  3849.4 ms

Avg 468.74 ms
Min 173.3 ms
Max 17337.86 ms
STDEV 1102.35 ms

I then put the settings back to the default and got a timeout after 45 tries.
These are the results greater than 3 seconds from those 45

114450.17 ms
  79529.79 ms
  40999.69 ms
  28981.87 ms
  25802.51 ms
  13346.41 ms
    9569.23 ms
    9267.75 ms
    8221.4 ms
    8213.87 ms
    5975.05 ms

Thank you everyone for your help.

Has anyone developed some sort of testing script that is much better than my simple bash script shown above?

I would like to just let some queries run and have the script run and then alter some parameters in postgres.conf and run again.
After running for a while it could tune these parameters based on the queries I have it run.

Thank you everyone for your advice.  My users will be much happier. :)

Can anyone see any problems with the settings I have and how they might affect my application as a whole?