Thread: Long running queries degrade performance

Long running queries degrade performance

From
Chris Kratz
Date:
Hello all,

My apologies if this is not the right mailing list to ask this question, but
we are wondering about general performance tuning principles for our main db
server.

We have a web app with a postgres backend.  Most queries have subsecond
response times through the web even with high usage.  Every once in awhile
someone will run either an ad-hoc query or some other long running db
process.  For some reason, it seems that a small number 3-4 of these jobs
running in parallel absolutely floors our server.  In monitoring the jobs,
linux (Kernel 2.4) drops the long running jobs priority, but even so they
seem to hog the system resources making subsequent requests for everyone else
very slow.  Our database at this point is almost entirely processor and
memory bound because it isn't too large to fit most of the working data into
memory yet.  There is generally little disk activity when this occurs.

These long running processes are almost always complex select statements, not
generally inserts or updates.  We continue to monitor and rework the
bottlenecks, but what is a little scary to us is how easily the database
becomes almost completely unresponsive with several large jobs running,
especially since we have a large number of users.  And it only takes one user
trying to view a page with one of these selects clicking multiple times
because it doesn't come back quickly to bring our system to it's knees for
hours.

We are looking to move to Kernel 2.6 and possibly a dedicated multiprocessor
machine for postgres towards the end of this year.  But, I am wondering if
there is anything we can do now to increase the interactive performance while
there are long running selects running as well.  Are there ways to adjust the
priority of backend processes, or things to tweak to maximize interactive
throughput for the quick jobs while the long running ones run in the
background?  Or if worse comes to worse to actually kill long running
processes without taking down the whole db as we have had to do on occasion.

Our server is a modest 2.4Ghz P4 with mirrored UW SCSI drives and 1G of
memory.  The db on disk is around 800M and this machine also hosts our web
app, so there is some contention for the processor.

Does anyone have any suggestions or thoughts on things we could look at? Is a
multiprocessor box the only answer, or are there other things we should be
looking at hardware wise.  Thank you for your time.
--
Chris Kratz
Systems Analyst/Programmer
VistaShare LLC
www.vistashare.com

Re: Long running queries degrade performance

From
Mike Nolan
Date:
> We have a web app with a postgres backend.  Most queries have subsecond
> response times through the web even with high usage.  Every once in awhile
> someone will run either an ad-hoc query or some other long running db
> process.

Are you sure it is postgres where the delay is occurring?  I ask this
because I also have a web-based front end to postgres, and while most of
the time the queries respond in about a second every now and then I see
one that takes much longer, sometimes 10-15 seconds.

I've seen this behavior on both my development system and on the
production server.

The same query a while later might respond quickly again.

I'm not sure where to look for the delay, either, and it is intermittent
enough that I'm not even sure what monitoring techniques to use.
--
Mike Nolan

Re: Long running queries degrade performance

From
Chris Kratz
Date:
Fairly sure, when it is happening, postgres usually is taking up the top slots
for cpu usage as reported by top.  Perhaps there is a better way to monitor
this?

The other thing for us is that others talk about disks being the bottleneck
whereas for us it is almost always the processor.  I expected the drives to
kill us early on (we have two uw scsi mirrored drives) but there is very
little disk activity.  The disks rarely do much during load for us (at this
point).  Most likely this is related more to data volume at this point.

As far as in your case, is there a lot of disk activity happening?  More
likely you have a situation where something else is happening which blocks
the current thread.  We ran into two situations recently which exhibited this
behavior.  One was adding and dropping tables in a transaction which blocks
any other transaction trying to do the same.  And two threads inserting
records with the same primary key value blocks the second till the first
finishes.  Both of these were triggered by users double clicking links in our
web app and were fixed by a better implementation.  Perhaps something like
that is causing what you are seeing.

-Chris

On Friday 16 April 2004 11:46 am, Mike Nolan wrote:
> > We have a web app with a postgres backend.  Most queries have subsecond
> > response times through the web even with high usage.  Every once in
> > awhile someone will run either an ad-hoc query or some other long running
> > db process.
>
> Are you sure it is postgres where the delay is occurring?  I ask this
> because I also have a web-based front end to postgres, and while most of
> the time the queries respond in about a second every now and then I see
> one that takes much longer, sometimes 10-15 seconds.
>
> I've seen this behavior on both my development system and on the
> production server.
>
> The same query a while later might respond quickly again.
>
> I'm not sure where to look for the delay, either, and it is intermittent
> enough that I'm not even sure what monitoring techniques to use.
> --
> Mike Nolan

--
Chris Kratz
Systems Analyst/Programmer
VistaShare LLC
www.vistashare.com


Re: Long running queries degrade performance

From
Mike Nolan
Date:
> Fairly sure, when it is happening, postgres usually is taking up the top slots
> for cpu usage as reported by top.  Perhaps there is a better way to monitor
> this?

Given the intermittent nature of the problem and its relative brevity
(5-10 seconds), I don't know whether top offers the granularity needed to
locate the bottleneck.

> likely you have a situation where something else is happening which blocks
> the current thread.

It happens on my development system, and I'm the only one on it.  I know
I've seen it on the production server, but I think it is a bit more
common on the development server, though that may be a case of which system
I spend the most time on.  (Also, the production server is 1300 miles away
with a DSL connection, so I may just be seeing network delays some of
the time there.)

> Both of these were triggered by users double clicking links in our
> web app and were fixed by a better implementation.  Perhaps something like
> that is causing what you are seeing.

My web app traps double-clicks in javascript and ignores all but the first one.
That's because some of the users have mice that give double-clicks even
when they only want one click.
--
Mike Nolan

Re: Long running queries degrade performance

From
Chris Kratz
Date:
On Friday 16 April 2004 4:25 pm, Mike Nolan wrote:
> Given the intermittent nature of the problem and its relative brevity
> (5-10 seconds), I don't know whether top offers the granularity needed to
> locate the bottleneck.

Our long running processes run on the order of multiple minutes (sometimes for
over an hour) and it's expected because the sql can be quite complex over
somewhat large datasets.  But it's the bringing the server to it's knees,
that I'm trying to figure out how to address if we can.  In other words, let
those long running processes run, but somehow still get decent performance
for "quick" requests.

Yours reminds me of what used to happen in our apps back when I worked in java
and the garbage collector kicked in.  Suddenly everything would stop for
10-15s and then continue on.  Sort of makes you think the app froze for some
reason.

> It happens on my development system, and I'm the only one on it.  I know
> I've seen it on the production server, but I think it is a bit more
> common on the development server, though that may be a case of which system
> I spend the most time on.  (Also, the production server is 1300 miles away
> with a DSL connection, so I may just be seeing network delays some of
> the time there.)

Interesting.  Have you tried running a processor monitor and seeing if you are
getting a cpu or disk spike when you get the blips?  Postgres has been pretty
constant for us in it's average runtime for any particular query.  We do get
some fluctuation, but I've always attributed that to other things happening
in the background.  I sometimes run gkrellm off the server just to "see"
what's happening on a macro scale.  It's a great early indicator when we are
getting slammed one way or another (network, memory, processor, disk, etc).
Plus it shows a couple of seconds of history so you can see blips pretty
easily.

> My web app traps double-clicks in javascript and ignores all but the first
> one. That's because some of the users have mice that give double-clicks
> even when they only want one click.

Hmmm, never thought of doing that.  Might be interesting to do something like
that in a few key places where we have problems.

> --
> Mike Nolan

--
Chris Kratz
Systems Analyst/Programmer
VistaShare LLC
www.vistashare.com

Re: Long running queries degrade performance

From
Tom Lane
Date:
Chris Kratz <chris.kratz@vistashare.com> writes:
> ... Or if worse comes to worse to actually kill long running
> processes without taking down the whole db as we have had to do on occasion.

A quick "kill -INT" suffices to issue a query cancel, which I think is
what you want here.  You could also consider putting an upper limit on
how long things can run by means of statement_timeout.

Those are just band-aids though.  Not sure about the underlying problem.
Ordinarily I'd guess that the big-hog queries are causing trouble by
evicting everything the other queries need from cache.  But since your
database fits in RAM, that doesn't seem to hold water.

What PG version are you running?

            regards, tom lane

Re: Long running queries degrade performance

From
Chris Kratz
Date:
On Friday 16 April 2004 5:12 pm, Tom Lane wrote:
> Chris Kratz <chris.kratz@vistashare.com> writes:
> > ... Or if worse comes to worse to actually kill long running
> > processes without taking down the whole db as we have had to do on
> > occasion.
>
> A quick "kill -INT" suffices to issue a query cancel, which I think is
> what you want here.  You could also consider putting an upper limit on
> how long things can run by means of statement_timeout.

Wow, that's exactly what I've been looking for.  I thought I had scoured the
manuals, but must have missed that one.  I need to think about the
statement_timeout, the might be a good idea to use as well.

> Those are just band-aids though.  Not sure about the underlying problem.
> Ordinarily I'd guess that the big-hog queries are causing trouble by
> evicting everything the other queries need from cache.  But since your
> database fits in RAM, that doesn't seem to hold water.

That makes some sense, perhaps there is some other cache somewhere that is
causing the problems.  I am doing some tuning and have set the following
items in our postgresql.conf:

shared_buffers = 4096
max_fsm_relations = 1000
max_fsm_pages = 20000
sort_mem = 2048
effective_cache_size = 64000

I believe these are the only performance related items we've modified.  One
thing I did today, since we seem to run about 600M of memory available for
file caches.  The effective cache size used to be much lower, so perhaps that
was causing some of the problems.

> What PG version are you running?

7.3.4 with grand hopes to move to 7.4 this summer.

>             regards, tom lane

--
Chris Kratz
Systems Analyst/Programmer
VistaShare LLC
www.vistashare.com

Re: Long running queries degrade performance

From
Christopher Browne
Date:
A long time ago, in a galaxy far, far away, nolan@gw.tssi.com (Mike Nolan) wrote:
>> We have a web app with a postgres backend.  Most queries have subsecond
>> response times through the web even with high usage.  Every once in awhile
>> someone will run either an ad-hoc query or some other long running db
>> process.
>
> Are you sure it is postgres where the delay is occurring?  I ask this
> because I also have a web-based front end to postgres, and while most of
> the time the queries respond in about a second every now and then I see
> one that takes much longer, sometimes 10-15 seconds.
>
> I've seen this behavior on both my development system and on the
> production server.
>
> The same query a while later might respond quickly again.
>
> I'm not sure where to look for the delay, either, and it is
> intermittent enough that I'm not even sure what monitoring
> techniques to use.

Well, a first thing to do is to see what query plans get set up for
the queries.  If the plans are varying over time, that suggests
something's up with ANALYZEs.

If the plans look a bit questionable, then you may be encountering the
situation where cache is helping you on the _second_ query but not the
first.  I did some tuning yesterday involving the same sort of
"symptoms," and that turned out to be what was happening.

I'll describe (in vague detail ;-)) what I was seeing.

- The table being queried was a "transaction" table, containing tens of
  thousands of records per day.

- The query was pulling summary information about one or another
  customer's activity on that day.

- The best index we had was on transaction date.

Thus, the query would walk through the "txn date" index, pulling
records into memory, and filtering them against the other selection
criteria.

The table is big, so that data is pretty widely scattered across many
pages.

The _first_ time the query is run, the data is all out on disk, and
there are hundreds-to-thousands of page reads to collect it all.  That
took 10-15 seconds.

The _second_ time it was run (as well as subsequent occasions), those
pages were all in cache, so the query runs in under a second.

What I wound up doing was to add an index on transaction date and
customer ID, so that a query that specifies both criteria will look
just for the few hundred (at most) records relevant to a particular
customer.  That's fast even the first time around.

We had a really useful "hook" on this one because the developer
noticed that the first time he queried for a particular day, it was
slow.  We could "repeat" the test easily by just changing to a day
that we hadn't pulled into cache yet.
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://cbbrowne.com/info/lisp.html
Referring to undocumented  private communications allows one to  claim
virtually anything: "we discussed this idea in  our working group last
year, and concluded that it was totally brain-damaged".
-- from the Symbolics Guidelines for Sending Mail