Thread: Long running queries degrade performance
Hello all, My apologies if this is not the right mailing list to ask this question, but we are wondering about general performance tuning principles for our main db server. We have a web app with a postgres backend. Most queries have subsecond response times through the web even with high usage. Every once in awhile someone will run either an ad-hoc query or some other long running db process. For some reason, it seems that a small number 3-4 of these jobs running in parallel absolutely floors our server. In monitoring the jobs, linux (Kernel 2.4) drops the long running jobs priority, but even so they seem to hog the system resources making subsequent requests for everyone else very slow. Our database at this point is almost entirely processor and memory bound because it isn't too large to fit most of the working data into memory yet. There is generally little disk activity when this occurs. These long running processes are almost always complex select statements, not generally inserts or updates. We continue to monitor and rework the bottlenecks, but what is a little scary to us is how easily the database becomes almost completely unresponsive with several large jobs running, especially since we have a large number of users. And it only takes one user trying to view a page with one of these selects clicking multiple times because it doesn't come back quickly to bring our system to it's knees for hours. We are looking to move to Kernel 2.6 and possibly a dedicated multiprocessor machine for postgres towards the end of this year. But, I am wondering if there is anything we can do now to increase the interactive performance while there are long running selects running as well. Are there ways to adjust the priority of backend processes, or things to tweak to maximize interactive throughput for the quick jobs while the long running ones run in the background? Or if worse comes to worse to actually kill long running processes without taking down the whole db as we have had to do on occasion. Our server is a modest 2.4Ghz P4 with mirrored UW SCSI drives and 1G of memory. The db on disk is around 800M and this machine also hosts our web app, so there is some contention for the processor. Does anyone have any suggestions or thoughts on things we could look at? Is a multiprocessor box the only answer, or are there other things we should be looking at hardware wise. Thank you for your time. -- Chris Kratz Systems Analyst/Programmer VistaShare LLC www.vistashare.com
> We have a web app with a postgres backend. Most queries have subsecond > response times through the web even with high usage. Every once in awhile > someone will run either an ad-hoc query or some other long running db > process. Are you sure it is postgres where the delay is occurring? I ask this because I also have a web-based front end to postgres, and while most of the time the queries respond in about a second every now and then I see one that takes much longer, sometimes 10-15 seconds. I've seen this behavior on both my development system and on the production server. The same query a while later might respond quickly again. I'm not sure where to look for the delay, either, and it is intermittent enough that I'm not even sure what monitoring techniques to use. -- Mike Nolan
Fairly sure, when it is happening, postgres usually is taking up the top slots for cpu usage as reported by top. Perhaps there is a better way to monitor this? The other thing for us is that others talk about disks being the bottleneck whereas for us it is almost always the processor. I expected the drives to kill us early on (we have two uw scsi mirrored drives) but there is very little disk activity. The disks rarely do much during load for us (at this point). Most likely this is related more to data volume at this point. As far as in your case, is there a lot of disk activity happening? More likely you have a situation where something else is happening which blocks the current thread. We ran into two situations recently which exhibited this behavior. One was adding and dropping tables in a transaction which blocks any other transaction trying to do the same. And two threads inserting records with the same primary key value blocks the second till the first finishes. Both of these were triggered by users double clicking links in our web app and were fixed by a better implementation. Perhaps something like that is causing what you are seeing. -Chris On Friday 16 April 2004 11:46 am, Mike Nolan wrote: > > We have a web app with a postgres backend. Most queries have subsecond > > response times through the web even with high usage. Every once in > > awhile someone will run either an ad-hoc query or some other long running > > db process. > > Are you sure it is postgres where the delay is occurring? I ask this > because I also have a web-based front end to postgres, and while most of > the time the queries respond in about a second every now and then I see > one that takes much longer, sometimes 10-15 seconds. > > I've seen this behavior on both my development system and on the > production server. > > The same query a while later might respond quickly again. > > I'm not sure where to look for the delay, either, and it is intermittent > enough that I'm not even sure what monitoring techniques to use. > -- > Mike Nolan -- Chris Kratz Systems Analyst/Programmer VistaShare LLC www.vistashare.com
> Fairly sure, when it is happening, postgres usually is taking up the top slots > for cpu usage as reported by top. Perhaps there is a better way to monitor > this? Given the intermittent nature of the problem and its relative brevity (5-10 seconds), I don't know whether top offers the granularity needed to locate the bottleneck. > likely you have a situation where something else is happening which blocks > the current thread. It happens on my development system, and I'm the only one on it. I know I've seen it on the production server, but I think it is a bit more common on the development server, though that may be a case of which system I spend the most time on. (Also, the production server is 1300 miles away with a DSL connection, so I may just be seeing network delays some of the time there.) > Both of these were triggered by users double clicking links in our > web app and were fixed by a better implementation. Perhaps something like > that is causing what you are seeing. My web app traps double-clicks in javascript and ignores all but the first one. That's because some of the users have mice that give double-clicks even when they only want one click. -- Mike Nolan
On Friday 16 April 2004 4:25 pm, Mike Nolan wrote: > Given the intermittent nature of the problem and its relative brevity > (5-10 seconds), I don't know whether top offers the granularity needed to > locate the bottleneck. Our long running processes run on the order of multiple minutes (sometimes for over an hour) and it's expected because the sql can be quite complex over somewhat large datasets. But it's the bringing the server to it's knees, that I'm trying to figure out how to address if we can. In other words, let those long running processes run, but somehow still get decent performance for "quick" requests. Yours reminds me of what used to happen in our apps back when I worked in java and the garbage collector kicked in. Suddenly everything would stop for 10-15s and then continue on. Sort of makes you think the app froze for some reason. > It happens on my development system, and I'm the only one on it. I know > I've seen it on the production server, but I think it is a bit more > common on the development server, though that may be a case of which system > I spend the most time on. (Also, the production server is 1300 miles away > with a DSL connection, so I may just be seeing network delays some of > the time there.) Interesting. Have you tried running a processor monitor and seeing if you are getting a cpu or disk spike when you get the blips? Postgres has been pretty constant for us in it's average runtime for any particular query. We do get some fluctuation, but I've always attributed that to other things happening in the background. I sometimes run gkrellm off the server just to "see" what's happening on a macro scale. It's a great early indicator when we are getting slammed one way or another (network, memory, processor, disk, etc). Plus it shows a couple of seconds of history so you can see blips pretty easily. > My web app traps double-clicks in javascript and ignores all but the first > one. That's because some of the users have mice that give double-clicks > even when they only want one click. Hmmm, never thought of doing that. Might be interesting to do something like that in a few key places where we have problems. > -- > Mike Nolan -- Chris Kratz Systems Analyst/Programmer VistaShare LLC www.vistashare.com
Chris Kratz <chris.kratz@vistashare.com> writes: > ... Or if worse comes to worse to actually kill long running > processes without taking down the whole db as we have had to do on occasion. A quick "kill -INT" suffices to issue a query cancel, which I think is what you want here. You could also consider putting an upper limit on how long things can run by means of statement_timeout. Those are just band-aids though. Not sure about the underlying problem. Ordinarily I'd guess that the big-hog queries are causing trouble by evicting everything the other queries need from cache. But since your database fits in RAM, that doesn't seem to hold water. What PG version are you running? regards, tom lane
On Friday 16 April 2004 5:12 pm, Tom Lane wrote: > Chris Kratz <chris.kratz@vistashare.com> writes: > > ... Or if worse comes to worse to actually kill long running > > processes without taking down the whole db as we have had to do on > > occasion. > > A quick "kill -INT" suffices to issue a query cancel, which I think is > what you want here. You could also consider putting an upper limit on > how long things can run by means of statement_timeout. Wow, that's exactly what I've been looking for. I thought I had scoured the manuals, but must have missed that one. I need to think about the statement_timeout, the might be a good idea to use as well. > Those are just band-aids though. Not sure about the underlying problem. > Ordinarily I'd guess that the big-hog queries are causing trouble by > evicting everything the other queries need from cache. But since your > database fits in RAM, that doesn't seem to hold water. That makes some sense, perhaps there is some other cache somewhere that is causing the problems. I am doing some tuning and have set the following items in our postgresql.conf: shared_buffers = 4096 max_fsm_relations = 1000 max_fsm_pages = 20000 sort_mem = 2048 effective_cache_size = 64000 I believe these are the only performance related items we've modified. One thing I did today, since we seem to run about 600M of memory available for file caches. The effective cache size used to be much lower, so perhaps that was causing some of the problems. > What PG version are you running? 7.3.4 with grand hopes to move to 7.4 this summer. > regards, tom lane -- Chris Kratz Systems Analyst/Programmer VistaShare LLC www.vistashare.com
A long time ago, in a galaxy far, far away, nolan@gw.tssi.com (Mike Nolan) wrote: >> We have a web app with a postgres backend. Most queries have subsecond >> response times through the web even with high usage. Every once in awhile >> someone will run either an ad-hoc query or some other long running db >> process. > > Are you sure it is postgres where the delay is occurring? I ask this > because I also have a web-based front end to postgres, and while most of > the time the queries respond in about a second every now and then I see > one that takes much longer, sometimes 10-15 seconds. > > I've seen this behavior on both my development system and on the > production server. > > The same query a while later might respond quickly again. > > I'm not sure where to look for the delay, either, and it is > intermittent enough that I'm not even sure what monitoring > techniques to use. Well, a first thing to do is to see what query plans get set up for the queries. If the plans are varying over time, that suggests something's up with ANALYZEs. If the plans look a bit questionable, then you may be encountering the situation where cache is helping you on the _second_ query but not the first. I did some tuning yesterday involving the same sort of "symptoms," and that turned out to be what was happening. I'll describe (in vague detail ;-)) what I was seeing. - The table being queried was a "transaction" table, containing tens of thousands of records per day. - The query was pulling summary information about one or another customer's activity on that day. - The best index we had was on transaction date. Thus, the query would walk through the "txn date" index, pulling records into memory, and filtering them against the other selection criteria. The table is big, so that data is pretty widely scattered across many pages. The _first_ time the query is run, the data is all out on disk, and there are hundreds-to-thousands of page reads to collect it all. That took 10-15 seconds. The _second_ time it was run (as well as subsequent occasions), those pages were all in cache, so the query runs in under a second. What I wound up doing was to add an index on transaction date and customer ID, so that a query that specifies both criteria will look just for the few hundred (at most) records relevant to a particular customer. That's fast even the first time around. We had a really useful "hook" on this one because the developer noticed that the first time he queried for a particular day, it was slow. We could "repeat" the test easily by just changing to a day that we hadn't pulled into cache yet. -- (format nil "~S@~S" "cbbrowne" "acm.org") http://cbbrowne.com/info/lisp.html Referring to undocumented private communications allows one to claim virtually anything: "we discussed this idea in our working group last year, and concluded that it was totally brain-damaged". -- from the Symbolics Guidelines for Sending Mail