Thread: Queries sometimes take 1000 times the normal time
Hello, We're running a set of Half-Life based game servers that lookup user privileges from a central PostgreSQL 7.3.4 database server (I recently ported the MySQL code in Adminmod to PostgreSQL to be able to do this). The data needed by the game servers are combined from several different tables, so we have some views set up to provide the data in the format needed. Currently there's only a few users in the database for testing purposes, and most of the time the user lookup's take 2-3 ms (I have syslog'ing of queries and duration turned on), but several times per hour the duration for one of the queries is 2-3 seconds (1000 times larger), while the surrounding lookups take the usual 2-3 ms. This is rather critical, as the game server software isn't asynchonous and thus waits for a reply before continuing, so when someone connects, and the user lookup happens to have one of these very long durations, the players on this server experience a major lag spike, which isn't very popular :-( All the game servers and the database server are connected to the same switch, so I don't think, that it is a network problem. So far I've been unable to locate the problem, so any suggestions are very welcome. Regards, Anders K. Pedersen
> Currently there's only a few users in the database for testing purposes, > and most of the time the user lookup's take 2-3 ms (I have syslog'ing of > queries and duration turned on), but several times per hour the duration > for one of the queries is 2-3 seconds (1000 times larger), while the > surrounding lookups take the usual 2-3 ms. Are there any other jobs running at the time of these excessive queries?
Attachment
"Anders K. Pedersen" <akp@cohaesio.com> writes: > Currently there's only a few users in the database for testing purposes, > and most of the time the user lookup's take 2-3 ms (I have syslog'ing of > queries and duration turned on), but several times per hour the duration > for one of the queries is 2-3 seconds (1000 times larger), while the > surrounding lookups take the usual 2-3 ms. One thing that comes to mind is that the slow query could be occurring at the same time as a checkpoint, or some other cycle-chewing background operation. It's not clear why a checkpoint would slow things down that much, though. Anyway I'd suggest looking for such activities; once we know if that's the issue or not, we can make some progress. regards, tom lane
On 28 Aug 2003 at 1:07, Anders K. Pedersen wrote: > Hello, > > We're running a set of Half-Life based game servers that lookup user > privileges from a central PostgreSQL 7.3.4 database server (I recently > ported the MySQL code in Adminmod to PostgreSQL to be able to do this). > > The data needed by the game servers are combined from several different > tables, so we have some views set up to provide the data in the format > needed. > > Currently there's only a few users in the database for testing purposes, > and most of the time the user lookup's take 2-3 ms (I have syslog'ing of > queries and duration turned on), but several times per hour the duration > for one of the queries is 2-3 seconds (1000 times larger), while the > surrounding lookups take the usual 2-3 ms. Check vmstat during the same period if it is syncing at that point as Tom suggested. Are you using pooled connections? If yes you could shorten life of a connection and force making a new connection every 10-15 minutes say. That would avoid IO avelanche at the end of the hour types. HTH. Bye Shridhar -- ignorance, n.: When you don't know anything, and someone else finds out.
We have a somewhat similar situation - we're running a fairly constant, but low priority, background load of about 70 selects and 40 inserts per second (batched into fairly large transactions), and on top of that we're trying to run time-sensitive queries for a web site (well two). I should emphasize that this is low low priority - if a query is delayed by an hour here, it doesn't matter. The web site queries will jump up one or two orders of magnitude (I have seen a normally 100ms query take in excess of 30 seconds) in duration at seemingly random points. It's not always when the transactions are committing, and it doesn't seem to be during checkpointing either. The same thing happens with WAL switched off. It appears to happen the first time the query runs after a while. If I run the same query immediately afterwards, it will take the normal amount of time. Any ideas? Cheers, Russ Garrett pgsql-performance-owner@postgresql.org wrote: > Subject: [PERFORM] Queries sometimes take 1000 times the normal time > > > Hello, > > We're running a set of Half-Life based game servers that lookup user > privileges from a central PostgreSQL 7.3.4 database server (I recently > ported the MySQL code in Adminmod to PostgreSQL to be able to do > this). > > The data needed by the game servers are combined from several > different tables, so we have some views set up to provide the data in > the format needed. > > Currently there's only a few users in the database for testing > purposes, and most of the time the user lookup's take 2-3 ms (I have > syslog'ing of queries and duration turned on), but several times per > hour the duration for one of the queries is 2-3 seconds (1000 times > larger), while the surrounding lookups take the usual 2-3 ms. > > This is rather critical, as the game server software isn't asynchonous > and thus waits for a reply before continuing, so when someone > connects, and the user lookup happens to have one of these very long > durations, the players on this server experience a major lag spike, > which isn't very popular :-( > > All the game servers and the database server are connected to the same > switch, so I don't think, that it is a network problem. > > So far I've been unable to locate the problem, so any suggestions are > very welcome. > > Regards, > Anders K. Pedersen
On 28 Aug 2003 at 10:02, Russell Garrett wrote: > The web site queries will jump up one or two orders of magnitude (I have > seen a normally 100ms query take in excess of 30 seconds) in duration at > seemingly random points. It's not always when the transactions are > committing, and it doesn't seem to be during checkpointing either. The same > thing happens with WAL switched off. It appears to happen the first time the > query runs after a while. If I run the same query immediately afterwards, it > will take the normal amount of time. Looks like it got flushed out of every type of cache and IO scheduler could not deliver immediately because of other loads... Bye Shridhar -- Abstainer, n.: A weak person who yields to the temptation of denying himself a pleasure. -- Ambrose Bierce, "The Devil's Dictionary"
>> The web site queries will jump up one or two orders of magnitude (I >> have seen a normally 100ms query take in excess of 30 seconds) in >> duration at seemingly random points. It's not always when the >> transactions are committing, and it doesn't seem to be during >> checkpointing either. The same thing happens with WAL switched off. >> It appears to happen the first time the query runs after a while. If >> I run the same query immediately afterwards, it will take the normal >> amount of time. > > Looks like it got flushed out of every type of cache and IO scheduler > could not deliver immediately because of other loads... Yeah, I wasn't sure what (or how) Postgres caches. The db server does have 2Gb of memory, but then again the database amounts to more than 2Gb, so it's fairly possible it's getting pushed out of cache. It's also fairly possible that it's not tuned completely optimally. I wonder if FreeBSD/kernel 2.6 would perform better in such a situation... Russ
Rod Taylor wrote: >>Currently there's only a few users in the database for testing purposes, >>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of >>queries and duration turned on), but several times per hour the duration >>for one of the queries is 2-3 seconds (1000 times larger), while the >>surrounding lookups take the usual 2-3 ms. > > Are there any other jobs running at the time of these excessive queries? I don't know if you're referring to jobs inside the PostgreSQL database or just jobs on the server, but I'm pretty sure that nothing major is going on inside the database - the only other job using it is doing an insert whenever one of our game admins executes an administrative command (like ban or kick), but this doesn't happen all that often, and according the PostgreSQL log isn't happening at the same times as the long queries. With regards to other jobs on the server, there is a MySQL server on it as well, which from time to time has some multi-second queries generated from a webserver also on this host, but the MySQL is running with nice 10 (PostgreSQL isn't nice'd). Someone else asked about vmstat results, and I've been running this for a while now, and I will report the results shortly. Regards, Anders K. Pedersen
Shridhar Daithankar wrote: > On 28 Aug 2003 at 1:07, Anders K. Pedersen wrote: >>We're running a set of Half-Life based game servers that lookup user >>privileges from a central PostgreSQL 7.3.4 database server (I recently >>ported the MySQL code in Adminmod to PostgreSQL to be able to do this). >> >>The data needed by the game servers are combined from several different >>tables, so we have some views set up to provide the data in the format >>needed. >> >>Currently there's only a few users in the database for testing purposes, >>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of >>queries and duration turned on), but several times per hour the duration >>for one of the queries is 2-3 seconds (1000 times larger), while the >>surrounding lookups take the usual 2-3 ms. > > > Check vmstat during the same period if it is syncing at that point as Tom > suggested. I've been running a vmstat 1 logging process for a while now, and the sample below shows what happende around one of these spikes - at 18:18:03 specifically (actually there were two 1 second long queries, that finished at 18:18:03). Thu Aug 28 18:17:53 2003 0 0 0 40904 4568 22288 404352 0 0 12 0 181 362 2 1 97 Thu Aug 28 18:17:54 2003 0 0 0 40904 4580 22260 404380 0 0 128 0 205 330 2 3 95 Thu Aug 28 18:17:55 2003 0 0 2 40904 4576 22264 404380 0 0 0 284 224 127 0 1 99 Thu Aug 28 18:17:56 2003 0 0 2 40904 5008 22268 404512 0 0 128 728 571 492 2 3 95 Thu Aug 28 18:17:57 2003 0 0 1 40904 5000 22276 404512 0 0 0 120 201 181 1 0 99 Thu Aug 28 18:17:58 2003 0 0 1 40904 4936 22284 404528 0 0 8 0 1147 2204 12 3 85 Thu Aug 28 18:17:59 2003 0 0 0 40904 4784 22304 404660 0 0 148 0 2112 3420 2 3 95 Thu Aug 28 18:18:00 2003 1 1 3 40904 4760 22324 404664 0 0 20 456 2374 3277 2 1 97 Thu Aug 28 18:18:01 2003 0 2 10 40904 4436 22000 401456 0 0 144 540 510 457 11 6 83 Thu Aug 28 18:18:02 2003 1 1 2 40904 8336 22032 401512 0 0 68 676 1830 2540 4 3 93 Thu Aug 28 18:18:04 2003 1 0 1 40904 8160 22052 401664 0 0 140 220 2308 3253 2 3 95 Thu Aug 28 18:18:05 2003 0 0 1 40904 7748 22064 402064 0 0 288 0 1941 2856 1 3 96 Thu Aug 28 18:18:06 2003 0 0 3 40904 6704 22064 403100 0 0 496 992 2326 3510 0 5 95 Thu Aug 28 18:18:07 2003 1 0 0 40904 6324 22088 402716 0 0 260 188 1984 2927 11 4 85 Thu Aug 28 18:18:08 2003 0 0 0 40904 6920 22088 402828 0 0 72 0 419 1473 17 5 78 Thu Aug 28 18:18:09 2003 0 0 0 40904 6784 22088 402964 0 0 128 0 235 476 2 1 97 Thu Aug 28 18:18:10 2003 0 0 1 40904 6404 22088 402980 0 0 0 0 343 855 14 2 84 As this shows, some disk I/O and an increased amount of interrupts and context switches is taking place at this time, and this also happens at the same time as all the other long queries I examined. However, vmstat also shows this pattern at a lot of other times, where the queries aren't affected by it. > Are you using pooled connections? If yes you could shorten life of a connection > and force making a new connection every 10-15 minutes say. That would avoid IO > avelanche at the end of the hour types. I'm not quite sure, what you mean by "pooled connections". Each game server has one connection to the PostgreSQL server, which is opened, when the server is first started, and then never closed (until the game server terminates, but there's days between this happens). Regards, Anders K. Pedersen
Tom Lane wrote: > "Anders K. Pedersen" <akp@cohaesio.com> writes: > >>Currently there's only a few users in the database for testing purposes, >>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of >>queries and duration turned on), but several times per hour the duration >>for one of the queries is 2-3 seconds (1000 times larger), while the >>surrounding lookups take the usual 2-3 ms. > > > One thing that comes to mind is that the slow query could be occurring > at the same time as a checkpoint, or some other cycle-chewing background > operation. It's not clear why a checkpoint would slow things down that > much, though. Anyway I'd suggest looking for such activities; once we > know if that's the issue or not, we can make some progress. One of my colleagues suggested looking for checkpoints as well; I searched the log, but only the following messages turned up: Aug 11 15:21:04 gs1 postgres[5447]: [2] LOG: checkpoint record is at 0/80193C Aug 23 13:59:51 gs1 postgres[16451]: [2] LOG: checkpoint record is at 0/201EB74 Aug 25 02:48:17 gs1 postgres[1059]: [2] LOG: checkpoint record is at 0/2B787D0 Currently there are only relatively few changes to the database - one INSERT everytime one of our game admins executes an administrative command (like ban or kick), and this happens at most 10 times per hour. As I understand checkpoints, this should mean, that they aren't happening very often, and when they do, should be able to finish almost immediately. Regards, Anders K. Pedersen
> With regards to other jobs on the server, there is a MySQL server on it > as well, which from time to time has some multi-second queries generated > from a webserver also on this host, but the MySQL is running with nice > 10 (PostgreSQL isn't nice'd). Do those MySQL queries hit disk hard? I've never seen PostgreSQL have hicups like you describe when running on a machine by itself. I have experienced similar issues when another process (cron job in my case) caused brief swapping to occur.
Attachment
Rod Taylor wrote: >>With regards to other jobs on the server, there is a MySQL server on it >>as well, which from time to time has some multi-second queries generated >>from a webserver also on this host, but the MySQL is running with nice >>10 (PostgreSQL isn't nice'd). > > Do those MySQL queries hit disk hard? I guess they may be able to do so - the MySQL database is 450 MB, and the server has 512 MB RAM, and some of the queries pretty summarizes everything in the database. However, I just cross-referenced the access logs from the webserver with the duration logs, and although some of the spikes did happen, while there would have been some MySQL activity (I can't tell for sure, if it was simple queries or the long ones), other spikes happened without any website activity in the surrounding minutes. > I've never seen PostgreSQL have hicups like you describe when running on > a machine by itself. I have experienced similar issues when another > process (cron job in my case) caused brief swapping to occur. OK. I may have to try to put the database on a separate server. Regards, Anders K. Pedersen
Just to add to the clutch here, also check your bdflush settings (if you're on linux) or equivalent (if you're not.) Many times the swapping algo in linux can be quite bursty if you have it set to move too many pages at a time during cleanup / flush.
scott.marlowe wrote: > Just to add to the clutch here, also check your bdflush settings (if > you're on linux) or equivalent (if you're not.) > > Many times the swapping algo in linux can be quite bursty if you have it > set to move too many pages at a time during cleanup / flush. According to vmstat it doesn't swap near the spikes, so I don't think this is the problem. I posted a vmstat sample in another reply, where you can see an example of what happens. Regards, Anders K. Pedersen
On 28 Aug 2003 at 20:16, Anders K. Pedersen wrote: > Shridhar Daithankar wrote: > > On 28 Aug 2003 at 1:07, Anders K. Pedersen wrote: > >>We're running a set of Half-Life based game servers that lookup user > >>privileges from a central PostgreSQL 7.3.4 database server (I recently > >>ported the MySQL code in Adminmod to PostgreSQL to be able to do this). > >> > >>The data needed by the game servers are combined from several different > >>tables, so we have some views set up to provide the data in the format > >>needed. > >> > >>Currently there's only a few users in the database for testing purposes, > >>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of > >>queries and duration turned on), but several times per hour the duration > >>for one of the queries is 2-3 seconds (1000 times larger), while the > >>surrounding lookups take the usual 2-3 ms. > > > > > > Check vmstat during the same period if it is syncing at that point as Tom > > suggested. > > I've been running a vmstat 1 logging process for a while now, and the > sample below shows what happende around one of these spikes - at > 18:18:03 specifically (actually there were two 1 second long queries, > that finished at 18:18:03). > > Thu Aug 28 18:17:53 2003 0 0 0 40904 4568 22288 404352 0 0 > 12 0 181 362 2 1 97 > Thu Aug 28 18:17:54 2003 0 0 0 40904 4580 22260 404380 0 0 > 128 0 205 330 2 3 95 > Thu Aug 28 18:17:55 2003 0 0 2 40904 4576 22264 404380 0 0 > 0 284 224 127 0 1 99 > Thu Aug 28 18:17:56 2003 0 0 2 40904 5008 22268 404512 0 0 > 128 728 571 492 2 3 95 > Thu Aug 28 18:17:57 2003 0 0 1 40904 5000 22276 404512 0 0 > 0 120 201 181 1 0 99 > Thu Aug 28 18:17:58 2003 0 0 1 40904 4936 22284 404528 0 0 > 8 0 1147 2204 12 3 85 > Thu Aug 28 18:17:59 2003 0 0 0 40904 4784 22304 404660 0 0 > 148 0 2112 3420 2 3 95 > Thu Aug 28 18:18:00 2003 1 1 3 40904 4760 22324 404664 0 0 > 20 456 2374 3277 2 1 97 > Thu Aug 28 18:18:01 2003 0 2 10 40904 4436 22000 401456 0 0 > 144 540 510 457 11 6 83 > Thu Aug 28 18:18:02 2003 1 1 2 40904 8336 22032 401512 0 0 > 68 676 1830 2540 4 3 93 > Thu Aug 28 18:18:04 2003 1 0 1 40904 8160 22052 401664 0 0 > 140 220 2308 3253 2 3 95 > Thu Aug 28 18:18:05 2003 0 0 1 40904 7748 22064 402064 0 0 > 288 0 1941 2856 1 3 96 > Thu Aug 28 18:18:06 2003 0 0 3 40904 6704 22064 403100 0 0 > 496 992 2326 3510 0 5 95 > Thu Aug 28 18:18:07 2003 1 0 0 40904 6324 22088 402716 0 0 > 260 188 1984 2927 11 4 85 > Thu Aug 28 18:18:08 2003 0 0 0 40904 6920 22088 402828 0 0 > 72 0 419 1473 17 5 78 > Thu Aug 28 18:18:09 2003 0 0 0 40904 6784 22088 402964 0 0 > 128 0 235 476 2 1 97 > Thu Aug 28 18:18:10 2003 0 0 1 40904 6404 22088 402980 0 0 > 0 0 343 855 14 2 84 > Notice a pattern. In first few entries free memory is increasing coupled with IO. Few entries down the line it's decreasing again with IO. I would guess that something terminated and started.. However given how idle CPU is, I wonder would it matter. Besides changes in stats are pretty small to be any significant. I wouldn't worry about context switches or interrupts. They don't seem to be any dramatic.. I wonder what kernel you are using. While running pgbench on 2.4 and 2.6 couple of days back, I noticed several stalls with 2.4 where neither CPU or disk does anything but nothing moves forward, for 30 sec. or so. If possible try with 2.6 Check which scheduler you are using and which works best for you. http://marc.theaimsgroup.com/?l=linux-kernel&m=105743728122143&w=2 If you want a step by step how to install 2.6, I can give that too.. It's pretty simple.. > > Are you using pooled connections? If yes you could shorten life of a connection > > and force making a new connection every 10-15 minutes say. That would avoid IO > > avelanche at the end of the hour types. > > I'm not quite sure, what you mean by "pooled connections". Each game > server has one connection to the PostgreSQL server, which is opened, > when the server is first started, and then never closed (until the game > server terminates, but there's days between this happens). I would say let each server start a new connection every 15 minutes. As soon as new connection is established, close old one. See if that cures the problem. Apache uses similar methods albeit it measure by nubmer of requests served by each child. HTH Bye Shridhar -- It is necessary to have purpose. -- Alice #1, "I, Mudd", stardate 4513.3