Thread: Queries sometimes take 1000 times the normal time

Queries sometimes take 1000 times the normal time

From
"Anders K. Pedersen"
Date:
Hello,

We're running a set of Half-Life based game servers that lookup user
privileges from a central PostgreSQL 7.3.4 database server (I recently
ported the MySQL code in Adminmod to PostgreSQL to be able to do this).

The data needed by the game servers are combined from several different
tables, so we have some views set up to provide the data in the format
needed.

Currently there's only a few users in the database for testing purposes,
and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
queries and duration turned on), but several times per hour the duration
for one of the queries is 2-3 seconds (1000 times larger), while the
surrounding lookups take the usual 2-3 ms.

This is rather critical, as the game server software isn't asynchonous
and thus waits for a reply before continuing, so when someone connects,
and the user lookup happens to have one of these very long durations,
the players on this server experience a major lag spike, which isn't
very popular :-(

All the game servers and the database server are connected to the same
switch, so I don't think, that it is a network problem.

So far I've been unable to locate the problem, so any suggestions are
very welcome.

Regards,
Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
Rod Taylor
Date:
> Currently there's only a few users in the database for testing purposes,
> and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
> queries and duration turned on), but several times per hour the duration
> for one of the queries is 2-3 seconds (1000 times larger), while the
> surrounding lookups take the usual 2-3 ms.

Are there any other jobs running at the time of these excessive queries?

Attachment

Re: Queries sometimes take 1000 times the normal time

From
Tom Lane
Date:
"Anders K. Pedersen" <akp@cohaesio.com> writes:
> Currently there's only a few users in the database for testing purposes,
> and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
> queries and duration turned on), but several times per hour the duration
> for one of the queries is 2-3 seconds (1000 times larger), while the
> surrounding lookups take the usual 2-3 ms.

One thing that comes to mind is that the slow query could be occurring
at the same time as a checkpoint, or some other cycle-chewing background
operation.  It's not clear why a checkpoint would slow things down that
much, though.  Anyway I'd suggest looking for such activities; once we
know if that's the issue or not, we can make some progress.

            regards, tom lane

Re: Queries sometimes take 1000 times the normal time

From
"Shridhar Daithankar"
Date:
On 28 Aug 2003 at 1:07, Anders K. Pedersen wrote:

> Hello,
>
> We're running a set of Half-Life based game servers that lookup user
> privileges from a central PostgreSQL 7.3.4 database server (I recently
> ported the MySQL code in Adminmod to PostgreSQL to be able to do this).
>
> The data needed by the game servers are combined from several different
> tables, so we have some views set up to provide the data in the format
> needed.
>
> Currently there's only a few users in the database for testing purposes,
> and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
> queries and duration turned on), but several times per hour the duration
> for one of the queries is 2-3 seconds (1000 times larger), while the
> surrounding lookups take the usual 2-3 ms.

Check vmstat during the same period if it is syncing at that point as Tom
suggested.

Are you using pooled connections? If yes you could shorten life of a connection
and force making a new connection every 10-15 minutes say. That would avoid IO
avelanche at the end of the hour types.

HTH.

Bye
 Shridhar

--
ignorance, n.:    When you don't know anything, and someone else finds out.


Re: Queries sometimes take 1000 times the normal time

From
"Russell Garrett"
Date:
We have a somewhat similar situation - we're running a fairly constant, but
low priority, background load of about 70 selects and 40 inserts per second
(batched into fairly large transactions), and on top of that we're trying to
run time-sensitive queries for a web site (well two). I should emphasize
that this is low low priority - if a query is delayed by an hour here, it
doesn't matter.

The web site queries will jump up one or two orders of magnitude (I have
seen a normally 100ms query take in excess of 30 seconds) in duration at
seemingly random points. It's not always when the transactions are
committing, and it doesn't seem to be during checkpointing either. The same
thing happens with WAL switched off. It appears to happen the first time the
query runs after a while. If I run the same query immediately afterwards, it
will take the normal amount of time.

Any ideas?

Cheers,

Russ Garrett

pgsql-performance-owner@postgresql.org wrote:
> Subject: [PERFORM] Queries sometimes take 1000 times the normal time
>
>
> Hello,
>
> We're running a set of Half-Life based game servers that lookup user
> privileges from a central PostgreSQL 7.3.4 database server (I recently
> ported the MySQL code in Adminmod to PostgreSQL to be able to do
> this).
>
> The data needed by the game servers are combined from several
> different tables, so we have some views set up to provide the data in
> the format needed.
>
> Currently there's only a few users in the database for testing
> purposes, and most of the time the user lookup's take 2-3 ms (I have
> syslog'ing of queries and duration turned on), but several times per
> hour the duration for one of the queries is 2-3 seconds (1000 times
> larger), while the surrounding lookups take the usual 2-3 ms.
>
> This is rather critical, as the game server software isn't asynchonous
> and thus waits for a reply before continuing, so when someone
> connects, and the user lookup happens to have one of these very long
> durations, the players on this server experience a major lag spike,
> which isn't very popular :-(
>
> All the game servers and the database server are connected to the same
> switch, so I don't think, that it is a network problem.
>
> So far I've been unable to locate the problem, so any suggestions are
> very welcome.
>
> Regards,
> Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
"Shridhar Daithankar"
Date:
On 28 Aug 2003 at 10:02, Russell Garrett wrote:

> The web site queries will jump up one or two orders of magnitude (I have
> seen a normally 100ms query take in excess of 30 seconds) in duration at
> seemingly random points. It's not always when the transactions are
> committing, and it doesn't seem to be during checkpointing either. The same
> thing happens with WAL switched off. It appears to happen the first time the
> query runs after a while. If I run the same query immediately afterwards, it
> will take the normal amount of time.

Looks like it got flushed out of every type of cache and IO scheduler could not
deliver immediately because of other loads...


Bye
 Shridhar

--
Abstainer, n.:    A weak person who yields to the temptation of denying himself a
pleasure.        -- Ambrose Bierce, "The Devil's Dictionary"


Re: Queries sometimes take 1000 times the normal time

From
"Russell Garrett"
Date:
>> The web site queries will jump up one or two orders of magnitude (I
>> have seen a normally 100ms query take in excess of 30 seconds) in
>> duration at seemingly random points. It's not always when the
>> transactions are committing, and it doesn't seem to be during
>> checkpointing either. The same thing happens with WAL switched off.
>> It appears to happen the first time the query runs after a while. If
>> I run the same query immediately afterwards, it will take the normal
>> amount of time.
>
> Looks like it got flushed out of every type of cache and IO scheduler
> could not deliver immediately because of other loads...

Yeah, I wasn't sure what (or how) Postgres caches. The db server does have
2Gb of memory, but then again the database amounts to more than 2Gb, so it's
fairly possible it's getting pushed out of cache. It's also fairly possible
that it's not tuned completely optimally. I wonder if FreeBSD/kernel 2.6
would perform better in such a situation...

Russ


Re: Queries sometimes take 1000 times the normal time

From
"Anders K. Pedersen"
Date:
Rod Taylor wrote:
>>Currently there's only a few users in the database for testing purposes,
>>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
>>queries and duration turned on), but several times per hour the duration
>>for one of the queries is 2-3 seconds (1000 times larger), while the
>>surrounding lookups take the usual 2-3 ms.
>
> Are there any other jobs running at the time of these excessive queries?

I don't know if you're referring to jobs inside the PostgreSQL database
or just jobs on the server, but I'm pretty sure that nothing major is
going on inside the database - the only other job using it is doing an
insert whenever one of our game admins executes an administrative
command (like ban or kick), but this doesn't happen all that often, and
according the PostgreSQL log isn't happening at the same times as the
long queries.

With regards to other jobs on the server, there is a MySQL server on it
as well, which from time to time has some multi-second queries generated
from a webserver also on this host, but the MySQL is running with nice
10 (PostgreSQL isn't nice'd).

Someone else asked about vmstat results, and I've been running this for
a while now, and I will report the results shortly.

Regards,
Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
"Anders K. Pedersen"
Date:
Shridhar Daithankar wrote:
> On 28 Aug 2003 at 1:07, Anders K. Pedersen wrote:
>>We're running a set of Half-Life based game servers that lookup user
>>privileges from a central PostgreSQL 7.3.4 database server (I recently
>>ported the MySQL code in Adminmod to PostgreSQL to be able to do this).
>>
>>The data needed by the game servers are combined from several different
>>tables, so we have some views set up to provide the data in the format
>>needed.
>>
>>Currently there's only a few users in the database for testing purposes,
>>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
>>queries and duration turned on), but several times per hour the duration
>>for one of the queries is 2-3 seconds (1000 times larger), while the
>>surrounding lookups take the usual 2-3 ms.
>
>
> Check vmstat during the same period if it is syncing at that point as Tom
> suggested.

I've been running a vmstat 1 logging process for a while now, and the
sample below shows what happende around one of these spikes - at
18:18:03 specifically (actually there were two 1 second long queries,
that finished at 18:18:03).

Thu Aug 28 18:17:53 2003 0  0  0  40904   4568  22288 404352   0   0
12     0  181   362  2   1  97
Thu Aug 28 18:17:54 2003 0  0  0  40904   4580  22260 404380   0   0
128     0  205   330  2   3  95
Thu Aug 28 18:17:55 2003 0  0  2  40904   4576  22264 404380   0   0
  0   284  224   127  0   1  99
Thu Aug 28 18:17:56 2003 0  0  2  40904   5008  22268 404512   0   0
128   728  571   492  2   3  95
Thu Aug 28 18:17:57 2003 0  0  1  40904   5000  22276 404512   0   0
  0   120  201   181  1   0  99
Thu Aug 28 18:17:58 2003 0  0  1  40904   4936  22284 404528   0   0
  8     0 1147  2204 12   3  85
Thu Aug 28 18:17:59 2003 0  0  0  40904   4784  22304 404660   0   0
148     0 2112  3420  2   3  95
Thu Aug 28 18:18:00 2003 1  1  3  40904   4760  22324 404664   0   0
20   456 2374  3277  2   1  97
Thu Aug 28 18:18:01 2003 0  2 10  40904   4436  22000 401456   0   0
144   540  510   457 11   6  83
Thu Aug 28 18:18:02 2003 1  1  2  40904   8336  22032 401512   0   0
68   676 1830  2540  4   3  93
Thu Aug 28 18:18:04 2003 1  0  1  40904   8160  22052 401664   0   0
140   220 2308  3253  2   3  95
Thu Aug 28 18:18:05 2003 0  0  1  40904   7748  22064 402064   0   0
288     0 1941  2856  1   3  96
Thu Aug 28 18:18:06 2003 0  0  3  40904   6704  22064 403100   0   0
496   992 2326  3510  0   5  95
Thu Aug 28 18:18:07 2003 1  0  0  40904   6324  22088 402716   0   0
260   188 1984  2927 11   4  85
Thu Aug 28 18:18:08 2003 0  0  0  40904   6920  22088 402828   0   0
72     0  419  1473 17   5  78
Thu Aug 28 18:18:09 2003 0  0  0  40904   6784  22088 402964   0   0
128     0  235   476  2   1  97
Thu Aug 28 18:18:10 2003 0  0  1  40904   6404  22088 402980   0   0
  0     0  343   855 14   2  84

As this shows, some disk I/O and an increased amount of interrupts and
context switches is taking place at this time, and this also happens at
the same time as all the other long queries I examined. However, vmstat
also shows this pattern at a lot of other times, where the queries
aren't affected by it.

> Are you using pooled connections? If yes you could shorten life of a connection
> and force making a new connection every 10-15 minutes say. That would avoid IO
> avelanche at the end of the hour types.

I'm not quite sure, what you mean by "pooled connections". Each game
server has one connection to the PostgreSQL server, which is opened,
when the server is first started, and then never closed (until the game
server terminates, but there's days between this happens).

Regards,
Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
"Anders K. Pedersen"
Date:
Tom Lane wrote:
> "Anders K. Pedersen" <akp@cohaesio.com> writes:
>
>>Currently there's only a few users in the database for testing purposes,
>>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
>>queries and duration turned on), but several times per hour the duration
>>for one of the queries is 2-3 seconds (1000 times larger), while the
>>surrounding lookups take the usual 2-3 ms.
>
>
> One thing that comes to mind is that the slow query could be occurring
> at the same time as a checkpoint, or some other cycle-chewing background
> operation.  It's not clear why a checkpoint would slow things down that
> much, though.  Anyway I'd suggest looking for such activities; once we
> know if that's the issue or not, we can make some progress.

One of my colleagues suggested looking for checkpoints as well; I
searched the log, but only the following messages turned up:

Aug 11 15:21:04 gs1 postgres[5447]: [2] LOG:  checkpoint record is at
0/80193C
Aug 23 13:59:51 gs1 postgres[16451]: [2] LOG:  checkpoint record is at
0/201EB74
Aug 25 02:48:17 gs1 postgres[1059]: [2] LOG:  checkpoint record is at
0/2B787D0

Currently there are only relatively few changes to the database - one
INSERT everytime one of our game admins executes an administrative
command (like ban or kick), and this happens at most 10 times per hour.
As I understand checkpoints, this should mean, that they aren't
happening very often, and when they do, should be able to finish almost
immediately.

Regards,
Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
Rod Taylor
Date:
> With regards to other jobs on the server, there is a MySQL server on it
> as well, which from time to time has some multi-second queries generated
> from a webserver also on this host, but the MySQL is running with nice
> 10 (PostgreSQL isn't nice'd).

Do those MySQL queries hit disk hard?

I've never seen PostgreSQL have hicups like you describe when running on
a machine by itself.  I have experienced similar issues when another
process (cron job in my case) caused brief swapping to occur.

Attachment

Re: Queries sometimes take 1000 times the normal time

From
"Anders K. Pedersen"
Date:
Rod Taylor wrote:
>>With regards to other jobs on the server, there is a MySQL server on it
>>as well, which from time to time has some multi-second queries generated
>>from a webserver also on this host, but the MySQL is running with nice
>>10 (PostgreSQL isn't nice'd).
>
> Do those MySQL queries hit disk hard?

I guess they may be able to do so - the MySQL database is 450 MB, and
the server has 512 MB RAM, and some of the queries pretty summarizes
everything in the database.

However, I just cross-referenced the access logs from the webserver with
the duration logs, and although some of the spikes did happen, while
there would have been some MySQL activity (I can't tell for sure, if it
was simple queries or the long ones), other spikes happened without any
website activity in the surrounding minutes.

> I've never seen PostgreSQL have hicups like you describe when running on
> a machine by itself.  I have experienced similar issues when another
> process (cron job in my case) caused brief swapping to occur.

OK. I may have to try to put the database on a separate server.

Regards,
Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
"scott.marlowe"
Date:
Just to add to the clutch here, also check your bdflush settings (if
you're on linux) or equivalent (if you're not.)

Many times the swapping algo in linux can be quite bursty if you have it
set to move too many pages at a time during cleanup / flush.


Re: Queries sometimes take 1000 times the normal time

From
"Anders K. Pedersen"
Date:
scott.marlowe wrote:
> Just to add to the clutch here, also check your bdflush settings (if
> you're on linux) or equivalent (if you're not.)
>
> Many times the swapping algo in linux can be quite bursty if you have it
> set to move too many pages at a time during cleanup / flush.

According to vmstat it doesn't swap near the spikes, so I don't think
this is the problem. I posted a vmstat sample in another reply, where
you can see an example of what happens.

Regards,
Anders K. Pedersen


Re: Queries sometimes take 1000 times the normal time

From
"Shridhar Daithankar"
Date:
On 28 Aug 2003 at 20:16, Anders K. Pedersen wrote:

> Shridhar Daithankar wrote:
> > On 28 Aug 2003 at 1:07, Anders K. Pedersen wrote:
> >>We're running a set of Half-Life based game servers that lookup user
> >>privileges from a central PostgreSQL 7.3.4 database server (I recently
> >>ported the MySQL code in Adminmod to PostgreSQL to be able to do this).
> >>
> >>The data needed by the game servers are combined from several different
> >>tables, so we have some views set up to provide the data in the format
> >>needed.
> >>
> >>Currently there's only a few users in the database for testing purposes,
> >>and most of the time the user lookup's take 2-3 ms (I have syslog'ing of
> >>queries and duration turned on), but several times per hour the duration
> >>for one of the queries is 2-3 seconds (1000 times larger), while the
> >>surrounding lookups take the usual 2-3 ms.
> >
> >
> > Check vmstat during the same period if it is syncing at that point as Tom
> > suggested.
>
> I've been running a vmstat 1 logging process for a while now, and the
> sample below shows what happende around one of these spikes - at
> 18:18:03 specifically (actually there were two 1 second long queries,
> that finished at 18:18:03).
>
> Thu Aug 28 18:17:53 2003 0  0  0  40904   4568  22288 404352   0   0
> 12     0  181   362  2   1  97
> Thu Aug 28 18:17:54 2003 0  0  0  40904   4580  22260 404380   0   0
> 128     0  205   330  2   3  95
> Thu Aug 28 18:17:55 2003 0  0  2  40904   4576  22264 404380   0   0
>   0   284  224   127  0   1  99
> Thu Aug 28 18:17:56 2003 0  0  2  40904   5008  22268 404512   0   0
> 128   728  571   492  2   3  95
> Thu Aug 28 18:17:57 2003 0  0  1  40904   5000  22276 404512   0   0
>   0   120  201   181  1   0  99
> Thu Aug 28 18:17:58 2003 0  0  1  40904   4936  22284 404528   0   0
>   8     0 1147  2204 12   3  85
> Thu Aug 28 18:17:59 2003 0  0  0  40904   4784  22304 404660   0   0
> 148     0 2112  3420  2   3  95
> Thu Aug 28 18:18:00 2003 1  1  3  40904   4760  22324 404664   0   0
> 20   456 2374  3277  2   1  97
> Thu Aug 28 18:18:01 2003 0  2 10  40904   4436  22000 401456   0   0
> 144   540  510   457 11   6  83
> Thu Aug 28 18:18:02 2003 1  1  2  40904   8336  22032 401512   0   0
> 68   676 1830  2540  4   3  93
> Thu Aug 28 18:18:04 2003 1  0  1  40904   8160  22052 401664   0   0
> 140   220 2308  3253  2   3  95
> Thu Aug 28 18:18:05 2003 0  0  1  40904   7748  22064 402064   0   0
> 288     0 1941  2856  1   3  96
> Thu Aug 28 18:18:06 2003 0  0  3  40904   6704  22064 403100   0   0
> 496   992 2326  3510  0   5  95
> Thu Aug 28 18:18:07 2003 1  0  0  40904   6324  22088 402716   0   0
> 260   188 1984  2927 11   4  85
> Thu Aug 28 18:18:08 2003 0  0  0  40904   6920  22088 402828   0   0
> 72     0  419  1473 17   5  78
> Thu Aug 28 18:18:09 2003 0  0  0  40904   6784  22088 402964   0   0
> 128     0  235   476  2   1  97
> Thu Aug 28 18:18:10 2003 0  0  1  40904   6404  22088 402980   0   0
>   0     0  343   855 14   2  84
>

Notice a pattern. In first few entries free memory is increasing coupled with
IO. Few entries down the line it's decreasing again with IO.  I would guess
that something terminated and started..

However given how idle CPU is, I wonder would it matter. Besides changes in
stats are pretty small to be any significant. I wouldn't worry about context
switches or interrupts. They don't seem to be any dramatic..

I wonder what kernel you are using. While running pgbench on 2.4 and 2.6 couple
of days back, I noticed several stalls with 2.4 where neither CPU or disk does
anything but nothing moves forward, for 30 sec. or so.

If possible try with 2.6 Check which scheduler you are using and which works
best for you.

    http://marc.theaimsgroup.com/?l=linux-kernel&m=105743728122143&w=2

If you want a step by step how to install 2.6, I can give that too.. It's
pretty simple..

> > Are you using pooled connections? If yes you could shorten life of a connection
> > and force making a new connection every 10-15 minutes say. That would avoid IO
> > avelanche at the end of the hour types.
>
> I'm not quite sure, what you mean by "pooled connections". Each game
> server has one connection to the PostgreSQL server, which is opened,
> when the server is first started, and then never closed (until the game
> server terminates, but there's days between this happens).

I would say let each server start a new connection every 15 minutes. As soon as
new connection is established, close old one.

See if that cures the problem. Apache uses similar methods albeit it measure by
nubmer of requests served by each child.

HTH

Bye
 Shridhar

--
It is necessary to have purpose.        -- Alice #1, "I, Mudd", stardate 4513.3