Thread: Simple query showing 270 hours of CPU time
Today, I looked at 'top' on my PG server and saw a pid that reported 270 hours of CPU time. Considering this is a very simple query, I was surprised to say the least. I was about to just kill the pid, but I figured I'd try and see exactly what it was stuck doing for so long. Here's the strace summary as run for a few second sample: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.25 0.671629 92 7272 semop 1.76 0.012171 406 30 recvfrom 0.57 0.003960 66 60 gettimeofday 0.36 0.002512 28 90 sendto 0.05 0.000317 10 32 lseek 0.01 0.000049 1 48 select ------ ----------- ----------- --------- --------- ---------------- 100.00 0.690638 7532 total Here's the query: select id from eventkeywords where word = '00003322' If I run the query manually, it completes in about 500ms, which is very reasonable. There are 408563 rows in this table. I just noticed there is no index on word ( there should be! ). Would this have caused the problem? This is 8.0.12 Linux sunrise 2.6.15-26-amd64-server #1 SMP Fri Sep 8 20:33:15 UTC 2006 x86_64 GNU/Linux Any idea what might have set it into this loop? -Dan
Dan Harris <fbsd@drivefaster.net> writes: > Here's the strace summary as run for a few second sample: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 97.25 0.671629 92 7272 semop > 1.76 0.012171 406 30 recvfrom > 0.57 0.003960 66 60 gettimeofday > 0.36 0.002512 28 90 sendto > 0.05 0.000317 10 32 lseek > 0.01 0.000049 1 48 select > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.690638 7532 total > Here's the query: > select id from eventkeywords where word = '00003322' How sure are you that (a) that's really what it's doing and (b) you are not observing multiple executions of the query? There are no recvfrom calls in the inner loops of the backend AFAIR, so this looks to me like the execution of 30 different queries. The number of semops is distressingly high, but that's a contention issue not an amount-of-runtime issue. I think you're looking at a backend that has simply executed one heckuva lot of queries on behalf of its client, and that inquiring into what the client thinks it's doing might be the first order of business. regards, tom lane
> Today, I looked at 'top' on my PG server and saw a pid that reported 270 > hours of CPU time. Considering this is a very simple query, I was > surprised to say the least. I was about to just kill the pid, but I > figured I'd try and see exactly what it was stuck doing for so long. If you are using connection pooling, or if your client keeps the connections for a long time, this backend could be very old... With PHP's persistent connections, for instance, backends restart when you restart the webserver, which isn't usually very often.
Tom Lane wrote: > Dan Harris <fbsd@drivefaster.net> writes: >> Here's the strace summary as run for a few second sample: > >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ---------------- >> 97.25 0.671629 92 7272 semop >> 1.76 0.012171 406 30 recvfrom >> 0.57 0.003960 66 60 gettimeofday >> 0.36 0.002512 28 90 sendto >> 0.05 0.000317 10 32 lseek >> 0.01 0.000049 1 48 select >> ------ ----------- ----------- --------- --------- ---------------- >> 100.00 0.690638 7532 total > >> Here's the query: >> select id from eventkeywords where word = '00003322' > > How sure are you that (a) that's really what it's doing and (b) you are > not observing multiple executions of the query? There are no recvfrom > calls in the inner loops of the backend AFAIR, so this looks to me like > the execution of 30 different queries. The number of semops is > distressingly high, but that's a contention issue not an > amount-of-runtime issue. You were absolutely right. This is one connection that is doing a whole lot of ( slow ) queries. I jumped the gun on this and assumed it was a single query taking this long. Sorry to waste time and bandwidth. Since you mentioned the number of semops is distressingly high, does this indicate a tuning problem? The machine has 64GB of RAM and as far as I can tell about 63GB is all cache. I wonder if this is a clue to an undervalued memory-related setting somewhere? -Dan
Dan Harris <fbsd@drivefaster.net> writes: > Since you mentioned the number of semops is distressingly high, does this > indicate a tuning problem? More like an old-version problem. We've done a lot of work on concurrent performance since 8.0.x, and most likely you are hitting one of the bottlenecks that have been solved since then. regards, tom lane