Thread: Lots of "semop" calls under load

Lots of "semop" calls under load

From
"Albe Laurenz"
Date:
On a database (PostgreSQL 8.2.4 on 64-bit Linux 2.6.18 on 8 AMD Opterons)
that is under high load, I observe the following:

- About 200 database sessions concurrently issue queries, most of them small,
  but I have found one that touches 38000 table and index blocks.
- "vmstat" shows that CPU time is divided between "idle" and "iowait",
  with user and sys time practically zero.
- the run queue is short, but the blocked queue (uninterruptible sleep) is around 10.
- Many context switches are reported (over hundred per second).
- "sar" says that the disk with the database is on 100% of its capacity.
  Storage is on a SAN box.

Queries that normally take seconds at most require up to an hour to finish.

I ran "lsof -p" on a backend running the big query mentioned above, and
it does not use any temp files (work_mem = 20MB).
The query accesses only one table and its index.

What puzzles me is the "strace -tt" output from that backend:

13:44:58.263598 semop(393227, 0x7fff482f6050, 1) = 0
13:44:58.313448 semop(229382, 0x7fff482f6070, 1) = 0
13:44:58.313567 semop(393227, 0x7fff482f6050, 1) = 0
13:44:58.442917 semop(229382, 0x7fff482f6070, 1) = 0
13:44:58.443074 semop(393227, 0x7fff482f6050, 1) = 0
13:44:58.565313 semop(393227, 0x7fff482f6050, 1) = 0
13:44:58.682178 semop(229382, 0x7fff482f6070, 1) = 0
13:44:58.682333 semop(393227, 0x7fff482f6480, 1) = 0
13:44:58.807452 semop(393227, 0x7fff482f6050, 1) = 0
13:44:58.924425 semop(393227, 0x7fff482f6480, 1) = 0
13:44:58.924727 semop(393227, 0x7fff482f6050, 1) = 0
13:44:59.045456 semop(393227, 0x7fff482f6050, 1) = 0
13:44:59.169011 semop(393227, 0x7fff482f6480, 1) = 0
13:44:59.169226 semop(327689, 0x7fff482f64a0, 1) = 0
[many more semops]
13:44:59.602532 semop(327689, 0x7fff482f6070, 1) = 0
13:44:59.602648 lseek(32, 120176640, SEEK_SET) = 120176640
13:44:59.602742 read(32, "{\0\0\0xwv\227\1\0\0\0\320\0\350\0\0 \3  \237\300\1@\236\300\1`\235\300\1"..., 8192) = 8192
13:44:59.602825 semop(327689, 0x7fff482f64d0, 1) = 0
13:44:59.602872 semop(393227, 0x7fff482f6080, 1) = 0
13:44:59.602929 semop(393227, 0x7fff482f6050, 1) = 0
13:44:59.614559 semop(360458, 0x7fff482f6070, 1) = 0
[many more semops]
13:44:59.742103 semop(229382, 0x7fff482f64a0, 1) = 0
13:44:59.742172 semop(393227, 0x7fff482f6050, 1) = 0
13:44:59.756526 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
13:44:59.758096 semop(393227, 0x7fff482f6480, 1) = 0
13:44:59.771655 semop(393227, 0x7fff482f6050, 1) = 0
[hundreds of semops]
13:45:14.339905 semop(393227, 0x7fff482f6050, 1) = 0
13:45:14.466992 semop(360458, 0x7fff482f6070, 1) = 0
13:45:14.467102 lseek(33, 332693504, SEEK_SET) = 332693504
13:45:14.467138 read(33, "{\0\0\0\210\235\351\331\1\0\0\0\204\0010\32\360\37\3 \340\237 \0\320\237 \0\300\237 \0"...,
8192)= 8192
 
13:45:14.599815 semop(163844, 0x7fff482f60a0, 1) = 0
13:45:14.600006 lseek(32, 125034496, SEEK_SET) = 125034496
13:45:14.600305 read(32, "{\0\0\0\230\257\270\227\1\0\0\0\330\0\340\0\0 \3  \237\300\1@\236\300\1`\235\300\1"..., 8192)
=8192
 
13:45:14.600391 semop(163844, 0x7fff482f64d0, 1) = 0
13:45:14.600519 semop(393227, 0x7fff482f6480, 1) = 0

and so on. File 32 is the table, file 33 is the index.

Many of the table and index blocks are probably already in shared memory
(shared_buffers = 6GB) and don't have to be read from disk.

My questions:

Is the long duration of the query caused by something else than I/O overload?
What are the semops? Lightweight locks waiting for shared buffer?
Are the lseek and read operations really that fast although the disk is on 100%?

Is this normal behavior under overload or is something ill tuned?

Yours,
Laurenz Albe

Re: Lots of "semop" calls under load

From
Tom Lane
Date:
"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> On a database (PostgreSQL 8.2.4 on 64-bit Linux 2.6.18 on 8 AMD Opterons)
> that is under high load, I observe the following:
> ...
> - "vmstat" shows that CPU time is divided between "idle" and "iowait",
>   with user and sys time practically zero.
> - "sar" says that the disk with the database is on 100% of its capacity.

It sounds like you've simply saturated the disk's I/O bandwidth.
(I've noticed that Linux isn't all that good about distinguishing "idle"
from "iowait" --- more than likely you're really looking at 100% iowait.)

>   Storage is on a SAN box.

What kind of SAN box?  You're going to need something pretty beefy to
keep all those CPUs busy.

> What puzzles me is the "strace -tt" output from that backend:

Some low level of contention and consequent semops/context switches
is to be expected.  I don't think you need to worry if it's only
100/sec.  The sort of "context swap storm" behavior we've seen in
the past is in the tens of thousands of swaps/sec on hardware
much weaker than what you have here --- if you were seeing one of
those I bet you'd be well above 100000 swaps/sec.

> Are the lseek and read operations really that fast although the disk is on 100%?

lseek is (should be) cheap ... it doesn't do any actual I/O.  The
read()s you're showing here were probably satisfied from kernel disk
cache.  If you look at a larger sample you'll find slower ones, I think.
Another thing to look for is slow writes.

            regards, tom lane

Re: Lots of "semop" calls under load

From
"Albe Laurenz"
Date:
Tom Lane wrote:
>> On a database (PostgreSQL 8.2.4 on 64-bit Linux 2.6.18 on 8 AMD Opterons)
>> that is under high load, I observe the following: ...
>> - "vmstat" shows that CPU time is divided between "idle" and "iowait",
>>   with user and sys time practically zero.
>> - "sar" says that the disk with the database is on 100% of its capacity.
>
> It sounds like you've simply saturated the disk's I/O bandwidth.
> (I've noticed that Linux isn't all that good about distinguishing "idle"
> from "iowait" --- more than likely you're really looking at
> 100% iowait.)
>
>>   Storage is on a SAN box.
>
> What kind of SAN box?  You're going to need something pretty beefy to
> keep all those CPUs busy.

HP EVA 8100. Our storage people think that the observed I/O rate is not ok.
They mutter something about kernel disk cache configuration.

>> What puzzles me is the "strace -tt" output from that backend:
>
> I don't think you need to worry [...]

Thanks for explaining the strace output.

I am now more confident that the I/O overload is not the fault of PostgreSQL.
Most execution plans look as good as they can be, so it's probably either
the I/O system or the application that's at fault.

Yours,
Laurenz Albe