"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> On a database (PostgreSQL 8.2.4 on 64-bit Linux 2.6.18 on 8 AMD Opterons)
> that is under high load, I observe the following:
> ...
> - "vmstat" shows that CPU time is divided between "idle" and "iowait",
> with user and sys time practically zero.
> - "sar" says that the disk with the database is on 100% of its capacity.
It sounds like you've simply saturated the disk's I/O bandwidth.
(I've noticed that Linux isn't all that good about distinguishing "idle"
from "iowait" --- more than likely you're really looking at 100% iowait.)
> Storage is on a SAN box.
What kind of SAN box? You're going to need something pretty beefy to
keep all those CPUs busy.
> What puzzles me is the "strace -tt" output from that backend:
Some low level of contention and consequent semops/context switches
is to be expected. I don't think you need to worry if it's only
100/sec. The sort of "context swap storm" behavior we've seen in
the past is in the tens of thousands of swaps/sec on hardware
much weaker than what you have here --- if you were seeing one of
those I bet you'd be well above 100000 swaps/sec.
> Are the lseek and read operations really that fast although the disk is on 100%?
lseek is (should be) cheap ... it doesn't do any actual I/O. The
read()s you're showing here were probably satisfied from kernel disk
cache. If you look at a larger sample you'll find slower ones, I think.
Another thing to look for is slow writes.
regards, tom lane