Dear List ,
We are having scalability issues with a high end hardware
The hardware is
CPU = 4 * opteron 6272 with 16 cores ie Total = 64 cores.
RAM = 128 GB DDR3
Disk = High performance RAID10 with lots of 15K spindles and a working BBU Cache.
normally the 1 min load average of the system remains between 0.5 to 1.0 .
The problem is that sometimes there are spikes of load avg which
jumps to > 50 very rapidly ( ie from 0.5 to 50 within 10 secs) and
it remains there for sometime and slowly reduces to normal value.
During such times of high load average we observe that there is no IO wait
in system and even CPU is 50% idle. In any case the IO Wait always remains < 1.0 % and
is mostly 0. Hence the load is not due to high I/O wait which was generally
the case with our previous hardware.
We are puzzled why the CPU and DISK I/O system are not being utilized
fully and would seek lists' wisdom on that.
We have setup sar to poll the system parameters every minute and
the data of which is graphed with cacti. If required any of the
system parameters or postgresql parameter can easily be put under
cacti monitoring and can be graphed.
The query load is mostly read only.
It is also possible to replicate the problem with pg_bench to some
extent . I choose -s = 100 and -t=10000 , the load does shoot but not
that spectacularly as achieved by the real world usage.
any help shall be greatly appreciated.
just a thought, will it be a good idea to partition the host hardware
to 4 equal virtual environments , ie 1 for master (r/w) and 3 slaves r/o
and distribute the r/o load on the 3 slaves ?
regds
mallah