Thread: High CPU Utilization

From:
Joe Uhl
Date:

Our production database is seeing very heavy CPU utilization - anyone
have any ideas/input considering the following?

CPU utilization gradually increases during the day until it approaches
90%-100% at our peak time.  When this happens our transactions/sec
drops and our site becomes very slow.  When in this state, I can see
hundreds of queries in pg_stat_activity that are not waiting on locks
but sit there for minutes.  When the database is not in this state,
those same queries can complete in fractions of a second - faster that
my script that watches pg_stat_activity can keep track of them.

This server has dual quad core xeon 5310s, 32 GB RAM, and a few
different disk arrays (all managed in hardware by either the Perc5/i
or Perc5/e adapter).  The Postgres data is on a 14 disk 7.2k SATA raid
10.  This server runs nothing but Postgres.

The PostgreSQL database (according to pg_database_size) is 55GB and we
are running PostgreSQL 8.3.5 and the 2.6.28.7-2 kernel under Arch Linux.

Right now (not under peak load) this server is running at 68% CPU
utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/
s reads.  When I run dd I can hit 200+MB/s writes and 230+ MB/s reads,
so we are barely using the available IO.  Further when I run dd the
CPU utilization of that process only approaches 20%-30% of one core.

Additionally, when I view "top -c" I generally see a dozen or so
"idle" postgres processes (they appear and drop away quickly though)
consuming very large chunks of CPU (as much as 60% of a core).  At any
given time we have hundreds of idle postgres processes due to the JDBC
connection pooling but most of them are 0% as I would expect them to
be.  I also see selects and inserts consuming very large percentages
of CPU but I view that as less strange since they are doing work.

Any ideas as to what is causing our CPUs to struggle?  Is the fact
that our RAM covers a significant portion of the database causing our
CPUs to do a bunch of thrashing as they work with memory while our
disk controllers sit idle?  According to top we barely use any swap.

We currently have max_connections set to 1000 (roughly the sum of the
JDBC pools on our application servers).  Would decreasing this value
help?  We can decrease the JDBC pools or switch to pgbouncer for
pooling if this is the case.

Really just looking for any input/ideas.  Our workload is primarily
OLTP in nature - essentially a social network.  By transactions/sec at
the start I am using the xact_commit value in pg_stat_database.
Please let me know if this value is not appropriate for getting a tps
guess.  Right now with the 60% CPU utilization and low IO use
xact_commit is increasing at a rate of 1070 a second.

I have an identical PITR slave I can pause the PITR sync on to run any
test against.  I will happily provide any additional information that
would be helpful.

Any assistance is greatly appreciated.

Joe Uhl

From:
Alan Hodgson
Date:

On Monday 16 March 2009, Joe Uhl <> wrote:
> Right now (not under peak load) this server is running at 68% CPU
> utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/
> s reads.  When I run dd I can hit 200+MB/s writes and 230+ MB/s reads,
> so we are barely using the available IO.  Further when I run dd the
> CPU utilization of that process only approaches 20%-30% of one core.

What does vmstat say when it's slow? The output of "vmstat 1 30" would be
informative.

note: dd is sequential I/O. Normal database usage is random I/O.

--
Even a sixth-grader can figure out that you can’t borrow money to pay off
your debt

From:
Joe Uhl
Date:

Here is vmstat 1 30.  We are under peak load right now so I can gather
information from the real deal :)

Had an almost complete lockup a moment ago, number of non-idle
postgres connections was 637.  Going to drop our JDBC pool sizes a bit
and bounce everything.

procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
12 35  95056 11102380  56856 14954948    3    4   669   541    1    2
23  3 54 19
12 39  95056 11092484  56876 14963204    0    0  6740  1204 10066
13277 91  5  0  4
8 42  95056 11081712  56888 14972244    0    0  8620  1168 10659 17020
78  6  0 15
10 30  95052 11069768  56904 14982628    0    0  8944   976 9809 15109
81  6  1 12
4 27  95048 11059576  56916 14991296    0    0  8852   440 7652 13294
63  4  2 32
5 42  95048 11052524  56932 14996496    0    0  4700   384 6383 11249
64  4  4 28
5 33  95048 11047492  56956 15001428    0    0  3852   572 6029 14010
36  4  5 56
7 35  95048 11041184  56960 15005480    0    0  3964   136 5042 10802
40  3  1 56
1 33  95048 11037988  56968 15009240    0    0  3892   168 3384 6479
26  1  3 69
3 28  95048 11029332  56980 15015744    0    0  6724   152 4964 12844
11  2  8 79
0 34  95048 11025880  56988 15020168    0    0  3852   160 3616 8614
11  1  6 82
3 25  95048 10996356  57044 15044796    0    0  7892   456 3126 7115
4  3  8 85
1 26  95048 10991692  57052 15050100    0    0  5188   176 2566 5976
3  2 12 83
0 29  95048 10985408  57060 15054968    0    0  4200    80 2586 6582
4  1 12 83
1 29  95048 10980828  57064 15058992    0    0  4560    64 2966 7557
7  2  6 85
2 28  95048 10977192  57072 15063176    0    0  3860    72 2695 6742
11  1  7 81
2 29  95048 10969120  57088 15067808    0    0  5084    84 3296 8067
14  1  0 84
0 25  95048 10962096  57104 15072984    0    0  4440   500 2721 6263
12  1  6 80
0 23  95044 10955320  57108 15079260    0    0  5712   232 2678 5990
6  1  6 87
2 25  95044 10948644  57120 15084524    0    0  5120   184 3499 8143
20  3  9 69
3 21  95044 10939744  57128 15090644    0    0  5756   264 4724 10272
32  3  5 60
1 19  95040 10933196  57144 15095024   12    0  4440   180 2585 5244
13  2 15 70
0 21  95040 10927596  57148 15098684    0    0  3248   136 2973 7292
8  1  9 81
1 20  95040 10920708  57164 15104244    0    0  5192   360 1865 4547
3  1  9 87
1 24  95040 10914552  57172 15105856    0    0  2308    16 1948 4450
6  1  1 93
0 24  95036 10909148  57176 15110240    0    0  3824   152 1330 2632
3  1  6 90
1 21  95036 10900628  57192 15116332    0    0  5680   180 1898 3986
4  1 11 84
0 19  95036 10888356  57200 15121736    0    0  5952   120 2252 3991
12  1  8 79
2 22  95036 10874336  57204 15128252    0    0  6320   112 2831 6755
5  2  8 85
3 26  95036 10857592  57220 15134020    0    0  5124   216 3067 5296
32  6  3 59

Alan, my apologies if you get this twice.  Didn't reply back to the
list on first try.

On Mar 16, 2009, at 3:52 PM, Alan Hodgson wrote:

> On Monday 16 March 2009, Joe Uhl <> wrote:
>> Right now (not under peak load) this server is running at 68% CPU
>> utilization and its SATA raid 10 is doing about 2MB/s writes and
>> 11MB/
>> s reads.  When I run dd I can hit 200+MB/s writes and 230+ MB/s
>> reads,
>> so we are barely using the available IO.  Further when I run dd the
>> CPU utilization of that process only approaches 20%-30% of one core.
>
> What does vmstat say when it's slow? The output of "vmstat 1 30"
> would be
> informative.
>
> note: dd is sequential I/O. Normal database usage is random I/O.
>
> --
> Even a sixth-grader can figure out that you can’t borrow money to
> pay off
> your debt
>
> --
> Sent via pgsql-performance mailing list (
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From:
Greg Smith
Date:

On Mon, 16 Mar 2009, Joe Uhl wrote:

> Here is vmstat 1 30.  We are under peak load right now so I can gather
> information from the real deal

Quite helpful, reformatting a bit and picking an informative section:

procs -----------memory----------    ---swap- ----io--- -system-- ----cpu----
r  b   swpd     free   buff  cache   si   so   bi    bo   in   cs us sy id wa
0 34  95048 11025880  56988 15020168 0    0  3852   160 3616 8614 11  1  6 82
3 25  95048 10996356  57044 15044796 0    0  7892   456 3126 7115  4  3  8 85
1 26  95048 10991692  57052 15050100 0    0  5188   176 2566 5976  3  2 12 83

This says that your server is spending all its time waiting for I/O,
actual CPU utilization is minimal.  You're only achieving around 3-8MB/s
of random I/O.  That's the reality of what your disk I/O subsystem is
capable of, regardless of what its sequential performance with dd looks
like.  If you were to run a more complicated benchmark like bonnie++
instead, I'd bet that your "seeks/second" results are very low, even
though sequential read/write is fine.

The Perc5 controllers have a pretty bad reputation for performance on this
list, even in RAID10.  Not much you can do about that beyond scrapping the
controller and getting a better one.

What you might do in order to reduce the total number of writes needed is
some standard postgresql.conf tuning; see
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

What you could do here is increase shared_buffers, checkpoint_segments,
and checkpoint_completion_target as described there.  Having more buffers
dedicated to the database and having less checkpoints can result in less
random I/O showing up, as popular data pages will stay in RAM for longer
without getting written out so much.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

From:
Joe Uhl
Date:

I dropped the pool sizes and brought things back up.  Things are
stable, site is fast, CPU utilization is still high.  Probably just a
matter of time before issue comes back (we get slammed as kids get out
of school in the US).

Now when I run vmtstat 1 30 it looks very different (below).  Waiting
is minimal, user is very high.  Under nontrivial load, according to
xact_commit in pg_stat_database we are doing 1800+ tps.

Appreciate the input and explanation on vmstat.  I am going to throw
some of these numbers into zabbix so I can keep a better eye on them.
This server is a couple years old so the purchase of a new controller
and/or disks is not out of the question.

On final note, have made several changes to postgresql.conf.  Some of
those here:
max_connections = 1000
shared_buffers = 7680MB
work_mem = 30MB
synchronous_commit = off
checkpoint_segments = 50
effective_cache_size = 20000MB

procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
sy id wa
  9  8  73036 500164  82200 23497748    3    4   669   541    1    1
23  3 54 19
20  4  73036 497252  82200 23500836    0    0  2500   680 11145 15168
91  4  2  2
21  1  73036 491416  82204 23503832    0    0  1916   920 10303 14032
94  4  1  1
23  5  73036 489580  82212 23505860    0    0  1348  3296 11682 15970
94  5  1  0
31  1  73036 481408  82220 23507752    0    0   984  8988 10123 11289
97  3  0  0
25  4  73036 483248  82232 23509420    0    0  1268  1312 10705 14063
96  4  0  0
23  4  73036 480096  82232 23512380    0    0  2372   472 9805 13996
94  5  1  1
24  4  73036 476732  82236 23515196    0    0  2012   720 10365 14307
96  3  1  0
22  1  73036 474468  82236 23516584    0    0   944  3108 9838 12831
95  4  1  0
14  1  73036 455756  82284 23534548    0    0   908  3284 9096 11333
94  4  1  0
10  2  73036 455224  82292 23536304    0    0  1760   416 12454 17736
89  6  3  2
17  0  73036 460620  82292 23538888    0    0  1292   968 12030 18333
90  7  2  1
13  4  73036 459764  82292 23539724    0    0   332   288 9722 14197
92  5  2  1
17  5  73036 457516  82292 23542176    0    0  1872 17752 10458 15465
91  5  2  1
19  4  73036 450804  82300 23545640    0    0  2980   640 10602 15621
90  6  2  2
24  0  73036 447660  82312 23547644    0    0  1736 10724 12401 15413
93  6  1  0
20  6  73036 444380  82320 23550692    0    0  2064   476 9008 10985
94  4  1  0
22  2  73036 442880  82328 23553640    0    0  2496  3156 10739 15211
93  5  1  1
11  1  73036 441448  82328 23555632    0    0  1452  3552 10812 15337
93  5  2  1
  6  2  73036 439812  82348 23557420    0    0  1052  1128 8603 10514
91  3  3  2
  6  3  73036 433456  82348 23560860    0    0  2484   656 7636 13033
68  4 14 14
  6  3  73036 433084  82348 23562628    0    0  1400   408 6046 11778
70  3 18  9
  5  0  73036 430776  82356 23564264    0    0  1108  1300 7549 13754
73  4 16  7
  5  2  73036 430124  82360 23565580    0    0  1016  2216 7844 14507
72  4 18  6
  4  2  73036 429652  82380 23567480    0    0  1168  2468 7694 15466
58  4 24 14
  6  2  73036 427304  82384 23569668    0    0  1132   752 5993 13606
49  5 36 10
  7  1  73036 423020  82384 23571932    0    0  1244   824 8085 18072
56  3 30 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
sy id wa
  4  0  73036 420816  82392 23573824    0    0  1292   820 5370 10958
46  2 41 10
  9  1  73020 418048  82392 23576900   52    0  1632  2592 5931 11629
60  3 29  8
  4  2  73004 415164  82424 23578620   56    0  1812  4116 7503 14674
71  3 15 12

On Mar 16, 2009, at 4:19 PM, Dave Youatt wrote:
> Last column "wa" is % cpu time spent waiting (for IO to complete).
> 80s
> and 90s is pretty high, probably too high.
>
> Might also want to measure the IO/s performance of your RAID
> controller.  From the descriptions, it will be much more important
> that
> long sequential reads/writes for characterizing your workload.
>
> There are also some disappointing HW RAID controllers out there.
> Generally, Aretec and Promise are good, Adaptec good, depending on
> model, and the ones that Dell ship w/their servers haven't had good
> reviews/reports.
>
>
> On 03/16/2009 01:10 PM, Joe Uhl wrote:
>> Here is vmstat 1 30.  We are under peak load right now so I can
>> gather
>> information from the real deal :)
>>
>> Had an almost complete lockup a moment ago, number of non-idle
>> postgres connections was 637.  Going to drop our JDBC pool sizes a
>> bit
>> and bounce everything.
>>
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu----
>> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>> sy
>> id wa
>> 12 35  95056 11102380  56856 14954948    3    4   669   541    1    2
>> 23  3 54 19
>> 12 39  95056 11092484  56876 14963204    0    0  6740  1204 10066
>> 13277 91  5  0  4
>> 8 42  95056 11081712  56888 14972244    0    0  8620  1168 10659
>> 17020
>> 78  6  0 15
>> 10 30  95052 11069768  56904 14982628    0    0  8944   976 9809
>> 15109
>> 81  6  1 12
>> 4 27  95048 11059576  56916 14991296    0    0  8852   440 7652 13294
>> 63  4  2 32
>> 5 42  95048 11052524  56932 14996496    0    0  4700   384 6383 11249
>> 64  4  4 28
>> 5 33  95048 11047492  56956 15001428    0    0  3852   572 6029 14010
>> 36  4  5 56
>> 7 35  95048 11041184  56960 15005480    0    0  3964   136 5042 10802
>> 40  3  1 56
>> 1 33  95048 11037988  56968 15009240    0    0  3892   168 3384 6479
>> 26  1  3 69
>> 3 28  95048 11029332  56980 15015744    0    0  6724   152 4964 12844
>> 11  2  8 79
>> 0 34  95048 11025880  56988 15020168    0    0  3852   160 3616 8614
>> 11  1  6 82
>> 3 25  95048 10996356  57044 15044796    0    0  7892   456 3126 7115
>> 4  3  8 85
>> 1 26  95048 10991692  57052 15050100    0    0  5188   176 2566 5976
>> 3  2 12 83
>> 0 29  95048 10985408  57060 15054968    0    0  4200    80 2586 6582
>> 4  1 12 83
>> 1 29  95048 10980828  57064 15058992    0    0  4560    64 2966 7557
>> 7  2  6 85
>> 2 28  95048 10977192  57072 15063176    0    0  3860    72 2695 6742
>> 11  1  7 81
>> 2 29  95048 10969120  57088 15067808    0    0  5084    84 3296 8067
>> 14  1  0 84
>> 0 25  95048 10962096  57104 15072984    0    0  4440   500 2721 6263
>> 12  1  6 80
>> 0 23  95044 10955320  57108 15079260    0    0  5712   232 2678 5990
>> 6  1  6 87
>> 2 25  95044 10948644  57120 15084524    0    0  5120   184 3499 8143
>> 20  3  9 69
>> 3 21  95044 10939744  57128 15090644    0    0  5756   264 4724 10272
>> 32  3  5 60
>> 1 19  95040 10933196  57144 15095024   12    0  4440   180 2585 5244
>> 13  2 15 70
>> 0 21  95040 10927596  57148 15098684    0    0  3248   136 2973 7292
>> 8  1  9 81
>> 1 20  95040 10920708  57164 15104244    0    0  5192   360 1865 4547
>> 3  1  9 87
>> 1 24  95040 10914552  57172 15105856    0    0  2308    16 1948 4450
>> 6  1  1 93
>> 0 24  95036 10909148  57176 15110240    0    0  3824   152 1330 2632
>> 3  1  6 90
>> 1 21  95036 10900628  57192 15116332    0    0  5680   180 1898 3986
>> 4  1 11 84
>> 0 19  95036 10888356  57200 15121736    0    0  5952   120 2252 3991
>> 12  1  8 79
>> 2 22  95036 10874336  57204 15128252    0    0  6320   112 2831 6755
>> 5  2  8 85
>> 3 26  95036 10857592  57220 15134020    0    0  5124   216 3067 5296
>> 32  6  3 59
>>
>> Alan, my apologies if you get this twice.  Didn't reply back to the
>> list on first try.
>>
>> On Mar 16, 2009, at 3:52 PM, Alan Hodgson wrote:
>>
>>> On Monday 16 March 2009, Joe Uhl <> wrote:
>>>> Right now (not under peak load) this server is running at 68% CPU
>>>> utilization and its SATA raid 10 is doing about 2MB/s writes and
>>>> 11MB/
>>>> s reads.  When I run dd I can hit 200+MB/s writes and 230+ MB/s
>>>> reads,
>>>> so we are barely using the available IO.  Further when I run dd the
>>>> CPU utilization of that process only approaches 20%-30% of one
>>>> core.
>>>
>>> What does vmstat say when it's slow? The output of "vmstat 1 30"
>>> would be
>>> informative.
>>>
>>> note: dd is sequential I/O. Normal database usage is random I/O.
>>>
>>> --
>>> Even a sixth-grader can figure out that you can’t borrow money to
>>> pay
>>> off
>>> your debt
>>>
>>> --
>>> Sent via pgsql-performance mailing list
>>> ()
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>

On Mar 16, 2009, at 4:35 PM, Greg Smith wrote:

> On Mon, 16 Mar 2009, Joe Uhl wrote:
>
>> Here is vmstat 1 30.  We are under peak load right now so I can
>> gather information from the real deal
>
> Quite helpful, reformatting a bit and picking an informative section:
>
> procs -----------memory----------    ---swap- ----io--- -system--
> ----cpu----
> r  b   swpd     free   buff  cache   si   so   bi    bo   in   cs us
> sy id wa
> 0 34  95048 11025880  56988 15020168 0    0  3852   160 3616 8614
> 11  1  6 82
> 3 25  95048 10996356  57044 15044796 0    0  7892   456 3126 7115
> 4  3  8 85
> 1 26  95048 10991692  57052 15050100 0    0  5188   176 2566 5976
> 3  2 12 83
>
> This says that your server is spending all its time waiting for I/O,
> actual CPU utilization is minimal.  You're only achieving around
> 3-8MB/s of random I/O.  That's the reality of what your disk I/O
> subsystem is capable of, regardless of what its sequential
> performance with dd looks like.  If you were to run a more
> complicated benchmark like bonnie++ instead, I'd bet that your
> "seeks/second" results are very low, even though sequential read/
> write is fine.
>
> The Perc5 controllers have a pretty bad reputation for performance
> on this list, even in RAID10.  Not much you can do about that beyond
> scrapping the controller and getting a better one.
>
> What you might do in order to reduce the total number of writes
> needed is some standard postgresql.conf tuning; see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
>
> What you could do here is increase shared_buffers,
> checkpoint_segments, and checkpoint_completion_target as described
> there.  Having more buffers dedicated to the database and having
> less checkpoints can result in less random I/O showing up, as
> popular data pages will stay in RAM for longer without getting
> written out so much.
>
> --
> * Greg Smith  http://www.gregsmith.com
> Baltimore, MD


From:
Scott Marlowe
Date:

On Mon, Mar 16, 2009 at 2:50 PM, Joe Uhl <> wrote:
> I dropped the pool sizes and brought things back up.  Things are stable,
> site is fast, CPU utilization is still high.  Probably just a matter of time
> before issue comes back (we get slammed as kids get out of school in the
> US).

Yeah, I'm guessing your server (or more specifically its RAID card)
just aren't up to the task.  We had the same problem last year with a
machine with 16 Gig ram and dual dual core 3.0GHz xeons with a Perc 5
something or other.  No matter how we tuned it or played with it, we
just couldn't get good random performance out of it.  It's since been
replaced by a white box unit with a tyan mobo and dual 4 core opterons
and an Areca 1680 and a 12 drive RAID-10. We can sustain 30 to 60 Megs
a second random access with 0 to 10% iowait.

Here's a typical vmstat 10 output when our load factor is hovering around 8...
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  1    460 170812  92856 29928156    0    0   604  3986 4863 10146
74  3 20  3  0
 7  1    460 124160  92912 29939660    0    0   812  5701 4829 9733 70
 3 23  3  0
13  0    460 211036  92984 29947636    0    0   589  3178 4429 9964 69
 3 25  3  0
 7  2    460  90968  93068 29963368    0    0  1067  4463 4915 11081
78  3 14  5  0
 7  3    460 115216  93100 29963336    0    0  3008  3197 4032 11812
69  4 15 12  0
 6  1    460 142120  93088 29923736    0    0  1112  6390 4991 11023
75  4 15  6  0
 6  0    460 157896  93208 29932576    0    0   698  2196 4151 8877 71
 2 23  3  0
11  0    460 124868  93296 29948824    0    0   963  3645 4891 10382
74  3 19  4  0
 5  3    460  95960  93272 29918064    0    0   592 30055 5550 7430 56
 3 18 23  0
 9  0    460  95408  93196 29914556    0    0  1090  3522 4463 10421
71  3 21  5  0
 9  0    460 128632  93176 29916412    0    0   883  4774 4757 10378
76  4 17  3  0

Note the bursty parts where we're shoving out 30Megs a second and the
wait jumps to 23%.  That's about as bad as it gets during the day for
us.  NBote that in your graph your bi column appears to be dominating
your bo column, so it looks like you're reaching a point where the
write cache on the controller gets full and you're real throughput is
shown to be ~ 1 megabyte a second outbound, and the inbound traffic
either has priority or is just filling in the gaps.  It looks to me
like your RAID card is prioritizing reads over writes, and the whole
system is just slowing to a crawl.  I'm willing to bet that if you
were running pure SW RAID with no RAID controller you'd get better
numbers.

From:
Gregory Stark
Date:

Greg Smith <> writes:

> On Mon, 16 Mar 2009, Joe Uhl wrote:
>
>> Here is vmstat 1 30.  We are under peak load right now so I can gather
>> information from the real deal
>
> Quite helpful, reformatting a bit and picking an informative section:
>
> procs -----------memory----------    ---swap- ----io--- -system-- ----cpu----
> r  b   swpd     free   buff  cache   si   so   bi    bo   in   cs us sy id wa
> 0 34  95048 11025880  56988 15020168 0    0  3852   160 3616 8614 11  1  6 82
> 3 25  95048 10996356  57044 15044796 0    0  7892   456 3126 7115  4  3  8 85
> 1 26  95048 10991692  57052 15050100 0    0  5188   176 2566 5976  3  2 12 83
>
> This says that your server is spending all its time waiting for I/O, actual CPU
> utilization is minimal.  You're only achieving around 3-8MB/s of random I/O.
> That's the reality of what your disk I/O subsystem is capable of, regardless of
> what its sequential performance with dd looks like.  If you were to run a more
> complicated benchmark like bonnie++ instead, I'd bet that your "seeks/second"
> results are very low, even though sequential read/write is fine.
>
> The Perc5 controllers have a pretty bad reputation for performance on this
> list, even in RAID10.  Not much you can do about that beyond scrapping the
> controller and getting a better one.

Hm, well the tests I ran for posix_fadvise were actually on a Perc5 -- though
who knows if it was the same under the hood -- and I saw better performance
than this. I saw about 4MB/s for a single drive and up to about 35MB/s for 15
drives. However this was using linux md raid-0, not hardware raid.

But you shouldn't get your hopes up too much for random i/o. 3-8MB seems low
but consider the following:

 $ units
 2445 units, 71 prefixes, 33 nonlinear units

 You have: 8kB / .5|7200min
 You want: MB/s
         * 1.92
         / 0.52083333

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

From:
Greg Smith
Date:

On Mon, 16 Mar 2009, Joe Uhl wrote:

> Now when I run vmtstat 1 30 it looks very different (below).

That looks much better.  Obviously you'd like some more headroom on the
CPU situation than you're seeing, but that's way better than having so
much time spent waiting for I/O.

> max_connections = 1000
> work_mem = 30MB

Be warned that you need to be careful with this combination.  If all 1000
connections were to sort something at once, you could end up with >30GB
worth of RAM used for that purpose.  It's probably quite unlikely that
will happen, but 30MB is on the high side with that many connections.

I wonder if your pool might work better, in terms of lowering total CPU
usage, if you reduced the number of incoming connections.  Each connection
adds some overhead and now that you've got the I/O situation under better
control you might get by with less simultaneous ones.  Something to
consider.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

From:
Greg Smith
Date:

On Tue, 17 Mar 2009, Gregory Stark wrote:

> Hm, well the tests I ran for posix_fadvise were actually on a Perc5 -- though
> who knows if it was the same under the hood -- and I saw better performance
> than this. I saw about 4MB/s for a single drive and up to about 35MB/s for 15
> drives. However this was using linux md raid-0, not hardware raid.

Right, it's the hardware RAID on the Perc5 I think people mainly complain
about.  If you use it in JBOD mode and let the higher performance CPU in
your main system drive the RAID functions it's not so bad.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

From:
Joe Uhl
Date:

On Mar 17, 2009, at 12:19 AM, Greg Smith wrote:

> On Tue, 17 Mar 2009, Gregory Stark wrote:
>
>> Hm, well the tests I ran for posix_fadvise were actually on a Perc5
>> -- though
>> who knows if it was the same under the hood -- and I saw better
>> performance
>> than this. I saw about 4MB/s for a single drive and up to about
>> 35MB/s for 15
>> drives. However this was using linux md raid-0, not hardware raid.
>
> Right, it's the hardware RAID on the Perc5 I think people mainly
> complain about.  If you use it in JBOD mode and let the higher
> performance CPU in your main system drive the RAID functions it's
> not so bad.
>
> --
> * Greg Smith  http://www.gregsmith.com
> Baltimore, MD

I have not yet had a chance to try software raid on the standby server
(still planning to) but wanted to follow up to see if there was any
good way to figure out what the postgresql processes are spending
their CPU time on.

We are under peak load right now, and I have Zabbix plotting CPU
utilization and CPU wait (from vmstat output) along with all sorts of
other vitals on charts.  CPU utilization is a sustained 90% - 95% and
CPU Wait is hanging below 10%.  Since being pointed at vmstat by this
list I have been watching CPU Wait and it does get high at times
(hence still wanting to try Perc5 in JBOD) but then there are
sustained periods, right now included, where our CPUs are just getting
crushed while wait and IO (only doing about 1.5 MB/sec right now) are
very low.

This high CPU utilization only occurs when under peak load and when
our JDBC pools are fully loaded.  We are moving more things into our
cache and constantly tuning indexes/tables but just want to see if
there is some underlying cause that is killing us.

Any recommendations for figuring out what our database is spending its
CPU time on?

From:
Scott Marlowe
Date:

On Fri, Mar 20, 2009 at 2:26 PM, Joe Uhl <> wrote:
> On Mar 17, 2009, at 12:19 AM, Greg Smith wrote:
>
>> On Tue, 17 Mar 2009, Gregory Stark wrote:
>>
>>> Hm, well the tests I ran for posix_fadvise were actually on a Perc5 --
>>> though
>>> who knows if it was the same under the hood -- and I saw better
>>> performance
>>> than this. I saw about 4MB/s for a single drive and up to about 35MB/s
>>> for 15
>>> drives. However this was using linux md raid-0, not hardware raid.
>>
>> Right, it's the hardware RAID on the Perc5 I think people mainly complain
>> about.  If you use it in JBOD mode and let the higher performance CPU in
>> your main system drive the RAID functions it's not so bad.
>>
>> --
>> * Greg Smith  http://www.gregsmith.com Baltimore, MD
>
> I have not yet had a chance to try software raid on the standby server
> (still planning to) but wanted to follow up to see if there was any good way
> to figure out what the postgresql processes are spending their CPU time on.
>
> We are under peak load right now, and I have Zabbix plotting CPU utilization
> and CPU wait (from vmstat output) along with all sorts of other vitals on
> charts.  CPU utilization is a sustained 90% - 95% and CPU Wait is hanging
> below 10%.  Since being pointed at vmstat by this list I have been watching
> CPU Wait and it does get high at times (hence still wanting to try Perc5 in
> JBOD) but then there are sustained periods, right now included, where our
> CPUs are just getting crushed while wait and IO (only doing about 1.5 MB/sec
> right now) are very low.
>
> This high CPU utilization only occurs when under peak load and when our JDBC
> pools are fully loaded.  We are moving more things into our cache and
> constantly tuning indexes/tables but just want to see if there is some
> underlying cause that is killing us.
>
> Any recommendations for figuring out what our database is spending its CPU
> time on?

What does the cs entry on vmstat say at this time?  If you're cs is
skyrocketing then you're getting a context switch storm, which is
usually a sign that there are just too many things going on at once /
you've got an old kernel things like that.

From:
Joe Uhl
Date:

On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote:

> On Fri, Mar 20, 2009 at 2:26 PM, Joe Uhl <> wrote:
>> On Mar 17, 2009, at 12:19 AM, Greg Smith wrote:
>>
>>> On Tue, 17 Mar 2009, Gregory Stark wrote:
>>>
>>>> Hm, well the tests I ran for posix_fadvise were actually on a
>>>> Perc5 --
>>>> though
>>>> who knows if it was the same under the hood -- and I saw better
>>>> performance
>>>> than this. I saw about 4MB/s for a single drive and up to about
>>>> 35MB/s
>>>> for 15
>>>> drives. However this was using linux md raid-0, not hardware raid.
>>>
>>> Right, it's the hardware RAID on the Perc5 I think people mainly
>>> complain
>>> about.  If you use it in JBOD mode and let the higher performance
>>> CPU in
>>> your main system drive the RAID functions it's not so bad.
>>>
>>> --
>>> * Greg Smith  http://www.gregsmith.com
>>> Baltimore, MD
>>
>> I have not yet had a chance to try software raid on the standby
>> server
>> (still planning to) but wanted to follow up to see if there was any
>> good way
>> to figure out what the postgresql processes are spending their CPU
>> time on.
>>
>> We are under peak load right now, and I have Zabbix plotting CPU
>> utilization
>> and CPU wait (from vmstat output) along with all sorts of other
>> vitals on
>> charts.  CPU utilization is a sustained 90% - 95% and CPU Wait is
>> hanging
>> below 10%.  Since being pointed at vmstat by this list I have been
>> watching
>> CPU Wait and it does get high at times (hence still wanting to try
>> Perc5 in
>> JBOD) but then there are sustained periods, right now included,
>> where our
>> CPUs are just getting crushed while wait and IO (only doing about
>> 1.5 MB/sec
>> right now) are very low.
>>
>> This high CPU utilization only occurs when under peak load and when
>> our JDBC
>> pools are fully loaded.  We are moving more things into our cache and
>> constantly tuning indexes/tables but just want to see if there is
>> some
>> underlying cause that is killing us.
>>
>> Any recommendations for figuring out what our database is spending
>> its CPU
>> time on?
>
> What does the cs entry on vmstat say at this time?  If you're cs is
> skyrocketing then you're getting a context switch storm, which is
> usually a sign that there are just too many things going on at once /
> you've got an old kernel things like that.

cs column (plus cpu columns) of vmtstat 1 30 reads as follows:

cs    us  sy id wa
11172 95  4  1  0
12498 94  5  1  0
14121 91  7  1  1
11310 90  7  1  1
12918 92  6  1  1
10613 93  6  1  1
9382  94  4  1  1
14023 89  8  2  1
10138 92  6  1  1
11932 94  4  1  1
15948 93  5  2  1
12919 92  5  3  1
10879 93  4  2  1
14014 94  5  1  1
9083  92  6  2  0
11178 94  4  2  0
10717 94  5  1  0
9279  97  2  1  0
12673 94  5  1  0
8058  82 17  1  1
8150  94  5  1  1
11334 93  6  0  0
13884 91  8  1  0
10159 92  7  0  0
9382  96  4  0  0
11450 95  4  1  0
11947 96  3  1  0
8616  95  4  1  0
10717 95  3  1  0

We are running on 2.6.28.7-2 kernel.  I am unfamiliar with vmstat
output but reading the man page (and that cs = "context switches per
second") makes my numbers seem very high.

Our sum JDBC pools currently top out at 400 connections (and we are
doing work on all 400 right now).  I may try dropping those pools down
even smaller. Are there any general rules of thumb for figuring out
how many connections you should service at maximum?  I know of the
memory constraints, but thinking more along the lines of connections
per CPU core.


From:
Scott Marlowe
Date:

On Fri, Mar 20, 2009 at 2:49 PM, Joe Uhl <> wrote:
>
> On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote:

>> What does the cs entry on vmstat say at this time?  If you're cs is
>> skyrocketing then you're getting a context switch storm, which is
>> usually a sign that there are just too many things going on at once /
>> you've got an old kernel things like that.
>
> cs column (plus cpu columns) of vmtstat 1 30 reads as follows:
>
> cs    us  sy id wa
> 11172 95  4  1  0
> 12498 94  5  1  0
> 14121 91  7  1  1
> 11310 90  7  1  1
> 12918 92  6  1  1
> 10613 93  6  1  1
> 9382  94  4  1  1
> 14023 89  8  2  1
> 10138 92  6  1  1
> 11932 94  4  1  1
> 15948 93  5  2  1
> 12919 92  5  3  1
> 10879 93  4  2  1
> 14014 94  5  1  1
> 9083  92  6  2  0
> 11178 94  4  2  0
> 10717 94  5  1  0
> 9279  97  2  1  0
> 12673 94  5  1  0
> 8058  82 17  1  1
> 8150  94  5  1  1
> 11334 93  6  0  0
> 13884 91  8  1  0
> 10159 92  7  0  0
> 9382  96  4  0  0
> 11450 95  4  1  0
> 11947 96  3  1  0
> 8616  95  4  1  0
> 10717 95  3  1  0
>
> We are running on 2.6.28.7-2 kernel.  I am unfamiliar with vmstat output but
> reading the man page (and that cs = "context switches per second") makes my
> numbers seem very high.

No, those aren't really all that high.  If you were hitting cs
contention, I'd expect it to be in the 25k to 100k range.  <10k
average under load is pretty reasonable.

> Our sum JDBC pools currently top out at 400 connections (and we are doing
> work on all 400 right now).  I may try dropping those pools down even
> smaller. Are there any general rules of thumb for figuring out how many
> connections you should service at maximum?  I know of the memory
> constraints, but thinking more along the lines of connections per CPU core.

Well, maximum efficiency is usually somewhere in the range of 1 to 2
times the number of cores you have, so trying to get the pool down to
a dozen or two connections would be the direction to generally head.
May not be reasonable or doable though.

From:
Joe Uhl
Date:

On Mar 20, 2009, at 4:58 PM, Scott Marlowe wrote:

> On Fri, Mar 20, 2009 at 2:49 PM, Joe Uhl <> wrote:
>>
>> On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote:
>
>>> What does the cs entry on vmstat say at this time?  If you're cs is
>>> skyrocketing then you're getting a context switch storm, which is
>>> usually a sign that there are just too many things going on at
>>> once /
>>> you've got an old kernel things like that.
>>
>> cs column (plus cpu columns) of vmtstat 1 30 reads as follows:
>>
>> cs    us  sy id wa
>> 11172 95  4  1  0
>> 12498 94  5  1  0
>> 14121 91  7  1  1
>> 11310 90  7  1  1
>> 12918 92  6  1  1
>> 10613 93  6  1  1
>> 9382  94  4  1  1
>> 14023 89  8  2  1
>> 10138 92  6  1  1
>> 11932 94  4  1  1
>> 15948 93  5  2  1
>> 12919 92  5  3  1
>> 10879 93  4  2  1
>> 14014 94  5  1  1
>> 9083  92  6  2  0
>> 11178 94  4  2  0
>> 10717 94  5  1  0
>> 9279  97  2  1  0
>> 12673 94  5  1  0
>> 8058  82 17  1  1
>> 8150  94  5  1  1
>> 11334 93  6  0  0
>> 13884 91  8  1  0
>> 10159 92  7  0  0
>> 9382  96  4  0  0
>> 11450 95  4  1  0
>> 11947 96  3  1  0
>> 8616  95  4  1  0
>> 10717 95  3  1  0
>>
>> We are running on 2.6.28.7-2 kernel.  I am unfamiliar with vmstat
>> output but
>> reading the man page (and that cs = "context switches per second")
>> makes my
>> numbers seem very high.
>
> No, those aren't really all that high.  If you were hitting cs
> contention, I'd expect it to be in the 25k to 100k range.  <10k
> average under load is pretty reasonable.
>
>> Our sum JDBC pools currently top out at 400 connections (and we are
>> doing
>> work on all 400 right now).  I may try dropping those pools down even
>> smaller. Are there any general rules of thumb for figuring out how
>> many
>> connections you should service at maximum?  I know of the memory
>> constraints, but thinking more along the lines of connections per
>> CPU core.
>
> Well, maximum efficiency is usually somewhere in the range of 1 to 2
> times the number of cores you have, so trying to get the pool down to
> a dozen or two connections would be the direction to generally head.
> May not be reasonable or doable though.

Thanks for the info.  Figure I can tune our pools down and monitor
throughput/CPU/IO and look for a sweet spot with our existing
hardware.  Just wanted to see if tuning connections down could
potentially help.

I feel as though we are going to have to replicate this DB before too
long.  We've got an almost identical server doing nothing but PITR
with 8 CPU cores mostly idle that could be better spent.  Our pgfouine
reports, though only logging queries that take over 1 second, show
90%  reads.

I have heard much about Slony, but has anyone used the newer version
of Mammoth Replicator (or looks to be called PostgreSQL + Replication
now) on 8.3?  From the documentation, it appears to be easier to set
up and less invasive but I struggle to find usage information/stories
online.


From:
Joe Uhl
Date:

On Mar 20, 2009, at 4:58 PM, Scott Marlowe wrote:

> On Fri, Mar 20, 2009 at 2:49 PM, Joe Uhl <> wrote:
>>
>> On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote:
>
>>> What does the cs entry on vmstat say at this time?  If you're cs is
>>> skyrocketing then you're getting a context switch storm, which is
>>> usually a sign that there are just too many things going on at
>>> once /
>>> you've got an old kernel things like that.
>>
>> cs column (plus cpu columns) of vmtstat 1 30 reads as follows:
>>
>> cs    us  sy id wa
>> 11172 95  4  1  0
>> 12498 94  5  1  0
>> 14121 91  7  1  1
>> 11310 90  7  1  1
>> 12918 92  6  1  1
>> 10613 93  6  1  1
>> 9382  94  4  1  1
>> 14023 89  8  2  1
>> 10138 92  6  1  1
>> 11932 94  4  1  1
>> 15948 93  5  2  1
>> 12919 92  5  3  1
>> 10879 93  4  2  1
>> 14014 94  5  1  1
>> 9083  92  6  2  0
>> 11178 94  4  2  0
>> 10717 94  5  1  0
>> 9279  97  2  1  0
>> 12673 94  5  1  0
>> 8058  82 17  1  1
>> 8150  94  5  1  1
>> 11334 93  6  0  0
>> 13884 91  8  1  0
>> 10159 92  7  0  0
>> 9382  96  4  0  0
>> 11450 95  4  1  0
>> 11947 96  3  1  0
>> 8616  95  4  1  0
>> 10717 95  3  1  0
>>
>> We are running on 2.6.28.7-2 kernel.  I am unfamiliar with vmstat
>> output but
>> reading the man page (and that cs = "context switches per second")
>> makes my
>> numbers seem very high.
>
> No, those aren't really all that high.  If you were hitting cs
> contention, I'd expect it to be in the 25k to 100k range.  <10k
> average under load is pretty reasonable.
>
>> Our sum JDBC pools currently top out at 400 connections (and we are
>> doing
>> work on all 400 right now).  I may try dropping those pools down even
>> smaller. Are there any general rules of thumb for figuring out how
>> many
>> connections you should service at maximum?  I know of the memory
>> constraints, but thinking more along the lines of connections per
>> CPU core.
>
> Well, maximum efficiency is usually somewhere in the range of 1 to 2
> times the number of cores you have, so trying to get the pool down to
> a dozen or two connections would be the direction to generally head.
> May not be reasonable or doable though.

Turns out we may have an opportunity to purchase a new database server
with this increased load.  Seems that the best route, based on
feedback to this thread, is to go whitebox, get quad opterons, and get
a very good disk controller.

Can anyone recommend a whitebox vendor?

Is there a current controller anyone on this list has experience with
that they could recommend?

This will be a bigger purchase so will be doing research and
benchmarking but any general pointers to a vendor/controller greatly
appreciated.



From:
Greg Smith
Date:

On Tue, 24 Mar 2009, Joe Uhl wrote:

> Can anyone recommend a whitebox vendor?

I dumped a list of recommended vendors from a discussion here a while back
at http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks you could get
started with.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

From:
Scott Marlowe
Date:

On Tue, Mar 24, 2009 at 1:29 PM, Greg Smith <> wrote:
> On Tue, 24 Mar 2009, Joe Uhl wrote:
>
>> Can anyone recommend a whitebox vendor?
>
> I dumped a list of recommended vendors from a discussion here a while back
> at http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks you could get
> started with.

I'd add Aberdeen Inc to that list.  They supply quality white box
servers with 3ware, areca, or LSI controllers, and provide a 5 year
all inclusive warranty.  Their customer service is top notch too.

From:
Ron
Date:

At 02:47 PM 3/24/2009, Joe Uhl wrote:

>Turns out we may have an opportunity to purchase a new database
>server with this increased load.  Seems that the best route, based
>on feedback to this thread, is to go whitebox, get quad opterons,
>and get a very good disk controller.
>
>Can anyone recommend a whitebox vendor?
I'll 2nd the Aberdeen recommendation.  I'll add Pogolinux to that list as well.


>Is there a current controller anyone on this list has experience
>with that they could recommend?
The 2 best performing RAID controller vendors at this time are AMCC
(AKA 3Ware) and Areca.
In general, the 8+ port Areca's with their BB cache maxed outperform
every other controller available.


>This will be a bigger purchase so will be doing research and
>benchmarking but any general pointers to a vendor/controller greatly
>appreciated.

Be =very= careful to thoroughly bench both the AMD and Intel CPU
options.  It is far from clear which is the better purchase.

I'd be very interested to see the results of your research and
benchmarks posted here on pgsql-performance.

Ron Peacetree


From:
Scott Marlowe
Date:

On Tue, Mar 24, 2009 at 4:58 PM, Ron <> wrote:
> At 02:47 PM 3/24/2009, Joe Uhl wrote:
>
>> Turns out we may have an opportunity to purchase a new database server
>> with this increased load.  Seems that the best route, based on feedback to
>> this thread, is to go whitebox, get quad opterons, and get a very good disk
>> controller.
>>
>> Can anyone recommend a whitebox vendor?
>
> I'll 2nd the Aberdeen recommendation.  I'll add Pogolinux to that list as
> well.
>
>
>> Is there a current controller anyone on this list has experience with that
>> they could recommend?
>
> The 2 best performing RAID controller vendors at this time are AMCC (AKA
> 3Ware) and Areca.
> In general, the 8+ port Areca's with their BB cache maxed outperform every
> other controller available.
>
>
>> This will be a bigger purchase so will be doing research and benchmarking
>> but any general pointers to a vendor/controller greatly appreciated.
>
> Be =very= careful to thoroughly bench both the AMD and Intel CPU options.
>  It is far from clear which is the better purchase.

My anecdotal experience has been that the Opterons stay afloat longer
as load increases, but I haven't had machines with similar enough
hardware to really test that.

> I'd be very interested to see the results of your research and benchmarks
> posted here on pgsql-performance.

Me too.  I'm gonna spend some time this summer benchmarking and tuning
the database servers that I pretty much had to burn in and put in
production this year due to time pressures.

From:
Scott Carey
Date:

On 3/24/09 4:16 PM, "Scott Marlowe" <> wrote:

> On Tue, Mar 24, 2009 at 4:58 PM, Ron <> wrote:
>> At 02:47 PM 3/24/2009, Joe Uhl wrote:
>>
>>> Turns out we may have an opportunity to purchase a new database server
>>> with this increased load.  Seems that the best route, based on feedback to
>>> this thread, is to go whitebox, get quad opterons, and get a very good disk
>>> controller.
>>>
>>> Can anyone recommend a whitebox vendor?
>>
>> I'll 2nd the Aberdeen recommendation.  I'll add Pogolinux to that list as
>> well.
>>
>>
>>> Is there a current controller anyone on this list has experience with that
>>> they could recommend?
>>
>> The 2 best performing RAID controller vendors at this time are AMCC (AKA
>> 3Ware) and Areca.
>> In general, the 8+ port Areca's with their BB cache maxed outperform every
>> other controller available.

I personally have had rather bad performance experiences with 3Ware
9550/9650 SATA cards.  I have no experience with the AMCC SAS stuff though.
Adaptec demolished the 9650 on arrays larger than 4 drives, and Areca will
do better at the very high end.

However, if CPU is the issue for this particular case, then the RAID
controller details are less significant.

I don't know how much data you have, but don't forget the option of SSDs, or
a mix of hard drives and SSDs for different data.  Ideally, you would want
the OS to just extend its pagecache onto a SSD, but only OpenSolaris can do
that right now and it is rather new (needs to be persistent across reboots).

http://blogs.sun.com/brendan/entry/test
http://blogs.sun.com/brendan/entry/l2arc_screenshots


>>
>>
>>> This will be a bigger purchase so will be doing research and benchmarking
>>> but any general pointers to a vendor/controller greatly appreciated.
>>
>> Be =very= careful to thoroughly bench both the AMD and Intel CPU options.
>>  It is far from clear which is the better purchase.
>
> My anecdotal experience has been that the Opterons stay afloat longer
> as load increases, but I haven't had machines with similar enough
> hardware to really test that.
>

One may want to note that Intel's next generation servers are due out within
45 days from what I can sense ('Q2' traditionally means ~April 1 for Intel
when on time).  These should be a rather significant bump for a database as
they adopt the AMD / Alpha style memory-controller-on-CPU architecture and
add a lot of cache.  Other relevant improvements:  increased performance on
compare-and-swap operations, the return of hyper threading, and ridiculous
memory bandwidth per CPU (3 DDR3 memory channels per CPU).

>> I'd be very interested to see the results of your research and benchmarks
>> posted here on pgsql-performance.
>
> Me too.  I'm gonna spend some time this summer benchmarking and tuning
> the database servers that I pretty much had to burn in and put in
> production this year due to time pressures.
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>