Re: Two identical systems, radically different performance - Mailing list pgsql-performance

From Evgeny Shishkin
Subject Re: Two identical systems, radically different performance
Date
Msg-id B5EA62D3-A446-4A1F-BC4E-4B8290FEE1C2@gmail.com
Whole thread Raw
In response to Two identical systems, radically different performance  (Craig James <cjames@emolecules.com>)
Responses Re: Two identical systems, radically different performance
Re: Two identical systems, radically different performance
List pgsql-performance

On Oct 9, 2012, at 1:45 AM, Craig James <cjames@emolecules.com> wrote:

This is driving me crazy.  A new server, virtually identical to an old one, has 50% of the performance with pgbench.  I've checked everything I can think of.

The setups (call the servers "old" and "new"):

old: 2 x 4-core Intel Xeon E5620
new: 4 x 4-core Intel Xeon E5606

both:

  memory: 12 GB DDR EC
  Disks: 12x500GB disks (Western Digital 7200RPM SATA)
    2 disks, RAID1: OS (ext4) and postgres xlog (ext2)
    8 disks, RAID10: $PGDATA

  3WARE 9650SE-12ML with battery-backed cache.  The admin tool (tw_cli)
  indicates that the battery is charged and the cache is working on both units.

  Linux: 2.6.32-41-server #94-Ubuntu SMP (new server's disk was
  actually cloned from old server).


  Postgres: 8.4.4 (yes, I should update.  But both are identical.)

The postgres.conf files are identical; diffs from the original are:

    max_connections = 500
    shared_buffers = 1000MB
    work_mem = 128MB
    synchronous_commit = off
    full_page_writes = off
    wal_buffers = 256kB
    checkpoint_segments = 30
    effective_cache_size = 4GB
    track_activities = on
    track_counts = on
    track_functions = none
    autovacuum = on
    autovacuum_naptime = 5min
    escape_string_warning = off

Note that the old server is in production and was serving a light load while this test was running, so in theory it should be slower, not faster, than the new server.

pgbench: Old server

    pgbench -i -s 100 -U test
    pgbench -U test -c ... -t ...

    -c  -t      TPS
     5  20000  3777
    10  10000  2622
    20  5000   3759
    30  3333   5712
    40  2500   5953
    50  2000   6141

New server
    -c  -t      TPS
    5   20000  2733
    10  10000  2783
    20  5000   3241
    30  3333   2987
    40  2500   2739
    50  2000   2119

On new server postgresql do not scale at all. Looks like contention. 


As you can see, the new server is dramatically slower than the old one.

I tested both the RAID10 data disk and the RAID1 xlog disk with bonnie++.  The xlog disks were almost identical in performance.  The RAID10 pg-data disks looked like this:

Old server:
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
xenon        24064M   687  99 203098  26 81904  16  3889  96 403747  31 737.6  31
Latency             20512us     469ms     394ms   21402us     396ms     112ms
Version  1.96       ------Sequential Create------ --------Random Create--------
xenon               -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 15953  27 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency             43291us     857us     519us    1588us      37us     178us
1.96,1.96,xenon,1,1349726125,24064M,,687,99,203098,26,81904,16,3889,96,403747,31,737.6,31,16,,,,,15953,27,+++++,+++,+++++,++\
+,+++++,+++,+++++,+++,+++++,+++,20512us,469ms,394ms,21402us,396ms,112ms,43291us,857us,519us,1588us,37us,178us


New server:
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
zinc         24064M   862  99 212143  54 96008  14  4921  99 279239  17 752.0  23
Latency             15613us     598ms     597ms    2764us     398ms     215ms
Version  1.96       ------Sequential Create------ --------Random Create--------
zinc                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 20380  26 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               487us     627us     407us     972us      29us     262us
1.96,1.96,zinc,1,1349722017,24064M,,862,99,212143,54,96008,14,4921,99,279239,17,752.0,23,16,,,,,20380,26,+++++,+++,+++++,+++\
,+++++,+++,+++++,+++,+++++,+++,15613us,598ms,597ms,2764us,398ms,215ms,487us,627us,407us,972us,29us,262us

I don't know enough about bonnie++ to know if these differences are interesting.

One dramatic difference I noted via vmstat.  On the old server, the I/O load during the bonnie++ run was steady, like this:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  2  71800 2117612  17940 9375660    0    0 82948 81944 1992 1341  1  3 86 10
 0  2  71800 2113328  17948 9383896    0    0 76288 75806 1751 1167  0  2 86 11
 0  1  71800 2111004  17948 9386540   92    0 93324 94232 2230 1510  0  4 86 10
 0  1  71800 2106796  17948 9387436  114    0 67698 67588 1572 1088  0  2 87 11
 0  1  71800 2106724  17956 9387968   50    0 81970 85710 1918 1287  0  3 86 10
 1  1  71800 2103304  17956 9390700    0    0 92096 92160 1970 1194  0  4 86 10
 0  2  71800 2103196  17976 9389204    0    0 70722 69680 1655 1116  1  3 86 10
 1  1  71800 2099064  17980 9390824    0    0 57346 57348 1357  949  0  2 87 11
 0  1  71800 2095596  17980 9392720    0    0 57344 57348 1379  987  0  2 86 12

But the new server varied wildly during bonnie++:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  1      0 4518352  12004 7167000    0    0 118894 120838 2613 1539  0  2 93  5
 0  1      0 4517252  12004 7167824    0    0  52116  53248 1179  793  0  1 94  5
 0  1      0 4515864  12004 7169088    0    0  46764  49152 1104  733  0  1 91  7
 0  1      0 4515180  12012 7169764    0    0  32924  30724  750  542  0  1 93  6
 0  1      0 4514328  12016 7170780    0    0  42188  45056 1019  664  0  1 90  9
 0  1      0 4513072  12016 7171856    0    0  67528  65540 1487  993  0  1 96  4
 0  1      0 4510852  12016 7173160    0    0  56876  57344 1358  942  0  1 94  5
 0  1      0 4500280  12044 7179924    0    0  91564  94220 2505 2504  1  2 91  6
 0  1      0 4495564  12052 7183492    0    0 102660 104452 2289 1473  0  2 92  6
 0  1      0 4492092  12052 7187720    0    0  98498  96274 2140 1385  0  2 93  5
 0  1      0 4488608  12060 7190772    0    0  97628 100358 2176 1398  0  1 94  4
 1  0      0 4485880  12052 7192600    0    0 112406 114686 2461 1509  0  3 90  7
 1  0      0 4483424  12052 7195612    0    0  64678  65536 1449  948  0  1 91  8
 0  1      0 4480252  12052 7199404    0    0  99608 100356 2217 1452  0  1 96  3


Also note the difference in free/cache distribution. Unless you took these numbers in completely different stages of bonnie++.

Any ideas where to look next would be greatly appreciated.

Craig


pgsql-performance by date:

Previous
From: Craig James
Date:
Subject: Re: Two identical systems, radically different performance
Next
From: Craig James
Date:
Subject: Re: Two identical systems, radically different performance