Home > mailing lists

Re: Two identical systems, radically different performance - Mailing list pgsql-performance

From	Evgeny Shishkin
Subject	Re: Two identical systems, radically different performance
Date	October 9, 2012 01:33:59
Msg-id	B5EA62D3-A446-4A1F-BC4E-4B8290FEE1C2@gmail.com Whole thread Raw
In response to	Two identical systems, radically different performance (Craig James <cjames@emolecules.com>)
Responses	Re: Two identical systems, radically different performance Re: Two identical systems, radically different performance
List	pgsql-performance

Tree view

On Oct 9, 2012, at 1:45 AM, Craig James <cjames@emolecules.com> wrote:

This is driving me crazy. A new server, virtually identical to an old one, has 50% of the performance with pgbench. I've checked everything I can think of.

The setups (call the servers "old" and "new"):

old: 2 x 4-core Intel Xeon E5620
new: 4 x 4-core Intel Xeon E5606

both:

memory: 12 GB DDR EC
Disks: 12x500GB disks (Western Digital 7200RPM SATA)
2 disks, RAID1: OS (ext4) and postgres xlog (ext2)
8 disks, RAID10: $PGDATA

3WARE 9650SE-12ML with battery-backed cache. The admin tool (tw_cli)
indicates that the battery is charged and the cache is working on both units.

Linux: 2.6.32-41-server #94-Ubuntu SMP (new server's disk was
actually cloned from old server).

Postgres: 8.4.4 (yes, I should update. But both are identical.)

The postgres.conf files are identical; diffs from the original are:

    max_connections = 500
    shared_buffers = 1000MB
    work_mem = 128MB
    synchronous_commit = off
    full_page_writes = off
    wal_buffers = 256kB
    checkpoint_segments = 30
    effective_cache_size = 4GB
    track_activities = on
    track_counts = on
    track_functions = none
    autovacuum = on
    autovacuum_naptime = 5min
    escape_string_warning = off

Note that the old server is in production and was serving a light load while this test was running, so in theory it should be slower, not faster, than the new server.

pgbench: Old server

pgbench -i -s 100 -U test
pgbench -U test -c ... -t ...

-c -t    TPS
   5 20000 3777
10 10000 2622
20 5000 3759
30 3333 5712
40 2500 5953
50 2000 6141

New server
-c -t      TPS
5 20000 2733
10 10000 2783
20 5000 3241
30 3333 2987
40 2500 2739
50 2000 2119

On new server postgresql do not scale at all. Looks like contention.

As you can see, the new server is dramatically slower than the old one.

I tested both the RAID10 data disk and the RAID1 xlog disk with bonnie++. The xlog disks were almost identical in performance. The RAID10 pg-data disks looked like this:

Old server:
Version 1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
xenon        24064M   687 99 203098 26 81904 16 3889 96 403747 31 737.6 31
Latency             20512us     469ms     394ms   21402us     396ms     112ms
Version 1.96       ------Sequential Create------ --------Random Create--------
xenon               -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
                 16 15953 27 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency             43291us     857us     519us    1588us      37us     178us
1.96,1.96,xenon,1,1349726125,24064M,,687,99,203098,26,81904,16,3889,96,403747,31,737.6,31,16,,,,,15953,27,+++++,+++,+++++,++\
+,+++++,+++,+++++,+++,+++++,+++,20512us,469ms,394ms,21402us,396ms,112ms,43291us,857us,519us,1588us,37us,178us

New server:
Version 1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
zinc         24064M   862 99 212143 54 96008 14 4921 99 279239 17 752.0 23
Latency             15613us     598ms     597ms    2764us     398ms     215ms
Version 1.96       ------Sequential Create------ --------Random Create--------
zinc                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
                 16 20380 26 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency               487us     627us     407us     972us      29us     262us
1.96,1.96,zinc,1,1349722017,24064M,,862,99,212143,54,96008,14,4921,99,279239,17,752.0,23,16,,,,,20380,26,+++++,+++,+++++,+++\
,+++++,+++,+++++,+++,+++++,+++,15613us,598ms,597ms,2764us,398ms,215ms,487us,627us,407us,972us,29us,262us

I don't know enough about bonnie++ to know if these differences are interesting.

One dramatic difference I noted via vmstat. On the old server, the I/O load during the bonnie++ run was steady, like this:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b   swpd   free   buff cache   si   so    bi    bo   in   cs us sy id wa
r b   swpd   free   buff cache   si   so    bi    bo   in   cs us sy id wa
0 2 71800 2117612 17940 9375660    0    0 82948 81944 1992 1341 1 3 86 10
0 2 71800 2113328 17948 9383896    0    0 76288 75806 1751 1167 0 2 86 11
0 1 71800 2111004 17948 9386540   92    0 93324 94232 2230 1510 0 4 86 10
0 1 71800 2106796 17948 9387436 114    0 67698 67588 1572 1088 0 2 87 11
0 1 71800 2106724 17956 9387968   50    0 81970 85710 1918 1287 0 3 86 10
1 1 71800 2103304 17956 9390700    0    0 92096 92160 1970 1194 0 4 86 10
0 2 71800 2103196 17976 9389204    0    0 70722 69680 1655 1116 1 3 86 10
1 1 71800 2099064 17980 9390824    0    0 57346 57348 1357 949 0 2 87 11
0 1 71800 2095596 17980 9392720    0    0 57344 57348 1379 987 0 2 86 12

But the new server varied wildly during bonnie++:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b   swpd   free   buff cache   si   so    bi    bo   in   cs us sy id wa
0 1      0 4518352 12004 7167000    0    0 118894 120838 2613 1539 0 2 93 5
0 1      0 4517252 12004 7167824    0    0 52116 53248 1179 793 0 1 94 5
0 1      0 4515864 12004 7169088    0    0 46764 49152 1104 733 0 1 91 7
0 1      0 4515180 12012 7169764    0    0 32924 30724 750 542 0 1 93 6
0 1      0 4514328 12016 7170780    0    0 42188 45056 1019 664 0 1 90 9
0 1      0 4513072 12016 7171856    0    0 67528 65540 1487 993 0 1 96 4
0 1      0 4510852 12016 7173160    0    0 56876 57344 1358 942 0 1 94 5
0 1      0 4500280 12044 7179924    0    0 91564 94220 2505 2504 1 2 91 6
0 1      0 4495564 12052 7183492    0    0 102660 104452 2289 1473 0 2 92 6
0 1      0 4492092 12052 7187720    0    0 98498 96274 2140 1385 0 2 93 5
0 1      0 4488608 12060 7190772    0    0 97628 100358 2176 1398 0 1 94 4
1 0      0 4485880 12052 7192600    0    0 112406 114686 2461 1509 0 3 90 7
1 0      0 4483424 12052 7195612    0    0 64678 65536 1449 948 0 1 91 8
0 1      0 4480252 12052 7199404    0    0 99608 100356 2217 1452 0 1 96 3

Also note the difference in free/cache distribution. Unless you took these numbers in completely different stages of bonnie++.

Any ideas where to look next would be greatly appreciated.

Craig

pgsql-performance by date:

From: Craig James
Date: 09 October 2012, 01:29:22
Subject: Re: Two identical systems, radically different performance

From: Craig James
Date: 09 October 2012, 01:42:36
Subject: Re: Two identical systems, radically different performance

Re: Two identical systems, radically different performance - Mailing list pgsql-performance

Previous

Next