Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL - Mailing list pgsql-performance

From Pietro Pugni
Subject Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL
Date
Msg-id 732B95A3-FEC8-4623-9AC5-98FD30F00863@gmail.com
Whole thread Raw
Responses Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL  (Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com>)
Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL  (Aidan Van Dyk <aidan@highrise.ca>)
Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL  (Jeff Janes <jeff.janes@gmail.com>)
Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL  (Gerardo Herzig <gherzig@fmed.uba.ar>)
List pgsql-performance

This question was posted originally on http://dba.stackexchange.com/questions/96444/cant-get-dell-pe-t420-perc-h710-perform-better-than-a-macmini-with-postgresql and they suggested to post it on this mailing list.

It's months that I'm trying to solve a performance issue with PostgreSQL. I’m able to give you all the technical details needed.

SYSTEM CONFIGURATION

Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:

  • VD0: two 15k SAS disks (ext4, OS partition, WAL partition, RAID1)
  • VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)

This system has the following configuration:

  • Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)
  • 128GB RAM (DDR3, 8x16GB @1600Mhz)
  • two Intel Xeon E5-2640 v2 @2Ghz
  • Dell Perc H710 with 512MB RAM (Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "disabled"):
    • VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)
    • VD1 (Postgres data partition): ten 10k SAS disks (XFS, RAID5)
  • PostgreSQL 9.4 (updated to the latest available version)
  • moved pg_stat_tmp to RAM disk

My personal low cost and low profile development machine is a MacMini configured in this way:

  • OS X Server 10.7.5
  • 8GB RAM (DDR3, 2x4GB @1333Mhz)
  • one Intel i7 @2.2Ghz
  • two Internal 500GB 7.2k SAS HDD (non RAID) for OS partition
  • external Promise Pegasus R1 connected with Thunderbolt v1 (512MB RAM, four 1TB 7.2k SAS HDD 32MB cache, RAID5, Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "enabled", NCQ: "enabled")
  • PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)
  • moved pg_stat_tmp to RAM disk

So far I've made a lot of tuning adjustments to both machines, including kernel reccomended ones on the official Postgres doc site.

APPLICATION

The deployment machine runs a web platform which instructs Postgres to make big transactions over billion of records. It's a platform designed for one user because system resources have to be dedicated as much as possible to one single job due to data size (I don't like to call it big data because big data are in the order ob ten of billion).

ISSUEs

I've found the deployment machine to be a lot slower than the development machine. This is paradoxal because the two machine really differs in many aspects. I've run many queries to investigate this strange behaviour and have done a lot of tuning adjustments.

During the last two months I've prepared and executed two type of query sets:

  • A: these sets make use of SELECT ... INTO, CREATE INDEX, CLUSTER and VACUUM ANALYZE.
  • B: these sets are from our application generated transactions and make use of SELECT over the tables created with set A.

A and B were always slower on T420. The only type of operation that was faster is the VACUUM ANALYZE.

RESULTS

A type set:

  • T420: went from 311seconds (default postgresql.conf) to 195seconds doing tuning adjustments over RAID, kernel and postgresql.conf;
  • MacMini: 40seconds.

B type set:

  • T420: 141seconds;
  • MacMini: 101seconds.

I've to mention that we have also adjusted the BIOS on T420 setting all possible parameters to "performance" and disabling low energy profiles. This lowered time execution over a type A set from 240seconds to 211seconds.

We have also upgrade all firmware and BIOS to the latest available versions.


Here are two benchmarks generated using pg_test_fsync:

T420 pg_test_fsync

60 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                   23358.758 ops/sec      43 usecs/op       fdatasync                       21417.018 ops/sec      47 usecs/op       fsync                           21112.662 ops/sec      47 usecs/op       fsync_writethrough                            n/a       open_sync                       23082.764 ops/sec      43 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                   11737.746 ops/sec      85 usecs/op       fdatasync                       19222.074 ops/sec      52 usecs/op       fsync                           18608.405 ops/sec      54 usecs/op       fsync_writethrough                            n/a       open_sync                       11510.074 ops/sec      87 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)        1 * 16kB open_sync write       21484.546 ops/sec      47 usecs/op        2 *  8kB open_sync writes      11478.119 ops/sec      87 usecs/op        4 *  4kB open_sync writes       5885.149 ops/sec     170 usecs/op        8 *  2kB open_sync writes       3027.676 ops/sec     330 usecs/op       16 *  1kB open_sync writes       1512.922 ops/sec     661 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       write, fsync, close             17946.690 ops/sec      56 usecs/op       write, close, fsync             17976.202 ops/sec      56 usecs/op

Non-Sync'ed 8kB writes:       write                           343202.937 ops/sec       3 usecs/op

MacMini pg_test_fsync

60 seconds per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                      3780.341 ops/sec     265 usecs/op       fdatasync                          3117.094 ops/sec     321 usecs/op       fsync                              3156.298 ops/sec     317 usecs/op       fsync_writethrough                  110.300 ops/sec    9066 usecs/op       open_sync                          3077.932 ops/sec     325 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                      1522.400 ops/sec     657 usecs/op       fdatasync                          2700.055 ops/sec     370 usecs/op       fsync                              2670.652 ops/sec     374 usecs/op       fsync_writethrough                   98.462 ops/sec   10156 usecs/op       open_sync                          1532.235 ops/sec     653 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)        1 * 16kB open_sync write          2634.754 ops/sec     380 usecs/op        2 *  8kB open_sync writes         1547.801 ops/sec     646 usecs/op        4 *  4kB open_sync writes          801.542 ops/sec    1248 usecs/op        8 *  2kB open_sync writes          405.515 ops/sec    2466 usecs/op       16 *  1kB open_sync writes          204.095 ops/sec    4900 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       write, fsync, close                2747.345 ops/sec     364 usecs/op       write, close, fsync                3070.877 ops/sec     326 usecs/op

Non-Sync'ed 8kB writes:       write                              3275.716 ops/sec     305 usecs/op

This confirms the hardware IO capabilities of T420 but doesn't explain why MacMini is MUCH MORE FAST.



Now let’s propose some query profiling times.

B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results:

T420

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


COMPILING PGSQL

I compiled and tested all the latest pgsql versions (9.0.19, 9.1.15, 9.2.10, 9.3.6 and 9.4.1) using different combinations of parameters for gcc-4.9.1 (gcc 4.7 for pgsql 9.0.19) and Postgres (I’ve tried also clang compiler with different optimization flags with no benefits). I followed this article but I was unable to test the -flto option due to several errors returned by make.  After two days of testing I went down from 195 to 189 seconds on T420 where MacMini still is 40 seconds (A set); and from 141 to 129 seconds where MacMini is 101 seconds (B set). On MacMini I’ve used the built-in pgsql 9.0.13 version while on T420 I've used the following optimal compiling options:

./configure CFLAGS="-O3 -fno-inline-functions -march=native" --with-openssl --with-libxml --with-libxslt --with-wal-blocksize=64 --with-blocksize=32 --with-wal-segsize=64 --with-segsize=1

I've also tried to disable Hyper-Threading with echo 0 > /sys/devices/system/cpu/cpuN/online where cpuN is the N-th logical CPU but nothing changed over B set queries. We have 2 CPU with 8 cores for a total of 16 physical cores and 16 logical cores.


It seems like T420 doesn’t push on single transaction while is probably able to manage multiple connections much better than MacMini. I can’t figure out why it’s much much much slower than MacMini on any kind of query (from data loading to da selection).

pgsql-performance by date:

Previous
From: Przemysław Deć
Date:
Subject: Re: Some performance testing?
Next
From: Ilya Kosmodemiansky
Date:
Subject: Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL