Thread: 60 core performance with 9.3

60 core performance with 9.3

From
Mark Kirkwood
Date:
I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1].

The context is the current machine in use by the customer is a 32 core
one, and due to growth we are looking at something larger (hence 60 cores).

Some initial tests show similar pgbench read only performance to what
Robert found here
http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html
(actually a bit quicker around 400000 tps).

However doing a mixed read-write workload is getting results the same or
only marginally quicker than the 32 core machine - particularly at
higher number of clients (e.g 200 - 500). I have yet to break out the
perf toolset, but I'm wondering if any folk has compared 32 and 60 (or
64) core read write pgbench performance?

regards

Mark

[1] Details:

4x E7-4890 15 cores each.
1 TB ram
16x Toshiba PX02SS SATA SSD
4x Samsung NVMe XS1715 PCIe SSD

Ubuntu 14.04  (Linux 3.13)



Re: 60 core performance with 9.3

From
Scott Marlowe
Date:
On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1].
>
> The context is the current machine in use by the customer is a 32 core one,
> and due to growth we are looking at something larger (hence 60 cores).
>
> Some initial tests show similar pgbench read only performance to what Robert
> found here
> http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html
> (actually a bit quicker around 400000 tps).
>
> However doing a mixed read-write workload is getting results the same or
> only marginally quicker than the 32 core machine - particularly at higher
> number of clients (e.g 200 - 500). I have yet to break out the perf toolset,
> but I'm wondering if any folk has compared 32 and 60 (or 64) core read write
> pgbench performance?

My guess is that the read only test is CPU / memory bandwidth limited,
but the mixed test is IO bound.

What's your iostat / vmstat / iotop etc look like when you're doing
both read only and read/write mixed?


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 27/06/14 14:01, Scott Marlowe wrote:
> On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz> wrote:
>> I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1].
>>
>> The context is the current machine in use by the customer is a 32 core one,
>> and due to growth we are looking at something larger (hence 60 cores).
>>
>> Some initial tests show similar pgbench read only performance to what Robert
>> found here
>> http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html
>> (actually a bit quicker around 400000 tps).
>>
>> However doing a mixed read-write workload is getting results the same or
>> only marginally quicker than the 32 core machine - particularly at higher
>> number of clients (e.g 200 - 500). I have yet to break out the perf toolset,
>> but I'm wondering if any folk has compared 32 and 60 (or 64) core read write
>> pgbench performance?
>
> My guess is that the read only test is CPU / memory bandwidth limited,
> but the mixed test is IO bound.
>
> What's your iostat / vmstat / iotop etc look like when you're doing
> both read only and read/write mixed?
>
>

That was what I would have thought too, but it does not appear to be the
case, here is a typical iostat:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1           0.00     0.00    0.00 4448.00     0.00    41.47
19.10     0.14    0.03    0.00    0.03   0.03  14.40
nvme1n1           0.00     0.00    0.00 4448.00     0.00    41.47
19.10     0.15    0.03    0.00    0.03   0.03  15.20
nvme2n1           0.00     0.00    0.00 4549.00     0.00    42.20
19.00     0.15    0.03    0.00    0.03   0.03  15.20
nvme3n1           0.00     0.00    0.00 4548.00     0.00    42.19
19.00     0.16    0.04    0.00    0.04   0.04  16.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 17961.00     0.00    83.67
9.54     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00


My feeling is spinlock or similar, 'perf top' shows

kernel find_busiest_group
kernel _raw_spin_lock

as the top time users.


Re: 60 core performance with 9.3

From
Andres Freund
Date:
On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:
> My feeling is spinlock or similar, 'perf top' shows
>
> kernel find_busiest_group
> kernel _raw_spin_lock
>
> as the top time users.

Those don't tell that much by themselves, could you do a hierarchical
profile? I.e. perf record -ga? That'll at least give the callers for
kernel level stuff. For more information compile postgres with
-fno-omit-frame-pointer.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 27/06/14 21:19, Andres Freund wrote:
> On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:
>> My feeling is spinlock or similar, 'perf top' shows
>>
>> kernel find_busiest_group
>> kernel _raw_spin_lock
>>
>> as the top time users.
>
> Those don't tell that much by themselves, could you do a hierarchical
> profile? I.e. perf record -ga? That'll at least give the callers for
> kernel level stuff. For more information compile postgres with
> -fno-omit-frame-pointer.
>

Excellent suggestion, will do next week!

regards

Mark



Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 27/06/14 21:19, Andres Freund wrote:
> On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:
>> My feeling is spinlock or similar, 'perf top' shows
>>
>> kernel find_busiest_group
>> kernel _raw_spin_lock
>>
>> as the top time users.
>
> Those don't tell that much by themselves, could you do a hierarchical
> profile? I.e. perf record -ga? That'll at least give the callers for
> kernel level stuff. For more information compile postgres with
> -fno-omit-frame-pointer.
>

Unfortunately this did not help - had lots of unknown symbols from
postgres in the profile - I'm guessing the Ubuntu postgresql-9.3 package
needs either the -dev package or to be rebuilt with the enable profile
option (debug and no-omit-frame-pointer seem to be there already).

However further investigation did uncover *very* interesting things.
Firstly I had previously said that read only performance looked
ok...this was wrong, purely based on comparison to Robert's blog post.
Rebooting the 60 core box with 32 cores enabled showed that we got
*better* scaling performance in the read only case and illustrated we
were hitting a serious regression with more cores. At this point data is
needed:

Test: pgbench
Options: scale 500
          read only
Os: Ubuntu 14.04
Pg: 9.3.4
Pg Options:
     max_connections = 200
     shared_buffers = 10GB
     maintenance_work_mem = 1GB
     effective_io_concurrency = 10
     wal_buffers = 32MB
     checkpoint_segments = 192
     checkpoint_completion_target = 0.8


Results

Clients | 9.3 tps 32 cores | 9.3 tps 60 cores
--------+------------------+-----------------
6       |  70400           |  71028
12      |  98918           | 129140
24      | 230345           | 240631
48      | 324042           | 409510
96      | 346929           | 120464
192     | 312621           |  92663

So we have anti scaling with 60 cores as we increase the client
connections. Ouch! A level of urgency led to trying out Andres's
'rwlock' 9.4 branch [1] - cherry picking the last 5 commits into 9.4
branch and building a package from that and retesting:

Clients | 9.4 tps 60 cores (rwlock)
--------+--------------------------
6       |  70189
12      | 128894
24      | 233542
48      | 422754
96      | 590796
192     | 630672

Wow - that is more like it! Andres that is some nice work, we definitely
owe you some beers for that :-) I am aware that I need to retest with an
unpatched 9.4 src - as it is not clear from this data how much is due to
Andres's patches and how much to the steady stream of 9.4 development.
I'll post an update on that later, but figured this was interesting
enough to note for now.


Regards

Mark

[1] from git://git.postgresql.org/git/users/andresfreund/postgres.git,
commits:
4b82477dcaf81ad7b0c102f4b66e479a5eb9504a
10d72b97f108b6002210ea97a414076a62302d4e
67ffebe50111743975d54782a3a94b15ac4e755f
fe686ed18fe132021ee5e557c67cc4d7c50a1ada
f2378dc2fa5b73c688f696704976980bab90c611



Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 01/07/14 21:48, Mark Kirkwood wrote:

> [1] from git://git.postgresql.org/git/users/andresfreund/postgres.git,
> commits:
> 4b82477dcaf81ad7b0c102f4b66e479a5eb9504a
> 10d72b97f108b6002210ea97a414076a62302d4e
> 67ffebe50111743975d54782a3a94b15ac4e755f
> fe686ed18fe132021ee5e557c67cc4d7c50a1ada
> f2378dc2fa5b73c688f696704976980bab90c611
>
>

Hmmm, should read last 5 commits in 'rwlock-contention' and I had pasted
the commit nos from my tree not Andres's, sorry, here are the right ones:
472c87400377a7dc418d8b77e47ba08f5c89b1bb
e1e549a8e42b753cc7ac60e914a3939584cb1c56
65c2174469d2e0e7c2894202dc63b8fa6f8d2a7f
959aa6e0084d1264e5b228e5a055d66e5173db7d
a5c3ddaef0ee679cf5e8e10d59e0a1fe9f0f1893




Re: 60 core performance with 9.3

From
Andres Freund
Date:
On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:
> On 27/06/14 21:19, Andres Freund wrote:
> >On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:
> >>My feeling is spinlock or similar, 'perf top' shows
> >>
> >>kernel find_busiest_group
> >>kernel _raw_spin_lock
> >>
> >>as the top time users.
> >
> >Those don't tell that much by themselves, could you do a hierarchical
> >profile? I.e. perf record -ga? That'll at least give the callers for
> >kernel level stuff. For more information compile postgres with
> >-fno-omit-frame-pointer.
> >
>
> Unfortunately this did not help - had lots of unknown symbols from postgres
> in the profile - I'm guessing the Ubuntu postgresql-9.3 package needs either
> the -dev package or to be rebuilt with the enable profile option (debug and
> no-omit-frame-pointer seem to be there already).

You need to install the -dbg package. My bet is you'll see s_lock high
in the profile, called mainly from the procarray and buffer mapping
lwlocks.

> Test: pgbench
> Options: scale 500
>          read only
> Os: Ubuntu 14.04
> Pg: 9.3.4
> Pg Options:
>     max_connections = 200

Just as an experiment I'd suggest increasing max_connections by one and
two and quickly retesting - there's some cacheline alignment issues that
aren't fixed yet that happen to vanish with some max_connections
settings.

>     shared_buffers = 10GB
>     maintenance_work_mem = 1GB
>     effective_io_concurrency = 10
>     wal_buffers = 32MB
>     checkpoint_segments = 192
>     checkpoint_completion_target = 0.8
>
>
> Results
>
> Clients | 9.3 tps 32 cores | 9.3 tps 60 cores
> --------+------------------+-----------------
> 6       |  70400           |  71028
> 12      |  98918           | 129140
> 24      | 230345           | 240631
> 48      | 324042           | 409510
> 96      | 346929           | 120464
> 192     | 312621           |  92663
>
> So we have anti scaling with 60 cores as we increase the client connections.
> Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4 branch [1]
> - cherry picking the last 5 commits into 9.4 branch and building a package
> from that and retesting:
>
> Clients | 9.4 tps 60 cores (rwlock)
> --------+--------------------------
> 6       |  70189
> 12      | 128894
> 24      | 233542
> 48      | 422754
> 96      | 590796
> 192     | 630672
>
> Wow - that is more like it! Andres that is some nice work, we definitely owe
> you some beers for that :-) I am aware that I need to retest with an
> unpatched 9.4 src - as it is not clear from this data how much is due to
> Andres's patches and how much to the steady stream of 9.4 development. I'll
> post an update on that later, but figured this was interesting enough to
> note for now.

Cool. That's what I like (and expect) to see :). I don't think unpatched
9.4 will show significantly different results than 9.3, but it'd be good
to validate that. If you do so, could you post the results in the
-hackers thread I just CCed you on? That'll help the work to get into
9.5.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 01/07/14 22:13, Andres Freund wrote:
> On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:
>> - cherry picking the last 5 commits into 9.4 branch and building a package
>> from that and retesting:
>>
>> Clients | 9.4 tps 60 cores (rwlock)
>> --------+--------------------------
>> 6       |  70189
>> 12      | 128894
>> 24      | 233542
>> 48      | 422754
>> 96      | 590796
>> 192     | 630672
>>
>> Wow - that is more like it! Andres that is some nice work, we definitely owe
>> you some beers for that :-) I am aware that I need to retest with an
>> unpatched 9.4 src - as it is not clear from this data how much is due to
>> Andres's patches and how much to the steady stream of 9.4 development. I'll
>> post an update on that later, but figured this was interesting enough to
>> note for now.
>
> Cool. That's what I like (and expect) to see :). I don't think unpatched
> 9.4 will show significantly different results than 9.3, but it'd be good
> to validate that. If you do so, could you post the results in the
> -hackers thread I just CCed you on? That'll help the work to get into
> 9.5.

So we seem to have nailed read only performance. Going back and
revisiting read write performance finds:

Postgres 9.4 beta
rwlock patch
pgbench scale = 2000

max_connections = 200;
shared_buffers = "10GB";
maintenance_work_mem = "1GB";
effective_io_concurrency = 10;
wal_buffers = "32MB";
checkpoint_segments = 192;
checkpoint_completion_target = 0.8;

clients  | tps (32 cores) | tps
---------+----------------+---------
6        |   8313         |   8175
12       |  11012         |  14409
24       |  16151         |  17191
48       |  21153         |  23122
96       |  21977         |  22308
192      |  22917         |  23109


So we are back to not doing significantly better than 32 cores. Hmmm.
Doing quite a few more tweaks gets some better numbers:

kernel.sched_autogroup_enabled=0
kernel.sched_migration_cost_ns=5000000
net.core.somaxconn=1024
/sys/kernel/mm/transparent_hugepage/enabled [never]

+checkpoint_segments = 1920
+wal_buffers = "256MB";


clients  | tps
---------+---------
6        |   8366
12       |  15988
24       |  19828
48       |  30315
96       |  31649
192      |  29497

One more:

+wal__sync_method = "open_datasync"

clients  | tps
---------+---------
6        |  9566
12       | 17129
24       | 22962
48       | 34564
96       | 32584
192      | 28367

So this looks better - however I suspect 32 core performance would
improve with these as well!

The problem does *not* look to be connected with IO (I will include some
iostat below). So time to get the profiler out (192 clients for 1 minute):

Full report http://paste.ubuntu.com/7777886/

# ========
# captured on: Fri Jul 11 03:09:06 2014
# hostname : ncel-prod-db3
# os release : 3.13.0-24-generic
# perf version : 3.13.9
# arch : x86_64
# nrcpus online : 60
# nrcpus avail : 60
# cpudesc : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,7
# total memory : 1056692116 kB
# cmdline : /usr/lib/linux-tools-3.13.0-24/perf record -ag
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2
= 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1,
precise_ip = 0, attr_mmap2 = 0, attr_mmap  = 1, attr_mmap_data = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, uncore_cbox_10 = 17, uncore_cbox_11 = 18,
uncore_cbox_12 = 19, uncore_cbox_13 = 20, uncore_cbox_14 = 21, software
= 1, uncore_irp = 33, uncore_pcu = 22, tracepoint = 2, uncore_imc_0 =
25, uncore_imc_1 = 26, uncore_imc_2 = 27, uncore_imc_3 = 28,
uncore_imc_4 = 29, uncore_imc_5 = 30, uncore_imc_6 = 31, uncore_imc_7 =
32, uncore_qpi_0 = 34, uncore_qpi_1 = 35, uncore_qpi_2 = 36,
uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 =
10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13,
uncore_cbox_7 = 14, uncore_cbox_8 = 15, uncore_cbox_9 = 16,
uncore_r2pcie = 37, uncore_r3qpi_0 = 38, uncore_r3qpi_1 = 39, breakpoint
= 5, uncore_ha_0 = 23, uncore_ha_1 = 24, uncore_ubox = 6
# ========
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 359906321606
#
# Overhead         Command            Shared Object
                             Symbol
# ........  ..............  .......................
.....................................................
#
      8.82%        postgres  [kernel.kallsyms]        [k]
_raw_spin_lock_irqsave
                   |
                   --- _raw_spin_lock_irqsave
                      |
                      |--75.69%-- pagevec_lru_move_fn
                      |          __lru_cache_add
                      |          lru_cache_add
                      |          putback_lru_page
                      |          migrate_pages
                      |          migrate_misplaced_page
                      |          do_numa_page
                      |          handle_mm_fault
                      |          __do_page_fault
                      |          do_page_fault
                      |          page_fault
                      |          |
                      |          |--31.07%-- PinBuffer
                      |          |          |
                      |          |           --100.00%-- ReadBuffer_common
                      |          |                     |
                      |          |                      --100.00%--
ReadBufferExtended
                      |          |                                |

                      |          |
|--71.62%-- index_fetch_heap
                      |          |                                |
      index_getnext
                      |          |                                |
      IndexNext
                      |          |                                |
      ExecScan
                      |          |                                |
      ExecProcNode
                      |          |                                |
      ExecModifyTable
                      |          |                                |
      ExecProcNode
                      |          |                                |
      standard_ExecutorRun
                      |          |                                |
      ProcessQuery
                      |          |                                |
      PortalRunMulti
                      |          |                                |
      PortalRun
                      |          |                                |
      PostgresMain
                      |          |                                |
      ServerLoop
                      |          |                                |

                      |          |
|--17.47%-- heap_hot_search
                      |          |                                |
      _bt_check_unique
                      |          |                                |
      _bt_doinsert
                      |          |                                |
      btinsert
                      |          |                                |
      FunctionCall6Coll
                      |          |                                |
      index_insert
                      |          |                                |
      |
                      |          |                                |
       --100.00%-- ExecInsertIndexTuples
                      |          |                                |
                 ExecModifyTable
                      |          |                                |
                 ExecProcNode
                      |          |                                |
                 standard_ExecutorRun
                      |          |                                |
                 ProcessQuery
                      |          |                                |
                 PortalRunMulti
                      |          |                                |
                 PortalRun
                      |          |                                |
                 PostgresMain
                      |          |                                |
                 ServerLoop
                      |          |                                |

                      |          |
|--3.81%-- RelationGetBufferForTuple
                      |          |                                |
      heap_update
                      |          |                                |
      ExecModifyTable
                      |          |                                |
      ExecProcNode
                      |          |                                |
      standard_ExecutorRun
                      |          |                                |
      ProcessQuery
                      |          |                                |
      PortalRunMulti
                      |          |                                |
      PortalRun
                      |          |                                |
      PostgresMain
                      |          |                                |
      ServerLoop
                      |          |                                |

                      |          |
|--3.65%-- _bt_relandgetbuf
                      |          |                                |
      _bt_search
                      |          |                                |
      _bt_first
                      |          |                                |
      |
                      |          |                                |
       --100.00%-- btgettuple
                      |          |                                |
                 FunctionCall2Coll
                      |          |                                |
                 index_getnext_tid
                      |          |                                |
                 index_getnext
                      |          |                                |
                 IndexNext
                      |          |                                |
                 ExecScan
                      |          |                                |
                 ExecProcNode
                      |          |                                |
                 |
                      |          |                                |
                 |--97.56%-- ExecModifyTable
                      |          |                                |
                 |          ExecProcNode
                      |          |                                |
                 |          standard_ExecutorRun
                      |          |                                |
                 |          ProcessQuery
                      |          |                                |
                 |          PortalRunMulti
                      |          |                                |
                 |          PortalRun
                      |          |                                |
                 |          PostgresMain
                      |          |                                |
                 |          ServerLoop
                      |          |                                |
                 |
                      |          |                                |
                  --2.44%-- standard_ExecutorRun
                      |          |                                |
                            PortalRunSelect
                      |          |                                |
                            PortalRun
                      |          |                                |
                            PostgresMain
                      |          |                                |
                            ServerLoop
                      |          |                                |

                      |          |
|--2.69%-- fsm_readbuf
                      |          |                                |
      fsm_set_and_search
                      |          |                                |
      RecordPageWithFreeSpace
                      |          |                                |
      lazy_vacuum_rel
                      |          |                                |
      vacuum_rel
                      |          |                                |
      vacuum
                      |          |                                |
      do_autovacuum
                      |          |                                |

                      |          |
--0.75%-- lazy_vacuum_rel
                      |          |
      vacuum_rel
                      |          |
      vacuum
                      |          |
      do_autovacuum
                      |          |
                      |          |--4.66%-- SearchCatCache
                      |          |          |
                      |          |          |--49.62%-- oper
                      |          |          |          make_op
                      |          |          |          transformExprRecurse
                      |          |          |          transformExpr
                      |          |          |          |
                      |          |          |          |--90.02%--
transformTargetEntry
                      |          |          |          |
transformTargetList
                      |          |          |          |
transformStmt
                      |          |          |          |
parse_analyze
                      |          |          |          |
pg_analyze_and_rewrite
                      |          |          |          |
PostgresMain
                      |          |          |          |          ServerLoop
                      |          |          |          |
                      |          |          |           --9.98%--
transformWhereClause
                      |          |          |
transformStmt
                      |          |          |
parse_analyze
                      |          |          |
pg_analyze_and_rewrite
                      |          |          |
PostgresMain
                      |          |          |                     ServerLoop



With respect to IO, here are typical iostat outputs:

sda HW RAID 10 array SAS SSD [data]
md0 SW RAID 10 of nvme[0-3]n1 PCie SSD [xlog]

Non Checkpoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    15.00    0.00    3.00     0.00     0.07
50.67     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1           0.00     0.00    0.00 4198.00     0.00   146.50
71.47     0.18    0.05    0.00    0.05   0.04  18.40
nvme1n1           0.00     0.00    0.00 4198.00     0.00   146.50
71.47     0.18    0.04    0.00    0.04   0.04  17.20
nvme2n1           0.00     0.00    0.00 4126.00     0.00   146.08
72.51     0.15    0.04    0.00    0.04   0.03  14.00
nvme3n1           0.00     0.00    0.00 4125.00     0.00   146.03
72.50     0.15    0.04    0.00    0.04   0.03  14.40
md0               0.00     0.00    0.00 16022.00     0.00   292.53
37.39     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00   18.00     0.00     0.07
8.44     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00


Checkpoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    29.00    1.00 96795.00     0.00  1074.52
22.73   133.13    1.38    4.00    1.38   0.01 100.00
nvme0n1           0.00     0.00    0.00 3564.00     0.00    56.71
32.59     0.12    0.03    0.00    0.03   0.03  11.60
nvme1n1           0.00     0.00    0.00 3564.00     0.00    56.71
32.59     0.12    0.03    0.00    0.03   0.03  12.00
nvme2n1           0.00     0.00    0.00 3884.00     0.00    59.12
31.17     0.14    0.04    0.00    0.04   0.04  13.60
nvme3n1           0.00     0.00    0.00 3884.00     0.00    59.12
31.17     0.13    0.03    0.00    0.03   0.03  12.80
md0               0.00     0.00    0.00 14779.00     0.00   115.80
16.05     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    3.00     0.00     0.01
8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    1.00 96830.00     0.00  1074.83
22.73   134.79    1.38    4.00    1.38   0.01 100.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00


Thanks for your patience if you have read this far!

Regards

Mark



Re: 60 core performance with 9.3

From
Andres Freund
Date:
On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote:
> On 01/07/14 22:13, Andres Freund wrote:
> >On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:
> >>- cherry picking the last 5 commits into 9.4 branch and building a package
> >>from that and retesting:
> >>
> >>Clients | 9.4 tps 60 cores (rwlock)
> >>--------+--------------------------
> >>6       |  70189
> >>12      | 128894
> >>24      | 233542
> >>48      | 422754
> >>96      | 590796
> >>192     | 630672
> >>
> >>Wow - that is more like it! Andres that is some nice work, we definitely owe
> >>you some beers for that :-) I am aware that I need to retest with an
> >>unpatched 9.4 src - as it is not clear from this data how much is due to
> >>Andres's patches and how much to the steady stream of 9.4 development. I'll
> >>post an update on that later, but figured this was interesting enough to
> >>note for now.
> >
> >Cool. That's what I like (and expect) to see :). I don't think unpatched
> >9.4 will show significantly different results than 9.3, but it'd be good
> >to validate that. If you do so, could you post the results in the
> >-hackers thread I just CCed you on? That'll help the work to get into
> >9.5.
>
> So we seem to have nailed read only performance. Going back and revisiting
> read write performance finds:
>
> Postgres 9.4 beta
> rwlock patch
> pgbench scale = 2000
>
> max_connections = 200;
> shared_buffers = "10GB";
> maintenance_work_mem = "1GB";
> effective_io_concurrency = 10;
> wal_buffers = "32MB";
> checkpoint_segments = 192;
> checkpoint_completion_target = 0.8;
>
> clients  | tps (32 cores) | tps
> ---------+----------------+---------
> 6        |   8313         |   8175
> 12       |  11012         |  14409
> 24       |  16151         |  17191
> 48       |  21153         |  23122
> 96       |  21977         |  22308
> 192      |  22917         |  23109

On that scale - that's bigger than shared_buffers IIRC - I'd not expect
the patch to make much of a difference.

> kernel.sched_autogroup_enabled=0
> kernel.sched_migration_cost_ns=5000000
> net.core.somaxconn=1024
> /sys/kernel/mm/transparent_hugepage/enabled [never]
>
> Full report http://paste.ubuntu.com/7777886/

> #
>      8.82%        postgres  [kernel.kallsyms]        [k]
> _raw_spin_lock_irqsave
>                   |
>                   --- _raw_spin_lock_irqsave
>                      |
>                      |--75.69%-- pagevec_lru_move_fn
>                      |          __lru_cache_add
>                      |          lru_cache_add
>                      |          putback_lru_page
>                      |          migrate_pages
>                      |          migrate_misplaced_page
>                      |          do_numa_page
>                      |          handle_mm_fault
>                      |          __do_page_fault
>                      |          do_page_fault
>                      |          page_fault

So, the majority of the time is spent in numa page migration. Can you
disable numa_balancing? I'm not sure if your kernel version does that at
runtime or whether you need to reboot.
The kernel.numa_balancing sysctl might work. Otherwise you probably need
to boot with numa_balancing=0.

It'd also be worthwhile to test this with numactl --interleave.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 11/07/14 20:22, Andres Freund wrote:
> On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote:

>> Postgres 9.4 beta
>> rwlock patch
>> pgbench scale = 2000
>>
> On that scale - that's bigger than shared_buffers IIRC - I'd not expect
> the patch to make much of a difference.
>

Right - we did test with it bigger (can't recall exactly how big), but
will retry again after setting the numa parameters below.

>> #
>>       8.82%        postgres  [kernel.kallsyms]        [k]
>> _raw_spin_lock_irqsave
>>                    |
>>                    --- _raw_spin_lock_irqsave
>>                       |
>>                       |--75.69%-- pagevec_lru_move_fn
>>                       |          __lru_cache_add
>>                       |          lru_cache_add
>>                       |          putback_lru_page
>>                       |          migrate_pages
>>                       |          migrate_misplaced_page
>>                       |          do_numa_page
>>                       |          handle_mm_fault
>>                       |          __do_page_fault
>>                       |          do_page_fault
>>                       |          page_fault
>
> So, the majority of the time is spent in numa page migration. Can you
> disable numa_balancing? I'm not sure if your kernel version does that at
> runtime or whether you need to reboot.
> The kernel.numa_balancing sysctl might work. Otherwise you probably need
> to boot with numa_balancing=0.
>
> It'd also be worthwhile to test this with numactl --interleave.
>

That was my feeling too - but I had no idea what the magic switch was to
tame it (appears to be in 3.13 kernels), will experiment and report
back. Thanks again!

Mark



Re: 60 core performance with 9.3

From
Kevin Grittner
Date:
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
> On 11/07/14 20:22, Andres Freund wrote:

>> So, the majority of the time is spent in numa page migration.
>> Can you disable numa_balancing? I'm not sure if your kernel
>> version does that at runtime or whether you need to reboot.
>> The kernel.numa_balancing sysctl might work. Otherwise you
>> probably need to boot with numa_balancing=0.
>>
>> It'd also be worthwhile to test this with numactl --interleave.
>
> That was my feeling too - but I had no idea what the magic switch
> was to tame it (appears to be in 3.13 kernels), will experiment
> and report back. Thanks again!

It might be worth a test using a cpuset to interleave OS cache and
the NUMA patch I submitted to the current CF to see whether this is
getting into territory where the patch makes a bigger difference.
I would expect it to do much better than using numactl --interleave
because work_mem and other process-local memory would be allocated
in "near" memory for each process.

http://www.postgresql.org/message-id/1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 11/07/14 20:22, Andres Freund wrote:
> On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote:
>> Full report http://paste.ubuntu.com/7777886/
>
>> #
>>       8.82%        postgres  [kernel.kallsyms]        [k]
>> _raw_spin_lock_irqsave
>>                    |
>>                    --- _raw_spin_lock_irqsave
>>                       |
>>                       |--75.69%-- pagevec_lru_move_fn
>>                       |          __lru_cache_add
>>                       |          lru_cache_add
>>                       |          putback_lru_page
>>                       |          migrate_pages
>>                       |          migrate_misplaced_page
>>                       |          do_numa_page
>>                       |          handle_mm_fault
>>                       |          __do_page_fault
>>                       |          do_page_fault
>>                       |          page_fault
>
> So, the majority of the time is spent in numa page migration. Can you
> disable numa_balancing? I'm not sure if your kernel version does that at
> runtime or whether you need to reboot.
> The kernel.numa_balancing sysctl might work. Otherwise you probably need
> to boot with numa_balancing=0.
>
> It'd also be worthwhile to test this with numactl --interleave.
>

Trying out with numa_balancing=0 seemed to get essentially the same
performance. Similarly wrapping postgres startup with --interleave.

All this made me want to try with numa *really* disabled. So rebooted
the box with "numa=off" appended to the kernel cmdline. Somewhat
surprisingly (to me anyway), the numbers were essentially identical. The
profile, however is quite different:

Full report at http://paste.ubuntu.com/7806285/


      4.56%         postgres  [kernel.kallsyms]         [k]
_raw_spin_lock_irqsave


                    |
                    --- _raw_spin_lock_irqsave
                       |
                       |--41.89%-- try_to_wake_up
                       |          |
                       |          |--96.12%-- default_wake_function
                       |          |          |
                       |          |          |--99.96%-- pollwake
                       |          |          |          __wake_up_common
                       |          |          |          __wake_up_sync_key
                       |          |          |          sock_def_readable
                       |          |          |          |
                       |          |          |          |--99.94%--
unix_stream_sendmsg
                       |          |          |          |
sock_sendmsg
                       |          |          |          |
SYSC_sendto
                       |          |          |          |
sys_sendto
                       |          |          |          |          tracesys
                       |          |          |          |
__libc_send
                       |          |          |          |          pq_flush
                       |          |          |          |
ReadyForQuery
                       |          |          |          |
PostgresMain
                       |          |          |          |
ServerLoop
                       |          |          |          |
PostmasterMain
                       |          |          |          |          main
                       |          |          |          |
__libc_start_main
                       |          |          |           --0.06%-- [...]
                       |          |           --0.04%-- [...]
                       |          |
                       |          |--2.87%-- wake_up_process
                       |          |          |
                       |          |          |--95.71%--
wake_up_sem_queue_do
                       |          |          |          SYSC_semtimedop
                       |          |          |          sys_semop
                       |          |          |          tracesys
                       |          |          |          __GI___semop
                       |          |          |          |
                       |          |          |          |--99.75%--
LWLockRelease
                       |          |          |          |          |

                       |          |          |          |
|--25.09%-- RecordTransactionCommit
                       |          |          |          |          |
       CommitTransaction
                       |          |          |          |          |
       CommitTransactionCommand
                       |          |          |          |          |
       finish_xact_command.part.4
                       |          |          |          |          |
       PostgresMain
                       |          |          |          |          |
       ServerLoop
                       |          |          |          |          |
       PostmasterMain
                       |          |          |          |          |
       main
                       |          |          |          |          |
       __libc_start_main



regards

Mark



Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 12/07/14 01:19, Kevin Grittner wrote:
>
> It might be worth a test using a cpuset to interleave OS cache and
> the NUMA patch I submitted to the current CF to see whether this is
> getting into territory where the patch makes a bigger difference.
> I would expect it to do much better than using numactl --interleave
> because work_mem and other process-local memory would be allocated
> in "near" memory for each process.
>
> http://www.postgresql.org/message-id/1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com
>

Thanks Kevin - I did try this out - seemed slightly better than using
--interleave, but almost identical to the results posted previously.

However looking at my postgres binary with ldd, I'm not seeing any link
to libnuma (despite it demanding the library whilst building), so I
wonder if my package build has somehow vanilla-ified the result :-(

Also I am guessing that with 60 cores I do:

$ sudo /bin/bash -c "echo 0-59 >/dev/cpuset/postgres/cpus"

i.e cpus are cores not packages...? If I've stuffed it up I'll redo!


Cheers

Mark


Re: 60 core performance with 9.3

From
Kevin Grittner
Date:
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
> On 12/07/14 01:19, Kevin Grittner wrote:
>>
>> It might be worth a test using a cpuset to interleave OS cache and
>> the NUMA patch I submitted to the current CF to see whether this is
>> getting into territory where the patch makes a bigger difference.
>> I would expect it to do much better than using numactl --interleave
>> because work_mem and other process-local memory would be allocated
>> in "near" memory for each process.
>>
> http://www.postgresql.org/message-id/1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com
>
> Thanks Kevin - I did try this out - seemed slightly better than using
> --interleave, but almost identical to the results posted previously.
>
> However looking at my postgres binary with ldd, I'm not seeing any link
> to libnuma (despite it demanding the library whilst building), so I
> wonder if my package build has somehow vanilla-ified the result :-(

That is odd; not sure what to make of that!

> Also I am guessing that with 60 cores I do:
>
> $ sudo /bin/bash -c "echo 0-59 >/dev/cpuset/postgres/cpus"
>
> i.e cpus are cores not packages...?

Right; basically, as a guide, you can use the output from:

$ numactl --hardware

Use the union of all the "cpu" numbers from the "node n cpus" lines.  The
above statement is also a good way to see how unbalance memory usage has
become while running a test.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 17/07/14 11:58, Mark Kirkwood wrote:

>
> Trying out with numa_balancing=0 seemed to get essentially the same
> performance. Similarly wrapping postgres startup with --interleave.
>
> All this made me want to try with numa *really* disabled. So rebooted
> the box with "numa=off" appended to the kernel cmdline. Somewhat
> surprisingly (to me anyway), the numbers were essentially identical. The
> profile, however is quite different:
>

A little more tweaking got some further improvement:

rwlocks patch as before

wal_buffers = 256MB
checkpoint_segments = 1920
wal_sync_method = open_datasync

LSI RAID adaptor disable read ahead and write cache for SSD fast path mode
numa_balancing = 0


Pgbench scale 2000 again:

clients  | tps (prev) |  tps (tweaked config)
---------+------------+---------
6        |   8175     |   8281
12       |  14409     |  15896
24       |  17191     |  19522
48       |  23122     |  29776
96       |  22308     |  32352
192      |  23109     |  28804


Now recall we were seeing no actual tps changes with numa_balancing=0 or
1 (so the improvement above is from the other changes), but figured it
might be informative to try to track down what the non-numa bottlenecks
looked like. We tried profiling the entire 10 minute run which showed
the stats collector as a possible source of contention:


      3.86%        postgres  [kernel.kallsyms]        [k] _raw_spin_lock_bh
                   |
                   --- _raw_spin_lock_bh
                      |
                      |--95.78%-- lock_sock_nested
                      |          udpv6_sendmsg
                      |          inet_sendmsg
                      |          sock_sendmsg
                      |          SYSC_sendto
                      |          sys_sendto
                      |          tracesys
                      |          __libc_send
                      |          |
                      |          |--99.17%-- pgstat_report_stat
                      |          |          PostgresMain
                      |          |          ServerLoop
                      |          |          PostmasterMain
                      |          |          main
                      |          |          __libc_start_main
                      |          |
                      |          |--0.77%-- pgstat_send_bgwriter
                      |          |          BackgroundWriterMain
                      |          |          AuxiliaryProcessMain
                      |          |          0x7f08efe8d453
                      |          |          reaper
                      |          |          __restore_rt
                      |          |          PostmasterMain
                      |          |          main
                      |          |          __libc_start_main
                      |           --0.07%-- [...]
                      |
                      |--2.54%-- __lock_sock
                      |          |
                      |          |--91.95%-- lock_sock_nested
                      |          |          udpv6_sendmsg
                      |          |          inet_sendmsg
                      |          |          sock_sendmsg
                      |          |          SYSC_sendto
                      |          |          sys_sendto
                      |          |          tracesys
                      |          |          __libc_send
                      |          |          |
                      |          |          |--99.73%-- pgstat_report_stat
                      |          |          |          PostgresMain
                      |          |          |          ServerLoop



Disabling track_counts and rerunning pgbench:

clients  | tps (no counts)
---------+------------
6        |    9806
12       |   18000
24       |   29281
48       |   43703
96       |   54539
192      |   36114


While these numbers look great in the middle range (12-96 clients), then
benefit looks to be tailing off as client numbers increase. Also running
with no stats (and hence no auto vacuum or analyze) is way too scary!

Trying out less write heavy workloads shows that the stats overhead does
not appear to be significant for *read* heavy cases, so this result
above is perhaps more of a curiosity than anything (given that read
heavy is more typical...and our real workload is more similar to read
heavy).

The profile for counts off looks like:

      4.79%         swapper  [kernel.kallsyms]        [k] read_hpet
                    |
                    --- read_hpet
                       |
                       |--97.10%-- ktime_get
                       |          |
                       |          |--35.24%-- clockevents_program_event
                       |          |          tick_program_event
                       |          |          |
                       |          |          |--56.59%--
__hrtimer_start_range_ns
                       |          |          |          |
                       |          |          |          |--78.12%--
hrtimer_start_range_ns
                       |          |          |          |
tick_nohz_restart
                       |          |          |          |
tick_nohz_idle_exit
                       |          |          |          |
cpu_startup_entry
                       |          |          |          |          |
                       |          |          |          |
|--98.84%-- start_secondary
                       |          |          |          |          |
                       |          |          |          |
--1.16%-- rest_init
                       |          |          |          |
       start_kernel
                       |          |          |          |
       x86_64_start_reservations
                       |          |          |          |
       x86_64_start_kernel
                       |          |          |          |
                       |          |          |           --21.88%--
hrtimer_start
                       |          |          |
tick_nohz_stop_sched_tick
                       |          |          |
__tick_nohz_idle_enter
                       |          |          |                     |
                       |          |          |
|--99.89%-- tick_nohz_idle_enter
                       |          |          |                     |
       cpu_startup_entry
                       |          |          |                     |
       |
                       |          |          |                     |
       |--98.30%-- start_secondary
                       |          |          |                     |
       |
                       |          |          |                     |
        --1.70%-- rest_init
                       |          |          |                     |
                  start_kernel
                       |          |          |                     |
                  x86_64_start_reservations
                       |          |          |                     |
                  x86_64_start_kernel
                       |          |          |
--0.11%-- [...]
                       |          |          |
                       |          |          |--40.25%--
hrtimer_force_reprogram
                       |          |          |          __remove_hrtimer
                       |          |          |          |
                       |          |          |          |--89.68%--
__hrtimer_start_range_ns
                       |          |          |          |
hrtimer_start
                       |          |          |          |
tick_nohz_stop_sched_tick
                       |          |          |          |
__tick_nohz_idle_enter
                       |          |          |          |          |
                       |          |          |          |
|--99.90%-- tick_nohz_idle_enter
                       |          |          |          |          |
       cpu_startup_entry
                       |          |          |          |          |
       |
                       |          |          |          |          |
       |--99.04%-- start_secondary
                       |          |          |          |          |
       |
                       |          |          |          |          |
        --0.96%-- rest_init
                       |          |          |          |          |
                  start_kernel
                       |          |          |          |          |
                  x86_64_start_reservations
                       |          |          |          |          |
                  x86_64_start_kernel
                       |          |          |          |
--0.10%-- [...]
                       |          |          |          |



Any thoughts on how to proceed further appreciated!

Cheers,

Mark


Re: 60 core performance with 9.3

From
"Tomas Vondra"
Date:
On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:
>
> While these numbers look great in the middle range (12-96 clients), then
> benefit looks to be tailing off as client numbers increase. Also running
> with no stats (and hence no auto vacuum or analyze) is way too scary!

I assume you've disabled statistics collector, which has nothing to do
with vacuum or analyze.

There are two kinds of statistics in PostgreSQL - data distribution
statistics (which is collected by ANALYZE and stored in actual tables
within the database) and runtime statistics (which is collected by the
stats collector and stored in a file somewhere on the dist).

By disabling statistics collector you loose runtime counters - number of
sequential/index scans on a table, tuples read from a relation aetc. But
it does not influence VACUUM or planning at all.

Also, it's mostly async (send over UDP and you're done) and shouldn't make
much difference unless you have large number of objects. There are ways to
improve this (e.g. by placing the stat files into a tmpfs).

Tomas



Re: 60 core performance with 9.3

From
Tom Lane
Date:
"Tomas Vondra" <tv@fuzzy.cz> writes:
> On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:
>> While these numbers look great in the middle range (12-96 clients), then
>> benefit looks to be tailing off as client numbers increase. Also running
>> with no stats (and hence no auto vacuum or analyze) is way too scary!

> By disabling statistics collector you loose runtime counters - number of
> sequential/index scans on a table, tuples read from a relation aetc. But
> it does not influence VACUUM or planning at all.

It does break autovacuum.

            regards, tom lane


Re: 60 core performance with 9.3

From
"Tomas Vondra"
Date:
On 30 Červenec 2014, 14:39, Tom Lane wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote:
>>> While these numbers look great in the middle range (12-96 clients),
>>> then
>>> benefit looks to be tailing off as client numbers increase. Also
>>> running
>>> with no stats (and hence no auto vacuum or analyze) is way too scary!
>
>> By disabling statistics collector you loose runtime counters - number of
>> sequential/index scans on a table, tuples read from a relation aetc. But
>> it does not influence VACUUM or planning at all.
>
> It does break autovacuum.

Of course, you're right. It throws away info about how much data was
modified and when the table was last (auto)vacuumed.

This is a clear proof that I really need to drink at least one cup of
coffee in the morning before doing anything in the morning.

Tomas



Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
Hi Tomas,

Unfortunately I think you are mistaken - disabling the stats collector
(i.e. track_counts = off) means that autovacuum has no idea about
when/if it needs to start a worker (as it uses those counts to decide),
and hence you lose all automatic vacuum and analyze as a result.

With respect to comments like "it shouldn't make difference" etc etc,
well the profile suggests otherwise, and the change in tps numbers
support the observation.

regards

Mark

On 30/07/14 20:42, Tomas Vondra wrote:
> On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:
>>
>> While these numbers look great in the middle range (12-96 clients), then
>> benefit looks to be tailing off as client numbers increase. Also running
>> with no stats (and hence no auto vacuum or analyze) is way too scary!
>
> I assume you've disabled statistics collector, which has nothing to do
> with vacuum or analyze.
>
> There are two kinds of statistics in PostgreSQL - data distribution
> statistics (which is collected by ANALYZE and stored in actual tables
> within the database) and runtime statistics (which is collected by the
> stats collector and stored in a file somewhere on the dist).
>
> By disabling statistics collector you loose runtime counters - number of
> sequential/index scans on a table, tuples read from a relation aetc. But
> it does not influence VACUUM or planning at all.
>
> Also, it's mostly async (send over UDP and you're done) and shouldn't make
> much difference unless you have large number of objects. There are ways to
> improve this (e.g. by placing the stat files into a tmpfs).
>
> Tomas
>



Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 31/07/14 00:47, Tomas Vondra wrote:
> On 30 Červenec 2014, 14:39, Tom Lane wrote:
>> "Tomas Vondra" <tv@fuzzy.cz> writes:
>>> On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote:
>>>> While these numbers look great in the middle range (12-96 clients),
>>>> then
>>>> benefit looks to be tailing off as client numbers increase. Also
>>>> running
>>>> with no stats (and hence no auto vacuum or analyze) is way too scary!
>>
>>> By disabling statistics collector you loose runtime counters - number of
>>> sequential/index scans on a table, tuples read from a relation aetc. But
>>> it does not influence VACUUM or planning at all.
>>
>> It does break autovacuum.
>
> Of course, you're right. It throws away info about how much data was
> modified and when the table was last (auto)vacuumed.
>
> This is a clear proof that I really need to drink at least one cup of
> coffee in the morning before doing anything in the morning.
>

Lol - thanks for taking a look anyway. Yes, coffee is often an important
part of the exercise.

Regards

Mark



Re: 60 core performance with 9.3

From
Matt Clarkson
Date:
I've been assisting Mark with the benchmarking of these new servers.

The drop off in both throughput and CPU utilisation that we've been
observing as the client count increases has let me to investigate which
lwlocks are dominant at different client counts.

I've recompiled postgres with Andres LWLock improvements, Kevin's
libnuma patch and with LWLOCK_STATS enabled.

The LWLOCK_STATS below suggest that ProcArrayLock might be the main
source of locking that's causing throughput to take a dive as the client
count increases beyond the core count.


wal_buffers = 256MB
checkpoint_segments = 1920
wal_sync_method = open_datasync

pgbench -s 2000 -T 600


Results:

 clients |  tps
---------+---------
     6   |  9490
    12   | 17558
    24   | 25681
    48   | 41175
    96   | 48954
   192   | 31887
   384   | 15564



LWLOCK_STATS at 48 clients

  Lock              |    Blk   | SpinDelay | Blk % | SpinDelay %
--------------------+----------+-----------+-------+-------------
 BufFreelistLock    |  31144   |      11   |  1.64 |   1.62
 ShmemIndexLock     |    192   |       1   |  0.01 |   0.15
 OidGenLock         |  32648   |      14   |  1.72 |   2.06
 XidGenLock         |  35731   |      18   |  1.88 |   2.64
 ProcArrayLock      | 291121   |     215   | 15.36 |  31.57
 SInvalReadLock     |  32136   |      13   |  1.70 |   1.91
 SInvalWriteLock    |  32141   |      12   |  1.70 |   1.76
 WALBufMappingLock  |  31662   |      15   |  1.67 |   2.20
 WALWriteLock       | 825380   |      45   | 36.31 |   6.61
 CLogControlLock    | 583458   |     337   | 26.93 |  49.49



LWLOCK_STATS at 96 clients

  Lock              |    Blk   | SpinDelay | Blk % | SpinDelay %
--------------------+----------+-----------+-------+-------------
 BufFreelistLock    |   62954  |      12   |  1.54 |   0.27
 ShmemIndexLock     |   62635  |       4   |  1.54 |   0.09
 OidGenLock         |   92232  |      22   |  2.26 |   0.50
 XidGenLock         |   98326  |      18   |  2.41 |   0.41
 ProcArrayLock      |  928871  |    3188   | 22.78 |  72.57
 SInvalReadLock     |   58392  |      13   |  1.43 |   0.30
 SInvalWriteLock    |   57429  |      14   |  1.41 |   0.32
 WALBufMappingLock  |  138375  |      14   |  3.39 |   0.32
 WALWriteLock       | 1480707  |      42   | 36.31 |   0.96
 CLogControlLock    | 1098239  |    1066   | 26.93 |  27.27



LWLOCK_STATS at 384 clients

  Lock              |    Blk   | SpinDelay | Blk % | SpinDelay %
--------------------+----------+-----------+-------+-------------
 BufFreelistLock    |  184298  |     158   |  1.93 |   0.03
 ShmemIndexLock     |  183573  |     164   |  1.92 |   0.03
 OidGenLock         |  184558  |     173   |  1.93 |   0.03
 XidGenLock         |  200239  |     213   |  2.09 |   0.04
 ProcArrayLock      | 4035527  |  579666   | 42.22 |  98.62
 SInvalReadLock     |  182204  |     152   |  1.91 |   0.03
 SInvalWriteLock    |  182898  |     137   |  1.91 |   0.02
 WALBufMappingLock  |  219936  |     215   |  2.30 |   0.04
 WALWriteLock       | 3172725  |     457   | 24.67 |   0.08
 CLogControlLock    | 1012458  |    6423   | 10.59 |   1.09


The same test done with a readonly workload show virtually no SpinDelay
at all.


Any thoughts or comments on these results are welcome!


Regards,
Matt.





Re: 60 core performance with 9.3

From
Alvaro Herrera
Date:
Matt Clarkson wrote:

> The LWLOCK_STATS below suggest that ProcArrayLock might be the main
> source of locking that's causing throughput to take a dive as the client
> count increases beyond the core count.

> Any thoughts or comments on these results are welcome!

Do these results change if you use Heikki's patch for CSN-based
snapshots?  See
http://www.postgresql.org/message-id/539AD153.9000004@vmware.com for the
patch (but note that you need to apply on top of 89cf2d52030 in the
master branch -- maybe it applies to HEAD the 9.4 branch but I didn't
try).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 01/08/14 09:38, Alvaro Herrera wrote:
> Matt Clarkson wrote:
>
>> The LWLOCK_STATS below suggest that ProcArrayLock might be the main
>> source of locking that's causing throughput to take a dive as the client
>> count increases beyond the core count.
>
>> Any thoughts or comments on these results are welcome!
>
> Do these results change if you use Heikki's patch for CSN-based
> snapshots?  See
> http://www.postgresql.org/message-id/539AD153.9000004@vmware.com for the
> patch (but note that you need to apply on top of 89cf2d52030 in the
> master branch -- maybe it applies to HEAD the 9.4 branch but I didn't
> try).
>

Hi Alvaro,

Applying the CSN patch on top of the rwlock + numa in 9.4 (bit of a
patch-fest we have here now) shows modest improvement at highest client
number (but appears to hurt performance in the mid range):

  clients |  tps
---------+--------
6        |  8445
12       | 14548
24       | 20043
48       | 27451
96       | 27718
192      | 23614
384      | 24737


Initial runs were quite disappointing, until we moved the csnlog
directory onto the same filesystem that the xlogs are on (PCIe SSD). We
could potentially look at locating them on their own separate volume if
that make sense.

Adding in LWLOCK stats again shows quite a different picture from the
previous:

48 clients

   Lock              |    Blk   | SpinDelay | Blk %     | SpinDelay %
--------------------+----------+-----------+-----------+-------------
WALWriteLock        | 25426001 | 1239      | 62.227442 | 14.373550
CLogControlLock     |  1793739 | 1376      |  4.389986 | 15.962877
ProcArrayLock       |  1007765 | 1305      |  2.466398 | 15.139211
CSNLogControlLock   |  609556  | 1722      |  1.491824 | 19.976798
WALInsertLocks 4    |  994170  |  247      |  2.433126 |  2.865429
WALInsertLocks 7    |  983497  |  243      |  2.407005 |  2.819026
WALInsertLocks 5    |  993068  |  239      |  2.430429 |  2.772622
WALInsertLocks 3    |  991446  |  229      |  2.426459 |  2.656613
WALInsertLocks 0    |  964185  |  235      |  2.359741 |  2.726218
WALInsertLocks 1    |  995237  |  221      |  2.435737 |  2.563805
WALInsertLocks 2    |  997593  |  213      |  2.441503 |  2.470998
WALInsertLocks 6    |  978178  |  201      |  2.393987 |  2.331787
BufFreelistLock     |  887194  |  206      |  2.171313 |  2.389791
XidGenLock          |  327385  |  366      |  0.801240 |  4.245940
CheckpointerCommLock|  104754  |  151      |  0.256374 |  1.751740
WALBufMappingLock   |  274226  |    7      |  0.671139 |  0.081206


96 clients

   Lock              |    Blk   | SpinDelay | Blk %     | SpinDelay %
--------------------+----------+-----------+-----------+-------------
WALWriteLock        | 25426001 |  1239     | 62.227442 | 14.373550
WALWriteLock        | 30097625 |  9616     | 48.550747 | 19.068393
CLogControlLock     |  3193429 | 13490     | 5.151349  | 26.750481
ProcArrayLock       |  2007103 | 11754     | 3.237676  | 23.308017
CSNLogControlLock   |  1303172 |  5022     | 2.102158  |  9.958556
BufFreelistLock     |  1921625 |  1977     | 3.099790  |  3.920363
WALInsertLocks 0    |  2011855 |   681     | 3.245341  |  1.350413
WALInsertLocks 5    |  1829266 |   627     | 2.950805  |  1.243332
WALInsertLocks 7    |  1806966 |   632     | 2.914833  |  1.253247
WALInsertLocks 4    |  1847372 |   591     | 2.980012  |  1.171945
WALInsertLocks 1    |  1948553 |   557     | 3.143228  |  1.104523
WALInsertLocks 6    |  1818717 |   582     | 2.933789  |  1.154098
WALInsertLocks 3    |  1873964 |   552     | 3.022908  |  1.094608
WALInsertLocks 2    |  1912007 |   523     | 3.084276  |  1.037102
XidGenLock          |   512521 |   699     | 0.826752  |  1.386107
CheckpointerCommLock|   386853 |   711     | 0.624036  |  1.409903
WALBufMappingLock   |   546462 |    65     | 0.881503  |  0.128894


384 clients

   Lock              |    Blk   | SpinDelay | Blk %     | SpinDelay %
--------------------+----------+-----------+-----------+-------------
WALWriteLock        | 25426001 |   1239    | 62.227442 | 14.373550
WALWriteLock        | 20703796 |  87265    | 27.749961 | 15.360068
CLogControlLock     |  3273136 | 122616    |  4.387089 | 21.582422
ProcArrayLock       |  3969918 | 100730    |  5.321008 | 17.730128
CSNLogControlLock   |  3191989 | 115068    |  4.278325 | 20.253851
BufFreelistLock     |  2014218 |  27952    |  2.699721 |  4.920009
WALInsertLocks 0    |  2750082 |   5438    |  3.686023 |  0.957177
WALInsertLocks 1    |  2584155 |   5312    |  3.463626 |  0.934999
WALInsertLocks 2    |  2477782 |   5497    |  3.321051 |  0.967562
WALInsertLocks 4    |  2375977 |   5441    |  3.184598 |  0.957705
WALInsertLocks 5    |  2349769 |   5458    |  3.149471 |  0.960697
WALInsertLocks 6    |  2329982 |   5367    |  3.122950 |  0.944680
WALInsertLocks 3    |  2415965 |   4771    |  3.238195 |  0.839774
WALInsertLocks 7    |  2316144 |   4930    |  3.104402 |  0.867761
CheckpointerCommLock|   584419 |  10794    |  0.783316 |  1.899921
XidGenLock          |   391212 |   6963    |  0.524354 |  1.225602
WALBufMappingLock   |   484693 |     83    |  0.649650 |  0.014609



So we're seeing delay coming fairly equally from 5 lwlocks.

Thanks again - any other suggestions welcome!

Cheers

Mark


Re: 60 core performance with 9.3

From
Josh Berkus
Date:
Mark,

Is the 60-core machine using some of the Intel chips which have 20
hyperthreaded virtual cores?

If so, I've been seeing some performance issues on these processors.
I'm currently doing a side-by-side hyperthreading on/off test.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 15/08/14 06:18, Josh Berkus wrote:
> Mark,
>
> Is the 60-core machine using some of the Intel chips which have 20
> hyperthreaded virtual cores?
>
> If so, I've been seeing some performance issues on these processors.
> I'm currently doing a side-by-side hyperthreading on/off test.
>

Hi Josh,

The board has 4 sockets with E7-4890 v2 cpus. They have 15 cores/30
threads. We've running with hyperthreading off (noticed the usual
steep/sudden scaling dropoff with it on).

What model are your 20 cores cpus?

Cheers

Mark






Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Josh Berkus
Date:
Mark, all:

So, this is pretty damming:

Read-only test with HT ON:

[pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 30
query mode: simple
number of clients: 20
number of threads: 4
duration: 600 s
number of transactions actually processed: 47167533
tps = 78612.471802 (including connections establishing)
tps = 78614.604352 (excluding connections establishing)

Read-only test with HT Off:

[pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench
starting vacuum...end.
transaction type: SELECT only
scaling factor: 30
query mode: simple
number of clients: 20
number of threads: 4
duration: 600 s
number of transactions actually processed: 82457739
tps = 137429.508196 (including connections establishing)
tps = 137432.893796 (excluding connections establishing)


On a read-write test, it's 10% faster with HT off as well.

Further, from their production machine we've seen that having HT on
causes the machine to slow down by 5X whenever you get more than 40
cores (as in 100% of real cores or 50% of HT cores) worth of activity.

So we're definitely back to "If you're using PostgreSQL, turn off
Hyperthreading".

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Shaun Thomas
Date:
On 08/20/2014 02:13 PM, Josh Berkus wrote:

> So we're definitely back to "If you're using PostgreSQL, turn off
> Hyperthreading".

That's so strange. Back when I did my Nehalem tests, we got a very
strong 30%+ increase by enabling HT. We only got a hit when we turned
off turbo, or forgot to disable power saving features.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 21/08/14 07:13, Josh Berkus wrote:
> Mark, all:
>
> So, this is pretty damming:
>
> Read-only test with HT ON:
>
> [pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench
> starting vacuum...end.
> transaction type: SELECT only
> scaling factor: 30
> query mode: simple
> number of clients: 20
> number of threads: 4
> duration: 600 s
> number of transactions actually processed: 47167533
> tps = 78612.471802 (including connections establishing)
> tps = 78614.604352 (excluding connections establishing)
>
> Read-only test with HT Off:
>
> [pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench
> starting vacuum...end.
> transaction type: SELECT only
> scaling factor: 30
> query mode: simple
> number of clients: 20
> number of threads: 4
> duration: 600 s
> number of transactions actually processed: 82457739
> tps = 137429.508196 (including connections establishing)
> tps = 137432.893796 (excluding connections establishing)
>
>
> On a read-write test, it's 10% faster with HT off as well.
>
> Further, from their production machine we've seen that having HT on
> causes the machine to slow down by 5X whenever you get more than 40
> cores (as in 100% of real cores or 50% of HT cores) worth of activity.
>
> So we're definitely back to "If you're using PostgreSQL, turn off
> Hyperthreading".
>


Hmm - that is interesting - I don't think we compared read only scaling
for hyperthreading on and off (only read write). You didn't mention what
cpu this is for (or how many sockets etc), would be useful to know.

Notwithstanding the above results, my workmate Matt made an interesting
observation: the scaling graph for (our) 60 core box (HT off), looks
just like the one for our 32 core box with HT *on*.

We are wondering if a lot of the previous analysis of HT performance
regressions should actually be reevaluated in the light of ...err is it
just that we have a lot more cores...? [1]

Regards

Mark

[1] Particularly as in *some* cases (single socket i7 for instance) HT
on seems to scale fine.


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Peter Geoghegan
Date:
On Wed, Aug 20, 2014 at 1:36 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> That's so strange. Back when I did my Nehalem tests, we got a very strong
> 30%+ increase by enabling HT. We only got a hit when we turned off turbo, or
> forgot to disable power saving features.

In my experience, it is crucially important to consider power saving
features in most benchmarks these days, where that might not have been
true a few years ago. The CPU scaling governor can alter the outcome
of many benchmarks quite significantly.

--
Regards,
Peter Geoghegan


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Bruce Momjian
Date:
On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote:
> On a read-write test, it's 10% faster with HT off as well.
>
> Further, from their production machine we've seen that having HT on
> causes the machine to slow down by 5X whenever you get more than 40
> cores (as in 100% of real cores or 50% of HT cores) worth of activity.
>
> So we're definitely back to "If you're using PostgreSQL, turn off
> Hyperthreading".

Not sure how you can make such a blanket statement when so many people
have tested and shown the benefits of hyper-threading.  I am also
unclear exactly what you tested, as I didn't see it mentioned in the
email --- CPU type, CPU count, and operating system would be the minimal
information required.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Tatsuo Ishii
Date:
> On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote:
>> On a read-write test, it's 10% faster with HT off as well.
>>
>> Further, from their production machine we've seen that having HT on
>> causes the machine to slow down by 5X whenever you get more than 40
>> cores (as in 100% of real cores or 50% of HT cores) worth of activity.
>>
>> So we're definitely back to "If you're using PostgreSQL, turn off
>> Hyperthreading".
>
> Not sure how you can make such a blanket statement when so many people
> have tested and shown the benefits of hyper-threading.  I am also
> unclear exactly what you tested, as I didn't see it mentioned in the
> email --- CPU type, CPU count, and operating system would be the minimal
> information required.

HT off is common knowledge for better benchmarking result, at least
for me. I've never seen better result with HT on, except POWER.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 21/08/14 11:14, Mark Kirkwood wrote:
>
> You didn't mention what
> cpu this is for (or how many sockets etc), would be useful to know.
>

Just to clarify - while you mentioned that the production system was 40
cores, it wasn't immediately obvious that the same system was the source
of the measurements you posted...sorry if I'm being a mixture of
pedantic and dense - just trying to make sure it is clear what
systems/cpus etc we are talking about (with this in mind it never hurts
to quote cpu and mobo model numbers)!

Cheers

Mark




Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Shaun Thomas
Date:
On 08/20/2014 06:14 PM, Mark Kirkwood wrote:

> Notwithstanding the above results, my workmate Matt made an interesting
> observation: the scaling graph for (our) 60 core box (HT off), looks
> just like the one for our 32 core box with HT *on*.

Hmm. I know this sounds stupid and unlikely, but has anyone actually
tested PostgreSQL on a system with more than 64 legitimate cores? The
work Robert Haas did to fix the CPU locking way back when showed
significant improvements up to 64, but so far as I know, nobody really
tested beyond that.

I seem to remember similar choking effects when pre-9.2 systems
encountered high CPU counts. I somehow doubt Intel would allow their HT
architecture to regress so badly from Nehalem, which is almost
3-generations old at this point. This smells like something in the
software stack, up to and including the Linux kernel.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Josh Berkus
Date:
On 08/20/2014 07:40 PM, Bruce Momjian wrote:
> On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote:
>> On a read-write test, it's 10% faster with HT off as well.
>>
>> Further, from their production machine we've seen that having HT on
>> causes the machine to slow down by 5X whenever you get more than 40
>> cores (as in 100% of real cores or 50% of HT cores) worth of activity.
>>
>> So we're definitely back to "If you're using PostgreSQL, turn off
>> Hyperthreading".
>
> Not sure how you can make such a blanket statement when so many people
> have tested and shown the benefits of hyper-threading.

Actually, I don't know that anyone has posted the benefits of HT.  Link?
 I want to compare results so that we can figure out what's different
between my case and theirs.  Also, it makes a big difference if there is
an advantage to turning HT on for some workloads.

> I am also
> unclear exactly what you tested, as I didn't see it mentioned in the
> email --- CPU type, CPU count, and operating system would be the minimal
> information required.

Ooops!  I thought I'd posted that earlier, but I didn't.

The processors in question is the Intel(R) Xeon(R) CPU E7- 4850, with 4
of them for a total of 40 cores or 80 HT cores.

OS is RHEL with 2.6.32-431.3.1.el6.x86_64.

I've emailed a kernel hacker who works at Intel for comment; for one
thing, I'm wondering if the older kernel version is a problem for a
system like this.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Bruce Momjian
Date:
On Thu, Aug 21, 2014 at 02:02:26PM -0700, Josh Berkus wrote:
> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
> > On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote:
> >> On a read-write test, it's 10% faster with HT off as well.
> >>
> >> Further, from their production machine we've seen that having HT on
> >> causes the machine to slow down by 5X whenever you get more than 40
> >> cores (as in 100% of real cores or 50% of HT cores) worth of activity.
> >>
> >> So we're definitely back to "If you're using PostgreSQL, turn off
> >> Hyperthreading".
> >
> > Not sure how you can make such a blanket statement when so many people
> > have tested and shown the benefits of hyper-threading.
>
> Actually, I don't know that anyone has posted the benefits of HT.  Link?
>  I want to compare results so that we can figure out what's different
> between my case and theirs.  Also, it makes a big difference if there is
> an advantage to turning HT on for some workloads.

I had Greg Smith test my system when it was installed, tested it, and
recommended hyper-threading.  The system is Debian Squeeze
(2.6.32-5-amd64), CPUs are dual Xeon E5620, 8 cores, 16 virtual cores.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Josh Berkus
Date:
On 08/21/2014 02:11 PM, Bruce Momjian wrote:
> On Thu, Aug 21, 2014 at 02:02:26PM -0700, Josh Berkus wrote:
>> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
>>> On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote:
>>>> On a read-write test, it's 10% faster with HT off as well.
>>>>
>>>> Further, from their production machine we've seen that having HT on
>>>> causes the machine to slow down by 5X whenever you get more than 40
>>>> cores (as in 100% of real cores or 50% of HT cores) worth of activity.
>>>>
>>>> So we're definitely back to "If you're using PostgreSQL, turn off
>>>> Hyperthreading".
>>>
>>> Not sure how you can make such a blanket statement when so many people
>>> have tested and shown the benefits of hyper-threading.
>>
>> Actually, I don't know that anyone has posted the benefits of HT.  Link?
>>  I want to compare results so that we can figure out what's different
>> between my case and theirs.  Also, it makes a big difference if there is
>> an advantage to turning HT on for some workloads.
>
> I had Greg Smith test my system when it was installed, tested it, and
> recommended hyper-threading.  The system is Debian Squeeze
> (2.6.32-5-amd64), CPUs are dual Xeon E5620, 8 cores, 16 virtual cores.

Can you post some numerical results?

I'm serious.  It's obviously easier for our users if we can blanket
recommend turning HT off; that's a LOT easier for them than "you might
want to turn HT off if these conditions ...".  So I want to establish
that HT is a benefit sometimes if it is.

I personally have never seen HT be a benefit.  I've seen it be harmless
(most of the time) but never beneficial.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Bruce Momjian
Date:
On Thu, Aug 21, 2014 at 02:17:13PM -0700, Josh Berkus wrote:
> >> Actually, I don't know that anyone has posted the benefits of HT.  Link?
> >>  I want to compare results so that we can figure out what's different
> >> between my case and theirs.  Also, it makes a big difference if there is
> >> an advantage to turning HT on for some workloads.
> >
> > I had Greg Smith test my system when it was installed, tested it, and
> > recommended hyper-threading.  The system is Debian Squeeze
> > (2.6.32-5-amd64), CPUs are dual Xeon E5620, 8 cores, 16 virtual cores.
>
> Can you post some numerical results?
>
> I'm serious.  It's obviously easier for our users if we can blanket
> recommend turning HT off; that's a LOT easier for them than "you might
> want to turn HT off if these conditions ...".  So I want to establish
> that HT is a benefit sometimes if it is.
>
> I personally have never seen HT be a benefit.  I've seen it be harmless
> (most of the time) but never beneficial.

I know that when hyperthreading was introduced that it was mostly a
negative, but then this was improved, and it might have gotten bad
again.  I am afraid results are based on the type of CPU, so I am not
sure we can know a general answer.

I know I asked Greg Smith, and I assume he would know.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Scott Marlowe
Date:
On Thu, Aug 21, 2014 at 3:02 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
>
>> I am also
>> unclear exactly what you tested, as I didn't see it mentioned in the
>> email --- CPU type, CPU count, and operating system would be the minimal
>> information required.
>
> Ooops!  I thought I'd posted that earlier, but I didn't.
>
> The processors in question is the Intel(R) Xeon(R) CPU E7- 4850, with 4
> of them for a total of 40 cores or 80 HT cores.
>
> OS is RHEL with 2.6.32-431.3.1.el6.x86_64.

I'm running almost the exact same setup in production as a spare. It
has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
it's a spare node I might be able to do some testing on it as well.
It's running a 3.2 kernel right now. I could probably get a later
model kernel on it even.

--
To understand recursion, one must first understand recursion.


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Scott Marlowe
Date:
On Thu, Aug 21, 2014 at 3:26 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Thu, Aug 21, 2014 at 3:02 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
>>
>>> I am also
>>> unclear exactly what you tested, as I didn't see it mentioned in the
>>> email --- CPU type, CPU count, and operating system would be the minimal
>>> information required.
>>
>> Ooops!  I thought I'd posted that earlier, but I didn't.
>>
>> The processors in question is the Intel(R) Xeon(R) CPU E7- 4850, with 4
>> of them for a total of 40 cores or 80 HT cores.
>>
>> OS is RHEL with 2.6.32-431.3.1.el6.x86_64.
>
> I'm running almost the exact same setup in production as a spare. It
> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
> it's a spare node I might be able to do some testing on it as well.
> It's running a 3.2 kernel right now. I could probably get a later
> model kernel on it even.
>
> --
> To understand recursion, one must first understand recursion.

To update this last post, the machine I have is running ubuntu 12.04.1
right now, and I have kernels 3.2, 3.5, 3.8, 3.11, and 3.13 available
to put on it. We're looking at removing it from our current production
cluster so I could likely do all kinds of crazy tests on it.


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
"Graeme B. Bell"
Date:
> HT off is common knowledge for better benchmarking result

It's wise to use the qualifer 'for better benchmarking results'.

It's worth keeping in mind here that a benchmark is not the same as normal production use.

For example, where I work we do lots of long-running queries in parallel over a big range of datasets rather than many
short-termtransactions as fast as possible. Our biggest DB server is also used for GDAL work and R at the same time*.
Prettyfar from pgbench; not everyone is constrained by locks. 

I suppose that if your code is basically N copies of the same function, hyper-threading isn't likely to help much
becauseit was introduced to allow different parts of the processor to be used in parallel when you're running
hetarogenouscode.  

But if you're hammering just one part of the CPU... well, adding another layer of logical complexity for your CPU to
manageprobably isn't going to do much good. 

Should HT be on or off when you're running 64 very mixed types of long-term queries which involve variously either
heavyuse of real number calculations or e.g. logic/string handling, and different data sets? It's a much more complex
questionthan simply maxing out your pgbench scores.  

I don't have the data now unfortunately, but I remember seeing a benefit for HT on our 4 core e3 when running
GDAL/Postgiswork in parallel last year. It's not surprising though; the GDAL calls are almost certainly using different
functionsof the processor compared to postgres and there should be very little lock contention. In light of this
interestingdata I'm now leaning towards proposing HT off for our mapservers (which receive short, similar requests over
andover), but for the hetaragenous servers, I think I'll keep it on for now. 

Graeme.



* unrelated. There's also huge advantages for us in keeping these different programs running on the same machine since
wefound we can get much better transfer rates through unix sockets than with TCP over the network. 

Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Josh Berkus
Date:
On 08/21/2014 02:26 PM, Scott Marlowe wrote:
> I'm running almost the exact same setup in production as a spare. It
> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
> it's a spare node I might be able to do some testing on it as well.
> It's running a 3.2 kernel right now. I could probably get a later
> model kernel on it even.

You know about the IO performance issues with 3.2, yes?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Steve Crawford
Date:
On 08/21/2014 03:51 PM, Josh Berkus wrote:
> On 08/21/2014 02:26 PM, Scott Marlowe wrote:
>> I'm running almost the exact same setup in production as a spare. It
>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
>> it's a spare node I might be able to do some testing on it as well.
>> It's running a 3.2 kernel right now. I could probably get a later
>> model kernel on it even.
> You know about the IO performance issues with 3.2, yes?
>
Were those 3.2 only and since fixed or are there issues persisting in
3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13.

Cheers,
Steve



Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Josh Berkus
Date:
On 08/21/2014 04:08 PM, Steve Crawford wrote:
> On 08/21/2014 03:51 PM, Josh Berkus wrote:
>> On 08/21/2014 02:26 PM, Scott Marlowe wrote:
>>> I'm running almost the exact same setup in production as a spare. It
>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
>>> it's a spare node I might be able to do some testing on it as well.
>>> It's running a 3.2 kernel right now. I could probably get a later
>>> model kernel on it even.
>> You know about the IO performance issues with 3.2, yes?
>>
> Were those 3.2 only and since fixed or are there issues persisting in
> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13.

The issues I know of were fixed in 3.9.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 22/08/14 11:29, Josh Berkus wrote:
> On 08/21/2014 04:08 PM, Steve Crawford wrote:
>> On 08/21/2014 03:51 PM, Josh Berkus wrote:
>>> On 08/21/2014 02:26 PM, Scott Marlowe wrote:
>>>> I'm running almost the exact same setup in production as a spare. It
>>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
>>>> it's a spare node I might be able to do some testing on it as well.
>>>> It's running a 3.2 kernel right now. I could probably get a later
>>>> model kernel on it even.
>>> You know about the IO performance issues with 3.2, yes?
>>>
>> Were those 3.2 only and since fixed or are there issues persisting in
>> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13.
>
> The issues I know of were fixed in 3.9.
>

There is a 3.11 kernel series for Ubuntu 12.04 Precise.

Regards

Mark


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
"Joshua D. Drake"
Date:
On 08/21/2014 04:29 PM, Josh Berkus wrote:
>
> On 08/21/2014 04:08 PM, Steve Crawford wrote:
>> On 08/21/2014 03:51 PM, Josh Berkus wrote:
>>> On 08/21/2014 02:26 PM, Scott Marlowe wrote:
>>>> I'm running almost the exact same setup in production as a spare. It
>>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
>>>> it's a spare node I might be able to do some testing on it as well.
>>>> It's running a 3.2 kernel right now. I could probably get a later
>>>> model kernel on it even.
>>> You know about the IO performance issues with 3.2, yes?
>>>
>> Were those 3.2 only and since fixed or are there issues persisting in
>> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13.
>
> The issues I know of were fixed in 3.9.
>

Correct. If you run trusty backports you are good to go.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
              not be surprised when they come back as Romans."


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Scott Marlowe
Date:
On Thu, Aug 21, 2014 at 5:29 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 08/21/2014 04:08 PM, Steve Crawford wrote:
>> On 08/21/2014 03:51 PM, Josh Berkus wrote:
>>> On 08/21/2014 02:26 PM, Scott Marlowe wrote:
>>>> I'm running almost the exact same setup in production as a spare. It
>>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since
>>>> it's a spare node I might be able to do some testing on it as well.
>>>> It's running a 3.2 kernel right now. I could probably get a later
>>>> model kernel on it even.
>>> You know about the IO performance issues with 3.2, yes?
>>>
>> Were those 3.2 only and since fixed or are there issues persisting in
>> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13.
>
> The issues I know of were fixed in 3.9.
>
I thought they were fixed in 3.8.something? We're running 3.8 on our
production servers but IO is not an issue for us.


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Shaun Thomas
Date:
On 08/22/2014 01:37 AM, Scott Marlowe wrote:

> I thought they were fixed in 3.8.something? We're running 3.8 on our
> production servers but IO is not an issue for us.

Yeah. 3.8 fixed a ton of issues that were plaguing us. There were still
a couple patches I wanted that didn't get in until 3.11+, but the worst
of the behavior was solved before that.

Bugs in kernel cache page aging algorithms are bad, m'kay?

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Andres Freund
Date:
On 2014-08-21 14:02:26 -0700, Josh Berkus wrote:
> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
> > Not sure how you can make such a blanket statement when so many people
> > have tested and shown the benefits of hyper-threading.
>
> Actually, I don't know that anyone has posted the benefits of HT.
> Link?

There's definitely cases where it can help. But it's highly workload
*and* hardware dependent.

> OS is RHEL with 2.6.32-431.3.1.el6.x86_64.
>
> I've emailed a kernel hacker who works at Intel for comment; for one
> thing, I'm wondering if the older kernel version is a problem for a
> system like this.

I'm not sure if it has been backported by redhat, but there
definitely have been significant improvement in SMT aware scheduling
after vanilla 2.6.32.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Josh Berkus
Date:
On 08/22/2014 07:02 AM, Andres Freund wrote:
> On 2014-08-21 14:02:26 -0700, Josh Berkus wrote:
>> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
>>> Not sure how you can make such a blanket statement when so many people
>>> have tested and shown the benefits of hyper-threading.
>>
>> Actually, I don't know that anyone has posted the benefits of HT.
>> Link?
>
> There's definitely cases where it can help. But it's highly workload
> *and* hardware dependent.

The only cases I've seen where HT can be beneficial is when you have
large numbers of idle connections.  Then the idle connections can be
"parked" on the HT virtual cores.  However, even in this case I haven't
seen a head-to-head performance comparison.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 26/08/14 10:13, Josh Berkus wrote:
> On 08/22/2014 07:02 AM, Andres Freund wrote:
>> On 2014-08-21 14:02:26 -0700, Josh Berkus wrote:
>>> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
>>>> Not sure how you can make such a blanket statement when so many people
>>>> have tested and shown the benefits of hyper-threading.
>>>
>>> Actually, I don't know that anyone has posted the benefits of HT.
>>> Link?
>>
>> There's definitely cases where it can help. But it's highly workload
>> *and* hardware dependent.
>
> The only cases I've seen where HT can be beneficial is when you have
> large numbers of idle connections.  Then the idle connections can be
> "parked" on the HT virtual cores.  However, even in this case I haven't
> seen a head-to-head performance comparison.
>

I recall HT beneficial on a single socket (i3 or i7), using pgbench as
the measuring tool. However I didn't save the results at the time. I've
just got some new ssd's to play with so might run some pgbench tests on
my home machine (Haswell i7) with HT on and off.

Regards

Mark


Re: Turn off Hyperthreading! WAS: 60 core performance with 9.3

From
Mark Kirkwood
Date:
On 26/08/14 10:13, Josh Berkus wrote:
> On 08/22/2014 07:02 AM, Andres Freund wrote:
>> On 2014-08-21 14:02:26 -0700, Josh Berkus wrote:
>>> On 08/20/2014 07:40 PM, Bruce Momjian wrote:
>>>> Not sure how you can make such a blanket statement when so many people
>>>> have tested and shown the benefits of hyper-threading.
>>>
>>> Actually, I don't know that anyone has posted the benefits of HT.
>>> Link?
>>
>> There's definitely cases where it can help. But it's highly workload
>> *and* hardware dependent.
>
> The only cases I've seen where HT can be beneficial is when you have
> large numbers of idle connections.  Then the idle connections can be
> "parked" on the HT virtual cores.  However, even in this case I haven't
> seen a head-to-head performance comparison.
>

I've just had a pair of Crucial m550's arrive, so a bit of benchmarking
is in order. The results (below) seem to suggest that HT enabled is
certainly not inhibiting scaling performance for single socket i7's. I
performed several runs (typical results shown below).

Intel i7-4770 3.4 Ghz, 16G
2x Crucial m550
Ubuntu 14.04
Postgres 9.4 beta2

logging_collector = on
max_connections = 600
shared_buffers = 1GB
wal_buffers = 32MB
checkpoint_segments = 128
effective_cache_size = 10GB

pgbench scale = 300
test duration (each) = 600s

db on 1x m550
xlog on 1x m550

clients |  tps (HT)|  tps (no HT)
--------+----------+-------------
4       |  517     |  520
8       | 1013     |  999
16      | 1938     | 1913
32      | 3574     | 3560
64      | 5873     | 5412
128     | 8351     | 7450
256     | 9426     | 7840
512     | 9357     | 7288


Regards

Mark