Thread: 60 core performance with 9.3
I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1]. The context is the current machine in use by the customer is a 32 core one, and due to growth we are looking at something larger (hence 60 cores). Some initial tests show similar pgbench read only performance to what Robert found here http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html (actually a bit quicker around 400000 tps). However doing a mixed read-write workload is getting results the same or only marginally quicker than the 32 core machine - particularly at higher number of clients (e.g 200 - 500). I have yet to break out the perf toolset, but I'm wondering if any folk has compared 32 and 60 (or 64) core read write pgbench performance? regards Mark [1] Details: 4x E7-4890 15 cores each. 1 TB ram 16x Toshiba PX02SS SATA SSD 4x Samsung NVMe XS1715 PCIe SSD Ubuntu 14.04 (Linux 3.13)
On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1]. > > The context is the current machine in use by the customer is a 32 core one, > and due to growth we are looking at something larger (hence 60 cores). > > Some initial tests show similar pgbench read only performance to what Robert > found here > http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html > (actually a bit quicker around 400000 tps). > > However doing a mixed read-write workload is getting results the same or > only marginally quicker than the 32 core machine - particularly at higher > number of clients (e.g 200 - 500). I have yet to break out the perf toolset, > but I'm wondering if any folk has compared 32 and 60 (or 64) core read write > pgbench performance? My guess is that the read only test is CPU / memory bandwidth limited, but the mixed test is IO bound. What's your iostat / vmstat / iotop etc look like when you're doing both read only and read/write mixed?
On 27/06/14 14:01, Scott Marlowe wrote: > On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood > <mark.kirkwood@catalyst.net.nz> wrote: >> I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1]. >> >> The context is the current machine in use by the customer is a 32 core one, >> and due to growth we are looking at something larger (hence 60 cores). >> >> Some initial tests show similar pgbench read only performance to what Robert >> found here >> http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html >> (actually a bit quicker around 400000 tps). >> >> However doing a mixed read-write workload is getting results the same or >> only marginally quicker than the 32 core machine - particularly at higher >> number of clients (e.g 200 - 500). I have yet to break out the perf toolset, >> but I'm wondering if any folk has compared 32 and 60 (or 64) core read write >> pgbench performance? > > My guess is that the read only test is CPU / memory bandwidth limited, > but the mixed test is IO bound. > > What's your iostat / vmstat / iotop etc look like when you're doing > both read only and read/write mixed? > > That was what I would have thought too, but it does not appear to be the case, here is a typical iostat: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 0.00 0.00 0.00 4448.00 0.00 41.47 19.10 0.14 0.03 0.00 0.03 0.03 14.40 nvme1n1 0.00 0.00 0.00 4448.00 0.00 41.47 19.10 0.15 0.03 0.00 0.03 0.03 15.20 nvme2n1 0.00 0.00 0.00 4549.00 0.00 42.20 19.00 0.15 0.03 0.00 0.03 0.03 15.20 nvme3n1 0.00 0.00 0.00 4548.00 0.00 42.19 19.00 0.16 0.04 0.00 0.04 0.04 16.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 17961.00 0.00 83.67 9.54 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 My feeling is spinlock or similar, 'perf top' shows kernel find_busiest_group kernel _raw_spin_lock as the top time users.
On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: > My feeling is spinlock or similar, 'perf top' shows > > kernel find_busiest_group > kernel _raw_spin_lock > > as the top time users. Those don't tell that much by themselves, could you do a hierarchical profile? I.e. perf record -ga? That'll at least give the callers for kernel level stuff. For more information compile postgres with -fno-omit-frame-pointer. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 27/06/14 21:19, Andres Freund wrote: > On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: >> My feeling is spinlock or similar, 'perf top' shows >> >> kernel find_busiest_group >> kernel _raw_spin_lock >> >> as the top time users. > > Those don't tell that much by themselves, could you do a hierarchical > profile? I.e. perf record -ga? That'll at least give the callers for > kernel level stuff. For more information compile postgres with > -fno-omit-frame-pointer. > Excellent suggestion, will do next week! regards Mark
On 27/06/14 21:19, Andres Freund wrote: > On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: >> My feeling is spinlock or similar, 'perf top' shows >> >> kernel find_busiest_group >> kernel _raw_spin_lock >> >> as the top time users. > > Those don't tell that much by themselves, could you do a hierarchical > profile? I.e. perf record -ga? That'll at least give the callers for > kernel level stuff. For more information compile postgres with > -fno-omit-frame-pointer. > Unfortunately this did not help - had lots of unknown symbols from postgres in the profile - I'm guessing the Ubuntu postgresql-9.3 package needs either the -dev package or to be rebuilt with the enable profile option (debug and no-omit-frame-pointer seem to be there already). However further investigation did uncover *very* interesting things. Firstly I had previously said that read only performance looked ok...this was wrong, purely based on comparison to Robert's blog post. Rebooting the 60 core box with 32 cores enabled showed that we got *better* scaling performance in the read only case and illustrated we were hitting a serious regression with more cores. At this point data is needed: Test: pgbench Options: scale 500 read only Os: Ubuntu 14.04 Pg: 9.3.4 Pg Options: max_connections = 200 shared_buffers = 10GB maintenance_work_mem = 1GB effective_io_concurrency = 10 wal_buffers = 32MB checkpoint_segments = 192 checkpoint_completion_target = 0.8 Results Clients | 9.3 tps 32 cores | 9.3 tps 60 cores --------+------------------+----------------- 6 | 70400 | 71028 12 | 98918 | 129140 24 | 230345 | 240631 48 | 324042 | 409510 96 | 346929 | 120464 192 | 312621 | 92663 So we have anti scaling with 60 cores as we increase the client connections. Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4 branch [1] - cherry picking the last 5 commits into 9.4 branch and building a package from that and retesting: Clients | 9.4 tps 60 cores (rwlock) --------+-------------------------- 6 | 70189 12 | 128894 24 | 233542 48 | 422754 96 | 590796 192 | 630672 Wow - that is more like it! Andres that is some nice work, we definitely owe you some beers for that :-) I am aware that I need to retest with an unpatched 9.4 src - as it is not clear from this data how much is due to Andres's patches and how much to the steady stream of 9.4 development. I'll post an update on that later, but figured this was interesting enough to note for now. Regards Mark [1] from git://git.postgresql.org/git/users/andresfreund/postgres.git, commits: 4b82477dcaf81ad7b0c102f4b66e479a5eb9504a 10d72b97f108b6002210ea97a414076a62302d4e 67ffebe50111743975d54782a3a94b15ac4e755f fe686ed18fe132021ee5e557c67cc4d7c50a1ada f2378dc2fa5b73c688f696704976980bab90c611
On 01/07/14 21:48, Mark Kirkwood wrote: > [1] from git://git.postgresql.org/git/users/andresfreund/postgres.git, > commits: > 4b82477dcaf81ad7b0c102f4b66e479a5eb9504a > 10d72b97f108b6002210ea97a414076a62302d4e > 67ffebe50111743975d54782a3a94b15ac4e755f > fe686ed18fe132021ee5e557c67cc4d7c50a1ada > f2378dc2fa5b73c688f696704976980bab90c611 > > Hmmm, should read last 5 commits in 'rwlock-contention' and I had pasted the commit nos from my tree not Andres's, sorry, here are the right ones: 472c87400377a7dc418d8b77e47ba08f5c89b1bb e1e549a8e42b753cc7ac60e914a3939584cb1c56 65c2174469d2e0e7c2894202dc63b8fa6f8d2a7f 959aa6e0084d1264e5b228e5a055d66e5173db7d a5c3ddaef0ee679cf5e8e10d59e0a1fe9f0f1893
On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: > On 27/06/14 21:19, Andres Freund wrote: > >On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: > >>My feeling is spinlock or similar, 'perf top' shows > >> > >>kernel find_busiest_group > >>kernel _raw_spin_lock > >> > >>as the top time users. > > > >Those don't tell that much by themselves, could you do a hierarchical > >profile? I.e. perf record -ga? That'll at least give the callers for > >kernel level stuff. For more information compile postgres with > >-fno-omit-frame-pointer. > > > > Unfortunately this did not help - had lots of unknown symbols from postgres > in the profile - I'm guessing the Ubuntu postgresql-9.3 package needs either > the -dev package or to be rebuilt with the enable profile option (debug and > no-omit-frame-pointer seem to be there already). You need to install the -dbg package. My bet is you'll see s_lock high in the profile, called mainly from the procarray and buffer mapping lwlocks. > Test: pgbench > Options: scale 500 > read only > Os: Ubuntu 14.04 > Pg: 9.3.4 > Pg Options: > max_connections = 200 Just as an experiment I'd suggest increasing max_connections by one and two and quickly retesting - there's some cacheline alignment issues that aren't fixed yet that happen to vanish with some max_connections settings. > shared_buffers = 10GB > maintenance_work_mem = 1GB > effective_io_concurrency = 10 > wal_buffers = 32MB > checkpoint_segments = 192 > checkpoint_completion_target = 0.8 > > > Results > > Clients | 9.3 tps 32 cores | 9.3 tps 60 cores > --------+------------------+----------------- > 6 | 70400 | 71028 > 12 | 98918 | 129140 > 24 | 230345 | 240631 > 48 | 324042 | 409510 > 96 | 346929 | 120464 > 192 | 312621 | 92663 > > So we have anti scaling with 60 cores as we increase the client connections. > Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4 branch [1] > - cherry picking the last 5 commits into 9.4 branch and building a package > from that and retesting: > > Clients | 9.4 tps 60 cores (rwlock) > --------+-------------------------- > 6 | 70189 > 12 | 128894 > 24 | 233542 > 48 | 422754 > 96 | 590796 > 192 | 630672 > > Wow - that is more like it! Andres that is some nice work, we definitely owe > you some beers for that :-) I am aware that I need to retest with an > unpatched 9.4 src - as it is not clear from this data how much is due to > Andres's patches and how much to the steady stream of 9.4 development. I'll > post an update on that later, but figured this was interesting enough to > note for now. Cool. That's what I like (and expect) to see :). I don't think unpatched 9.4 will show significantly different results than 9.3, but it'd be good to validate that. If you do so, could you post the results in the -hackers thread I just CCed you on? That'll help the work to get into 9.5. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 01/07/14 22:13, Andres Freund wrote: > On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: >> - cherry picking the last 5 commits into 9.4 branch and building a package >> from that and retesting: >> >> Clients | 9.4 tps 60 cores (rwlock) >> --------+-------------------------- >> 6 | 70189 >> 12 | 128894 >> 24 | 233542 >> 48 | 422754 >> 96 | 590796 >> 192 | 630672 >> >> Wow - that is more like it! Andres that is some nice work, we definitely owe >> you some beers for that :-) I am aware that I need to retest with an >> unpatched 9.4 src - as it is not clear from this data how much is due to >> Andres's patches and how much to the steady stream of 9.4 development. I'll >> post an update on that later, but figured this was interesting enough to >> note for now. > > Cool. That's what I like (and expect) to see :). I don't think unpatched > 9.4 will show significantly different results than 9.3, but it'd be good > to validate that. If you do so, could you post the results in the > -hackers thread I just CCed you on? That'll help the work to get into > 9.5. So we seem to have nailed read only performance. Going back and revisiting read write performance finds: Postgres 9.4 beta rwlock patch pgbench scale = 2000 max_connections = 200; shared_buffers = "10GB"; maintenance_work_mem = "1GB"; effective_io_concurrency = 10; wal_buffers = "32MB"; checkpoint_segments = 192; checkpoint_completion_target = 0.8; clients | tps (32 cores) | tps ---------+----------------+--------- 6 | 8313 | 8175 12 | 11012 | 14409 24 | 16151 | 17191 48 | 21153 | 23122 96 | 21977 | 22308 192 | 22917 | 23109 So we are back to not doing significantly better than 32 cores. Hmmm. Doing quite a few more tweaks gets some better numbers: kernel.sched_autogroup_enabled=0 kernel.sched_migration_cost_ns=5000000 net.core.somaxconn=1024 /sys/kernel/mm/transparent_hugepage/enabled [never] +checkpoint_segments = 1920 +wal_buffers = "256MB"; clients | tps ---------+--------- 6 | 8366 12 | 15988 24 | 19828 48 | 30315 96 | 31649 192 | 29497 One more: +wal__sync_method = "open_datasync" clients | tps ---------+--------- 6 | 9566 12 | 17129 24 | 22962 48 | 34564 96 | 32584 192 | 28367 So this looks better - however I suspect 32 core performance would improve with these as well! The problem does *not* look to be connected with IO (I will include some iostat below). So time to get the profiler out (192 clients for 1 minute): Full report http://paste.ubuntu.com/7777886/ # ======== # captured on: Fri Jul 11 03:09:06 2014 # hostname : ncel-prod-db3 # os release : 3.13.0-24-generic # perf version : 3.13.9 # arch : x86_64 # nrcpus online : 60 # nrcpus avail : 60 # cpudesc : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz # cpuid : GenuineIntel,6,62,7 # total memory : 1056692116 kB # cmdline : /usr/lib/linux-tools-3.13.0-24/perf record -ag # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, attr_mmap2 = 0, attr_mmap = 1, attr_mmap_data = 0 # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, uncore_cbox_10 = 17, uncore_cbox_11 = 18, uncore_cbox_12 = 19, uncore_cbox_13 = 20, uncore_cbox_14 = 21, software = 1, uncore_irp = 33, uncore_pcu = 22, tracepoint = 2, uncore_imc_0 = 25, uncore_imc_1 = 26, uncore_imc_2 = 27, uncore_imc_3 = 28, uncore_imc_4 = 29, uncore_imc_5 = 30, uncore_imc_6 = 31, uncore_imc_7 = 32, uncore_qpi_0 = 34, uncore_qpi_1 = 35, uncore_qpi_2 = 36, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_cbox_8 = 15, uncore_cbox_9 = 16, uncore_r2pcie = 37, uncore_r3qpi_0 = 38, uncore_r3qpi_1 = 39, breakpoint = 5, uncore_ha_0 = 23, uncore_ha_1 = 24, uncore_ubox = 6 # ======== # # Samples: 1M of event 'cycles' # Event count (approx.): 359906321606 # # Overhead Command Shared Object Symbol # ........ .............. ....................... ..................................................... # 8.82% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--75.69%-- pagevec_lru_move_fn | __lru_cache_add | lru_cache_add | putback_lru_page | migrate_pages | migrate_misplaced_page | do_numa_page | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | | | |--31.07%-- PinBuffer | | | | | --100.00%-- ReadBuffer_common | | | | | --100.00%-- ReadBufferExtended | | | | | |--71.62%-- index_fetch_heap | | | index_getnext | | | IndexNext | | | ExecScan | | | ExecProcNode | | | ExecModifyTable | | | ExecProcNode | | | standard_ExecutorRun | | | ProcessQuery | | | PortalRunMulti | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--17.47%-- heap_hot_search | | | _bt_check_unique | | | _bt_doinsert | | | btinsert | | | FunctionCall6Coll | | | index_insert | | | | | | | --100.00%-- ExecInsertIndexTuples | | | ExecModifyTable | | | ExecProcNode | | | standard_ExecutorRun | | | ProcessQuery | | | PortalRunMulti | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--3.81%-- RelationGetBufferForTuple | | | heap_update | | | ExecModifyTable | | | ExecProcNode | | | standard_ExecutorRun | | | ProcessQuery | | | PortalRunMulti | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--3.65%-- _bt_relandgetbuf | | | _bt_search | | | _bt_first | | | | | | | --100.00%-- btgettuple | | | FunctionCall2Coll | | | index_getnext_tid | | | index_getnext | | | IndexNext | | | ExecScan | | | ExecProcNode | | | | | | | |--97.56%-- ExecModifyTable | | | | ExecProcNode | | | | standard_ExecutorRun | | | | ProcessQuery | | | | PortalRunMulti | | | | PortalRun | | | | PostgresMain | | | | ServerLoop | | | | | | | --2.44%-- standard_ExecutorRun | | | PortalRunSelect | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--2.69%-- fsm_readbuf | | | fsm_set_and_search | | | RecordPageWithFreeSpace | | | lazy_vacuum_rel | | | vacuum_rel | | | vacuum | | | do_autovacuum | | | | | --0.75%-- lazy_vacuum_rel | | vacuum_rel | | vacuum | | do_autovacuum | | | |--4.66%-- SearchCatCache | | | | | |--49.62%-- oper | | | make_op | | | transformExprRecurse | | | transformExpr | | | | | | | |--90.02%-- transformTargetEntry | | | | transformTargetList | | | | transformStmt | | | | parse_analyze | | | | pg_analyze_and_rewrite | | | | PostgresMain | | | | ServerLoop | | | | | | | --9.98%-- transformWhereClause | | | transformStmt | | | parse_analyze | | | pg_analyze_and_rewrite | | | PostgresMain | | | ServerLoop With respect to IO, here are typical iostat outputs: sda HW RAID 10 array SAS SSD [data] md0 SW RAID 10 of nvme[0-3]n1 PCie SSD [xlog] Non Checkpoint Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 15.00 0.00 3.00 0.00 0.07 50.67 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 0.00 0.00 0.00 4198.00 0.00 146.50 71.47 0.18 0.05 0.00 0.05 0.04 18.40 nvme1n1 0.00 0.00 0.00 4198.00 0.00 146.50 71.47 0.18 0.04 0.00 0.04 0.04 17.20 nvme2n1 0.00 0.00 0.00 4126.00 0.00 146.08 72.51 0.15 0.04 0.00 0.04 0.03 14.00 nvme3n1 0.00 0.00 0.00 4125.00 0.00 146.03 72.50 0.15 0.04 0.00 0.04 0.03 14.40 md0 0.00 0.00 0.00 16022.00 0.00 292.53 37.39 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 18.00 0.00 0.07 8.44 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Checkpoint Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 29.00 1.00 96795.00 0.00 1074.52 22.73 133.13 1.38 4.00 1.38 0.01 100.00 nvme0n1 0.00 0.00 0.00 3564.00 0.00 56.71 32.59 0.12 0.03 0.00 0.03 0.03 11.60 nvme1n1 0.00 0.00 0.00 3564.00 0.00 56.71 32.59 0.12 0.03 0.00 0.03 0.03 12.00 nvme2n1 0.00 0.00 0.00 3884.00 0.00 59.12 31.17 0.14 0.04 0.00 0.04 0.04 13.60 nvme3n1 0.00 0.00 0.00 3884.00 0.00 59.12 31.17 0.13 0.03 0.00 0.03 0.03 12.80 md0 0.00 0.00 0.00 14779.00 0.00 115.80 16.05 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 3.00 0.00 0.01 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 1.00 96830.00 0.00 1074.83 22.73 134.79 1.38 4.00 1.38 0.01 100.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Thanks for your patience if you have read this far! Regards Mark
On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote: > On 01/07/14 22:13, Andres Freund wrote: > >On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: > >>- cherry picking the last 5 commits into 9.4 branch and building a package > >>from that and retesting: > >> > >>Clients | 9.4 tps 60 cores (rwlock) > >>--------+-------------------------- > >>6 | 70189 > >>12 | 128894 > >>24 | 233542 > >>48 | 422754 > >>96 | 590796 > >>192 | 630672 > >> > >>Wow - that is more like it! Andres that is some nice work, we definitely owe > >>you some beers for that :-) I am aware that I need to retest with an > >>unpatched 9.4 src - as it is not clear from this data how much is due to > >>Andres's patches and how much to the steady stream of 9.4 development. I'll > >>post an update on that later, but figured this was interesting enough to > >>note for now. > > > >Cool. That's what I like (and expect) to see :). I don't think unpatched > >9.4 will show significantly different results than 9.3, but it'd be good > >to validate that. If you do so, could you post the results in the > >-hackers thread I just CCed you on? That'll help the work to get into > >9.5. > > So we seem to have nailed read only performance. Going back and revisiting > read write performance finds: > > Postgres 9.4 beta > rwlock patch > pgbench scale = 2000 > > max_connections = 200; > shared_buffers = "10GB"; > maintenance_work_mem = "1GB"; > effective_io_concurrency = 10; > wal_buffers = "32MB"; > checkpoint_segments = 192; > checkpoint_completion_target = 0.8; > > clients | tps (32 cores) | tps > ---------+----------------+--------- > 6 | 8313 | 8175 > 12 | 11012 | 14409 > 24 | 16151 | 17191 > 48 | 21153 | 23122 > 96 | 21977 | 22308 > 192 | 22917 | 23109 On that scale - that's bigger than shared_buffers IIRC - I'd not expect the patch to make much of a difference. > kernel.sched_autogroup_enabled=0 > kernel.sched_migration_cost_ns=5000000 > net.core.somaxconn=1024 > /sys/kernel/mm/transparent_hugepage/enabled [never] > > Full report http://paste.ubuntu.com/7777886/ > # > 8.82% postgres [kernel.kallsyms] [k] > _raw_spin_lock_irqsave > | > --- _raw_spin_lock_irqsave > | > |--75.69%-- pagevec_lru_move_fn > | __lru_cache_add > | lru_cache_add > | putback_lru_page > | migrate_pages > | migrate_misplaced_page > | do_numa_page > | handle_mm_fault > | __do_page_fault > | do_page_fault > | page_fault So, the majority of the time is spent in numa page migration. Can you disable numa_balancing? I'm not sure if your kernel version does that at runtime or whether you need to reboot. The kernel.numa_balancing sysctl might work. Otherwise you probably need to boot with numa_balancing=0. It'd also be worthwhile to test this with numactl --interleave. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 11/07/14 20:22, Andres Freund wrote: > On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote: >> Postgres 9.4 beta >> rwlock patch >> pgbench scale = 2000 >> > On that scale - that's bigger than shared_buffers IIRC - I'd not expect > the patch to make much of a difference. > Right - we did test with it bigger (can't recall exactly how big), but will retry again after setting the numa parameters below. >> # >> 8.82% postgres [kernel.kallsyms] [k] >> _raw_spin_lock_irqsave >> | >> --- _raw_spin_lock_irqsave >> | >> |--75.69%-- pagevec_lru_move_fn >> | __lru_cache_add >> | lru_cache_add >> | putback_lru_page >> | migrate_pages >> | migrate_misplaced_page >> | do_numa_page >> | handle_mm_fault >> | __do_page_fault >> | do_page_fault >> | page_fault > > So, the majority of the time is spent in numa page migration. Can you > disable numa_balancing? I'm not sure if your kernel version does that at > runtime or whether you need to reboot. > The kernel.numa_balancing sysctl might work. Otherwise you probably need > to boot with numa_balancing=0. > > It'd also be worthwhile to test this with numactl --interleave. > That was my feeling too - but I had no idea what the magic switch was to tame it (appears to be in 3.13 kernels), will experiment and report back. Thanks again! Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 11/07/14 20:22, Andres Freund wrote: >> So, the majority of the time is spent in numa page migration. >> Can you disable numa_balancing? I'm not sure if your kernel >> version does that at runtime or whether you need to reboot. >> The kernel.numa_balancing sysctl might work. Otherwise you >> probably need to boot with numa_balancing=0. >> >> It'd also be worthwhile to test this with numactl --interleave. > > That was my feeling too - but I had no idea what the magic switch > was to tame it (appears to be in 3.13 kernels), will experiment > and report back. Thanks again! It might be worth a test using a cpuset to interleave OS cache and the NUMA patch I submitted to the current CF to see whether this is getting into territory where the patch makes a bigger difference. I would expect it to do much better than using numactl --interleave because work_mem and other process-local memory would be allocated in "near" memory for each process. http://www.postgresql.org/message-id/1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/07/14 20:22, Andres Freund wrote: > On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote: >> Full report http://paste.ubuntu.com/7777886/ > >> # >> 8.82% postgres [kernel.kallsyms] [k] >> _raw_spin_lock_irqsave >> | >> --- _raw_spin_lock_irqsave >> | >> |--75.69%-- pagevec_lru_move_fn >> | __lru_cache_add >> | lru_cache_add >> | putback_lru_page >> | migrate_pages >> | migrate_misplaced_page >> | do_numa_page >> | handle_mm_fault >> | __do_page_fault >> | do_page_fault >> | page_fault > > So, the majority of the time is spent in numa page migration. Can you > disable numa_balancing? I'm not sure if your kernel version does that at > runtime or whether you need to reboot. > The kernel.numa_balancing sysctl might work. Otherwise you probably need > to boot with numa_balancing=0. > > It'd also be worthwhile to test this with numactl --interleave. > Trying out with numa_balancing=0 seemed to get essentially the same performance. Similarly wrapping postgres startup with --interleave. All this made me want to try with numa *really* disabled. So rebooted the box with "numa=off" appended to the kernel cmdline. Somewhat surprisingly (to me anyway), the numbers were essentially identical. The profile, however is quite different: Full report at http://paste.ubuntu.com/7806285/ 4.56% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--41.89%-- try_to_wake_up | | | |--96.12%-- default_wake_function | | | | | |--99.96%-- pollwake | | | __wake_up_common | | | __wake_up_sync_key | | | sock_def_readable | | | | | | | |--99.94%-- unix_stream_sendmsg | | | | sock_sendmsg | | | | SYSC_sendto | | | | sys_sendto | | | | tracesys | | | | __libc_send | | | | pq_flush | | | | ReadyForQuery | | | | PostgresMain | | | | ServerLoop | | | | PostmasterMain | | | | main | | | | __libc_start_main | | | --0.06%-- [...] | | --0.04%-- [...] | | | |--2.87%-- wake_up_process | | | | | |--95.71%-- wake_up_sem_queue_do | | | SYSC_semtimedop | | | sys_semop | | | tracesys | | | __GI___semop | | | | | | | |--99.75%-- LWLockRelease | | | | | | | | | |--25.09%-- RecordTransactionCommit | | | | | CommitTransaction | | | | | CommitTransactionCommand | | | | | finish_xact_command.part.4 | | | | | PostgresMain | | | | | ServerLoop | | | | | PostmasterMain | | | | | main | | | | | __libc_start_main regards Mark
On 12/07/14 01:19, Kevin Grittner wrote: > > It might be worth a test using a cpuset to interleave OS cache and > the NUMA patch I submitted to the current CF to see whether this is > getting into territory where the patch makes a bigger difference. > I would expect it to do much better than using numactl --interleave > because work_mem and other process-local memory would be allocated > in "near" memory for each process. > > http://www.postgresql.org/message-id/1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com > Thanks Kevin - I did try this out - seemed slightly better than using --interleave, but almost identical to the results posted previously. However looking at my postgres binary with ldd, I'm not seeing any link to libnuma (despite it demanding the library whilst building), so I wonder if my package build has somehow vanilla-ified the result :-( Also I am guessing that with 60 cores I do: $ sudo /bin/bash -c "echo 0-59 >/dev/cpuset/postgres/cpus" i.e cpus are cores not packages...? If I've stuffed it up I'll redo! Cheers Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 12/07/14 01:19, Kevin Grittner wrote: >> >> It might be worth a test using a cpuset to interleave OS cache and >> the NUMA patch I submitted to the current CF to see whether this is >> getting into territory where the patch makes a bigger difference. >> I would expect it to do much better than using numactl --interleave >> because work_mem and other process-local memory would be allocated >> in "near" memory for each process. >> > http://www.postgresql.org/message-id/1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com > > Thanks Kevin - I did try this out - seemed slightly better than using > --interleave, but almost identical to the results posted previously. > > However looking at my postgres binary with ldd, I'm not seeing any link > to libnuma (despite it demanding the library whilst building), so I > wonder if my package build has somehow vanilla-ified the result :-( That is odd; not sure what to make of that! > Also I am guessing that with 60 cores I do: > > $ sudo /bin/bash -c "echo 0-59 >/dev/cpuset/postgres/cpus" > > i.e cpus are cores not packages...? Right; basically, as a guide, you can use the output from: $ numactl --hardware Use the union of all the "cpu" numbers from the "node n cpus" lines. The above statement is also a good way to see how unbalance memory usage has become while running a test. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 17/07/14 11:58, Mark Kirkwood wrote: > > Trying out with numa_balancing=0 seemed to get essentially the same > performance. Similarly wrapping postgres startup with --interleave. > > All this made me want to try with numa *really* disabled. So rebooted > the box with "numa=off" appended to the kernel cmdline. Somewhat > surprisingly (to me anyway), the numbers were essentially identical. The > profile, however is quite different: > A little more tweaking got some further improvement: rwlocks patch as before wal_buffers = 256MB checkpoint_segments = 1920 wal_sync_method = open_datasync LSI RAID adaptor disable read ahead and write cache for SSD fast path mode numa_balancing = 0 Pgbench scale 2000 again: clients | tps (prev) | tps (tweaked config) ---------+------------+--------- 6 | 8175 | 8281 12 | 14409 | 15896 24 | 17191 | 19522 48 | 23122 | 29776 96 | 22308 | 32352 192 | 23109 | 28804 Now recall we were seeing no actual tps changes with numa_balancing=0 or 1 (so the improvement above is from the other changes), but figured it might be informative to try to track down what the non-numa bottlenecks looked like. We tried profiling the entire 10 minute run which showed the stats collector as a possible source of contention: 3.86% postgres [kernel.kallsyms] [k] _raw_spin_lock_bh | --- _raw_spin_lock_bh | |--95.78%-- lock_sock_nested | udpv6_sendmsg | inet_sendmsg | sock_sendmsg | SYSC_sendto | sys_sendto | tracesys | __libc_send | | | |--99.17%-- pgstat_report_stat | | PostgresMain | | ServerLoop | | PostmasterMain | | main | | __libc_start_main | | | |--0.77%-- pgstat_send_bgwriter | | BackgroundWriterMain | | AuxiliaryProcessMain | | 0x7f08efe8d453 | | reaper | | __restore_rt | | PostmasterMain | | main | | __libc_start_main | --0.07%-- [...] | |--2.54%-- __lock_sock | | | |--91.95%-- lock_sock_nested | | udpv6_sendmsg | | inet_sendmsg | | sock_sendmsg | | SYSC_sendto | | sys_sendto | | tracesys | | __libc_send | | | | | |--99.73%-- pgstat_report_stat | | | PostgresMain | | | ServerLoop Disabling track_counts and rerunning pgbench: clients | tps (no counts) ---------+------------ 6 | 9806 12 | 18000 24 | 29281 48 | 43703 96 | 54539 192 | 36114 While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! Trying out less write heavy workloads shows that the stats overhead does not appear to be significant for *read* heavy cases, so this result above is perhaps more of a curiosity than anything (given that read heavy is more typical...and our real workload is more similar to read heavy). The profile for counts off looks like: 4.79% swapper [kernel.kallsyms] [k] read_hpet | --- read_hpet | |--97.10%-- ktime_get | | | |--35.24%-- clockevents_program_event | | tick_program_event | | | | | |--56.59%-- __hrtimer_start_range_ns | | | | | | | |--78.12%-- hrtimer_start_range_ns | | | | tick_nohz_restart | | | | tick_nohz_idle_exit | | | | cpu_startup_entry | | | | | | | | | |--98.84%-- start_secondary | | | | | | | | | --1.16%-- rest_init | | | | start_kernel | | | | x86_64_start_reservations | | | | x86_64_start_kernel | | | | | | | --21.88%-- hrtimer_start | | | tick_nohz_stop_sched_tick | | | __tick_nohz_idle_enter | | | | | | | |--99.89%-- tick_nohz_idle_enter | | | | cpu_startup_entry | | | | | | | | | |--98.30%-- start_secondary | | | | | | | | | --1.70%-- rest_init | | | | start_kernel | | | | x86_64_start_reservations | | | | x86_64_start_kernel | | | --0.11%-- [...] | | | | | |--40.25%-- hrtimer_force_reprogram | | | __remove_hrtimer | | | | | | | |--89.68%-- __hrtimer_start_range_ns | | | | hrtimer_start | | | | tick_nohz_stop_sched_tick | | | | __tick_nohz_idle_enter | | | | | | | | | |--99.90%-- tick_nohz_idle_enter | | | | | cpu_startup_entry | | | | | | | | | | | |--99.04%-- start_secondary | | | | | | | | | | | --0.96%-- rest_init | | | | | start_kernel | | | | | x86_64_start_reservations | | | | | x86_64_start_kernel | | | | --0.10%-- [...] | | | | Any thoughts on how to proceed further appreciated! Cheers, Mark
On 30 Červenec 2014, 3:44, Mark Kirkwood wrote: > > While these numbers look great in the middle range (12-96 clients), then > benefit looks to be tailing off as client numbers increase. Also running > with no stats (and hence no auto vacuum or analyze) is way too scary! I assume you've disabled statistics collector, which has nothing to do with vacuum or analyze. There are two kinds of statistics in PostgreSQL - data distribution statistics (which is collected by ANALYZE and stored in actual tables within the database) and runtime statistics (which is collected by the stats collector and stored in a file somewhere on the dist). By disabling statistics collector you loose runtime counters - number of sequential/index scans on a table, tuples read from a relation aetc. But it does not influence VACUUM or planning at all. Also, it's mostly async (send over UDP and you're done) and shouldn't make much difference unless you have large number of objects. There are ways to improve this (e.g. by placing the stat files into a tmpfs). Tomas
"Tomas Vondra" <tv@fuzzy.cz> writes: > On 30 Červenec 2014, 3:44, Mark Kirkwood wrote: >> While these numbers look great in the middle range (12-96 clients), then >> benefit looks to be tailing off as client numbers increase. Also running >> with no stats (and hence no auto vacuum or analyze) is way too scary! > By disabling statistics collector you loose runtime counters - number of > sequential/index scans on a table, tuples read from a relation aetc. But > it does not influence VACUUM or planning at all. It does break autovacuum. regards, tom lane
On 30 Červenec 2014, 14:39, Tom Lane wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote: >>> While these numbers look great in the middle range (12-96 clients), >>> then >>> benefit looks to be tailing off as client numbers increase. Also >>> running >>> with no stats (and hence no auto vacuum or analyze) is way too scary! > >> By disabling statistics collector you loose runtime counters - number of >> sequential/index scans on a table, tuples read from a relation aetc. But >> it does not influence VACUUM or planning at all. > > It does break autovacuum. Of course, you're right. It throws away info about how much data was modified and when the table was last (auto)vacuumed. This is a clear proof that I really need to drink at least one cup of coffee in the morning before doing anything in the morning. Tomas
Hi Tomas, Unfortunately I think you are mistaken - disabling the stats collector (i.e. track_counts = off) means that autovacuum has no idea about when/if it needs to start a worker (as it uses those counts to decide), and hence you lose all automatic vacuum and analyze as a result. With respect to comments like "it shouldn't make difference" etc etc, well the profile suggests otherwise, and the change in tps numbers support the observation. regards Mark On 30/07/14 20:42, Tomas Vondra wrote: > On 30 Červenec 2014, 3:44, Mark Kirkwood wrote: >> >> While these numbers look great in the middle range (12-96 clients), then >> benefit looks to be tailing off as client numbers increase. Also running >> with no stats (and hence no auto vacuum or analyze) is way too scary! > > I assume you've disabled statistics collector, which has nothing to do > with vacuum or analyze. > > There are two kinds of statistics in PostgreSQL - data distribution > statistics (which is collected by ANALYZE and stored in actual tables > within the database) and runtime statistics (which is collected by the > stats collector and stored in a file somewhere on the dist). > > By disabling statistics collector you loose runtime counters - number of > sequential/index scans on a table, tuples read from a relation aetc. But > it does not influence VACUUM or planning at all. > > Also, it's mostly async (send over UDP and you're done) and shouldn't make > much difference unless you have large number of objects. There are ways to > improve this (e.g. by placing the stat files into a tmpfs). > > Tomas >
On 31/07/14 00:47, Tomas Vondra wrote: > On 30 Červenec 2014, 14:39, Tom Lane wrote: >> "Tomas Vondra" <tv@fuzzy.cz> writes: >>> On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote: >>>> While these numbers look great in the middle range (12-96 clients), >>>> then >>>> benefit looks to be tailing off as client numbers increase. Also >>>> running >>>> with no stats (and hence no auto vacuum or analyze) is way too scary! >> >>> By disabling statistics collector you loose runtime counters - number of >>> sequential/index scans on a table, tuples read from a relation aetc. But >>> it does not influence VACUUM or planning at all. >> >> It does break autovacuum. > > Of course, you're right. It throws away info about how much data was > modified and when the table was last (auto)vacuumed. > > This is a clear proof that I really need to drink at least one cup of > coffee in the morning before doing anything in the morning. > Lol - thanks for taking a look anyway. Yes, coffee is often an important part of the exercise. Regards Mark
I've been assisting Mark with the benchmarking of these new servers. The drop off in both throughput and CPU utilisation that we've been observing as the client count increases has let me to investigate which lwlocks are dominant at different client counts. I've recompiled postgres with Andres LWLock improvements, Kevin's libnuma patch and with LWLOCK_STATS enabled. The LWLOCK_STATS below suggest that ProcArrayLock might be the main source of locking that's causing throughput to take a dive as the client count increases beyond the core count. wal_buffers = 256MB checkpoint_segments = 1920 wal_sync_method = open_datasync pgbench -s 2000 -T 600 Results: clients | tps ---------+--------- 6 | 9490 12 | 17558 24 | 25681 48 | 41175 96 | 48954 192 | 31887 384 | 15564 LWLOCK_STATS at 48 clients Lock | Blk | SpinDelay | Blk % | SpinDelay % --------------------+----------+-----------+-------+------------- BufFreelistLock | 31144 | 11 | 1.64 | 1.62 ShmemIndexLock | 192 | 1 | 0.01 | 0.15 OidGenLock | 32648 | 14 | 1.72 | 2.06 XidGenLock | 35731 | 18 | 1.88 | 2.64 ProcArrayLock | 291121 | 215 | 15.36 | 31.57 SInvalReadLock | 32136 | 13 | 1.70 | 1.91 SInvalWriteLock | 32141 | 12 | 1.70 | 1.76 WALBufMappingLock | 31662 | 15 | 1.67 | 2.20 WALWriteLock | 825380 | 45 | 36.31 | 6.61 CLogControlLock | 583458 | 337 | 26.93 | 49.49 LWLOCK_STATS at 96 clients Lock | Blk | SpinDelay | Blk % | SpinDelay % --------------------+----------+-----------+-------+------------- BufFreelistLock | 62954 | 12 | 1.54 | 0.27 ShmemIndexLock | 62635 | 4 | 1.54 | 0.09 OidGenLock | 92232 | 22 | 2.26 | 0.50 XidGenLock | 98326 | 18 | 2.41 | 0.41 ProcArrayLock | 928871 | 3188 | 22.78 | 72.57 SInvalReadLock | 58392 | 13 | 1.43 | 0.30 SInvalWriteLock | 57429 | 14 | 1.41 | 0.32 WALBufMappingLock | 138375 | 14 | 3.39 | 0.32 WALWriteLock | 1480707 | 42 | 36.31 | 0.96 CLogControlLock | 1098239 | 1066 | 26.93 | 27.27 LWLOCK_STATS at 384 clients Lock | Blk | SpinDelay | Blk % | SpinDelay % --------------------+----------+-----------+-------+------------- BufFreelistLock | 184298 | 158 | 1.93 | 0.03 ShmemIndexLock | 183573 | 164 | 1.92 | 0.03 OidGenLock | 184558 | 173 | 1.93 | 0.03 XidGenLock | 200239 | 213 | 2.09 | 0.04 ProcArrayLock | 4035527 | 579666 | 42.22 | 98.62 SInvalReadLock | 182204 | 152 | 1.91 | 0.03 SInvalWriteLock | 182898 | 137 | 1.91 | 0.02 WALBufMappingLock | 219936 | 215 | 2.30 | 0.04 WALWriteLock | 3172725 | 457 | 24.67 | 0.08 CLogControlLock | 1012458 | 6423 | 10.59 | 1.09 The same test done with a readonly workload show virtually no SpinDelay at all. Any thoughts or comments on these results are welcome! Regards, Matt.
Matt Clarkson wrote: > The LWLOCK_STATS below suggest that ProcArrayLock might be the main > source of locking that's causing throughput to take a dive as the client > count increases beyond the core count. > Any thoughts or comments on these results are welcome! Do these results change if you use Heikki's patch for CSN-based snapshots? See http://www.postgresql.org/message-id/539AD153.9000004@vmware.com for the patch (but note that you need to apply on top of 89cf2d52030 in the master branch -- maybe it applies to HEAD the 9.4 branch but I didn't try). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 01/08/14 09:38, Alvaro Herrera wrote: > Matt Clarkson wrote: > >> The LWLOCK_STATS below suggest that ProcArrayLock might be the main >> source of locking that's causing throughput to take a dive as the client >> count increases beyond the core count. > >> Any thoughts or comments on these results are welcome! > > Do these results change if you use Heikki's patch for CSN-based > snapshots? See > http://www.postgresql.org/message-id/539AD153.9000004@vmware.com for the > patch (but note that you need to apply on top of 89cf2d52030 in the > master branch -- maybe it applies to HEAD the 9.4 branch but I didn't > try). > Hi Alvaro, Applying the CSN patch on top of the rwlock + numa in 9.4 (bit of a patch-fest we have here now) shows modest improvement at highest client number (but appears to hurt performance in the mid range): clients | tps ---------+-------- 6 | 8445 12 | 14548 24 | 20043 48 | 27451 96 | 27718 192 | 23614 384 | 24737 Initial runs were quite disappointing, until we moved the csnlog directory onto the same filesystem that the xlogs are on (PCIe SSD). We could potentially look at locating them on their own separate volume if that make sense. Adding in LWLOCK stats again shows quite a different picture from the previous: 48 clients Lock | Blk | SpinDelay | Blk % | SpinDelay % --------------------+----------+-----------+-----------+------------- WALWriteLock | 25426001 | 1239 | 62.227442 | 14.373550 CLogControlLock | 1793739 | 1376 | 4.389986 | 15.962877 ProcArrayLock | 1007765 | 1305 | 2.466398 | 15.139211 CSNLogControlLock | 609556 | 1722 | 1.491824 | 19.976798 WALInsertLocks 4 | 994170 | 247 | 2.433126 | 2.865429 WALInsertLocks 7 | 983497 | 243 | 2.407005 | 2.819026 WALInsertLocks 5 | 993068 | 239 | 2.430429 | 2.772622 WALInsertLocks 3 | 991446 | 229 | 2.426459 | 2.656613 WALInsertLocks 0 | 964185 | 235 | 2.359741 | 2.726218 WALInsertLocks 1 | 995237 | 221 | 2.435737 | 2.563805 WALInsertLocks 2 | 997593 | 213 | 2.441503 | 2.470998 WALInsertLocks 6 | 978178 | 201 | 2.393987 | 2.331787 BufFreelistLock | 887194 | 206 | 2.171313 | 2.389791 XidGenLock | 327385 | 366 | 0.801240 | 4.245940 CheckpointerCommLock| 104754 | 151 | 0.256374 | 1.751740 WALBufMappingLock | 274226 | 7 | 0.671139 | 0.081206 96 clients Lock | Blk | SpinDelay | Blk % | SpinDelay % --------------------+----------+-----------+-----------+------------- WALWriteLock | 25426001 | 1239 | 62.227442 | 14.373550 WALWriteLock | 30097625 | 9616 | 48.550747 | 19.068393 CLogControlLock | 3193429 | 13490 | 5.151349 | 26.750481 ProcArrayLock | 2007103 | 11754 | 3.237676 | 23.308017 CSNLogControlLock | 1303172 | 5022 | 2.102158 | 9.958556 BufFreelistLock | 1921625 | 1977 | 3.099790 | 3.920363 WALInsertLocks 0 | 2011855 | 681 | 3.245341 | 1.350413 WALInsertLocks 5 | 1829266 | 627 | 2.950805 | 1.243332 WALInsertLocks 7 | 1806966 | 632 | 2.914833 | 1.253247 WALInsertLocks 4 | 1847372 | 591 | 2.980012 | 1.171945 WALInsertLocks 1 | 1948553 | 557 | 3.143228 | 1.104523 WALInsertLocks 6 | 1818717 | 582 | 2.933789 | 1.154098 WALInsertLocks 3 | 1873964 | 552 | 3.022908 | 1.094608 WALInsertLocks 2 | 1912007 | 523 | 3.084276 | 1.037102 XidGenLock | 512521 | 699 | 0.826752 | 1.386107 CheckpointerCommLock| 386853 | 711 | 0.624036 | 1.409903 WALBufMappingLock | 546462 | 65 | 0.881503 | 0.128894 384 clients Lock | Blk | SpinDelay | Blk % | SpinDelay % --------------------+----------+-----------+-----------+------------- WALWriteLock | 25426001 | 1239 | 62.227442 | 14.373550 WALWriteLock | 20703796 | 87265 | 27.749961 | 15.360068 CLogControlLock | 3273136 | 122616 | 4.387089 | 21.582422 ProcArrayLock | 3969918 | 100730 | 5.321008 | 17.730128 CSNLogControlLock | 3191989 | 115068 | 4.278325 | 20.253851 BufFreelistLock | 2014218 | 27952 | 2.699721 | 4.920009 WALInsertLocks 0 | 2750082 | 5438 | 3.686023 | 0.957177 WALInsertLocks 1 | 2584155 | 5312 | 3.463626 | 0.934999 WALInsertLocks 2 | 2477782 | 5497 | 3.321051 | 0.967562 WALInsertLocks 4 | 2375977 | 5441 | 3.184598 | 0.957705 WALInsertLocks 5 | 2349769 | 5458 | 3.149471 | 0.960697 WALInsertLocks 6 | 2329982 | 5367 | 3.122950 | 0.944680 WALInsertLocks 3 | 2415965 | 4771 | 3.238195 | 0.839774 WALInsertLocks 7 | 2316144 | 4930 | 3.104402 | 0.867761 CheckpointerCommLock| 584419 | 10794 | 0.783316 | 1.899921 XidGenLock | 391212 | 6963 | 0.524354 | 1.225602 WALBufMappingLock | 484693 | 83 | 0.649650 | 0.014609 So we're seeing delay coming fairly equally from 5 lwlocks. Thanks again - any other suggestions welcome! Cheers Mark
Mark, Is the 60-core machine using some of the Intel chips which have 20 hyperthreaded virtual cores? If so, I've been seeing some performance issues on these processors. I'm currently doing a side-by-side hyperthreading on/off test. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 15/08/14 06:18, Josh Berkus wrote: > Mark, > > Is the 60-core machine using some of the Intel chips which have 20 > hyperthreaded virtual cores? > > If so, I've been seeing some performance issues on these processors. > I'm currently doing a side-by-side hyperthreading on/off test. > Hi Josh, The board has 4 sockets with E7-4890 v2 cpus. They have 15 cores/30 threads. We've running with hyperthreading off (noticed the usual steep/sudden scaling dropoff with it on). What model are your 20 cores cpus? Cheers Mark
Mark, all: So, this is pretty damming: Read-only test with HT ON: [pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench starting vacuum...end. transaction type: SELECT only scaling factor: 30 query mode: simple number of clients: 20 number of threads: 4 duration: 600 s number of transactions actually processed: 47167533 tps = 78612.471802 (including connections establishing) tps = 78614.604352 (excluding connections establishing) Read-only test with HT Off: [pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench starting vacuum...end. transaction type: SELECT only scaling factor: 30 query mode: simple number of clients: 20 number of threads: 4 duration: 600 s number of transactions actually processed: 82457739 tps = 137429.508196 (including connections establishing) tps = 137432.893796 (excluding connections establishing) On a read-write test, it's 10% faster with HT off as well. Further, from their production machine we've seen that having HT on causes the machine to slow down by 5X whenever you get more than 40 cores (as in 100% of real cores or 50% of HT cores) worth of activity. So we're definitely back to "If you're using PostgreSQL, turn off Hyperthreading". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 08/20/2014 02:13 PM, Josh Berkus wrote: > So we're definitely back to "If you're using PostgreSQL, turn off > Hyperthreading". That's so strange. Back when I did my Nehalem tests, we got a very strong 30%+ increase by enabling HT. We only got a hit when we turned off turbo, or forgot to disable power saving features. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 21/08/14 07:13, Josh Berkus wrote: > Mark, all: > > So, this is pretty damming: > > Read-only test with HT ON: > > [pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench > starting vacuum...end. > transaction type: SELECT only > scaling factor: 30 > query mode: simple > number of clients: 20 > number of threads: 4 > duration: 600 s > number of transactions actually processed: 47167533 > tps = 78612.471802 (including connections establishing) > tps = 78614.604352 (excluding connections establishing) > > Read-only test with HT Off: > > [pgtest@db ~]$ pgbench -c 20 -j 4 -T 600 -S bench > starting vacuum...end. > transaction type: SELECT only > scaling factor: 30 > query mode: simple > number of clients: 20 > number of threads: 4 > duration: 600 s > number of transactions actually processed: 82457739 > tps = 137429.508196 (including connections establishing) > tps = 137432.893796 (excluding connections establishing) > > > On a read-write test, it's 10% faster with HT off as well. > > Further, from their production machine we've seen that having HT on > causes the machine to slow down by 5X whenever you get more than 40 > cores (as in 100% of real cores or 50% of HT cores) worth of activity. > > So we're definitely back to "If you're using PostgreSQL, turn off > Hyperthreading". > Hmm - that is interesting - I don't think we compared read only scaling for hyperthreading on and off (only read write). You didn't mention what cpu this is for (or how many sockets etc), would be useful to know. Notwithstanding the above results, my workmate Matt made an interesting observation: the scaling graph for (our) 60 core box (HT off), looks just like the one for our 32 core box with HT *on*. We are wondering if a lot of the previous analysis of HT performance regressions should actually be reevaluated in the light of ...err is it just that we have a lot more cores...? [1] Regards Mark [1] Particularly as in *some* cases (single socket i7 for instance) HT on seems to scale fine.
On Wed, Aug 20, 2014 at 1:36 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > That's so strange. Back when I did my Nehalem tests, we got a very strong > 30%+ increase by enabling HT. We only got a hit when we turned off turbo, or > forgot to disable power saving features. In my experience, it is crucially important to consider power saving features in most benchmarks these days, where that might not have been true a few years ago. The CPU scaling governor can alter the outcome of many benchmarks quite significantly. -- Regards, Peter Geoghegan
On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote: > On a read-write test, it's 10% faster with HT off as well. > > Further, from their production machine we've seen that having HT on > causes the machine to slow down by 5X whenever you get more than 40 > cores (as in 100% of real cores or 50% of HT cores) worth of activity. > > So we're definitely back to "If you're using PostgreSQL, turn off > Hyperthreading". Not sure how you can make such a blanket statement when so many people have tested and shown the benefits of hyper-threading. I am also unclear exactly what you tested, as I didn't see it mentioned in the email --- CPU type, CPU count, and operating system would be the minimal information required. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
> On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote: >> On a read-write test, it's 10% faster with HT off as well. >> >> Further, from their production machine we've seen that having HT on >> causes the machine to slow down by 5X whenever you get more than 40 >> cores (as in 100% of real cores or 50% of HT cores) worth of activity. >> >> So we're definitely back to "If you're using PostgreSQL, turn off >> Hyperthreading". > > Not sure how you can make such a blanket statement when so many people > have tested and shown the benefits of hyper-threading. I am also > unclear exactly what you tested, as I didn't see it mentioned in the > email --- CPU type, CPU count, and operating system would be the minimal > information required. HT off is common knowledge for better benchmarking result, at least for me. I've never seen better result with HT on, except POWER. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On 21/08/14 11:14, Mark Kirkwood wrote: > > You didn't mention what > cpu this is for (or how many sockets etc), would be useful to know. > Just to clarify - while you mentioned that the production system was 40 cores, it wasn't immediately obvious that the same system was the source of the measurements you posted...sorry if I'm being a mixture of pedantic and dense - just trying to make sure it is clear what systems/cpus etc we are talking about (with this in mind it never hurts to quote cpu and mobo model numbers)! Cheers Mark
On 08/20/2014 06:14 PM, Mark Kirkwood wrote: > Notwithstanding the above results, my workmate Matt made an interesting > observation: the scaling graph for (our) 60 core box (HT off), looks > just like the one for our 32 core box with HT *on*. Hmm. I know this sounds stupid and unlikely, but has anyone actually tested PostgreSQL on a system with more than 64 legitimate cores? The work Robert Haas did to fix the CPU locking way back when showed significant improvements up to 64, but so far as I know, nobody really tested beyond that. I seem to remember similar choking effects when pre-9.2 systems encountered high CPU counts. I somehow doubt Intel would allow their HT architecture to regress so badly from Nehalem, which is almost 3-generations old at this point. This smells like something in the software stack, up to and including the Linux kernel. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 08/20/2014 07:40 PM, Bruce Momjian wrote: > On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote: >> On a read-write test, it's 10% faster with HT off as well. >> >> Further, from their production machine we've seen that having HT on >> causes the machine to slow down by 5X whenever you get more than 40 >> cores (as in 100% of real cores or 50% of HT cores) worth of activity. >> >> So we're definitely back to "If you're using PostgreSQL, turn off >> Hyperthreading". > > Not sure how you can make such a blanket statement when so many people > have tested and shown the benefits of hyper-threading. Actually, I don't know that anyone has posted the benefits of HT. Link? I want to compare results so that we can figure out what's different between my case and theirs. Also, it makes a big difference if there is an advantage to turning HT on for some workloads. > I am also > unclear exactly what you tested, as I didn't see it mentioned in the > email --- CPU type, CPU count, and operating system would be the minimal > information required. Ooops! I thought I'd posted that earlier, but I didn't. The processors in question is the Intel(R) Xeon(R) CPU E7- 4850, with 4 of them for a total of 40 cores or 80 HT cores. OS is RHEL with 2.6.32-431.3.1.el6.x86_64. I've emailed a kernel hacker who works at Intel for comment; for one thing, I'm wondering if the older kernel version is a problem for a system like this. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Aug 21, 2014 at 02:02:26PM -0700, Josh Berkus wrote: > On 08/20/2014 07:40 PM, Bruce Momjian wrote: > > On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote: > >> On a read-write test, it's 10% faster with HT off as well. > >> > >> Further, from their production machine we've seen that having HT on > >> causes the machine to slow down by 5X whenever you get more than 40 > >> cores (as in 100% of real cores or 50% of HT cores) worth of activity. > >> > >> So we're definitely back to "If you're using PostgreSQL, turn off > >> Hyperthreading". > > > > Not sure how you can make such a blanket statement when so many people > > have tested and shown the benefits of hyper-threading. > > Actually, I don't know that anyone has posted the benefits of HT. Link? > I want to compare results so that we can figure out what's different > between my case and theirs. Also, it makes a big difference if there is > an advantage to turning HT on for some workloads. I had Greg Smith test my system when it was installed, tested it, and recommended hyper-threading. The system is Debian Squeeze (2.6.32-5-amd64), CPUs are dual Xeon E5620, 8 cores, 16 virtual cores. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 08/21/2014 02:11 PM, Bruce Momjian wrote: > On Thu, Aug 21, 2014 at 02:02:26PM -0700, Josh Berkus wrote: >> On 08/20/2014 07:40 PM, Bruce Momjian wrote: >>> On Wed, Aug 20, 2014 at 12:13:50PM -0700, Josh Berkus wrote: >>>> On a read-write test, it's 10% faster with HT off as well. >>>> >>>> Further, from their production machine we've seen that having HT on >>>> causes the machine to slow down by 5X whenever you get more than 40 >>>> cores (as in 100% of real cores or 50% of HT cores) worth of activity. >>>> >>>> So we're definitely back to "If you're using PostgreSQL, turn off >>>> Hyperthreading". >>> >>> Not sure how you can make such a blanket statement when so many people >>> have tested and shown the benefits of hyper-threading. >> >> Actually, I don't know that anyone has posted the benefits of HT. Link? >> I want to compare results so that we can figure out what's different >> between my case and theirs. Also, it makes a big difference if there is >> an advantage to turning HT on for some workloads. > > I had Greg Smith test my system when it was installed, tested it, and > recommended hyper-threading. The system is Debian Squeeze > (2.6.32-5-amd64), CPUs are dual Xeon E5620, 8 cores, 16 virtual cores. Can you post some numerical results? I'm serious. It's obviously easier for our users if we can blanket recommend turning HT off; that's a LOT easier for them than "you might want to turn HT off if these conditions ...". So I want to establish that HT is a benefit sometimes if it is. I personally have never seen HT be a benefit. I've seen it be harmless (most of the time) but never beneficial. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Thu, Aug 21, 2014 at 02:17:13PM -0700, Josh Berkus wrote: > >> Actually, I don't know that anyone has posted the benefits of HT. Link? > >> I want to compare results so that we can figure out what's different > >> between my case and theirs. Also, it makes a big difference if there is > >> an advantage to turning HT on for some workloads. > > > > I had Greg Smith test my system when it was installed, tested it, and > > recommended hyper-threading. The system is Debian Squeeze > > (2.6.32-5-amd64), CPUs are dual Xeon E5620, 8 cores, 16 virtual cores. > > Can you post some numerical results? > > I'm serious. It's obviously easier for our users if we can blanket > recommend turning HT off; that's a LOT easier for them than "you might > want to turn HT off if these conditions ...". So I want to establish > that HT is a benefit sometimes if it is. > > I personally have never seen HT be a benefit. I've seen it be harmless > (most of the time) but never beneficial. I know that when hyperthreading was introduced that it was mostly a negative, but then this was improved, and it might have gotten bad again. I am afraid results are based on the type of CPU, so I am not sure we can know a general answer. I know I asked Greg Smith, and I assume he would know. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Aug 21, 2014 at 3:02 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 08/20/2014 07:40 PM, Bruce Momjian wrote: > >> I am also >> unclear exactly what you tested, as I didn't see it mentioned in the >> email --- CPU type, CPU count, and operating system would be the minimal >> information required. > > Ooops! I thought I'd posted that earlier, but I didn't. > > The processors in question is the Intel(R) Xeon(R) CPU E7- 4850, with 4 > of them for a total of 40 cores or 80 HT cores. > > OS is RHEL with 2.6.32-431.3.1.el6.x86_64. I'm running almost the exact same setup in production as a spare. It has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since it's a spare node I might be able to do some testing on it as well. It's running a 3.2 kernel right now. I could probably get a later model kernel on it even. -- To understand recursion, one must first understand recursion.
On Thu, Aug 21, 2014 at 3:26 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Aug 21, 2014 at 3:02 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 08/20/2014 07:40 PM, Bruce Momjian wrote: >> >>> I am also >>> unclear exactly what you tested, as I didn't see it mentioned in the >>> email --- CPU type, CPU count, and operating system would be the minimal >>> information required. >> >> Ooops! I thought I'd posted that earlier, but I didn't. >> >> The processors in question is the Intel(R) Xeon(R) CPU E7- 4850, with 4 >> of them for a total of 40 cores or 80 HT cores. >> >> OS is RHEL with 2.6.32-431.3.1.el6.x86_64. > > I'm running almost the exact same setup in production as a spare. It > has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since > it's a spare node I might be able to do some testing on it as well. > It's running a 3.2 kernel right now. I could probably get a later > model kernel on it even. > > -- > To understand recursion, one must first understand recursion. To update this last post, the machine I have is running ubuntu 12.04.1 right now, and I have kernels 3.2, 3.5, 3.8, 3.11, and 3.13 available to put on it. We're looking at removing it from our current production cluster so I could likely do all kinds of crazy tests on it.
> HT off is common knowledge for better benchmarking result It's wise to use the qualifer 'for better benchmarking results'. It's worth keeping in mind here that a benchmark is not the same as normal production use. For example, where I work we do lots of long-running queries in parallel over a big range of datasets rather than many short-termtransactions as fast as possible. Our biggest DB server is also used for GDAL work and R at the same time*. Prettyfar from pgbench; not everyone is constrained by locks. I suppose that if your code is basically N copies of the same function, hyper-threading isn't likely to help much becauseit was introduced to allow different parts of the processor to be used in parallel when you're running hetarogenouscode. But if you're hammering just one part of the CPU... well, adding another layer of logical complexity for your CPU to manageprobably isn't going to do much good. Should HT be on or off when you're running 64 very mixed types of long-term queries which involve variously either heavyuse of real number calculations or e.g. logic/string handling, and different data sets? It's a much more complex questionthan simply maxing out your pgbench scores. I don't have the data now unfortunately, but I remember seeing a benefit for HT on our 4 core e3 when running GDAL/Postgiswork in parallel last year. It's not surprising though; the GDAL calls are almost certainly using different functionsof the processor compared to postgres and there should be very little lock contention. In light of this interestingdata I'm now leaning towards proposing HT off for our mapservers (which receive short, similar requests over andover), but for the hetaragenous servers, I think I'll keep it on for now. Graeme. * unrelated. There's also huge advantages for us in keeping these different programs running on the same machine since wefound we can get much better transfer rates through unix sockets than with TCP over the network.
On 08/21/2014 02:26 PM, Scott Marlowe wrote: > I'm running almost the exact same setup in production as a spare. It > has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since > it's a spare node I might be able to do some testing on it as well. > It's running a 3.2 kernel right now. I could probably get a later > model kernel on it even. You know about the IO performance issues with 3.2, yes? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 08/21/2014 03:51 PM, Josh Berkus wrote: > On 08/21/2014 02:26 PM, Scott Marlowe wrote: >> I'm running almost the exact same setup in production as a spare. It >> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since >> it's a spare node I might be able to do some testing on it as well. >> It's running a 3.2 kernel right now. I could probably get a later >> model kernel on it even. > You know about the IO performance issues with 3.2, yes? > Were those 3.2 only and since fixed or are there issues persisting in 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13. Cheers, Steve
On 08/21/2014 04:08 PM, Steve Crawford wrote: > On 08/21/2014 03:51 PM, Josh Berkus wrote: >> On 08/21/2014 02:26 PM, Scott Marlowe wrote: >>> I'm running almost the exact same setup in production as a spare. It >>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since >>> it's a spare node I might be able to do some testing on it as well. >>> It's running a 3.2 kernel right now. I could probably get a later >>> model kernel on it even. >> You know about the IO performance issues with 3.2, yes? >> > Were those 3.2 only and since fixed or are there issues persisting in > 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13. The issues I know of were fixed in 3.9. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 22/08/14 11:29, Josh Berkus wrote: > On 08/21/2014 04:08 PM, Steve Crawford wrote: >> On 08/21/2014 03:51 PM, Josh Berkus wrote: >>> On 08/21/2014 02:26 PM, Scott Marlowe wrote: >>>> I'm running almost the exact same setup in production as a spare. It >>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since >>>> it's a spare node I might be able to do some testing on it as well. >>>> It's running a 3.2 kernel right now. I could probably get a later >>>> model kernel on it even. >>> You know about the IO performance issues with 3.2, yes? >>> >> Were those 3.2 only and since fixed or are there issues persisting in >> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13. > > The issues I know of were fixed in 3.9. > There is a 3.11 kernel series for Ubuntu 12.04 Precise. Regards Mark
On 08/21/2014 04:29 PM, Josh Berkus wrote: > > On 08/21/2014 04:08 PM, Steve Crawford wrote: >> On 08/21/2014 03:51 PM, Josh Berkus wrote: >>> On 08/21/2014 02:26 PM, Scott Marlowe wrote: >>>> I'm running almost the exact same setup in production as a spare. It >>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since >>>> it's a spare node I might be able to do some testing on it as well. >>>> It's running a 3.2 kernel right now. I could probably get a later >>>> model kernel on it even. >>> You know about the IO performance issues with 3.2, yes? >>> >> Were those 3.2 only and since fixed or are there issues persisting in >> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13. > > The issues I know of were fixed in 3.9. > Correct. If you run trusty backports you are good to go. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, @cmdpromptinc "If we send our children to Caesar for their education, we should not be surprised when they come back as Romans."
On Thu, Aug 21, 2014 at 5:29 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 08/21/2014 04:08 PM, Steve Crawford wrote: >> On 08/21/2014 03:51 PM, Josh Berkus wrote: >>> On 08/21/2014 02:26 PM, Scott Marlowe wrote: >>>> I'm running almost the exact same setup in production as a spare. It >>>> has 4 of those CPUs, 256G RAM, and is currently set to use HT. Since >>>> it's a spare node I might be able to do some testing on it as well. >>>> It's running a 3.2 kernel right now. I could probably get a later >>>> model kernel on it even. >>> You know about the IO performance issues with 3.2, yes? >>> >> Were those 3.2 only and since fixed or are there issues persisting in >> 3.2+? The 12.04 LTS release of Ubuntu Server was 3.2 but the 14.04 is 3.13. > > The issues I know of were fixed in 3.9. > I thought they were fixed in 3.8.something? We're running 3.8 on our production servers but IO is not an issue for us.
On 08/22/2014 01:37 AM, Scott Marlowe wrote: > I thought they were fixed in 3.8.something? We're running 3.8 on our > production servers but IO is not an issue for us. Yeah. 3.8 fixed a ton of issues that were plaguing us. There were still a couple patches I wanted that didn't get in until 3.11+, but the worst of the behavior was solved before that. Bugs in kernel cache page aging algorithms are bad, m'kay? -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 2014-08-21 14:02:26 -0700, Josh Berkus wrote: > On 08/20/2014 07:40 PM, Bruce Momjian wrote: > > Not sure how you can make such a blanket statement when so many people > > have tested and shown the benefits of hyper-threading. > > Actually, I don't know that anyone has posted the benefits of HT. > Link? There's definitely cases where it can help. But it's highly workload *and* hardware dependent. > OS is RHEL with 2.6.32-431.3.1.el6.x86_64. > > I've emailed a kernel hacker who works at Intel for comment; for one > thing, I'm wondering if the older kernel version is a problem for a > system like this. I'm not sure if it has been backported by redhat, but there definitely have been significant improvement in SMT aware scheduling after vanilla 2.6.32. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 08/22/2014 07:02 AM, Andres Freund wrote: > On 2014-08-21 14:02:26 -0700, Josh Berkus wrote: >> On 08/20/2014 07:40 PM, Bruce Momjian wrote: >>> Not sure how you can make such a blanket statement when so many people >>> have tested and shown the benefits of hyper-threading. >> >> Actually, I don't know that anyone has posted the benefits of HT. >> Link? > > There's definitely cases where it can help. But it's highly workload > *and* hardware dependent. The only cases I've seen where HT can be beneficial is when you have large numbers of idle connections. Then the idle connections can be "parked" on the HT virtual cores. However, even in this case I haven't seen a head-to-head performance comparison. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 26/08/14 10:13, Josh Berkus wrote: > On 08/22/2014 07:02 AM, Andres Freund wrote: >> On 2014-08-21 14:02:26 -0700, Josh Berkus wrote: >>> On 08/20/2014 07:40 PM, Bruce Momjian wrote: >>>> Not sure how you can make such a blanket statement when so many people >>>> have tested and shown the benefits of hyper-threading. >>> >>> Actually, I don't know that anyone has posted the benefits of HT. >>> Link? >> >> There's definitely cases where it can help. But it's highly workload >> *and* hardware dependent. > > The only cases I've seen where HT can be beneficial is when you have > large numbers of idle connections. Then the idle connections can be > "parked" on the HT virtual cores. However, even in this case I haven't > seen a head-to-head performance comparison. > I recall HT beneficial on a single socket (i3 or i7), using pgbench as the measuring tool. However I didn't save the results at the time. I've just got some new ssd's to play with so might run some pgbench tests on my home machine (Haswell i7) with HT on and off. Regards Mark
On 26/08/14 10:13, Josh Berkus wrote: > On 08/22/2014 07:02 AM, Andres Freund wrote: >> On 2014-08-21 14:02:26 -0700, Josh Berkus wrote: >>> On 08/20/2014 07:40 PM, Bruce Momjian wrote: >>>> Not sure how you can make such a blanket statement when so many people >>>> have tested and shown the benefits of hyper-threading. >>> >>> Actually, I don't know that anyone has posted the benefits of HT. >>> Link? >> >> There's definitely cases where it can help. But it's highly workload >> *and* hardware dependent. > > The only cases I've seen where HT can be beneficial is when you have > large numbers of idle connections. Then the idle connections can be > "parked" on the HT virtual cores. However, even in this case I haven't > seen a head-to-head performance comparison. > I've just had a pair of Crucial m550's arrive, so a bit of benchmarking is in order. The results (below) seem to suggest that HT enabled is certainly not inhibiting scaling performance for single socket i7's. I performed several runs (typical results shown below). Intel i7-4770 3.4 Ghz, 16G 2x Crucial m550 Ubuntu 14.04 Postgres 9.4 beta2 logging_collector = on max_connections = 600 shared_buffers = 1GB wal_buffers = 32MB checkpoint_segments = 128 effective_cache_size = 10GB pgbench scale = 300 test duration (each) = 600s db on 1x m550 xlog on 1x m550 clients | tps (HT)| tps (no HT) --------+----------+------------- 4 | 517 | 520 8 | 1013 | 999 16 | 1938 | 1913 32 | 3574 | 3560 64 | 5873 | 5412 128 | 8351 | 7450 256 | 9426 | 7840 512 | 9357 | 7288 Regards Mark