Re: 60 core performance with 9.3 - Mailing list pgsql-performance
From | Mark Kirkwood |
---|---|
Subject | Re: 60 core performance with 9.3 |
Date | |
Msg-id | 53BF326F.2070903@catalyst.net.nz Whole thread Raw |
In response to | Re: 60 core performance with 9.3 (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: 60 core performance with 9.3
|
List | pgsql-performance |
On 01/07/14 22:13, Andres Freund wrote: > On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: >> - cherry picking the last 5 commits into 9.4 branch and building a package >> from that and retesting: >> >> Clients | 9.4 tps 60 cores (rwlock) >> --------+-------------------------- >> 6 | 70189 >> 12 | 128894 >> 24 | 233542 >> 48 | 422754 >> 96 | 590796 >> 192 | 630672 >> >> Wow - that is more like it! Andres that is some nice work, we definitely owe >> you some beers for that :-) I am aware that I need to retest with an >> unpatched 9.4 src - as it is not clear from this data how much is due to >> Andres's patches and how much to the steady stream of 9.4 development. I'll >> post an update on that later, but figured this was interesting enough to >> note for now. > > Cool. That's what I like (and expect) to see :). I don't think unpatched > 9.4 will show significantly different results than 9.3, but it'd be good > to validate that. If you do so, could you post the results in the > -hackers thread I just CCed you on? That'll help the work to get into > 9.5. So we seem to have nailed read only performance. Going back and revisiting read write performance finds: Postgres 9.4 beta rwlock patch pgbench scale = 2000 max_connections = 200; shared_buffers = "10GB"; maintenance_work_mem = "1GB"; effective_io_concurrency = 10; wal_buffers = "32MB"; checkpoint_segments = 192; checkpoint_completion_target = 0.8; clients | tps (32 cores) | tps ---------+----------------+--------- 6 | 8313 | 8175 12 | 11012 | 14409 24 | 16151 | 17191 48 | 21153 | 23122 96 | 21977 | 22308 192 | 22917 | 23109 So we are back to not doing significantly better than 32 cores. Hmmm. Doing quite a few more tweaks gets some better numbers: kernel.sched_autogroup_enabled=0 kernel.sched_migration_cost_ns=5000000 net.core.somaxconn=1024 /sys/kernel/mm/transparent_hugepage/enabled [never] +checkpoint_segments = 1920 +wal_buffers = "256MB"; clients | tps ---------+--------- 6 | 8366 12 | 15988 24 | 19828 48 | 30315 96 | 31649 192 | 29497 One more: +wal__sync_method = "open_datasync" clients | tps ---------+--------- 6 | 9566 12 | 17129 24 | 22962 48 | 34564 96 | 32584 192 | 28367 So this looks better - however I suspect 32 core performance would improve with these as well! The problem does *not* look to be connected with IO (I will include some iostat below). So time to get the profiler out (192 clients for 1 minute): Full report http://paste.ubuntu.com/7777886/ # ======== # captured on: Fri Jul 11 03:09:06 2014 # hostname : ncel-prod-db3 # os release : 3.13.0-24-generic # perf version : 3.13.9 # arch : x86_64 # nrcpus online : 60 # nrcpus avail : 60 # cpudesc : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz # cpuid : GenuineIntel,6,62,7 # total memory : 1056692116 kB # cmdline : /usr/lib/linux-tools-3.13.0-24/perf record -ag # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, attr_mmap2 = 0, attr_mmap = 1, attr_mmap_data = 0 # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, uncore_cbox_10 = 17, uncore_cbox_11 = 18, uncore_cbox_12 = 19, uncore_cbox_13 = 20, uncore_cbox_14 = 21, software = 1, uncore_irp = 33, uncore_pcu = 22, tracepoint = 2, uncore_imc_0 = 25, uncore_imc_1 = 26, uncore_imc_2 = 27, uncore_imc_3 = 28, uncore_imc_4 = 29, uncore_imc_5 = 30, uncore_imc_6 = 31, uncore_imc_7 = 32, uncore_qpi_0 = 34, uncore_qpi_1 = 35, uncore_qpi_2 = 36, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_cbox_8 = 15, uncore_cbox_9 = 16, uncore_r2pcie = 37, uncore_r3qpi_0 = 38, uncore_r3qpi_1 = 39, breakpoint = 5, uncore_ha_0 = 23, uncore_ha_1 = 24, uncore_ubox = 6 # ======== # # Samples: 1M of event 'cycles' # Event count (approx.): 359906321606 # # Overhead Command Shared Object Symbol # ........ .............. ....................... ..................................................... # 8.82% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--75.69%-- pagevec_lru_move_fn | __lru_cache_add | lru_cache_add | putback_lru_page | migrate_pages | migrate_misplaced_page | do_numa_page | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | | | |--31.07%-- PinBuffer | | | | | --100.00%-- ReadBuffer_common | | | | | --100.00%-- ReadBufferExtended | | | | | |--71.62%-- index_fetch_heap | | | index_getnext | | | IndexNext | | | ExecScan | | | ExecProcNode | | | ExecModifyTable | | | ExecProcNode | | | standard_ExecutorRun | | | ProcessQuery | | | PortalRunMulti | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--17.47%-- heap_hot_search | | | _bt_check_unique | | | _bt_doinsert | | | btinsert | | | FunctionCall6Coll | | | index_insert | | | | | | | --100.00%-- ExecInsertIndexTuples | | | ExecModifyTable | | | ExecProcNode | | | standard_ExecutorRun | | | ProcessQuery | | | PortalRunMulti | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--3.81%-- RelationGetBufferForTuple | | | heap_update | | | ExecModifyTable | | | ExecProcNode | | | standard_ExecutorRun | | | ProcessQuery | | | PortalRunMulti | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--3.65%-- _bt_relandgetbuf | | | _bt_search | | | _bt_first | | | | | | | --100.00%-- btgettuple | | | FunctionCall2Coll | | | index_getnext_tid | | | index_getnext | | | IndexNext | | | ExecScan | | | ExecProcNode | | | | | | | |--97.56%-- ExecModifyTable | | | | ExecProcNode | | | | standard_ExecutorRun | | | | ProcessQuery | | | | PortalRunMulti | | | | PortalRun | | | | PostgresMain | | | | ServerLoop | | | | | | | --2.44%-- standard_ExecutorRun | | | PortalRunSelect | | | PortalRun | | | PostgresMain | | | ServerLoop | | | | | |--2.69%-- fsm_readbuf | | | fsm_set_and_search | | | RecordPageWithFreeSpace | | | lazy_vacuum_rel | | | vacuum_rel | | | vacuum | | | do_autovacuum | | | | | --0.75%-- lazy_vacuum_rel | | vacuum_rel | | vacuum | | do_autovacuum | | | |--4.66%-- SearchCatCache | | | | | |--49.62%-- oper | | | make_op | | | transformExprRecurse | | | transformExpr | | | | | | | |--90.02%-- transformTargetEntry | | | | transformTargetList | | | | transformStmt | | | | parse_analyze | | | | pg_analyze_and_rewrite | | | | PostgresMain | | | | ServerLoop | | | | | | | --9.98%-- transformWhereClause | | | transformStmt | | | parse_analyze | | | pg_analyze_and_rewrite | | | PostgresMain | | | ServerLoop With respect to IO, here are typical iostat outputs: sda HW RAID 10 array SAS SSD [data] md0 SW RAID 10 of nvme[0-3]n1 PCie SSD [xlog] Non Checkpoint Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 15.00 0.00 3.00 0.00 0.07 50.67 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 0.00 0.00 0.00 4198.00 0.00 146.50 71.47 0.18 0.05 0.00 0.05 0.04 18.40 nvme1n1 0.00 0.00 0.00 4198.00 0.00 146.50 71.47 0.18 0.04 0.00 0.04 0.04 17.20 nvme2n1 0.00 0.00 0.00 4126.00 0.00 146.08 72.51 0.15 0.04 0.00 0.04 0.03 14.00 nvme3n1 0.00 0.00 0.00 4125.00 0.00 146.03 72.50 0.15 0.04 0.00 0.04 0.03 14.40 md0 0.00 0.00 0.00 16022.00 0.00 292.53 37.39 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 0.00 18.00 0.00 0.07 8.44 0.00 0.00 0.00 0.00 0.00 0.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Checkpoint Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 29.00 1.00 96795.00 0.00 1074.52 22.73 133.13 1.38 4.00 1.38 0.01 100.00 nvme0n1 0.00 0.00 0.00 3564.00 0.00 56.71 32.59 0.12 0.03 0.00 0.03 0.03 11.60 nvme1n1 0.00 0.00 0.00 3564.00 0.00 56.71 32.59 0.12 0.03 0.00 0.03 0.03 12.00 nvme2n1 0.00 0.00 0.00 3884.00 0.00 59.12 31.17 0.14 0.04 0.00 0.04 0.04 13.60 nvme3n1 0.00 0.00 0.00 3884.00 0.00 59.12 31.17 0.13 0.03 0.00 0.03 0.03 12.80 md0 0.00 0.00 0.00 14779.00 0.00 115.80 16.05 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 3.00 0.00 0.01 8.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 0.00 0.00 1.00 96830.00 0.00 1074.83 22.73 134.79 1.38 4.00 1.38 0.01 100.00 dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Thanks for your patience if you have read this far! Regards Mark
pgsql-performance by date: