Thread: Comparative tps question
Hi, I've just been benchmarking a new box I've got and running pgbench yields what I thought was a slow tps count. It is dificult to find comparisons online of other benchmark results, I'd like to see if I have the box set up reasonably well. I know oracle, et al prohibit benchmark results, but was surprised that there doesn't seem to be any postgresql ones out there.. Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog, both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all setup with XFS on and tweaked as I could. The drives are 160Gb and overprovisioned by another 15%. I'm running postgresql 9.1 on ubuntu 12.04 bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on the main data array, this doesn't seem too bad although the specs show over 200MB/s should be achievable per drive. pgbench (using a scaling factor of 100 with 100 clients and 25 threads) gives an average of about 7200tps. Does this look acceptable? Instinctively it feels on the low side, although I noted that a couple of blogs show (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ and http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html) show around 1500tps for a single ssd, so maybe this is what is expected. The interesting param differences from the postgresql conf are: share_buffers=6Gb work_mem=64Mb max_stack_depth=4Mb random_page_cost=1.1 cpu_tuple_cost=0.1 cpu_index_tuple_cost=0.05 cpu_operator_cost=0.025 effective_cache_size=40Gb I'd be happy to provide any other configs, etc assuming the tps values are way off the expected. Thanks John ps. the number of "safe" ssds available in the uk seems to be rather limited, hence the intel 320s which I probably aren't as fast as modern drives.
On Wed, Nov 28, 2012 at 12:37 PM, John Lister <john.lister@kickstone.com> wrote: > Hi, I've just been benchmarking a new box I've got and running pgbench > yields what I thought was a slow tps count. It is dificult to find > comparisons online of other benchmark results, I'd like to see if I have the > box set up reasonably well. > > I know oracle, et al prohibit benchmark results, but was surprised that > there doesn't seem to be any postgresql ones out there.. > > Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel > 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog, > both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz > memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all > setup with XFS on and tweaked as I could. The drives are 160Gb and > overprovisioned by another 15%. > > I'm running postgresql 9.1 on ubuntu 12.04 > > bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on > the main data array, this doesn't seem too bad although the specs show over > 200MB/s should be achievable per drive. Probably this limitation is coming from sata bus. It shouldn't be a problem in practice. Can you report bonnie++ seek performance? Another possibility is the raid controller is introducing overhead here. > pgbench (using a scaling factor of 100 with 100 clients and 25 threads) > gives an average of about 7200tps. > > Does this look acceptable? Instinctively it feels on the low side, although > I noted that a couple of blogs show > (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ > and > http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html) > show around 1500tps for a single ssd, so maybe this is what is expected. > > The interesting param differences from the postgresql conf are: > share_buffers=6Gb > work_mem=64Mb > max_stack_depth=4Mb > random_page_cost=1.1 > cpu_tuple_cost=0.1 > cpu_index_tuple_cost=0.05 > cpu_operator_cost=0.025 > effective_cache_size=40Gb *) none of the above settings will influence storage bound pgbench results. Influential settings are fsync, synchronous_commit, wal_sync_method, wal_level, full_page_writes, wal_buffers, wal_writer_delay, and commit_delay. These settings are basically managing various tradeoffs, espeically in the sense of safety vs performance. > I'd be happy to provide any other configs, etc assuming the tps values are > way off the expected. *) Very first thing we need to check is if we are storage bound (check i/o wait) and if so where the bind up is. Could be on the wal or heap volume. Another possibility is that we're lock bound which is a completely different issue to deal with. so we want to see top, iostat, vmstat, etc while test is happening. *) another interesting test to run is large scaling factor (ideally, at least 2x ram) read only test via pgbench -S. merlin
On 28/11/2012 19:21, Merlin Moncure wrote: > On Wed, Nov 28, 2012 at 12:37 PM, John Lister <john.lister@kickstone.com> wrote: >> Hi, I've just been benchmarking a new box I've got and running pgbench >> yields what I thought was a slow tps count. It is dificult to find >> comparisons online of other benchmark results, I'd like to see if I have the >> box set up reasonably well. >> >> I know oracle, et al prohibit benchmark results, but was surprised that >> there doesn't seem to be any postgresql ones out there.. >> >> Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel >> 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog, >> both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz >> memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all >> setup with XFS on and tweaked as I could. The drives are 160Gb and >> overprovisioned by another 15%. >> >> I'm running postgresql 9.1 on ubuntu 12.04 >> >> bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on >> the main data array, this doesn't seem too bad although the specs show over >> 200MB/s should be achievable per drive. > Probably this limitation is coming from sata bus. It shouldn't be a > problem in practice. Can you report bonnie++ seek performance? > Another possibility is the raid controller is introducing overhead > here. I must have misread the numbers before when using bonnie++, run it again and getting 1.3Gb/s read and 700Mb/s write which looks more promising. In terms of vmstat: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 5 1 0 275800 564 62541220 0 0 346904 259208 18110 12013 7 3 86 5 and iostat avg-cpu: %user %nice %system %iowait %steal %idle 8.97 0.00 3.95 2.04 0.00 85.03 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdc 4716.00 1271456.00 0.00 1271456 0 obviously the figures varied for read/write speed during the tests. but iowait averaged about 3% with the system about 85-90% idle. Oddly bonnie reports near 80% cpu use during the test which seems high? The H710 is capable of using 6Gbps sata drives although the intel ones are limited to 3Gbps, given the above results, the io performance looks to be ok? >> pgbench (using a scaling factor of 100 with 100 clients and 25 threads) >> gives an average of about 7200tps. >> >> Does this look acceptable? Instinctively it feels on the low side, although >> I noted that a couple of blogs show >> (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ >> and >> http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html) >> show around 1500tps for a single ssd, so maybe this is what is expected. >> >> The interesting param differences from the postgresql conf are: >> share_buffers=6Gb >> work_mem=64Mb >> max_stack_depth=4Mb >> random_page_cost=1.1 >> cpu_tuple_cost=0.1 >> cpu_index_tuple_cost=0.05 >> cpu_operator_cost=0.025 >> effective_cache_size=40Gb > *) none of the above settings will influence storage bound pgbench > results. Influential settings are fsync, synchronous_commit, > wal_sync_method, wal_level, full_page_writes, wal_buffers, > wal_writer_delay, and commit_delay. These settings are basically > managing various tradeoffs, espeically in the sense of safety vs > performance. I figured they may influence the planner, caching of the queries. Of the ones you list only this is changed: wal_level=hot_standby > *) Very first thing we need to check is if we are storage bound (check > i/o wait) and if so where the bind up is. Could be on the wal or heap > volume. Another possibility is that we're lock bound which is a > completely different issue to deal with. so we want to see top, > iostat, vmstat, etc while test is happening. io_wait is typically <20% which is worse than for bonnie. vmstat typical figures are during pgbench are procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 30 1 0 261900 44340 62650808 0 0 88348 74500 103544 175006 53 20 21 6 and iostat (sda is the wal device) avg-cpu: %user %nice %system %iowait %steal %idle 52.80 0.00 17.94 12.22 0.00 17.03 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 2544.00 0.00 66432.00 0 132864 sdc 4153.00 132848.00 136.00 265696 272 I noticed that the system values are usually in the 20% region, could this be the locks? btw pgbench is running on the db server not a client - would that influence things dramatically. > *) another interesting test to run is large scaling factor (ideally, > at least 2x ram) read only test via pgbench -S. merlin Might give that a go when I next get a chance to run the tests... John
On Thu, Nov 29, 2012 at 10:56 AM, John Lister <john.lister@kickstone.com> wrote: > I must have misread the numbers before when using bonnie++, run it again and > getting 1.3Gb/s read and 700Mb/s write which looks more promising. In terms > of vmstat: pretty nice. >> *) Very first thing we need to check is if we are storage bound (check i/o >> wait) and if so where the bind up is. Could be on the wal or heap volume. >> Another possibility is that we're lock bound which is a completely different >> issue to deal with. so we want to see top, iostat, vmstat, etc while test is >> happening. > > io_wait is typically <20% which is worse than for bonnie. > vmstat typical figures are during pgbench are > procs -----------memory---------- ---swap-- -----io---- -system-- > ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id > wa > 30 1 0 261900 44340 62650808 0 0 88348 74500 103544 175006 53 > 20 21 6 > > and iostat (sda is the wal device) > avg-cpu: %user %nice %system %iowait %steal %idle > 52.80 0.00 17.94 12.22 0.00 17.03 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 2544.00 0.00 66432.00 0 132864 > sdc 4153.00 132848.00 136.00 265696 272 > > I noticed that the system values are usually in the 20% region, could this > be the locks? btw pgbench is running on the db server not a client - would > that influence things dramatically. Since we have some idle cpu% here we can probably eliminate pgbench as a bottleneck by messing around with the -j switch. another thing we want to test is the "-N" switch -- this doesn't update the tellers and branches table which in high concurrency situations can bind you from locking perspective. one thing that immediately jumps out here is that your wal volume could be holding you up. so it's possible we may want to move wal to the ssd volume. if you can scrounge up a 9.2 pgbench, we can gather more evidence for that by running pgbench with the "--unlogged-tables" option, which creates the tables unlogged so that they are not wal logged (for the record, this causes tables to be truncated when not shut down in clean state). putting all the options above together (history only, no wal, multi thread) and you're test is more approximating random device write performance. >> *) another interesting test to run is large scaling factor (ideally, at >> least 2x ram) read only test via pgbench -S. merlin > > Might give that a go when I next get a chance to run the tests... yeah -- this will tell us raw seek performance of ssd volume which presumably will be stupendous. 2x is minimum btw 10x would be more appropriate. since you're building a beast, other settings to explore are numa (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) and dell memory bios settings that are occasionally set from the factory badly (see here: http://bleything.net/articles/postgresql-benchmarking-memory.html). merlin
On 29/11/2012 17:33, Merlin Moncure wrote: > Since we have some idle cpu% here we can probably eliminate pgbench as > a bottleneck by messing around with the -j switch. another thing we > want to test is the "-N" switch -- this doesn't update the tellers and > branches table which in high concurrency situations can bind you from > locking perspective. Using -N gives around a 15% increase in tps with no major changes in load, etc. using more threads slightly drops the performance (as expected with only 32 "cores"). dropping it does give a slight increase (presumably because half the cores aren't real). > one thing that immediately jumps out here is that your wal volume > could be holding you up. so it's possible we may want to move wal to > the ssd volume. if you can scrounge up a 9.2 pgbench, we can gather > more evidence for that by running pgbench with the > "--unlogged-tables" option, which creates the tables unlogged so that > they are not wal logged (for the record, this causes tables to be > truncated when not shut down in clean state). I did notice that using -S drives the tps up to near 30K tps, so it is possibly the wal volume, although saying that I did move the pg_xlog directory onto the ssd array before posting to the list and the difference wasn't significant. I'll try and repeat that when I get some more downtime (I'm having to run the current tests while the db is live, but under light load). I'll have a look at using the 9.2 pgbench and see what happens. > yeah -- this will tell us raw seek performance of ssd volume which > presumably will be stupendous. 2x is minimum btw 10x would be more > appropriate. > > since you're building a beast, other settings to explore are numa > (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) > and dell memory bios settings that are occasionally set from the > factory badly (see here: > http://bleything.net/articles/postgresql-benchmarking-memory.html). Cheers for the links, I'd already looked at the numa stuff and disabled zone reclaim. I was looking at using the patch previously posted that used shared mode for the master process and then local only for the workers - excuse the terminology - but time constraints prevented that. Made sure the box was in performance mode in the bios, unfortunately I spotted bens blog when I was setting the box up, but didn't have time to go through all the tests. At the time performance seemed ok (well better than the previous box :) - but having it live for a while made me think I or it could be doing better. Anyway, I still think it would be nice to post tps results for compative purposes, so if I get a minute or two I'll create a site and stick mine on there. John
On 29/11/2012 17:33, Merlin Moncure wrote: > one thing that immediately jumps out here is that your wal volume > could be holding you up. so it's possible we may want to move wal to > the ssd volume. if you can scrounge up a 9.2 pgbench, we can gather > more evidence for that by running pgbench with the "--unlogged-tables" > option, which creates the tables unlogged so that they are not wal > logged (for the record, this causes tables to be truncated when not > shut down in clean state). Ok, got myself a 9.2 version of pgbench and run it a few times on unlogged tables... changing the number of threads has maybe a 5% change in values which isn't probably too much to worry about. -j 25 -c 100 -s 100 gives a tps of around 10.5k using -N ups that to around 20k using -S ups that again to around 40k I'll have to wait until I get to shut the db down again to try the wal on an ssd. Although unless I did something wrong it didn't seem to make a huge difference before.... During these tests, iowait dropped to almost 0, user and sys stayed around the same (60% and 20% respectively). although the disk traffic was only in the 10s of Mb/s which seems very low - unless there is some wierd caching going on and it gets dumped at a later date? John -- Get the PriceGoblin Browser Addon www.pricegoblin.co.uk