Thread: Comparative tps question

Comparative tps question

From

John Lister

Date:

28 November 2012, 18:37:38

Hi, I've just been benchmarking a new box I've got and running pgbench
yields what I thought was a slow tps count. It is dificult to find
comparisons online of other benchmark results, I'd like to see if I have
the box set up reasonably well.

I know oracle, et al prohibit benchmark results, but was surprised that
there doesn't seem to be any postgresql ones out there..

Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x
intel 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the
pg_xlog, both on a dell H710 raid controller, in addition it has 64Gb of
1600Mhz memory and 2x E5-2650 processors (with HT=32 cores). The arrays
are all setup with XFS on and tweaked as I could. The drives are 160Gb
and overprovisioned by another 15%.

I'm running postgresql 9.1 on ubuntu 12.04

bonnie++ (using defaults) shows about 600MB/s sequential read/write IO
on the main data array, this doesn't seem too bad although the specs
show over 200MB/s should be achievable per drive.

pgbench (using a scaling factor of 100 with 100 clients and 25 threads)
gives an average of about 7200tps.

Does this look acceptable? Instinctively it feels on the low side,
although I noted that a couple of blogs show
(http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/
and
http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html)
show around 1500tps for a single ssd, so maybe this is what is expected.

The interesting param differences from the postgresql conf are:
share_buffers=6Gb
work_mem=64Mb
max_stack_depth=4Mb
random_page_cost=1.1
cpu_tuple_cost=0.1
cpu_index_tuple_cost=0.05
cpu_operator_cost=0.025
effective_cache_size=40Gb

I'd be happy to provide any other configs, etc assuming the tps values
are way off the expected.

Thanks

John

ps. the number of "safe" ssds available in the uk seems to be rather
limited, hence the intel 320s which I probably aren't as fast as modern
drives.

Re: Comparative tps question

From

Merlin Moncure

Date:

28 November 2012, 19:21:43

On Wed, Nov 28, 2012 at 12:37 PM, John Lister <john.lister@kickstone.com> wrote:
> Hi, I've just been benchmarking a new box I've got and running pgbench
> yields what I thought was a slow tps count. It is dificult to find
> comparisons online of other benchmark results, I'd like to see if I have the
> box set up reasonably well.
>
> I know oracle, et al prohibit benchmark results, but was surprised that
> there doesn't seem to be any postgresql ones out there..
>
> Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel
> 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog,
> both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz
> memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all
> setup with XFS on and tweaked as I could. The drives are 160Gb and
> overprovisioned by another 15%.
>
> I'm running postgresql 9.1 on ubuntu 12.04
>
> bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on
> the main data array, this doesn't seem too bad although the specs show over
> 200MB/s should be achievable per drive.

Probably this limitation is coming from sata bus. It shouldn't be a
problem in practice.   Can you report bonnie++ seek performance?
Another possibility is the raid controller is introducing overhead
here.

> pgbench (using a scaling factor of 100 with 100 clients and 25 threads)
> gives an average of about 7200tps.
>
> Does this look acceptable? Instinctively it feels on the low side, although
> I noted that a couple of blogs show
> (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/
> and
> http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html)
> show around 1500tps for a single ssd, so maybe this is what is expected.
>
> The interesting param differences from the postgresql conf are:
> share_buffers=6Gb
> work_mem=64Mb
> max_stack_depth=4Mb
> random_page_cost=1.1
> cpu_tuple_cost=0.1
> cpu_index_tuple_cost=0.05
> cpu_operator_cost=0.025
> effective_cache_size=40Gb

*) none of the above settings will influence storage bound pgbench
results.  Influential settings are fsync, synchronous_commit,
wal_sync_method, wal_level, full_page_writes, wal_buffers,
wal_writer_delay, and commit_delay.  These settings are basically
managing various tradeoffs, espeically in the sense of safety vs
performance.

> I'd be happy to provide any other configs, etc assuming the tps values are
> way off the expected.

*) Very first thing we need to check is if we are storage bound (check
i/o wait) and if so where the bind up is.  Could be on the wal or heap
volume.  Another possibility is that we're lock bound which is a
completely different issue to deal with.

so we want to see top, iostat, vmstat, etc while test is happening.

*) another interesting test to run is large scaling factor (ideally,
at least 2x ram) read only test via pgbench -S.

merlin

Re: Comparative tps question

From

John Lister

Date:

29 November 2012, 16:56:44

On 28/11/2012 19:21, Merlin Moncure wrote:
> On Wed, Nov 28, 2012 at 12:37 PM, John Lister <john.lister@kickstone.com> wrote:
>> Hi, I've just been benchmarking a new box I've got and running pgbench
>> yields what I thought was a slow tps count. It is dificult to find
>> comparisons online of other benchmark results, I'd like to see if I have the
>> box set up reasonably well.
>>
>> I know oracle, et al prohibit benchmark results, but was surprised that
>> there doesn't seem to be any postgresql ones out there..
>>
>> Anyway, the machine is a Dell R720 with the data on a raid 10 using 8x intel
>> 320 SSDs and a mirrored pair of 15k SAS HDDs configured for the pg_xlog,
>> both on a dell H710 raid controller, in addition it has 64Gb of 1600Mhz
>> memory and 2x E5-2650 processors (with HT=32 cores). The arrays are all
>> setup with XFS on and tweaked as I could. The drives are 160Gb and
>> overprovisioned by another 15%.
>>
>> I'm running postgresql 9.1 on ubuntu 12.04
>>
>> bonnie++ (using defaults) shows about 600MB/s sequential read/write IO on
>> the main data array, this doesn't seem too bad although the specs show over
>> 200MB/s should be achievable per drive.
> Probably this limitation is coming from sata bus. It shouldn't be a
> problem in practice.   Can you report bonnie++ seek performance?
> Another possibility is the raid controller is introducing overhead
> here.
I must have misread the numbers before when using bonnie++, run it again
and getting 1.3Gb/s read and 700Mb/s write which looks more promising.
In terms of vmstat:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
  5  1      0 275800    564 62541220    0    0 346904 259208 18110
12013  7  3 86  5
and iostat
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            8.97    0.00    3.95    2.04    0.00   85.03
Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sdc            4716.00   1271456.00         0.00    1271456 0

obviously the figures varied for read/write speed during the tests. but
iowait averaged about 3% with the system about 85-90% idle. Oddly bonnie
reports near 80% cpu use during the test which seems high?

The H710 is capable of using 6Gbps sata drives although the intel ones
are limited to 3Gbps, given the above results, the io performance looks
to be ok?
>> pgbench (using a scaling factor of 100 with 100 clients and 25 threads)
>> gives an average of about 7200tps.
>>
>> Does this look acceptable? Instinctively it feels on the low side, although
>> I noted that a couple of blogs show
>> (http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/
>> and
>> http://it-blog.5amsolutions.com/2010/08/performance-of-postgresql-ssd-vs.html)
>> show around 1500tps for a single ssd, so maybe this is what is expected.
>>
>> The interesting param differences from the postgresql conf are:
>> share_buffers=6Gb
>> work_mem=64Mb
>> max_stack_depth=4Mb
>> random_page_cost=1.1
>> cpu_tuple_cost=0.1
>> cpu_index_tuple_cost=0.05
>> cpu_operator_cost=0.025
>> effective_cache_size=40Gb
> *) none of the above settings will influence storage bound pgbench
> results.  Influential settings are fsync, synchronous_commit,
> wal_sync_method, wal_level, full_page_writes, wal_buffers,
> wal_writer_delay, and commit_delay.  These settings are basically
> managing various tradeoffs, espeically in the sense of safety vs
> performance.
I figured they may influence the planner, caching of the queries. Of the
ones you list only this is changed:
wal_level=hot_standby

> *) Very first thing we need to check is if we are storage bound (check
> i/o wait) and if so where the bind up is. Could be on the wal or heap
> volume. Another possibility is that we're lock bound which is a
> completely different issue to deal with. so we want to see top,
> iostat, vmstat, etc while test is happening.
io_wait is typically <20% which is worse than for bonnie.
vmstat typical figures are during pgbench are
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
30  1      0 261900  44340 62650808    0    0 88348 74500 103544 175006
53 20 21  6

and iostat (sda is the wal device)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           52.80    0.00   17.94   12.22    0.00   17.03

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda            2544.00         0.00     66432.00          0 132864
sdc            4153.00    132848.00       136.00     265696 272

I noticed that the system values are usually in the 20% region, could
this be the locks? btw pgbench is running on the db server not a client
- would that influence things dramatically.

> *) another interesting test to run is large scaling factor (ideally,
> at least 2x ram) read only test via pgbench -S. merlin
Might give that a go when I next get a chance to run the tests...



John

Re: Comparative tps question

From

Merlin Moncure

Date:

29 November 2012, 17:33:53

On Thu, Nov 29, 2012 at 10:56 AM, John Lister <john.lister@kickstone.com> wrote:
> I must have misread the numbers before when using bonnie++, run it again and
> getting 1.3Gb/s read and 700Mb/s write which looks more promising. In terms
> of vmstat:

pretty nice.

>> *) Very first thing we need to check is if we are storage bound (check i/o
>> wait) and if so where the bind up is. Could be on the wal or heap volume.
>> Another possibility is that we're lock bound which is a completely different
>> issue to deal with. so we want to see top, iostat, vmstat, etc while test is
>> happening.
>
> io_wait is typically <20% which is worse than for bonnie.
> vmstat typical figures are during pgbench are
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa
> 30  1      0 261900  44340 62650808    0    0 88348 74500 103544 175006 53
> 20 21  6
>
> and iostat (sda is the wal device)
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           52.80    0.00   17.94   12.22    0.00   17.03
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
> sda            2544.00         0.00     66432.00          0 132864
> sdc            4153.00    132848.00       136.00     265696 272
>
> I noticed that the system values are usually in the 20% region, could this
> be the locks? btw pgbench is running on the db server not a client - would
> that influence things dramatically.

Since we have some idle cpu% here we can probably eliminate pgbench as
a bottleneck by messing around with the -j switch.  another thing we
want to test is the "-N" switch -- this doesn't update the tellers and
branches table which in high concurrency situations can bind you from
locking perspective.

one thing that immediately jumps out here is that your wal volume
could be holding you up.  so it's possible we may want to move wal to
the ssd volume.  if you can scrounge up a 9.2 pgbench, we can gather
more evidence for that by running pgbench with the
"--unlogged-tables" option, which creates the tables unlogged so that
they are not wal logged (for the record, this causes tables to be
truncated when not shut down in clean state).

putting all the options above together (history only, no wal, multi
thread) and you're test is more approximating random device write
performance.

>> *) another interesting test to run is large scaling factor (ideally, at
>> least 2x ram) read only test via pgbench -S. merlin
>
> Might give that a go when I next get a chance to run the tests...

yeah -- this will tell us raw seek performance of ssd volume which
presumably will be stupendous.  2x is minimum btw 10x would be more
appropriate.

since you're building a beast, other settings to explore are numa
(http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
and dell memory bios settings that are occasionally set from the
factory badly (see here:
http://bleything.net/articles/postgresql-benchmarking-memory.html).

merlin

Re: Comparative tps question

From

John Lister

Date:

30 November 2012, 19:59:21

On 29/11/2012 17:33, Merlin Moncure wrote:
> Since we have some idle cpu% here we can probably eliminate pgbench as
> a bottleneck by messing around with the -j switch. another thing we
> want to test is the "-N" switch -- this doesn't update the tellers and
> branches table which in high concurrency situations can bind you from
> locking perspective.
Using -N gives around a 15% increase in tps with no major changes in
load, etc. using more threads slightly drops the performance (as
expected with only 32 "cores"). dropping it does give a slight increase
(presumably because half the cores aren't real).

> one thing that immediately jumps out here is that your wal volume
> could be holding you up.  so it's possible we may want to move wal to
> the ssd volume.  if you can scrounge up a 9.2 pgbench, we can gather
> more evidence for that by running pgbench with the
> "--unlogged-tables" option, which creates the tables unlogged so that
> they are not wal logged (for the record, this causes tables to be
> truncated when not shut down in clean state).
I did notice that using -S drives the tps up to near 30K tps, so it is
possibly the wal volume, although saying that I did move the pg_xlog
directory onto the ssd array before posting to the list and the
difference wasn't significant. I'll try and repeat that when I get some
more downtime (I'm having to run the current tests while the db is live,
but under light load).

I'll have a look at using the 9.2 pgbench and see what happens.
> yeah -- this will tell us raw seek performance of ssd volume which
> presumably will be stupendous.  2x is minimum btw 10x would be more
> appropriate.
>
> since you're building a beast, other settings to explore are numa
> (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
> and dell memory bios settings that are occasionally set from the
> factory badly (see here:
> http://bleything.net/articles/postgresql-benchmarking-memory.html).
Cheers for the links, I'd already looked at the numa stuff and disabled
zone reclaim. I was looking at using the patch previously posted that
used shared mode for the master process and then local only for the
workers - excuse the terminology - but time constraints prevented that.
Made sure the box was in performance mode in the bios, unfortunately I
spotted bens blog when I was setting the box up, but didn't have time to
go through all the tests. At the time performance seemed ok (well better
than the previous box :) - but having it live for a while made me think
I or it could be doing better.

Anyway, I still think it would be nice to post tps results for compative
purposes, so if I get a minute or two I'll create a site and stick mine
on there.

John

Re: Comparative tps question

From

John Lister

Date:

04 December 2012, 16:28:01

On 29/11/2012 17:33, Merlin Moncure wrote:
> one thing that immediately jumps out here is that your wal volume
> could be holding you up. so it's possible we may want to move wal to
> the ssd volume. if you can scrounge up a 9.2 pgbench, we can gather
> more evidence for that by running pgbench with the "--unlogged-tables"
> option, which creates the tables unlogged so that they are not wal
> logged (for the record, this causes tables to be truncated when not
> shut down in clean state).
Ok, got myself a 9.2 version of pgbench and run it a few times on
unlogged tables...
changing the number of threads has maybe a 5% change in values which
isn't probably too much to worry about.
-j 25 -c 100 -s 100 gives a tps of around 10.5k
using -N ups that to around 20k
using -S ups that again to around 40k

I'll have to wait until I get to shut the db down again to try the wal
on an ssd. Although unless I did something wrong it didn't seem to make
a huge difference before....

During these tests, iowait dropped to almost 0, user and sys stayed
around the same (60% and 20% respectively). although the disk traffic
was only in the 10s of Mb/s which seems very low - unless there is some
wierd caching going on and it gets dumped at a later date?

John

--
Get the PriceGoblin Browser Addon
www.pricegoblin.co.uk