Thread: Benchmarking a large server

Benchmarking a large server

From
Chris Hoover
Date:
I've got a fun problem.

My employer just purchased some new db servers that are very large.  The specs on them are:

4 Intel X7550 CPU's (32 physical cores, HT turned off)
1 TB Ram
1.3 TB Fusion IO (2 1.3 TB Fusion IO Duo cards in a raid 10)
3TB Sas Array (48 15K 146GB spindles)

The issue we are running into is how do we benchmark this server, specifically, how do we get valid benchmarks for the Fusion IO card?  Normally to eliminate the cache effect, you run iozone and other benchmark suites at 2x the ram.  However, we can't do that due to 2TB > 1.3TB. 

So, does anyone have any suggestions/experiences in benchmarking storage when the storage is smaller then 2x memory?  

Thanks,

Chris

Re: Benchmarking a large server

From
Merlin Moncure
Date:
On Mon, May 9, 2011 at 3:32 PM, Chris Hoover <revoohc@gmail.com> wrote:
> I've got a fun problem.
> My employer just purchased some new db servers that are very large.  The
> specs on them are:
> 4 Intel X7550 CPU's (32 physical cores, HT turned off)
> 1 TB Ram
> 1.3 TB Fusion IO (2 1.3 TB Fusion IO Duo cards in a raid 10)
> 3TB Sas Array (48 15K 146GB spindles)

my GOODNESS!  :-D.  I mean, just, wow.

> The issue we are running into is how do we benchmark this server,
> specifically, how do we get valid benchmarks for the Fusion IO card?
>  Normally to eliminate the cache effect, you run iozone and other benchmark
> suites at 2x the ram.  However, we can't do that due to 2TB > 1.3TB.
> So, does anyone have any suggestions/experiences in benchmarking storage
> when the storage is smaller then 2x memory?

hm, if it was me, I'd write a small C program that just jumped
directly on the device around and did random writes assuming it wasn't
formatted.  For sequential read, just flush caches and dd the device
to /dev/null.  Probably someone will suggest better tools though.

merlin

Re: Benchmarking a large server

From
David Boreham
Date:
> hm, if it was me, I'd write a small C program that just jumped
> directly on the device around and did random writes assuming it wasn't
> formatted.  For sequential read, just flush caches and dd the device
> to /dev/null.  Probably someone will suggest better tools though.
I have a program I wrote years ago for a purpose like this. One of the
things it can
do is write to the filesystem at the same time as dirtying pages in a
large shared
or non-shared memory region. The idea was to emulate the behavior of a
database
reasonably accurately. Something like bonnie++ would probably be a good
starting
point these days though.



Re: Benchmarking a large server

From
Ben Chobot
Date:
On May 9, 2011, at 1:32 PM, Chris Hoover wrote:

> 1.3 TB Fusion IO (2 1.3 TB Fusion IO Duo cards in a raid 10)

Be careful here. What if the entire card hiccups, instead of just a device on it? (We've had that happen to us before.)
Dependingon how you've done your raid 10, either all your parity is gone or your data is. 

Re: Benchmarking a large server

From
Shaun Thomas
Date:
On 05/09/2011 03:32 PM, Chris Hoover wrote:

> So, does anyone have any suggestions/experiences in benchmarking storage
> when the storage is smaller then 2x memory?

We had a similar problem when benching our FusionIO setup. What I did
was write a script that cleared out the Linux system cache before every
iteration of our pgbench tests. You can do that easily with:

echo 3 > /proc/sys/vm/drop_caches

Executed as root.

Then we ran short (10, 20, 30, 40 clients, 10,000 transactions each)
pgbench tests, resetting the cache and the DB after every iteration. It
was all automated in a script, so it wasn't too much work.

We got (roughly) a 15x speed improvement over a 6x15k RPM RAID-10 setup
on the same server, with no other changes. This was definitely
corroborated after deployment, when our frequent periods of 100% disk IO
utilization vanished and were replaced by occasional 20-30% spikes. Even
that's an unfair comparison in favor of the RAID, because we added DRBD
to the mix because you can't share a PCI card between two servers.

If you do have two 1.3TB Duo cards in a 4x640GB RAID-10, you should get
even better read times than we did.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Benchmarking a large server

From
Merlin Moncure
Date:
On Mon, May 9, 2011 at 3:59 PM, David Boreham <david_list@boreham.org> wrote:
>
>> hm, if it was me, I'd write a small C program that just jumped
>> directly on the device around and did random writes assuming it wasn't
>> formatted.  For sequential read, just flush caches and dd the device
>> to /dev/null.  Probably someone will suggest better tools though.
>
> I have a program I wrote years ago for a purpose like this. One of the
> things it can
> do is write to the filesystem at the same time as dirtying pages in a large
> shared
> or non-shared memory region. The idea was to emulate the behavior of a
> database
> reasonably accurately. Something like bonnie++ would probably be a good
> starting
> point these days though.

The problem with bonnie++ is that the results aren't valid, especially
the read tests.  I think it refuses to even run unless you set special
switches.

merlin

Re: Benchmarking a large server

From
David Boreham
Date:
On 5/9/2011 3:11 PM, Merlin Moncure wrote:
> The problem with bonnie++ is that the results aren't valid, especially
> the read tests.  I think it refuses to even run unless you set special
> switches.

I only care about writes ;)

But definitely, be careful with the tools. I tend to prefer small
programs written in house myself,
and of course simply running your application under a synthesized load.





Re: Benchmarking a large server

From
Greg Smith
Date:
On 05/09/2011 04:32 PM, Chris Hoover wrote:
> So, does anyone have any suggestions/experiences in benchmarking
> storage when the storage is smaller then 2x memory?

If you do the Linux trick to drop its caches already mentioned, you can
start a database test with zero information in memory.  In that
situation, whether or not everything could fit in RAM doesn't matter as
much; you're starting with none of it in there.  In that case, you can
benchmark things without having twice as much disk space.  You just have
to recognize that the test become less useful the longer you run it, and
measure the results accordingly.

A test starting from that state will start out showing you random I/O
speed on the device, slowing moving toward in-memory cached speeds as
the benchmark runs for a while.  You really need to capture the latency
data for every transaction and graph it over time to make any sense of
it.  If you look at "Using and Abusing pgbench" at
http://projects.2ndquadrant.com/talks , starting on P33 I have several
slides showing such a test, done with pgbench and pgbench-tools.  I
added a quick hack to pgbench-tools around then to make it easier to run
this specific type of test, but to my knowledge no one else has ever
used it.  (I've had talks about PostgreSQL in my yard that were better
attended than that session, for which I blame Jonah Harris for doing a
great talk in the room next door concurrent with it.)

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Benchmarking a large server

From
Cédric Villemain
Date:
2011/5/9 Chris Hoover <revoohc@gmail.com>:
> I've got a fun problem.
> My employer just purchased some new db servers that are very large.  The
> specs on them are:
> 4 Intel X7550 CPU's (32 physical cores, HT turned off)
> 1 TB Ram
> 1.3 TB Fusion IO (2 1.3 TB Fusion IO Duo cards in a raid 10)
> 3TB Sas Array (48 15K 146GB spindles)
> The issue we are running into is how do we benchmark this server,
> specifically, how do we get valid benchmarks for the Fusion IO card?
>  Normally to eliminate the cache effect, you run iozone and other benchmark
> suites at 2x the ram.  However, we can't do that due to 2TB > 1.3TB.
> So, does anyone have any suggestions/experiences in benchmarking storage
> when the storage is smaller then 2x memory?

You can reduce the memory size on server boot.
If you use linux, you can add a 'mem=512G' to your boot time
parameters. (maybe it supports only K or M, so 512*1024...)

> Thanks,
> Chris



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Benchmarking a large server

From
Craig James
Date:
2011/5/9 Chris Hoover<revoohc@gmail.com>:

> I've got a fun problem.
> My employer just purchased some new db servers that are very large.  The
> specs on them are:
> 4 Intel X7550 CPU's (32 physical cores, HT turned off)
> 1 TB Ram
> 1.3 TB Fusion IO (2 1.3 TB Fusion IO Duo cards in a raid 10)
> 3TB Sas Array (48 15K 146GB spindles)
> The issue we are running into is how do we benchmark this server,
> specifically, how do we get valid benchmarks for the Fusion IO card?
>   Normally to eliminate the cache effect, you run iozone and other benchmark
> suites at 2x the ram.  However, we can't do that due to 2TB>  1.3TB.
> So, does anyone have any suggestions/experiences in benchmarking storage
> when the storage is smaller then 2x memory?
Maybe this is a dumb question, but why do you care?  If you have 1TB RAM and just a little more actual disk space, it
seemslike your database will always be cached in memory anyway.  If you "eliminate the cach effect," won't the
benchmarkactually give you the wrong real-life results? 

Craig


Re: Benchmarking a large server

From
David Boreham
Date:
On 5/9/2011 6:32 PM, Craig James wrote:
> Maybe this is a dumb question, but why do you care?  If you have 1TB
> RAM and just a little more actual disk space, it seems like your
> database will always be cached in memory anyway.  If you "eliminate
> the cach effect," won't the benchmark actually give you the wrong
> real-life results?

The time it takes to populate the cache from a cold start might be
important.

Also, if it were me, I'd be wanting to check for weird performance
behavior at this memory scale.
I've seen cases in the past where the VM subsystem went bananas because
the designers
and testers of its algorithms never considered the physical memory size
we deployed.

How many times was the kernel tested with this much memory, for example
? (never??)



Re: Benchmarking a large server

From
Greg Smith
Date:
Craig James wrote:
> Maybe this is a dumb question, but why do you care?  If you have 1TB
> RAM and just a little more actual disk space, it seems like your
> database will always be cached in memory anyway.  If you "eliminate
> the cach effect," won't the benchmark actually give you the wrong
> real-life results?

If you'd just spent what two FusionIO drives cost, you'd want to make
damn sure they worked as expected too.  Also, if you look carefully,
there is more disk space than this on the server, just not on the SSDs.
It's possible this setup could end up with most of RAM filled with data
that's stored on the regular drives.  In that case the random
performance of the busy SSD would be critical.  It would likely take a
very bad set of disk layout choices for that to happen, but I could see
heavy sequential scans of tables in a data warehouse pushing in that
direction.

Isolating out the SSD performance without using the larger capacity of
the regular drives on the server is an excellent idea here, it's just
tricky to do.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Benchmarking a large server

From
david@lang.hm
Date:
On Mon, 9 May 2011, David Boreham wrote:

> On 5/9/2011 6:32 PM, Craig James wrote:
>> Maybe this is a dumb question, but why do you care?  If you have 1TB RAM
>> and just a little more actual disk space, it seems like your database will
>> always be cached in memory anyway.  If you "eliminate the cach effect,"
>> won't the benchmark actually give you the wrong real-life results?
>
> The time it takes to populate the cache from a cold start might be important.

you may also have other processes that will be contending with the disk
buffers for memory (for that matter, postgres may use a significant amount
of that memory as it's producing it's results)

David Lang

> Also, if it were me, I'd be wanting to check for weird performance behavior
> at this memory scale.
> I've seen cases in the past where the VM subsystem went bananas because the
> designers
> and testers of its algorithms never considered the physical memory size we
> deployed.
>
> How many times was the kernel tested with this much memory, for example ?
> (never??)
>
>
>
>

Re: Benchmarking a large server

From
Shaun Thomas
Date:
> How many times was the kernel tested with this much memory, for example
> ? (never??)

This is actually *extremely* relevant.

Take a look at /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio if you have an older Linux system, or
/proc/sys/vm/dirty_bytes,and /proc/sys/vm/dirty_background_bytes with a newer one. 

On older systems for instance, those are set to 40 and 20 respectively (recent kernels cut these in half). That's
significantbecause ratio is the *percentage* of memory that can remain dirty before causing async, and background_ratio
tellsit when it should start writing in the background to avoid hitting that higher and much more disruptive number.
Thisis another source of IO that can be completely independent of the checkpoint spikes that long plagued PostgreSQL
versionsprior to 8.3. 

With that much memory (1TB!), that's over 100GB of dirty memory before it starts writing that out to disk even with the
newerconservative settings. We had to tweak and test for days to find good settings for these, and our servers only
have96GB of RAM. You also have to consider, as fast as the FusionIO drives are, they're still NVRAM, which has
write-amplificationissues. How fast do you think it can commit 100GB of dirty memory to disk? Even with a background
settingof 1%, that's 10GB on your system. 

That means you'd need to use a very new kernel so you can utilize the dirty_bytes and dirty_background_bytes settings
soyou can force those settings into more sane levels to avoid unpredictable several-minute long asyncs. I'm not sure
howmuch testing Linux sees on massive hardware like that, but that's just one hidden danger of not properly
benchmarkingthe server and just thinking 1TB of memory and caching the entire dataset is only an improvement. 

--
Shaun Thomas
Peak6 | 141 W. Jackson Blvd. | Suite 800 | Chicago, IL 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Benchmarking a large server

From
Greg Smith
Date:
On 05/09/2011 11:13 PM, Shaun Thomas wrote:
> Take a look at /proc/sys/vm/dirty_ratio and
> /proc/sys/vm/dirty_background_ratio if you have an older Linux system,
> or /proc/sys/vm/dirty_bytes, and /proc/sys/vm/dirty_background_bytes
> with a newer one.
> On older systems for instance, those are set to 40 and 20 respectively (recent kernels cut these in half).

1/4 actually; 10% and 5% starting in kernel 2.6.22.  The main sources of
this on otherwise new servers I see are RedHat Linux RHEL5 systems
running 2.6.18.  But as you say, even the lower defaults of the newer
kernels can be way too much on a system with lots of RAM.

The main downside I've seen of addressing this by using a kernel with
dirty_bytes and dirty_background_bytes is that VACUUM can slow down
considerably.  It really relies on the filesystem having a lot of write
cache to perform well.  In many cases people are happy with VACUUM
throttling if it means nasty I/O spikes go away, but the trade-offs here
are still painful at times.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Benchmarking a large server

From
Yeb Havinga
Date:
On 2011-05-09 22:32, Chris Hoover wrote:
>
> The issue we are running into is how do we benchmark this server,
> specifically, how do we get valid benchmarks for the Fusion IO card?
>  Normally to eliminate the cache effect, you run iozone and other
> benchmark suites at 2x the ram.  However, we can't do that due to 2TB
> > 1.3TB.
>
> So, does anyone have any suggestions/experiences in benchmarking
> storage when the storage is smaller then 2x memory?

Oracle's Orion test tool has a configurable cache size parameter - it's
a separate download and specifically written to benchmark database oltp
and olap like io patterns, see
http://www.oracle.com/technetwork/topics/index-089595.html

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


Re: Benchmarking a large server

From
Claudio Freire
Date:
On Mon, May 9, 2011 at 10:32 PM, Chris Hoover <revoohc@gmail.com> wrote:
> So, does anyone have any suggestions/experiences in benchmarking storage
> when the storage is smaller then 2x memory?

Try writing a small python script (or C program) to mmap a large chunk
of memory, with MAP_LOCKED, this will keep it in RAM and avoid that
RAM from being used for caching.
The script should touch the memory at least once to avoid overcommit
from getting smart on you.

I think only root can lock memory, so that small program would have to
run as root.

Re: Benchmarking a large server

From
Jeff
Date:
On May 9, 2011, at 4:50 PM, Merlin Moncure wrote:
>
> hm, if it was me, I'd write a small C program that just jumped
> directly on the device around and did random writes assuming it wasn't
> formatted.  For sequential read, just flush caches and dd the device
> to /dev/null.  Probably someone will suggest better tools though.
>
> merlin
>

<shameless plug>
http://pgfoundry.org/projects/pgiosim

it is a small program we use to beat the [bad word] out of io systems.
it randomly seeks, does an 8kB read, optionally writes it out (and
optionally fsyncing) and reports how fast it is going (you need to
watch iostat output as well so you can see actual physical tps without
hte OS cache interfering).

It goes through regular read & write calls like PG (I didn't want to
bother with junk like o_direct & friends).

it is also now multithreaded so you can fire up a bunch of random read
threads (rather than firing up a bunch of pgiosims in parallel) and
see how things scale up.


--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: Benchmarking a large server

From
Cédric Villemain
Date:
2011/5/10 Greg Smith <greg@2ndquadrant.com>:
> On 05/09/2011 11:13 PM, Shaun Thomas wrote:
>>
>> Take a look at /proc/sys/vm/dirty_ratio and
>> /proc/sys/vm/dirty_background_ratio if you have an older Linux system, or
>> /proc/sys/vm/dirty_bytes, and /proc/sys/vm/dirty_background_bytes with a
>> newer one.
>> On older systems for instance, those are set to 40 and 20 respectively
>> (recent kernels cut these in half).
>
> 1/4 actually; 10% and 5% starting in kernel 2.6.22.  The main sources of
> this on otherwise new servers I see are RedHat Linux RHEL5 systems  running
> 2.6.18.  But as you say, even the lower defaults of the newer kernels can be
> way too much on a system with lots of RAM.

one can experiment writeback storm with this script from Chris Mason,
under GPLv2:
http://oss.oracle.com/~mason/fsync-tester.c

You need to tweak it a bit, AFAIR, this  #define SIZE (32768*32) must
be reduced to be equal to 8kb blocks if you want similar to pg write
pattern.

The script does a big file, many small fsync, writing on both. Please,
see http://www.spinics.net/lists/linux-ext4/msg24308.html

It is used as a torture program by some linuxfs-hackers and may be
useful for the OP on his large server to validate hardware and kernel.


>
> The main downside I've seen of addressing this by using a kernel with
> dirty_bytes and dirty_background_bytes is that VACUUM can slow down
> considerably.  It really relies on the filesystem having a lot of write
> cache to perform well.  In many cases people are happy with VACUUM
> throttling if it means nasty I/O spikes go away, but the trade-offs here are
> still painful at times.
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: Benchmarking a large server

From
Greg Smith
Date:
Greg Smith wrote:
> On 05/09/2011 11:13 PM, Shaun Thomas wrote:
>> Take a look at /proc/sys/vm/dirty_ratio and
>> /proc/sys/vm/dirty_background_ratio if you have an older Linux
>> system, or /proc/sys/vm/dirty_bytes, and
>> /proc/sys/vm/dirty_background_bytes with a newer one.
>> On older systems for instance, those are set to 40 and 20
>> respectively (recent kernels cut these in half).
>
> 1/4 actually; 10% and 5% starting in kernel 2.6.22.  The main sources
> of this on otherwise new servers I see are RedHat Linux RHEL5 systems
> running 2.6.18.  But as you say, even the lower defaults of the newer
> kernels can be way too much on a system with lots of RAM.

Ugh...we're both right, sort of.  2.6.22 dropped them to 5/10:
http://kernelnewbies.org/Linux_2_6_22 as I said.  But on the new
Scientific Linux 6 box I installed yesterday, they're at 10/20--as you
suggested.

Can't believe I'm going to need a table by kernel version and possibly
distribution to keep this all straight now, what a mess.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books