Thread: AMD Shanghai versus Intel Nehalem

AMD Shanghai versus Intel Nehalem

From

Scott Marlowe

Date:

12 May 2009, 20:36:55

Anyone on the list had a chance to benchmark the Nehalem's yet?  I'm
primarily wondering if their promise of performance from 3 memory
channels holds up under typical pgsql workloads.  I've been really
happy with the behavior of my AMD shanghai based server under heavy
loads, but if the Nehalems much touted performance increase translates
to pgsql, I'd like to know.

Re: AMD Shanghai versus Intel Nehalem

From

Greg Smith

Date:

12 May 2009, 23:06:03

Anand did SQL Server and Oracle test results, the Nehalem system looks
like a substantial improvement over the Shanghai Opteron 2384:

http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=6
http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: AMD Shanghai versus Intel Nehalem

From

Scott Marlowe

Date:

12 May 2009, 23:28:44

On Tue, May 12, 2009 at 8:05 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> Anand did SQL Server and Oracle test results, the Nehalem system looks like
> a substantial improvement over the Shanghai Opteron 2384:
>
> http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=6
> http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7

That's an interesting article. Thanks for the link.  A couple points
stick out to me.

1: 5520 to 5540 parts only have 1 133MHz step increase in performance
2: 550x parts have no hyperthreading.

Assuming that the parts tested (5570) were using hyperthreading and
two 133MHz steps, at the lower end of the range, the 550x parts are
likely not that much faster than the opterons in their same clock
speed range, but are still quite a bit more expensive.

It'd be nice to see some benchmarks on the more reasonably priced CPUs
in both ranges, the 2.2 to 2.4 GHz opterons and the 2.0 (5504) to
2.26GHz (5520) nehalems. Since I have to buy > 1 server to handle the
load and provide redundancy anyway, single cpu performance isn't
nearly as interesting as aggregate performance / $ spent.

While all the benchmarks on near 3GHz parts is fun to read and
salivate over, it's not as relevant to my interests as the performance
of the more reasonably prices parts.

Re: AMD Shanghai versus Intel Nehalem

From

Scott Carey

Date:

13 May 2009, 00:09:43

The $ cost of more CPU power on larger machines ends up such a small %
chunk, especially after I/O cost.  Sure, the CPU with HyperThreading and the
turbo might be 40% more expensive than the other CPU, but if the total
system cost is 5% more for 15% more performance . . .

It depends on how CPU limited you are.  If you aren't, there isn't much of a
reason to look past the cheaper Opterons with a good I/O setup.

I've got a 2 x 5520 system with lots of RAM on the way.  The problem with
lots of RAM in the Nehalem systems, is that the memory speed slows as more
is added.  I think mine slows from the 1066Mhz the processor can handle to
800Mhz.  It still has way more bandwidth than the old Xeons though.
Although my use case is about as far from pg_bench as you can get, I might
be able to get a run of it in during stress testing.

On 5/12/09 7:28 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> On Tue, May 12, 2009 at 8:05 PM, Greg Smith <gsmith@gregsmith.com> wrote:
>> Anand did SQL Server and Oracle test results, the Nehalem system looks like
>> a substantial improvement over the Shanghai Opteron 2384:
>>
>> http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=6
>> http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7
>
> That's an interesting article. Thanks for the link.  A couple points
> stick out to me.
>
> 1: 5520 to 5540 parts only have 1 133MHz step increase in performance
> 2: 550x parts have no hyperthreading.
>
> Assuming that the parts tested (5570) were using hyperthreading and
> two 133MHz steps, at the lower end of the range, the 550x parts are
> likely not that much faster than the opterons in their same clock
> speed range, but are still quite a bit more expensive.
>
> It'd be nice to see some benchmarks on the more reasonably priced CPUs
> in both ranges, the 2.2 to 2.4 GHz opterons and the 2.0 (5504) to
> 2.26GHz (5520) nehalems. Since I have to buy > 1 server to handle the
> load and provide redundancy anyway, single cpu performance isn't
> nearly as interesting as aggregate performance / $ spent.
>
> While all the benchmarks on near 3GHz parts is fun to read and
> salivate over, it's not as relevant to my interests as the performance
> of the more reasonably prices parts.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: AMD Shanghai versus Intel Nehalem

From

Scott Marlowe

Date:

13 May 2009, 00:59:42

On Tue, May 12, 2009 at 8:59 PM, Scott Carey <scott@richrelevance.com> wrote:
> The $ cost of more CPU power on larger machines ends up such a small %
> chunk, especially after I/O cost.  Sure, the CPU with HyperThreading and the
> turbo might be 40% more expensive than the other CPU, but if the total
> system cost is 5% more for 15% more performance . . .

But everything dollar I spend on CPUs is a dollar I can't spend on
RAID contollers, more memory, or more drives.

We're looking at machines with say 32 1TB SATA drives, which run in
the $12k range.  The Nehalem 5570s (2.8GHz) are going for something in
the range of $1500 or more, the 5540 (2.53GHz) at $774.99, 5520
(2.26GHz) at $384.99, and the 5506 (2.13GHz) at $274.99.  The 5520 is
the first one with hyperthreading so it's a reasonable cost increase.
Somewhere around the 5530 the cost for increase in performance stops
making a lot of sense.

The opterons, like the 2378 barcelona at 2.4GHz cost $279.99, or the
2.5GHz 2380 at $400 are good values.  And I know they mostly scale by
clock speed so I can decide on which to buy based on that.    The 83xx
series cpus are still far too expensive to be cost effective, with
2.2GHz parts running $600 and faster parts climbing VERY quickly after
that.

So what I want to know is how the 2.5GHz barcelonas would compare to
both the 5506 through 5530 nehalems, as those parts are all in the
same cost range (sub $500 cpus).

> It depends on how CPU limited you are.  If you aren't, there isn't much of a
> reason to look past the cheaper Opterons with a good I/O setup.

Exactly.  Which is why I'm looking for best bang for buck on the CPU
front.  Also performance as a "data pump" so to speak, i.e. minimizing
memory bandwidth limitations.

> I've got a 2 x 5520 system with lots of RAM on the way.  The problem with
> lots of RAM in the Nehalem systems, is that the memory speed slows as more
> is added.

I too wondered about that and its effect on performance.  Another
benchmark I'd like to see, how it runs with more and less memory.

> I think mine slows from the 1066Mhz the processor can handle to
> 800Mhz.  It still has way more bandwidth than the old Xeons though.
> Although my use case is about as far from pg_bench as you can get, I might
> be able to get a run of it in during stress testing.

I'd be very interested in hearing how it runs.  and not just for pgbench.

Re: AMD Shanghai versus Intel Nehalem

From

Scott Marlowe

Date:

13 May 2009, 05:31:05

Just realized I made a mistake, I was under the impression that
Shanghai CPUs had 8xxx numbers while barcelona had 23xx numbers.  I
was wrong, it appears the 8xxx numbers are for 4+ socket servers while
the 23xx numbers are for 2 or fewer sockets.  So, there are several
quite affordable shanghai cpus out there, and many of the ones I
quoted as barcelonas are in fact shanghais with the larger 6M L2
cache.

Re: AMD Shanghai versus Intel Nehalem

From

Arjen van der Meijden

Date:

13 May 2009, 05:33:36

We have a dual E5540 with 16GB (I think 1066Mhz) memory here, but no AMD
Shanghai. We haven't done PostgreSQL benchmarks yet, but given the
previous experiences, PostgreSQL should be equally faster compared to mysql.

Our databasebenchmark is actually mostly a cpu/memory-benchmark.
Comparing the results of the dual E5540 (2.53Ghz with HT enabled) to a
dual Intel X5355 (2.6Ghz quad core two from 2007), the peek load has
increased from somewhere between 7 and 10 concurrent clients to
somewhere around 25, suggesting better scalable hardware. With the 25
concurrent clients we handled 2.5 times the amount of queries/second
compared to the 7 concurrent client-score for the X5355, both in MySQL
5.0.41. At 7 CC we still had 1.7 times the previous result.

I'm not really sure how the shanghai cpu's compare to those older
X5355's, the AMD's should be faster, but how much?

I've no idea if we get a Shanghai to compare it with, but we will get a
dual X5570 soon on which we'll repeat some of the tests, so that should
at least help a bit with scaling the X5570-results around the world down.

Best regards,

Arjen

On 12-5-2009 20:47 Scott Marlowe wrote:
> Anyone on the list had a chance to benchmark the Nehalem's yet?  I'm
> primarily wondering if their promise of performance from 3 memory
> channels holds up under typical pgsql workloads.  I've been really
> happy with the behavior of my AMD shanghai based server under heavy
> loads, but if the Nehalems much touted performance increase translates
> to pgsql, I'd like to know.
>

Re: AMD Shanghai versus Intel Nehalem

From

Scott Carey

Date:

13 May 2009, 13:58:45

On 5/12/09 10:06 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> Just realized I made a mistake, I was under the impression that
> Shanghai CPUs had 8xxx numbers while barcelona had 23xx numbers.  I
> was wrong, it appears the 8xxx numbers are for 4+ socket servers while
> the 23xx numbers are for 2 or fewer sockets.  So, there are several
> quite affordable shanghai cpus out there, and many of the ones I
> quoted as barcelonas are in fact shanghais with the larger 6M L2
> cache.
>

At this point, I wouldn¹t go below 5520 on the Nehalem side (turbo + HT is
just too big a jump, as is the 1066Mhz versus 800Mhz memory jump).  Its $100
extra per CPU on a $10K + machine.
The next 'step' is the 5550, since it can run 1333Mhz memory and has 2x the
turbo -- but you would have to be more CPU bound for that.  I wouldn't worry
about the 5530 or 5540, they will only scale a little up from the 5520.

For Opterons, I wouldn't touch anything but a Shanghai these days since its
just not much more and we know the cache differences are very important for
DB loads.

Re: AMD Shanghai versus Intel Nehalem

From

Scott Carey

Date:

13 May 2009, 15:47:21

On 5/12/09 11:08 PM, "Arjen van der Meijden" <acmmailing@tweakers.net>
wrote:

> We have a dual E5540 with 16GB (I think 1066Mhz) memory here, but no AMD
> Shanghai. We haven't done PostgreSQL benchmarks yet, but given the
> previous experiences, PostgreSQL should be equally faster compared to mysql.
>
> Our databasebenchmark is actually mostly a cpu/memory-benchmark.
> Comparing the results of the dual E5540 (2.53Ghz with HT enabled) to a
> dual Intel X5355 (2.6Ghz quad core two from 2007), the peek load has
> increased from somewhere between 7 and 10 concurrent clients to
> somewhere around 25, suggesting better scalable hardware. With the 25
> concurrent clients we handled 2.5 times the amount of queries/second
> compared to the 7 concurrent client-score for the X5355, both in MySQL
> 5.0.41. At 7 CC we still had 1.7 times the previous result.
>

Excellent!  That is a pretty huge boost.   I'm curious which aspects of this
new architecture helped the most.  For Postgres, the following would seem
the most relevant:
1.  Shared L3 cache per processors -- more efficient shared datastructure
access.
2.  Faster atomic operations -- CompareAndSwap, etc are much faster.
3.  Faster cache coherency.
4.  Lower latency RAM with more overall bandwidth (Opteron style).

Can you do a quick and dirty memory bandwidth test? (assuming linux)
On the older X5355 machine and the newer E5540, try:
/sbin/hdparm -T /dev/sd<device>

Where <device> is a valid letter for a device on your system.

Here are the results for me on an older system with dual Intel E5335 (2Ghz,
4MB cache, family 6 model 15)
Best result out of 5 (its not all that consistent, + or minus 10%)
/dev/sda:
 Timing cached reads:   10816 MB in  2.00 seconds = 5416.89 MB/sec

And a newer system with dual Xeon X5460 (3.16Ghz, 6MB cache, family 6 model
23)
Best of 7 results:
/dev/sdb:
 Timing cached reads:   26252 MB in  1.99 seconds = 13174.42 MB/sec

Its not a very accurate measurement, but its quick and highlights relative
hardware differences very easily.

> I'm not really sure how the shanghai cpu's compare to those older
> X5355's, the AMD's should be faster, but how much?
>

I'm not sure either, and the Xeon platforms have evolved such that the
chipsets and RAM configurations matter as much as the processor does.

> I've no idea if we get a Shanghai to compare it with, but we will get a
> dual X5570 soon on which we'll repeat some of the tests, so that should
> at least help a bit with scaling the X5570-results around the world down.
>
> Best regards,
>
> Arjen
>

Re: AMD Shanghai versus Intel Nehalem

From

Scott Carey

Date:

13 May 2009, 15:59:04

FYI:
This is an excellent article on the Nehalem CPU's and their memory
performance as the CPU and RAM combinations change:

http://blogs.sun.com/jnerl/entry/configuring_and_optimizing_intel_xeon

Its fairly complicated (as it is for the Opteron too).


On 5/13/09 9:58 AM, "Scott Carey" <scott@richrelevance.com> wrote:

>
>
>
> On 5/12/09 10:06 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>
>> Just realized I made a mistake, I was under the impression that
>> Shanghai CPUs had 8xxx numbers while barcelona had 23xx numbers.  I
>> was wrong, it appears the 8xxx numbers are for 4+ socket servers while
>> the 23xx numbers are for 2 or fewer sockets.  So, there are several
>> quite affordable shanghai cpus out there, and many of the ones I
>> quoted as barcelonas are in fact shanghais with the larger 6M L2
>> cache.
>>
>
> At this point, I wouldn¹t go below 5520 on the Nehalem side (turbo + HT is
> just too big a jump, as is the 1066Mhz versus 800Mhz memory jump).  Its $100
> extra per CPU on a $10K + machine.
> The next 'step' is the 5550, since it can run 1333Mhz memory and has 2x the
> turbo -- but you would have to be more CPU bound for that.  I wouldn't worry
> about the 5530 or 5540, they will only scale a little up from the 5520.
>
> For Opterons, I wouldn't touch anything but a Shanghai these days since its
> just not much more and we know the cache differences are very important for
> DB loads.
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: AMD Shanghai versus Intel Nehalem

From

Arjen van der Meijden

Date:

14 May 2009, 03:21:49

On 13-5-2009 20:39 Scott Carey wrote:
> Excellent!  That is a pretty huge boost.   I'm curious which aspects of this
> new architecture helped the most.  For Postgres, the following would seem
> the most relevant:
> 1.  Shared L3 cache per processors -- more efficient shared datastructure
> access.
> 2.  Faster atomic operations -- CompareAndSwap, etc are much faster.
> 3.  Faster cache coherency.
> 4.  Lower latency RAM with more overall bandwidth (Opteron style).

Apart from that, it has a newer debian (and thus kernel/glibc) and a
slightly less constraining IO which may help as well.

> Can you do a quick and dirty memory bandwidth test? (assuming linux)
> On the older X5355 machine and the newer E5540, try:
> /sbin/hdparm -T /dev/sd<device>

It is in use, so the results may not be so good, this is the best I got
on our dual X5355:
  Timing cached reads:   6314 MB in  2.00 seconds = 3159.08 MB/sec

But this is the best I got for a (also in use) Dual E5450 we have:
  Timing cached reads:   13158 MB in  2.00 seconds = 6587.11 MB/sec

And here the best for the (idle) E5540:
  Timing cached reads:   16494 MB in  2.00 seconds = 8256.27 MB/sec

These numbers are with hdparm v8.9

Best regards,

Arjen

Re: AMD Shanghai versus Intel Nehalem

From

Greg Smith

Date:

14 May 2009, 03:52:13

On Wed, 13 May 2009, Scott Carey wrote:

> Can you do a quick and dirty memory bandwidth test? (assuming linux)
>
> /sbin/hdparm -T /dev/sd<device>
>
> ...its not a very accurate measurement, but its quick and highlights
> relative hardware differences very easily.

I've found "hdparm -T" to be useful for comparing the relative memory
bandwidth of a given system as I change its RAM configuration around, but
that's about it.  I've seen that result change by a factor of 2X just by
changing kernel version on the same hardware.  The data volume transferred
doesn't seem to be nearly enough to extract the true RAM speed from
(guessing the cause here) things like whether the test/kernel code fits
into the CPU cache.

I'm using this nowadays:

sysbench --test=memory --memory-oper=write --memory-block-size=1024MB
--memory-total-size=1024MB run

The sysbench read test looks similarly borked by caching effects when I've
tried it, but if you write that much it seems to give useful results.

P.S. Too many Scotts who write similarly on this thread.  If either if you
are at PGCon next week, please flag me down if you see me so I can finally
sort you two out.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: AMD Shanghai versus Intel Nehalem

From

Scott Carey

Date:

14 May 2009, 14:01:12

On 5/13/09 11:52 PM, "Greg Smith" <gsmith@gregsmith.com> wrote:

> On Wed, 13 May 2009, Scott Carey wrote:
>
>> Can you do a quick and dirty memory bandwidth test? (assuming linux)
>>
>> /sbin/hdparm -T /dev/sd<device>
>>
>> ...its not a very accurate measurement, but its quick and highlights
>> relative hardware differences very easily.
>
> I've found "hdparm -T" to be useful for comparing the relative memory
> bandwidth of a given system as I change its RAM configuration around, but
> that's about it.  I've seen that result change by a factor of 2X just by
> changing kernel version on the same hardware.  The data volume transferred
> doesn't seem to be nearly enough to extract the true RAM speed from
> (guessing the cause here) things like whether the test/kernel code fits
> into the CPU cache.

That's too bad -- I have been using it to compare machines as well, but they
are all on the same Linux version / distro.

Regardless -- the results indicate a 2x to 3x bandwidth improvement... Which
sounds about right if the older CPU isn't on the newer FBDIMM chipset.  If
both of those machines are on the same Kernel, the relative values should be
a somewhat valid (though -- definitely not all that accurate).

>
> I'm using this nowadays:
>
> sysbench --test=memory --memory-oper=write --memory-block-size=1024MB
> --memory-total-size=1024MB run
>

Unfortunately, sysbench isn't installed by default on many (most?) distros,
or even available as a package on many.  So its a bigger 'ask' to get
results from it.  Certainly a significantly better overall tool.

> The sysbench read test looks similarly borked by caching effects when I've
> tried it, but if you write that much it seems to give useful results.

Re: AMD Shanghai versus Intel Nehalem

From

Scott Carey

Date:

14 May 2009, 14:10:14

On 5/13/09 11:21 PM, "Arjen van der Meijden" <acmmailing@tweakers.net>
wrote:

> On 13-5-2009 20:39 Scott Carey wrote:
>> Excellent!  That is a pretty huge boost.   I'm curious which aspects of this
>> new architecture helped the most.  For Postgres, the following would seem
>> the most relevant:
>> 1.  Shared L3 cache per processors -- more efficient shared datastructure
>> access.
>> 2.  Faster atomic operations -- CompareAndSwap, etc are much faster.
>> 3.  Faster cache coherency.
>> 4.  Lower latency RAM with more overall bandwidth (Opteron style).
>
> Apart from that, it has a newer debian (and thus kernel/glibc) and a
> slightly less constraining IO which may help as well.
>
>> Can you do a quick and dirty memory bandwidth test? (assuming linux)
>> On the older X5355 machine and the newer E5540, try:
>> /sbin/hdparm -T /dev/sd<device>
>
> It is in use, so the results may not be so good, this is the best I got
> on our dual X5355:
>   Timing cached reads:   6314 MB in  2.00 seconds = 3159.08 MB/sec
>
> But this is the best I got for a (also in use) Dual E5450 we have:
>   Timing cached reads:   13158 MB in  2.00 seconds = 6587.11 MB/sec
>
> And here the best for the (idle) E5540:
>   Timing cached reads:   16494 MB in  2.00 seconds = 8256.27 MB/sec
>
> These numbers are with hdparm v8.9

Thanks!

My numbers were with hdparm 6.6 (Centos 5.3) -- so they aren't directly
comparable.
FYI When my systems are in use, the results are typically 50% to 75% of the
idle scores.

But, yours probably are roughly comparable to each other -- you're getting
more than 2x the memory bandwidth between those systems.  Without knowing
the exact chipset and RAM configurations, this is definitely a factor in the
performance difference at higher concurrency.


>
> Best regards,
>
> Arjen
>