Thread: AMD Shanghai versus Intel Nehalem
Anyone on the list had a chance to benchmark the Nehalem's yet? I'm primarily wondering if their promise of performance from 3 memory channels holds up under typical pgsql workloads. I've been really happy with the behavior of my AMD shanghai based server under heavy loads, but if the Nehalems much touted performance increase translates to pgsql, I'd like to know.
Anand did SQL Server and Oracle test results, the Nehalem system looks like a substantial improvement over the Shanghai Opteron 2384: http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=6 http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7 -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, May 12, 2009 at 8:05 PM, Greg Smith <gsmith@gregsmith.com> wrote: > Anand did SQL Server and Oracle test results, the Nehalem system looks like > a substantial improvement over the Shanghai Opteron 2384: > > http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=6 > http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7 That's an interesting article. Thanks for the link. A couple points stick out to me. 1: 5520 to 5540 parts only have 1 133MHz step increase in performance 2: 550x parts have no hyperthreading. Assuming that the parts tested (5570) were using hyperthreading and two 133MHz steps, at the lower end of the range, the 550x parts are likely not that much faster than the opterons in their same clock speed range, but are still quite a bit more expensive. It'd be nice to see some benchmarks on the more reasonably priced CPUs in both ranges, the 2.2 to 2.4 GHz opterons and the 2.0 (5504) to 2.26GHz (5520) nehalems. Since I have to buy > 1 server to handle the load and provide redundancy anyway, single cpu performance isn't nearly as interesting as aggregate performance / $ spent. While all the benchmarks on near 3GHz parts is fun to read and salivate over, it's not as relevant to my interests as the performance of the more reasonably prices parts.
The $ cost of more CPU power on larger machines ends up such a small % chunk, especially after I/O cost. Sure, the CPU with HyperThreading and the turbo might be 40% more expensive than the other CPU, but if the total system cost is 5% more for 15% more performance . . . It depends on how CPU limited you are. If you aren't, there isn't much of a reason to look past the cheaper Opterons with a good I/O setup. I've got a 2 x 5520 system with lots of RAM on the way. The problem with lots of RAM in the Nehalem systems, is that the memory speed slows as more is added. I think mine slows from the 1066Mhz the processor can handle to 800Mhz. It still has way more bandwidth than the old Xeons though. Although my use case is about as far from pg_bench as you can get, I might be able to get a run of it in during stress testing. On 5/12/09 7:28 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > On Tue, May 12, 2009 at 8:05 PM, Greg Smith <gsmith@gregsmith.com> wrote: >> Anand did SQL Server and Oracle test results, the Nehalem system looks like >> a substantial improvement over the Shanghai Opteron 2384: >> >> http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=6 >> http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7 > > That's an interesting article. Thanks for the link. A couple points > stick out to me. > > 1: 5520 to 5540 parts only have 1 133MHz step increase in performance > 2: 550x parts have no hyperthreading. > > Assuming that the parts tested (5570) were using hyperthreading and > two 133MHz steps, at the lower end of the range, the 550x parts are > likely not that much faster than the opterons in their same clock > speed range, but are still quite a bit more expensive. > > It'd be nice to see some benchmarks on the more reasonably priced CPUs > in both ranges, the 2.2 to 2.4 GHz opterons and the 2.0 (5504) to > 2.26GHz (5520) nehalems. Since I have to buy > 1 server to handle the > load and provide redundancy anyway, single cpu performance isn't > nearly as interesting as aggregate performance / $ spent. > > While all the benchmarks on near 3GHz parts is fun to read and > salivate over, it's not as relevant to my interests as the performance > of the more reasonably prices parts. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Tue, May 12, 2009 at 8:59 PM, Scott Carey <scott@richrelevance.com> wrote: > The $ cost of more CPU power on larger machines ends up such a small % > chunk, especially after I/O cost. Sure, the CPU with HyperThreading and the > turbo might be 40% more expensive than the other CPU, but if the total > system cost is 5% more for 15% more performance . . . But everything dollar I spend on CPUs is a dollar I can't spend on RAID contollers, more memory, or more drives. We're looking at machines with say 32 1TB SATA drives, which run in the $12k range. The Nehalem 5570s (2.8GHz) are going for something in the range of $1500 or more, the 5540 (2.53GHz) at $774.99, 5520 (2.26GHz) at $384.99, and the 5506 (2.13GHz) at $274.99. The 5520 is the first one with hyperthreading so it's a reasonable cost increase. Somewhere around the 5530 the cost for increase in performance stops making a lot of sense. The opterons, like the 2378 barcelona at 2.4GHz cost $279.99, or the 2.5GHz 2380 at $400 are good values. And I know they mostly scale by clock speed so I can decide on which to buy based on that. The 83xx series cpus are still far too expensive to be cost effective, with 2.2GHz parts running $600 and faster parts climbing VERY quickly after that. So what I want to know is how the 2.5GHz barcelonas would compare to both the 5506 through 5530 nehalems, as those parts are all in the same cost range (sub $500 cpus). > It depends on how CPU limited you are. If you aren't, there isn't much of a > reason to look past the cheaper Opterons with a good I/O setup. Exactly. Which is why I'm looking for best bang for buck on the CPU front. Also performance as a "data pump" so to speak, i.e. minimizing memory bandwidth limitations. > I've got a 2 x 5520 system with lots of RAM on the way. The problem with > lots of RAM in the Nehalem systems, is that the memory speed slows as more > is added. I too wondered about that and its effect on performance. Another benchmark I'd like to see, how it runs with more and less memory. > I think mine slows from the 1066Mhz the processor can handle to > 800Mhz. It still has way more bandwidth than the old Xeons though. > Although my use case is about as far from pg_bench as you can get, I might > be able to get a run of it in during stress testing. I'd be very interested in hearing how it runs. and not just for pgbench.
Just realized I made a mistake, I was under the impression that Shanghai CPUs had 8xxx numbers while barcelona had 23xx numbers. I was wrong, it appears the 8xxx numbers are for 4+ socket servers while the 23xx numbers are for 2 or fewer sockets. So, there are several quite affordable shanghai cpus out there, and many of the ones I quoted as barcelonas are in fact shanghais with the larger 6M L2 cache.
We have a dual E5540 with 16GB (I think 1066Mhz) memory here, but no AMD Shanghai. We haven't done PostgreSQL benchmarks yet, but given the previous experiences, PostgreSQL should be equally faster compared to mysql. Our databasebenchmark is actually mostly a cpu/memory-benchmark. Comparing the results of the dual E5540 (2.53Ghz with HT enabled) to a dual Intel X5355 (2.6Ghz quad core two from 2007), the peek load has increased from somewhere between 7 and 10 concurrent clients to somewhere around 25, suggesting better scalable hardware. With the 25 concurrent clients we handled 2.5 times the amount of queries/second compared to the 7 concurrent client-score for the X5355, both in MySQL 5.0.41. At 7 CC we still had 1.7 times the previous result. I'm not really sure how the shanghai cpu's compare to those older X5355's, the AMD's should be faster, but how much? I've no idea if we get a Shanghai to compare it with, but we will get a dual X5570 soon on which we'll repeat some of the tests, so that should at least help a bit with scaling the X5570-results around the world down. Best regards, Arjen On 12-5-2009 20:47 Scott Marlowe wrote: > Anyone on the list had a chance to benchmark the Nehalem's yet? I'm > primarily wondering if their promise of performance from 3 memory > channels holds up under typical pgsql workloads. I've been really > happy with the behavior of my AMD shanghai based server under heavy > loads, but if the Nehalems much touted performance increase translates > to pgsql, I'd like to know. >
On 5/12/09 10:06 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > Just realized I made a mistake, I was under the impression that > Shanghai CPUs had 8xxx numbers while barcelona had 23xx numbers. I > was wrong, it appears the 8xxx numbers are for 4+ socket servers while > the 23xx numbers are for 2 or fewer sockets. So, there are several > quite affordable shanghai cpus out there, and many of the ones I > quoted as barcelonas are in fact shanghais with the larger 6M L2 > cache. > At this point, I wouldn¹t go below 5520 on the Nehalem side (turbo + HT is just too big a jump, as is the 1066Mhz versus 800Mhz memory jump). Its $100 extra per CPU on a $10K + machine. The next 'step' is the 5550, since it can run 1333Mhz memory and has 2x the turbo -- but you would have to be more CPU bound for that. I wouldn't worry about the 5530 or 5540, they will only scale a little up from the 5520. For Opterons, I wouldn't touch anything but a Shanghai these days since its just not much more and we know the cache differences are very important for DB loads.
On 5/12/09 11:08 PM, "Arjen van der Meijden" <acmmailing@tweakers.net> wrote: > We have a dual E5540 with 16GB (I think 1066Mhz) memory here, but no AMD > Shanghai. We haven't done PostgreSQL benchmarks yet, but given the > previous experiences, PostgreSQL should be equally faster compared to mysql. > > Our databasebenchmark is actually mostly a cpu/memory-benchmark. > Comparing the results of the dual E5540 (2.53Ghz with HT enabled) to a > dual Intel X5355 (2.6Ghz quad core two from 2007), the peek load has > increased from somewhere between 7 and 10 concurrent clients to > somewhere around 25, suggesting better scalable hardware. With the 25 > concurrent clients we handled 2.5 times the amount of queries/second > compared to the 7 concurrent client-score for the X5355, both in MySQL > 5.0.41. At 7 CC we still had 1.7 times the previous result. > Excellent! That is a pretty huge boost. I'm curious which aspects of this new architecture helped the most. For Postgres, the following would seem the most relevant: 1. Shared L3 cache per processors -- more efficient shared datastructure access. 2. Faster atomic operations -- CompareAndSwap, etc are much faster. 3. Faster cache coherency. 4. Lower latency RAM with more overall bandwidth (Opteron style). Can you do a quick and dirty memory bandwidth test? (assuming linux) On the older X5355 machine and the newer E5540, try: /sbin/hdparm -T /dev/sd<device> Where <device> is a valid letter for a device on your system. Here are the results for me on an older system with dual Intel E5335 (2Ghz, 4MB cache, family 6 model 15) Best result out of 5 (its not all that consistent, + or minus 10%) /dev/sda: Timing cached reads: 10816 MB in 2.00 seconds = 5416.89 MB/sec And a newer system with dual Xeon X5460 (3.16Ghz, 6MB cache, family 6 model 23) Best of 7 results: /dev/sdb: Timing cached reads: 26252 MB in 1.99 seconds = 13174.42 MB/sec Its not a very accurate measurement, but its quick and highlights relative hardware differences very easily. > I'm not really sure how the shanghai cpu's compare to those older > X5355's, the AMD's should be faster, but how much? > I'm not sure either, and the Xeon platforms have evolved such that the chipsets and RAM configurations matter as much as the processor does. > I've no idea if we get a Shanghai to compare it with, but we will get a > dual X5570 soon on which we'll repeat some of the tests, so that should > at least help a bit with scaling the X5570-results around the world down. > > Best regards, > > Arjen >
FYI: This is an excellent article on the Nehalem CPU's and their memory performance as the CPU and RAM combinations change: http://blogs.sun.com/jnerl/entry/configuring_and_optimizing_intel_xeon Its fairly complicated (as it is for the Opteron too). On 5/13/09 9:58 AM, "Scott Carey" <scott@richrelevance.com> wrote: > > > > On 5/12/09 10:06 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > >> Just realized I made a mistake, I was under the impression that >> Shanghai CPUs had 8xxx numbers while barcelona had 23xx numbers. I >> was wrong, it appears the 8xxx numbers are for 4+ socket servers while >> the 23xx numbers are for 2 or fewer sockets. So, there are several >> quite affordable shanghai cpus out there, and many of the ones I >> quoted as barcelonas are in fact shanghais with the larger 6M L2 >> cache. >> > > At this point, I wouldn¹t go below 5520 on the Nehalem side (turbo + HT is > just too big a jump, as is the 1066Mhz versus 800Mhz memory jump). Its $100 > extra per CPU on a $10K + machine. > The next 'step' is the 5550, since it can run 1333Mhz memory and has 2x the > turbo -- but you would have to be more CPU bound for that. I wouldn't worry > about the 5530 or 5540, they will only scale a little up from the 5520. > > For Opterons, I wouldn't touch anything but a Shanghai these days since its > just not much more and we know the cache differences are very important for > DB loads. > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 13-5-2009 20:39 Scott Carey wrote: > Excellent! That is a pretty huge boost. I'm curious which aspects of this > new architecture helped the most. For Postgres, the following would seem > the most relevant: > 1. Shared L3 cache per processors -- more efficient shared datastructure > access. > 2. Faster atomic operations -- CompareAndSwap, etc are much faster. > 3. Faster cache coherency. > 4. Lower latency RAM with more overall bandwidth (Opteron style). Apart from that, it has a newer debian (and thus kernel/glibc) and a slightly less constraining IO which may help as well. > Can you do a quick and dirty memory bandwidth test? (assuming linux) > On the older X5355 machine and the newer E5540, try: > /sbin/hdparm -T /dev/sd<device> It is in use, so the results may not be so good, this is the best I got on our dual X5355: Timing cached reads: 6314 MB in 2.00 seconds = 3159.08 MB/sec But this is the best I got for a (also in use) Dual E5450 we have: Timing cached reads: 13158 MB in 2.00 seconds = 6587.11 MB/sec And here the best for the (idle) E5540: Timing cached reads: 16494 MB in 2.00 seconds = 8256.27 MB/sec These numbers are with hdparm v8.9 Best regards, Arjen
On Wed, 13 May 2009, Scott Carey wrote: > Can you do a quick and dirty memory bandwidth test? (assuming linux) > > /sbin/hdparm -T /dev/sd<device> > > ...its not a very accurate measurement, but its quick and highlights > relative hardware differences very easily. I've found "hdparm -T" to be useful for comparing the relative memory bandwidth of a given system as I change its RAM configuration around, but that's about it. I've seen that result change by a factor of 2X just by changing kernel version on the same hardware. The data volume transferred doesn't seem to be nearly enough to extract the true RAM speed from (guessing the cause here) things like whether the test/kernel code fits into the CPU cache. I'm using this nowadays: sysbench --test=memory --memory-oper=write --memory-block-size=1024MB --memory-total-size=1024MB run The sysbench read test looks similarly borked by caching effects when I've tried it, but if you write that much it seems to give useful results. P.S. Too many Scotts who write similarly on this thread. If either if you are at PGCon next week, please flag me down if you see me so I can finally sort you two out. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 5/13/09 11:52 PM, "Greg Smith" <gsmith@gregsmith.com> wrote: > On Wed, 13 May 2009, Scott Carey wrote: > >> Can you do a quick and dirty memory bandwidth test? (assuming linux) >> >> /sbin/hdparm -T /dev/sd<device> >> >> ...its not a very accurate measurement, but its quick and highlights >> relative hardware differences very easily. > > I've found "hdparm -T" to be useful for comparing the relative memory > bandwidth of a given system as I change its RAM configuration around, but > that's about it. I've seen that result change by a factor of 2X just by > changing kernel version on the same hardware. The data volume transferred > doesn't seem to be nearly enough to extract the true RAM speed from > (guessing the cause here) things like whether the test/kernel code fits > into the CPU cache. That's too bad -- I have been using it to compare machines as well, but they are all on the same Linux version / distro. Regardless -- the results indicate a 2x to 3x bandwidth improvement... Which sounds about right if the older CPU isn't on the newer FBDIMM chipset. If both of those machines are on the same Kernel, the relative values should be a somewhat valid (though -- definitely not all that accurate). > > I'm using this nowadays: > > sysbench --test=memory --memory-oper=write --memory-block-size=1024MB > --memory-total-size=1024MB run > Unfortunately, sysbench isn't installed by default on many (most?) distros, or even available as a package on many. So its a bigger 'ask' to get results from it. Certainly a significantly better overall tool. > The sysbench read test looks similarly borked by caching effects when I've > tried it, but if you write that much it seems to give useful results.
On 5/13/09 11:21 PM, "Arjen van der Meijden" <acmmailing@tweakers.net> wrote: > On 13-5-2009 20:39 Scott Carey wrote: >> Excellent! That is a pretty huge boost. I'm curious which aspects of this >> new architecture helped the most. For Postgres, the following would seem >> the most relevant: >> 1. Shared L3 cache per processors -- more efficient shared datastructure >> access. >> 2. Faster atomic operations -- CompareAndSwap, etc are much faster. >> 3. Faster cache coherency. >> 4. Lower latency RAM with more overall bandwidth (Opteron style). > > Apart from that, it has a newer debian (and thus kernel/glibc) and a > slightly less constraining IO which may help as well. > >> Can you do a quick and dirty memory bandwidth test? (assuming linux) >> On the older X5355 machine and the newer E5540, try: >> /sbin/hdparm -T /dev/sd<device> > > It is in use, so the results may not be so good, this is the best I got > on our dual X5355: > Timing cached reads: 6314 MB in 2.00 seconds = 3159.08 MB/sec > > But this is the best I got for a (also in use) Dual E5450 we have: > Timing cached reads: 13158 MB in 2.00 seconds = 6587.11 MB/sec > > And here the best for the (idle) E5540: > Timing cached reads: 16494 MB in 2.00 seconds = 8256.27 MB/sec > > These numbers are with hdparm v8.9 Thanks! My numbers were with hdparm 6.6 (Centos 5.3) -- so they aren't directly comparable. FYI When my systems are in use, the results are typically 50% to 75% of the idle scores. But, yours probably are roughly comparable to each other -- you're getting more than 2x the memory bandwidth between those systems. Without knowing the exact chipset and RAM configurations, this is definitely a factor in the performance difference at higher concurrency. > > Best regards, > > Arjen >