Thread: hyperthreaded cpu still an issue in 8.4?
Just wondering is the issue referenced in http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php is still present in 8.4 or if some tunable (or other) made the use of hyperthreading a non-issue. We're looking to upgrade our servers soon for performance reasons and am trying to determine if more cpus (no HT) or less cpus (with HT) are the way to go. Thx -- Douglas J Hunley http://douglasjhunley.com Twitter: @hunleyd
On Tue, Jul 21, 2009 at 1:42 PM, Doug Hunley<doug@hunley.homeip.net> wrote: > Just wondering is the issue referenced in > http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php > is still present in 8.4 or if some tunable (or other) made the use of > hyperthreading a non-issue. We're looking to upgrade our servers soon > for performance reasons and am trying to determine if more cpus (no > HT) or less cpus (with HT) are the way to go. Thx I wouldn't recommend HT CPUs at all. I think your assumption, that HT == CPU is wrong in first place. Please read more about HT on intel's website. -- GJ
On Tue, Jul 21, 2009 at 6:42 AM, Doug Hunley<doug@hunley.homeip.net> wrote: > Just wondering is the issue referenced in > http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php > is still present in 8.4 or if some tunable (or other) made the use of > hyperthreading a non-issue. We're looking to upgrade our servers soon > for performance reasons and am trying to determine if more cpus (no > HT) or less cpus (with HT) are the way to go. Thx This isn't really an application tunable so much as a kernel level tunable. PostgreSQL seems to have scaled pretty well a couple years ago in the tweakers.net benchmark of the Sun T1 CPU with 4 threads per core. However, at the time 4 AMD cores were spanking 8 Sun T1 cores with 4 threads each. Now, whether or not their benchmark applies to your application only you can say. Can you get machines on a 30 day trial program to benchmark them and decide which to go with? I'm guessing that dual 6core Opterons with lots of memory is the current king of the hill for reasonably priced pg servers that are running CPU bound loads. If you're mostly IO bound then it really doesn't matter which CPU.
2009/7/21 Grzegorz Jaśkiewicz <gryzman@gmail.com>: > On Tue, Jul 21, 2009 at 1:42 PM, Doug Hunley<doug@hunley.homeip.net> wrote: >> Just wondering is the issue referenced in >> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >> is still present in 8.4 or if some tunable (or other) made the use of >> hyperthreading a non-issue. We're looking to upgrade our servers soon >> for performance reasons and am trying to determine if more cpus (no >> HT) or less cpus (with HT) are the way to go. Thx > > I wouldn't recommend HT CPUs at all. I think your assumption, that HT > == CPU is wrong in first place. Not sure the OP said that...
On Tue, Jul 21, 2009 at 3:16 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > On Tue, Jul 21, 2009 at 6:42 AM, Doug Hunley<doug@hunley.homeip.net> wrote: >> Just wondering is the issue referenced in >> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >> is still present in 8.4 or if some tunable (or other) made the use of >> hyperthreading a non-issue. We're looking to upgrade our servers soon >> for performance reasons and am trying to determine if more cpus (no >> HT) or less cpus (with HT) are the way to go. Thx > > This isn't really an application tunable so much as a kernel level > tunable. PostgreSQL seems to have scaled pretty well a couple years > ago in the tweakers.net benchmark of the Sun T1 CPU with 4 threads per > core. However, at the time 4 AMD cores were spanking 8 Sun T1 cores > with 4 threads each. > > Now, whether or not their benchmark applies to your application only > you can say. Can you get machines on a 30 day trial program to > benchmark them and decide which to go with? I'm guessing that dual > 6core Opterons with lots of memory is the current king of the hill for > reasonably priced pg servers that are running CPU bound loads. > > If you're mostly IO bound then it really doesn't matter which CPU. Unless he is doing a lot of computations, on small sets of data. Now I am confused, HT is not anywhere near what 'threads' are on sparcs afaik. -- GJ
On 07/21/2009 10:36 AM, Grzegorz Jaśkiewicz wrote:
Fun relatively off-topic chat... :-)
Intel "HT" provides the ability to execute two threads per CPU core at the same time.
Sun "CoolThreads" provide the same capability. They have just scaled it further. Instead of Intel's Xeon Series 5500 with dual-processor, quad-core, dual-thread configuration (= 16 active threads at a time), Sun T2+ has dual-processor, eight-core, eight-thread configuration (= 128 active threads at a time).
Just, each Sun "CoolThread" thread is far less capable than an Intel "HT" thread, so the comparison is really about the type of load.
But, the real point is that "thread" (whether "CoolThread" or "HT" thread), is not the same as core, which is not the same as processor. X 2 threads is usually significantly less benefit than X 2 cores. X 2 cores is probably less benefit than X 2 processors.
I think the Intel numbers says that Intel HT provides +15% performance on average.
Cheers,
mark
On Tue, Jul 21, 2009 at 3:16 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote:On Tue, Jul 21, 2009 at 6:42 AM, Doug Hunley<doug@hunley.homeip.net> wrote:Just wondering is the issue referenced in http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php is still present in 8.4 or if some tunable (or other) made the use of hyperthreading a non-issue. We're looking to upgrade our servers soon for performance reasons and am trying to determine if more cpus (no HT) or less cpus (with HT) are the way to go. ThxThis isn't really an application tunable so much as a kernel level tunable. PostgreSQL seems to have scaled pretty well a couple years ago in the tweakers.net benchmark of the Sun T1 CPU with 4 threads per core. However, at the time 4 AMD cores were spanking 8 Sun T1 cores with 4 threads each.Unless he is doing a lot of computations, on small sets of data. Now I am confused, HT is not anywhere near what 'threads' are on sparcs afaik.
Fun relatively off-topic chat... :-)
Intel "HT" provides the ability to execute two threads per CPU core at the same time.
Sun "CoolThreads" provide the same capability. They have just scaled it further. Instead of Intel's Xeon Series 5500 with dual-processor, quad-core, dual-thread configuration (= 16 active threads at a time), Sun T2+ has dual-processor, eight-core, eight-thread configuration (= 128 active threads at a time).
Just, each Sun "CoolThread" thread is far less capable than an Intel "HT" thread, so the comparison is really about the type of load.
But, the real point is that "thread" (whether "CoolThread" or "HT" thread), is not the same as core, which is not the same as processor. X 2 threads is usually significantly less benefit than X 2 cores. X 2 cores is probably less benefit than X 2 processors.
I think the Intel numbers says that Intel HT provides +15% performance on average.
Cheers,
mark
-- Mark Mielke <mark@mielke.cc>
2009/7/21 Mark Mielke <mark@mark.mielke.cc>: > On 07/21/2009 10:36 AM, Grzegorz Jaśkiewicz wrote: > > On Tue, Jul 21, 2009 at 3:16 PM, Scott Marlowe<scott.marlowe@gmail.com> > wrote: > > > On Tue, Jul 21, 2009 at 6:42 AM, Doug Hunley<doug@hunley.homeip.net> wrote: > > > Just wondering is the issue referenced in > http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php > is still present in 8.4 or if some tunable (or other) made the use of > hyperthreading a non-issue. We're looking to upgrade our servers soon > for performance reasons and am trying to determine if more cpus (no > HT) or less cpus (with HT) are the way to go. Thx > > > This isn't really an application tunable so much as a kernel level > tunable. PostgreSQL seems to have scaled pretty well a couple years > ago in the tweakers.net benchmark of the Sun T1 CPU with 4 threads per > core. However, at the time 4 AMD cores were spanking 8 Sun T1 cores > with 4 threads each. > > > Unless he is doing a lot of computations, on small sets of data. > > > Now I am confused, HT is not anywhere near what 'threads' are on sparcs > afaik. > > Fun relatively off-topic chat... :-) > > Intel "HT" provides the ability to execute two threads per CPU core at the > same time. > > Sun "CoolThreads" provide the same capability. They have just scaled it > further. Instead of Intel's Xeon Series 5500 with dual-processor, quad-core, > dual-thread configuration (= 16 active threads at a time), Sun T2+ has > dual-processor, eight-core, eight-thread configuration (= 128 active threads > at a time). > > Just, each Sun "CoolThread" thread is far less capable than an Intel "HT" > thread, so the comparison is really about the type of load. > > But, the real point is that "thread" (whether "CoolThread" or "HT" thread), > is not the same as core, which is not the same as processor. X 2 threads is > usually significantly less benefit than X 2 cores. X 2 cores is probably > less benefit than X 2 processors. Actually, given the faster inter-connect speed and communication, I'd think a single quad core CPU would be faster than the equivalent dual dual core cpu. > I think the Intel numbers says that Intel HT provides +15% performance on > average. It's very dependent on work load, that's for sure. I've some things that are 60 to 80% improved, others that go negative. But 15 to 40% is more typical.
On 7/21/09 9:22 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >> But, the real point is that "thread" (whether "CoolThread" or "HT" thread), >> is not the same as core, which is not the same as processor. X 2 threads is >> usually significantly less benefit than X 2 cores. X 2 cores is probably >> less benefit than X 2 processors. > > Actually, given the faster inter-connect speed and communication, I'd > think a single quad core CPU would be faster than the equivalent dual > dual core cpu. Its very workload dependant and system dependant. If the dual core dual cpu setup has 2x the memory bandwidth of the single quad core (Nehalem, Opteron), it also likely has higher memory latency and a dedicated interconnect for memory and cache coherency. And so some workloads will favor the low latency and others will favor more bandwidth. If its like the older Xeons, where an extra CPU doesn't buy you more memory bandwidth alone (but better chipsets do), then a single quad core is usually faster than dual core dual cpu (if the same chipset). Even more so if there is a lot of lock contention, since that can all be handled on the same CPU rather than communicating across the bus. But back on topic for HT -- HT doesn't like spin-locks much unless they use the right low level instruction sequence rather than actually spinning. With the right instruction, the spin will allow the other thread to do work... With the wrong one, it will tie up the pipeline. I have no idea what Postgres' spin-locks and tool chain compile down to.
Scott Carey wrote: > > But back on topic for HT -- HT doesn't like spin-locks much unless they > use the right low level instruction sequence rather than actually > spinning. With the right instruction, the spin will allow the other > thread to do work... With the wrong one, it will tie up the pipeline. I > have no idea what Postgres' spin-locks and tool chain compile down to. > I have two hyperthreaded Xeon processors, so this machine thinks it has four processors. I have not seen the effect of spin locks with postgres. But I can tell that Firefox and Thunderbird use the wrong ones. When one of these is having trouble accessing a site, the processor in question goes up to 100% and the other part of the hyperthreaded processor does nothing even though I run four BOINC processes that would be glad to gobble up the cycles. Of course, since it is common to both Firefox and Thunderbird, perhaps it is a problem in the name server, bind. But wherever it is, it bugs me. -- .~. Jean-David Beyer Registered Linux User 85642. /V\ PGP-Key: 9A2FC99A Registered Machine 241939. /( )\ Shrewsbury, New Jersey http://counter.li.org ^^-^^ 13:55:01 up 6 days, 3:52, 3 users, load average: 4.03, 4.25, 4.45
On Tue, 21 Jul 2009, Doug Hunley wrote: > Just wondering is the issue referenced in > http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php > is still present in 8.4 or if some tunable (or other) made the use of > hyperthreading a non-issue. We're looking to upgrade our servers soon > for performance reasons and am trying to determine if more cpus (no > HT) or less cpus (with HT) are the way to go. If you're talking about the hyperthreading in the latest Intel Nehalem processors, I've been seeing great PostgreSQL performance from those. The kind of weird behavior the old generation hyperthreading designs had seems gone. You can see at http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907222158050.16713@westnet.com that I've cleared 90K TPS on a 16 core system (2 quad-core hyperthreaded processors) running a small test using lots of parallel SELECTs. That would not be possible if there were HT spinlock problems still around. There have been both PostgreSQL scaling improvments and hardware improvements since the 2005 messages you saw there that have combined to clear up the issues there. While true cores would still be better if everything else were equal, it rarely is, and I wouldn't hestitate to jump on Intel's bandwagon right now. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 27 Jul 2009, Dave Youatt wrote: > Do you think it's due to the new internal interconnect, that bears a > strong resemblance to AMD's hypertransport (AMD's buzzword for borrowing > lots of interconnect technology from the DEC alpha (EV7?)), or Intel > fixing a not-so-good initial implementation of "hyperthreading" (Intel's > marketing buzzword) from a few years ago. It certainly looks like it's Intel finally getting the interconnect right, because I'm seeing huge improvements in raw memory speeds too. That's the one area I used to see better results from Opterons on sometimes, but Intel pulled way ahead on this last upgrade. The experiment I haven't done yet is to turn off hyperthreading and see how much the performance degrades. This is hard because I'm several thousand miles from the servers I'm running the tests on, which makes low level config changes somewhat hairy. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 01/-10/-28163 11:59 AM, Greg Smith wrote: > On Tue, 21 Jul 2009, Doug Hunley wrote: > >> Just wondering is the issue referenced in >> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >> is still present in 8.4 or if some tunable (or other) made the use of >> hyperthreading a non-issue. We're looking to upgrade our servers soon >> for performance reasons and am trying to determine if more cpus (no >> HT) or less cpus (with HT) are the way to go. > > If you're talking about the hyperthreading in the latest Intel Nehalem > processors, I've been seeing great PostgreSQL performance from those. > The kind of weird behavior the old generation hyperthreading designs > had seems gone. You can see at > http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907222158050.16713@westnet.com > that I've cleared 90K TPS on a 16 core system (2 quad-core > hyperthreaded processors) running a small test using lots of parallel > SELECTs. That would not be possible if there were HT spinlock > problems still around. There have been both PostgreSQL scaling > improvments and hardware improvements since the 2005 messages you saw > there that have combined to clear up the issues there. While true > cores would still be better if everything else were equal, it rarely > is, and I wouldn't hestitate to jump on Intel's bandwagon right now. Greg, those are compelling numbers for the new Nehalem processors. Great news for postgresql. Do you think it's due to the new internal interconnect, that bears a strong resemblance to AMD's hypertransport (AMD's buzzword for borrowing lots of interconnect technology from the DEC alpha (EV7?)), or Intel fixing a not-so-good initial implementation of "hyperthreading" (Intel's marketing buzzword) from a few years ago. Also, and this is getting maybe too far off topic, beyond the buzzwords, what IS the new "hyperthreading" in Nehalems? -- opportunistic superpipelined cpus?, superscalar? What's shared by the cores (bandwidth, cache(s))? What's changed about the new hyperthreading that makes it actually seem to work (or at least not causes other problems)? smarter scheduling of instructions to take advantage of stalls, hazards another thread's instruction stream? Fixed instruction-level locking/interlocks, or avoiding locking whenever possible? better cache coherency mechanicms (related to the interconnects)? Jedi mind tricks??? I'm guessing it's the better interconnect, but work interferes with finding the time to research and benchmark.
On 7/27/09 11:05 AM, "Dave Youatt" <dave@meteorsolutions.com> wrote: > On 01/-10/-28163 11:59 AM, Greg Smith wrote: >> On Tue, 21 Jul 2009, Doug Hunley wrote: >> > Also, and this is getting maybe too far off topic, beyond the buzzwords, > what IS the new "hyperthreading" in Nehalems? -- opportunistic > superpipelined cpus?, superscalar? What's shared by the cores > (bandwidth, cache(s))? What's changed about the new hyperthreading > that makes it actually seem to work (or at least not causes other > problems)? smarter scheduling of instructions to take advantage of > stalls, hazards another thread's instruction stream? Fixed > instruction-level locking/interlocks, or avoiding locking whenever > possible? better cache coherency mechanicms (related to the > interconnects)? Jedi mind tricks??? > The Nehalems are an iteration off the "Core" processor line, which is a 4-way superscalar, out of order CPU. Also, it has some very sophisticated memory access reordering capability. So, the HyperThreading here (Symmetric Multi-Threading, SMT, is the academic name) will take advantage of that processor's inefficiencies -- a mix of stalls due to waiting for memory, and unused execution 'width' resources. So, if both threads are active and not stalled on memory access or other execution bubbles, there are a lot of internal processor resources to share. And if one of them is misbehaving and spinning, it won't dominate those resources. On the old Pentium-4 based HyperThreading, was also SMT, but those processors were built to be high frequency and 'narrow' in terms of superscalar execution (2-way superscalar, I believe). So the main advantage of HT there was that one thread could schedule work while another was waiting on memory access. If both were putting demands on the core execution resources there was not much to gain unless one thread stalled on memory access a lot, and if one of them was spinning it would eat up most of the shared resources. In both cases, the main execution resources get split up. L1 cache, instruction buffers and decoders, instruction reorder buffers, etc. But in this release, Intel increased several of these to beyond what is optimal for one thread, to make the HT more efficient. But the type of applications that will benefit the most from this HT is not always the same as the older one, since the two CPU lines have different weaknesses for SMT to mask or strengths to enhance. > I'm guessing it's the better interconnect, but work interferes with > finding the time to research and benchmark. The new memory and interconnect architecture has a huge impact on performance, but it is separate from the other big features (Turbo being the other one not discussed here). For scalability to many CPUs it is probably the most significant however. Note, that these CPU's have some good power saving technology that helps quite a bit when idle or using just one core or thread, but when all threads are ramped up and all the memory banks are filled the systems draw a LOT of power. AMD still does quite well if you're on a power budget with their latest CPUs. > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Mon, 27 Jul 2009, Dave Youatt wrote: > Greg, those are compelling numbers for the new Nehalem processors. > Great news for postgresql. Do you think it's due to the new internal > interconnect... Unlikely. Different threads on the same CPU core share their resources, so they don't need an explicit communication channel at all (I'm simplifying massively here). A real interconnect is only needed between CPUs and between different cores on a CPU, and of course to the outside world. Scott's explanation of why SMT works better now is much more likely to be the real reason. Matthew -- Ozzy: Life is full of disappointments. Millie: No it isn't - I can always fit more in.
- Message-id: <alpine.DEB.2.00.0907281037580.19493@aragorn.flymine.org>
On Mon, 27 Jul 2009, Dave Youatt wrote:
Greg, those are compelling numbers for the new Nehalem processors.
Great news for postgresql. Do you think it's due to the new internal
interconnect...
Unlikely. Different threads on the same CPU core share their resources, so they don't need an explicit communication channel at all (I'm simplifying massively here). A real interconnect is only needed between CPUs and between different cores on a CPU, and of course to the outside world. Scott's explanation of why SMT works better now is much more likely to be the real reason.
:-) there's also this interconnect thingie between sockets, cores and memory. Nehalem has a new one (for Intel), integrated memory controller, that is. And a new on-chip cache organization.
I'm still betting on the interconnect(s), particularly for bandwidth-intensive, data pumping server apps. And it looks like the other new interconnect ("QuickPath") plays well w/the integrated memory controller for multi-socket systems.
Greg, in your spare time... Also, curious how Nehalem compares w/AMD Phenom II, esp the newer ones w/multi-lane(?) HT
And apologies to the list for straying off topic a bit.
On Tue, 28 Jul 2009, Matthew Wakeling wrote: > Unlikely. Different threads on the same CPU core share their resources, so > they don't need an explicit communication channel at all (I'm simplifying > massively here). A real interconnect is only needed between CPUs and between > different cores on a CPU, and of course to the outside world. The question was "why are the new CPUs benchmarking so much faster than the old ones", and I believe that's mainly because the interconnection both between CPUs and between CPUs and memory are dramatically faster. The SMT improvements stack on top of that, but are in my opinion secondary. I base that on also seeing a dramatic improvement in memory transfer speeds on the new platform, which alone might even be sufficient to explain the performance boost. I'll break the two factors apart later to be sure though--all the regulars on this list know where I stand on measuring performance compared with theorizing about it. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, Jul 27, 2009 at 2:05 PM, Dave Youatt<dave@meteorsolutions.com> wrote: > On 01/-10/-28163 11:59 AM, Greg Smith wrote: >> On Tue, 21 Jul 2009, Doug Hunley wrote: >> >>> Just wondering is the issue referenced in >>> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >>> is still present in 8.4 or if some tunable (or other) made the use of >>> hyperthreading a non-issue. We're looking to upgrade our servers soon >>> for performance reasons and am trying to determine if more cpus (no >>> HT) or less cpus (with HT) are the way to go. >> >> If you're talking about the hyperthreading in the latest Intel Nehalem >> processors, I've been seeing great PostgreSQL performance from those. >> The kind of weird behavior the old generation hyperthreading designs >> had seems gone. You can see at >> http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907222158050.16713@westnet.com >> that I've cleared 90K TPS on a 16 core system (2 quad-core >> hyperthreaded processors) running a small test using lots of parallel >> SELECTs. That would not be possible if there were HT spinlock >> problems still around. There have been both PostgreSQL scaling >> improvments and hardware improvements since the 2005 messages you saw >> there that have combined to clear up the issues there. While true >> cores would still be better if everything else were equal, it rarely >> is, and I wouldn't hestitate to jump on Intel's bandwagon right now. > > Greg, those are compelling numbers for the new Nehalem processors. > Great news for postgresql. Do you think it's due to the new internal > interconnect, that bears a strong resemblance to AMD's hypertransport [snip] as a point of reference, here are some numbers on a quad core system (2xintel 5160) using the old pgbench, scaling factor 10: pgbench -S -c 16 -t 10000 starting vacuum...end. transaction type: SELECT only scaling factor: 10 query mode: simple number of clients: 16 number of transactions per client: 10000 number of transactions actually processed: 160000/160000 tps = 24088.807000 (including connections establishing) tps = 24201.820189 (excluding connections establishing) This shows actually my system (pre-Nehalem) is pretty close clock for clock, albeit thats not completely fair..I'm using only 4 cores on dual core procs. Still, these are almost two years old now. EDIT: I see now that Greg has only 8 core system not counting hyperthreading...so I'm getting absolutely spanked! Go Intel! Also, I'm absolutely dying to see some numbers on the high end W5580...if anybody has one, please post! merlin
On Tue, Jul 28, 2009 at 2:58 PM, Merlin Moncure<mmoncure@gmail.com> wrote: > On Mon, Jul 27, 2009 at 2:05 PM, Dave Youatt<dave@meteorsolutions.com> wrote: >> On 01/-10/-28163 11:59 AM, Greg Smith wrote: >>> On Tue, 21 Jul 2009, Doug Hunley wrote: >>> >>>> Just wondering is the issue referenced in >>>> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >>>> is still present in 8.4 or if some tunable (or other) made the use of >>>> hyperthreading a non-issue. We're looking to upgrade our servers soon >>>> for performance reasons and am trying to determine if more cpus (no >>>> HT) or less cpus (with HT) are the way to go. >>> >>> If you're talking about the hyperthreading in the latest Intel Nehalem >>> processors, I've been seeing great PostgreSQL performance from those. >>> The kind of weird behavior the old generation hyperthreading designs >>> had seems gone. You can see at >>> http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907222158050.16713@westnet.com >>> that I've cleared 90K TPS on a 16 core system (2 quad-core >>> hyperthreaded processors) running a small test using lots of parallel >>> SELECTs. That would not be possible if there were HT spinlock >>> problems still around. There have been both PostgreSQL scaling >>> improvments and hardware improvements since the 2005 messages you saw >>> there that have combined to clear up the issues there. While true >>> cores would still be better if everything else were equal, it rarely >>> is, and I wouldn't hestitate to jump on Intel's bandwagon right now. >> >> Greg, those are compelling numbers for the new Nehalem processors. >> Great news for postgresql. Do you think it's due to the new internal >> interconnect, that bears a strong resemblance to AMD's hypertransport > [snip] > > as a point of reference, here are some numbers on a quad core system > (2xintel 5160) using the old pgbench, scaling factor 10: > > pgbench -S -c 16 -t 10000 > starting vacuum...end. > transaction type: SELECT only > scaling factor: 10 > query mode: simple > number of clients: 16 > number of transactions per client: 10000 > number of transactions actually processed: 160000/160000 > tps = 24088.807000 (including connections establishing) > tps = 24201.820189 (excluding connections establishing) > > This shows actually my system (pre-Nehalem) is pretty close clock for > clock, albeit thats not completely fair..I'm using only 4 cores on > dual core procs. Still, these are almost two years old now. > > EDIT: I see now that Greg has only 8 core system not counting > hyperthreading...so I'm getting absolutely spanked! Go Intel! > > Also, I'm absolutely dying to see some numbers on the high end > W5580...if anybody has one, please post! Just FYI, I ran the same basic test but with -c 10 since -c shouldn't really be greater than -s, and got this: pgbench -S -c 10 -t 10000 starting vacuum...end. transaction type: SELECT only scaling factor: 10 number of clients: 10 number of transactions per client: 10000 number of transactions actually processed: 100000/100000 tps = 32855.677494 (including connections establishing) tps = 33344.826183 (excluding connections establishing) With -s at 16 and -c at 16 I got this: pgbench -S -c 16 -t 10000 starting vacuum...end. transaction type: SELECT only scaling factor: 16 number of clients: 16 number of transactions per client: 10000 number of transactions actually processed: 160000/160000 tps = 32822.559602 (including connections establishing) tps = 33266.308652 (excluding connections establishing) That's on dual Quad-Core AMD Opteron(tm) Processor 2352 CPUs (2.2GHz) and 16 G ram.
On 7/28/09 1:28 PM, "Greg Smith" <gsmith@gregsmith.com> wrote: > On Tue, 28 Jul 2009, Matthew Wakeling wrote: > >> Unlikely. Different threads on the same CPU core share their resources, so >> they don't need an explicit communication channel at all (I'm simplifying >> massively here). A real interconnect is only needed between CPUs and between >> different cores on a CPU, and of course to the outside world. > > The question was "why are the new CPUs benchmarking so much faster than > the old ones", and I believe that's mainly because the interconnection > both between CPUs and between CPUs and memory are dramatically faster. I believe he was answering the question "What makes SMT work well with Postgres for these CPUs when it had problems on old Xeons?" -- and that doesn't have a lot to do with the interconnect or bandwidth. It may also be a more advanced compiler / OS toolchain. Postgres 8.0 compiled on an older system and OS might not work so well with the new HT. As for the question as to what is so good about the Nehalems -- the on-die memory controller and point-to-point interprocessor interconnect is the biggest performance change. Turbo and SMT are pretty good icing on the cake though.
On 7/28/09 1:58 PM, "Merlin Moncure" <mmoncure@gmail.com> wrote: > On Mon, Jul 27, 2009 at 2:05 PM, Dave Youatt<dave@meteorsolutions.com> wrote: >> On 01/-10/-28163 11:59 AM, Greg Smith wrote: >>> On Tue, 21 Jul 2009, Doug Hunley wrote: >>> >>>> Just wondering is the issue referenced in >>>> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >>>> is still present in 8.4 or if some tunable (or other) made the use of >>>> hyperthreading a non-issue. We're looking to upgrade our servers soon >>>> for performance reasons and am trying to determine if more cpus (no >>>> HT) or less cpus (with HT) are the way to go. >>> >>> If you're talking about the hyperthreading in the latest Intel Nehalem >>> processors, I've been seeing great PostgreSQL performance from those. >>> The kind of weird behavior the old generation hyperthreading designs >>> had seems gone. You can see at >>> http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907222158050.1671 >>> 3@westnet.com >>> that I've cleared 90K TPS on a 16 core system (2 quad-core >>> hyperthreaded processors) running a small test using lots of parallel >>> SELECTs. That would not be possible if there were HT spinlock >>> problems still around. There have been both PostgreSQL scaling >>> improvments and hardware improvements since the 2005 messages you saw >>> there that have combined to clear up the issues there. While true >>> cores would still be better if everything else were equal, it rarely >>> is, and I wouldn't hestitate to jump on Intel's bandwagon right now. >> >> Greg, those are compelling numbers for the new Nehalem processors. >> Great news for postgresql. Do you think it's due to the new internal >> interconnect, that bears a strong resemblance to AMD's hypertransport > [snip] > > as a point of reference, here are some numbers on a quad core system > (2xintel 5160) using the old pgbench, scaling factor 10: > > pgbench -S -c 16 -t 10000 > starting vacuum...end. > transaction type: SELECT only > scaling factor: 10 > query mode: simple > number of clients: 16 > number of transactions per client: 10000 > number of transactions actually processed: 160000/160000 > tps = 24088.807000 (including connections establishing) > tps = 24201.820189 (excluding connections establishing) > > This shows actually my system (pre-Nehalem) is pretty close clock for > clock, albeit thats not completely fair..I'm using only 4 cores on > dual core procs. Still, these are almost two years old now. > > EDIT: I see now that Greg has only 8 core system not counting > hyperthreading...so I'm getting absolutely spanked! Go Intel! > > Also, I'm absolutely dying to see some numbers on the high end > W5580...if anybody has one, please post! > > merlin Note, that a 5160 is a bit behind. The 52xx and 54xx series were a decent perf boost on their own, with more cache, and usually more total system bandwidth too (50% more than 51xx and 53xx is typical). But the leap to 55xx is far bigger! > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Tue, 28 Jul 2009, Scott Marlowe wrote: > Just FYI, I ran the same basic test but with -c 10 since -c shouldn't > really be greater than -s That's only true if you're running the TPC-B-like or other write tests, where access to the small branches table becomes a serious hotspot for contention. The select-only test has no such specific restriction as it only operations on the big accounts table. Often peak throughput is closer to a very small multiple on the number of cores though, and possibly even clients=cores, presumably because it's more efficient to approximately peg one backend per core rather than switch among more than one on each--reduced L1 cache contention etc. That's the behavior you measured when your test showed better results with c=10 than c=16 on a 8 core system, rather than suffering less from the "c must be < s" contention limitation. Sadly I don't have or expect to have a W5580 in the near future though, the X5550 @ 2.67GHz is the bang for the buck sweet spot right now and accordingly that's what I have in the lab at Truviso. As Merlin points out, that's still plenty to spank any select-only pgbench results I've ever seen. The multi-threaded pgbench batch submitted by Itagaki Takahiro recently is here just in time to really exercise these new processors properly. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, Jul 28, 2009 at 5:21 PM, Greg Smith<gsmith@gregsmith.com> wrote: > On Tue, 28 Jul 2009, Scott Marlowe wrote: > >> Just FYI, I ran the same basic test but with -c 10 since -c shouldn't >> really be greater than -s > > That's only true if you're running the TPC-B-like or other write tests, > where access to the small branches table becomes a serious hotspot for > contention. The select-only test has no such specific restriction as it I thought so too, but my pgbench -S -c 16 was WAY faster on a -s 16 db than on a -s10...
On Tue, Jul 28, 2009 at 4:11 PM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > On Tue, Jul 28, 2009 at 2:58 PM, Merlin Moncure<mmoncure@gmail.com> wrote: >> On Mon, Jul 27, 2009 at 2:05 PM, Dave Youatt<dave@meteorsolutions.com> wrote: >>> On 01/-10/-28163 11:59 AM, Greg Smith wrote: >>>> On Tue, 21 Jul 2009, Doug Hunley wrote: >>>> >>>>> Just wondering is the issue referenced in >>>>> http://archives.postgresql.org/pgsql-performance/2005-11/msg00415.php >>>>> is still present in 8.4 or if some tunable (or other) made the use of >>>>> hyperthreading a non-issue. We're looking to upgrade our servers soon >>>>> for performance reasons and am trying to determine if more cpus (no >>>>> HT) or less cpus (with HT) are the way to go. >>>> >>>> If you're talking about the hyperthreading in the latest Intel Nehalem >>>> processors, I've been seeing great PostgreSQL performance from those. >>>> The kind of weird behavior the old generation hyperthreading designs >>>> had seems gone. You can see at >>>> http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907222158050.16713@westnet.com >>>> that I've cleared 90K TPS on a 16 core system (2 quad-core >>>> hyperthreaded processors) running a small test using lots of parallel >>>> SELECTs. That would not be possible if there were HT spinlock >>>> problems still around. There have been both PostgreSQL scaling >>>> improvments and hardware improvements since the 2005 messages you saw >>>> there that have combined to clear up the issues there. While true >>>> cores would still be better if everything else were equal, it rarely >>>> is, and I wouldn't hestitate to jump on Intel's bandwagon right now. >>> >>> Greg, those are compelling numbers for the new Nehalem processors. >>> Great news for postgresql. Do you think it's due to the new internal >>> interconnect, that bears a strong resemblance to AMD's hypertransport I'd love to see some comparisons on the exact same hardware, same kernel and everything but with HT enabled and disabled. Don't forget that newer (Linux) kernels have vastly improved SMP performance. -- Jon
Greg Smith wrote: > On Tue, 28 Jul 2009, Scott Marlowe wrote: > >> Just FYI, I ran the same basic test but with -c 10 since -c shouldn't >> really be greater than -s > > That's only true if you're running the TPC-B-like or other write tests, > where access to the small branches table becomes a serious hotspot for > contention. The select-only test has no such specific restriction as it > only operations on the big accounts table. Often peak throughput is > closer to a very small multiple on the number of cores though, and > possibly even clients=cores, presumably because it's more efficient to > approximately peg one backend per core rather than switch among more > than one on each--reduced L1 cache contention etc. That's the behavior > you measured when your test showed better results with c=10 than c=16 on > a 8 core system, rather than suffering less from the "c must be < s" > contention limitation. Well the real problem is that pgbench itself does not scale too well to lots of concurrent connections and/or to high transaction rates so it seriously skews the result. If you look http://www.kaltenbrunner.cc/blog/index.php?/archives/26-Benchmarking-8.4-Chapter-1Read-Only-workloads.html. It is pretty clear that 90k(or the 83k I got due to the slower E5530) tps is actually a pgench limit and that the backend really can do almost twice as fast (I only demonstrated ~140k tps using sysbench there but I later managed to do ~160k tps with queries that are closer to what pgbench does in the lab) Stefan
On Wed, 29 Jul 2009, Stefan Kaltenbrunner wrote: > Well the real problem is that pgbench itself does not scale too well to lots > of concurrent connections and/or to high transaction rates so it seriously > skews the result. Sure, but that's what the multi-threaded pgbench code aims to fix, which didn't show up until after you ran your tests. I got the 90K select TPS with a completely unoptimized postgresql.conf, so that's by no means the best it's possible to get out of the new pgbench code on this hardware. I've seen as much as a 40% improvement over the standard pgbench code in my limited testing so far, and the patch author has seen a 450% one. You might be able to see at least the same results you got from sysbench out of it. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Wed, 29 Jul 2009, Stefan Kaltenbrunner wrote: > >> Well the real problem is that pgbench itself does not scale too well >> to lots of concurrent connections and/or to high transaction rates so >> it seriously skews the result. > > Sure, but that's what the multi-threaded pgbench code aims to fix, which > didn't show up until after you ran your tests. I got the 90K select TPS > with a completely unoptimized postgresql.conf, so that's by no means the > best it's possible to get out of the new pgbench code on this hardware. > I've seen as much as a 40% improvement over the standard pgbench code in > my limited testing so far, and the patch author has seen a 450% one. > You might be able to see at least the same results you got from sysbench > out of it. oh - the 90k tps are with the new multithreaded pgbench? missed that fact. As you can see from my results I managed to get 83k with the 8.4 pgbench on a slightly slower Nehalem which does not sound too impressive for the new code... Stefan
On Wed, 29 Jul 2009, Stefan Kaltenbrunner wrote: > oh - the 90k tps are with the new multithreaded pgbench? missed that fact. As > you can see from my results I managed to get 83k with the 8.4 pgbench on a > slightly slower Nehalem which does not sound too impressive for the new > code... I got 96K with the default postgresql.conf - 32MB shared_buffers etc. - and I didn't even try to find the sweet spot yet for things like number of threads, that's just the first useful number that popped out. I saw as much as 87K with the regular one too. I already planned to run the test set you did for comparison sake at some point. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 28 Jul 2009, Scott Carey wrote: > On 7/28/09 1:28 PM, "Greg Smith" <gsmith@gregsmith.com> wrote: >> On Tue, 28 Jul 2009, Matthew Wakeling wrote: >> >>> Unlikely. Different threads on the same CPU core share their resources, so >>> they don't need an explicit communication channel at all (I'm simplifying >>> massively here). A real interconnect is only needed between CPUs and between >>> different cores on a CPU, and of course to the outside world. >> >> The question was "why are the new CPUs benchmarking so much faster than >> the old ones"... > > I believe he was answering the question "What makes SMT work well with > Postgres for these CPUs when it had problems on old Xeons?" Exactly. Interconnects and bandwidth are going to make the CPU faster in general, but won't have any (much?) effect on the relative speed with and without SMT. If the new CPUs are four-way dispatch and the old ones were two-way dispatch, that easily explains why SMT is a bonus on the new CPUs. With a two-way dispatch, a single thread is likely to be able to keep both pipelines busy most of the time. Switching on SMT will try to keep the pipelines busy a bit more, giving a small improvement, however that improvement is cancelled out by the cache being half the size for each thread. One of our applications ran 30% slower with SMT enabled on an old Xeon. On the new CPUs, it would be very hard for a single thread to keep four execution pipelines busy, so switching on SMT increases the throughput in a big way. Also, the bigger caches mean that splitting the cache in half doesn't have nearly as much impact. That's why SMT is a good thing on the new CPUs. However, SMT is always likely to slow down any process that is single-threaded, if that is the only thread doing significant work on the machine. It only really shows its benefit when you have more CPU-intensive processes than real CPU cores. Matthew -- In the beginning was the word, and the word was unsigned, and the main() {} was without form and void...
On Tue, 28 Jul 2009, Dave Youatt wrote: > Unlikely. Different threads on the same CPU core share their resources, so they don't > need an explicit communication channel at all (I'm simplifying massively here). A real > interconnect is only needed between CPUs and between different cores on a CPU, and of > course to the outside world. Scott's explanation of why SMT works better now is much more > likely to be the real reason. Actually, no, I wrote that. Please give at least some indication when replying to an email which parts of it are your words and which are quotes from someone else. Emails can be incredibly confusing without that distinction. You actually wrote: > :-) there's also this interconnect thingie between sockets, cores and memory. Nehalem has > a new one (for Intel), integrated memory controller, that is. And a new on-chip cache > organization. This, (like I mention elsewhere) will make the CPU faster overall, but is unlikely to increase the performance gain of switching SMT on. In fact, having a lower latency memory controller is more likely to reduce some of the problem that SMT is trying to address - that of a single thread stalling on memory access. Having said that, memory access latency is not scaling as quickly as CPU speed, so over time SMT is going to get more important. Matthew -- "Take care that thou useth the proper method when thou taketh the measure of high-voltage circuits so that thou doth not incinerate both thee and the meter; for verily, though thou has no account number and can be easily replaced, the meter doth have one, and as a consequence, bringeth much woe upon the Supply Department." -- The Ten Commandments of Electronics
On Tue, Jul 28, 2009 at 7:21 PM, Greg Smith<gsmith@gregsmith.com> wrote: > On Tue, 28 Jul 2009, Scott Marlowe wrote: > >> Just FYI, I ran the same basic test but with -c 10 since -c shouldn't >> really be greater than -s > > That's only true if you're running the TPC-B-like or other write tests, > where access to the small branches table becomes a serious hotspot for > contention. The select-only test has no such specific restriction as it > only operations on the big accounts table. Often peak throughput is closer > to a very small multiple on the number of cores though, and possibly even > clients=cores, presumably because it's more efficient to approximately peg > one backend per core rather than switch among more than one on each--reduced > L1 cache contention etc. That's the behavior you measured when your test > showed better results with c=10 than c=16 on a 8 core system, rather than > suffering less from the "c must be < s" contention limitation. > > Sadly I don't have or expect to have a W5580 in the near future though, the > X5550 @ 2.67GHz is the bang for the buck sweet spot right now and > accordingly that's what I have in the lab at Truviso. As Merlin points out, > that's still plenty to spank any select-only pgbench results I've ever seen. > The multi-threaded pgbench batch submitted by Itagaki Takahiro recently is > here just in time to really exercise these new processors properly. Can I trouble you for a single client run, say: pgbench -S -c 1 -t 250000 I'd like to see how much of your improvement comes from SMT and how much comes from general improvements to the cpu... merlin